overrep_kmer {qckitfastq}R Documentation

Generate overrepresented kmers of length k based on their observed to expected ratio at each position across all sequences in the dataset. The expected proportion of a length k kmer assumes site independence and is computed as the sum of the count of each base pair in the kmer times the probability of observing that base pair in the data set, i.e. P(A)count_in_kmer(A)+P(C)count_in_kmer(C)+... The observed to expected ratio is computed as log2(obs/exp). Those with obsexp_ratio > 2 are considered to be overrepresented and appear in the returned data frame along with their position in the sequence.

Description

Generate overrepresented kmers of length k based on their observed to expected ratio at each position across all sequences in the dataset. The expected proportion of a length k kmer assumes site independence and is computed as the sum of the count of each base pair in the kmer times the probability of observing that base pair in the data set, i.e. P(A)count_in_kmer(A)+P(C)count_in_kmer(C)+... The observed to expected ratio is computed as log2(obs/exp). Those with obsexp_ratio > 2 are considered to be overrepresented and appear in the returned data frame along with their position in the sequence.

Usage

overrep_kmer(infile, k, output_file = NA)

Arguments

infile

path to gzipped FASTQ file

k

the kmer length

output_file

File to save plot to. Default NA.

Value

Data frame with columns: Position (in read), Obsexp_ratio, & Kmer

Examples


infile <-system.file("extdata", "test.fq.gz",
    package = "qckitfastq")
overrep_kmer(infile,k=4)


[Package qckitfastq version 1.10.0 Index]