bc_extract {CellBarcode} | R Documentation |
bc_extract
identifies the barcodes (and UMI) from the sequences using
regular expressions. pattern
and pattern_type
arguments are
necessary, which provide the barcode (and UMI) pattern and their location
within the sequences.
bc_extract( x, pattern = "", sample_name = NULL, metadata = NULL, maxLDist = 0, pattern_type = c(barcode = 1), costs = list(sub = 1, ins = 99, del = 99), ordered = TRUE ) ## S4 method for signature 'data.frame' bc_extract( x, pattern = "", sample_name = NULL, maxLDist = 0, pattern_type = c(barcode = 1), costs = list(sub = 1, ins = 99, del = 99), ordered = TRUE ) ## S4 method for signature 'ShortReadQ' bc_extract( x, pattern = "", sample_name = NULL, maxLDist = 0, pattern_type = c(barcode = 1), costs = list(sub = 1, ins = 99, del = 99), ordered = TRUE ) ## S4 method for signature 'DNAStringSet' bc_extract( x, pattern = "", sample_name = NULL, maxLDist = 0, pattern_type = c(barcode = 1), costs = list(sub = 1, ins = 99, del = 99), ordered = TRUE ) ## S4 method for signature 'integer' bc_extract( x, pattern = "", sample_name = NULL, maxLDist = 0, pattern_type = c(barcode = 1), costs = list(sub = 1, ins = 99, del = 99), ordered = TRUE ) ## S4 method for signature 'character' bc_extract( x, pattern = "", sample_name = NULL, metadata = NULL, maxLDist = 0, pattern_type = c(barcode = 1), costs = list(sub = 1, ins = 99, del = 99), ordered = TRUE ) ## S4 method for signature 'list' bc_extract( x, pattern = "", sample_name = NULL, metadata = NULL, maxLDist = 0, pattern_type = c(barcode = 1), costs = list(sub = 1, ins = 99, del = 99), ordered = TRUE )
x |
A single or a list of fastq file, ShortReadQ, DNAStringSet, data.frame, or named integer. |
pattern |
A string, specifying the regular expression with capture. It matchs the barcode (and UMI) with capture pattern. |
sample_name |
A string vector, applicable when |
metadata |
A |
maxLDist |
A integer. The mismatch threshold for barcode matching, when
maxLDist is 0, the |
pattern_type |
A vector. It defines the barcode (and UMI) capture group. See Details. |
costs |
A named list, applicable when maxLDist > 0, specifying the
weight of each mismatch events while extracting the barcodes. The list
element name have to be |
ordered |
A logical value. If the value is true, the return barcodes (UMI-barcode tags) are sorted by the reads counts. |
The pattern
argument is a regular expression, the capture operation
()
identifying the barcode or UMI. pattern_type
argument
annotates capture, denoting the UMI or the barcode captured pattern. In the
example:
([ACTG]{3})TCGATCGATCGA([ACTG]+)ATCGATCGATC |---------| starts with 3 base pairs UMI. |----------| constant sequence in the backbone. |-------| flexible barcode sequences. |---------| 3' constant sequence.
In UMI part [ACGT]{3}
, [ACGT]
means it can be one of
the "A", "C", "G" and "T", and {3}
means it repeats 3 times.
In the barcode pattern [ACGT]+
, the +
denotes
that there is at least one of the A
or C
or G
or
T.
This function returns a BarcodeObj object if the input is a list
or a
vector
of Fastq files, otherwise it returns a data.frame.
In
the later case
the data.frame
has 5 columns:
reads_seq
: full sequence.
match_seq
: part of the full sequence matched by pattern.
umi_seq
(optional): UMI sequence, applicable when there is UMI
in 'pattern' and 'pattern_type' argument.
barcode_seq
: barcode sequence.
count
: reads number.
The match_seq
is part of reads_seq
; The umi_seq
and
barcode_seq
are part of match_seq
. The reads_seq
is the
full sequence, and is unique id for each record (row), On the contrast,
match_seq
, umi_seq
or barcode_seq
may duplicated between
rows.
fq_file <- system.file("extdata", "simple.fq", package="CellBarcode") library(ShortRead) # barcode from fastq file bc_extract(fq_file, pattern = "AAAAA(.*)CCCCC") # barcode from ShortReadQ object sr <- readFastq(fq_file) # ShortReadQ bc_extract(sr, pattern = "AAAAA(.*)CCCCC") # barcode from DNAStringSet object ds <- sread(sr) # DNAStringSet bc_extract(ds, pattern = "AAAAA(.*)CCCCC") # barcode from integer vector iv <- tables(ds, n = Inf)$top # integer vector bc_extract(iv, pattern = "AAAAA(.*)CCCCC") # barcode from data.frame df <- data.frame(seq = names(iv), freq = as.integer(iv)) # data.frame bc_extract(df, pattern = "AAAAA(.*)CCCCC") # barcode from list of DNAStringSet l <- list(sample1 = ds, sample2 = ds) # list bc_extract(l, pattern = "AAAAA(.*)CCCCC") # Extract UMI and barcode d1 <- data.frame( seq = c( "ACTTCGATCGATCGAAAAGATCGATCGATC", "AATTCGATCGATCGAAGAGATCGATCGATC", "CCTTCGATCGATCGAAGAAGATCGATCGATC", "TTTTCGATCGATCGAAAAGATCGATCGATC", "AAATCGATCGATCGAAGAGATCGATCGATC", "CCCTCGATCGATCGAAGAAGATCGATCGATC", "GGGTCGATCGATCGAAAAGATCGATCGATC", "GGATCGATCGATCGAAGAGATCGATCGATC", "ACTTCGATCGATCGAACAAGATCGATCGATC", "GGTTCGATCGATCGACGAGATCGATCGATC", "GCGTCCATCGATCGAAGAAGATCGATCGATC" ), freq = c( 30, 60, 9, 10, 14, 5, 10, 30, 6, 4 , 6 ) ) # barcode backbone with UMI and barcode pattern <- "([ACTG]{3})TCGATCGATCGA([ACTG]+)ATCGATCGATC" bc_extract( list(test = d1), pattern, sample_name=c("test"), pattern_type=c(UMI=1, barcode=2)) ###