sc_long_pipeline {FLAMES}R Documentation

Pipeline for Single Cell Data

Description

Semi-supervised isoform detection and annotation for long read data. This varient is for single cell data. By default, this pipeline demultiplexes input fastq data (match_cell_barcode = TRUE). Specific parameters relating to analysis can be changed either through function arguments, or through a configuration JSON file.

Usage

sc_long_pipeline(
  annot,
  fastq,
  in_bam = NULL,
  outdir,
  genome_fa,
  minimap2_dir = "",
  downsample_ratio = 1,
  reference_csv,
  match_barcode = TRUE,
  config_file = NULL,
  do_genome_align = TRUE,
  do_isoform_id = TRUE,
  do_read_realign = TRUE,
  do_transcript_quanti = TRUE,
  gen_raw_isoform = TRUE,
  has_UMI = FALSE,
  UMI_LEN = 10,
  MAX_DIST = 10,
  MAX_TS_DIST = 100,
  MAX_SPLICE_MATCH_DIST = 10,
  min_fl_exon_len = 40,
  Max_site_per_splice = 3,
  Min_sup_cnt = 10,
  Min_cnt_pct = 0.01,
  Min_sup_pct = 0.2,
  strand_specific = 1,
  remove_incomp_reads = 5,
  use_junctions = TRUE,
  no_flank = TRUE,
  use_annotation = TRUE,
  min_tr_coverage = 0.75,
  min_read_coverage = 0.75
)

Arguments

annot

The file path to gene annotations file in gff3 format

fastq

The file path to input fastq file

in_bam

Optional file path to a bam file to use instead of fastq file (skips initial alignment step)

outdir

The path to directory to store all output files.

genome_fa

The file path to genome fasta file.

minimap2_dir

Path to the directory containing minimap2, if it is not in PATH. Only required if either or both of do_genome_align and do_read_realign are TRUE.

downsample_ratio

Integer; downsampling ratio if performing downsampling analysis.

reference_csv

The file path to the reference csv used for demultiplexing

match_barcode

Boolean; specifies if demultiplexing should be performed using FLAMES::match_cell_barcode

config_file

File path to the JSON configuration file. If specified, config_file overrides all configuration parameters

do_genome_align

Boolean; specifies whether to run the genome alignment step. TRUE is recommended

do_isoform_id

Boolean; specifies whether to run the isoform identification step. TRUE is recommended

do_read_realign

Boolean; specifies whether to run the read realignment step. TRUE is recommended

do_transcript_quanti

Boolean; specifies whether to run the transcript quantification step. TRUE is recommended

gen_raw_isoform

Boolean; specifies whether a gff3 should be generated containing the raw isoform information in the isoform identification step

has_UMI

Boolean; specifies if the data contains UMI.

UMI_LEN

Integer; the length of UMI sequence in bases

MAX_DIST

Real; maximum distance allowed when merging splicing sites in isoform consensus clustering.

MAX_TS_DIST

Real; maximum distance allowed when merging transcript start/end position in isoform consensus clustering.

MAX_SPLICE_MATCH_DIST

Real; maximum distance allowed when merging splice site called from the data and the reference annotation.

min_fl_exon_len

Real; minimum length for the first exon outside the gene body in reference annotation. This is to correct the alignment artifact

Max_site_per_splice

Real; maximum transcript start/end site combinations allowed per splice chain

Min_sup_cnt

Real; minimum number of read support an isoform. Decreasing this number will significantly increase the number of isoform detected.

Min_cnt_pct

Real; minimum percentage of count for an isoform relative to total count for the same gene.

Min_sup_pct

Real; minimum percentage of count for an splice chain that support a given transcript start/end site combination.

strand_specific

1, -1 or 0. 1 indicates if reads are in the same strand as mRNA, -1 indicates reads are reverse complemented, 0 indicates reads are not strand specific.

remove_incomp_reads

Real; determines the strength of truncated isoform filtering. Larger number means more stringent filtering.

use_junctions

Boolean; determiens whether to use known splice junctions to help correct the alignment results

no_flank

Boolean; passed to minimap2 for synthetic spike-in data. Refer to Minimap2 document for more details

use_annotation

Boolean; specifies whether to use reference to help annotate known isoforms

min_tr_coverage

Real; minimum percentage of isoform coverage for a read to be aligned to that isoform

min_read_coverage

Real; minimum percentage of read coverage for a read to be uniquely aligned to that isoform

Details

By default FLAMES use minimap2 for read alignment. After the genome alignment step (do_genome_align), FLAMES summarizes the alignment for each read by grouping reads with similar splice junctions to get a raw isoform annotation (do_isoform_id). The raw isoform annotation is compared against the reference annotation to correct potential splice site and transcript start/end errors. Transcripts that have similar splice junctions and transcript start/end to the reference transcript are merged with the reference. This process will also collapse isoforms that are likely to be truncated transcripts. Next is the read realignment step (do_read_realign), where the sequence of each polished transcript is extracted and used as the updated reference. The reads are realigned to this reference by minimap2. The transcripts with only a few full-length aligned reads are discarded. The reads are assigned to transcripts based on both alignment score, fractions of reads aligned and transcript coverage. Reads that cannot be uniquely assigned to transcripts or have low transcript coverage are discarded. The UMI transcript count matrix is generated by collapsing the reads with the same UMI in a similar way to what is done for short-read scRNA-seq data, but allowing for an edit distance of up to 2 by default. Most of the parameters, such as the minimal distance to splice site and minimal percentage of transcript coverage can be modified by the JSON configuration file (config_file).

The default parameters can be changed either through the function arguments are through the configuration JSON file config_file. the pipeline_parameters section specifies which steps are to be executed in the pipeline - by default, all steps are executed. The isoform_parameters section affects isoform detection - key parameters include:

Value

sc_long_pipeline returns a SingleCellExperiment object, containing a count matrix as an assay, gene annotations under metadata, as well as a list of the other output files generated by the pipeline. The pipeline also outputs a number of output files into the given outdir directory. These output files generated by the pipeline are:

See Also

bulk_long_pipeline() for bulk long data, SingleCellExperiment() for how data is outputted

Examples

# download the two fastq files, move them to a folder to be merged together
temp_path <- tempfile()
bfc <- BiocFileCache::BiocFileCache(temp_path, ask=FALSE)
file_url <- 
    "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data"
# download the required fastq files, and move them to new folder
fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep="/")))]]
fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", paste(file_url, "fastq/sample2.fastq.gz", sep="/")))]]
fastq_dir <- paste(temp_path, "fastq_dir", sep="/") # the downloaded fastq files need to be in a directory to be merged together
dir.create(fastq_dir)
file.copy(c(fastq1, fastq2), fastq_dir)
unlink(c(fastq1, fastq2)) # the original files can be deleted

## Not run: 
# run the FLAMES bulk pipeline, using the downloaded files
se <- bulk_long_pipeline(annot=system.file("extdata/SIRV_anno.gtf", package="FLAMES"), 
                   fastq=fastq_dir,
                   outdir=tempdir(), genome_fa=system.file("extdata/SIRV_genomefa.fasta", package="FLAMES"),
                   config_file=system.file("extdata/SIRV_config_default.json", package="FLAMES"))

## End(Not run)
# OR
# run the FLAMES single cell pipeline
#sce <- sc_long_pipeline(annot, fastq, NULL, outdir, genome_fa, match_barcode=FALSE, config+file=config)

[Package FLAMES version 1.0.2 Index]