Sample output files for tximport

This data package provides a set of output files from running a number of various transcript abundance quantifiers on 6 samples from the GEUVADIS Project. The files are contained in the inst/extdata directory.

A citation for the GEUVADIS Project is:

Lappalainen, et al., “Transcriptome and genome sequencing uncovers functional variation in humans”, Nature 501, 506-511 (26 September 2013) doi:10.1038/nature12531.

The purpose of this vignette is to detail which versions of software were run, and exactly what calls were made.

Sample information and quantification files

A small file, samples.txt is included in the inst/extdata directory:

dir <- system.file("extdata", package="tximportData")
samples <- read.table(file.path(dir,"samples.txt"), header=TRUE)
samples

##   pop center                assay    sample experiment       run
## 1 TSI  UNIGE NA20503.1.M_111124_5 ERS185497  ERX163094 ERR188297
## 2 TSI  UNIGE NA20504.1.M_111124_7 ERS185242  ERX162972 ERR188088
## 3 TSI  UNIGE NA20505.1.M_111124_6 ERS185048  ERX163009 ERR188329
## 4 TSI  UNIGE NA20507.1.M_111124_7 ERS185412  ERX163158 ERR188288
## 5 TSI  UNIGE NA20508.1.M_111124_2 ERS185362  ERX163159 ERR188021
## 6 TSI  UNIGE NA20514.1.M_111124_4 ERS185217  ERX163062 ERR188356

Further details can be found in a more extended table:

samples.ext <- read.delim(file.path(dir,"samples_extended.txt"), header=TRUE)
colnames(samples.ext)

##  [1] "Source.Name"                            
##  [2] "Comment.ENA_SAMPLE."                    
##  [3] "Characteristics.Organism."              
##  [4] "Term.Source.REF"                        
##  [5] "Term.Accession.Number"                  
##  [6] "Characteristics.Strain."                
##  [7] "Characteristics.population."            
##  [8] "Comment.1000g.Phase1.Genotypes."        
##  [9] "Protocol.REF"                           
## [10] "Protocol.REF.1"                         
## [11] "Extract.Name"                           
## [12] "Comment.LIBRARY_SELECTION."             
## [13] "Comment.LIBRARY_SOURCE."                
## [14] "Comment.SEQUENCE_LENGTH."               
## [15] "Comment.LIBRARY_STRATEGY."              
## [16] "Comment.LIBRARY_LAYOUT."                
## [17] "Comment.NOMINAL_LENGTH."                
## [18] "Comment.NOMINAL_SDEV."                  
## [19] "Protocol.REF.2"                         
## [20] "Performer"                              
## [21] "Assay.Name"                             
## [22] "Technology.Type"                        
## [23] "Comment.ENA_EXPERIMENT."                
## [24] "Comment.READ_INDEX_1_BASE_COORD."       
## [25] "Protocol.REF.3"                         
## [26] "Scan.Name"                              
## [27] "Comment.SUBMITTED_FILE_NAME."           
## [28] "Comment.ENA_RUN."                       
## [29] "Comment.FASTQ_URI."                     
## [30] "Protocol.REF.4"                         
## [31] "Derived.Array.Data.File"                
## [32] "Comment..Derived.ArrayExpress.FTP.file."
## [33] "Factor.Value.population."               
## [34] "Factor.Value.laboratory."               
## [35] "date"

The quantification outputs themselves can be found in sub-directories:

list.files(dir)

##  [1] "alevin"                  "cufflinks"              
##  [3] "kallisto"                "kallisto_boot"          
##  [5] "refseq"                  "rsem"                   
##  [7] "sailfish"                "salmon"                 
##  [9] "salmon_dm"               "salmon_ec"              
## [11] "salmon_gibbs"            "samples.txt"            
## [13] "samples_extended.txt"    "tx2gene.csv"            
## [15] "tx2gene.ensembl.v87.csv" "tx2gene.gencode.v27.csv"
## [17] "tx2gene_alevin.tsv"

list.files(file.path(dir,"cufflinks"))

## [1] "isoforms.attr_table"  "isoforms.count_table" "isoforms.fpkm_table"

list.files(file.path(dir,"rsem","ERR188021"))

## [1] "ERR188021.genes.results.gz"    "ERR188021.isoforms.results.gz"

list.files(file.path(dir,"kallisto","ERR188021"))

## [1] "abundance.h5"     "abundance.tsv.gz" "run_info.json"

list.files(file.path(dir,"salmon","ERR188021"))

## [1] "aux_info"               "cmd_info.json"          "libParams"             
## [4] "lib_format_counts.json" "logs"                   "quant.sf.gz"

list.files(file.path(dir,"sailfish","ERR188021"))

## [1] "cmd_info.json" "quant.sf"

list.files(file.path(dir,"alevin"))

## [1] "mouse1_LPS2_50"      "mouse1_unst_50"      "mouse1_unst_50_boot"

Genome and gene annotation file

For Cufflinks and Sailfish, the Illumina iGenomes was used as the index, see details below.
For RSEM, Salmon and kallisto (without inference replicates), the Gencode v27 CHR transcripts were used (gencode.v27.transcripts.fa).
For the salmon_gibbs and kallisto_boot directories, the Ensembl v87 cDNA transcripts were used (Homo_sapiens.GRCh38.cdna.all.fa).
For the salmon_dm directory, the Ensembl Drosophila melanogaster v92 transcripts were used (either just cDNA or combining cDNA with non-coding transcripts).

Illumina iGenomes: The human genome and annotations were downloaded from Illumina iGenomes for the UCSC hg19 version. The human genome FASTA file used was in the Sequence/WholeGenomeFasta directory and the gene annotation GTF file used was the genes.gtf file in the Annotation/Genes directory. This GTF file contains RefSeq transcript IDs and UCSC gene names. The Annotation directory contained a README.txt file with the text:

The contents of the annotation directories were downloaded from UCSC on: June 02, 2014.

The genes.gtf file was filtered to include only chromosomes 1-22, X, Y, and M.

Cufflinks

Tophat2 version 2.0.11 was run with the call:

tophat -p 20 -o tophat_out/$f genome fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz;

Cufflinks version 2.2.1 was run with the call:

cuffquant -p 40 -b $GENO -o cufflinks/$f genes.gtf tophat_out/$f/accepted_hits.bam;

Cuffnorm was run with the call:

cuffnorm genes.gtf -o cufflinks/ \
cufflinks/ERR188297/abundances.cxb \
cufflinks/ERR188088/abundances.cxb \
cufflinks/ERR188329/abundances.cxb \
cufflinks/ERR188288/abundances.cxb \
cufflinks/ERR188021/abundances.cxb \
cufflinks/ERR188356/abundances.cxb

RSEM

RSEM version 1.2.31 was run with the call:

rsem-calculate-expression --num-threads 6 --bowtie2 --paired-end <(zcat fastq/$f\_1.fastq.gz) <(zcat fastq/$f\_2.fastq.gz) index rsem/$f/$f

kallisto

kallisto version 0.43.1 was run with the call:

kallisto quant --bias -i index -t 6 -o kallisto/$f fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz

For the files in kallisto_boot directory, kallisto version 0.43.0 was run, quantifying against the Ensembl transcripts (v87) in Homo_sapiens.GRCh38.cdna.all.fa, using the call:

kallisto quant -i index -t 6 -b 5 -o kallisto_0.43.0/$f fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz

Salmon

Salmon version 0.8.2 was run with the call:

salmon quant -p 6 --gcBias -i index -l IU -1 fastq/$f\_1.fastq.gz -2 fastq/$f\_2.fastq.gz -o salmon/$f

For the files in the salmon_gibbs directory, Salmon version 0.8.1 was run, quantifying against the Ensembl transcripts (v87) in Homo_sapiens.GRCh38.cdna.all.fa, using the call:

salmon quant -p 6 --numGibbsSamples 5 -i index -l IU -1 fastq/$f\_1.fastq.gz -2 fastq/$f\_2.fastq.gz -o salmon_gibbs/$f

For the files in the salmon_dm directory (Drosophila melanogaster), Salmon version 0.10.2 was run (once with only cDNA, once combining cDNA with non-coding transcripts):

salmon quant -l A --gcBias --seqBias --posBias -i Drosophila_melanogaster.BDGP6.cdna.v92_salmon_0.10.2 -o SRR1197474 -1 SRR1197474_1.fastq.gz -2 SRR1197474_2.fastq.gz

For the files in the salmon_ec directory, Salmon version 1.1.0 was run with --dumpEq on the files from Tasic, B., Yao, Z., Graybuck, L.T. et al. “Shared and distinct transcriptomic cell types across neocortical areas” (2018) doi: 10.1038/s41586-018-0654-5 These files were generated by Jeroen Gilis. The raw data is from: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA476008&o=acc_s%3Aa

alevin

Two small examples of alevin output (50 cells each) were generated by Jeroen Gilis. The dataset is a subset from the paper, Hagai et al. “Gene expression variability across cells and species shapes innate immunity” (2018) doi: 10.1038/s41586-018-0657-2 Salmon/alevin version 1.6.0 was run, and using the tx2gene data that is included in the package under tx2gene_alevin.tsv.

Sailfish

Sailfish version 0.9.0 was run with the call:

sailfish quant -p 10 --biasCorrect -i sailfish_0.9.0/index -l IU -1 <(zcat fastq/$f\_1.fastq.gz) -2 <(zcat fastq/$f\_2.fastq.gz) -o sailfish_0.9.0/$f

Session info

sessionInfo()

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
## [1] compiler_4.2.1 magrittr_2.0.3 tools_4.2.1    stringi_1.7.8  knitr_1.40    
## [6] stringr_1.4.1  xfun_0.34      evaluate_0.17