This data package provides a set of output files from running a number
of various transcript abundance quantifiers on 6 samples from the
GEUVADIS Project. The
files are contained in the inst/extdata
directory.
A citation for the GEUVADIS Project is:
Lappalainen, et al., “Transcriptome and genome sequencing uncovers functional variation in humans”, Nature 501, 506-511 (26 September 2013) doi:10.1038/nature12531.
The purpose of this vignette is to detail which versions of software were run, and exactly what calls were made.
A small file, samples.txt
is included in the inst/extdata
directory:
dir <- system.file("extdata", package="tximportData")
samples <- read.table(file.path(dir,"samples.txt"), header=TRUE)
samples
## pop center assay sample experiment run
## 1 TSI UNIGE NA20503.1.M_111124_5 ERS185497 ERX163094 ERR188297
## 2 TSI UNIGE NA20504.1.M_111124_7 ERS185242 ERX162972 ERR188088
## 3 TSI UNIGE NA20505.1.M_111124_6 ERS185048 ERX163009 ERR188329
## 4 TSI UNIGE NA20507.1.M_111124_7 ERS185412 ERX163158 ERR188288
## 5 TSI UNIGE NA20508.1.M_111124_2 ERS185362 ERX163159 ERR188021
## 6 TSI UNIGE NA20514.1.M_111124_4 ERS185217 ERX163062 ERR188356
Further details can be found in a more extended table:
samples.ext <- read.delim(file.path(dir,"samples_extended.txt"), header=TRUE)
colnames(samples.ext)
## [1] "Source.Name"
## [2] "Comment.ENA_SAMPLE."
## [3] "Characteristics.Organism."
## [4] "Term.Source.REF"
## [5] "Term.Accession.Number"
## [6] "Characteristics.Strain."
## [7] "Characteristics.population."
## [8] "Comment.1000g.Phase1.Genotypes."
## [9] "Protocol.REF"
## [10] "Protocol.REF.1"
## [11] "Extract.Name"
## [12] "Comment.LIBRARY_SELECTION."
## [13] "Comment.LIBRARY_SOURCE."
## [14] "Comment.SEQUENCE_LENGTH."
## [15] "Comment.LIBRARY_STRATEGY."
## [16] "Comment.LIBRARY_LAYOUT."
## [17] "Comment.NOMINAL_LENGTH."
## [18] "Comment.NOMINAL_SDEV."
## [19] "Protocol.REF.2"
## [20] "Performer"
## [21] "Assay.Name"
## [22] "Technology.Type"
## [23] "Comment.ENA_EXPERIMENT."
## [24] "Comment.READ_INDEX_1_BASE_COORD."
## [25] "Protocol.REF.3"
## [26] "Scan.Name"
## [27] "Comment.SUBMITTED_FILE_NAME."
## [28] "Comment.ENA_RUN."
## [29] "Comment.FASTQ_URI."
## [30] "Protocol.REF.4"
## [31] "Derived.Array.Data.File"
## [32] "Comment..Derived.ArrayExpress.FTP.file."
## [33] "Factor.Value.population."
## [34] "Factor.Value.laboratory."
## [35] "date"
The quantification outputs themselves can be found in sub-directories:
list.files(dir)
## [1] "cufflinks" "derivedTxome"
## [3] "kallisto" "kallisto_boot"
## [5] "rsem" "sailfish"
## [7] "salmon" "salmon_gibbs"
## [9] "samples.txt" "samples_extended.txt"
## [11] "tx2gene.csv" "tx2gene.gencode.v27.csv"
list.files(file.path(dir,"cufflinks"))
## [1] "isoforms.attr_table" "isoforms.count_table" "isoforms.fpkm_table"
list.files(file.path(dir,"rsem","ERR188021"))
## [1] "ERR188021.genes.results.gz" "ERR188021.isoforms.results.gz"
list.files(file.path(dir,"kallisto","ERR188021"))
## [1] "abundance.h5" "abundance.tsv.gz" "run_info.json"
list.files(file.path(dir,"salmon","ERR188021"))
## [1] "aux_info" "cmd_info.json"
## [3] "libParams" "lib_format_counts.json"
## [5] "logs" "quant.sf.gz"
list.files(file.path(dir,"sailfish","ERR188021"))
## [1] "cmd_info.json" "quant.sf"
gencode.v27.transcripts.fa
).salmon_gibbs
and kallisto_boot
directories,
the Ensembl v87 cDNA transcripts were used (Homo_sapiens.GRCh38.cdna.all.fa
).Illumina iGenomes: The human genome and annotations were downloaded from
Illumina iGenomes
for the UCSC hg19 version. The human genome FASTA file used was in the
Sequence/WholeGenomeFasta
directory and the gene annotation GTF file used
was the genes.gtf
file in the Annotation/Genes
directory. This GTF
file contains RefSeq transcript IDs and UCSC gene names. The
Annotation
directory contained a README.txt
file with the text:
The contents of the annotation directories were downloaded from UCSC on: June 02, 2014.
The genes.gtf
file was filtered to include only chromosomes
1-22, X, Y, and M.
Tophat2 version 2.0.11 was run with the call:
tophat -p 20 -o tophat_out/$f genome fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz;
Cufflinks version 2.2.1 was run with the call:
cuffquant -p 40 -b $GENO -o cufflinks/$f genes.gtf tophat_out/$f/accepted_hits.bam;
Cuffnorm was run with the call:
cuffnorm genes.gtf -o cufflinks/ \
cufflinks/ERR188297/abundances.cxb \
cufflinks/ERR188088/abundances.cxb \
cufflinks/ERR188329/abundances.cxb \
cufflinks/ERR188288/abundances.cxb \
cufflinks/ERR188021/abundances.cxb \
cufflinks/ERR188356/abundances.cxb
RSEM version 1.2.31 was run with the call:
rsem-calculate-expression --num-threads 6 --bowtie2 --paired-end <(zcat fastq/$f\_1.fastq.gz) <(zcat fastq/$f\_2.fastq.gz) index rsem/$f/$f
kallisto version 0.43.1 was run with the call:
kallisto quant --bias -i index -t 6 -o kallisto/$f fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz
For the files in kallisto_boot
directory, kallisto version 0.43.0
was run, quantifying against the Ensembl transcripts (v87) in
Homo_sapiens.GRCh38.cdna.all.fa
, using the call:
kallisto quant -i index -t 6 -b 5 -o kallisto_0.43.0/$f fastq/$f\_1.fastq.gz fastq/$f\_2.fastq.gz
Salmon version 0.8.2 was run with the call:
$salmon quant -p 6 --gcBias -i index -l IU -1 fastq/$f\_1.fastq.gz -2 fastq/$f\_2.fastq.gz -o salmon/$f
For the files in the salmon_gibbs
directory, Salmon version 0.8.1
was run, quantifying against teh Ensembl transcripts (v87) in
Homo_sapiens.GRCh38.cdna.all.fa
, using the call:
$salmon quant -p 6 --numGibbsSamples 5 -i index -l IU -1 fastq/$f\_1.fastq.gz -2 fastq/$f\_2.fastq.gz -o salmon_gibbs/$f
Sailfish version 0.9.0 was run with the call:
sailfish quant -p 10 --biasCorrect -i sailfish_0.9.0/index -l IU -1 <(zcat fastq/$f\_1.fastq.gz) -2 <(zcat fastq/$f\_2.fastq.gz) -o sailfish_0.9.0/$f
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.5.0 magrittr_1.5 tools_3.5.0 stringi_1.1.7
## [5] knitr_1.20 stringr_1.3.0 evaluate_0.10.1