This vignette describes the steps that were followed for the generation of the data objects contained in the package pasilla.
pasilla 1.30.0
We used the RNA-Seq data from the publication by Brooks et al. (Brooks et al. 2010). The experiment investigated the effect of siRNA knock-down of pasilla, a gene that is known to bind to mRNA in the spliceosome, and which is thought to be involved in the regulation of splicing. The data set contains 3 biological replicates of the knockdown as well as 4 biological replicates for the untreated control. The data files are available from the NCBI Gene Expression Omnibus under the accession GSE18508. The read sequences in FASTQ format were extracted from the NCBI short read archive file (.sra files) using the SRA toolkit.
The reads in the FASTQ files were aligned using tophat
version 1.2.0
with default parameters against the reference Drosophila melanogaster
genome. The following table summarizes the read number and
alignment statistics.
The column exon.counts
refers to the number of reads that could be uniquely aligned to an exon.
dataFilesDir = system.file("extdata", package = "pasilla", mustWork=TRUE)
pasillaSampleAnno = read.csv(file.path(dataFilesDir, "pasilla_sample_annotation.csv"))
pasillaSampleAnno
## file condition type number.of.lanes total.number.of.reads
## 1 treated1fb treated single-read 5 35158667
## 2 treated2fb treated paired-end 2 12242535 (x2)
## 3 treated3fb treated paired-end 2 12443664 (x2)
## 4 untreated1fb untreated single-read 2 17812866
## 5 untreated2fb untreated single-read 6 34284521
## 6 untreated3fb untreated paired-end 2 10542625 (x2)
## 7 untreated4fb untreated paired-end 2 12214974 (x2)
## exon.counts
## 1 15679615
## 2 15620018
## 3 12733865
## 4 14924838
## 5 20764558
## 6 10283129
## 7 11653031
The reference genome fasta files were obtained from the
Ensembl ftp server.
We ran bowtie-build
to index the
fasta file. For more information on this procedure see
the bowtie
webpage.
The indexed form is required by bowtie
, and thus tophat
.
wget ftp://ftp.ensembl.org/pub/release-62/fasta/drosophila_melanogaster/ \
dna/Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa.gz
gunzip Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa.gz
bowtie-build Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa \
d_melanogaster_BDGP5.25.62
We generated the alignment BAM file using tophat
. For the single-reads data:
tophat bowtie_index <filename>1.fastq,<filename>2.fastq,...,<filename>N.fastq
For the paired-end data:
tophat -r inner-fragment-size bowtie_index \
<filename>1_1.fastq,<filename>2_1.fastq,...,<filename>N_1.fastq \
<filename>1_2.fastq,<filename>2_2.fastq,...,<filename>N_2.fastq
The SAM alignment files from which pasilla was generated are available at this URL.
To generate the per-exon read counts, we first needed to define the exonic regions.
To this end, we downloaded the file Drosophila_melanogaster.BDGP5.25.62.gtf.gz
from
Ensembl.
The script dexseq_prepare_annotation.py
contained in the DEXSeq
package was used to extract the exons of the transcripts from the file, define new
non-overlapping exonic regions and reformat it to create the file Dmel.BDGP5.25.62.DEXSeq.chr.gff
contained in pasilla/extdata
. For example, for this file we ran:
wget ftp://ftp.ensembl.org/pub/release-62/gtf/ \
drosophila_melanogaster/Drosophila_melanogaster.BDGP5.25.62.gtf.gz
gunzip Drosophila_melanogaster.BDGP5.25.62.gtf.gz
python dexseq_prepare_annotation.py Drosophila_melanogaster.BDGP5.25.62.gtf \
Dmel.BDGP5.25.62.DEXSeq.chr.gff
To count the reads that fell into each non-overlapping exonic part, the script
dexseq_count.py
, which is also contained in the DEXSeq package, was used.
It took the alignment results in the form of a SAM file
(sorted by position in the case of a paired end data) and the gtf
file
Dmel.BDGP5.25.62.DEXSeq.chr.gff
and returned one file for each biological replicate
with the exon counts. For example, for the file treated1.bam, which contained single-end
alignments, we ran:
samtools index treated1.bam
samtools view treated1.bam > treated1.sam
python dexseq_count.py Dmel.BDGP5.25.62.DEXSeq.chr.gff \
treated1.sam treated1fb.txt
For the file treated2.bam
, which contained paired-end alignments:
samtools index treated2.bam
samtools view treated2.bam > treated2.sam
sort -k1,1 -k2,2n treated2.sam > treated2_sorted.sam
python dexseq_count.py -p yes Dmel.BDGP5.25.62.DEXSeq.chr.gff \
treated2_sorted.sam treated2fb.txt
The output of the two HTSeq python scripts is provided in the pasilla package:
dir(system.file("extdata", package = "pasilla", mustWork=TRUE), pattern = ".txt$")
## [1] "geneIDsinsubset.txt" "treated1fb.txt" "treated2fb.txt"
## [4] "treated3fb.txt" "untreated1fb.txt" "untreated2fb.txt"
## [7] "untreated3fb.txt" "untreated4fb.txt"
The Python scripts are built upon the HTSeq library.
To create a DEXSeqDataSet object, we need the count files, the sample annotations, and the GFF file with the per exon annotation.
gffFile = file.path(dataFilesDir, "Dmel.BDGP5.25.62.DEXSeq.chr.gff")
With these, we could call the function DEXSeqDataSet
to construct the object dxd
.
library("DEXSeq")
dxd = DEXSeqDataSetFromHTSeq(
countfiles = file.path(dataFilesDir, paste(pasillaSampleAnno$file, "txt", sep=".")),
sampleData = pasillaSampleAnno,
design= ~ sample + exon + condition:exon,
flattenedfile = gffFile)
dxd
## class: DEXSeqDataSet
## dim: 70463 14
## metadata(1): version
## assays(1): counts
## rownames(70463): FBgn0000003:E001 FBgn0000008:E001 ... FBgn0261575:E001
## FBgn0261575:E002
## rowData names(5): featureID groupID exonBaseMean exonBaseVar
## transcripts
## colnames: NULL
## colData names(8): sample file ... exon.counts exon
We only wanted to work with data from a subset of genes, which was defined in the following file.
genesforsubset = readLines(file.path(dataFilesDir, "geneIDsinsubset.txt"))
dxd = dxd[geneIDs( dxd ) %in% genesforsubset,]
We saved the R objects in the data directory of the package:
save(dxd, file = file.path("..", "data", "pasillaDEXSeqDataSet.RData"))
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] DEXSeq_1.48.0 RColorBrewer_1.1-3
## [3] AnnotationDbi_1.64.0 DESeq2_1.42.0
## [5] SummarizedExperiment_1.32.0 GenomicRanges_1.54.0
## [7] GenomeInfoDb_1.38.0 IRanges_2.36.0
## [9] S4Vectors_0.40.0 MatrixGenerics_1.14.0
## [11] matrixStats_1.0.0 Biobase_2.62.0
## [13] BiocGenerics_0.48.0 BiocParallel_1.36.0
## [15] BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 dplyr_1.1.3 blob_1.2.4
## [4] filelock_1.0.2 Biostrings_2.70.1 bitops_1.0-7
## [7] fastmap_1.1.1 RCurl_1.98-1.12 BiocFileCache_2.10.0
## [10] XML_3.99-0.14 digest_0.6.33 lifecycle_1.0.3
## [13] statmod_1.5.0 survival_3.5-7 KEGGREST_1.42.0
## [16] RSQLite_2.3.1 genefilter_1.84.0 magrittr_2.0.3
## [19] compiler_4.3.1 rlang_1.1.1 sass_0.4.7
## [22] progress_1.2.2 tools_4.3.1 utf8_1.2.4
## [25] yaml_2.3.7 knitr_1.44 prettyunits_1.2.0
## [28] S4Arrays_1.2.0 curl_5.1.0 bit_4.0.5
## [31] DelayedArray_0.28.0 xml2_1.3.5 abind_1.4-5
## [34] hwriter_1.3.2.1 grid_4.3.1 fansi_1.0.5
## [37] xtable_1.8-4 colorspace_2.1-0 ggplot2_3.4.4
## [40] scales_1.2.1 biomaRt_2.58.0 cli_3.6.1
## [43] rmarkdown_2.25 crayon_1.5.2 generics_0.1.3
## [46] httr_1.4.7 DBI_1.1.3 cachem_1.0.8
## [49] stringr_1.5.0 splines_4.3.1 zlibbioc_1.48.0
## [52] parallel_4.3.1 BiocManager_1.30.22 XVector_0.42.0
## [55] vctrs_0.6.4 Matrix_1.6-1.1 jsonlite_1.8.7
## [58] geneplotter_1.80.0 bookdown_0.36 hms_1.1.3
## [61] bit64_4.0.5 locfit_1.5-9.8 jquerylib_0.1.4
## [64] annotate_1.80.0 glue_1.6.2 codetools_0.2-19
## [67] stringi_1.7.12 gtable_0.3.4 munsell_0.5.0
## [70] tibble_3.2.1 pillar_1.9.0 rappdirs_0.3.3
## [73] htmltools_0.5.6.1 GenomeInfoDbData_1.2.11 dbplyr_2.3.4
## [76] R6_2.5.1 evaluate_0.22 lattice_0.22-5
## [79] png_0.1-8 Rsamtools_2.18.0 memoise_2.0.1
## [82] bslib_0.5.1 Rcpp_1.0.11 SparseArray_1.2.0
## [85] xfun_0.40 pkgconfig_2.0.3
Brooks, A. N., L. Yang, M. O. Duff, K. D. Hansen, J. W. Park, S. Dudoit, S. E. Brenner, and B. R. Graveley. 2010. “Conservation of an RNA regulatory map between Drosophila and mammals.” Genome Research, October, 193–202. https://doi.org/10.1101/gr.108662.110.