XBSeq is a novel algorithm for testing RNA-seq differential expression (DE), where a statistical model was established based on the assumption that observed signals are the convolution of true expression signals and sequencing noises. The mapped reads in non-exonic regions are considered as sequencing noises, which follows a Poisson distribution. Given measurable observed signal and background noise from RNA-seq data, true expression signals, assuming governed by the negative binomial distribution, can be delineated and thus the accurate detection of differential expressed genes. XBSeq paper is published in BMC genomics [1]. We recently also incorporated functionality of roar package for testing differential alternative polyadenylation (apa) usage.
XBSeq can be installed from Bioconductor by
source('http://www.bioconductor.org/biocLite.R')
biocLite("XBSeq")
library("XBSeq")
You can also install development version of XBSeq by
library(devtools)
install_github('liuy12/XBSeq')
In order to use XBSeq for testing DE, after sequence alignment, we need to carry out read-counting procedure twice to measure the reads mapped to exonic regions (observed signal) and non-exonic regions (background noise). There are several existing methods for this purpose. Here we introduce read counting using featureCounts, HTSeq and summarizeOverlaps
featureCounts is a read summarization program that can be used for reads generated from RNA or DNA sequencing technologies and it implements highly efficient chromosome hashing and feature blocking techniques which is considerably fast in speed and require less computer memory. featureCounts is available from Rsubread package within R. Basically, you will need to run the following commands:
fc_signal <- featureCounts(files = bamLists, annot.ext = gtf_file, isGTFAnnotationFile = TRUE)
fc_bg <- featureCounts(files = bamLists, annot.ext = gtf_file_bg, isGTFAnnotationFile = TRUE)
The gtf file used to measure observed signal and background noise can be downloaded in the gtf folder from github: https://github.com/Liuy12/XBSeq_files. We have already constructed annotations for several organism of various genome builds. If you would like to construct the gtf file by yourself, we also have deposited the perl script we used to construct the gtf file in github. Details regarding the procedure we used to construct the background gtf file can be found in the Details section in the vignette.
Alternatively, you can also use HTSeq for read-counting purpose:
htseq-count [options] <alignment_file> <gtf_file> > Observed_count.txt
htseq-count [options] <alignment_file> <gtf_file_bg> > background_count.txt
Details regarding how HTSeq works can be found here: http://www-huber.embl.de/HTSeq/doc/count.html
You can also use summarizeOverlaps
function which is available from GenomicRanges package:
features_signal <- import(gtf_file)
features_signal <- split(features_signal, mcols(features_signal)$gene_id)
so_signal <- summarizeOverlaps(features_signal, bamLists)
## for background noise
features_bg <- import(gtf_file_bg)
features_bg <- split(features_bg, mcols(features_bg)$gene_id)
so_bg <- summarizeOverlaps(features_bg, bamLists)
Alternative polyadenylation (APA) is a widespread mechanism, where alternative poly(A) sites are used by a gene to encode multiple mRNA transcripts of different 3’UTR lengths. User can infer differential APA usage directly form RNA-seq alignment files:
apaStats <- apaUsage(bamTreatment, bamControl, apaAnno)
Where bamTreatment
is a list of full path of filenames of bam alignments with data for the treatment condition and bamControl
is a list of full path of filenames of bam alignments with data for the control condition to be considered. apaAnno
is full path of apa annotation used by roar package. APA annotation for several organisms of various genome build can be downloaded from here. For details regarding how to construct APA annotation, please refer to Details section is the vignette.
After HTSeq procedure, then we will have two measurements for each gene, the observed signal and background noise. Here we will use a mouse RNA-seq dataset, which contains 3 replicates of wild type mouse liver tissues (WT) and 3 replicates of Myc transgenic mouse liver tissues (MYC). The dataset is obtained from Gene Expression Omnibus (GSE61875).
As a preliminary step, we have already carried out HTSeq procedure mentioned above to generate observed signal and background noise for each gene. The two datasets can be loaded into user’s working space by
data(ExampleData)
We can first take a look at the two datasets:
head(Observed)
## Sample_54_WT Sample_72_WT Sample_93_WT Sample_31_MYC
## 0610005C13Rik 3683 2956 3237 2985
## 0610007C21Rik 2136 1675 1519 1782
## 0610007L01Rik 2826 3584 1712 2694
## 0610007P08Rik 717 616 529 737
## 0610007P14Rik 3138 2145 2145 1995
## 0610007P22Rik 298 254 194 211
## Sample_75_MYC Sample_87_MYC
## 0610005C13Rik 4043 4437
## 0610007C21Rik 2265 2214
## 0610007L01Rik 3133 3552
## 0610007P08Rik 864 923
## 0610007P14Rik 2871 2945
## 0610007P22Rik 272 333
head(Background)
## Sample_54_WT Sample_72_WT Sample_93_WT Sample_31_MYC
## 0610005C13Rik 512 374 466 496
## 0610007C21Rik 50 44 40 42
## 0610007L01Rik 33 22 35 39
## 0610007P08Rik 14 14 28 20
## 0610007P14Rik 21 17 22 10
## 0610007P22Rik 16 19 26 14
## Sample_75_MYC Sample_87_MYC
## 0610005C13Rik 504 648
## 0610007C21Rik 42 60
## 0610007L01Rik 40 22
## 0610007P08Rik 14 18
## 0610007P14Rik 24 23
## 0610007P22Rik 28 28
Rows represent reads mapped to each gene or corresponding background region. Column represent samples. And differential expression analysis will be carried out as follows:
Firstly, we need to construct a XBSeqDataSet object. Conditions are the design matrix for the experiment. Observe and background are the output matrix from HTSeq (Remeber to remove the bottom few lines of summary statistics of the output matrix).
conditions <- factor(c(rep("C", 3), rep("T", 3)))
XB <- XBSeqDataSet(Observed, Background, conditions)
It is always recommended that you examine the distribution of observed signal and background noise beforehand. We provide function ‘r XBplot’ to achieve this. We recommended to examine the distribution in log2 transcript per million (TPM) unit by setting argument unit equals to “logTPM”. Genelength information is loaded via “ExampleData”. Ideally, library size should also be provided. By default, the sum of all the reads that mapped to exonic regions are used.
XBplot(XB, Samplenum = 1, unit = "LogTPM", Genelength = genelength[, 2])
## Warning in XBplot(XB, Samplenum = 1, unit = "LogTPM", Genelength = genelength[, : Libsize is not provided, the sum of all the read counts that mapped to exonic
## regions in each sample is used as the total library size for that sample
## Warning: Removed 16453 rows containing non-finite values (stat_bin).
Then estimate the preliminary underlying signal followed by normalizing factor and dispersion estimates
XB <- estimateRealCount(XB)
XB <- estimateSizeFactors(XB)
XB <- estimateSCV(XB, method = "pooled", sharingMode = "maximum", fitType = "local")
Take a look at the scv fitting information
plotSCVEsts(XB)
Carry out the DE test by using function XBSeqTest
Teststas <- XBSeqTest( XB, levels(conditions)[1L], levels(conditions)[2L], method ='NP')
Plot Maplot based on test statistics
MAplot(Teststas, padj = FALSE, pcuff = 0.01, lfccuff = 1)
# Alternatively, all the codes above can be done with a wrapper function XBSeq
Teststats <- XBSeq(Observed, Background, conditions, method = "pooled", sharingMode = "maximum",
fitType = "local", pvals_only = FALSE, paraMethod = "NP")
Now we will carry out DE analysis on the same dataset by using DESeq and then compare the results obtained by these two methods
If you have not installed DESeq before, DESeq is also available from Bioconductor
biocLite("DESeq")
Then DE analysis for DESeq can be carried out by:
library('DESeq')
library('ggplot2')
de <- newCountDataSet(Observed, conditions)
de <- estimateSizeFactors(de)
de <- estimateDispersions(de, method = "pooled", fitType="local")
res <- nbinomTest(de, levels(conditions)[1], levels(conditions)[2])
Then we can compare the results from XBSeq and DESeq
DE_index_DESeq <- with(res, which(pval < 0.01 & abs(log2FoldChange) > 1))
DE_index_XBSeq <- with(Teststas, which(pval < 0.01 & abs(log2FoldChange) > 1))
DE_index_inters <- intersect(DE_index_DESeq, DE_index_XBSeq)
DE_index_DESeq_uniq <- setdiff(DE_index_DESeq, DE_index_XBSeq)
DE_plot <- MAplot(Teststas, padj = FALSE, pcuff = 0.01, lfccuff = 1, shape = 16)
DE_plot + geom_point(data = Teststas[DE_index_inters, ], aes(x = baseMean, y = log2FoldChange),
color = "green", shape = 16) + geom_point(data = Teststas[DE_index_DESeq_uniq,
], aes(x = baseMean, y = log2FoldChange), color = "blue", shape = 16)
The red dots indicate DE genes identified only by XBSeq. Then green dots are the shared results of XBSeq and DESeq. The blue dots are DE genes identified only by DESeq.
Exonic region annotation is obtained from UCSC database.
More details regarding how do we construct the background region annotation file of an real example can be found in manual page of ExampleData and also our publication of XBSeq.
APA sites are predicted by using POLYAR program [2], which applies an Expectation Maximization (EM) approach by using 12 different previously mapped poly(A) signal (PAS) hexamer. The predicted APA sites by POLYAR are classified into three classes, PAS-strong, PAS-medium and PAS-weak. Only APA sites in PAS-strong class are selected to construct final APA annotation. APA annotations for human and mouse genome of different versions have been built and are available to download from github: https://github.com/Liuy12/XBSeq_files
More often than not, I have been asked about whether intron retention events have any effect over the performance of XBSeq. Intron retention is a common mechanism for alternative splicing for controlling transcriptome activity. Several articles have already demonstrated that transcripts with intronic retention will be degraded via a mechanism called Nonsense-mediated mRNA decay (NMD) [3-5]. To me, intron retention will not affect the performance of XBSeq, since this type of transcripts will be degraded eventually and it makes sense to consider them as background noise.
Updated background annotation file using latest annotation file from UCSC database
Incorprated functionality to directly process alignment files (.bam files) using featureCounts
Provided one alternative parameter estimation by using Maximum likelihood estimation (MLE)
Provided one alternative statistical test for differential expression by using beta distribution approximation
Incorporation of roar for testing differential APA usage
Report bugs as issues on our GitHub repository or you can report directly to my email: liuy12@uthscsa.edu.
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggplot2_2.2.1 DESeq_1.32.0
## [3] lattice_0.20-35 locfit_1.5-9.1
## [5] XBSeq_1.12.0 DESeq2_1.20.0
## [7] SummarizedExperiment_1.10.0 DelayedArray_0.6.0
## [9] BiocParallel_1.14.0 matrixStats_0.53.1
## [11] Biobase_2.40.0 GenomicRanges_1.32.0
## [13] GenomeInfoDb_1.16.0 IRanges_2.14.0
## [15] S4Vectors_0.18.0 BiocGenerics_0.26.0
## [17] BiocStyle_2.8.0
##
## loaded via a namespace (and not attached):
## [1] bit64_0.9-7 splines_3.5.0
## [3] assertthat_0.2.0 Formula_1.2-2
## [5] latticeExtra_0.6-28 roar_1.16.0
## [7] blob_1.1.1 Rsamtools_1.32.0
## [9] GenomeInfoDbData_1.1.0 yaml_2.1.18
## [11] pillar_1.2.2 RSQLite_2.1.0
## [13] backports_1.1.2 glue_1.2.0
## [15] digest_0.6.15 RColorBrewer_1.1-2
## [17] XVector_0.20.0 checkmate_1.8.5
## [19] colorspace_1.3-2 htmltools_0.3.6
## [21] Matrix_1.2-14 plyr_1.8.4
## [23] pkgconfig_2.0.1 XML_3.98-1.11
## [25] genefilter_1.62.0 bookdown_0.7
## [27] zlibbioc_1.26.0 xtable_1.8-2
## [29] scales_0.5.0 pracma_2.1.4
## [31] htmlTable_1.11.2 tibble_1.4.2
## [33] annotate_1.58.0 nnet_7.3-12
## [35] lazyeval_0.2.1 survival_2.42-3
## [37] magrittr_1.5 memoise_1.1.0
## [39] evaluate_0.10.1 foreign_0.8-70
## [41] tools_3.5.0 data.table_1.10.4-3
## [43] formatR_1.5 stringr_1.3.0
## [45] munsell_0.4.3 cluster_2.0.7-1
## [47] bindrcpp_0.2.2 Biostrings_2.48.0
## [49] AnnotationDbi_1.42.0 compiler_3.5.0
## [51] rlang_0.2.0 grid_3.5.0
## [53] RCurl_1.95-4.10 rstudioapi_0.7
## [55] htmlwidgets_1.2 labeling_0.3
## [57] bitops_1.0-6 base64enc_0.1-3
## [59] rmarkdown_1.9 gtable_0.2.0
## [61] DBI_0.8 R6_2.2.2
## [63] GenomicAlignments_1.16.0 gridExtra_2.3
## [65] dplyr_0.7.4 rtracklayer_1.40.0
## [67] knitr_1.20 bit_1.1-12
## [69] bindr_0.1.1 Hmisc_4.1-1
## [71] rprojroot_1.3-2 stringi_1.1.7
## [73] Rcpp_0.12.16 geneplotter_1.58.0
## [75] rpart_4.1-13 acepack_1.4.1
## [77] xfun_0.1
XBSeq is implemented in R based on the source code from DESeq and DESeq2.