APAlyzer 1.0.0
APAlyzer is a toolkit for bioinformatic analysis of alternative polyadenylation (APA) events using RNA sequencing data. Our main approach is comparison of sequencing reads in regions demarcated by high quality polyadenylation sites (PASs) annotated in the PolyA_DB database (http://exon.njms.rutgers.edu/polya_db/v3/) (Wang et al. 2017, 2018). The current version (v3.0) uses RNA-seq data to examine APA events in 3’ untranslated regions (3’UTRs) and in introns. The coding regions are used for gene expression calculation.
APAlyzer should be installed using BiocManager:
if (!"BiocManager" %in% rownames(installed.packages()))
install.packages("BiocManager")
BiocManager::install("APAlyzer")
Alternatively, it can also be installed as follows:
R CMD INSTALL APAlyzer.tar.gz
In additions, user can also install development version of APAlyzer directly from GitHub:
BiocManager::install('RJWANGbioinfo/APAlyzer')
After installation, APAlyzer can be used by:
library(APAlyzer)
The package reads BAM file(s) to obtain read coverage information in different genomic regions. The following example shows that we first specify paths to example BAM files in the TBX20BamSubset (Bindreither 2018) data package. In this example, BAM files correspond to mouse RNA-seq data, (mapped to mm9).
suppressMessages(library("TBX20BamSubset"))
suppressMessages(library("Rsamtools"))
flsall = getBamFileList()
flsall
## SRR316184
## "/home/biocbuild/bbs-3.10-bioc/R/library/TBX20BamSubset/extdata/SRR316184.bam"
## SRR316185
## "/home/biocbuild/bbs-3.10-bioc/R/library/TBX20BamSubset/extdata/SRR316185.bam"
## SRR316186
## "/home/biocbuild/bbs-3.10-bioc/R/library/TBX20BamSubset/extdata/SRR316186.bam"
## SRR316187
## "/home/biocbuild/bbs-3.10-bioc/R/library/TBX20BamSubset/extdata/SRR316187.bam"
## SRR316188
## "/home/biocbuild/bbs-3.10-bioc/R/library/TBX20BamSubset/extdata/SRR316188.bam"
## SRR316189
## "/home/biocbuild/bbs-3.10-bioc/R/library/TBX20BamSubset/extdata/SRR316189.bam"
PAS references in the genome (both 3’UTRs and introns) are required by our
package. We have pre-built a reference file for the mouse genome (mm9),
which can be loaded from extdata
:
extpath = system.file("extdata", "mm9_REF.RData", package="APAlyzer")
load(extpath, verbose=TRUE)
## Loading objects:
## refUTRraw
## dfIPA
## dfLE
This extdata
covers 3’UTR APA regions (refUTRraw), IPA regions (dfIPA),
and 3’-most exon regions (dfLE). The refUTRraw
is a data frame containing
6 columns for genomic information of 3’UTR PASs:
head(refUTRraw,2)
## gene_symbol Chrom Strand First Last cdsend
## 1 Xkr4 chr1 - 3204561 3189814 3206103
## 2 Lypla1 chr1 + 4835237 4836816 4835097
dfIPA
is a data frame containing 8 columns for Intronic PASs; ‘upstreamSS’
means the closest 5’ or 3’ splice site to IPA, ‘downstreamSS’
means closest 3’ splice site:
head(dfIPA,2)
## PASid gene_symbol Chrom Strand Pos upstreamSS downstreamSS
## 5 chr4:32818455:+ Mdn1 chr4 + 32818455 32818305 32818889
## 10 chr4:32837613:+ Mdn1 chr4 + 32837613 32837452 32837887
dfLE
is a data frame containing 5 colmuns for 3’ least exon;
‘LEstart’ means the start genomic position of last 3’ exon.
head(dfLE,2)
## gene_symbol Chrom Strand LEstart TES
## 1 0610009O20Rik chr18 + 38421534 38425118
## 2 0610010F05Rik chr11 - 23465136 23462079
In additions to mouse mm9, our package has also a pre-build version for human hg19 genome:
extpath = system.file("extdata", "hg19_REF.RData", package="APAlyzer")
load(extpath, verbose=TRUE)
## Loading objects:
## refUTRraw_hg19
## dfIPA_hg19
## dfLE_hg19
To calculate 3’UTR APA relative expression (RE), we first need to define
the refence regions of aUTR and cUTR using refUTRraw
. Since the sample
data only contains mapping information on chr19, we can zoom into
reference regions on chr19 only:
refUTRraw=refUTRraw[which(refUTRraw$Chrom=='chr19'),]
UTRdbraw=REF3UTR(refUTRraw)
The REF3UTR
function returns a genomic range containing aUTR(pPAS to dPAS)
and cUTR(cdsend to pPAS) regions for each gene:
head(UTRdbraw,2)
## GRangesList object of length 2:
## $`1700020D05Rik__aUTR`
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | gene_symbol gene_name
## <Rle> <IRanges> <Rle> | <factor> <character>
## 10082 chr19 5492231-5495277 - | 1700020D05Rik 1700020D05Rik__aUTR
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
##
## $`1700020D05Rik__cUTR`
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | gene_symbol gene_name
## <Rle> <IRanges> <Rle> | <factor> <character>
## 100821 chr19 5495277-5502867 - | 1700020D05Rik 1700020D05Rik__cUTR
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
Once cUTR and aUTR regions are defined, the RE of 3’UTR APA of each gene
can be calculated by PASEXP_3UTR
:
DFUTRraw=PASEXP_3UTR(UTRdbraw, flsall, Strandtype="forward")
## [1] "SRR316184, Strand: forward, finished"
## [1] "SRR316185, Strand: forward, finished"
## [1] "SRR316186, Strand: forward, finished"
## [1] "SRR316187, Strand: forward, finished"
## [1] "SRR316188, Strand: forward, finished"
## [1] "SRR316189, Strand: forward, finished"
The PASEXP_3UTR
3UTR requires two inputs: 1) aUTR and cUTR
reference regions, and 2) path of BAM file(s). In additions
to input, one can also define the strand of the
sequencing using Strandtype
. The detailed usage can also
be obtained by the command ?PASEXP_3UTR
.The output data
frame covers reads count (in aUTR or cUTR), RPKM (in
aUTR or cUTR) and RE (log2(aUTR/cUTR)) for each gene:
head(DFUTRraw,2)
## gene_symbol SRR316184_areads SRR316184_aRPKM SRR316184_creads
## 1 1700020D05Rik 0 0 55
## 2 1810009A15Rik 0 0 0
## SRR316184_cRPKM SRR316184_3UTR_RE SRR316185_areads SRR316185_aRPKM
## 1 105.4447 -Inf 0 0
## 2 0.0000 NaN 0 0
## SRR316185_creads SRR316185_cRPKM SRR316185_3UTR_RE SRR316186_areads
## 1 47 87.28104 -Inf 0
## 2 0 0.00000 NaN 0
## SRR316186_aRPKM SRR316186_creads SRR316186_cRPKM SRR316186_3UTR_RE
## 1 0 47 91.72928 -Inf
## 2 0 0 0.00000 NaN
## SRR316187_areads SRR316187_aRPKM SRR316187_creads SRR316187_cRPKM
## 1 0 0 85 116.5675
## 2 0 0 0 0.0000
## SRR316187_3UTR_RE SRR316188_areads SRR316188_aRPKM SRR316188_creads
## 1 -Inf 0 0 64
## 2 NaN 0 0 0
## SRR316188_cRPKM SRR316188_3UTR_RE SRR316189_areads SRR316189_aRPKM
## 1 78.20346 -Inf 0 0
## 2 0.00000 NaN 0 0
## SRR316189_creads SRR316189_cRPKM SRR316189_3UTR_RE
## 1 33 46.85651 -Inf
## 2 0 0.00000 NaN
Analysis of IPA requires two genomic regions: IPA regions and 3’-most exons. As mentioned above, these regions in mouse and human genomes have been pre-built in the package:
#mouse(mm9):
extpath = system.file("extdata", "mm9_REF.RData", package="APAlyzer")
load(extpath)
#human(hg19):
extpath = system.file("extdata", "hg19_REF.RData", package="APAlyzer")
load(extpath)
Similar to 3’UTR APA, RE of IPAs can be calculated using PASEXP_IPA:
PASEXP_IPA
:
dfIPA=dfIPA[which(dfIPA$Chrom=='chr19'),]
dfLE=dfLE[which(dfLE$Chrom=='chr19'),]
IPA_OUTraw=PASEXP_IPA(dfIPA, dfLE, flsall, Strandtype="forward", nts=1)
Note that, as a specific feature for IPA, one can set more threads using
‘nts=’(the parameter passed to Rsubread::featureCounts,
check ?Rsubread::featureCounts
for details) to increase calculation speed.
The detailed usage can be obtained by the command ?PASEXP_IPA
.
The output data frame contains read count and read density IPA upstream (a), IPA downstream (b) and 3’-most exon region (c). The RE of IPA is calculated as log2((a - b)/c).
head(IPA_OUTraw,2)
## gene_symbol IPAID PASid
## 1 9130011E15Rik chr19:45894071:-9130011E15Rik chr19:45894071:-
## 2 9130011E15Rik chr19:45952090:-9130011E15Rik chr19:45952090:-
## SRR316184_IPA_UPreads SRR316184_IPA_UPRPK SRR316184_IPA_DNreads
## 1 1 1.445087 0
## 2 104 12.453598 26
## SRR316184_IPA_DNRPK SRR316184_LEreads SRR316184_LERPK SRR316184_IPA_RE
## 1 0.0000000 96 59.47955 -5.363166
## 2 0.8276828 96 59.47955 -2.355049
## SRR316185_IPA_UPreads SRR316185_IPA_UPRPK SRR316185_IPA_DNreads
## 1 0 0.00000 1
## 2 121 14.48928 20
## SRR316185_IPA_DNRPK SRR316185_LEreads SRR316185_LERPK SRR316185_IPA_RE
## 1 1.6611296 96 59.47955 0.000000
## 2 0.6366791 96 59.47955 -2.102237
## SRR316186_IPA_UPreads SRR316186_IPA_UPRPK SRR316186_IPA_DNreads
## 1 0 0.00000 0
## 2 108 12.93258 13
## SRR316186_IPA_DNRPK SRR316186_LEreads SRR316186_LERPK SRR316186_IPA_RE
## 1 0.0000000 69 42.75093 0.000000
## 2 0.4138414 69 42.75093 -1.771866
## SRR316187_IPA_UPreads SRR316187_IPA_UPRPK SRR316187_IPA_DNreads
## 1 0 0.00000 0
## 2 199 23.82948 22
## SRR316187_IPA_DNRPK SRR316187_LEreads SRR316187_LERPK SRR316187_IPA_RE
## 1 0.000000 40 24.78315 0.00000000
## 2 0.700347 40 24.78315 -0.09964814
## SRR316188_IPA_UPreads SRR316188_IPA_UPRPK SRR316188_IPA_DNreads
## 1 1 1.445087 0
## 2 207 24.787451 15
## SRR316188_IPA_DNRPK SRR316188_LEreads SRR316188_LERPK SRR316188_IPA_RE
## 1 0.0000000 55 34.07683 -4.5595631
## 2 0.4775093 55 34.07683 -0.4872446
## SRR316189_IPA_UPreads SRR316189_IPA_UPRPK SRR316189_IPA_DNreads
## 1 1 1.445087 0
## 2 226 27.062627 15
## SRR316189_IPA_DNRPK SRR316189_LEreads SRR316189_LERPK SRR316189_IPA_RE
## 1 0.0000000 46 28.50062 -4.3017653
## 2 0.4775093 46 28.50062 -0.1003744
Once the read coverage information is obtained for each sample, one can compare APA regulation difference between two different groups. In this analysis, there are two types of experimental design: 1) without replicates; 2) with replicates. A sample table will be generated according to the design:
# Build the sample table with replicates
sampleTable1 = data.frame(samplename = c(names(flsall)),
condition = c(rep("NT",3),rep("KD",3)))
sampleTable1
## samplename condition
## 1 SRR316184 NT
## 2 SRR316185 NT
## 3 SRR316186 NT
## 4 SRR316187 KD
## 5 SRR316188 KD
## 6 SRR316189 KD
# Build the sample table without replicates
sampleTable2 = data.frame(samplename = c("SRR316184","SRR316187"),
condition = c("NT","KD"))
sampleTable2
## samplename condition
## 1 SRR316184 NT
## 2 SRR316187 KD
To analyze 3’UTR APA between samples (KD and NT groups in the example)
without replicates, sampleTable2
is used. The function
used here is called APAdiff
(detailed information can be obtained by the
command ?APAdiff
). It will fist to go through the sample table to
determine whether it is a replicate design or non-replicate design.
Then the APA compassion will be performed.
# Analysis 3'UTR APA between KD and NT group using non-repilicate design
test_3UTRsing=APAdiff(sampleTable2,DFUTRraw,
conKET='NT',
trtKEY='KD',
PAS='3UTR',
CUTreads=0)
The APAdiff
function requires two inputs: 1) A sample table defining
groups/conditions of the samples, and 2) read coverage information of
aUTRs and cUTRs, which can be obtained by PASEXP_3UTR
from the previous
step. The group name, i.e., treatment or control, can be defined b
trtKEY=
and conKET=
; the PAS type analyzed should be defined by PAS=
;
and the read cutoff used for aUTR and cUTR is defined by CUTreads=
with the default value being 0. In the non-replicate design, the APA pattern
will be compared between two samples and output will be shown in a data frame:
head(test_3UTRsing,2)
## gene_symbol RED pvalue APAreg
## 3 1810055G02Rik -0.1195816 1.0000000 NC
## 4 2700081O15Rik 0.3862662 0.3292784 NC
table(test_3UTRsing$APAreg)
##
## DN NC UP
## 8 246 31
The output contains 4 columns: ‘gene symbol’ describes gene information;
‘RED’ is relative expression difference between two groups; ‘pvalue’ is
statistical significance based on the Fisher’s exact test; and ‘APAreg’
is 3’UTR APA regulation pattern in the gene. We define 3 types in ‘APAreg’,
‘UP’ means aUTR abundance in the treatment group (‘KD’ in this case) is at
least 5% higher than that in control (‘NT’ in this case), and ‘pvalue’<0.05;
‘DN’ means aUTR abundance is 5% lower in treatment than that in control and
p-value<0.05; ‘NC’ are the remaining genes. With respect to 3’UTR size changes,
‘UP’ means 3’UTR shortening, and ‘DN’ 3’UTR lengthening.
For the replicate design, we use t-test for significance analysis. However,
other tools based on negative binomial data distribution, such as
DEXSeq (Anders, Reyes, and W. 2012) might also be used.
# Analysis 3'UTR APA between KD and NT group using multi-repilicate design
test_3UTRmuti=APAdiff(sampleTable1,
DFUTRraw,
conKET='NT',
trtKEY='KD',
PAS='3UTR',
CUTreads=0)
head(test_3UTRmuti,2)
## gene_symbol RED pvalue APAreg
## 3 1810055G02Rik -0.7631689 0.08590526 NC
## 4 2700081O15Rik 0.1872116 0.07691742 NC
table(test_3UTRmuti$APAreg)
##
## DN NC UP
## 8 226 24
In the replicate design, ‘RED’ is difference of averaged relative expression between two groups; ‘pvalue’ is the p-value from t-test. In this case, ‘UP’ is defined as ‘RED’ <0 and ‘pvalue’ <0.05; while ‘DN’ is the opposite; and ‘NC’ is the remaining genes.
IPA comparison is similar to 3’UTR APA using APAdiff
, except that it
(1) uses IPA expression as input, and (2) ‘PAS=’ needs to be defined as
‘IPA’, and (3) the analysis is performed on each IPA. Note that, the direction
of IPA regulation is opposite to that of 3’UTR APA. This means ‘UP’ is defined
as up-regulation of IPA (RED > 0); ‘DN’ is the opposite;
and ‘NC’ is the remaining genes.
Analysis of IPA between KD and NT groups without replicates is shown here:
test_IPAsing=APAdiff(sampleTable2,
IPA_OUTraw,
conKET='NT',
trtKEY='KD',
PAS='IPA',
CUTreads=0)
head(test_IPAsing,2)
## gene_symbol PASid RED pvalue APAreg
## 2 9130011E15Rik chr19:45952090:- 2.255401 1.790010e-12 UP
## 5 Ablim1 chr19:57151655:- 0.626637 5.392007e-01 NC
Analysis of IPA between KD and NT groups using replicate data is shown here:
test_IPAmuti=APAdiff(sampleTable1,
IPA_OUTraw,
conKET='NT',
trtKEY='KD',
PAS='IPA',
CUTreads=0)
head(test_IPAmuti,2)
## gene_symbol PASid RED pvalue APAreg
## 2 9130011E15Rik chr19:45952090:- 1.84729507 0.001294432 UP
## 5 Ablim1 chr19:57151655:- -0.09466715 0.900027947 NC
APA comparison result can be plotted using either boxplots or violin plots or CDF curves. For the previous 3’UTR APA and IPA comparison outputs, one needs to first build the plotting data frame:
test_3UTRmuti$APA="3'UTR"
test_IPAmuti$APA="IPA"
dfplot=rbind(test_3UTRmuti[,c('RED','APA')],test_IPAmuti[,c('RED','APA')])
To make violin plots and CDF curves using ggplot2:
library(ggplot2)
###violin
ggplot(dfplot, aes(x = APA, y = RED)) +
geom_violin(trim = FALSE) +
geom_boxplot(width = 0.2)+ theme_bw() +
geom_hline(yintercept=0, linetype="dashed", color = "red")
###CDF
ggplot(dfplot, aes( x = RED, color = APA)) +
stat_ecdf(geom = "step") +
ylab("cumulative fraction")+
geom_vline(xintercept=0, linetype="dashed", color = "gray")+ theme_bw() +
geom_hline(yintercept=0.5, linetype="dashed", color = "gray")
APA is frequently involved in gene expression regulation. To compare gene expression vs. APA in different samples, our package provides a simple function to assess the expression changes using RNA-seq reads mapped to coding sequences.
suppressMessages(library("GenomicFeatures"))
suppressMessages(library("org.Mm.eg.db"))
extpath = system.file("extdata", "mm9.chr19.refGene.R.DB", package="APAlyzer")
txdb=loadDb(extpath, packageName='GenomicFeatures')
IDDB = org.Mm.eg.db
CDSdbraw=REFCDS(txdb,IDDB)
## 'select()' returned many:1 mapping between keys and columns
DFGENEraw=GENEXP_CDS(CDSdbraw, flsall, Strandtype="forward")
## [1] "SRR316184, Strand: forward, finished"
## [1] "SRR316185, Strand: forward, finished"
## [1] "SRR316186, Strand: forward, finished"
## [1] "SRR316187, Strand: forward, finished"
## [1] "SRR316188, Strand: forward, finished"
## [1] "SRR316189, Strand: forward, finished"
How to generate a BAM file list for analysis?
A BAM file list containing both BAM file names and paths of the files. Let’s say all the BAM files are stored in bamdir, then BAM file lists can be obtained through:
flsall = dir(bamdir,".bam")
flsall=paste0(bamdir,flsall)
names(flsall)=dir(bamdir,".bam")
Why am I getting error messages when I try to get txdb using makeTxDbFromUCSC?
You can try either upgrade your Bioconductor, or load the genome annotation using GTF, or load the prebuild genome annotation using ‘.R.DB’ file, e.g., mm9.refGene.R.DB.
The session information records all the package versions used in generation of the present document.
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] org.Mm.eg.db_3.10.0 GenomicFeatures_1.38.0 AnnotationDbi_1.48.0
## [4] Biobase_2.46.0 ggplot2_3.2.1 TBX20BamSubset_1.21.0
## [7] Rsamtools_2.2.0 Biostrings_2.54.0 XVector_0.26.0
## [10] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0 IRanges_2.20.0
## [13] S4Vectors_0.24.0 BiocGenerics_0.32.0 APAlyzer_1.0.0
## [16] BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.1 splines_3.6.1
## [3] bit64_0.9-7 assertthat_0.2.1
## [5] askpass_1.1 BiocManager_1.30.9
## [7] BiocFileCache_1.10.0 blob_1.2.0
## [9] GenomeInfoDbData_1.2.2 yaml_2.2.0
## [11] progress_1.2.2 pillar_1.4.2
## [13] RSQLite_2.1.2 backports_1.1.5
## [15] lattice_0.20-38 glue_1.3.1
## [17] digest_0.6.22 RColorBrewer_1.1-2
## [19] colorspace_1.4-1 htmltools_0.4.0
## [21] Matrix_1.2-17 XML_3.98-1.20
## [23] pkgconfig_2.0.3 biomaRt_2.42.0
## [25] genefilter_1.68.0 bookdown_0.14
## [27] zlibbioc_1.32.0 xtable_1.8-4
## [29] purrr_0.3.3 scales_1.0.0
## [31] BiocParallel_1.20.0 tibble_2.1.3
## [33] openssl_1.4.1 annotate_1.64.0
## [35] withr_2.1.2 SummarizedExperiment_1.16.0
## [37] lazyeval_0.2.2 survival_2.44-1.1
## [39] magrittr_1.5 crayon_1.3.4
## [41] memoise_1.1.0 evaluate_0.14
## [43] tools_3.6.1 prettyunits_1.0.2
## [45] hms_0.5.1 matrixStats_0.55.0
## [47] stringr_1.4.0 munsell_0.5.0
## [49] DelayedArray_0.12.0 compiler_3.6.1
## [51] DESeq_1.38.0 rlang_0.4.1
## [53] grid_3.6.1 RCurl_1.95-4.12
## [55] rappdirs_0.3.1 labeling_0.3
## [57] bitops_1.0-6 rmarkdown_1.16
## [59] Rsubread_2.0.0 gtable_0.3.0
## [61] DBI_1.0.0 curl_4.2
## [63] R6_2.4.0 GenomicAlignments_1.22.0
## [65] knitr_1.25 dplyr_0.8.3
## [67] rtracklayer_1.46.0 bit_1.1-14
## [69] zeallot_0.1.0 stringi_1.4.3
## [71] Rcpp_1.0.2 geneplotter_1.64.0
## [73] vctrs_0.2.0 dbplyr_1.4.2
## [75] tidyselect_0.2.5 xfun_0.10
We thank members of the Bin Tian lab for helpful discussions and package testing.
Anders, S., A. Reyes, and Huber W. 2012. “Detecting Differential Usage of Exons from Rna-Seq Data.” Genome Research 22:4025. https://doi.org/10.1101/gr.133744.111.
Bindreither, D. 2018. TBX20BamSubset: Subset of Bam Files from the "Tbx20" Experiment.
Wang, R., R. Nambiar, D. Zheng, and Tian B. 2017. “PolyA_DB 3 Catalogs Cleavage and Polyadenylation Sites Identified by Deep Sequencing in Multiple Genomes.” Nucleic Acids Research 46 (D1):D315–D319. https://doi.org/10.1093/nar/gkx1000.
Wang, R., D. Zheng, G. Yehia, and Tian B. 2018. “A Compendium of Conserved Cleavage and Polyadenylation Events in Mammalian Genes.” Genome Research 28 (10):1427–41. https://doi.org/10.1101/gr.237826.118.