NBSplice installation

NBSplice is an R package sitting on Bioconductor. You could download the source file from the Bioconductor site (http://bioconductor.org/) and then use R CMD install. Another easy way to install it is by doing:

## try http:// if https:// URLs are not supported
if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("NBSplice", version="devel")

Analysis workflow

Here we show the most basic steps for a differential splicing analysis pipeline using the NBSplice R package.

Input Data

To start the analysis, an expression matrix at the isoform/transcript level called isoCounts, previously generated by upstream processing tools like Kallisto (Bray et al. 2016) or RSEM (Li and Dewey 2011), is required. You also need a table describing isoform-gene relationships called geneIso and a table of experiment information called designMatrix. In particular, this table must contain a column storing the experiment factor which will be evaluated. Here, we assume that this column is called condition.

Transcript expression matrix

As input, the NBSplice package expects isoform expression counts obtained from RNA-seq in the form of a matrix of numerical values. In this matrix, the count value found in the i-th row and the j-th column represents the expression counts assigned to isoform i in experimental sample j. The values in the matrix should be un-normalized. We suggest the use of the effective counts typically reported by the mentioned quantification tools. It is not necessary to round those because NBSplice will do it internally. The expression matrix could be easily obtained by the combination of quantification files for all samples. For instance, if Kallisto was used to transcript quantification, the expression matrix should be built combining the est_counts columns of those files as is shown in the code below.

fileNames<-c(paste(rep("C1R",4), 1:4,"/abundance.tsv", sep=""),
            paste(rep("C2R",4), 1:4,"/abundance.tsv", sep=""))

myInfo<-lapply(seq_along(fileNames), function(x){
    quant<-read.delim(fileNames[x], header = TRUE)
    expression<-quant[,"est_counts"]
    return(expression)
})
# Adding the isoform IDS as row names
isoform_id<-read.delim(fileNames[x], header = TRUE)[,"target_id"]
expressionMatrix<-(do.call(cbind, myInfo))
rownames(expressionMatrix)<-isoform_id
# Adding the samples names as column names
colnames(expressionMatrix)<-c(paste(rep("C1R",4), 1:4, sep=""),
                                paste(rep("C2R",4), 1:4, sep=""))

If another quantification tool was used, you only need to change the name of the quantification files and the name of the columns having estimated counts and isoforms IDs. For instance, for RSEM quantification, isoforms counts and IDS are in the expected_count and transcript_id columns of the Sample.isoforms.results file, respectively.

As an example, we have included an isoCounts matrix which is a subset of one of the simulated matrix used to demonstrate NBSplice utility.

data(isoCounts, package="NBSplice")
head(isoCounts)

##                 C1R1 C1R2 C1R3 C1R4 C2R1 C2R2 C2R3 C2R4
## ENST00000228345   79    0  106   99   76    0   68   23
## ENST00000358495  567  212  162  715   70  176  255  243
## ENST00000397230   15    0    0   28    0    0   22    3
## ENST00000430095    0    6    0   77    0    0   27   59
## ENST00000461568   11    0    0    0    0    2    0    0
## ENST00000463750   10   31  126   60   88    0    0   31

dim(isoCounts)

## [1] 3144    8

As can be noted, the isoCounts matrix represents an experiment with 8 samples were 3122 isoforms were quantified.

Gene-isoform relationships matrix

The geneIso matrix must have two columns called isoform_id and gene_id and as many rows as the isoCounts matrix. The i-th row of the geneIso matrix specifies from which gene was originated the i-th isoform. This information will be used by NBSplice methods for model fitting and hypothesis tests. This matrix could be easily obtained from the annotation file used for isoform quantification. For instance, if RSEM was used, the matrix can be built taking the first two columns (transcript_id and gene_id) of the Sample.isoforms.results file. In the case of Kallisto, this information is not stored in the quantification file. You could use the biomaRt R package to obtain gene IDs for the quantified isoforms or the refGenome package to extract those from the annotation file.

As an example, we provided a geneIso matrix associated with the isoCounts object previously loaded.

data(geneIso, package="NBSplice")
head(geneIso)

##                         gene_id      isoform_id
## ENST00000228345 ENSG00000002016 ENST00000228345
## ENST00000358495 ENSG00000002016 ENST00000358495
## ENST00000397230 ENSG00000002016 ENST00000397230
## ENST00000430095 ENSG00000002016 ENST00000430095
## ENST00000461568 ENSG00000002016 ENST00000461568
## ENST00000463750 ENSG00000002016 ENST00000463750

dim(geneIso)

## [1] 3144    2

This matrix contains annotation information, so, you need to build it everytime that you change your reference transcriptome, downloading the respective information from the database site.

Experiment data and colName

The designMatrix table must contain experiment information relevant for the differential splicing analysis. Its number of columns is variable but it should have as rows as columns the isoCounts matrix has. Thus, in the designMatrix the i-th row specifies experimental information associated to the i-th sample. It is worth to mention that the order and names of isoCounts columns must be the same of the designMatrix rows. Currently, NBSplice allows two-levels comparison of single-factor experiments (multiple-factor experiment comparisons are in development). In order to do this, it is necessary to specify the argument colName which indicates the name of the designMatrix column corresponding to that factor. We provided the designMatrix object associated to our isoCounts matrix which could be loaded doing data(designMatrix, package="NBSplice"). Alternatively, this matrix could be created using the code below.

designMatrix<-data.frame(sample=c(paste(rep("C1R", 4), 1:4, sep=""),
                        paste(rep("C2R", 4), 1:4, sep="")), condition=factor(c(
                        rep("Normal", 4), rep("Tumor", 4)), levels=c("Normal",
                        "Tumor")))
rownames(designMatrix)<-designMatrix[,"sample"]
designMatrix

##      sample condition
## C1R1   C1R1    Normal
## C1R2   C1R2    Normal
## C1R3   C1R3    Normal
## C1R4   C1R4    Normal
## C2R1   C2R1     Tumor
## C2R2   C2R2     Tumor
## C2R3   C2R3     Tumor
## C2R4   C2R4     Tumor

In our case, we must define our colName argument indicating that our factor of interest is condition.

colName<-"condition"
levels(designMatrix[,colName])

## [1] "Normal" "Tumor"

As can be noted, our experiment involved two experimental conditions: Normal and Tumor. Note that this column of the designMatrix is a factor with two levels where the first level, commonly taken as the reference, is Normal.

IsoDataSet class

Basics

NBSplice incorporates a new class called IsoDataSet to store expression counts and experimental data. An object of this class has the next slots: counts, geneCounts, colData, isoGeneRel, design and lowExpIndex. The first slot will contain the isoCounts matrix, whereas, the second one store the expression counts at the gene level. To do this, NBSplice incorporates a function which takes every gene of geneIso and, for each sample, compute its total expression as the sum of the counts of all its isoforms contained in isoCounts. The third and fifth slots correspond to designMatrix and geneIso objects. The slot called design save the formula used to model fitting and hypothesis tests, and is internally defined by NBSplice based on the colName argument. The last argument, lowExpIndex, stores the indexes of low-expressed isoforms. The way to define those isoforms is explained in Identifying low-expressed isoforms.

Creating an object

The typical way to build an IsoDataSet object is using its constructor which have the same name as the class. In addition to the four arguments previously defined, it is possible to define the BPPARAM argument which must be a BiocParallelParam instance defining the parallel back-end to be used (Morgan et al., n.d.).

library(NBSplice)
myIsoDataSet<-IsoDataSet(isoCounts, designMatrix, colName, geneIso)

NBSplice also provides several methods for easy object and slots inspections.

show(myIsoDataSet)

## IsoDataSet 
## Isoform Counts: 
##                      C1R1     C1R2     C1R3       C1R4     C2R1   C2R2     C2R3
## ENST00000228345 111.79303   0.0000 149.7884  141.06040 111.9745   0.00 100.9089
## ENST00000358495 802.36265 296.6549 228.9219 1018.76958 103.1344 258.33 378.4085
## ENST00000397230  21.22653   0.0000   0.0000   39.89587   0.0000   0.00  32.6470
##                       C2R4
## ENST00000228345  36.469152
## ENST00000358495 385.304517
## ENST00000397230   4.756846
## 
## Gene Counts: 
##                      C1R1      C1R2      C1R3      C1R4      C2R1       C2R2
## ENSG00000002016 1158.9683  481.3646  854.9243 1667.0775  495.0451  408.04405
## ENSG00000005059  182.5481  158.1227   72.0680  115.4131  139.9681   58.71137
## ENSG00000005889 1982.5575 1859.6906 1575.6043 1530.2917 1461.5618 1241.74555
##                      C2R3      C2R4
## ENSG00000002016  651.4561  692.9139
## ENSG00000005059  136.5238  144.2910
## ENSG00000005889 1167.8724 2177.0498
## 
## Isoform-Gene relationship: 
##                         gene_id      isoform_id
## ENST00000228345 ENSG00000002016 ENST00000228345
## ENST00000358495 ENSG00000002016 ENST00000358495
## ENST00000397230 ENSG00000002016 ENST00000397230
## 
## Design Matrix: 
##      sample condition
## C1R1   C1R1    Normal
## C1R2   C1R2    Normal
## C1R3   C1R3    Normal
## 
## Formula: 
##  counts ~ condition + iso + condition:iso
## <environment: 0x557ce2191568>

head(isoCounts(myIsoDataSet))

##                      C1R1       C1R2     C1R3       C1R4     C2R1       C2R2
## ENST00000228345 111.79303   0.000000 149.7884  141.06040 111.9745   0.000000
## ENST00000358495 802.36265 296.654936 228.9219 1018.76958 103.1344 258.330043
## ENST00000397230  21.22653   0.000000   0.0000   39.89587   0.0000   0.000000
## ENST00000430095   0.00000   8.395894   0.0000  109.71365   0.0000   0.000000
## ENST00000461568  15.56612   0.000000   0.0000    0.00000   0.0000   2.935569
## ENST00000463750  14.15102  43.378788 178.0503   85.49115 129.6547   0.000000
##                      C2R3       C2R4
## ENST00000228345 100.90892  36.469152
## ENST00000358495 378.40846 385.304517
## ENST00000397230  32.64700   4.756846
## ENST00000430095  40.06678  93.551303
## ENST00000461568   0.00000   0.000000
## ENST00000463750   0.00000  49.154074

Note that the isoCounts method returns the counts slot but it seems to be different to the IsoCounts object previously loaded. This is because this method has an additional logical parameter, CPM with default value TRUE, useful to indicate if expression counts should be returned in counts per million (CPM) scale.

Identifying low-expressed isoforms

Low expression transcripts have been widely recognized as a source of noise in model fitting and parameter estimation of RNA-seq expression counts. Typically, count thresholds are used to decide if a gene or an isoform has low expression. However, NBSplice assumes that an isoform has a low expression if their absolute or relative expression is lower than specifics thresholds. This assumption is due to the package model infers differential splicing by means of changes in relative expression of isoforms. Consequently, a transcript which has a very low relative expression in both conditions could exhibit a small change but enough to be erroneously detected as significant. The buildLowExpIdx method is responsible for the identification of low-expressed isoforms. It takes an IsoDataSet object and two expression thresholds and creates a low-expression index and stores this index in the lowExpIndex object slot. The two thresholds are called ratioThres and countsThres. The first one of those is used to set a minimum of relative expression that each isoform should achieve in all experimental samples. The second one indicates a threshold in CPM for mean expression across conditions. The method also needs the specification of the colName argument, in order to compute mean expressions per factor level, and admits parallelization using the BPPARAM argument.

myIsoDataSet<-buildLowExpIdx(myIsoDataSet, colName, ratioThres = 0.01, 
                            countThres = 1)

Testing differential gene splicing

Once an IsoDataSet was built and its low-expressed isoforms were identified, you could use the NBSplice for differential splicing evaluation using the NBTest method. It provides a user-friendly interface between model fitting and hypothesis testing and the user by means a single code line. The theory behind the NBSplice model is explained in NBSplice strategy. Briefly, NBTest iterates along genes obtaining significance p-values at gene and isoform level indicating the occurrence of differential splicing and significant changes in isoform proportion, respectively. Then, isoform and gene p-values are adjusted. Low-expressed isoforms are ignored for model fitting but their mean relative expression per condition are also computed. The method requires the specification of the colName, as well as the statistic assumed for hypothesis test p-values distribution. As the colName factor could have more than two levels, it is necessary the specification of an additional argument defining the two levels to be contrasted. By default, the contrast argument takes the two first levels of colName. NBTest returns an NBSpliceRes object which stores the obtained results.

myDSResults<-NBTest(myIsoDataSet, colName, test="F")

NBSpliceRes class

Basics

NBSpliceRes is another new class provided by NBSplice to facilitate the manipulation and exploration of the differential splicing results obtained using the NBTest method. An object of this class has three slots called results, lowExpIndex and contrast. The first slot stores differential splicing results for all isoforms, whereas the second slot keeps the indexes of low-expressed isoforms and the third one contains the evaluated contrast. Specific methods for slot accession are also provided by NBSplice

head(lowExpIndex(myDSResults))

## [1] 1 3 4 5 6 7

contrast(myDSResults)

## [1] "Normal" "Tumor"

head(results(myDSResults))

##                iso            gene ratio_Normal ratio_Tumor         odd
## 2  ENST00000358495 ENSG00000002016   0.54719747  0.49422479 -0.10181928
## 21 ENST00000394650 ENSG00000005059   0.69478391  0.78304282  0.11958650
## 26 ENST00000379177 ENSG00000005889   0.34888943  0.25494787 -0.31369597
## 27 ENST00000379188 ENSG00000005889   0.40700345  0.34146857 -0.17556603
## 35 ENST00000540034 ENSG00000005889   0.08976672  0.08393136 -0.06721502
## 37 ENST00000219069 ENSG00000006194   0.76318911  0.80921396  0.05855751
##          stat      pval  genePval       FDR   geneFDR
## 2  0.14639753 0.7151892 0.7151892 0.9926409 0.9995640
## 21 1.05528398 0.3439129 0.3439129 0.7935389 0.8405744
## 26 0.55196653 0.4671017 0.8761851 0.9190379 0.9995640
## 27 0.17311930 0.6822731 0.8761851 0.9901351 0.9995640
## 35 0.02497649 0.8761851 0.8761851 0.9982574 0.9995640
## 37 0.06668665 0.8005993 0.7438810 0.9982574 0.9995640

The data.frame containing differential splicing results has the next columns:

iso: Isoform name.
gene: Gene name.
ratio_X: Relative expression of the isoform in experimental condition X, which is the first value specified with the contrast argument.
ratio_Y: Relative expression of the isoform in experimental condition Y, which is the second value specified with the contrast argument.
odd: Ratio between isoform’s relative expression in condition X and Y.
stat: Statistic of the hypothesis test performed to decide if the change in isoform’s relative expression between X and Y conditions was significant.
pval: P-value at the isoform level corresponding to the statistic stat.
genePval: P-value at the gene level.
FDR: Adjusted p-value at the isoform level.
genePval: Adjusted p-value at the gene level.

In particular, the pval or its adjusted value, FDR, is useful for deciding if an isoform experimented significant change in its relative expression due to the change between X and Y conditions, here called Normal and Tumor. Whereas, the genePval or its adjusted value, geneFDR, will help you to decide if a gene evidenced significant changes in its alternative splicing pattern when conditions X and Y were contrasted.

In the example illustrated above, non-reliable isoforms were discarded by means of the filter argument of the results methods which has TRUE as its default value. Non-reliable refers to both low expressed isoforms and those cases where model estimations did not achieve convergence. To obtain the full results matrix only you need specifies that filter argument is FALSE.

head(results(myDSResults, filter = FALSE))

##               iso            gene ratio_Normal ratio_Tumor        odd      stat
## 1 ENST00000228345 ENSG00000002016  0.089070273 0.108429887         NA        NA
## 2 ENST00000358495 ENSG00000002016  0.547197468 0.494224790 -0.1018193 0.1463975
## 3 ENST00000397230 ENSG00000002016  0.010561661 0.014244721         NA        NA
## 4 ENST00000430095 ENSG00000002016  0.020813457 0.049128715         NA        NA
## 5 ENST00000461568 ENSG00000002016  0.003357753 0.001798561         NA        NA
## 6 ENST00000463750 ENSG00000002016  0.090468201 0.083210744         NA        NA
##        pval  genePval       FDR  geneFDR
## 1        NA 0.7151892        NA 0.999564
## 2 0.7151892 0.7151892 0.9926409 0.999564
## 3        NA 0.7151892        NA 0.999564
## 4        NA 0.7151892        NA 0.999564
## 5        NA 0.7151892        NA 0.999564
## 6        NA 0.7151892        NA 0.999564

Non-reliable isoforms could be easily detected in the obtained matrix because they present NA value in all parameters derived from model fitting and hypothesis test. It is worth to note that although non-reliable isoforms were not analyzed by statistical models, they have been used to compute the total gene counts so in order to avoid overestimation of relative expression of the well-expressed isoforms.

Results exploration

The NBSpliceRes class has several methods useful for differential splicing results exploration.

Getting results of differentially spliced genes

NBSplice assumes a gene is differentially spliced if its p-value (raw or adjusted) is lower than a significant threshold (default value = 0.05). The results table only for those differentially spliced genes is easily extracted using the GetDSResults method. Whereas, the names of those genes is obtained by means of the GetDSGenes. Both methods expect at least one argument of class NBSpliceRes. In addition, they could receive a logical parameter, adjusted, indicating if p-values must be adjusted (default) or not, and a numeric parameter, p.value, establishing the significance threshold (default value = 0.05).

mySignificantRes<-GetDSResults(myDSResults)
head(mySignificantRes)

##                 iso            gene ratio_Normal ratio_Tumor        odd
## 378 ENST00000206514 ENSG00000092068   0.03696968  0.07501606  0.7076040
## 379 ENST00000316902 ENSG00000092068   0.51890400  0.17702007 -1.0754558
## 381 ENST00000339733 ENSG00000092068   0.03601217  0.05323293  0.3908204
## 383 ENST00000422941 ENSG00000092068   0.02884770  0.03681750  0.2439428
## 386 ENST00000524758 ENSG00000092068   0.03023622  0.04352228  0.3642326
## 388 ENST00000528186 ENSG00000092068   0.02675809  0.04615988  0.5452741
##           stat         pval   genePval         FDR    geneFDR
## 378  6.0654084 0.0170055687 0.00281695 0.128238715 0.03830455
## 379 14.3400329 0.0003851829 0.00281695 0.009577521 0.03830455
## 381  1.8410376 0.1804745117 0.00281695 0.566677648 0.03830455
## 383  0.7091551 0.4034386713 0.00281695 0.854605021 0.03830455
## 386  1.5880384 0.2130252802 0.00281695 0.629128681 0.03830455
## 388  3.5529531 0.0648255060 0.00281695 0.305294101 0.03830455

dim(mySignificantRes)

## [1] 110  10

myDSGenes<-GetDSGenes(myDSResults)
head(myDSGenes)

## [1] "ENSG00000092068" "ENSG00000103275" "ENSG00000103942" "ENSG00000119446"
## [5] "ENSG00000123983" "ENSG00000126883"

length(myDSGenes)

## [1] 33

As can be seen, in our example, NBSplice detected 184 isoforms with a significant change of relative expression between Normal and Tumor condition whereas 40 genes were identified as differentially spliced.

Isoforms’ relative expression plot

The scatter plot of isoforms’ relative expression between the two contrasted conditions is explored using the plotRatiosDisp. It returns a ggplot2 object which could be then easily modified and customized.

plotRatiosDisp(myDSResults)

As can be noted, isoforms are colored according to if they came from those genes detected as differentially spliced or not. To do this, the adjusted and p.value parameters, previously used to call the getDSResults method, could be specified in the plotRatiosDisp calling.

Volcano plot

The volcano plot is commonly used to explore the dispersion between statistical significance from a statistical test with the magnitude of the change. In the case of NBSplice the isoform p-value and the difference between isoform’s relative expression per contrasted condition are explored using the plotVolcano method. In addition, isoforms are colored according to if they came from those genes detected as differentially spliced or not.

plotVolcano(myDSResults)

Exploring a differentially spliced gene

NBSplice also provides two methods to individually explore differentially spliced genes: GetGeneResults and plotGeneResults. The first one is useful for obtaining the differential expression results of a single gene, specified using the gene argument. The second one is a graphical method to generate a bar-plot illustrating the isoforms’ relative expression values in the two contrasted conditions. In both cases, it is possible filtering those non-reliable isoforms by means of the filterLowExpIso argument. In the case of the plotGeneResults method, non-significant isoforms could be also filtered out setting the filterNotSignificant argument to TRUE which is its default value. As an example, the results obtained for the first analyzed gene are shown below.

## Select the first differentially spliced gene
gene<-GetDSGenes(myDSResults)[1]
GetGeneResults(myDSResults, gene)

##                 iso            gene ratio_Normal ratio_Tumor        odd
## 378 ENST00000206514 ENSG00000092068   0.03696968  0.07501606  0.7076040
## 379 ENST00000316902 ENSG00000092068   0.51890400  0.17702007 -1.0754558
## 381 ENST00000339733 ENSG00000092068   0.03601217  0.05323293  0.3908204
## 383 ENST00000422941 ENSG00000092068   0.02884770  0.03681750  0.2439428
## 386 ENST00000524758 ENSG00000092068   0.03023622  0.04352228  0.3642326
## 388 ENST00000528186 ENSG00000092068   0.02675809  0.04615988  0.5452741
## 390 ENST00000528860 ENSG00000092068   0.02884770  0.03681750  0.2439428
## 391 ENST00000529705 ENSG00000092068   0.02209550  0.06319893  1.0509135
## 392 ENST00000532568 ENSG00000092068   0.03410357  0.07346647  0.7674271
##           stat         pval   genePval         FDR    geneFDR
## 378  6.0654084 0.0170055687 0.00281695 0.128238715 0.03830455
## 379 14.3400329 0.0003851829 0.00281695 0.009577521 0.03830455
## 381  1.8410376 0.1804745117 0.00281695 0.566677648 0.03830455
## 383  0.7091551 0.4034386713 0.00281695 0.854605021 0.03830455
## 386  1.5880384 0.2130252802 0.00281695 0.629128681 0.03830455
## 388  3.5529531 0.0648255060 0.00281695 0.305294101 0.03830455
## 390  0.7091551 0.4034386713 0.00281695 0.854605021 0.03830455
## 391 13.1933293 0.0006259889 0.00281695 0.014451319 0.03830455
## 392  7.1220185 0.0100334774 0.00281695 0.093240396 0.03830455

plotGeneResults(myDSResults, gene)

## Keeping non-reliable and non-significant isoforms
plotGeneResults(myDSResults, gene, filterLowExpIso = FALSE, 
                filterNotSignificant = FALSE)

NBSplice strategy

NBSplice, is a package developed for differential splicing detection starting from isoform quantification data from RNA-seq experiments. Our tool is based on generalized linear models (GLMs) and uses them to infer significant differences in isoforms relative expression values between two experimental conditions. In addition, NBSplice not only reveals what genes were under DS but also reveals the identity of the gene isoforms affected by it as well as their relative expression in each experimental condition.

Model

NBSplice is based on the use of GLMs to model isoform expression counts. This approach has been widely used for several differential expression tools like edgeR (Robinson, McCarthy, and Smyth 2010) and DESeq2 (Love, Huber, and Anders 2014) for differential gene expression and DEXSeq (Anders, Reyes, and Huber 2012) for differential exon usage analysis. These tools are based in negative binomial (NB) models and, in general, they assume that the counts of the i-th gene/exon in the k-th sequenced sample, denoted as \(y_{ik}\), are a random variable with a NB distribution.

\[\begin{equation} y_{ik}\sim NB(\mu_{ik}=p_{ik}s_k, \phi_{i}) \end{equation}\]

This NB distribution is characterized by two parameters: the mean, \(\mu_{ik}\), and the dispersion, \(\phi_{i}\). Specifically, previously mentioned R packages assume that \(\mu_{ik}\) can be expressed as the product of the real proportion of mRNA fragments of the i-th gene in the k-th sample, \(p_{ik}\), with a factor, \(s_k\), related to the sequencing library size. The parameter \(\phi_{i}\) is directly related to the variance of \(y_{ik}\)

\[\begin{equation} V(y_{ik})=\mu_{ik}+\phi_i\mu_{ik} \label{EQ:var} \end{equation}\]

Commonly used methods fit GLMs with a logarithmic link and with the elements \(x_{kr}\) of the design matrix to infer differential expression

\[\begin{equation} log(p_{ik})=\sum\limits_rx_{kr}\beta_{ir} \label{EQ:mod} \end{equation}\]

In the most simple case, a treatment-control experiment, \(r\) be equal to \(1\) and the design matrix only will have one column where each cell indicates if sample \(k\) was or not treated.

NBSplice takes the advantage of the previously described strategy although it proposes an alternative approach. It assumes that isoform expression counts, in CPMs, of the i-th isoform from the j-th gene in the k-th sample, \(y_{ijk}\), follow a NB distribution as

\[\begin{equation} y_{ijk}\sim NB(media=p_{ijk}\mu_{jk}, dispersion=\phi_{j}) \label{EQ:eq5} \end{equation}\]

where \(p_{ijk}\) represents the proportion of the mRNA fragments from the j-th gene (\(\mu_{jk}\)) belonging to the i-th isoform and the \(\phi_{j}\) is the dispersion of the j-th gene, in the k-th sample. In this context, for each isoform, it is possible to predict its mean relative expression or proportion by means of a GLM fit at the level gene according to

\[\begin{equation} log(p_{ijk})=x_{r.}\beta_{ij} \label{EQ:eq6} \end{equation}\]

where \(\beta_{ij}\) represents the change in \(p_{ijk}\) due to the effect described by the r-th column of the design matrix X (\(x_{r.}\)). Thus, NBSplice proposes, unlike other tools such as DEXSeq or Limma (Ritchie et al. 2015), the adjustment of one model for each gene. In particular, this model must consider the different gene isoforms as effects. In the most simple case, where only one factor (cond) with two levels (\(l=1, ~2\)) is analyzed , the previous equation is summarized at:

\[\begin{equation} ln(p_{ijl})=\mu_0+iso_{ij}+cond_l+ iso_{ij}:cond_l \label{EQ:eq7} \end{equation}\]

Once coefficient models are estimated, \(\hat{\beta}\), NBSplice evaluates if the changes in isoforms proportions due to experimental conditions are significant by means of a linear hypothesis test (Wald test) based on the F-statistic defined below (Greene 2017).

\[\begin{equation} H_0:cond_l+ iso_{ij}:cond_l=0 \\ F=\frac{(H\hat{\beta}-q)'(H \hat{cov}(\hat{\beta})H')^{-1}(H\hat{\beta}-q)}{p} \end{equation}\]

Here, \(H\) is the contrast matrix, \(q\) is a single element matrix having the value of the right side of the test equation (\(q=0\)), \(\hat{cov}(\hat{\beta})\) is the covariance matrix of \(\hat{\beta}\) and \(p\) is equal to the number of rows of \(H\) (\(p=1\)). In particular, \(H\) is a full rank matrix, conformable with \(\hat{\beta}\) so, when they are multiplied, the linear combination of the left side of the test equation is defined. Thus, under the null hypothesis, the F statistics will have \(F\) distribution with \(1\) and \(n-K\) degrees of freedom. Specifically, \(n\) is the number of samples and \(K\) is the length of \(\beta\). When only one experimental factor with two levels is analyzed (treatment-control case), \(K\) will be equal to the double of the number of analyzed isoforms of gene \(j\), \(I_j\). The comparison of the F statistics against its distribution under null hypothesis provides a p-value for each isoform of gene \(j\).

Given the fact that DS affect the whole gene, once isoform hypothesis tests were performed, NBSplice uses the Simes test (Simes 1986) to combine isoform p-values obtaining a gene p-value. Given a family of null hypothesis, \(H_{01}, ..., H_{0I_j}\), to test changes in relative isoform expression of the \(I_j\) isoforms from gene \(j\), the Simes test evaluates only one global null hypothesis

\[\begin{equation} H_0=\bigcap\limits_{p=1}^{I_j}H_{0i} \end{equation}\]

To do this, the isoform p-values are sorted \(p-value_{(1)j}< \dots < p-value_{(m)j}< \dots <p-value{(I)j}\) and then the Simes statistic, \(T_j\), is defined as

\[\begin{equation} T_j=\min \{ \frac{I_j p-value{(m)j}}{m}\} \label{EQ:eq10} \end{equation}\]

Under the null hypothesis, this statistic has uniform distribution, \(U(0,1)\), representing a p-value. The Simes test rejects \(H_0\) if \(T_j\) is lower or equal to the significance threshold \(\alpha\) (Simes 1986). Finally, isoform and gene p-values are corrected using the FDR strategy (Benjamini and Hochberg 1995).

Implementation

NBSplice uses commonly used methods and R packages for classical GLM fit (McCullagh 1984). In particular, the glm.nb function of the MASS package (Venables and Ripley 2002) is used to model fitting. It was chosen because this function allows the estimation of the dispersion parameter reporting a logical parameter indicating if the model estimation converged or not. This parameter is used by NBSplice because if estimations do not converge, thus inferences over the corresponding genes won’t be done. In addition, the lht function of the car package (Fox and Weisberg 2011) was used for linear hypothesis tests and the simes.test method from mppa package (Rubin-Delanchy and Heard 2014), for the Simes test.

Session info

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] NBSplice_1.12.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7          locfit_1.5-9.4      lattice_0.20-45    
##  [4] assertthat_0.2.1    digest_0.6.28       utf8_1.2.2         
##  [7] mppa_1.0            R6_2.5.1            cellranger_1.1.0   
## [10] plyr_1.8.6          evaluate_0.14       highr_0.9          
## [13] ggplot2_3.3.5       pillar_1.6.4        rlang_0.4.12       
## [16] curl_4.3.2          readxl_1.3.1        data.table_1.14.2  
## [19] car_3.0-11          jquerylib_0.1.4     rmarkdown_2.11     
## [22] labeling_0.4.2      BiocParallel_1.28.0 stringr_1.4.0      
## [25] foreign_0.8-81      munsell_0.5.0       compiler_4.1.1     
## [28] xfun_0.27           pkgconfig_2.0.3     htmltools_0.5.2    
## [31] tidyselect_1.1.1    tibble_3.1.5        edgeR_3.36.0       
## [34] rio_0.5.27          codetools_0.2-18    fansi_0.5.0        
## [37] crayon_1.4.1        dplyr_1.0.7         MASS_7.3-54        
## [40] grid_4.1.1          jsonlite_1.7.2      gtable_0.3.0       
## [43] lifecycle_1.0.1     DBI_1.1.1           magrittr_2.0.1     
## [46] scales_1.1.1        zip_2.2.0           stringi_1.7.5      
## [49] carData_3.0-4       farver_2.1.0        reshape2_1.4.4     
## [52] limma_3.50.0        bslib_0.3.1         ellipsis_0.3.2     
## [55] vctrs_0.3.8         generics_0.1.1      openxlsx_4.2.4     
## [58] BiocStyle_2.22.0    tools_4.1.1         forcats_0.5.1      
## [61] glue_1.4.2          purrr_0.3.4         hms_1.1.1          
## [64] abind_1.4-5         parallel_4.1.1      fastmap_1.1.0      
## [67] yaml_2.2.1          colorspace_2.0-2    BiocManager_1.30.16
## [70] knitr_1.36          haven_2.4.3         sass_0.4.0

References

Anders, Simon, Alejandro Reyes, and Wolfgang Huber. 2012. “Detecting Differential Usage of Exons from RNA-seq Data.” Genome Research 22 (10): 2008–17.

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological), 289–300.

Bray, NL, H Pimentel, P Melsted, and L Pachter. 2016. “Near-Optimal Probabilistic RNA-seq Quantification.” Nature Biotechnology 34 (5): 525.

Fox, John, and Sanford Weisberg. 2011. An R Companion to Applied Regression. Second. Thousand Oaks CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion.

Greene, W.H. 2017. Econometric Analysis. Pearson Education. https://books.google.com.ar/books?id=iICcDgAAQBAJ.

Li, Bo, and Colin N Dewey. 2011. “RSEM: Accurate Transcript Quantification from RNA-Seq Data with or Without a Reference Genome.” BMC Bioinformatics 12 (1): 323.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2.” Genome Biology 15 (12): 550.

McCullagh, Peter. 1984. “Generalized Linear Models.” European Journal of Operational Research 16 (3): 285–92.

Morgan, Martin, Valerie Obenchain, Michel Lang, and Ryan Thompson. n.d. BiocParallel: Bioconductor Facilities for Parallel Evaluation.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47–e47.

Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “edgeR: a Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.

Rubin-Delanchy, Patrick, and Nicholas A Heard. 2014. Mppa: Statistics for Analysing Multiple Simultaneous Point Processes on the Real Line. https://CRAN.R-project.org/package=mppa.

Simes, R John. 1986. “An Improved Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 73 (3): 751–54.

Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS4.

NBSplice-vignette

Gabriela A. Merino and Elmer A. Fernandez

26 October 2021

NBSplice installation

Analysis workflow

Input Data

Transcript expression matrix

Gene-isoform relationships matrix

Experiment data and colName

IsoDataSet class

Basics

Creating an object

Identifying low-expressed isoforms

Testing differential gene splicing

NBSpliceRes class

Basics

Results exploration

Getting results of differentially spliced genes

Isoforms’ relative expression plot

Volcano plot

Exploring a differentially spliced gene

NBSplice strategy

Model

Implementation

Session info

References