BASiCS 2.6.0
Single-cell mRNA sequencing can uncover novel cell-to-cell heterogeneity in gene expression levels within seemingly homogeneous populations of cells. However, these experiments are prone to high levels of technical noise, creating new challenges for identifying genes that show genuine heterogeneous expression within the group of cells under study.
BASiCS (Bayesian Analysis of Single-Cell Sequencing data) is an integrated Bayesian hierarchical model that propagates statistical uncertainty by simultaneously performing data normalisation (global scaling), technical noise quantification and two types of supervised downstream analyses:
Regression = TRUE
as a function parameter
in BASiCS_MCMC
.In all cases, a probabilistic output is provided and a decision rule is calibrated using the expected false discovery rate (EFDR) (Newton et al. 2004).
A brief description for the statistical model implemented in BASiCS is
provided in Section 6 of this document. The original
implementation of BASiCS (Vallejos, Marioni, and Richardson 2015) requires the use of spike-in
molecules — that are artificially introduced to each cell’s lysate
— to perform these analyses. More recently, Eling et al. (2018) extendeded
BASiCS to also address datasets for which spikes-ins are not available
(see Section 4). To use this feature,
please set WithSpikes = FALSE
as a function parameter in BASiCS_MCMC
.
Important: BASiCS has been designed in the context of supervised experiments where the groups of cells (e.g. experimental conditions, cell types) under study are known a priori (e.g. case-control studies). Therefore, we DO NOT advise the use of BASiCS in unsupervised settings where the aim is to uncover sub-populations of cells through clustering.
Parameter estimation is performed using the BASiCS_MCMC
function. Downstream
analyses implemented in BASiCS rely on appropriate post-processing of the
output returned by BASiCS_MCMC
. Essential parameters for running
BASiCS_MCMC
are:
Data
: a SingleCellExperiment
object created as in Section
3.1N
: total number of iterationsThin
: length of the thining period (i.e. only every Thin
iterations will be stored in the output of the BASiCS_MCMC
)Burn
: length of burn-in period (i.e. the initial Burn
iterations that will be discarded from the output of the BASiCS_MCMC
)Regression
: if this parameter is set equal to TRUE
, the regression
BASiCS model will be used (Eling et al. 2018). The latter infers a global
regression trend between mean expression and over-dispersion parameters. This
trend is subsequently used to derive a residual over-dispersion measure that
is defined as departures with respect to this trend. This is now the
recommended default setting for BASiCS*.If the optional parameter PrintProgress
is set to TRUE
, the R
console will display the progress of the MCMC algorithm.
For other optional parameters refer to help(BASiCS_MCMC)
.
Here, we illustrate the usage of BASiCS_MCMC
using a built-in
synthetic dataset.
As the outcome of this function is stochastic, users should expect small
variations across different runs.
To ensure reproducible results, users may consider using a fixed seed (as
illustrated in the following code).
NOTE: WE USE A SMALL NUMBER OF ITERATIONS FOR ILLUSTRATION PURPOSES ONLY.
LARGER NUMBER OF ITERATIONS ARE USUALLY REQUIRED TO ACHIEVE CONVERGENCE. OUR
RECOMMENDED SETTING IS N=20000
, Thin=20
and Burn=10000
.
set.seed(1)
Data <- makeExampleBASiCS_Data()
Chain <- BASiCS_MCMC(
Data = Data,
N = 1000, Thin = 10, Burn = 500,
PrintProgress = FALSE, Regression = TRUE
)
## -----------------------------------------------------
## MCMC sampler has been started: 1000 iterations to go.
## -----------------------------------------------------
## -----------------------------------------------------
## End of Burn-in period.
## -----------------------------------------------------
##
## -----------------------------------------------------
## -----------------------------------------------------
## All 1000 MCMC iterations have been completed.
## -----------------------------------------------------
## -----------------------------------------------------
##
## -----------------------------------------------------
## Please see below a summary of the overall acceptance rates.
## -----------------------------------------------------
##
## Minimum acceptance rate among mu[i]'s: 0.46
## Average acceptance rate among mu[i]'s: 0.61496
## Maximum acceptance rate among mu[i]'s: 0.816
##
##
## Minimum acceptance rate among delta[i]'s: 0.43
## Average acceptance rate among delta[i]'s: 0.52244
## Maximum acceptance rate among delta[i]'s: 0.608
##
##
## Acceptance rate for phi (joint): 0.882
##
##
## Minimum acceptance rate among nu[j]'s: 0.424
## Average acceptance rate among nu[j]'s: 0.541133
## Maximum acceptance rate among nu[j]'s: 0.726
##
##
## Minimum acceptance rate among theta[k]'s: 0.736
## Average acceptance rate among theta[k]'s: 0.736
## Maximum acceptance rate among theta[k]'s: 0.736
##
## -----------------------------------------------------
##
As a default, the code above runs the original implementation mode of BASiCS
(spikes without regression; see Section 4).
To use the regression BASiCS model (Eling et al. 2018), please set
Regression = TRUE
. To use the no-spikes implementation of BASiCS, please add
WithSpikes = FALSE
as an additional parameter.
The Data
and Chain
(a BASiCS_Chain
object) objects created by the code
above can be use for subsequent downstream analyses. See Section
3.2 for highly/lowly variable gene
detection (single group of cells, see also functions BASiCS_DetectHVG
and
BASiCS_DetectLVG
) and Section 3.3 for
differential expression analyses (two groups of cells, see also function
BASiCS_TestDE
).
Important remarks:
Please ensure the acceptance rates displayed in the console output of
BASiCS_MCMC
are around 0.44. If they are too far from this value, you
should increase N
and Burn
.
It is essential to assess the convergence of the MCMC algorithm before performing downstream analyses. For guidance regarding this step, refer to the ‘Convergence assessment’ section of this vignette
Typically, setting N=20000
, Thin=20
and Burn=10000
leads to
stable results.
The input dataset for BASiCS must be stored as an SingleCellExperiment
object (see SingleCellExperiment package).
The generation of the input SingleCellExperiment
object requires a matrix of
raw counts Counts
(columns: cells, rows: genes) after quality control
(e.g. removing low quality cells) and filtering of lowly expressed genes. If
spike-in molecules are contained in Counts
, a logical vector Tech
is
required to indicate which rows contain technical spike-in molecules and a
data.frame
object SpikeInfo
containing the names of the spike-in molecules
in the first column and the absolute number of molecules per well in the second
column. More details are provided in section 3.1. If
spike-ins are not available, a vector BatchInfo
containing batch information
is required.
The newBASiCS_Data
function can be used to create the input data object based
on the following information:
Counts
: a matrix of raw expression counts with dimensions \(q\) times \(n\).
Within this matrix, \(q_0\) rows must correspond to biological genes and \(q-q_0\)
rows must correspond to technical spike-in genes. Gene names must be stored as
rownames(Counts)
.
Tech
: a logical vector (TRUE
/FALSE
) with \(q\) elements. If
Tech[i] = FALSE
the gene i
is biological; otherwise the gene is spike-in.
This vector must be specified in the same order of genes as in the
Counts
matrix.
SpikeInfo
(optional): a data.frame
with \(q-q_0\) rows. First column must
contain the names associated to the spike-in genes (as in rownames(Counts)
).
Second column must contain the input number of molecules for the spike-in genes
(amount per cell). If a value for this parameter is not provided when calling
newBASiCS_Data
, SpikeInfo
is set as NULL
as a default value. In those
cases, the BatchInfo
argument has to be provided to allow using the no-spikes
implementation of BASiCS.
BatchInfo
(optional): vector of length \(n\) to indicate batch structure
(whenever cells have been processed using multiple batches). If a value for this
parameter is not provided when calling newBASiCS_Data
, BASiCS will assume
the data contains a single batch of samples.
For example, the following code generates a synthetic dataset with 50 genes (40 biological and 10 spike-in) and 40 cells.
set.seed(1)
Counts <- matrix(rpois(50 * 40, 2), ncol = 40)
rownames(Counts) <- c(paste0("Gene", 1:40), paste0("Spike", 1:10))
colnames(Counts) <- paste0("Cell", 1:40)
Tech <- c(rep(FALSE, 40), rep(TRUE, 10))
set.seed(2)
SpikeInput <- rgamma(10, 1, 1)
SpikeInfo <- data.frame(
"SpikeID" = paste0("Spike", 1:10),
"SpikeInput" = SpikeInput
)
# No batch structure
DataExample <- newBASiCS_Data(Counts, Tech, SpikeInfo)
# With batch structure
DataExample <- newBASiCS_Data(
Counts, Tech, SpikeInfo,
BatchInfo = rep(c(1, 2), each = 20)
)
To convert an existing SingleCellExperiment
object (Data
) into one that can
be used within BASiCS, meta-information must be stored in the object.
If spike-ins are in use, observed counts must be stored in altExp(Data)
metadata(Data)
: the SpikeInfo
object is stored in the
metadata
slot of the SingleCellExperiment
object:
metadata(Data) <- list(SpikeInput = SpikeInfo)
.
colData(Data)$BatchInfo
: the BatchInfo
object is stored in the
colData
slot of the SingleCellExperiment
object.
Once the additional information is included, the object can be used within BASiCS.
NOTE: Input number of molecules for spike-in should be calculated using experimental information. For each spike-in gene \(i\), we use
\[ \mu_{i} = C_i \times 10^{-18} \times (6.022 \times 10^{23}) \times V \times D \hspace{0.5cm} \mbox{where,} \]
To run BASiCS without incorporating reads from technical spike-in genes,
and existing SingleCellExperiment
object can be used. The only modification
to the existing object is to assign the colData(Data)$BatchInfo
slot.
set.seed(1)
CountsNoSpikes <- matrix(rpois(50 * 40, 2), ncol = 40)
rownames(CountsNoSpikes) <- paste0("Gene", 1:50)
colnames(CountsNoSpikes) <- paste0("Cell", 1:40)
# With batch structure
DataExampleNoSpikes <- SingleCellExperiment(
assays = list(counts = CountsNoSpikes),
colData = data.frame(BatchInfo = rep(c(1, 2), each = 20))
)
Note: BASiCS assumes that a pre-processing quality control step has been applied to remove cells with poor quality data and/or lowly expressed genes that were undetected through sequencing. When analysing multiple groups of cells, the gene filtering step must be jointly applied across all groups to ensure the same genes are retained.
The function BASiCS_Filter
can be used to perform this task. For examples,
refer to help(BASiCS_Filter)
. Moreover, the scater package
provides enhanced functionality for the pre-processing of scRNA-seq datasets.
We illustrate this analysis using a small extract from the MCMC chain obtained
in (Vallejos, Richardson, and Marioni 2016) when analysing the single cell samples provided in
(Grün, Kester, and Oudenaarden 2014). This is included within BASiCS
as the ChainSC
dataset.
data(ChainSC)
The following code is used to identify highly variable genes (HVG) and
lowly variable genes (LVG) among these cells. The VarThreshold
parameter
sets a lower threshold for the proportion of variability that is assigned to
the biological component (Sigma
). In the examples below:
For each gene, these functions return posterior probabilities as a measure of HVG/LVG evidence. A cut-off value for these posterior probabilities is set by controlling the EFDR (as a default option, EFDR is set as 0.10).
HVG <- BASiCS_DetectHVG(ChainSC, VarThreshold = 0.6)
LVG <- BASiCS_DetectLVG(ChainSC, VarThreshold = 0.2)
plot_grid(
BASiCS_PlotVG(HVG, "Grid"),
BASiCS_PlotVG(HVG, "VG")
)
plot_grid(
BASiCS_PlotVG(LVG, "Grid"),
BASiCS_PlotVG(LVG, "VG")
)
To access the results of these tests, please use as.data.frame
.
as.data.frame(HVG)
## GeneName GeneIndex Mu Delta Sigma Prob HVG
## 1 Amacr 21 7.164534 7.164534 0.6627267 0.6933333 TRUE
## 2 Phlda2 250 9.372955 9.372955 0.7445820 0.9600000 TRUE
## 3 Rhox13 286 9.138169 9.138169 0.7695457 0.9733333 TRUE
as.data.frame(LVG)
## GeneName GeneIndex Mu Delta Sigma Prob LVG
## 1 A630072M18Rik 8 76.74158 76.74158 0.17424707 0.7333333 TRUE
## 2 Actn1 14 143.07299 143.07299 0.11822042 0.9733333 TRUE
## 3 Ahsa1 20 167.46692 167.46692 0.09373365 1.0000000 TRUE
## 4 Ass1 30 160.89463 160.89463 0.12406375 0.9733333 TRUE
## 5 Atp1b1 36 182.63844 182.63844 0.15493222 0.9200000 TRUE
## 6 Atp5g2 37 344.95180 344.95180 0.09031420 1.0000000 TRUE
## 7 Btf3 55 263.29872 263.29872 0.06858356 1.0000000 TRUE
## 8 Chchd2 69 576.15694 576.15694 0.04702879 1.0000000 TRUE
## 9 Cops6 78 95.28245 95.28245 0.14890886 0.8400000 TRUE
## 10 Dnajc19 102 69.75996 69.75996 0.16615340 0.7200000 TRUE
## 11 Eef1d 112 164.97660 164.97660 0.10212111 0.9866667 TRUE
## 12 Eif3e 115 172.85517 172.85517 0.12818532 0.9866667 TRUE
## 13 Fkbp4 141 334.41897 334.41897 0.09779841 1.0000000 TRUE
## 14 Ftl1 147 2296.21095 2296.21095 0.06656123 1.0000000 TRUE
## 15 Gars 150 144.21635 144.21635 0.15426362 0.8533333 TRUE
## 16 Gm13251 159 214.95780 214.95780 0.11957642 0.9866667 TRUE
## 17 Gm5506 161 2117.86447 2117.86447 0.09553558 1.0000000 TRUE
## 18 Gm5860 162 879.56308 879.56308 0.11234069 1.0000000 TRUE
## 19 Hnrnpa3 176 698.74412 698.74412 0.03976716 1.0000000 TRUE
## 20 Lrrc14b 191 82.26530 82.26530 0.13761078 0.9200000 TRUE
## 21 Mcm7 199 171.91942 171.91942 0.12149953 0.9466667 TRUE
## 22 Mlec 205 92.63264 92.63264 0.12870287 0.9333333 TRUE
## 23 Myh9 214 75.00987 75.00987 0.15676464 0.7733333 TRUE
## 24 Nedd8 218 140.92099 140.92099 0.10778409 0.9600000 TRUE
## 25 Nmrk1 223 745.61731 745.61731 0.16557363 0.8933333 TRUE
## 26 Nop56 225 237.46748 237.46748 0.09231176 1.0000000 TRUE
## 27 Nop58 226 168.68657 168.68657 0.16715635 0.7466667 TRUE
## 28 Npm3-ps1 230 167.04112 167.04112 0.13345401 0.8933333 TRUE
## 29 Nucks1 232 334.69529 334.69529 0.10074699 1.0000000 TRUE
## 30 Psmc3 269 131.08951 131.08951 0.13530298 0.9333333 TRUE
## 31 Psmd8 270 106.57423 106.57423 0.14710745 0.9200000 TRUE
## 32 Ptges3 272 297.56496 297.56496 0.11596188 1.0000000 TRUE
## 33 Rad23b 275 157.73725 157.73725 0.08739974 0.9866667 TRUE
## 34 Rbbp4 277 181.18047 181.18047 0.13249099 0.9466667 TRUE
## 35 Rif1 287 121.96288 121.96288 0.12419407 0.9200000 TRUE
## 36 Rpl15 290 1479.56150 1479.56150 0.03884788 1.0000000 TRUE
## 37 Rpl23a 291 1328.68382 1328.68382 0.03264723 1.0000000 TRUE
## 38 Rps11 293 1307.43808 1307.43808 0.04070404 1.0000000 TRUE
## 39 Slc23a1 310 1608.25591 1608.25591 0.18284658 0.7333333 TRUE
## 40 Slc25a3 311 371.72673 371.72673 0.05901678 1.0000000 TRUE
## 41 Snx3 323 123.78433 123.78433 0.14466506 0.9066667 TRUE
## 42 Snx5 324 152.55332 152.55332 0.14451260 0.8933333 TRUE
## 43 Sod1 326 636.54334 636.54334 0.04662697 1.0000000 TRUE
## 44 Sod2 327 222.74992 222.74992 0.11801915 0.9866667 TRUE
## 45 Srp72 331 71.67213 71.67213 0.17290929 0.6933333 TRUE
SummarySC <- Summary(ChainSC)
plot(SummarySC, Param = "mu", Param2 = "delta", log = "xy")
HTable <- as.data.frame(HVG)
LTable <- as.data.frame(LVG)
with(HTable, points(Mu, Delta))
with(LTable, points(Mu, Delta))
Note: this decision rule implemented in this function has changed with
respect to the original release of BASiCS (where EviThreshold
was defined
such that EFDR = EFNR). However, the new choice is more stable (sometimes, it
was not posible to find a threshold such that EFDR = EFNR).
To illustrate the use of the differential mean expression and differential
over-dispersion tests between two cell populations, we use extracts from the
MCMC chains obtained in (Vallejos, Richardson, and Marioni 2016) when analysing the
(Grün, Kester, and Oudenaarden 2014) dataset (single cells vs pool-and-split samples). These
were obtained by independently running the BASiCS_MCMC
function for each
group of cells.
data(ChainSC)
data(ChainRNA)
Test <- BASiCS_TestDE(
Chain1 = ChainSC, Chain2 = ChainRNA,
GroupLabel1 = "SC", GroupLabel2 = "PaS",
EpsilonM = log2(1.5), EpsilonD = log2(1.5),
EFDR_M = 0.10, EFDR_D = 0.10,
Offset = TRUE, PlotOffset = TRUE, Plot = TRUE
)
In BASiCS_TestDE
, EpsilonM
sets the log2 fold change (log2FC) in expression
(\(\mu\)) and EpsilonD
the log2FC in over-dispersion (\(\delta\)). As a default
option: EpsilonM = EpsilonD = log2(1.5)
(i.e. the test is set to detect
absolute increases of 50% or above). To adjust for differences in overall mRNA
content, an internal offset correction is performed when OffSet=TRUE
.
This is the recommended default setting.
Previously, the output of this function was a set of plots and a nested list
structure. The new output is an object of S4 class BASiCS_ResultsDE
.
Test
## An object of class BASiCS_ResultsDE, containing:
## -------------------------------------------------------------
## An object of class BASiCS_ResultDE.
## -------------------------------------------------------------
## 0 genes with a change in mean expression
## - Higher mean expression in SC samples: 0
## - Higher mean expression in PaS samples: 0
## - Fold change tolerance = 150 %
## - Probability threshold = 0.666666666666667
## - EFDR = NaN %
## - EFNR = 19.64 %
## -------------------------------------------------------------
## An object of class BASiCS_ResultDE.
## -------------------------------------------------------------
## 0 genes with a change in over dispersion
## - Higher over dispersion in SC samples: 0
## - Higher over dispersion in PaS samples: 0
## - Fold change tolerance = 150 %
## - Probability threshold = 0.666666666666667
## - EFDR = NA %
## - EFNR = NA %
You can access the results of these tests using as.data.frame
. You must specify which
of the mean, dispersion and residual overdispersion results you want using the
Which
argument, as follows.
head(as.data.frame(Test, Parameter = "Mean"))
## [1] GeneName MeanOverall Mean1 Mean2 MeanFC
## [6] MeanLog2FC ProbDiffMean ResultDiffMean
## <0 rows> (or 0-length row.names)
head(as.data.frame(Test, Parameter = "Disp"))
## [1] GeneName DispOverall Disp1 Disp2 DispFC
## [6] DispLog2FC ProbDiffDisp ResultDiffDisp
## <0 rows> (or 0-length row.names)
## This object doesn't contain residual overdispersion tests as the chains
## were run using the non-regression version of BASiCS
# head(as.data.frame(DE, Parameter = "Disp"))
There’s a rowData
field and accessor, so we can add gene information that
will be added when formatting each of the fields (eg different gene
identifiers).
rowData(Test)
## DataFrame with 350 rows and 1 column
## GeneName
## <character>
## 1 1700094D03Rik
## 2 1700097N02Rik
## 3 1810026B05Rik
## 4 2310008H04Rik
## 5 2410137M14Rik
## ... ...
## 346 Tef
## 347 Terf2
## 348 Tfeb
## 349 Timm22
## 350 Tinf2
rowData(Test) <- cbind(rowData(Test), Index = 1:nrow(rowData(Test)))
as.data.frame(Test, Parameter = "Mean")
## [1] GeneName MeanOverall Mean1 Mean2 MeanFC
## [6] MeanLog2FC ProbDiffMean ResultDiffMean Index
## <0 rows> (or 0-length row.names)
As of BASiCS v2, the differential expression plots have been re-done in ggplot2
and can be generated using BASiCS_PlotDE
.
The default plots are “Grid” (EFDR vs EFNR), “MA”, and “Volcano”.
These are done for “Mean”, “Disp”, and “ResDisp”.
We can choose to plot eg only mean, or mean and
overdispersion, etc, and only MA plot, MA plot and volcano, etc.
BASiCS_PlotDE(Test)
BASiCS_PlotDE(Test, Plots = c("MA", "Volcano"))
BASiCS_PlotDE(Test, Plots = "MA", Parameters = "Mean")
Due to the confounding between mean and over-dispersion that is
typically observed in scRNA-seq datasets, the non-regression BASiCS model
(run using Regression = FALSE
as a function parameter in BASiCS_MCMC
)
can only be used to assess changes in over-dispersion for those genes in which
the mean expression does not change between the groups. In this case, we
recommend users to use EpsilonM = 0
as a conservative option to avoid
changes in over-dispersion to be confounded by mean expression (the genes for
which mean expression changes are marked as ExcludedFromTesting
in the
ResultDiffDisp
field).
Test <- BASiCS_TestDE(
Chain1 = ChainSC, Chain2 = ChainRNA,
GroupLabel1 = "SC", GroupLabel2 = "PaS",
EpsilonM = 0, EpsilonD = log2(1.5),
EFDR_M = 0.10, EFDR_D = 0.10,
Offset = TRUE, PlotOffset = FALSE, Plot = FALSE
)
Note: If the regression BASiCS model has been
used (Regression = TRUE
as a function parameter in BASiCS_MCMC
),
BASiCS_TestDE
will also report changes in residual over-dispersion
(not confounded by mean expression) between the groups (see Section
4 in this vignette).
Beyond its original implementation, BASiCS has been extended to address experimental designs in which spike-in molecules are not available as well as to address the confounding that is typically observed between mean and over-dispersion for scRNA-seq datasets (Eling et al. 2018). Alternative implementation modes are summarised below:
As a default, the BASiCS_MCMC
function uses WithSpikes = TRUE
.
WithSpikes = FALSE
When technical spike-in genes are not available, BASiCS uses a horizontal
integration strategy which borrows information across multiple technical
replicates (Eling et al. 2018). Therefore, BASiCS_MCMC
will fail to run if a
single batch of samples is provided. Note: batch information must be
provided via the BatchInfo
argument when using the newBASiCS_Data
function
or BatchInfo
must be stored as a slot in colData(Data)
when using an
existing SingleCellExperiment
object.
DataNoSpikes <- newBASiCS_Data(
Counts, Tech, SpikeInfo = NULL,
BatchInfo = rep(c(1, 2), each = 20)
)
# Alternatively
DataNoSpikes <- SingleCellExperiment(
assays = list(counts = Counts),
colData = data.frame(BatchInfo = rep(c(1, 2), each = 20))
)
ChainNoSpikes <- BASiCS_MCMC(
Data = DataNoSpikes, N = 1000,
Thin = 10, Burn = 500,
WithSpikes = FALSE, Regression = TRUE,
PrintProgress = FALSE
)
## -----------------------------------------------------
## MCMC sampler has been started: 1000 iterations to go.
## -----------------------------------------------------
## -----------------------------------------------------
## End of Burn-in period.
## -----------------------------------------------------
##
## -----------------------------------------------------
## -----------------------------------------------------
## All 1000 MCMC iterations have been completed.
## -----------------------------------------------------
## -----------------------------------------------------
##
## -----------------------------------------------------
## Please see below a summary of the overall acceptance rates.
## -----------------------------------------------------
##
## Minimum acceptance rate among mu[i]'s: 0.372
## Average acceptance rate among mu[i]'s: 0.48584
## Maximum acceptance rate among mu[i]'s: 0.62
##
##
## Minimum acceptance rate among delta[i]'s: 0.722
## Average acceptance rate among delta[i]'s: 0.7918
## Maximum acceptance rate among delta[i]'s: 0.838
##
##
## Minimum acceptance rate among nu[jk]'s: 0.884
## Average acceptance rate among nu[jk]'s: 0.935
## Maximum acceptance rate among nu[jk]'s: 0.982
##
##
## Minimum acceptance rate among theta[k]'s: 0.772
## Average acceptance rate among theta[k]'s: 0.785
## Maximum acceptance rate among theta[k]'s: 0.798
##
##
## -----------------------------------------------------
##
Regression = TRUE
The BASiCS model uses a joint informative prior formulation to account for the relationship between mean and over-dispersion gene-specific parameters. The latter is used to infer a global regression trend between these parameters and, subsequently, to derive a residual over-dispersion measure that is defined as departures with respect to this trend.
DataRegression <- newBASiCS_Data(
Counts, Tech, SpikeInfo,
BatchInfo = rep(c(1, 2), each = 20)
)
ChainRegression <- BASiCS_MCMC(
Data = DataRegression, N = 1000,
Thin = 10, Burn = 500,
Regression = TRUE,
PrintProgress = FALSE
)
## -----------------------------------------------------
## MCMC sampler has been started: 1000 iterations to go.
## -----------------------------------------------------
## -----------------------------------------------------
## End of Burn-in period.
## -----------------------------------------------------
##
## -----------------------------------------------------
## -----------------------------------------------------
## All 1000 MCMC iterations have been completed.
## -----------------------------------------------------
## -----------------------------------------------------
##
## -----------------------------------------------------
## Please see below a summary of the overall acceptance rates.
## -----------------------------------------------------
##
## Minimum acceptance rate among mu[i]'s: 0.428
## Average acceptance rate among mu[i]'s: 0.49715
## Maximum acceptance rate among mu[i]'s: 0.576
##
##
## Minimum acceptance rate among delta[i]'s: 0.746
## Average acceptance rate among delta[i]'s: 0.78325
## Maximum acceptance rate among delta[i]'s: 0.82
##
##
## Acceptance rate for phi (joint): 0.836
##
##
## Minimum acceptance rate among nu[j]'s: 0.794
## Average acceptance rate among nu[j]'s: 0.8442
## Maximum acceptance rate among nu[j]'s: 0.97
##
##
## Minimum acceptance rate among theta[k]'s: 0.72
## Average acceptance rate among theta[k]'s: 0.73
## Maximum acceptance rate among theta[k]'s: 0.74
##
## -----------------------------------------------------
##
This implementation provides additional functionality when performing
downstream analyses. These are illustrated below using a small extract from
the MCMC chain obtained when analysing the dataset provided in
(Grün, Kester, and Oudenaarden 2014) (single cell versus pool-and-split samples). These are
included within BASiCS
as the ChainSCReg
and ChainRNAReg
datasets.
To visualize the fit between over-dispersion \(\delta_i\) and mean expression $ _i$ the following function can be used.
data("ChainRNAReg")
BASiCS_ShowFit(ChainRNAReg)
The BASiCS_TestDE
test function will automatically perform differential
variability testing based on the residual over-dispersion parameters
\(\epsilon_i\) when its output includes two Chain
objects that were generated
by the regression BASiCS model.
data("ChainSCReg")
Test <- BASiCS_TestDE(
Chain1 = ChainSCReg, Chain2 = ChainRNAReg,
GroupLabel1 = "SC", GroupLabel2 = "PaS",
EpsilonM = log2(1.5), EpsilonD = log2(1.5),
EpsilonR = log2(1.5) / log2(exp(1)),
EFDR_M = 0.10, EFDR_D = 0.10,
Offset = TRUE, PlotOffset = FALSE, Plot = FALSE
)
This test function outputs an extra slot containing the results of the
differential testing residual over-dispersion test. Only genes that are
expressed in at least 2 cells (in both groups) are included in the test.
Genes that do not satisfy this condition are marked as ExcludedFromRegression
in the ResultDiffResDisp
field. By performing the regression,
all genes can be tested for changes in expression variability independent of
changes in mean expression.
head(as.data.frame(Test, Parameter = "ResDisp"))
## [1] GeneName ResDispOverall ResDisp1 ResDisp2
## [5] ResDispFC ResDispDistance ProbDiffResDisp ResultDiffResDisp
## [9] MeanOverall
## <0 rows> (or 0-length row.names)
BASiCS_PlotDE(Test, Parameters = "ResDisp")
Note: Additional parameters for this sampler include: k
number of
regression components (k
-2 radial basis functions, one intercept and one
linear component), Var
the scale parameter influencing the width of the basis
functions and eta
the degrees of freedom. For additional details about these
choices, please refer to Eling et al. (2018).
To externally store the output of BASiCS_MCMC
(recommended), additional
parameters StoreChains
, StoreDir
and RunName
are required. For example:
Data <- makeExampleBASiCS_Data()
Chain <- BASiCS_MCMC(
Data,
N = 1000, Thin = 10, Burn = 500,
Regression = TRUE, PrintProgress = FALSE,
StoreChains = TRUE, StoreDir = tempdir(), RunName = "Example"
)
## -----------------------------------------------------
## MCMC sampler has been started: 1000 iterations to go.
## -----------------------------------------------------
## -----------------------------------------------------
## End of Burn-in period.
## -----------------------------------------------------
##
## -----------------------------------------------------
## -----------------------------------------------------
## All 1000 MCMC iterations have been completed.
## -----------------------------------------------------
## -----------------------------------------------------
##
## -----------------------------------------------------
## Please see below a summary of the overall acceptance rates.
## -----------------------------------------------------
##
## Minimum acceptance rate among mu[i]'s: 0.43
## Average acceptance rate among mu[i]'s: 0.63636
## Maximum acceptance rate among mu[i]'s: 0.772
##
##
## Minimum acceptance rate among delta[i]'s: 0.436
## Average acceptance rate among delta[i]'s: 0.52052
## Maximum acceptance rate among delta[i]'s: 0.616
##
##
## Acceptance rate for phi (joint): 0.906
##
##
## Minimum acceptance rate among nu[j]'s: 0.44
## Average acceptance rate among nu[j]'s: 0.5334
## Maximum acceptance rate among nu[j]'s: 0.696
##
##
## Minimum acceptance rate among theta[k]'s: 0.732
## Average acceptance rate among theta[k]'s: 0.732
## Maximum acceptance rate among theta[k]'s: 0.732
##
## -----------------------------------------------------
##
In this example, the output of BASiCS_MCMC
will be stored as a BASiCS_Chain
object in the file “chain_Example.Rds”, within the tempdir()
directory.
To load pre-computed MCMC chains,
Chain <- BASiCS_LoadChain("Example", StoreDir = tempdir())
To assess convergence of the chain, the convergence diagnostics provided by the
package coda
can be used. Additionally, the chains can be visually inspected.
For example:
plot(Chain, Param = "mu", Gene = 1, log = "y")
See help(BASiCS_MCMC)
for example plots associated to other model parameters.
In the figure above:
?acf
)To access the MCMC chains associated to individual parameter use the function
displayChainBASiCS
. For example,
displayChainBASiCS(Chain, Param = "mu")[1:2, 1:2]
## Gene1 Gene2
## [1,] 5.35826 14.90511
## [2,] 12.75166 11.77155
As a summary of the posterior distribution, the function Summary
calculates
posterior medians and the High Posterior Density (HPD) intervals for each model
parameter. As a default option, HPD intervals contain 0.95 probability.
ChainSummary <- Summary(Chain)
The function displaySummaryBASiCS
extract posterior summaries for individual
parameters. For example
head(displaySummaryBASiCS(ChainSummary, Param = "mu"), n = 2)
## median lower upper
## Gene1 9.692761 5.358260 14.84602
## Gene2 12.643747 8.375004 19.98754
The following figures display posterior medians and the corresponding HPD 95% intervals for gene-specific parameters \(\mu_i\) (mean) and \(\delta_i\) (over-dispersion)
par(mfrow = c(1, 2))
plot(ChainSummary, Param = "mu", main = "All genes", log = "y")
plot(ChainSummary, Param = "delta",
Genes = c(2, 5, 10, 50), main = "5 customized genes"
)
See help(BASiCS_MCMC)
for example plots associated to other model parameters.
It is also possible to produce a matrix of normalised and denoised expression
counts for which the effect of technical variation is removed. For this purpose,
we implemented the function BASiCS_DenoisedCounts
. For each gene \(i\) and
cell \(j\) this function returns
\[
x^*_{ij} = \frac{ x_{ij} } {\hat{\phi}_j \hat{\nu}_j},
\]
where \(x_{ij}\) is the observed expression count of gene \(i\) in cell \(j\),
\(\hat{\phi}_j\) denotes the posterior median of \(\phi_j\) and \(\hat{\nu}_j\) is
the posterior median of \(\nu_j\).
DenoisedCounts <- BASiCS_DenoisedCounts(Data = Data, Chain = Chain)
DenoisedCounts[1:2, 1:2]
## Cell1 Cell2
## Gene1 1.678073 20.53865
## Gene2 10.068436 13.69243
Note: the output of BASiCS_DenoisedCounts
requires no further
data normalisation.
Alternativelly, the user can compute the normalised and denoised expression
rates underlying the expression of all genes across cells using
BASiCS_DenoisedRates
. The output of this function is given by
\[
\Lambda_{ij} = \hat{\mu_i} \hat{\rho}_{ij},
\]
where \(\hat{\mu_i}\) represents the posterior median of \(\mu_j\) and
\(\hat{\rho}_{ij}\) is given by its posterior mean (Monte Carlo estimate based
on the MCMC sample of all model parameters).
DenoisedRates <- BASiCS_DenoisedRates(
Data = Data, Chain = Chain,
Propensities = FALSE
)
DenoisedRates[1:2, 1:2]
## Cell1 Cell2
## Gene1 2.144986 17.52305
## Gene2 10.612366 13.60491
Alternative, denoised expression propensities \(\hat{\rho}_{ij}\) can also be extracted
DenoisedProp <- BASiCS_DenoisedRates(
Data = Data, Chain = Chain,
Propensities = TRUE
)
DenoisedProp[1:2, 1:2]
## Cell1 Cell2
## Gene1 0.2212977 1.807849
## Gene2 0.8393370 1.076019
We first describe the model introduced in [1], which relates to a single group of cells.
Throughout, we consider the expression counts of \(q\) genes, where \(q_0\) are
expressed in the population of cells under study (biological genes) and the
remaining \(q-q_0\) are extrinsic spike-in (technical) genes. Let \(X_{ij}\) be a
random variable representing the expression count of a gene \(i\) in cell \(j\)
(\(i=1,\ldots,q\); \(j=1,\ldots,n\)). BASiCS is based on the following hierarchical
model: \[X_{ij} \big| \mu_i, \phi_j, \nu_j, \rho_{ij} \sim \left\{ \begin{array}{ll} \mbox{Poisson}(\phi_j \nu_j \mu_i \rho_{ij}), \mbox{ for }i=1,\ldots,q_0, j=1,\ldots,n \\ \mbox{Poisson}(\nu_j \mu_i), \mbox{ for }i=q_0+1,\ldots,q, j=1,\ldots,n, \end{array} \right.\]
where \(\nu_j\) and \(\rho_{ij}\) are mutually independent random effects such that \(\nu_j|s_j,\theta \sim \mbox{Gamma}(1/\theta,1/ (s_j \theta))\) and \(\rho_{ij} | \delta_i \sim \mbox{Gamma} (1/\delta_i,1/\delta_i)\)1 We parametrize the Gamma distribution such that if \(X \sim \mbox{Gamma}(a,b)\), then \(\mbox{E}(X)=a/b\) and \(\mbox{var}(X)=a/b^2\)..
A graphical representation of this model is displayed below. This is based on the expression counts of 2 genes (\(i\): biological and \(i'\): technical) at 2 cells (\(j\) and \(j'\)). Squared and circular nodes denote known observed quantities (observed expression counts and added number of spike-in mRNA molecules) and unknown elements, respectively. Whereas black circular nodes represent the random effects that play an intermediate role in our hierarchical structure, red circular nodes relate to unknown model parameters in the top layer of hierarchy in our model. Blue, green and grey areas highlight elements that are shared within a biological gene, technical gene or cell, respectively.
In this setting, the key parameters to be used for downstream analyses are:
\(\mu_i\): mean expression parameter for gene \(i\) in the group of cells under study. In case of the spike-in technical genes, \(\mu_i\) is assumed to be known and equal to the input number of molecules of the corresponding spike-in gene).
\(\delta_i\): over-dispersion parameter for gene \(i\), a measure for the excess of variation that is observed after accounting for technical noise (with respect to Poisson sampling)
Additional (nuisance) parameters are interpreted as follows:
\(\phi_j\): cell-specific normalizing parameters related to differences in mRNA content (identifiability constrain: \(\sum_{j=1}^n \phi_j = n\)).
\(s_j\): cell-specific normalizing parameters related to technical cell-specific biases (for more details regarding this interpretation see (Vallejos et al. 2017)).
\(\theta\): technical over-dispersion parameter, controlling the strenght of cell-to-cell technical variability.
When cells from the same group are processed in multiple sequencing batches, this model is extended so that the technical over-dispersion parameter \(\theta\) is batch-specific. This extension allows a different strenght of technical noise to be inferred for each batch of cells.
In Vallejos, Richardson, and Marioni (2016), this model has been extended to cases where multiple groups of cells are under study. This is achieved by assuming gene-specific parameters to be also group-specific. Based on this setup, evidence of differential expression is quantified through log2-fold changes of gene-specific parameters (mean and over-dispersion) between the groups. Moreover, Eling et al. (2018) further extended this model by addressing the mean/over-dispersion confounding that is characteristic of scRNA-seq datasets as well as experimental designs where spike-ins are not available.
More details regarding the model setup, prior specification and implementation are described in Vallejos, Marioni, and Richardson (2015), Vallejos, Richardson, and Marioni (2016) and Eling et al. (2018).
This work has been funded by the MRC Biostatistics Unit (MRC grant no. MRC_MC_UP_0801/1; Catalina Vallejos and Sylvia Richardson), EMBL European Bioinformatics Institute (core European Molecular Biology Laboratory funding; Catalina Vallejos, Nils Eling and John Marioni), CRUK Cambridge Institute (core CRUK funding; John Marioni) and The Alan Turing Institute (EPSRC grant no. EP/N510129/1; Catalina Vallejos).
We thank several members of the Marioni laboratory (EMBL-EBI; CRUK-CI) for support and discussions throughout the development of this R library. In particular, we are grateful to Aaron Lun (LTLA) for advise and support during the preparation the Bioconductor submission.
We also acknowledge feedback, bug reports and contributions from (Github aliases provided within parenthesis): Ben Dulken (bdulken), Chang Xu (xuchang116), Danilo Horta (Horta), Dmitriy Zhukov (dvzhukov), Jens Preußner (jenzopr), Joanna Dreux (Joannacodes), Kevin Rue-Albrecht (kevinrue), Luke Zappia (lazappi), Nitesh Turaga (nturaga), Mike Morgan (MikeDMorgan), Muad Abd El Hay (Cumol), Simon Anders (s-andrews), Shian Su (Shians), Yongchao Ge and Yuan Cao (yuancao90), among others.
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggplot2_3.3.5 cowplot_1.1.1
## [3] BASiCS_2.6.0 SingleCellExperiment_1.16.0
## [5] SummarizedExperiment_1.24.0 Biobase_2.54.0
## [7] GenomicRanges_1.46.0 GenomeInfoDb_1.30.0
## [9] IRanges_2.28.0 S4Vectors_0.32.0
## [11] BiocGenerics_0.40.0 MatrixGenerics_1.6.0
## [13] matrixStats_0.61.0 knitr_1.36
## [15] BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 RColorBrewer_1.1-2
## [3] tools_4.1.1 bslib_0.3.1
## [5] utf8_1.2.2 R6_2.5.1
## [7] irlba_2.3.3 DBI_1.1.1
## [9] colorspace_2.0-2 withr_2.4.2
## [11] gridExtra_2.3 tidyselect_1.1.1
## [13] compiler_4.1.1 BiocNeighbors_1.12.0
## [15] DelayedArray_0.20.0 labeling_0.4.2
## [17] bookdown_0.24 sass_0.4.0
## [19] scales_1.1.1 hexbin_1.28.2
## [21] stringr_1.4.0 digest_0.6.28
## [23] rmarkdown_2.11 XVector_0.34.0
## [25] pkgconfig_2.0.3 htmltools_0.5.2
## [27] sparseMatrixStats_1.6.0 highr_0.9
## [29] limma_3.50.0 fastmap_1.1.0
## [31] rlang_0.4.12 shiny_1.7.1
## [33] DelayedMatrixStats_1.16.0 farver_2.1.0
## [35] jquerylib_0.1.4 generics_0.1.1
## [37] jsonlite_1.7.2 BiocParallel_1.28.0
## [39] dplyr_1.0.7 RCurl_1.98-1.5
## [41] magrittr_2.0.1 BiocSingular_1.10.0
## [43] scuttle_1.4.0 GenomeInfoDbData_1.2.7
## [45] Matrix_1.3-4 Rcpp_1.0.7
## [47] munsell_0.5.0 fansi_0.5.0
## [49] viridis_0.6.2 lifecycle_1.0.1
## [51] stringi_1.7.5 yaml_2.2.1
## [53] edgeR_3.36.0 MASS_7.3-54
## [55] zlibbioc_1.40.0 plyr_1.8.6
## [57] grid_4.1.1 parallel_4.1.1
## [59] promises_1.2.0.1 dqrng_0.3.0
## [61] crayon_1.4.1 miniUI_0.1.1.1
## [63] lattice_0.20-45 beachmat_2.10.0
## [65] magick_2.7.3 locfit_1.5-9.4
## [67] metapod_1.2.0 pillar_1.6.4
## [69] igraph_1.2.7 reshape2_1.4.4
## [71] ScaledMatrix_1.2.0 glue_1.4.2
## [73] evaluate_0.14 scran_1.22.0
## [75] BiocManager_1.30.16 vctrs_0.3.8
## [77] httpuv_1.6.3 gtable_0.3.0
## [79] purrr_0.3.4 assertthat_0.2.1
## [81] xfun_0.27 ggExtra_0.9
## [83] rsvd_1.0.5 mime_0.12
## [85] xtable_1.8-4 coda_0.19-4
## [87] later_1.3.0 viridisLite_0.4.0
## [89] tibble_3.1.5 cluster_2.1.2
## [91] statmod_1.4.36 bluster_1.4.0
## [93] ellipsis_0.3.2
Eling, Nils, Arianne Richard, Sylvia Richardson, John C Marioni, and Catalina A Vallejos. 2018. “Correcting the Mean-Variance Dependency for Differential Variability Testing Using Single-Cell Rna Sequencing Data.” Cell Systems.
Grün, Dominic, Lennart Kester, and Alexander van Oudenaarden. 2014. “Validation of Noise Models for Single-Cell Transcriptomics.” Nature Methods 11 (6): 637–40.
Martinez-Jimenez, Celia Pilar, Nils Eling, Hung-Chang Chen, Catalina A Vallejos, Aleksandra A Kolodziejczyk, Frances Connor, Lovorka Stojic, et al. 2017. “Aging Increases Cell-to-Cell Transcriptional Variability Upon Immune Stimulation.” Science 355 (6332): 1433–6.
Newton, Michael A, Amine Noueiry, Deepayan Sarkar, and Paul Ahlquist. 2004. “Detecting Differential Gene Expression with a Semiparametric Hierarchical Mixture Method.” Biostatistics 5 (2): 155–76.
Vallejos, Catalina A, John C Marioni, and Sylvia Richardson. 2015. “BASiCS: Bayesian Analysis of Single-Cell Sequencing Data.” PLoS Computational Biology 11 (6): e1004333.
Vallejos, Catalina A, Sylvia Richardson, and John C Marioni. 2016. “Beyond Comparisons of Means: Understanding Changes in Gene Expression at the Single-Cell Level.” Genome Biology 17 (1).
Vallejos, Catalina A, Davide Risso, Antonio Scialdone, Sandrine Dudoit, and John C Marioni. 2017. “Normalizing Single-Cell RNA Sequencing Data: Challenges and Opportunities.” Nature Methods 14 (6).