BUSseq_MCMC {BUSseq}R Documentation

Condcut MCMC sampling and posterior inference for the BUSseq Model

Description

The function BUSseq_MCMC runs a Markov chain Monte Carlo (MCMC) algorithm to fit the Batch effects correction with Unknown Subtypes for scRNA-seq data (BUSseq) model. BUSseq is an interpretable Bayesian hierarchical model that closely follows the data-generating mechanism of scRNA-seq experiments. BUSseq can simultaneously correct batch effects, cluster cell types, impute missing data caused by dropout events and detect differentially expressed genes without requiring a preliminary normalization step. We adopt the MCMC algorithm to conduct posterior inference for the BUSseq model. Here, we denote the total number of cells as N, the batch number as B, the gene number as G, and the cell-type number as K.

Usage

BUSseq_MCMC(ObservedData, n.celltypes, seed = round(runif(1,1,10000)), 
            n.cores = 8,
            n.iterations = 2000, n.burnin = floor(n.iterations/2),
            n.unchanged = min(floor(n.iterations * 0.3), 500),
            n.output = n.iterations/10, working_dir = getwd(),
            Drop_ind = rep(TRUE, length(ObservedData)), fdr = 0.05, 
            hyper_pi = 2, hyper_gamma0 = 3, 
            hyper_gamma1 = c(0.001, 0.01), 
            hyper_alpha = 5, tau1sq = 50, hyper_p = c(1, 3), 
            hyper_tau0sq = c(2, 0.01), hyper_nu = 5,
            hyper_delta = 5, hyper_phi = c(1, 0.1))

Arguments

ObservedData

ObservedData is a list with the length equal to the batch number, or a SingleCellExperiment object. If it is a list, the b-th element of ObservedData should be the raw count data matrix of batch b, where each row corresponds to a gene, and each column corresponds to a cell. If it is a SingleCellExperiment object, the raw count data matrix should have a count matrix in counts, and Batch_ind exists in the colData to indicate the corresponding batch of each cell. Moreover, cells from the same batch should be arranged together.

n.celltypes

n.celltypes is an integer and represents the number of cell types, which needs to be specified by the user.

seed

seed is an integer for the Random Number Generator.

n.cores

n.cores is an integer and denotes the number of cores used for the parallel MCMC algorithm.

n.iterations

n.iterations is the total number of iterations of the MCMC algorithm. The default is 2000.

n.burnin

n.burnin is the number of burn-in iterations of the MCMC sampling. The default is a half of n.iterations

n.unchanged

n.unchanged is the number of iterations where the MCMC does not update the hyperparameter p and tau0 of the slab and spike prior of cell-type effects. The default is the minimum of one third of n.iterations and 500.

n.output

n.output is the number of iterations per hard-disk writing of the posterior sampling. The default is one tenth of n.iterations.

working_dir

working_dir is the directory to store the posterior samples. The default is the current directory.

Drop_ind

Drop_ind is a Boolean vector with length equal to the number of batches and indicates which batch suffers from dropout events.

fdr

The false discovery rate level we want to control in order to identify intrinsic genes. The default is 0.05.

hyper_pi

hyper_pi is a scalar representing the hyperparameter of the Dirichlet prior for cell type proportions. The default is 2.

hyper_gamma0

hyper_gamma0 is a scalar representing the variance of the normal prior for the intercept of the logistic regression for dropout events. The default is 3.

hyper_gamma1

hyper_gamma1 is a vector representing the hyperparameter of the Gamma prior for the log-odds ratio of dropout events. The default is (0.001, 0.01) such that the prior mean of gamma1 is 0.1.

hyper_alpha

hyper_alpha is a scalar representing the variance of the normal prior for the log-scale baseline expression levels. The default is 5.

tau1sq

tau1sq is the slab variance of the spike-and-slab prior for the cell type effects. The default is 50.

hyper_p

hyper_p is a two-dimensional vector representing the two shape parameters of the beta prior for the proportion of intrinsic genes. The default is c(1,3).

hyper_tau0sq

hyper_tau0sq is a two-dimensional vector representing the shape and scale of the inverse gamma prior for the variance of the spike normal prior. The default is c(2,0.01).

hyper_nu

hyper_nu is a scalar representing the variance of the normal prior for the batch effects. The default is 5.

hyper_delta

hyper_delta is a scalar representing the variance of the normal prior for the cell-specific size effects. The default is sqrt(5).

hyper_phi

hyper_phi is a two-dimensional vector representing the shape and rate of the gamma prior for the overdispersion parameters. The default is c(1,0.1).

Value

A SingleCellExperiment object res is outputed with an imputed count data matrix assay(res, "imputed_data") after imputing dropout events. At the same time, the estimated cell type labels and relative size factors of cells are added in the column-level metadata int_colData(res)$BUSseq, while the identified intrinsic genes are stored in the row-level metadata int_elementMetadata(res)$BUSseq.

Morover, the dimension information, the posterior inference of parameters and latent variables are stored as a list in metadata(res)$BUSseq:

n.cell

The total number of cells in all batches, a scalar.

n.gene

The number of genes, a scalar.

n.batch

The number of batches, a scalar.

n.perbatch

The number of cells in each batch, a B-dimensional vector.

n.celltype

The number of cell types specified by user, a scalar.

n.iter

The total number of iterations applied in the MCMC algorithm, a scalar.

seed

The seed for the MCMC algorithm.

n.burnin

The number of iterations as burnin, a scalar.

gamma.est

The estimated intercept and odds ratio of the logistic regression for dropout events, a B by 2 matrix.

alpha.est

The estimated log-scale baseline expression levels, a G-dimensional vector whose g-th element is the estimated log-scale mean gene expression level of gene g in the first cell type.

beta.est

The estimated cell-type effects, a G by K matrix, whose [g,k] element is the effects of cell type k on gene g compared with the first cell type. Note that the first column is zero as the first cell type is taken as the baseline cell type.

nu.est

The estimated location batch effects, a G by B matrix, where [g,b] element is the location batch effect on gene g in the batch b compared with the first batch. Note that the first column is zero as the first batch is taken as the reference batch without batch effects.

delta.est

The estimated cell-specific global effects, an N-dimensional vector. Note that the first element in each vector is zero as the first cell in each batch is taken as the reference cell.

phi.est

The estimated overdispersion parameters, a G by B matrix, where [g,b] element is the overdispersion parameter on gene g in batch b.

pi.est

The estimated cell-type proportions across batches, a B by K matrix, whose [b,k] element is the estimated proportion of cell type k in batch b.

w.est

The estimated cell-type indicators of each cell, a N-dimensional vector.

p.est

The estimated proportion of differentially expressed genes compared with the first cell type, a scalar.

PPI.est

The estimated posterior marginal probability of being differentially expressed genes compared with the first cell type. The return is G by K matrix. Noted that the first column consist of zeros as there is no differentially expressed gene compared with the cell type of itself.

D.est

The intrinsic gene indicators. The return is an N-dimensional vector.

BIC

The BIC value when K = n.celltypes, which is used to determine the number of cell types by varying the value of K.

Author(s)

Fangda Song

References

Song, Fangda, Ga Ming Angus Chan, and Yingying Wei. Flexible experimental designs for valid single-cell RNA-sequencing experiments allowing batch effects correction. Nature communications 11, no. 1 (2020): 1-15.

Examples


#######################################
# Apply BUSseq to the Simulation Data #
#######################################
library(SingleCellExperiment)
RawCountData <- assay(BUSseqfits_example, "counts")
batch_ind <- colData(BUSseqfits_example)
sce <- SingleCellExperiment(assays = list(counts = RawCountData),
                            colData = DataFrame(Batch_ind = batch_ind))
BUSseqfits_res <- BUSseq_MCMC(ObservedData = sce, 
                              seed = 1234, n.cores = 2,
                              n.celltypes = 4, n.iterations = 500)

################################################
# Extract Estimates from the BUSseqfits Object #
################################################

#return cell type indicators
w.est <- celltypes(BUSseqfits_res)

#return the intercept and odds ratio of the logistic regression
#for dropout events
gamma.est <- dropout_coefficient_values(BUSseqfits_res)

#return the log-scale baseline expression values
alpha.est <-  baseline_expression_values(BUSseqfits_res)

#return the cell-type effects
beta.est <- celltype_effects(BUSseqfits_res)

#return the mean expression levels
mu.est <- celltype_mean_expression(BUSseqfits_res)

#return the cell-specific global effects
delta.est <- cell_effect_values(BUSseqfits_res)

#return the location batch effects
nu.est <- location_batch_effects(BUSseqfits_res)

#return the overdispersion parameters
phi.est <- overdispersions(BUSseqfits_res)

#return the intrinsic gene indices
D.est <- intrinsic_genes_BUSseq(BUSseqfits_res)

#return the BIC value
BIC <- BIC_BUSseq(BUSseqfits_res)

#return the raw read count matrix
CountData_raw <- raw_read_counts(BUSseqfits_res)

#return the imputed read count matrix
CountData_imputed <- imputed_read_counts(BUSseqfits_res)

#return the corrected read count matrix
CountData_corrected <- corrected_read_counts(BUSseqfits_res)

#################
# Visualization #
#################
#generate the heatmap of raw read count data
heatmap_data_BUSseq(BUSseqfits_res, project_name="Heatmap_raw")

#generate the heatmap of imputed read count data
heatmap_data_BUSseq(BUSseqfits_res, data_type = "Imputed",
                    project_name="Heatmap_imputed")

#generate the heatmap of corrected read count data
heatmap_data_BUSseq(BUSseqfits_res, data_type = "Corrected", 
                    project_name="Heatmap_corrected")


[Package BUSseq version 1.0.0 Index]