scMultiSim Basics

We define a utility function to modify a list.

list_modify <- function (curr_list, ...) {
  args <- list(...)
  for (i in names(args)) {
    curr_list[[i]] <- args[[i]]
  }
  curr_list
}
# library("devtools")
# devtools::load_all(".")
library("scMultiSim")
library(dplyr)

Simulation Basics

Simulate true counts by calling sim_true_counts(options) where options is a list. Use scmultisim_help() to get help.

scmultisim_help("options")

GRN and Differentiation Tree

The minimal input to scMultiSim is a differentiation tree, and you can optionally provide ground truth for GRN and cell-cell interactions. The differentiation tree is an R phylo object, which can be created using e.g. ape::read.tree() or ape::rtree(). It controls the cell population structure: each node of the tree should represent a cell type, and connected nodes indicate the differentiation relationship between cell types. scMultiSim provides this explicit control on the cell population structure while preserving all other effects (such as GRN and Cell-Cell Interactions), so you can generate any cell trajectory or clustering structure you want, which is especially useful for benchmarking trajectory inference and clustering methods.

If generating a continuous population, this tree specifies the cell differentiation trajectory; if generating a discrete population, the tips of this tree will be the clusters (cell types).

scMultiSim also provides three differentiation trees. Phyla5() and Phyla3() return bifurcating trees with 5 and 3 leaves respectively. Phyla1() returns only a single branch, which can be useful when we don’t want any specific trajectory.

par(mfrow=c(1,2))
Phyla5(plotting = TRUE)
## 
## Phylogenetic tree with 5 tips and 4 internal nodes.
## 
## Tip labels:
##   1, 2, 3, 4, 5
## 
## Rooted; includes branch lengths.
Phyla3(plotting = TRUE)

plot of chunk plot-tree

## 
## Phylogenetic tree with 3 tips and 2 internal nodes.
## 
## Tip labels:
##   1, 2, 3
## 
## Rooted; includes branch lengths.
# It's not possible to plot Phyla1() because it only contains 1 branch connecting two nodes.
Phyla1()
## 
## Phylogenetic tree with 1 tips and 1 internal nodes.
## 
## Tip labels:
##   A
## 
## Rooted; includes branch lengths.

The GRN should be a data frame with 3 columns, each representing the target, regulator, and effect. The target and regulator should be gene names, which can be integers or strings. The effect should be a numeric value, indicating the effect of the regulator on the target.

scMultiSim provides two sample GRNs, GRN_params_100 and GRN_params_1139, which contain 100 and 1139 genes respectively. Let’s load them first.

data(GRN_params_100)
GRN_params <- GRN_params_100
head(GRN_params)
##   regulated.gene regulator.gene regulator.effect
## 1             16             10         1.768699
## 2             29             10         1.675360
## 3             62             10         3.723509
## 4             55             10         2.866182
## 5             81             80         4.020744
## 6             73             80         2.353371

Simulating True Counts

Now, we create the options list for the simulation session. In the following example, we simulate 500 cells with 50 CIFs.

The number of genes is determined by the option num.genes or the number of genes in the GRN. If num.genes is not specified, the number of genes will be the number of unique genes in the GRN, plus a fraction of genes that are not regulated by any other genes. this is controlled by the option unregulated.gene.ratio (default is 0.1). Since our GRN_params contains 100 gene names, 10% more genes will be added to the simulation, and the number of genes in the simulated data will be 110. If you don’t need to simulate GRN effects, simply set GRN = NA.

The cif.sigma controls the variance of the CIFs. Usually, with cif.sigma = 0.1, the trajectory will be very clear, while with cif.sigma = 1, the trajectory will be more noisy. We use cif.sigma = 0.5 in this example.

We also have do.velocity option to use the Kinetic model to simulate RNA velocity data.

set.seed(42)

options <- list(
  GRN = GRN_params,
  num.cells = 300,
  num.cifs = 20,
  cif.sigma = 1,
  tree = Phyla5(),
  diff.cif.fraction = 0.8,
  do.velocity = TRUE
)

Now we run the simulation and check what kind of data is in the returned result:

results <- sim_true_counts(options)
## Time spent: 0.24 mins
names(results)
##  [1] ".grn"             "unspliced_counts" ".options"         ".n"              
##  [5] "region_to_gene"   "atacseq_data"     "giv"              "kinetic_params"  
##  [9] "num_genes"        "velocity"         "cif"              "hge_scale"       
## [13] "counts"           "cell_meta"        "atac_counts"      "grn_params"      
## [17] "cell_time"

The results contains the following important data:

Visualize the Results

We can visualize the true counts and ATAC-seq data using ploTSNE(data, labels):

plot_tsne(log2(results$counts + 1),
         results$cell_meta$pop,
         legend = 'pop', plot.name = 'True RNA Counts Tsne')
plot of chunk plot-counts

plot of chunk plot-counts

plot_tsne(log2(results$atacseq_data + 1),
         results$cell_meta$pop,
         legend = 'pop', plot.name = 'True ATAC-seq Tsne')
plot of chunk plot-counts

plot of chunk plot-counts

Since we also have RNA velocity enabled, the results also contains the following data:

plot_rna_velocity(results, arrow.length = 2)
## $raw
plot of chunk plot-velocity

plot of chunk plot-velocity

## 
## $normalized
plot of chunk plot-velocity

plot of chunk plot-velocity

## 
## $knn_normalized
plot of chunk plot-velocity

plot of chunk plot-velocity

Adding Technical Variation

We can also add the technical variation and batch effect to the true counts:

# adds `counts_obs` to `results`
add_expr_noise(results)
## Adding experimental noise...
## 50..100..150..200..250..300..Using atac_counts
## Time spent: 0.13 mins
# adds `counts_with_batches` to `results`
divide_batches(results, nbatch = 2)
## Adding batch effects...
plot_tsne(log2(results$counts_with_batches + 1),
         results$cell_meta$pop,
         legend = 'pop', plot.name = 'RNA Counts Tsne with Batches')
plot of chunk add-expr-noise

plot of chunk add-expr-noise

Adjusting Parameters

scMultiSim provides various parameters to control each type of biological effect.

Discrete Cell Population

We can also simulate discrete cell population by setting discrete.cif = TRUE. In this case, each tip of the tree will be one cell type, therefore there will be 5 clusters in the following result.

set.seed(42)

options <- list(
  GRN = GRN_params,
  num.cells = 400,
  num.cifs = 20,
  tree = Phyla5(),
  diff.cif.fraction = 0.8,
  discrete.cif = TRUE
)

results <- sim_true_counts(options)
## Time spent: 0.17 mins
plot_tsne(log2(results$counts + 1),
         results$cell_meta$pop,
         legend = 'pop', plot.name = 'True RNA Counts Tsne')
plot of chunk simulate-discrete

plot of chunk simulate-discrete

Adjusting the Effect of Cell Population

In scMultiSim, the differentiation tree provides explicit control of the cell population. The effect of the tree can be adjusted by the option diff.cif.fraction, which controls how many CIFs are affected by the cell population. With a larger diff.cif.fraction, the effect of cell population will be larger and you may see a clearer trajectory or well separated clusters. With a smaller diff.cif.fraction, the resulting RNA counts will be more affected by other factors, such as the GRN.

Now let’s visualize the trajectory with different diff.cif.fraction values:

set.seed(42)

options <- list(
  GRN = GRN_params,
  num.cells = 300,
  num.cifs = 20,
  tree = Phyla5(),
  diff.cif.fraction = 0.8
)

results <- sim_true_counts(
        options %>% list_modify(diff.cif.fraction = 0.4))
## Time spent: 0.12 mins
plot_tsne(log2(results$counts + 1),
         results$cell_meta$pop,
         legend = 'pop', plot.name = 'RNA Counts (diff.cif.fraction = 0.2)')
plot of chunk adjust-diff-cif-fraction

plot of chunk adjust-diff-cif-fraction

results <- sim_true_counts(
        options %>% list_modify(diff.cif.fraction = 0.9))
## Time spent: 0.12 mins
plot_tsne(log2(results$counts + 1),
         results$cell_meta$pop,
         legend = 'pop', plot.name = 'RNA Counts (diff.cif.fraction = 0.8)')
plot of chunk adjust-diff-cif-fraction

plot of chunk adjust-diff-cif-fraction

Adjusting the Inherent Cell Heterogeneity

The inherent cell heterogeneity is controlled by the non-diff-CIF, which is sampled from a normal distribution with mean cif.mean and standard deviation cif.sigma. Therefore, the larger cif.sigma is, the larger the inherent cell heterogeneity is.

Now, let’s visualize the effect of cif.sigma:

set.seed(42)

options <- list(
  GRN = GRN_params,
  num.cells = 300,
  num.cifs = 20,
  tree = Phyla5(),
  diff.cif.fraction = 0.8,
  cif.sigma = 0.5
)

results <- sim_true_counts(
        options %>% list_modify(cif.sigma = 0.1))
## Time spent: 0.12 mins
plot_tsne(log2(results$counts + 1),
         results$cell_meta$pop,
         legend = 'pop', plot.name = 'RNA Counts (cif.sigma = 0.1)')
plot of chunk adjust-cif-sigma

plot of chunk adjust-cif-sigma

results <- sim_true_counts(
        options %>% list_modify(cif.sigma = 1.0))
## Time spent: 0.13 mins
plot_tsne(log2(results$counts + 1),
         results$cell_meta$pop,
         legend = 'pop', plot.name = 'RNA Counts (cif.sigma = 1.0)')
plot of chunk adjust-cif-sigma

plot of chunk adjust-cif-sigma

Adjusting the Intrinsic Noise

If we set do.velocity = FALSE, scMultiSim will simulate the RNA counts using the Beta-Poisson model, which is faster but doesn’t output RNA velocity. When using the Beta-Possion model, scMultiSim provides a intrinsic.noise parameter to control the intrinsic noise during the transcription process. By default, intrinsic.noise is set to 1, which means the true counts will be sampled from the Beta-Poisson model. If we set intrinsic.noise to a smaller value like 0.5, the true counts will be 0.5 * (theoretical mean) + 0.5 * (sampled from the Beta-Poisson model).

set.seed(42)

options <- list(
  GRN = GRN_params,
  num.cells = 300,
  num.cifs = 20,
  tree = Phyla5(),
  diff.cif.fraction = 0.8,
  intrinsic.noise = 1
)

results <- sim_true_counts(
        options %>% list_modify(intrinsic.noise = 0.5))
## Time spent: 0.12 mins
plot_tsne(log2(results$counts + 1),
         results$cell_meta$pop,
         legend = 'pop', plot.name = 'RNA Counts (intrinsic.noise = 0.5)')
plot of chunk adjust-intrinsic-noise

plot of chunk adjust-intrinsic-noise

results <- sim_true_counts(
        options %>% list_modify(intrinsic.noise = 1))
## Time spent: 0.12 mins
plot_tsne(log2(results$counts + 1),
         results$cell_meta$pop,
         legend = 'pop', plot.name = 'RNA Counts (intrinsic.noise = 1)')
plot of chunk adjust-intrinsic-noise

plot of chunk adjust-intrinsic-noise

Simulating Dynamic GRN

First, call the following function to check the usage of dynamic GRN.

scmultisim_help("dynamic.GRN")
## Dynamic GRN deletes and creates some edges in the GRN in each epoch.
## One epoch contains multiple steps, and the change is done gradually in steps.
## The specific GRN at each step will be used by one or more cells sequentially.
## When an epoch is done, another epoch will start.
## 
## Available options for dynamic.GRN:
##   - seed: the random seed
##   - num.steps: number of steps in each epoch.
##   - cell.per.step: how many cells share the GRN in the same step.
##   - involved.genes: a new edge will only be created within these specified genes.
##       The default value is NA, which will use all existing genes in the GRN.
##   - num.changing.edges: if < 1, it means the portion of edges added/deleted in each epoch.
##       if >= 1, it means the number of edges added/deleted in each epoch.
##   - create.tf.edges: whether a new edge can connect two TFs in the GRN.
##   - weight.mean: the mean value of the weight for a newly created edge.
##       The default value is NA, meaning that it will use the mean value of the input GRN.
##   - weight.sd: the standard deviation of the weight for a newly created edge.
## 
## See the returned list for the default values.
## NULL

Here we use Phyla1() as the differentiation tree to remove the effect of the trajectory. Additionally, we can use discrete.cif = TRUE to simulate discrete cell population.

set.seed(42)

options_ <- list(
  GRN = GRN_params,
  num.cells = 300,
  num.cifs = 20,
  tree = Phyla1(),
  diff.cif.fraction = 0.8,
  do.velocity = FALSE,
  dynamic.GRN = list(
    cell.per.step = 3,
    num.changing.edges = 5,
    weight.mean = 0,
    weight.sd = 4
  )
)

results <- sim_true_counts(options_)
## Time spent: 0.13 mins

results$cell_specific_grn is a list containing the gene effects matrix for each cell. Each row is a target and each column is a regulator. The corresponding gene names are displayed as column and row names.

# GRN for cell 1 (first 10 rows)
results$cell_specific_grn[[1]][1:10,]
##           2        6       10       19 80 91
## 1  2.250484 1.824607 1.941613 4.268812  0  0
## 2  0.000000 0.000000 0.000000 0.000000  0  0
## 3  2.184071 0.000000 0.000000 0.000000  0  0
## 4  1.634807 0.000000 0.000000 0.000000  0  0
## 5  4.677883 4.301893 2.090473 4.047174  0  0
## 6  0.000000 0.000000 0.000000 0.000000  0  0
## 7  3.306285 2.673968 2.414780 3.548154  0  0
## 8  2.707510 4.653263 1.357980 1.829304  0  0
## 9  1.501430 1.970649 4.359262 3.915318  0  0
## 10 0.000000 0.000000 0.000000 0.000000  0  0

Since we set cell.per.step = 3, we expect each adjacent 3 cells share the same GRN:

print(all(results$cell_specific_grn[[1]] == results$cell_specific_grn[[2]]))
## [1] TRUE
print(all(results$cell_specific_grn[[2]] == results$cell_specific_grn[[3]]))
## [1] TRUE
print(all(results$cell_specific_grn[[3]] == results$cell_specific_grn[[4]]))
## [1] FALSE

Session Information

sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.1.4      scMultiSim_1.0.0 knitr_1.46      
## 
## loaded via a namespace (and not attached):
##  [1] utf8_1.2.4                  generics_0.1.3             
##  [3] SparseArray_1.4.0           lattice_0.22-6             
##  [5] digest_0.6.35               magrittr_2.0.3             
##  [7] evaluate_0.23               grid_4.4.0                 
##  [9] iterators_1.0.14            foreach_1.5.2              
## [11] jsonlite_1.8.8              Matrix_1.7-0               
## [13] ape_5.8                     GenomeInfoDb_1.40.0        
## [15] httr_1.4.7                  fansi_1.0.6                
## [17] UCSC.utils_1.0.0            scales_1.3.0               
## [19] codetools_0.2-20            abind_1.4-5                
## [21] cli_3.6.2                   rlang_1.1.3                
## [23] crayon_1.5.2                XVector_0.44.0             
## [25] Biobase_2.64.0              munsell_0.5.1              
## [27] withr_3.0.0                 DelayedArray_0.30.0        
## [29] Rtsne_0.17                  S4Arrays_1.4.0             
## [31] tools_4.4.0                 parallel_4.4.0             
## [33] zeallot_0.1.0               colorspace_2.1-0           
## [35] ggplot2_3.5.1               GenomeInfoDbData_1.2.12    
## [37] SummarizedExperiment_1.34.0 BiocGenerics_0.50.0        
## [39] assertthat_0.2.1            vctrs_0.6.5                
## [41] R6_2.5.1                    matrixStats_1.3.0          
## [43] stats4_4.4.0                lifecycle_1.0.4            
## [45] zlibbioc_1.50.0             S4Vectors_0.42.0           
## [47] IRanges_2.38.0              MASS_7.3-60.2              
## [49] pkgconfig_2.0.3             pillar_1.9.0               
## [51] gtable_0.3.5                glue_1.7.0                 
## [53] Rcpp_1.0.12                 highr_0.10                 
## [55] xfun_0.43                   tibble_3.2.1               
## [57] GenomicRanges_1.56.0        tidyselect_1.2.1           
## [59] KernelKnn_1.1.5             MatrixGenerics_1.16.0      
## [61] farver_2.1.1                nlme_3.1-164               
## [63] labeling_0.4.3              compiler_4.4.0             
## [65] markdown_1.12