1 Installation

1.1 Load

2 Description

3 Downloading datasets

4 Exploring the data structure

4.1 Cell metadata
4.2 RNA expression
4.3 Chromatin Accessibility

5 Suggested software for the downstream analysis

6 sessionInfo

1 Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("SingleCellMultiModal")

1.1 Load

library(SingleCellMultiModal)
library(MultiAssayExperiment)
library(scran)
library(scater)

2 Description

This data set consists of about 10K Peripheral Blood Mononuclear Cells (PBMCs) derived from a single healthy donor. It is available from the 10x Genomics website.

Provided are the RNA expression counts quantified at the gene level and the chromatin accessibility levels quantified at the peak level. Here we provide the default peaks called by the CellRanger software. If you want to explore other peak definitions or chromatin accessibility quantifications (at the promoter level, etc.), you have download the fragments.tsv.gz file from the 10x Genomics website.

3 Downloading datasets

The user can see the available dataset by using the default options

mae <- scMultiome("pbmc_10x", mode = "*", dry.run = FALSE, format = "MTX")

## Working on: pbmc_atac_se.rds

## Working on: pbmc_atac.mtx.gz

## Working on: pbmc_rna_se.rds

## Working on: pbmc_rna.mtx.gz

## Working on: pbmc_atac,
##  pbmc_rna

## see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation

## loading from cache

## see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation

## loading from cache

## Working on: pbmc_atac,
##  pbmc_rna

## see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation

## loading from cache

## see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation

## loading from cache

## Working on: pbmc_colData

## Working on: pbmc_sampleMap

## see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation

## loading from cache

## see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation

## loading from cache

4 Exploring the data structure

There are two assays: rna and atac, stored as SingleCellExperiment objects

mae

## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 2:
##  [1] atac: SingleCellExperiment with 108344 rows and 10032 columns
##  [2] rna: SingleCellExperiment with 36549 rows and 10032 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

where the cells are the same in both assays:

upsetSamples(mae)

4.1 Cell metadata

Columns:

nCount_RNA: number of read counts
nFeature_RNA: number of genes with at least one read count
nCount_ATAC: number of ATAC read counts
nFeature_ATAC: number of ATAC peaks with at least one read count
celltype: The cell types have been annotated by the 10x Genomics R&D team using gene markers. They provide a rough characterisation of the cell type diversity, but keep in mind that they are not ground truth labels.
broad_celltype: Lymphoid or Myeloid origin

The cells have not been QC-ed, choosing a minimum number of genes/peaks per cell depends is left to you! In addition, there are further quality control criteria that you may want to apply, including mitochondrial coverage, fraction of reads overlapping ENCODE Blacklisted regions, Transcription start site enrichment, etc. See suggestions below for software that can perform a semi-automated quality control pipeline

head(colData(mae))

## DataFrame with 6 rows and 6 columns
##                  nCount_RNA nFeature_RNA nCount_ATAC nFeature_ATAC
##                   <integer>    <integer>   <integer>     <integer>
## AAACAGCCAAGGAATC       8380         3308       55582         13878
## AAACAGCCAATCCCTT       3771         1896       20495          7253
## AAACAGCCAATGCGCT       6876         2904       16674          6528
## AAACAGCCAGTAGGTG       7614         3061       39454         11633
## AAACAGCCAGTTTACG       3633         1691       20523          7245
## AAACAGCCATCCAGGT       7782         3028       22412          8602
##                                celltype broad_celltype
##                             <character>    <character>
## AAACAGCCAAGGAATC      naive CD4 T cells       Lymphoid
## AAACAGCCAATCCCTT     memory CD4 T cells       Lymphoid
## AAACAGCCAATGCGCT      naive CD4 T cells       Lymphoid
## AAACAGCCAGTAGGTG      naive CD4 T cells       Lymphoid
## AAACAGCCAGTTTACG     memory CD4 T cells       Lymphoid
## AAACAGCCATCCAGGT non-classical monocy..        Myeloid

4.2 RNA expression

The RNA expression consists of 36,549 genes and 10,032 cells, stored using the dgCMatrix sparse matrix format

dim(experiments(mae)[["rna"]])

## [1] 36549 10032

names(experiments(mae))

## [1] "atac" "rna"

Let’s do some standard dimensionality reduction plot:

sce.rna <- experiments(mae)[["rna"]]

# Normalisation
sce.rna <- logNormCounts(sce.rna)

# Feature selection
decomp <- modelGeneVar(sce.rna)
hvgs <- rownames(decomp)[decomp$mean>0.01 & decomp$p.value <= 0.05]
sce.rna <- sce.rna[hvgs,]

# PCA
sce.rna <- runPCA(sce.rna, ncomponents = 25)

# UMAP
set.seed(42)
sce.rna <- runUMAP(sce.rna, dimred="PCA", n_neighbors = 25, min_dist = 0.3)
plotUMAP(sce.rna, colour_by="celltype", point_size=0.5, point_alpha=1)

4.3 Chromatin Accessibility

The ATAC expression consists of 108,344 peaks and 10,032 cells:

dim(experiments(mae)[["atac"]])

## [1] 108344  10032

Let’s do some standard dimensionality reduction plot. Note that scATAC-seq data is sparser than scRNA-seq, almost binary. The log normalisation + PCA approach that scater implements for scRNA-seq is not a good strategy for scATAC-seq data. Topic modelling or TFIDF+SVD are a better strategy. Please see the package recommendations below.

sce.atac <- experiments(mae)[["atac"]]

# Normalisation
sce.atac <- logNormCounts(sce.atac)

# Feature selection
decomp <- modelGeneVar(sce.atac)
hvgs <- rownames(decomp)[decomp$mean>0.25]
sce.atac <- sce.atac[hvgs,]

# PCA
sce.atac <- runPCA(sce.atac, ncomponents = 25)

# UMAP
set.seed(42)
sce.atac <- runUMAP(sce.atac, dimred="PCA", n_neighbors = 25, min_dist = 0.3)
plotUMAP(sce.atac, colour_by="celltype", point_size=0.5, point_alpha=1)

5 Suggested software for the downstream analysis

These are my personal recommendations of R-based analysis software:

RNA expression: scater, scran
ATAC accessibility: archR, snapATAC, cisTopic, Signac, chromVar, Cicero
Integrative analysis: MOFA+, Seurat. Note that both methods have released vignettes in their website where they analysed this same data set.

6 sessionInfo

sessionInfo()

## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scater_1.32.0               ggplot2_3.5.1              
##  [3] scran_1.32.0                scuttle_1.14.0             
##  [5] HDF5Array_1.32.0            rhdf5_2.48.0               
##  [7] DelayedArray_0.30.0         SparseArray_1.4.0          
##  [9] S4Arrays_1.4.0              abind_1.4-5                
## [11] Matrix_1.7-0                RaggedExperiment_1.28.0    
## [13] SingleCellExperiment_1.26.0 SingleCellMultiModal_1.16.0
## [15] MultiAssayExperiment_1.30.0 SummarizedExperiment_1.34.0
## [17] Biobase_2.64.0              GenomicRanges_1.56.0       
## [19] GenomeInfoDb_1.40.0         IRanges_2.38.0             
## [21] S4Vectors_0.42.0            BiocGenerics_0.50.0        
## [23] MatrixGenerics_1.16.0       matrixStats_1.3.0          
## [25] BiocStyle_2.32.0           
## 
## loaded via a namespace (and not attached):
##   [1] jsonlite_1.8.8            magrittr_2.0.3           
##   [3] ggbeeswarm_0.7.2          magick_2.8.3             
##   [5] farver_2.1.1              rmarkdown_2.26           
##   [7] zlibbioc_1.50.0           vctrs_0.6.5              
##   [9] memoise_2.0.1             DelayedMatrixStats_1.26.0
##  [11] tinytex_0.50              htmltools_0.5.8.1        
##  [13] BiocBaseUtils_1.6.0       AnnotationHub_3.12.0     
##  [15] curl_5.2.1                BiocNeighbors_1.22.0     
##  [17] Rhdf5lib_1.26.0           sass_0.4.9               
##  [19] bslib_0.7.0               plyr_1.8.9               
##  [21] cachem_1.0.8              igraph_2.0.3             
##  [23] mime_0.12                 lifecycle_1.0.4          
##  [25] pkgconfig_2.0.3           rsvd_1.0.5               
##  [27] R6_2.5.1                  fastmap_1.1.1            
##  [29] GenomeInfoDbData_1.2.12   digest_0.6.35            
##  [31] colorspace_2.1-0          AnnotationDbi_1.66.0     
##  [33] dqrng_0.3.2               irlba_2.3.5.1            
##  [35] ExperimentHub_2.12.0      RSQLite_2.3.6            
##  [37] beachmat_2.20.0           filelock_1.0.3           
##  [39] labeling_0.4.3            fansi_1.0.6              
##  [41] httr_1.4.7                compiler_4.4.0           
##  [43] bit64_4.0.5               withr_3.0.0              
##  [45] BiocParallel_1.38.0       viridis_0.6.5            
##  [47] DBI_1.2.2                 UpSetR_1.4.0             
##  [49] highr_0.10                rappdirs_0.3.3           
##  [51] rjson_0.2.21              bluster_1.14.0           
##  [53] tools_4.4.0               vipor_0.4.7              
##  [55] beeswarm_0.4.0            glue_1.7.0               
##  [57] rhdf5filters_1.16.0       grid_4.4.0               
##  [59] cluster_2.1.6             generics_0.1.3           
##  [61] gtable_0.3.5              BiocSingular_1.20.0      
##  [63] ScaledMatrix_1.12.0       metapod_1.12.0           
##  [65] utf8_1.2.4                XVector_0.44.0           
##  [67] RcppAnnoy_0.0.22          ggrepel_0.9.5            
##  [69] BiocVersion_3.19.1        pillar_1.9.0             
##  [71] limma_3.60.0              dplyr_1.1.4              
##  [73] BiocFileCache_2.12.0      lattice_0.22-6           
##  [75] bit_4.0.5                 tidyselect_1.2.1         
##  [77] locfit_1.5-9.9            Biostrings_2.72.0        
##  [79] knitr_1.46                gridExtra_2.3            
##  [81] bookdown_0.39             edgeR_4.2.0              
##  [83] xfun_0.43                 statmod_1.5.0            
##  [85] UCSC.utils_1.0.0          yaml_2.3.8               
##  [87] evaluate_0.23             codetools_0.2-20         
##  [89] tibble_3.2.1              BiocManager_1.30.22      
##  [91] cli_3.6.2                 uwot_0.2.2               
##  [93] munsell_0.5.1             jquerylib_0.1.4          
##  [95] Rcpp_1.0.12               dbplyr_2.5.0             
##  [97] png_0.1-8                 parallel_4.4.0           
##  [99] blob_1.2.4                sparseMatrixStats_1.16.0 
## [101] SpatialExperiment_1.14.0  viridisLite_0.4.2        
## [103] scales_1.3.0              purrr_1.0.2              
## [105] crayon_1.5.2              rlang_1.1.3              
## [107] cowplot_1.1.3             KEGGREST_1.44.0          
## [109] formatR_1.14

PBMCs profiled with the Chromium Single Cell Multiome ATAC + Gene Expression from 10x

2 May 2024