SCATE 1.10.0
Single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) is a new technology for measuring genome-wide regulatory element activities in single cells. With the ability to analyze cells’ distinct behaviors in a heterogeneous cell population, this technology is rapidly transforming biomedical research. Data produced by scATAC-seq are highly sparse and discrete. Existing computational methods typically use these data to analyze regulatory pathway activities in single cells. They cannot accurately measure activities of individual cis-regulatory elements (CREs) due to data sparsity. SCATE is a new statistical framework for analyzing scATAC-seq data. SCATE adaptively integrates information from co-activated CREs, similar cells, and publicly available regulome data to substantially increase the accuracy for estimating activities of individual CREs. We show that one can use SCATE to identify cell subpopulations and then accurately reconstruct CRE activities of each subpopulation. The reconstructed signals are accurate even for cell subpopulations consisting of only a few cells, and they significantly improve prediction of transcription factor binding sites. The accurate CRE-level signal reconstruction makes SCATE an unique tool for analyzing regulatory landscape of a heterogeneous cell population using scATAC-seq data.
The main functions of SCATE is demonstrated using the following example of 10 GM12878 and 10 K562 scATAC-seq samples.
For more details, see our paper describing the SCATE
package:
The first input of SCATE is a list of aligned bam files for scATAC-seq sample. Each single cell should have one separate bam file. One needs to compile a list that includes the location to each bam file. The easist way is to put all bam files in one folder and use function to get all locations. Below shows an example where stores the locations to bam files. Alternately users can also use a list of GRanges object as input to SCATE. See section below.
# load in SCATE
options(warn=-1)
suppressMessages(library(SCATE))
set.seed(12345)
# set up locations to bam files
bamlist <- list.files(paste0(system.file(package="SCATEData"),"/extdata/"),full.names = TRUE,pattern='.bam$')
bamlist <- bamlist[grepl('.bam$', bamlist)]
head(bamlist)
## [1] "/home/biocbuild/bbs-3.17-bioc/R/site-library/SCATEData/extdata//GSM1596831.bam"
## [2] "/home/biocbuild/bbs-3.17-bioc/R/site-library/SCATEData/extdata//GSM1596840.bam"
## [3] "/home/biocbuild/bbs-3.17-bioc/R/site-library/SCATEData/extdata//GSM1596874.bam"
## [4] "/home/biocbuild/bbs-3.17-bioc/R/site-library/SCATEData/extdata//GSM1596881.bam"
## [5] "/home/biocbuild/bbs-3.17-bioc/R/site-library/SCATEData/extdata//GSM1596940.bam"
## [6] "/home/biocbuild/bbs-3.17-bioc/R/site-library/SCATEData/extdata//GSM1596942.bam"
SCATE comes with a built in function to perform cell clustering. Users can also prepare their own clustering results. See Section Cell Clustering.
Users can also rebuild the SCATE database with their own bulk DNase-seq or list of CRE of interest. See Section Build users own database.
SCATE also supports peak calls instead of bam files as input data format. In this case, the input should be a data frame or list of data frames of scATAC-seq peaks. For each data frame, first column is chromsome name, second column is start site, third column is end site, and fourth column is the number of reads of the peak. Currently cellcluster
and SCATE
supports peak calls as input. Please refer to the manual page of cellcluster
and SCATE
functions for details.
The function satacprocess
reads bam files into R as a list of GRanges object. It transforms the reads into the midpoint of the reads (e.g. read chr1:100-150 will be transformed into chr1:125-125). scATAC-seq samples with library size smaller than libsizefilter
(here is 1000) will be discarded.
Here we run satacprocess
function using the bamlist
we prepared in the previous section. The returned satac
object is a list of Granges with length 18 (2 of the 20 cells are filtered out) and will be used in following analysis.
satac <- satacprocess(input=bamlist,type='bam',libsizefilter=1000)
# Number of elements in satac
length(satac)
## [1] 18
# Content in the first element as an example
satac[[1]]
## GRanges object with 10847 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 569924 +
## [2] chr1 569926 +
## [3] chr1 569926 +
## [4] chr1 569904 -
## [5] chr1 569910 -
## ... ... ... ...
## [10843] chrX 153763244 -
## [10844] chrX 153775451 +
## [10845] chrX 153977428 +
## [10846] chrX 154285712 +
## [10847] chrX 154842414 +
## -------
## seqinfo: 24 sequences from an unspecified genome
Users can also use a list of GRanges object as input to SCATE by specifying type='gr'
.
For the purpose of demonstration, we shorten the names of satac.
names(satac) <- sub('.*/','',names(satac))
SCATE has a built-in function cellcluster
that clusters cells based on averaged signal of CRE clusters. The below example runs the cellcluster
function using the satac
object we generated in previous sections.
clusterres <- cellcluster(satac,genome="hg19",clunum=2,perplexity=5)
Here genome
should be either “hg19” or “mm10”. It indicates the genome to which the scATAC-seq data is aligned. gives the number of clusters. If clunum=NULL
, the cluster number will be determined automatically by the function. is a number specifying perplexity for tSNE. The default is 30. It should be reduced when sample size is small. For example here it is reduced to 5.
The output of the function cluster
has three components: tSNE results, clustering results and aggregated signal for CRE cluster:
# tSNE results
tsne <- clusterres[[1]]
head(tsne)
## [,1] [,2]
## GSM1596840.bam 2.511994 -7.422190
## GSM1596874.bam -25.222696 -23.359763
## GSM1596881.bam -20.867878 7.335890
## GSM1596942.bam -6.684146 -25.335561
## GSM1596944.bam -12.007923 -3.640182
## GSM1596961.bam -8.268657 -11.268950
# clustering results
cluster <- clusterres[[2]]
cluster
## GSM1596840.bam GSM1596874.bam GSM1596881.bam GSM1596942.bam GSM1596944.bam
## 1 1 1 1 1
## GSM1596961.bam GSM1597041.bam GSM1597096.bam SRR1779746.bam SRR1779805.bam
## 1 1 1 2 2
## SRR1779856.bam SRR1779874.bam SRR1779956.bam SRR1779959.bam SRR1779973.bam
## 1 2 2 2 2
## SRR1780018.bam SRR1780020.bam SRR1780054.bam
## 2 1 2
# cell 'GSM1596840.bam' belongs to cluster 1, and cell 'SRR1779746.bam' belongs to cluster 2.
# aggregated signal for CRE cluster
aggsig <- clusterres[[3]]
aggsig[seq(1,3), seq(1,3)]
## GSM1596840.bam GSM1596874.bam GSM1596881.bam
## 7 0.0000000 0.0000000 0.3607243
## 9 0.0000000 0.0000000 0.0000000
## 10 0.6758744 0.8080204 0.4568552
We can use the following code to draw the tSNE plot and mark the clusters.
library(ggplot2)
plotdata <- data.frame(tSNE1=tsne[,1],tSNE2=tsne[,2],Cluster=as.factor(cluster))
ggplot(plotdata,aes(x=tSNE1,y=tSNE2,col=Cluster)) + geom_point()
The function SCATE
will reconstruct signals for individual CREs. For example, below code performs SCATE using the scATAC-seq data and clustering results from the previous clustering step. For demonstration purpose, here number of CRE clusters is set to be 156 (clunum=156) to allow faster speed. In real application, clunum is recommended to set to NULL so SCATE will automatically choose the optimal number of clusters.
res <- SCATE(satac,genome="hg19",cluster=cluster,clusterid=NULL,clunum=156,ncores=1,verbose=TRUE)
## Preparing data
## Fitting model
## Preparing data
## Fitting model
## Generating results
# check the 10000-10005th row of the matrix
res[seq(10000,10005),]
## 1 2
## chr1_2009200_2009399 1.8206879 1.799521
## chr1_2009400_2009599 1.8471040 1.741879
## chr1_2009600_2009799 2.0180426 1.999110
## chr1_2009800_2009999 2.1168597 2.085053
## chr1_2010000_2010199 1.9667737 1.923041
## chr1_2010200_2010399 0.9414907 1.063028
If users only want to perform SCATE in a subset of clusters, they can set clusterid
to be a subset of clusters. For example setting clusterid=c(1,2)
will let SCATE to only run in cluster 1 and 2. ncores
sets the number of computing cores to run SCATE in parallel. A larger number will result in less computational time, but will use more computational resources. If SCATE is run on a personal computer, ncores should be set to a small number such as 3. If ncores=NULL
, all available cores will be used for computing. Note that currently ncores are forced to be 1 on Windows computers. If verbose is TRUE, the current progress will be displayed. If verbose is FALSE the information will be suppressed.
The output is matrix of reconstructed signals. The number of rows of the matrix is the same as the number of bins in the genome. The number of columns is the same as the number of clusters in which SCATE is performed. The column names indicate the cluster id.
Users can also supply their own cell cluster. An example can be seen below. usercellcluster
specifies the cluster by users. Note that usercellcluster
needs to be named to correspond with the list of GRanges.
# use similar ways to construct the cluster
usercellcluster <- rep(1:2,each=9)
names(usercellcluster) <- names(satac)
# check the contents of the cluster
usercellcluster
## GSM1596840.bam GSM1596874.bam GSM1596881.bam GSM1596942.bam GSM1596944.bam
## 1 1 1 1 1
## GSM1596961.bam GSM1597041.bam GSM1597096.bam SRR1779746.bam SRR1779805.bam
## 1 1 1 1 2
## SRR1779856.bam SRR1779874.bam SRR1779956.bam SRR1779959.bam SRR1779973.bam
## 2 2 2 2 2
## SRR1780018.bam SRR1780020.bam SRR1780054.bam
## 2 2 2
userclusterres <- SCATE(satac,genome="hg19",cluster=usercellcluster,clunum=156,ncores=1,verbose=TRUE)
## Preparing data
## Fitting model
## Preparing data
## Fitting model
## Generating results
Function extractfeature
can be used to extract signals for a subset of genomic region of interest. For example if we want to extract signal for all bins that are within chromosome 5 50000-50300 and 50700-51000, we can first prepare the following region data:
region <- data.frame(chr=c('chr5','chr5'),start=c(50000,50700),end=c(50300,51000))
region
## chr start end
## 1 chr5 50000 50300
## 2 chr5 50700 51000
Region is a data frame with three columns. First column is chromosome name, second column is starting position, and third column is the ending position.
Function extractfeature
is then called in this way. mode
can be either ‘overlap’ to include all bins that overlap with the given region
, or be ‘nearest’ to include the nearest bin for each region.
extractres <- extractfeature(res,region,mode='overlap')
extractres
## 1 2
## chr5_50000_50199 1.422589 1.416037
## chr5_50200_50399 1.611571 1.537897
## chr5_50600_50799 1.730582 1.666568
## chr5_50800_50999 1.543166 1.632689
## chr5_51000_51199 1.437658 1.526997
extractfeature
also allows the extracted features to be saved to BED files that can be uploaded to genome browser (e.g. UCSC genome browser):
extractres <- extractfeature(res,region,mode='overlap',folder='destination folder')
In this example, a set of BED files will be saved to destination folder. Each BED file corresponds to a cluster and has the signal and genomic location for each bin. The name of the BED file corresponds to the name of the cluster.
The function peakcall
will perform peak calling on the SCATE results. For example, we can call peaks for the first cluster:
peakres <- peakcall(res)
# check the result for the first cluster
head(peakres[[1]])
## chr start end FDR Signal
## 13676 chr20 26188400 26190799 0 8.193352
## 9887 chr12 53772600 53774999 0 7.185601
## 851 chr1 110880200 110882999 0 7.152953
## 14622 chr19 39339400 39343599 0 7.100896
## 14699 chr19 42771800 42773799 0 7.078099
## 14126 chr19 6424400 6425999 0 7.067159
For each cluster, the output is a data frame with five column. The first three columns tell the location of the peak. They are chromosome name, starting location and ending location respectively. The fourth column is the FDR of the peak, and the fifth column is the signal of the peak. Peaks are already ordered by FDR and then by signal.
Use the following code to save the peak results for the first cluster as BED file.
write.table(peakres[[1]],file='your file.bed',sep='\t',quote=FALSE,col.names = FALSE,row.names = FALSE)
The whole process of reading in bam, clustering cell, and performing SCATE is included in the SCATEpipeline
function using following wrapper function of SCATE pipeline:
piperes <- SCATEpipeline(bamlist,genome="hg19",cellclunum=2,CREclunum=156,perplexity=5,ncores=1)
## Preparing data
## Fitting model
## Preparing data
## Fitting model
## Generating results
# get the cell cluster results, same as calling 'cellcluster' function.
cluster <- piperes[['cellcluster']]
# get the SCATE outputs, same as calling 'SCATE' function.
SCATEres <- piperes[['SCATE']]
# get the peak calling results, same as calling 'peakcall' function.
peakres <- piperes[['peak']]
As before, bamlist
stores the locations to bam files and genome
is ‘hg19’ or ‘mm10’. The output is a list with three components. The first component is the cell clustering results, same as the one returned by cellcluster
. The second component is the SCATE results, same as the one returned by SCATE
. The third component is the peak calling results, same as the one returned by peakcall
.
One can extract the region using the extractfeature
.
extractres <- extractfeature(piperes[['SCATE']],region,mode='overlap',folder='destination folder')
Users can add new bulk samples or CRE to human and mouse databases. This can be done with the following command. Note: the function may take around 1-2 days to finish.
makedatabase(datapath,savepath,bamfile=bamfile,cre=cre,genome='hg19')
Here, datapath
is the path to the data package folder (e.g. myfolder/hg19/). User must first download the data package to use this function. The data package for hg19 and mm10 can be downloaded from http://jilab.biostat.jhsph.edu/projects/scate/hg19.zip or http://jilab.biostat.jhsph.edu/projects/scate/mm10.zip. The compressed file should be unzipped.
savepath
is the path to save the generated database. e.g. myfolder/database.rds.
bamfile
is the location of bulk DNase-seq bamfiles. The format is the same as bamlist
cre
is a dataframe of new CRE sites to be added to the database. First column: chromosome name. Second column: start position. Third column: end position. The format is the same as region
argument in function extractfeature
.
genome
should be either ‘hg19’ or ‘mm10’. Default is ‘hg19’.
The function will generate a database file in ‘savepath’, which can then be passed to SCATE main function. See the manual page of SCATE function for how to include your own database.
Users can also build database from scratch using bulk samples or pooled single-cell ATAC-seq samples (pseudobulks) provided by users. This can be done with the following command:
makedatabase(datapath=NULL,savepath,bamfile=bamfile,cre=cre,genomerange=genomerange)
savepath
is the path to save the generated database. e.g. myfolder/database.rds.
bamfile
is the location of bulk DNase-seq bamfiles or pooled single-cell ATAC-seq bam files. The format is the same as bamlist
cre
is a dataframe of new CRE sites to be added to the database. First column: chromosome name. Second column: start position. Third column: end position. The format is the same as region
argument in function extractfeature
.
genomerange
is a data frame with two columns. First column is the chromosome and second column is the length of the genome. Example is https://genome.ucsc.edu/goldenpath/help/hg19.chrom.sizes
The function will generate a database file in ‘savepath’, which can then be passed to SCATE main function. See the manual page of SCATE function for how to include your own database. The function will take around 1-2 days to finish.
Once the databases are built, users can run the whole pipeline with the new database:
piperes <- SCATEpipeline(bamlist,datapath='path to new database')
The following code gives example of running each individual step with the new database. Note that the new database only needs to be specified in and .
clusterres <- cellcluster(satac,datapath='path to new database',clunum=2,perplexity=5)
cluster <- clusterres[[2]]
res <- SCATE(satac,datapath='path to new database',cluster=cluster,clusterid=NULL)
sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 splines parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] ggplot2_3.4.2 SCATE_1.10.0
## [3] mclust_6.0.0 Rtsne_0.16
## [5] SCATEData_1.9.0 GenomicAlignments_1.36.0
## [7] Rsamtools_2.16.0 Biostrings_2.68.0
## [9] XVector_0.40.0 SummarizedExperiment_1.30.0
## [11] Biobase_2.60.0 MatrixGenerics_1.12.0
## [13] matrixStats_0.63.0 GenomicRanges_1.52.0
## [15] GenomeInfoDb_1.36.0 IRanges_2.34.0
## [17] S4Vectors_0.38.0 ExperimentHub_2.8.0
## [19] AnnotationHub_3.8.0 BiocFileCache_2.8.0
## [21] dbplyr_2.3.2 BiocGenerics_0.46.0
## [23] xgboost_1.7.5.1 splines2_0.4.8
## [25] preprocessCore_1.62.0 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 farver_2.1.1
## [3] dplyr_1.1.2 blob_1.2.4
## [5] filelock_1.0.2 bitops_1.0-7
## [7] fastmap_1.1.1 RCurl_1.98-1.12
## [9] promises_1.2.0.1 digest_0.6.31
## [11] mime_0.12 lifecycle_1.0.3
## [13] ellipsis_0.3.2 KEGGREST_1.40.0
## [15] interactiveDisplayBase_1.38.0 RSQLite_2.3.1
## [17] magrittr_2.0.3 compiler_4.3.0
## [19] rlang_1.1.0 sass_0.4.5
## [21] tools_4.3.0 utf8_1.2.3
## [23] yaml_2.3.7 data.table_1.14.8
## [25] knitr_1.42 labeling_0.4.2
## [27] bit_4.0.5 curl_5.0.0
## [29] DelayedArray_0.26.0 BiocParallel_1.34.0
## [31] withr_2.5.0 purrr_1.0.1
## [33] grid_4.3.0 fansi_1.0.4
## [35] colorspace_2.1-0 xtable_1.8-4
## [37] scales_1.2.1 cli_3.6.1
## [39] rmarkdown_2.21 crayon_1.5.2
## [41] generics_0.1.3 httr_1.4.5
## [43] DBI_1.1.3 cachem_1.0.7
## [45] zlibbioc_1.46.0 AnnotationDbi_1.62.0
## [47] BiocManager_1.30.20 vctrs_0.6.2
## [49] Matrix_1.5-4 jsonlite_1.8.4
## [51] bookdown_0.33 bit64_4.0.5
## [53] magick_2.7.4 jquerylib_0.1.4
## [55] glue_1.6.2 codetools_0.2-19
## [57] gtable_0.3.3 BiocVersion_3.17.1
## [59] later_1.3.0 munsell_0.5.0
## [61] tibble_3.2.1 pillar_1.9.0
## [63] rappdirs_0.3.3 htmltools_0.5.5
## [65] GenomeInfoDbData_1.2.10 R6_2.5.1
## [67] evaluate_0.20 shiny_1.7.4
## [69] lattice_0.21-8 highr_0.10
## [71] png_0.1-8 memoise_2.0.1
## [73] httpuv_1.6.9 bslib_0.4.2
## [75] Rcpp_1.0.10 xfun_0.39
## [77] pkgconfig_2.0.3
If the SCATEData
package is useful in your work, please cite the following paper:
- Zhicheng Ji, Weiqiang Zhou, Wenpin Hou, Hongkai Ji, Single-cell ATAC-seq Signal Extraction and Enhancement with SCATE, Genome Biol 21, 161 (2020).