Contents

1 Introduction

Nucleoli serve as major organizing hubs for the three-dimensional structure of mammalian heterochromatin1. Specific loci, termed nucleolar-associated domains (NADs), form frequent three-dimensional associations with nucleoli2. Early mammalian development is a critical period to study NAD biological function, because interactions between pericentromeric chromatin and perinucleolar regions are particularly dynamic during embryonic development4. Here, we developed a Bioconductor package NADfinder for bioinformatic analysis of the NAD-seq data, including normalization, smoothing, peak calling, peak trimming and annotation. We demonstrate the application of NADfinder in mapping the NADs in the mouse genome, determining how these associations are altered during embryonic stem cell (ESC) differentiation, and develop tools for study of these higher-order chromosome interactions in fixed and live single cells (Fig. 1).

Fig. 1 Workflow for NAD peak identification. Read counts are summarized for each 50kb moving window with step size of 1kb for nucleolar and genomic samples. Log2 ratio between nucleolar and genomic samples was computed for each window, followed by local background correction, smoothing, peak calling, filtering and annotation.

Fig. 1 Workflow for NAD peak identification. Read counts are summarized for each 50kb moving window with step size of 1kb for nucleolar and genomic samples. Log2 ratio between nucleolar and genomic samples was computed for each window, followed by local background correction, smoothing, peak calling, filtering and annotation.

2 Single pair of nucleoleus associated DNA and whole genomic DNA sequencing

samples
Here is an example to use NADfinder for peak calling.

2.1 Coverage calculation

Here is the code snippet for calculating coverage with a sliding window in a given step along the genome using a pair of bam files from genomic sample as background and purified nucleoleus-associated DNA as target signal.

## load the library
library(NADfinder)
library(SummarizedExperiment)
## test bam files in the package
path <- system.file("extdata", package = "NADfinder", lib.loc = NULL,
            mustWork = TRUE)
bamFiles <- dir(path, ".bam$")

## window size for tile of genome. Here we set it to a big window size,
## ie., 50k because of the following two reasons:
## 1. peaks of NADs are wide;
## 2. data will be smoothed better with bigger window size.
ws <- 50000
## step for tile of genome
step <- 5000
## Set the background. 
## 0.25 means 25% of the lower ratios will be used for background training.
backgroundPercentage <- 0.25
## Count the reads for each window with a given step in the genome.
## The output will be a GRanges object.
library(BSgenome.Mmusculus.UCSC.mm10)
se <- tileCount(reads=file.path(path, bamFiles), 
                genome=Mmusculus, 
                windowSize=ws, 
                step=step,
                mode = IntersectionNotStrict,
                dataOverSamples = FALSE)

Here we load the pre-computed coverage data single.count to save build time.

data(single.count)
se <- single.count

For quality asessment, cumulativePercentage can be used to plot the cumulative sums of sorted read counts for nucleolus-associated DNA and whole genomic DNA. We expect the cumulative sum in the genomic DNA to be close to a straight line because the coverage for the genomic DNA sample should be uniformly distributed. Unlike ChIP-seq data, the cumulative sum for the nucleoleus sample will not exhibit sharp changes because most of NADs are broad regions as wide as 100 kb. However, we should observe clear differences between the two curves.

## Calculate ratios for peak calling. We use nucleoleus vs genomic DNA.
dat <- log2se(se, nucleolusCols = "N18.subsampled.srt.bam", 
              genomeCols ="G18.subsampled.srt.bam",
              transformation = "log2CPMRatio")
## Smooth the ratios for each chromosome.
## We found that for each chromosome, the ratios are higher in 5'end than 3'end.
datList <- smoothRatiosByChromosome(dat, chr = c("chr18", "chr19"), N = 100)
## check the difference between the cumulative percentage tag allocation 
## in genome and nucleoleus samples.
cumulativePercentage(datList[["chr18"]])

2.2 Call peaks

Before peak calling, the function smoothRatiosByChromosome is used for log ratios calculation, normalization and smoothing.

The peaks will be called if the ratios are significantly higher than chromosome-specific background determined by trimPeaks. The following figure shows the peaks (black rectangles) called using normalized (green curve) and smoothed (red curve) log2 ratios.

## check the results of smooth function
chr18 <- datList[["chr18"]] ## we only have reads in chr18 in test data.
chr18subset <- subset(chr18, seq.int(floor(length(chr18)/10))*10)
chr18 <- assays(chr18subset)
ylim <- range(c(chr18$ratio[, 1], 
                chr18$bcRatio[, 1], 
                chr18$smoothedRatio[, 1]))
plot(chr18$ratio[, 1], 
     ylim=ylim*c(.9, 1.1), 
     type="l", main="chr18")
abline(h=0, col="cyan", lty=2)
points(chr18$bcRatio[, 1], type="l", col="green")
points(chr18$smoothedRatio[, 1], type="l", col="red")
legend("topright", 
       legend = c("raw_ratio", "background_corrected", "smoothed"), 
       fill = c("black", "green", "red"))
## call peaks for each chromosome
peaks <- lapply(datList, trimPeaks, 
                backgroundPercentage=backgroundPercentage, 
                cutoffAdjPvalue=0.05, countFilter=1000)
## plot the peaks in "chr18"
peaks.subset <- countOverlaps(rowRanges(chr18subset), peaks$chr18)>=1
peaks.run <- rle(peaks.subset)
run <- cumsum(peaks.run$lengths)
run <- data.frame(x0=c(1, run[-length(run)]), x1=run)
#run <- run[peaks.run$values, , drop=FALSE]
rect(xleft = run$x0, ybottom = ylim[2]*.75, 
     xright = run$x1, ytop = ylim[2]*.8,
     col = "black")

## convert list to a GRanges object
peaks.gr <- unlist(GRangesList(peaks))

The following shows how to save or export the called peaks for downstream analysis.

## output the peaks
write.csv(as.data.frame(unname(peaks.gr)), "peaklist.csv", row.names=FALSE)
## export peaks to a bed file.
library(rtracklayer)
#export(peaks.gr, "peaklist.bed", "BED")

3 Samples with more than one replicate

Data analysis with multiple biological replicates follows the same steps as that of a single paired samples, i.e., coverage calculation, normalization and smoothing, and peak calling. The only difference is that limma is used to determine the statistical significance in peak calling.

library(NADfinder)
## bam file path
path <- system.file("extdata", package = "NADfinder", lib.loc = NULL,
            mustWork = TRUE)
bamFiles <- dir(path, ".bam$")

f <- bamFiles
ws <- 50000
step <- 5000
library(BSgenome.Mmusculus.UCSC.mm10)
se <- tileCount(reads=file.path(path, f), 
                genome=Mmusculus, 
                windowSize=ws, 
                step=step)

Load saved coverage.

data(triplicate.count)
se <- triplicate.count

## Calculate ratios for nucleoloar vs genomic samples.
se <- log2se(se, 
             nucleolusCols = c("N18.subsampled.srt-2.bam", 
                                "N18.subsampled.srt-3.bam",
                                "N18.subsampled.srt.bam" ),
             genomeCols = c("G18.subsampled.srt-2.bam",
                            "G18.subsampled.srt-3.bam",
                            "G18.subsampled.srt.bam"), 
             transformation = "log2CPMRatio")
seList<- smoothRatiosByChromosome(se, chr="chr18")
cumulativePercentage(seList[["chr18"]])

peaks <- lapply(seList, callPeaks, 
                cutoffAdjPvalue=0.05, countFilter=1)
peaks <- unlist(GRangesList(peaks))
peaks
## GRanges object with 557 ranges and 6 metadata columns:
##             seqnames            ranges strand | N18.subsampled.srt.2.bam
##                <Rle>         <IRanges>  <Rle> |                <numeric>
##     chr18.1    chr18   2956001-3138000      * |         1.97568648812916
##     chr18.2    chr18   3141001-3249000      * |          2.1544379015058
##     chr18.3    chr18   3302001-3381000      * |         1.82705096318418
##     chr18.4    chr18   3413001-3455000      * |         1.91336569386678
##     chr18.5    chr18   3480001-3526000      * |         1.84524499394483
##         ...      ...               ...    ... .                      ...
##   chr18.557    chr18 90001001-90086000      * |         2.01710665097692
##   chr18.558    chr18 90087001-90182000      * |         2.05569754137179
##   chr18.559    chr18 90291001-90421000      * |          1.2231252965598
##   chr18.560    chr18 90465001-90554000      * |        0.849524365257808
##   chr18.561    chr18 90645001-90649000      * |         1.93710202665383
##             N18.subsampled.srt.3.bam N18.subsampled.srt.bam
##                            <numeric>              <numeric>
##     chr18.1         1.97568648812916       1.97568648812916
##     chr18.2          2.1544379015058        2.1544379015058
##     chr18.3         1.82705096318418       1.82705096318418
##     chr18.4         1.91336569386678       1.91336569386678
##     chr18.5         1.84524499394483       1.84524499394483
##         ...                      ...                    ...
##   chr18.557         2.01710665097692       2.01710665097692
##   chr18.558         2.05569754137179       2.05569754137179
##   chr18.559          1.2231252965598        1.2231252965598
##   chr18.560        0.849524365257808      0.849524365257808
##   chr18.561         1.93710202665383       1.93710202665383
##                        AveSig              P.value            adj.P.Val
##                     <numeric>            <numeric>            <numeric>
##     chr18.1  1.97568648812916 6.07774744657402e-45 1.04607251207039e-43
##     chr18.2   2.1544379015058 6.19622758967668e-45 1.04607251207039e-43
##     chr18.3  1.82705096318418 7.11951046762163e-45 1.04607251207039e-43
##     chr18.4  1.91336569386678 7.10161737193469e-45 1.04607251207039e-43
##     chr18.5  1.84524499394483 7.40228192152164e-45 1.04607251207039e-43
##         ...               ...                  ...                  ...
##   chr18.557  2.01710665097692   7.000815500399e-45 1.04607251207039e-43
##   chr18.558  2.05569754137179 6.51280307843472e-45 1.04607251207039e-43
##   chr18.559   1.2231252965598 8.00000092665794e-45 1.04607251207039e-43
##   chr18.560 0.849524365257808 1.65778201195244e-44 1.89984418683201e-43
##   chr18.561  1.93710202665383 2.38741283949983e-43 1.77613314108961e-42
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome

The peaks can be visualized along the ideogram using plotSig.

ideo <- readRDS(system.file("extdata", "ideo.mm10.rds", 
                            package = "NADfinder"))
plotSig(ideo=ideo, grList=GRangesList(peaks), mcolName="AveSig", 
        layout=list("chr18"), 
        parameterList=list(types="heatmap"))

4 Session Info

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] BSgenome.Mmusculus.UCSC.mm10_1.4.0 BSgenome_1.54.0                   
##  [3] rtracklayer_1.46.0                 Biostrings_2.54.0                 
##  [5] XVector_0.26.0                     NADfinder_1.10.0                  
##  [7] SummarizedExperiment_1.16.0        DelayedArray_0.12.0               
##  [9] BiocParallel_1.20.0                matrixStats_0.55.0                
## [11] Biobase_2.46.0                     GenomicRanges_1.38.0              
## [13] GenomeInfoDb_1.22.0                IRanges_2.20.0                    
## [15] S4Vectors_0.24.0                   BiocGenerics_0.32.0               
## [17] BiocStyle_2.14.0                  
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.5               Hmisc_4.2-0                  
##   [3] AnnotationHub_2.18.0          corrplot_0.84                
##   [5] BiocFileCache_1.10.0          lazyeval_0.2.2               
##   [7] splines_3.6.1                 ggplot2_3.2.1                
##   [9] digest_0.6.22                 ensembldb_2.10.0             
##  [11] htmltools_0.4.0               GO.db_3.10.0                 
##  [13] checkmate_1.9.4               csaw_1.20.0                  
##  [15] magrittr_1.5                  memoise_1.1.0                
##  [17] baseline_1.2-1                grImport2_0.1-5              
##  [19] cluster_2.1.0                 InteractionSet_1.14.0        
##  [21] limma_3.42.0                  askpass_1.1                  
##  [23] prettyunits_1.0.2             jpeg_0.1-8.1                 
##  [25] colorspace_1.4-1              signal_0.7-6                 
##  [27] blob_1.2.0                    rappdirs_0.3.1               
##  [29] xfun_0.10                     dplyr_0.8.3                  
##  [31] crayon_1.3.4                  RCurl_1.95-4.12              
##  [33] graph_1.64.0                  zeallot_0.1.0                
##  [35] VariantAnnotation_1.32.0      survival_2.44-1.1            
##  [37] glue_1.3.1                    EmpiricalBrownsMethod_1.14.0 
##  [39] gtable_0.3.0                  zlibbioc_1.32.0              
##  [41] seqinr_3.6-1                  Rgraphviz_2.30.0             
##  [43] SparseM_1.77                  scales_1.0.0                 
##  [45] futile.options_1.0.1          DBI_1.0.0                    
##  [47] edgeR_3.28.0                  bibtex_0.4.2                 
##  [49] Rcpp_1.0.2                    plotrix_3.7-6                
##  [51] metap_1.1                     htmlTable_1.13.2             
##  [53] xtable_1.8-4                  progress_1.2.2               
##  [55] foreign_0.8-72                bit_1.1-14                   
##  [57] Formula_1.2-3                 htmlwidgets_1.5.1            
##  [59] httr_1.4.1                    RColorBrewer_1.1-2           
##  [61] acepack_1.4.1                 pkgconfig_2.0.3              
##  [63] XML_3.98-1.20                 nnet_7.3-12                  
##  [65] Gviz_1.30.0                   dbplyr_1.4.2                 
##  [67] locfit_1.5-9.1                tidyselect_0.2.5             
##  [69] rlang_0.4.1                   later_1.0.0                  
##  [71] polynom_1.4-0                 AnnotationDbi_1.48.0         
##  [73] munsell_0.5.0                 BiocVersion_3.10.1           
##  [75] tools_3.6.1                   RSQLite_2.1.2                
##  [77] ade4_1.7-13                   evaluate_0.14                
##  [79] stringr_1.4.0                 fastmap_1.0.1                
##  [81] rGADEM_2.34.0                 yaml_2.2.0                   
##  [83] knitr_1.25                    bit64_0.9-7                  
##  [85] purrr_0.3.3                   randomForest_4.6-14          
##  [87] AnnotationFilter_1.10.0       RBGL_1.62.0                  
##  [89] mime_0.7                      formatR_1.7                  
##  [91] biomaRt_2.42.0                rstudioapi_0.10              
##  [93] compiler_3.6.1                curl_4.2                     
##  [95] png_0.1-7                     interactiveDisplayBase_1.24.0
##  [97] MotIV_1.42.0                  preseqR_4.0.0                
##  [99] ATACseqQC_1.10.0              tibble_2.1.3                 
## [101] stringi_1.4.3                 idr_1.2                      
## [103] futile.logger_1.4.3           GenomicFeatures_1.38.0       
## [105] ChIPpeakAnno_3.20.0           lattice_0.20-38              
## [107] ProtGenerics_1.18.0           Matrix_1.2-17                
## [109] trackViewer_1.22.0            multtest_2.42.0              
## [111] vctrs_0.2.0                   pillar_1.4.2                 
## [113] BiocManager_1.30.9            Rdpack_0.11-0                
## [115] grImport_0.9-2                data.table_1.12.6            
## [117] bitops_1.0-6                  gbRd_0.4-11                  
## [119] httpuv_1.5.2                  R6_2.4.0                     
## [121] latticeExtra_0.6-28           bookdown_0.14                
## [123] promises_1.1.0                gridExtra_2.3                
## [125] KernSmooth_2.23-16            motifStack_1.30.0            
## [127] dichromat_2.0-0               lambda.r_1.2.4               
## [129] MASS_7.3-51.4                 assertthat_0.2.1             
## [131] seqLogo_1.52.0                openssl_1.4.1                
## [133] GenomicScores_1.10.0          regioneR_1.18.0              
## [135] GenomicAlignments_1.22.0      Rsamtools_2.2.0              
## [137] GenomeInfoDbData_1.2.2        hms_0.5.1                    
## [139] rpart_4.1-15                  VennDiagram_1.6.20           
## [141] grid_3.6.1                    rmarkdown_1.16               
## [143] biovizBase_1.34.0             shiny_1.4.0                  
## [145] base64enc_0.1-3

References

1. Politz, J. C. R., Scalzo, D. & Groudine, M. The redundancy of the mammalian heterochromatic compartment. Current opinion in genetics & development 37, 1–8 (2016).

2. Németh, A. et al. Initial genomics of the human nucleolus. PLoS Genet 6, e1000889 (2010).

3. Koningsbruggen, S. van et al. High-resolution whole-genome sequencing reveals that specific chromatin domains from most human chromosomes associate with nucleoli. Molecular biology of the cell 21, 3735–3748 (2010).

4. Aguirre-Lavin, T. et al. 3D-fish analysis of embryonic nuclei in mouse highlights several abrupt changes of nuclear organization during preimplantation development. BMC developmental biology 12, 1 (2012).

5. Popken, J. et al. Reprogramming of fibroblast nuclei in cloned bovine embryos involves major structural remodeling with both striking similarities and differences to nuclear phenotypes of in vitro fertilized embryos. Nucleus 5, 555–589 (2014).