A comprehensive guide to using the bnbc package for normalizing Hi-C replicates.
bnbc 1.0.0
The bnbc package provides functionality to perform normalization and batch correction across samples on data obtained from Hi-C (Lieberman-Aiden et al. 2009) experiments.
In this package we implement tools for general subsetting of and data extraction from sets of Hi-C contact matrices, as well as smoothing of contact matrices, cross-sample normalization and cross-sample batch effect correction methods.
bnbc expects as input
GRanges
object representing the genome assayed, with individual ranges having the width that is equal to the bin size used to partition the genome.list
of (square) contact matrices.DataFrame
object containing sample-level covariates (i.e. gender, date of processing, etc).Please cite our bioRxiv paper.
It is well appreciated that Hi-C contact matrices exhibit an exponential decay in observed number of contacts as a function of the distance between the pair of interacting loci. In this work we operate, as has recently been done (i.e. (Yang et al. 2017)), on the set of all loci interacting at a specific distance, one chromosome at a time. For a given distance \(k\), the relevant set of loci are listed in each contact matrix as the entries comprising the \(k\)-th matrix diagonal (with the main diagonal being referred to as the first diagonal). We refer to these diagonals as matrix “bands”.
This document has the following dependencies
library(bnbc)
bnbc uses the ContactGroup
class to represent the set of contact matrices for a given set of genomic locis interactions. The class has 3 slots:
rowData
: a GRanges
object that has 1 element for each bin of the partitioned genome.colData
: a DataFrame
object that contains information on each sample (i.e. gender).contacts
: a list
of contact matrices.We expect each ContactGroup
to represent data from 1 chromosome. We are thinking about a whole-genome container. An example dataset for chromosome 22 is supplied with the package.
data(cgEx)
cgEx
## Object of class `ContactGroup` representing contact matrices with
## 1283 bins
## 40 kb average width per bin
## 14 samples
Creating a ContactGroup
object requires specifying the 3 slots above:
cgEx <- ContactGroup(rowData=rowData(cgEx),
contacts=contacts(cgEx),
colData=colData(cgEx))
Note that in this example, we used the accessor methods for each of these slots; there are also corresponding ‘setter’ methods, such as rowData(cgEx)<-
.
Printing a ContactGroup
object gives the number of bins represented by the rowData
slot, the width of the bin in terms of genomic distances (i.e. 100kb) and the number of samples:
cgEx
## Object of class `ContactGroup` representing contact matrices with
## 1283 bins
## 40 kb average width per bin
## 14 samples
The InteractionSet package contains a class called InteractionSet
which is essentially an extension of the ContactGroup
class. The internal storage format is different and InteractionSet
is not restricted to square contact matrices like the ContactGroup
class. We are interested in porting the bnbc()
function to using InteractionSet
, but bnbc()
extensively uses band matrices and we have optimized Rcpp-based routines for getting and setting bands of normal matrices, which means ContactGroup
is a pretty fast for our purposes.
To get data into bnbc you need a list of contact matrices, one per sample. We assume the contact matrices are square, with no missing values. We do not require that data have been transformed or pre-processed by various bias correction software and provide methods for some simple pre-processing techniques.
There is currently no standard Hi-C data format. Instead, different groups produces custom formats, often in forms of text files. Because contact matrices are square, it is common to only distribute the upper or lower triangular matrix. In that case, you can use the following trick to make the matrix square:
mat <- matrix(1:9, nrow = 3, ncol = 3)
mat[lower.tri(mat)] <- 0
mat
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 0 5 8
## [3,] 0 0 9
## Now we fill in the lower triangular matrix with the upper triangular
mat[lower.tri(mat)] <- mat[upper.tri(mat)]
mat
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 4 5 8
## [3,] 7 8 9
Below, we demonstrate the steps needed to convert a set of hypothetical contact matrices into a ContactGroup
object. The object upper.mats.list
is supposed to be a list of contact matrices, each represented as an upper triangular matrix. We also suppose LociData
to be a GenomicRanges
object containing the loci of the contact matrices, and SampleData
to be a DataFrame
of per-sample information (i.e. cell type, sample name etc). We first convert all contact matrices to be symmetric matrices, then use the constructor method ContactGroup()
to create the object.
## Example not run
## Convert upper triangles to symmetry matrix
MatsList <- lapply(upper.mats.list, function(M) {
M[lower.tri(M)] <- M[upper.tri(M)]
})
## Use ContactGroup constructor method
cg <- ContactGroup(rowData = LociData, contacts = MatsList, colData = SampleData)
For this to work, the contacts
list has to have the same names as the rownames of colData
.
We provide setter and getter methods for manipulating individual matrix bands for contact matrices as well. First, we have functions for working with bands of individual matrices (not bnbc related):
mat.1 <- contacts(cgEx)[[1]]
mat.1[1000:1005, 1000:1005]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 5 30 36 32 53 20
## [2,] 30 7 36 38 50 16
## [3,] 36 36 3 44 51 15
## [4,] 32 38 44 4 89 39
## [5,] 53 50 51 89 36 55
## [6,] 20 16 15 39 55 5
b1 <- band(mat=mat.1, band.no=2)
band(mat=mat.1, band.no=2) <- b1 + 1
mat.1[1000:1005, 1000:1005]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 5 31 36 32 53 20
## [2,] 31 7 37 38 50 16
## [3,] 36 37 3 45 51 15
## [4,] 32 38 45 4 90 39
## [5,] 53 50 51 90 36 56
## [6,] 20 16 15 39 56 5
In this example, the main diagonal of the contact matrix is also the main diagonal of the printed example above. Similarly, band number two, which is also the first off-diagonal, is also the first off-diagonal of the printed example. As can be seen from the printed example, updating a matrix band is a symmetric operation, and updated the first off-diagonal in both the upper and lower triangles of the matrix.
To utilize this across a list of contact matrices, we have the cgApply()
function which applies the same function to each of the matrices. It supports parallelization using parallel.
To adjust for differences in depth sequencing, we first apply the logCPM
transform (Law et al. 2014) to each contact matrix. This transformation divides each contact matrix by the sum of the upper triangle of that matrix (adding 0.5 to each matrix cell and 1 to sum of the upper triangle), scales the resulting matrix by \(10^6\) and finally takes the log of the scaled matrix (a fudge factor is added to both numerator and denominator prior to taking the logarithm)..
cgEx.cpm <- logCPM(cgEx)
Additionally, we smooth each contact matrix with a square smoothing kernel to reduce artifacts of the choice of bin width. We support both box and Gaussian smoothers.
cgEx.smooth <- boxSmoother(cgEx.cpm, h=5)
## or
## cgEx.smooth <- gaussSmoother(cgEx.cpm, radius=3, sigma=4)
BNBC operates on each matrix band separately. For each matrix band \(k\), we extract each sample’s observation on that band and form a matrix \(M\) from those bands; if band \(k\) has \(d\) entries, then after logCPM
transformation, \(M \in \mathbb{R}^{n \times d}\). For each such matrix, we first apply quantile normalization (Bolstad et al. 2003) to correct for distributional differences, and then ComBat (Johnson, Li, and Rabinovic 2007) to correct for batch effects.
Here we will use bnbc to do batch correction on the first 10 matrix bands, beginning with the second matrix band and ending on the eleventh.
cgEx.bnbc <- bnbc(cgEx.smooth, batch=colData(cgEx.smooth)$Batch,
threshold=1e7, step=4e4, nbands=11, verbose=FALSE)
## Standardizing Data across genes
## Standardizing Data across genes
## Standardizing Data across genes
## Standardizing Data across genes
## Standardizing Data across genes
## Standardizing Data across genes
## Standardizing Data across genes
## Standardizing Data across genes
## Standardizing Data across genes
## Standardizing Data across genes
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] bnbc_1.0.0 SummarizedExperiment_1.8.0
## [3] DelayedArray_0.4.0 matrixStats_0.52.2
## [5] Biobase_2.38.0 GenomicRanges_1.30.0
## [7] GenomeInfoDb_1.14.0 IRanges_2.12.0
## [9] S4Vectors_0.16.0 BiocGenerics_0.24.0
## [11] BiocStyle_2.6.0
##
## loaded via a namespace (and not attached):
## [1] genefilter_1.60.0 locfit_1.5-9.1 splines_3.4.2
## [4] lattice_0.20-35 htmltools_0.3.6 yaml_2.1.14
## [7] mgcv_1.8-22 blob_1.1.0 XML_3.98-1.9
## [10] survival_2.41-3 rlang_0.1.2 DBI_0.7
## [13] EBImage_4.20.0 BiocParallel_1.12.0 bit64_0.9-7
## [16] jpeg_0.1-8 GenomeInfoDbData_0.99.1 sva_3.26.0
## [19] stringr_1.2.0 zlibbioc_1.24.0 htmlwidgets_0.9
## [22] evaluate_0.10.1 memoise_1.1.0 knitr_1.17
## [25] AnnotationDbi_1.40.0 preprocessCore_1.40.0 Rcpp_0.12.13
## [28] xtable_1.8-2 backports_1.1.1 limma_3.34.0
## [31] annotate_1.56.0 XVector_0.18.0 abind_1.4-5
## [34] bit_1.1-12 png_0.1-7 digest_0.6.12
## [37] stringi_1.1.5 tiff_0.1-5 bookdown_0.5
## [40] grid_3.4.2 rprojroot_1.2 tools_3.4.2
## [43] bitops_1.0-6 magrittr_1.5 RCurl_1.95-4.8
## [46] RSQLite_2.0 tibble_1.3.4 Matrix_1.2-11
## [49] rmarkdown_1.6 fftwtools_0.9-8 nlme_3.1-131
## [52] compiler_3.4.2
Bolstad, B M, R A Irizarry, M Astrand, and T P Speed. 2003. “A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias.” Bioinformatics 19: 185–93. doi:10.1093/bioinformatics/19.2.185.
Johnson, W Evan, Cheng Li, and Ariel Rabinovic. 2007. “Adjusting Batch Effects in Microarray Expression Data Using Empirical Bayes Methods” 8: 118–27. doi:10.1093/biostatistics/kxj037.
Law, Charity W, Yunshun Chen, Wei Shi, and Gordon K Smyth. 2014. “Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-seq Read Counts.” Genome Biology 15: R29. doi:10.1186/gb-2014-15-2-r29.
Lieberman-Aiden, Erez, Nynke L van Berkum, Louise Williams, Maxim Imakaev, Tobias Ragoczy, Agnes Telling, Ido Amit, et al. 2009. “Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome.” Science 326: 289–93. doi:10.1126/science.1181369.
Yang, Tao, Feipeng Zhang, Galip Gurkan Yardimci, Fan Song, Ross C Hardison, William Stafford Noble, Feng Yue, and Qunhua Li. 2017. “HiCRep: Assessing the Reproducibility of Hi-C Data Using a Stratum- Adjusted Correlation Coefficient.” Genome Research. doi:10.1101/gr.220640.117.