Contents

1 Introduction

The mpra package provides tools for the analysis of data from massively parallel reporter assays (MPRA). Specifically, it contains the functionality described in (Myint et al. 2017). The primary analysis purpose is to enable differential analysis of activity measures, but the package can also be used to generate precision weights useful in regression analyses of activity scores on sequence features. The main workhorse of the mpra package is the mpralm() function which draws on the previously proposed voom framework for RNA-seq analysis (Law et al. 2014). In this document, we will be looking at MPRA data from a study comparing episomal and lentiviral versions of MPRA (Inoue et al. 2017).

1.1 How to cite

If you are using this package, please cite (Myint et al. 2017). If you are using the mpralm() function, it would be appropriate to also cite the voom framework (Law et al. 2014).

2 Dependencies

This document has the following dependencies

library(mpra)

3 Creating an MPRASet object

In this package, MPRA data are contained in MPRASet objects. Because MPRA data do not have a common prescribed format, these objects must be created manually. In this section, we demonstrate how to do this.

MPRASet objects must contain DNA and RNA count information because this is the information used to quantify activity levels of the elements being assayed. DNA and RNA count information should be specified as \(K \times S\) integer count matrices where \(K\) is the total number of barcodes over all elements if barcode-level information is being supplied or the total number of putative regulatory elements (PREs) if element-level information is being supplied. \(S\) is the number of samples (typically, the number of independent transfections).

MPRASet objects must also contain element identification information. This should be supplied as a character vector of length \(K\), the number of rows in the DNA and RNA count matrices. These are any strings used to describe/identify the unique PREs being assayed.

Optionally, the barcode sequences and PRE sequences can be specified as length \(K\) character vectors.

In the next sections we provide specific examples for how to specify this information for two common differential analysis settings: tissue and allele comparisons. Although we show simulated data, this information would typically be read from text files.

3.1 Tissue comparison

In tissue comparison studies, the same set of PREs is assayed in two or more cell types. In the following example, the experiment looks at four PREs with three barcodes each. Two tissues (liver and kidney) are studied, and each tissue has four replicates (four independent transfections each).

RNA and DNA count matrices would look as below:

E <- 4 # Number of elements
B <- 3 # Number of barcodes
s <- 4 # Samples per tissue
nt <- 2 # Number of tissues

set.seed(434)
rna <- matrix(rpois(E*B*s*nt, lambda = 30), nrow = E*B, ncol = s*nt)
dna <- matrix(rpois(E*B*s*nt, lambda = 30), nrow = E*B, ncol = s*nt)

rn <- as.character(outer(paste0("barcode_", seq_len(B), "_"), paste0("elem_", seq_len(E)), FUN = "paste0"))
cn <- c(paste0("liver_", seq_len(s)), paste0("kidney_", seq_len(s)))

rownames(rna) <- rn
rownames(dna) <- rn
colnames(rna) <- cn
colnames(dna) <- cn

rna
##                  liver_1 liver_2 liver_3 liver_4 kidney_1 kidney_2 kidney_3
## barcode_1_elem_1      31      26      36      18       27       33       28
## barcode_2_elem_1      33      28      29      32       28       33       34
## barcode_3_elem_1      22      43      25      41       28       34       30
## barcode_1_elem_2      28      27      31      20       37       33       24
## barcode_2_elem_2      32      31      32      21       38       22       27
## barcode_3_elem_2      35      36      28      38       26       23       30
## barcode_1_elem_3      30      34      26      34       40       26       26
## barcode_2_elem_3      37      27      37      29       25       29       38
## barcode_3_elem_3      34      30      20      26       30       35       33
## barcode_1_elem_4      28      27      34      26       29       30       25
## barcode_2_elem_4      31      32      28      36       31       24       29
## barcode_3_elem_4      24      28      34      31       37       29       34
##                  kidney_4
## barcode_1_elem_1       30
## barcode_2_elem_1       30
## barcode_3_elem_1       37
## barcode_1_elem_2       29
## barcode_2_elem_2       29
## barcode_3_elem_2       29
## barcode_1_elem_3       22
## barcode_2_elem_3       28
## barcode_3_elem_3       38
## barcode_1_elem_4       30
## barcode_2_elem_4       32
## barcode_3_elem_4       36
dna
##                  liver_1 liver_2 liver_3 liver_4 kidney_1 kidney_2 kidney_3
## barcode_1_elem_1      29      34      28      27       27       25       30
## barcode_2_elem_1      43      37      34      33       43       26       26
## barcode_3_elem_1      35      23      32      27       19       29       29
## barcode_1_elem_2      23      30      34      27       32       18       24
## barcode_2_elem_2      32      29      42      21       37       34       32
## barcode_3_elem_2      26      23      22      24       32       41       24
## barcode_1_elem_3      29      38      34      27       31       27       31
## barcode_2_elem_3      30      36      32      21       34       29       26
## barcode_3_elem_3      34      28      42      35       26       32       32
## barcode_1_elem_4      34      25      29      30       44       26       23
## barcode_2_elem_4      29      26      37      37       28       37       34
## barcode_3_elem_4      24      28      33      40       31       30       19
##                  kidney_4
## barcode_1_elem_1       31
## barcode_2_elem_1       33
## barcode_3_elem_1       25
## barcode_1_elem_2       25
## barcode_2_elem_2       31
## barcode_3_elem_2       23
## barcode_1_elem_3       33
## barcode_2_elem_3       39
## barcode_3_elem_3       26
## barcode_1_elem_4       33
## barcode_2_elem_4       31
## barcode_3_elem_4       26

PRE identification strings would look as below. When counts are provided at the barcode level, the eid character vector will have repeated elements.

eid <- rep(paste0("elem_", seq_len(E)), each = B)
eid
##  [1] "elem_1" "elem_1" "elem_1" "elem_2" "elem_2" "elem_2" "elem_3" "elem_3"
##  [9] "elem_3" "elem_4" "elem_4" "elem_4"

We may also have PRE sequences as below. These sequences must be specified in a character vector of the same length as eid and the same number of rows as rna and dna.

eseq <- replicate(E, paste(sample(c("A", "T", "C", "G"), 10, replace = TRUE), collapse = ""))
eseq <- rep(eseq, each = B)
eseq
##  [1] "TGACCATACC" "TGACCATACC" "TGACCATACC" "ACGCCCAATT" "ACGCCCAATT"
##  [6] "ACGCCCAATT" "CGGCGAGGGG" "CGGCGAGGGG" "CGGCGAGGGG" "CACTGTACTA"
## [11] "CACTGTACTA" "CACTGTACTA"

The above pieces (rna, dna, eid, and eseq) can be supplied as arguments to the MPRASet constructor function as below. If barcode (barcode sequences) or eseq (PRE sequences) is not supplied, it must be specified as NULL.

mpraset_example <- MPRASet(DNA = dna, RNA = rna, eid = eid, eseq = eseq, barcode = NULL)
mpraset_example
## class: MPRASet 
## dim: 12 8 
## metadata(0):
## assays(2): DNA RNA
## rownames(12): barcode_1_elem_1 barcode_2_elem_1 ... barcode_2_elem_4
##   barcode_3_elem_4
## rowData names(2): eid eseq
## colnames(8): liver_1 liver_2 ... kidney_3 kidney_4
## colData names(0):
## No barcodes present

3.2 Allele comparison

In allele comparison studies, PREs that exist with two or more alleles are assayed. All allelic-versions of the PREs are assayed in the same sample. Because activity comparisons between alleles is desired, these counts must be separated into different columns. In the following example, the experiment looks at four PREs with three barcodes each. There are two alleles per PRE, and there are four replicates (four independent transfections total).

RNA and DNA count matrices would look as below:

E <- 4 # Number of elements
B <- 3 # Number of barcodes
s <- 4 # Total number of samples
nalleles <- 2 # Number of alleles

set.seed(434)
rna <- matrix(rpois(E*B*s*nalleles, lambda = 30), nrow = E*B, ncol = s*nalleles)
dna <- matrix(rpois(E*B*s*nalleles, lambda = 30), nrow = E*B, ncol = s*nalleles)

rn <- as.character(outer(paste0("barcode_", seq_len(B), "_"), paste0("elem_", seq_len(E)), FUN = "paste0"))
cn <- c(paste0("allele1_sample", seq_len(s)), paste0("allele2_sample", seq_len(s)))

rownames(rna) <- rn
rownames(dna) <- rn
colnames(rna) <- cn
colnames(dna) <- cn

rna
##                  allele1_sample1 allele1_sample2 allele1_sample3
## barcode_1_elem_1              31              26              36
## barcode_2_elem_1              33              28              29
## barcode_3_elem_1              22              43              25
## barcode_1_elem_2              28              27              31
## barcode_2_elem_2              32              31              32
## barcode_3_elem_2              35              36              28
## barcode_1_elem_3              30              34              26
## barcode_2_elem_3              37              27              37
## barcode_3_elem_3              34              30              20
## barcode_1_elem_4              28              27              34
## barcode_2_elem_4              31              32              28
## barcode_3_elem_4              24              28              34
##                  allele1_sample4 allele2_sample1 allele2_sample2
## barcode_1_elem_1              18              27              33
## barcode_2_elem_1              32              28              33
## barcode_3_elem_1              41              28              34
## barcode_1_elem_2              20              37              33
## barcode_2_elem_2              21              38              22
## barcode_3_elem_2              38              26              23
## barcode_1_elem_3              34              40              26
## barcode_2_elem_3              29              25              29
## barcode_3_elem_3              26              30              35
## barcode_1_elem_4              26              29              30
## barcode_2_elem_4              36              31              24
## barcode_3_elem_4              31              37              29
##                  allele2_sample3 allele2_sample4
## barcode_1_elem_1              28              30
## barcode_2_elem_1              34              30
## barcode_3_elem_1              30              37
## barcode_1_elem_2              24              29
## barcode_2_elem_2              27              29
## barcode_3_elem_2              30              29
## barcode_1_elem_3              26              22
## barcode_2_elem_3              38              28
## barcode_3_elem_3              33              38
## barcode_1_elem_4              25              30
## barcode_2_elem_4              29              32
## barcode_3_elem_4              34              36
dna
##                  allele1_sample1 allele1_sample2 allele1_sample3
## barcode_1_elem_1              29              34              28
## barcode_2_elem_1              43              37              34
## barcode_3_elem_1              35              23              32
## barcode_1_elem_2              23              30              34
## barcode_2_elem_2              32              29              42
## barcode_3_elem_2              26              23              22
## barcode_1_elem_3              29              38              34
## barcode_2_elem_3              30              36              32
## barcode_3_elem_3              34              28              42
## barcode_1_elem_4              34              25              29
## barcode_2_elem_4              29              26              37
## barcode_3_elem_4              24              28              33
##                  allele1_sample4 allele2_sample1 allele2_sample2
## barcode_1_elem_1              27              27              25
## barcode_2_elem_1              33              43              26
## barcode_3_elem_1              27              19              29
## barcode_1_elem_2              27              32              18
## barcode_2_elem_2              21              37              34
## barcode_3_elem_2              24              32              41
## barcode_1_elem_3              27              31              27
## barcode_2_elem_3              21              34              29
## barcode_3_elem_3              35              26              32
## barcode_1_elem_4              30              44              26
## barcode_2_elem_4              37              28              37
## barcode_3_elem_4              40              31              30
##                  allele2_sample3 allele2_sample4
## barcode_1_elem_1              30              31
## barcode_2_elem_1              26              33
## barcode_3_elem_1              29              25
## barcode_1_elem_2              24              25
## barcode_2_elem_2              32              31
## barcode_3_elem_2              24              23
## barcode_1_elem_3              31              33
## barcode_2_elem_3              26              39
## barcode_3_elem_3              32              26
## barcode_1_elem_4              23              33
## barcode_2_elem_4              34              31
## barcode_3_elem_4              19              26

We could define PRE identification strings and sequences as above and use them in the MPRASet constructor function similarly:

mpraset_example2 <- MPRASet(DNA = dna, RNA = rna, eid = eid, eseq = eseq, barcode = NULL)
mpraset_example2
## class: MPRASet 
## dim: 12 8 
## metadata(0):
## assays(2): DNA RNA
## rownames(12): barcode_1_elem_1 barcode_2_elem_1 ... barcode_2_elem_4
##   barcode_3_elem_4
## rowData names(2): eid eseq
## colnames(8): allele1_sample1 allele1_sample2 ... allele2_sample3
##   allele2_sample4
## colData names(0):
## No barcodes present

4 Analysis

While the above section demonstrated how to create MPRASet objects, we will use preconstructed objects containing data from a comparison of episomal and lentiviral versions of MPRA (Inoue et al. 2017).

data(mpraSetExample)

We create the design matrix with an indicator for the episomal (mutant integrase) samples and fit the precision-weighted linear model with mpralm. In MPRA experiments, activity measures are quantified as the log2 ratio of RNA counts over DNA counts. When there is barcode level information (as in this experiment), there are various ways to summarize information over barcodes to compute the final element- and sample-specific log ratios that are used for subsequent statistical modeling.

We have specified aggregate = "mean" to indicate that the element- and sample-specific log ratios will be computed by first computing the log ratio of RNA counts over DNA counts for each barcode, then taking the mean over barcodes in a particular element and sample.

We have specifed normalize = TRUE to perform total count normalization on the RNA and DNA libraries. This scales all libraries to have a common size of 10 million reads.

Because this experiment looks at a set of PREs in two different cellular conditions, the different samples (columns of the MPRASet object) are independent. Thus we specify model_type = "indep_groups" to perform an unpaired analysis. In contrast, if we were performing an allele comparison, we would specify model_type = "corr_groups" to performed a paired analysis (indicating that different columns of the MPRASet object are linked).

Finally, we specify plot = TRUE to plot the relationship between log ratio variability versus element copy number.

design <- data.frame(intcpt = 1, episomal = grepl("MT", colnames(mpraSetExample)))
mpralm_fit <- mpralm(object = mpraSetExample, design = design, aggregate = "mean", normalize = TRUE, model_type = "indep_groups", plot = TRUE)

The resulting fit object can be used with topTable from the limma package.

toptab <- topTable(mpralm_fit, coef = 2, number = Inf)
toptab6 <- head(toptab)

Because the element codes are rather long for this dataset, we do some tricks to print the top differential elements:

rownames(toptab6)
## [1] "C:SLEA_hg18:chr2:210861483-210861650|5:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;26:V_HNF1_C:AGTTAATGATTAACCAA;45:V_HNF1_C:AGTTAATGATTAACCAA;64:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;85:V_HNF1_C:AGTTAATGATTAACCAA;104:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;125:V_HNF1_C:AGTTAATGATTAACCAA;144:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC"      
## [2] "R:EP300-NoMod_chr9:12814543-12814714_[chr9:12814543-12814714]"                                                                                                                                                                                                                                                       
## [3] "C:SLEA_hg18:chr2:210861483-210861650|4:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;25:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;46:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;67:V_HNF1_C:AGTTAATGATTAACCAA;86:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;107:V_HNF1_C:AGTTAATGATTAACCAA;126:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;147:V_HNF1_C:AGTTAATGATTAACCAA"
## [4] "C:SLEA_hg18:chr2:210861483-210861650|6:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;27:V_HNF1_C:AGTTAATGATTAACCAA;46:V_HNF1_C:AGTTAATGATTAACCAA;65:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC;86:V_HNF1_C:AGTTAATGATTAACCAA;105:V_HNF1_C:AGTTAATGATTAACCAA;124:V_HNF1_C:AGTTAATGATTAACCAA;143:V_AHRARNT_02:GGGGATCGCGTGCCAGCCC"            
## [5] "R:FOXA1_FOXA2-ChMod_chr1:38000211-38000353_[chr1:38000196-38000367]"                                                                                                                                                                                                                                                 
## [6] "R:EP300-NoMod_chr16:3070958-3071129_[chr16:3070958-3071129]"
rownames(toptab6) <- NULL
toptab6
##        logFC  AveExpr         t      P.Value    adj.P.Val        B
## 1 -1.2076627 1.284363 -26.99214 1.450176e-36 3.538429e-33 73.05825
## 2 -1.7316415 1.668562 -23.50738 4.357972e-33 5.316726e-30 64.87553
## 3 -1.1603766 1.497371 -23.11093 1.150733e-32 9.359292e-30 64.12459
## 4 -0.9560848 1.016586 -18.57491 2.136470e-27 1.303247e-24 52.09394
## 5 -1.1713718 1.730330 -18.25135 5.494116e-27 2.681128e-24 51.09648
## 6 -0.7639104 0.733524 -17.71599 2.689472e-26 1.093719e-23 49.54888

5 Sessioninfo

R version 3.5.1 Patched (2018-07-12 r74967) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.5 LTS

Matrix products: default BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] mpra_1.4.1 limma_3.38.2
[3] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
[5] BiocParallel_1.16.2 matrixStats_0.54.0
[7] Biobase_2.42.0 GenomicRanges_1.34.0
[9] GenomeInfoDb_1.18.1 IRanges_2.16.0
[11] S4Vectors_0.20.1 BiocGenerics_0.28.0
[13] BiocStyle_2.10.0

loaded via a namespace (and not attached): [1] Rcpp_1.0.0 compiler_3.5.1 BiocManager_1.30.4
[4] XVector_0.22.0 bitops_1.0-6 tools_3.5.1
[7] zlibbioc_1.28.0 statmod_1.4.30 digest_0.6.18
[10] evaluate_0.12 lattice_0.20-38 Matrix_1.2-15
[13] yaml_2.2.0 xfun_0.4 GenomeInfoDbData_1.2.0 [16] stringr_1.3.1 knitr_1.20 rprojroot_1.3-2
[19] grid_3.5.1 rmarkdown_1.10 bookdown_0.7
[22] magrittr_1.5 backports_1.1.2 scales_1.0.0
[25] htmltools_0.3.6 colorspace_1.3-2 stringi_1.2.4
[28] RCurl_1.95-4.11 munsell_0.5.0

References

Inoue, Fumitaka, Martin Kircher, Beth Martin, Gregory M Cooper, Daniela M Witten, Michael T McManus, Nadav Ahituv, and Jay Shendure. 2017. “A Systematic Comparison Reveals Substantial Differences in Chromosomal Versus Episomal Encoding of Enhancer Activity.” Genome Research 27:38–52. https://doi.org/10.1101/gr.212092.116.

Law, Charity W, Yunshun Chen, Wei Shi, and Gordon K Smyth. 2014. “Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-seq Read Counts.” Genome Biology 15:R29. https://doi.org/10.1186/gb-2014-15-2-r29.

Myint, Leslie, Dimitrios G Avramopoulos, Loyal A. Goff, and Kasper D Hansen. 2017. “Linear Models Enable Powerful Differential Activity Analysis in Massively Parallel Reporter Assays.” bioRxiv, 196394. https://doi.org/10.1101/196394.