rnaEditr
is an R package that identifies genomic sites and genomic regions that are differentially edited in RNA-seq datasets. rnaEditr
can analyze studies with continuous, binary, or survival phenotypes, along with multiple covariates and/or interaction effects. To identify hyper-edited regions, rnaEditr
first determines co-edited sub-regions without using any phenotype information. Next, rnaEditr
tests association between RNA editing levels within the co-edited regions with binary, continuous or survival phenotypes.
The latest version can be installed by:
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("rnaEditr")
After installation, the rnaEditr
package can be loaded into R using:
## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'AnnotationDbi'
The input of rnaEditr
are: (1) an RNA editing dataset, with rows corresponding to the edited sites and columns corresponding to the samples; (2) a phenotype dataset, with rows corresponding to different samples, ordered in the same way as columns of dataset (1); (3) a character string of genomic regions that users are interested in. This is only required for region-based analysis and will be explained in section 4.1.
The first dataset includes RNA editing levels for each sample at each edited site. RNA editing levels are typically defined as the total number of edited reads (i.e. reads with G nucleotides) divided by the total number of reads covering the site (i.e. reads with A and G nucleotides) (Breen et al. 2019 PMID: 31455887). They range from 0 (un-edited site) to 1 (completely edited site).
We assume quality control and normalization of the RNA editing dataset have been performed, and that each site is edited in at least 5% of the samples. For illustration, we use a subset of the TCGA breast cancer RNA editing dataset (syn 2374375). This example dataset includes RNA editing levels for 272 edited sites mapped to genes PHACTR4, CCR5, METTL7A (using hg19 reference), along with a few randomly sampled sites, for 221 subjects.
## BRCA-Tumor-TCGA-AC-A2FF BRCA-Normal-TCGA-AC-A2FF
## chr1:28661572 0.00000000 0.0000000
## chr1:28661576 0.64285714 0.6307692
## chr1:28661578 0.01923077 0.0000000
## BRCA-Normal-TCGA-BH-A0AU
## chr1:28661572 0.0000000
## chr1:28661576 0.5333333
## chr1:28661578 0.0000000
In order to link the RNA editing dataset with phenotype data, the phenotype dataset needs to have a column called sample
, whose values must be a exact match to the column names in the above RNA editing dataset.
pheno_df <- readRDS(
system.file(
"extdata",
"pheno_df.RDS",
package = 'rnaEditr',
mustWork = TRUE
)
)
## submitter_id sample sample_type
## 1 TCGA-AC-A2FF BRCA-Tumor-TCGA-AC-A2FF Primary Tumor
## 2 TCGA-AC-A2FF BRCA-Normal-TCGA-AC-A2FF Solid Tissue Normal
## 3 TCGA-BH-A0AU BRCA-Normal-TCGA-BH-A0AU Solid Tissue Normal
Here we are using the command below to make sure sample
variable in phenotype dataset has a exact match to the column names in the RNA editing dataset.
## [1] TRUE
rnaEditr
can perform both site-specific and region-based analysis. In Section 3, we illustrate testing associations between cancer status and RNA editing levels at individual sites. In Section 4, we illustrate identifying cancer associated co-edited regions.
Before testing the associations, we use function CreateEditingTable()
to turn RNA editing matrix into a dataframe with special class rnaEdit_df
, which is a required format of input dataset for function TestAssociations()
.
In this example, we will test the association between cancer status ( sample_type
variable from pheno_df
dataset) and all edited sites. Because the outcome variable “cancer status” is a binary variable, rnaEditr
will apply a logistic regression model for each site: \[logit(Pr(cancer\;status = “yes”)) \sim RNA\;editing\;level\]
First, we use the command below to check the distribution of variable sample_type
. Please note that rnaEditr
will fit firth corrected logistic regression instead of regular logistic regression models for binary outcomes, when the minimum sample size per group is less than 5.
##
## Primary Tumor Solid Tissue Normal
## 182 39
tumor_single_df <- TestAssociations(
# an RNA editing dataframe with special class "rnaEdit_df" from function
# CreateEditingTable() if site-specific analysis, from function
# SummarizeAllRegions() if region-based analysis.
rnaEdit_df = rnaedit2_df,
# a phenotype dataset that must have variable "sample" whose values are a
# exact match to the colnames of "rnaEdit_df".
pheno_df = pheno_df,
# name of outcome variable in phenotype dataset "pheno_df" that you want to
# test.
responses_char = "sample_type",
# names of covariate variables in phenotype dataset "pheno_df" that you want
# to add into the model.
covariates_char = NULL,
# type of outcome variable that you input in argument "responses_char".
respType = "binary",
# order the final results by p-values or not.
orderByPval = TRUE
)
## seqnames start end width estimate stdErr pValue
## 225 chr12 51324639 51324639 1 -19.66360 3.158016 4.767627e-10
## 42 chr1 28662199 28662199 1 -14.83609 2.399372 6.276575e-10
## 178 chr12 51324163 51324163 1 -42.69833 7.221196 3.361021e-09
## fdr
## 225 8.536142e-08
## 42 8.536142e-08
## 178 2.978449e-07
Here tumor_single_df
is a data frame of all edited sites, with corresponding p-values and false discovery rate (FDRs) from the logistic regression model.
We next annotate these results by adding gene names mapped to the RNA editing sites.
tumor_annot_df <- AnnotateResults(
# the output dataset from function TestAssociations().
results_df = tumor_single_df,
# close-by regions, since this is site-specific analysis, set to NULL.
closeByRegions_gr = NULL,
# input regions, since this is site-specific analysis, set to NULL.
inputRegions_gr = NULL,
genome = "hg19",
# the type of analysis result from function TestAssociations(), since we are
# running site-specific analysis, set to "site-specific".
analysis = "site-specific"
)
## seqnames start end width symbol estimate stdErr pValue
## 1 chr12 51324639 51324639 1 METTL7A -19.66360 3.158016 4.767627e-10
## 2 chr1 28662199 28662199 1 MED18 -14.83609 2.399372 6.276575e-10
## 3 chr12 51324163 51324163 1 METTL7A -42.69833 7.221196 3.361021e-09
## fdr
## 1 8.536142e-08
## 2 8.536142e-08
## 3 2.978449e-07
The region-based analysis consists of several steps: (1) make GRanges object for the genomic regions, (2) determine RNA editing sites that are located closely within the genomic regions, (3) identify co-edited regions, and (4) test the association between cancer status and co-edited regions.
First, we make GRanges object for the genomic regions that we are interested in . We saved both hg19 and hg38 gene references in the package. The following command retrieves regions associated with 20,314 pre-processed hg19 genes. To retrieve hg38 gene reference, easily change “hg19_annoGene_gr.RDS” into “hg38_annoGene_gr.RDS” in the command below.
allGenes_gr <- readRDS(
system.file(
"extdata",
"hg19_annoGene_gr.RDS",
package = 'rnaEditr',
mustWork = TRUE
)
)
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | ensembl_gene_id symbol
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 45959538-45965751 * | ENSG00000236624 CCDC163P
## [2] chr1 44457031-44462200 * | ENSG00000159214 CCDC24
## [3] chr1 148555979-148596267 * | ENSG00000243452 NBPF15
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
For candidate gene studies, we are often only interested in a few specific genes. We can use the following commands to extract GRanges associated with the specific genes:
# If input is gene symbol
inputGenes_gr <- TransformToGR(
# input a character vector of gene symbols
genes_char = c("PHACTR4", "CCR5", "METTL7A"),
# the type of "gene_char". As we input gene symbols above, set to "symbol"
type = "symbol",
genome = "hg19"
)
## GRanges object with 3 ranges and 1 metadata column:
## seqnames ranges strand | symbols
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 28696114-28826881 * | PHACTR4
## [2] chr3 46411633-46417697 * | CCR5
## [3] chr12 51317255-51326300 * | METTL7A
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
If we’re interested in particular regions, we can use the following commands to make GRanges.
# If input is region ranges
inputRegions_gr <- TransformToGR(
# input a character vector of region ranges.
genes_char = c("chr22:18555686-18573797", "chr22:36883233-36908148"),
# the type of "gene_char". As we input region ranges above, set to "region".
type = "region",
genome = "hg19"
)
# Here we use AddMetaData() to find the gene symbols for inputRegions_gr.
AddMetaData(target_gr = inputRegions_gr, genome = "hg19")
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | symbol
## <Rle> <IRanges> <Rle> | <character>
## [1] chr22 18555686-18573797 * | PEX26
## [2] chr22 36883233-36908148 * | FOXRED2;EIF3D
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Next, within each input candidate region, we identify the sub-regions that contain closely located RNA editing sites. At default, we identify sub-regions with at least 3 edited sites, for which maximum distance between two neighboring sites is 50 bp.
closeByRegions_gr <- AllCloseByRegions(
# a GRanges object of genomic regions retrieved or created in section 4.1.
regions_gr = inputGenes_gr,
# an RNA editing matrix.
rnaEditMatrix = rnaedit_df,
maxGap = 50,
minSites = 3
)
## GRanges object with 8 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 28823903-28823966 *
## [2] chr1 28824256-28824490 *
## [3] chr1 28825317-28825530 *
## [4] chr1 28825670-28825927 *
## [5] chr1 28826015-28826284 *
## [6] chr3 46417171-46417221 *
## [7] chr12 51324081-51324639 *
## [8] chr12 51325229-51325622 *
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
closeByRegions_gr
is a GRanges object that includes sub-regions that contain closely located RNA editing sites (close-by regions). In this example, within the three input candidate regions, 8 close-by regions were found.
Next, within each close-by region, we identify co-edited regions based on RNA editing levels. At default, a co-edited region needs to satisfy the following requirements: (1) with at least 3 edited sites; (2) the minimum correlation between RNA editing levels of one site and the mean RNA editing levels of the rest of the sites is at least 0.4; (3) the minimum pairwise correlation of sites within the selected cluster is at least 0.1.
closeByCoeditedRegions_gr <- AllCoeditedRegions(
# a GRanges object of close-by regions created by AllCloseByRegions().
regions_gr = closeByRegions_gr,
# an RNA editing matrix.
rnaEditMatrix = rnaedit_df,
# type of output data.
output = "GRanges",
rDropThresh_num = 0.4,
minPairCorr = 0.1,
minSites = 3,
# the method for computing correlations.
method = "spearman",
# When no co-edited regions are found in an input genomic region, you want to
# output the whole region (when set to TRUE) or NULL (when set to FALSE).
returnAllSites = FALSE
)
## GRanges object with 14 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 28825317-28825370 *
## [2] chr1 28825394-28825403 *
## [3] chr1 28825670-28825717 *
## [4] chr1 28825899-28825927 *
## [5] chr1 28826150-28826173 *
## ... ... ... ...
## [10] chr12 51324308-51324436 *
## [11] chr12 51324447-51324467 *
## [12] chr12 51324545-51324585 *
## [13] chr12 51324604-51324627 *
## [14] chr12 51325324-51325383 *
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
closeByCoeditedRegions_gr
is a GRanges that contains co-edited regions within the close-by regions we found from last step. We identified 14 co-edited regions from the 8 close-by regions.
Let’s take a look at correlations of RNA editing levels within the first co-edited region:
Next, we summarize RNA editing levels from multiple edited sites within each co-edited region using medians.
summarizedRegions_df <- SummarizeAllRegions(
# a GRanges object of close-by regions created by AllCoeditedRegions().
regions_gr = closeByCoeditedRegions_gr,
# an RNA editing matrix.
rnaEditMatrix = rnaedit_df,
# available methods: "MaxSites", "MeanSites", "MedianSites", and "PC1Sites".
selectMethod = MedianSites
)
## seqnames start end width BRCA-Tumor-TCGA-AC-A2FF
## 1 chr1 28825317 28825370 54 0.10618417
## 2 chr1 28825394 28825403 10 0.00000000
## 3 chr1 28825670 28825717 48 0.08602151
summarizedRegions_df
is a data frame with each row corresponding to one co-edited region identified from function AllCoeditedRegions()
. Since there are 14 co-edited regions, this data frame has 14 rows. In column 5 to the end, each column includes median editing levels (over all sites within each co-edited region) for one sample.
Next, we test the association between cancer status and co-edited regions using logistic regression model: \[logit (Pr(sample\;type = “cancer”)) \sim median\;RNA\;editing\;levels\]
tumor_region_df <- TestAssociations(
# an RNA editing dataframe with special class "rnaEdit_df" from function
# CreateEditingTable() if site-specific analysis, from function
# SummarizeAllRegions() if region-based analysis.
rnaEdit_df = summarizedRegions_df,
# a phenotype dataset that must have variable "sample" whose values are a
# exact match to the colnames of "rnaEdit_df".
pheno_df = pheno_df,
# name of outcome variable in phenotype dataset "pheno_df" that you want to
# test.
responses_char = "sample_type",
# names of covariate variables in phenotype dataset "pheno_df" that you want
# to add into the model.
covariates_char = NULL,
# type of outcome variable that you input in argument "responses_char".
respType = "binary",
# order the final results by p-values or not.
orderByPval = TRUE
)
## seqnames start end width estimate stdErr pValue
## 8 chr12 51324085 51324127 43 -51.23815 8.736146 4.489288e-09
## 14 chr12 51325324 51325383 60 -94.78676 16.738315 1.488692e-08
## 11 chr12 51324447 51324467 21 -102.71801 18.271132 1.888967e-08
## fdr
## 8 6.285003e-08
## 14 8.815177e-08
## 11 8.815177e-08
Here tumor_region_df
contains results for testing each co-edited region using logistic regression model.
We next annotate results by adding the corresponding input regions, close-by regions and genes mapped to these genomic regions.
tumor_annot_df <- AnnotateResults(
# the output dataset from function TestAssociations().
results_df = tumor_region_df,
# close-by regions which is a output of AllCloseByRegions().
closeByRegions_gr = closeByRegions_gr,
# input regions, which are created in section 4.1.
inputRegions_gr = inputGenes_gr,
genome = "hg19",
# the type of analysis result from function TestAssociations(), since we are
# doing region-based analysis, use default here.
analysis = "region-based"
)
## seqnames start end width inputRegion
## 1 chr12 51324085 51324127 43 chr12:51317255-51326300
## 2 chr12 51325324 51325383 60 chr12:51317255-51326300
## 3 chr12 51324447 51324467 21 chr12:51317255-51326300
## closeByRegion symbol estimate stdErr pValue
## 1 chr12:51324081-51324639 METTL7A -51.23815 8.736146 4.489288e-09
## 2 chr12:51325229-51325622 METTL7A -94.78676 16.738315 1.488692e-08
## 3 chr12:51324081-51324639 METTL7A -102.71801 18.271132 1.888967e-08
## fdr
## 1 6.285003e-08
## 2 8.815177e-08
## 3 8.815177e-08
To summarize this final dataset, values of column inputRegion
are the user-specified candidate genomic regions, values of column closeByRegion
are the regions that contain closely located RNA editing sites, and columns seqnames
, start
, end
are the co-edited regions.
TestAssociations
in rnaEditr
rnaEditr
can analyze different types of phenotypes, including binary, continuous and survival outcomes. In the last section, we analyzed a binary phenotype sample_type
. We next illustrate analysis for continuous and survival phenotypes.
For continuous outcome, as an example, to identify co-edited regions associated age, we use the following commands:
tumor_region_df <- TestAssociations(
# an RNA editing dataframe with special class "rnaEdit_df" from function
# CreateEditingTable() if site-specific analysis, from function
# SummarizeAllRegions() if region-based analysis.
rnaEdit_df = summarizedRegions_df,
# a phenotype dataset that must have variable "sample" whose values are a
# exact match to the colnames of "rnaEdit_df".
pheno_df = pheno_df,
# name of outcome variable in phenotype dataset "pheno_df" that you want to
# test.
responses_char = "age_at_diagnosis",
# names of covariate variables in phenotype dataset "pheno_df" that you want
# to add into the model.
covariates_char = NULL,
# type of outcome variable that you input in argument "responses_char".
respType = "continuous",
# order the final results by p-values or not.
orderByPval = TRUE
)
## seqnames start end width estimate stdErr pValue fdr
## 7 chr3 46417171 46417221 51 -29.56179 12.64776 0.02032533 0.2845546
## 2 chr1 28825394 28825403 10 -72.25518 35.67815 0.04406090 0.3084263
## 11 chr12 51324447 51324467 21 54.28664 29.48137 0.06691720 0.3122803
For survival outcome, for example, we use the following commands:
tumor_region_df <- TestAssociations(
# an RNA editing dataframe with special class "rnaEdit_df" from function
# CreateEditingTable() if site-specific analysis, from function
# SummarizeAllRegions() if region-based analysis.
rnaEdit_df = summarizedRegions_df,
# a phenotype dataset that must have variable "sample" whose values are a
# exact match to the colnames of "rnaEdit_df".
pheno_df = pheno_df,
# name of outcome variable in phenotype dataset "pheno_df" that you want to
# test.
responses_char = c("OS.time", "OS"),
# names of covariate variables in phenotype dataset "pheno_df" that you want
# to add into the model.
covariates_char = NULL,
# type of outcome variable that you input in argument "responses_char".
respType = "survival",
# order the final results by p-values or not.
orderByPval = TRUE
)
## seqnames start end width coef exp_coef se_coef
## 3 chr1 28825670 28825717 48 -13.64262 1.188740e-06 3.485844
## 14 chr12 51325324 51325383 60 -21.61839 4.085568e-10 5.753652
## 4 chr1 28825899 28825927 29 -24.57176 2.131191e-11 7.063456
## pValue fdr
## 3 9.088515e-05 0.001202134
## 14 1.717334e-04 0.001202134
## 4 5.038231e-04 0.002111975
Here is the R session information for this vignette:
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rnaEditr_1.10.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 dplyr_1.1.2
## [3] blob_1.2.4 filelock_1.0.2
## [5] Biostrings_2.68.0 bitops_1.0-7
## [7] fastmap_1.1.1 RCurl_1.98-1.12
## [9] BiocFileCache_2.8.0 GenomicAlignments_1.36.0
## [11] XML_3.99-0.14 digest_0.6.31
## [13] lifecycle_1.0.3 survival_3.5-5
## [15] KEGGREST_1.40.0 RSQLite_2.3.1
## [17] magrittr_2.0.3 compiler_4.3.0
## [19] rlang_1.1.0 sass_0.4.5
## [21] progress_1.2.2 rngtools_1.5.2
## [23] tools_4.3.0 corrplot_0.92
## [25] utf8_1.2.3 yaml_2.3.7
## [27] rtracklayer_1.60.0 knitr_1.42
## [29] prettyunits_1.1.1 doRNG_1.8.6
## [31] bit_4.0.5 curl_5.0.0
## [33] DelayedArray_0.26.0 plyr_1.8.8
## [35] xml2_1.3.3 BiocParallel_1.34.0
## [37] purrr_1.0.1 BiocGenerics_0.46.0
## [39] grid_4.3.0 stats4_4.3.0
## [41] fansi_1.0.4 mice_3.15.0
## [43] iterators_1.0.14 biomaRt_2.56.0
## [45] SummarizedExperiment_1.30.0 cli_3.6.1
## [47] rmarkdown_2.21 crayon_1.5.2
## [49] generics_0.1.3 httr_1.4.5
## [51] rjson_0.2.21 DBI_1.1.3
## [53] cachem_1.0.7 stringr_1.5.0
## [55] splines_4.3.0 operator.tools_1.6.3
## [57] zlibbioc_1.46.0 parallel_4.3.0
## [59] AnnotationDbi_1.62.0 XVector_0.40.0
## [61] restfulr_0.0.15 matrixStats_0.63.0
## [63] vctrs_0.6.2 Matrix_1.5-4
## [65] jsonlite_1.8.4 IRanges_2.34.0
## [67] hms_1.1.3 S4Vectors_0.38.0
## [69] bit64_4.0.5 GenomicFeatures_1.52.0
## [71] locfit_1.5-9.7 foreach_1.5.2
## [73] tidyr_1.3.0 jquerylib_0.1.4
## [75] glue_1.6.2 codetools_0.2-19
## [77] stringi_1.7.12 GenomeInfoDb_1.36.0
## [79] GenomicRanges_1.52.0 BiocIO_1.10.0
## [81] tibble_3.2.1 pillar_1.9.0
## [83] rappdirs_0.3.3 htmltools_0.5.5
## [85] GenomeInfoDbData_1.2.10 R6_2.5.1
## [87] dbplyr_2.3.2 formula.tools_1.7.1
## [89] evaluate_0.20 lattice_0.21-8
## [91] Biobase_2.60.0 highr_0.10
## [93] backports_1.4.1 png_0.1-8
## [95] Rsamtools_2.16.0 broom_1.0.4
## [97] memoise_2.0.1 bslib_0.4.2
## [99] Rcpp_1.0.10 nlme_3.1-162
## [101] mgcv_1.8-42 logistf_1.24.1
## [103] bumphunter_1.42.0 xfun_0.39
## [105] MatrixGenerics_1.12.0 pkgconfig_2.0.3