A Brief Introduction to TReNA

Overview of TReNA

TReNA is a package for inferring relationships between target genes and their transcription factors. By using TReNA, we can generate hypotheses aroung regulation of gene expression and enable the construction of gene network models. In the most basic sense, TReNA accepts a matrix of gene expression data (that must include the expression levels of transcription factors) and provides two operations using the data. First, it filters putitive transcriptional regulators based on user-specified criteria, such as presence of motifs or footprints. Second, given a target gene and list of transcription factors as predictors, it uses a user-specified solver to perform feature selection. Thus, the standard TReNA workflow is the following:

  1. Create a TReNA object using an expression matrix (RNA-seq or microarray) and a desired solver.
  2. (optional) Select candidate transcription factors from either motifs or footprints.
  3. Perform feature selection to identify putative transcriptional regulators for the target gene.

The filtering step, while optional, can constrain the solution space and reduce computational time by reducing the number of predictors supplied to the solver. Rather than supplying hundreds of transcription factors that can obscure meaningful relationships between genes, filters allow the user to screen out less likely candidates, leaving only the most likely affectors.

Starting off, we’ll load the TReNA library.

suppressMessages(library(TReNA))

Creating a TReNA Object

Running TReNA requires a matrix of gene expression data; there are several example data sets included with the TReNA package. In this case, we will load data from a study on Alzheimer’s disease (Nat Sci Data, doi:10.1038/sdata.2016.89), in which 278 samples were obtained for 154 transcription factors.

load(system.file(package="TReNA", "extdata/ampAD.154genes.mef2cTFs.278samples.RData"))

This gives us a matrix, mtx.sub, that contains our gene expression data in units of TMM. Before creating a TReNA object, it is wise to inspect the distribution of the matrix and consider using some sort of data transformation to normalize the mean-variance relationship. We can see, for instance, that our dataset is fairly skewed, with most of the expression values falling close to 0 and only a few values on the higher end of the spectrum.

The skewed nature of the matrix may profoundly affect our results. To overcome this, we will transform our matrix using the voom function in the limma package.

suppressMessages(library(limma))
mtx.voom <- voom(mtx.sub)$E

If we now plot our transformed matrix, we can see that it’s been scaled down substantially.

Moving forward, we will use this transformed data set in the rest of our analyses. That gives us one half of our TReNA object; the other half concerns our choice of solver. TReNA currently supports 9 different solvers. Their associated character strings for use in TReNA are (in no particular order):

  1. lasso - an application of glmnet using alpha = 0.9 as the default elastic net parameter; default TReNA solver
  2. ridge - an application of glmnet using alpha = 0 as the default elastic net parameter
  3. randomForest - an application of randomForest, the standard Random Forest function
  4. bayesSpike - an application of vbsr, the Bayes Spike function
  5. sqrtlasso - an application of the slim function from the flare package, with q = 2
  6. lassopv - an application of lassopv, the P-Value LASSO function
  7. pearson - an application of the cor function using the default parameters
  8. spearman - an application of the cor function using method = “spearman”
  9. ensemble - a combination of 2 or more of the other solvers; by default, it uses all but bayesSpike and sqrtlasso

Although TReNA defaults to using LASSO as the solver, we recommend doing some critical thinking about which solver most suits your purposes before choosing. Once a solver has been chosen, the TReNA object is simply specified (using Random Forest in this case):

trena <- TReNA(mtx.assay = mtx.voom, solver = "randomForest")

We have now created an object of the TReNA class, which contains our solver and our assay matrix.

Using Filtering to Reduce Predictor Number

TReNA currently supports 3 different candidate transcription factor inputs:

  1. FootprintFilter: using a PostgreSQL or SQLite database of footprint data, returns transcription factors with footprints for a given target gene for a specified genomic range
  2. VarianceFilter: returns transcription factors with expression variance within a certain range of the target gene’s expression variance
  3. NullFilter: returns all transcription factors supplied in the matrix

In order to get a list of transcription factor candidates, you must construct the appropriate object by supplying an assay matrix, then call the getCandidates function on the candidate inputs. Depending on the input, this method may require other parameters. The simplest input is the NullFilter, which can return candidates with no additional parameters; however, as the name suggests, this just returns all transcription factors, not including the target gene.

null.filter <- NullFilter(mtx.assay = mtx.voom)
tfs <- getCandidates(null.filter)
str(tfs)
##  chr [1:154] "ADNP" "ADNP2" "ALX3" "ALX4" "ARX" "ATF2" "ATF7" "BACH1" ...

The VarianceFilter requires the specification of a target gene and also allows you to specify a variance size range. It finds the variance of all genes in the assay matrix, then returns all transcription factors with variance within the variance size range of the target gene. For instance, we can create a VarianceFilter and use it to find all transcription factors with variance within 50% of the target gene’s variance. This will return both the names of the transcription factors and their variances in a named list.

variance.filter <- VarianceFilter(mtx.assay = mtx.voom)
tf.list <- getCandidates(variance.filter, extraArgs = list("target.gene" = "MEF2C", "var.size" = 0.5))
str(tf.list)
## List of 2
##  $ tfs    : chr [1:33] "BCL6B" "CREB5" "CUX2" "DRGX" ...
##  $ tf.vars: Named num [1:33] 0.799 0.57 0.996 0.578 0.504 ...
##   ..- attr(*, "names")= chr [1:33] "BCL6B" "CREB5" "CUX2" "DRGX" ...

The most complex filter is the FootprintFilter, which leverages information from footprint databases in either SQLite or PostgreSQL. This input requires connection to 2 databases: 1) A genome database, which contains information on the location and function of genes; 2) A project database, which contains footprint regions from a specific project. As an illustration of the format and utility of these databases, we have included in extdata two SQLite databases demonstrating the required tables and information needed to use the FootprintFilter. Our databases are subsets of larger databases, corresponding to the MEF2C gene and a footprinting project in the brain using the Wellington method for determining footprints. The FootprintFilter method for getting candidates also allows for specification of the distance upstream and downstream of a transcription start site to look for footprints; in this case, we’ll use 1000 bases for each, the default distance. As with the VarianceFilter, we return a named list, with the names of the canidate transcription factors, plus a table of other information about them taken from the project and genome databases.

footprint.filter <- FootprintFilter(mtx.assay = mtx.voom)
db.address <- system.file(package = "TReNA", "extdata")
genome.db.uri <- paste("sqlite:/",db.address,"genome.sub.db", sep = "/")
project.db.uri <- paste("sqlite:/",db.address,"project.sub.db", sep = "/")
target.gene <- "MEF2C"
tfs <- getCandidates(footprint.filter, extraArgs = list("target.gene" = target.gene,
                                                        "genome.db.uri"=genome.db.uri, 
                                                        "project.db.uri" = project.db.uri,
                                                        "size.upstream" = 1000,
                                                        "size.downstream" = 1000))
str(tfs)
## List of 2
##  $ tfs: chr [1:64] "LHX4" "LHX6" "LHX9" "LHX2" ...
##  $ tbl:'data.frame': 580 obs. of  18 variables:
##   ..$ name      : chr [1:580] "MA0046.2" "MA0046.2" "MA0135.1" "MA0135.1" ...
##   ..$ loc       : chr [1:580] "chr5:88903305-88903319" "chr5:88903305-88903319" "chr5:88903381-88903393" "chr5:88903381-88903393" ...
##   ..$ chrom     : chr [1:580] "chr5" "chr5" "chr5" "chr5" ...
##   ..$ start     : int [1:580] 88903305 88903305 88903381 88903381 88903381 88903381 88903381 88903381 88903381 88903381 ...
##   ..$ endpos    : int [1:580] 88903319 88903319 88903393 88903393 88903393 88903393 88903393 88903393 88903393 88903393 ...
##   ..$ type      : chr [1:580] "motif.in.footprint" "motif.in.footprint" "motif.in.footprint" "motif.in.footprint" ...
##   ..$ length    : int [1:580] 15 15 13 13 13 13 13 13 13 13 ...
##   ..$ strand    : chr [1:580] "+" "-" "+" "+" ...
##   ..$ sample_id : chr [1:580] "ENCSR318PRQ_wellington" "ENCSR318PRQ_wellington" "ENCSR318PRQ_wellington" "ENCSR318PRQ_wellington" ...
##   ..$ method    : chr [1:580] "WELLINGTON" "WELLINGTON" "WELLINGTON" "WELLINGTON" ...
##   ..$ provenance: chr [1:580] "brain.filler.minid" "brain.filler.minid" "brain.filler.minid" "brain.filler.minid" ...
##   ..$ score1    : num [1:580] -23.1 -23.1 -25.9 -25.9 -25.9 ...
##   ..$ score2    : num [1:580] 11.7 12.8 12 12 12 ...
##   ..$ score3    : num [1:580] 3.17e-05 1.69e-05 3.33e-05 3.33e-05 3.33e-05 3.33e-05 3.33e-05 3.33e-05 3.33e-05 3.33e-05 ...
##   ..$ score4    : num [1:580] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ score5    : num [1:580] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ score6    : num [1:580] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ tf        : chr [1:580] "HNF1A" "HNF1A" "LHX4" "LHX6" ...

The FootprintFilter makes use of another class in the package, the FootprintFinder. This class provides an interface to PostgreSQL and SQLite databases of a specific 2-table schema and has a family of methods for accessing footprint data. To demonstrate some of this functionality, we will use the FootprintFinder to extract footprints from a particular region. In the previous example, we used an SQLite database included in the package; essentially a small snapshot of a larger dataset. More usefully, there are now publically available databases of footprints for use at bddsrds.globusgenomics.org. Currently, the sole project database is of brain tissue using the HINT alignment method; this database will remain in perpetuity and will be joined by other methods, issues, and alignments. We can access this information as follows:

genome.db.uri    <- "postgres://bddsrds.globusgenomics.org/hg38"  # has gtf and motifsgenes tables
footprint.db.uri <- "postgres://bddsrds.globusgenomics.org/brain_hint"  # has hits and regions tables
fpf <- FootprintFinder(genome.db.uri, footprint.db.uri, quiet=FALSE)
## [1] postgres://bddsrds.globusgenomics.org/hg38/gtf: 2568100 rows
## [1] postgres://bddsrds.globusgenomics.org/hg38/motifsgenes: 9289 rows
## [1] postgres://bddsrds.globusgenomics.org/brain_hint/regions: 22772585 rows
tbl.fp <- getFootprintsInRegion(fpf, "chr5", 88822685, 89011824)
str(tbl.fp)
## 'data.frame':    4105 obs. of  17 variables:
##  $ loc       : chr  "chr5:88822748-88822759" "chr5:88822748-88822759" "chr5:88823222-88823232" "chr5:88823232-88823246" ...
##  $ chrom     : chr  "chr5" "chr5" "chr5" "chr5" ...
##  $ start     : int  88822748 88822748 88823222 88823232 88823233 88823234 88823235 88823238 88823245 88823248 ...
##  $ endpos    : int  88822759 88822759 88823232 88823246 88823246 88823245 88823248 88823251 88823259 88823259 ...
##  $ type      : chr  "motif.in.footprint" "motif.in.footprint" "motif.in.footprint" "motif.in.footprint" ...
##  $ name      : chr  "MA0695.1" "MA0694.1" "MA0472.2" "MA0768.1" ...
##  $ length    : int  12 12 11 15 14 12 14 14 15 12 ...
##  $ strand    : chr  "-" "-" "-" "-" ...
##  $ sample_id : chr  "ENCSR118WIQ_hint" "ENCSR118WIQ_hint" "ENCSR706IDL_hint" "ENCSR187PYY_hint" ...
##  $ method    : chr  "HINT" "HINT" "HINT" "HINT" ...
##  $ provenance: chr  "brain.filler.minid" "brain.filler.minid" "brain.filler.minid" "brain.filler.minid" ...
##  $ score1    : num  5 5 110 5 5 5 5 5 5 5 ...
##  $ score2    : num  11.56 12.05 10.72 8.79 11.16 ...
##  $ score3    : num  5.16e-05 3.34e-05 3.95e-05 6.52e-05 6.52e-05 3.16e-05 4.63e-06 7.49e-05 7.31e-05 7.87e-06 ...
##  $ score4    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ score5    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ score6    : num  NA NA NA NA NA NA NA NA NA NA ...

Solving a TReNA Object

Each of solvers has its own strengths and weaknesses that we will not delve into here, but that are generally detailed within their own packages. To begin with, we will look at the LASSO solver, which is the default solver for a TReNA object.

Once a user reaches the point of actually solving the TReNA object, a solver has already been specified, but the user must still specify the target gene response and transcription factor predictors of interest. For our purposes, we will obtain our list of transcription factors using a filter on variance, with “MEF2C” as our target gene.

trena <- TReNA(mtx.assay = mtx.voom, solver = "lasso")
target.gene <- "MEF2C"
gene.filter <- VarianceFilter(mtx.assay = mtx.voom)
tf.list <- getCandidates(gene.filter, extraArgs = list("target.gene" = target.gene, "var.size" = 0.5))

tbl.out <- solve(trena, target.gene, tf.list$tfs)
head(tbl.out)
##             beta intercept   gene.cor
## STAT4  0.2790472  5.707991  0.9399046
## SATB2  0.2401920  5.707991  0.9053878
## HLF    0.1492047  5.707991  0.9024599
## FOXO4 -0.1189245  5.707991 -0.7026636
## LHX6   0.1013413  5.707991  0.7905433
## TSHZ2  0.0933637  5.707991  0.8140928

Each solver will necessarily give a different result, neccessitating some amount of thoughtfulness on the part of the user. For instance, if the goal is to return a sparse list of only the most influential genes, then LASSO may be the best choice. By contrast, the Random Forest solver will return scores for all genes, and will thus result in inferring more relationships.

Using an Ensemble Solver Approach

In addition to the individual solvers, TReNA includes an ensemble approach as a solver option. The ensemble solver allows the user to specify a vector of multiple TReNA solvers as part of the extraArgs list argument (e.g. extraArgs = list(solver.list = c(“lasso”,“ridge”))). Alternatively, the user can specify solver.list = “default.solvers” to run all solvers but Bayes Spike and square root LASSO (the most sensitive to outliers and the computationally intensive, respectively). In addition to computing and returning the individual scores of each solver, the ensemble solver generates an “extreme” score (pcaMax) based on all solvers and a “concordance” score (concordance). These new scores differ in that the pcaMax score tends to give more weight to extreme outlying scores from a particular solver and generally scales from 0 to 15, whereas the concordance score attempts to use more of a “voting” system of which solves consider a feature significant and ranges from 0-1. As an example, we can run the ensemble solver on the same problem we just solved using LASSO.

trena <- TReNA(mtx.assay = mtx.voom, solver = "ensemble")
target.gene <- "MEF2C"
gene.filter <- VarianceFilter(mtx.assay = mtx.voom)
tf.list <- getCandidates(gene.filter, extraArgs = list("target.gene" = target.gene, "var.size" = 0.5))
tbl.out <- solve(trena, target.gene, tf.list$tfs)
tbl.out
##     gene  beta.lasso  rf.score pearson.coeff spearman.coeff lasso.p.value
## 5    HLF  0.15188937 59.604965     0.9024599      0.9416976  1.062597e-49
## 9  STAT4  0.27971288 55.731636     0.9399046      0.9225219  2.375128e-55
## 8  SATB2  0.24008473 41.957092     0.9053878      0.8756181  3.765829e-50
## 4  FOXP2  0.06449762 10.550038     0.8478133      0.8365842  1.490099e-30
## 10 TSHZ2  0.09287678  9.236621     0.8140928      0.8193281  3.124494e-38
## 7   LHX6  0.09970340  6.366319     0.7905433      0.7427102  2.941322e-27
## 2  ESRRG  0.00000000 10.932977     0.8089849      0.8183043  6.521426e-02
## 1   DRGX  0.00000000 13.417914     0.8141358      0.8194466  7.692273e-01
## 3  FOXO4 -0.11729071  7.029378    -0.7026636     -0.7731181  2.380726e-31
## 6   IRF7 -0.00912229  1.768709    -0.6829667     -0.7311429  2.782271e-14
##     beta.ridge concordance   pcaMax
## 5   0.10065777   0.5435398 2.982551
## 9   0.08253372   0.5646652 2.980053
## 8   0.12747181   0.5033651 2.911247
## 4   0.12892481   0.5065816 2.451110
## 10  0.09896179   0.4847700 2.281065
## 7   0.10930127   0.4868122 2.197155
## 2   0.09087183   0.5453544 2.088372
## 1   0.07003641   0.5560267 1.996704
## 3  -0.09381616   0.5610777 1.665997
## 6  -0.08249788   0.4934256 1.280257

The ensemble approach is an attractive option for a couple of reasons:

  1. Solver choice can be a non-trivial task for a TReNA user, as it requires some working knowledge of the pros and cons of the different methods. The ensemble method alleviates the need to choose one solver.

  2. Different solvers will give different predictions for important transcription factors; the ensemble method provides a way to create a composite score that accounts for all included solvers and can thus be seen as something of an overall metric.

  3. The relationship between transcription factors and target genes is a young area of active investigation. How this relates to the various solvers is an open question. TReNA was born out of a desire to explore these relationships.

We can optionally pass solver-specific parameters as well using named arguments in the extraArgs list argument. For instance, we may want to specify the number of cores to use for the square root LASSO solver and include it in our solver list. The following solve command will accomplish that, setting the number of cores to dedicate to square root LASSO as 4:

tbl.out.2 <- solve(trena, target.gene, tf.list$tfs, 
                   extraArgs = list(
                     "solver.list" = c("lasso", "ridge", "sqrtlasso", "randomForest",
                                       "lassopv", "spearman", "pearson"),
                     "sqrtlasso" = list("num.cores" = 4)
                   ))
tbl.out.2
##    gene  beta.lasso  beta.ridge beta.sqrtlasso  rf.score lasso.p.value
## 8 STAT4  0.28135578  0.08234302     0.24145443 56.796727  2.375128e-55
## 7 SATB2  0.23647191  0.12709173     0.20228804 41.903326  3.765829e-50
## 5   HLF  0.15718273  0.10019317     0.07940525 55.427868  1.062597e-49
## 4 FOXP2  0.06153010  0.12828409     0.16001585 12.997131  1.490099e-30
## 9 TSHZ2  0.09178372  0.09894517     0.09074777 10.648696  3.124494e-38
## 6  LHX6  0.09421302  0.10884922     0.12978793  7.808024  2.941322e-27
## 2  DRGX  0.00000000  0.07005068     0.00000000 13.623176  7.692273e-01
## 1  CUX2  0.00000000  0.06999594     0.09228099  6.180490  1.686995e-10
## 3 FOXO4 -0.11236100 -0.09345662    -0.14581599  6.386056  2.380726e-31
##   spearman.coeff pearson.coeff concordance   pcaMax
## 8      0.9225219     0.9399046   0.5361336 2.875762
## 7      0.8756181     0.9053878   0.4613880 2.748987
## 5      0.9416976     0.9024599   0.5018390 2.688886
## 4      0.8365842     0.8478133   0.4600733 2.295562
## 9      0.8193281     0.8140928   0.4507073 2.098685
## 6      0.7427102     0.7905433   0.4390257 2.039758
## 2      0.8194466     0.8141358   0.5374177 1.840798
## 1      0.6965599     0.7819094   0.4878311 1.740354
## 3     -0.7731181    -0.7026636   0.5424244 1.724863