1 Introduction
2 Managing bed file content with mongodb
- 2.1 Querying the txregnet database
  - 2.1.1 The connection
  - 2.1.2 Queries and aggregation
3 An integrative container
- 3.1 Sample metadata
- 3.2 Extended RaggedExperiment
4 Visualizing coincidence
5 Higher-level work with sbov

## Warning: Package 'TxRegInfra' is deprecated and will be removed from
##   Bioconductor version 3.13

1 Introduction

TxRegQuery addresses exploration of transcriptional regulatory networks by integrating data on eQTL, digital genomic footprinting (DGF), DnaseI hypersensitivity binding data (DHS), and transcription factor binding site (TFBS) data. Owing to the volume of emerging tissue-specific data, special data modalities are used.

2 Managing bed file content with mongodb

2.1 Querying the `txregnet` database

We have a long-running server that will respond to queries. We focus on mongolite as the interface.

2.1.1 The connection

suppressPackageStartupMessages({
library(TxRegInfra)
library(mongolite)
library(Gviz)
library(EnsDb.Hsapiens.v75)
library(BiocParallel)
register(SerialParam())
})
con1 = mongo(url=URL_txregInAWS(), db="txregnet")
con1

## <Mongo collection> 'test' 
##  $aggregate(pipeline = "{}", options = "{\"allowDiskUse\":true}", handler = NULL, pagesize = 1000, iterate = FALSE) 
##  $count(query = "{}") 
##  $disconnect(gc = TRUE) 
##  $distinct(key, query = "{}") 
##  $drop() 
##  $export(con = stdout(), bson = FALSE, query = "{}", fields = "{}", sort = "{\"_id\":1}") 
##  $find(query = "{}", fields = "{\"_id\":0}", sort = "{}", skip = 0, limit = 0, handler = NULL, pagesize = 1000) 
##  $import(con, bson = FALSE) 
##  $index(add = NULL, remove = NULL) 
##  $info() 
##  $insert(data, pagesize = 1000, stop_on_error = TRUE, ...) 
##  $iterate(query = "{}", fields = "{\"_id\":0}", sort = "{}", skip = 0, limit = 0) 
##  $mapreduce(map, reduce, query = "{}", sort = "{}", limit = 0, out = NULL, scope = NULL) 
##  $remove(query, just_one = FALSE) 
##  $rename(name, db = NULL) 
##  $replace(query, update = "{}", upsert = FALSE) 
##  $run(command = "{\"ping\": 1}", simplify = TRUE) 
##  $update(query, update = "{\"$set\":{}}", filters = NULL, upsert = FALSE, multiple = FALSE)

We will write methods that work with the ‘fields’ of this object.

There is not much explicit reflectance in the mongolite API. The following is improvised and may be fragile:

parent.env(con1)$orig

## $name
## [1] "test"
## 
## $db
## [1] "txregnet"
## 
## $url
## [1] "mongodb+srv://user:user123@txregnet-kui9i.mongodb.net/txregnet"
## 
## $options
## List of 6
##  $ pem_file              : NULL
##  $ ca_file               : NULL
##  $ ca_dir                : NULL
##  $ crl_file              : NULL
##  $ allow_invalid_hostname: logi FALSE
##  $ weak_cert_validation  : logi FALSE

2.1.2 Queries and aggregation

If the mongo utility is available as a system command, we can get a list of collections in the database as follows.

if (verifyHasMongoCmd()) {
  head(c1 <- listAllCollections(url=URL_txregInAWS(), db="txregnet"))
  }

## Error in system2(cmd, args = "--help", stdout = TRUE, stderr = TRUE) : 
##   error in running command

## install mongodb on your system to use this function

Otherwise, as long as mongolite is installed, as long as we know the collection names of interest, we can use them as noted throughout this vignette.

We can get a record from a given collection:

mongo(url=URL_txregInAWS(), db="txregnet", 
   collection="Adipose_Subcutaneous_allpairs_v7_eQTL")$find(limit=1)

##             gene_id       variant_id tss_distance ma_samples ma_count       maf
## 1 ENSG00000238009.2 1_807994_C_T_b37       678771         17       18 0.0233766
##   pval_nominal     slope slope_se    qvalue chr snp_pos A1 A2 build
## 1   0.00126668 -0.759564 0.233531 0.0742794   1  807994  C  T   b37

Queries can be composed using JSON. We have a tool to generate queries that employ the mongodb aggregation method. Here we demonstrate this by computing, for each chromosome, the count and minimum values of the footprint statistic on CD14 cells.

m1 = mongo(url = URL_txregInAWS(), db = "txregnet",  collection="CD14_DS17215_hg19_FP")
newagg = makeAggregator( by="chr", vbl="stat", op="$min", opname="min")

The JSON layout of this aggregating query is

[
  {
    "$group": {
      "_id": ["$chr"],
      "count": {
        "$sum": [1]
      },
      "min": {
        "$min": ["$stat"]
      }
    }
  }
]

Invocation returns a data frame:

head(m1$aggregate(newagg))

##     _id count        min
## 1  chrY   827 0.01907390
## 2 chr18 15868 0.06107950
## 3 chr10 40267 0.00601357
## 4  chr4 32947 0.02776440
## 5  chr6 54728 0.00565057
## 6 chr17 47987 0.01242310

3 An integrative container

We need to bind the metadata and information about the mongodb.

3.1 Sample metadata

The following turns a very ad hoc filtering of the collection names into a DataFrame.

# cd = makeColData() # works when mongo does
cd = TxRegInfra::basicColData
head(cd,2)

## DataFrame with 2 rows and 3 columns
##                                                  base        type
##                                           <character> <character>
## Adipose_Subcutaneous_allpairs_v7_eQTL         Adipose        eQTL
## Adipose_Visceral_Omentum_allpairs_v7_eQTL     Adipose        eQTL
##                                                              mid
##                                                      <character>
## Adipose_Subcutaneous_allpairs_v7_eQTL     Subcutaneous_allpair..
## Adipose_Visceral_Omentum_allpairs_v7_eQTL Visceral_Omentum_all..

3.2 Extended RaggedExperiment

rme0 = RaggedMongoExpt(con1, colData=cd)
rme1 = rme0[, which(cd$type=="FP")]

A key method in development is subsetting the archive by genomic coordinates.

si = GenomeInfoDb::Seqinfo(genome="hg19")["chr17"] # to fix query genome
myg = GRanges("chr17", IRanges(38.07e6,38.09e6), seqinfo=si)
s1 = sbov(rme1, myg, simplify=FALSE)

## ..........................................

s1

## class: RaggedExperiment 
## dim: 1676 42 
## assays(3): chr id stat
## rownames: NULL
## colnames(42): CD14_DS17215_hg19_FP CD19_DS17186_hg19_FP ...
##   iPS_19_11_DS15153_hg19_FP vHMEC_DS18406_hg19_FP
## colData names(6): base type ... type mid

dim(sa <- sparseAssay(s1, 3))  # compact gives segfault

## [1] 1676   42

sa[953:956,c("fLung_DS14724_hg19_FP", "fMuscle_arm_DS17765_hg19_FP")]

##                         fLung_DS14724_hg19_FP fMuscle_arm_DS17765_hg19_FP
## chr17:38084160-38084169              0.533333                          NA
## chr17:38084924-38084952              0.890476                          NA
## chr17:38080857-38080891                    NA                     0.54902
## chr17:38081914-38081926                    NA                     0.50000

4 Visualizing coincidence

ormm = txmodels("ORMDL3", plot=FALSE, name="ORMDL3")
sar = strsplit(rownames(sa), ":|-")
an = as.numeric
gr = GRanges(seqnames(ormm)[1], IRanges(an(sapply(sar,"[", 2)), an(sapply(sar,"[", 3))))
gr1 = gr
gr1$score = 1-sa[,1]
gr2 = gr
gr2$score = 1-sa[,2]
sc1 = DataTrack(gr1, name="Lung FP")
sc2 = DataTrack(gr2, name="Musc/Arm FP")
plotTracks(list(GenomeAxisTrack(), sc1, sc2, ormm), showId=TRUE)

5 Higher-level work with `sbov`

5.1 Building annotated GRanges for a selected target interval

We begin with three ‘single-concept’ assays with relevance to lung genomics. The v7 GTEx lung eQTL data, an encode DnaseI narrowPeak report on lung fibroblasts, and a digital genomic footprint report for fetal lung.

lname_eqtl = "Lung_allpairs_v7_eQTL"
lname_dhs = "ENCFF001SSA_hg19_HS" # see dnmeta, fibroblast of lung
lname_fp = "fLung_DS14724_hg19_FP"
si17 = GenomeInfoDb::Seqinfo(genome="hg19")["chr17"]
si17n = si17
GenomeInfoDb::seqlevelsStyle(si17n) = "NCBI"
s1 = sbov(rme0[,lname_eqtl], GRanges("17", IRanges(38.06e6, 38.15e6),
    seqinfo=si17n))

## .

s2 = sbov(rme0[,lname_dhs], GRanges("chr17", IRanges(38.06e6, 38.15e6),
   seqinfo=si17))

## .

s3 = sbov(rme0[,lname_fp], GRanges("chr17", IRanges(38.06e6, 38.15e6),
   seqinfo=si17))

## .

Now we have annotated GRanges for each assay. The eQTL data in part are:

names(mcols(s1))

##  [1] "gene_id"      "variant_id"   "tss_distance" "ma_samples"   "ma_count"    
##  [6] "maf"          "pval_nominal" "slope"        "slope_se"     "qvalue"      
## [11] "chr"          "snp_pos"      "A1"           "A2"           "build"       
## [16] "origin"

head(s1[, c("gene_id", "variant_id", "maf", "pval_nominal")])

## GRanges object with 6 ranges and 4 metadata columns:
##       seqnames    ranges strand |            gene_id          variant_id
##          <Rle> <IRanges>  <Rle> |        <character>         <character>
##   [1]       17  38061054      * |  ENSG00000266469.1 17_38061054_G_A_b37
##   [2]       17  38061439      * |  ENSG00000161395.8 17_38061439_T_C_b37
##   [3]       17  38061439      * | ENSG00000073605.14 17_38061439_T_C_b37
##   [4]       17  38061439      * |  ENSG00000172057.5 17_38061439_T_C_b37
##   [5]       17  38061439      * |  ENSG00000167914.6 17_38061439_T_C_b37
##   [6]       17  38062196      * |  ENSG00000161395.8 17_38062196_G_A_b37
##             maf pval_nominal
##       <numeric>    <numeric>
##   [1] 0.0195822  7.72192e-04
##   [2] 0.4203660  3.99212e-04
##   [3] 0.4203660  6.87714e-10
##   [4] 0.4203660  1.08337e-10
##   [5] 0.4203660  2.15704e-10
##   [6] 0.4188480  3.09568e-04
##   -------
##   seqinfo: 1 sequence from GRCh37.p13 genome

The names of genes and variants used here are cumbersome – symbols and rsids are preferable.

addsyms = function(x, EnsDb=EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75) {
  ensids = gsub("\\..*", "", x$gene_id) # remove post period
  gns = genes(EnsDb)
  x$symbol = gns[ensids]$symbol
  x
}
s1 = addsyms(s1)

Note that it is possible to retrieve rsids for the SNPs by address. But this is a slow operation involving a huge SNPlocs package that we do not want to work with directly for this vignette.

> snpsByOverlaps(SNPlocs.Hsapiens.dbSNP144.GRCh37, s1b)
UnstitchedGPos object with 265 positions and 2 metadata columns:
        seqnames       pos strand |   RefSNP_id alleles_as_ambig
           <Rle> <integer>  <Rle> | <character>      <character>
    [1]       17  38061054      * |  rs36049276                R
    [2]       17  38061439      * |   rs4795399                Y
    [3]       17  38062196      * |   rs2305480                R
    [4]       17  38062217      * |   rs2305479                Y
    [5]       17  38062503      * |  rs35104165                Y
    ...      ...       ...    ... .         ...              ...
  [261]       17  38149258      * |  rs58212353                K
  [262]       17  38149350      * |   rs8073254                V
  [263]       17  38149411      * |  rs34648856                R
  [264]       17  38149724      * |   rs3785549                Y
  [265]       17  38149727      * |   rs3785550                H
  -------
  seqinfo: 25 sequences (1 circular) from GRCh37.p13 genome

5.2 A bipartite graph for eQTL-gene relationships

The object s1 computed above is available as demo_eQTL_granges. We convert it to a graph via

library(graph)

## 
## Attaching package: 'graph'

## The following object is masked from 'package:Biostrings':
## 
##     complement

g1 = sbov_to_graphNEL(demo_eQTL_granges)
g1

## A graphNEL graph with directed edges
## Number of Nodes = 312 
## Number of Edges = 693

Nodes are SNPs and genes, edges are present when the resource (in this case the GTEx lung study) declares an association (in this case, an FDR for SNP-gene association not exceeding 0.10.) The graph library includes functions for creation of incidence matrices from graphs, and vice versa.

5.3 Connecting eQTL-SNPs via DHS and DGF

Given the GRanges representations for sbov results, we can use overlap computations to conveniently identify relationships between eQTL SNPs, genes, and hypersensitivity or footprint regions.

We use sbov_output_HS as a persistent instance of s2 computed above.

seqlevelsStyle(demo_eQTL_granges) = "UCSC"
fo1 = findOverlaps(demo_eQTL_granges, sbov_output_HS)

## Warning in .Seqinfo.mergexy(x, y): The 2 combined objects have no sequence levels in common. (Use
##   suppressWarnings() to suppress this warning.)

fo1

## Hits object with 0 hits and 0 metadata columns:
##    queryHits subjectHits
##    <integer>   <integer>
##   -------
##   queryLength: 693 / subjectLength: 11

eq_by_hs = split(demo_eQTL_granges[queryHits(fo1)],
   subjectHits(fo1))
eq_by_hs

## GRangesList object of length 0:
## <0 elements>

This shows that there are two DHS sites that overlap with SNPs showing eQTL associations with various genes.

For the footprint data, we have:

fo2 = findOverlaps(demo_eQTL_granges, sbov_output_FP)

## Warning in .Seqinfo.mergexy(x, y): The 2 combined objects have no sequence levels in common. (Use
##   suppressWarnings() to suppress this warning.)

fo2

## Hits object with 0 hits and 0 metadata columns:
##    queryHits subjectHits
##    <integer>   <integer>
##   -------
##   queryLength: 693 / subjectLength: 107

eq_by_fp = split(demo_eQTL_granges[queryHits(fo2)],
   subjectHits(fo2))
eq_by_fp

## GRangesList object of length 0:
## <0 elements>

5.4 Relationships to FIMO-based TFBS

We have a small number of cloud-resident FIMO search results through the TFutils package.

library(TFutils)
data(demo_fimo_granges)
seqlevelsStyle(demo_eQTL_granges) = "UCSC"
lapply(demo_fimo_granges, lapply, function(x) 
   subsetByOverlaps(demo_eQTL_granges, x))

## Warning in .Seqinfo.mergexy(x, y): The 2 combined objects have no sequence levels in common. (Use
##   suppressWarnings() to suppress this warning.)

## Warning in .Seqinfo.mergexy(x, y): The 2 combined objects have no sequence levels in common. (Use
##   suppressWarnings() to suppress this warning.)

## $VDR
## $VDR$`chr17:38070000-38090000`
## GRanges object with 0 ranges and 17 metadata columns:
##    seqnames    ranges strand |  gene_id  variant_id tss_distance ma_samples
##       <Rle> <IRanges>  <Rle> | <factor> <character>    <integer>  <integer>
##     ma_count       maf pval_nominal     slope  slope_se    qvalue       chr
##    <integer> <numeric>    <numeric> <numeric> <numeric> <numeric> <integer>
##      snp_pos       A1       A2    build      origin      symbol
##    <integer> <factor> <factor> <factor> <character> <character>
##   -------
##   seqinfo: 1 sequence from hg19 genome
## 
## 
## $POU2F1
## $POU2F1$`chr17:38070000-38090000`
## GRanges object with 0 ranges and 17 metadata columns:
##    seqnames    ranges strand |  gene_id  variant_id tss_distance ma_samples
##       <Rle> <IRanges>  <Rle> | <factor> <character>    <integer>  <integer>
##     ma_count       maf pval_nominal     slope  slope_se    qvalue       chr
##    <integer> <numeric>    <numeric> <numeric> <numeric> <numeric> <integer>
##      snp_pos       A1       A2    build      origin      symbol
##    <integer> <factor> <factor> <factor> <character> <character>
##   -------
##   seqinfo: 1 sequence from hg19 genome

TxRegInfra: support for TxRegQuery

October 28, 2020

Contents

1 Introduction

2 Managing bed file content with mongodb

2.1 Querying the `txregnet` database

2.1.1 The connection

2.1.2 Queries and aggregation

3 An integrative container

3.1 Sample metadata

3.2 Extended RaggedExperiment

4 Visualizing coincidence

5 Higher-level work with `sbov`

5.1 Building annotated GRanges for a selected target interval

5.2 A bipartite graph for eQTL-gene relationships

5.3 Connecting eQTL-SNPs via DHS and DGF

5.4 Relationships to FIMO-based TFBS

TxRegInfra: support for TxRegQuery

October 28, 2020

Contents

1 Introduction

2 Managing bed file content with mongodb

2.1 Querying the txregnet database

2.1.1 The connection

2.1.2 Queries and aggregation

3 An integrative container

3.1 Sample metadata

3.2 Extended RaggedExperiment

4 Visualizing coincidence

5 Higher-level work with sbov

5.1 Building annotated GRanges for a selected target interval

5.2 A bipartite graph for eQTL-gene relationships

5.3 Connecting eQTL-SNPs via DHS and DGF

5.4 Relationships to FIMO-based TFBS

2.1 Querying the `txregnet` database

5 Higher-level work with `sbov`