Topology-based pathway analysis of RNA-seq data

Ivana Ihnatova1* and Ludwig Geistlinger2**

1Institute of Biostatistics and Analyses, Masarykova University Brno
2School of Public Health, City University of New York

*ihnatova@iba.muni.cz
**ludwig.geistlinger@sph.cuny.edu

27 April 2020

Abstract

The ToPASeq package implements methods for topology-based pathway analysis of RNA-seq data. This includes Topological Analysis of Pathway Phenotype Association (TAPPA; Gao and Wang, 2007), PathWay Enrichment Analysis (PWEA; Hung et al., 2010), and the Pathway Regulation Score (PRS; Ibrahim et al., 2012).

Package

ToPASeq 1.22.0

1 Setup
2 Preparing the data
3 Differential expression
4 Pathway analysis
- 4.1 Pathway Regulation Score (PRS)

1 Setup

Note: the ToPASeq package currently undergoes a major rework due to the change of the package maintainer. It is recommended to use the topology-based methods implemented in the EnrichmentBrowser or the graphite package instead.

We start by loading the package.

library(ToPASeq)

2 Preparing the data

For RNA-seq data, we consider transcriptome profiles of four primary human airway smooth muscle cell lines in two conditions: control and treatment with dexamethasone Himes et al., 2014.

We load the airway dataset

library(airway)
data(airway)

For further analysis, we only keep genes that are annotated to an ENSEMBL gene ID.

airSE <- airway[grep("^ENSG", rownames(airway)),]
dim(airSE)

## [1] 63677     8

assay(airSE)[1:4,1:4]

##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513
## ENSG00000000003        679        448        873        408
## ENSG00000000005          0          0          0          0
## ENSG00000000419        467        515        621        365
## ENSG00000000457        260        211        263        164

3 Differential expression

The EnrichmentBrowser package incorporates established functionality from the limma package for differential expression analysis. This involves the voom transformation when applied to RNA-seq data. Alternatively, differential expression analysis for RNA-seq data can also be carried out based on the negative binomial distribution with edgeR and DESeq2.

This can be performed using the function EnrichmentBrowser::deAna and assumes some standardized variable names:

GROUP defines the sample groups being contrasted,
BLOCK defines paired samples or sample blocks, as e.g. for batch effects.

For more information on experimental design, see the limma user’s guide, chapter 9.

For the airway dataset, the GROUP variable indicates whether the cell lines have been treated with dexamethasone (1) or not (0).

airSE$GROUP <- ifelse(airway$dex == "trt", 1, 0)
table(airSE$GROUP)

## 
## 0 1 
## 4 4

Paired samples, or in general sample batches/blocks, can be defined via a BLOCK column in the colData slot. For the airway dataset, the sample blocks correspond to the four different cell lines.

airSE$BLOCK <- airway$cell
table(airSE$BLOCK)

## 
## N052611 N061011 N080611  N61311 
##       2       2       2       2

For RNA-seq data, the deAna function can be used to carry out differential expression analysis between the two groups either based on functionality from limma (that includes the voom transformation), or alternatively, the frequently used edgeR or DESeq2 package. Here, we use the analysis based on edgeR.

library(EnrichmentBrowser)
airSE <- deAna(airSE, de.method="edgeR")

## Excluding 47751 genes not satisfying min.cpm threshold

rowData(airSE, use.names=TRUE)

## DataFrame with 15926 rows and 4 columns
##                         FC edgeR.STAT        PVAL   ADJ.PVAL
##                  <numeric>  <numeric>   <numeric>  <numeric>
## ENSG00000000003 -0.3901002 31.0558140 0.000232422 0.00217355
## ENSG00000000419  0.1978022  6.6454709 0.027419893 0.07560513
## ENSG00000000457  0.0291609  0.0929623 0.766666551 0.84808859
## ENSG00000000460 -0.1243820  0.3832263 0.549659194 0.67996523
## ENSG00000000971  0.4172901 28.7686093 0.000312276 0.00272063
## ...                    ...        ...         ...        ...
## ENSG00000273373 -0.0438722  0.0397087   0.8460260   0.901607
## ENSG00000273382 -0.8597567  7.7869742   0.0190219   0.057267
## ENSG00000273448  0.0281667  0.0103270   0.9752405   0.984888
## ENSG00000273472 -0.4642705  1.9010366   0.1978818   0.328963
## ENSG00000273486 -0.1109445  0.1536285   0.7032766   0.802377

4 Pathway analysis

Pathways are typically represented as graphs, where the nodes are genes and edges between the nodes represent interaction between genes.

The graphite package provides pathway collections from major pathway databases such as KEGG, Biocarta, Reactome, and NCI.

Here, we retrieve human KEGG pathways.

library(graphite)
pwys <- pathways(species="hsapiens", database="kegg")
pwys

## KEGG pathways for hsapiens
## 307 entries, retrieved on 22-04-2020

As the airway dataset uses ENSEMBL gene IDs, but the nodes of the pathways are based on NCBI Entrez Gene IDs,

nodes(pwys[[1]])

##  [1] "ENTREZID:10327"  "ENTREZID:124"    "ENTREZID:125"    "ENTREZID:126"   
##  [5] "ENTREZID:127"    "ENTREZID:128"    "ENTREZID:130"    "ENTREZID:130589"
##  [9] "ENTREZID:131"    "ENTREZID:160287" "ENTREZID:1737"   "ENTREZID:1738"  
## [13] "ENTREZID:2023"   "ENTREZID:2026"   "ENTREZID:2027"   "ENTREZID:217"   
## [17] "ENTREZID:218"    "ENTREZID:219"    "ENTREZID:220"    "ENTREZID:2203"  
## [21] "ENTREZID:221"    "ENTREZID:222"    "ENTREZID:223"    "ENTREZID:224"   
## [25] "ENTREZID:226"    "ENTREZID:229"    "ENTREZID:230"    "ENTREZID:2538"  
## [29] "ENTREZID:2597"   "ENTREZID:26330"  "ENTREZID:2645"   "ENTREZID:2821"  
## [33] "ENTREZID:3098"   "ENTREZID:3099"   "ENTREZID:3101"   "ENTREZID:387712"
## [37] "ENTREZID:3939"   "ENTREZID:3945"   "ENTREZID:3948"   "ENTREZID:441531"
## [41] "ENTREZID:501"    "ENTREZID:5105"   "ENTREZID:5106"   "ENTREZID:5160"  
## [45] "ENTREZID:5161"   "ENTREZID:5162"   "ENTREZID:5211"   "ENTREZID:5213"  
## [49] "ENTREZID:5214"   "ENTREZID:5223"   "ENTREZID:5224"   "ENTREZID:5230"  
## [53] "ENTREZID:5232"   "ENTREZID:5236"   "ENTREZID:5313"   "ENTREZID:5315"  
## [57] "ENTREZID:55276"  "ENTREZID:55902"  "ENTREZID:57818"  "ENTREZID:669"   
## [61] "ENTREZID:7167"   "ENTREZID:80201"  "ENTREZID:83440"  "ENTREZID:84532" 
## [65] "ENTREZID:8789"   "ENTREZID:92483"  "ENTREZID:92579"  "ENTREZID:9562"

we first map the gene IDs in the airway dataset from ENSEMBL to ENTREZ IDs.

airSE <- idMap(airSE, org="hsa", from="ENSEMBL", to="ENTREZID")

##

## Encountered 93 from.IDs with >1 corresponding to.ID

## (the first to.ID was chosen for each of them)

## Excluded 2387 from.IDs without a corresponding to.ID

## Encountered 7 to.IDs with >1 from.ID (the first from.ID was chosen for each of them)

## Mapped from.IDs have been added to the rowData column ENSEMBL

Next, we define all genes with adjusted p-value below 0.01 as differentially expressed, and collect their log2 fold change for further analysis.

all <- names(airSE)
de.ind <- rowData(airSE)$ADJ.PVAL < 0.01
de <- rowData(airSE)$FC[de.ind]
names(de) <- all[de.ind]

This results in 2,426 DE genes - out of 11,780 genes in total.

length(all)

## [1] 13532

length(de)

## [1] 2657

4.1 Pathway Regulation Score (PRS)

The Pathway Regulation Score (PRS) incorporates the pathway topology by weighting the indiviudal gene-level log2 fold changes by the number of downstream DE genes. The weighted absolute fold changes are summed across the pathway and statistical significance is assessed by permutation of genes. Ibrahim et al., 2012

res <- prs(de, all, pwys[1:100], nperm=100)

## 3697 node labels mapped to the expression data
## Average coverage 47.4046 %
## 0 (out of 100) pathways without a mapped node

## 9 pwys were filtered out

head(res)

##                                               nPRS    p.value
## Cysteine and methionine metabolism       10.858556 0.00990099
## Tyrosine metabolism                       9.243633 0.00990099
## Glycine, serine and threonine metabolism  7.065008 0.00990099
## Arachidonic acid metabolism               6.989077 0.00990099
## Retinol metabolism                        6.406457 0.00990099
## Glutathione metabolism                    6.112306 0.00990099

Corresponding gene weights (number of downstream DE genes) can be obtained for a pathway of choice via

ind <- grep("ErbB signaling pathway", names(pwys))
weights <- prsWeights(pwys[[ind]], de, all)

## 71 node labels mapped to the expression data
## Average coverage 80.68182 %
## 0 (out of 1) pathways without a mapped node

weights

##    572   5601   5291   2065   6416    815   8440   5609   5063    369   2932 
##      0      0      0      0      0      1      0      0      1      0      0 
##     25   1399   5594   6655   5595    208   5599   6198   5602   2549    867 
##      0      0      0      4      0      0      0      0      3      5      0 
##   1027   1839    868   6654  10000   8503   5290   5335   1026   6776   2002 
##      0      0      0      0      0      1      1      0      1      1      1 
##   5605  25759  10298   5894   3845   4609   2064    207     27    817   5295 
##      0      0      0      2      0      1      0      0      1      0      1 
##   1956  53358    818   5058   5578   3084    673   4690   6464   1398   5604 
##      0      1      0      0      0      1      0      0      0      0      1 
##   5747 145957   5293   6777   3265    685   6199   3725   2885   5062 399694 
##      0      0      1      1      3      1      0      1      0      0      1 
##   1978   6714   5336   2475   4893 
##      0      0      2      1      0

Inspecting the genes with maximum number of downstream DE genes

weights[weights == max(weights)]

## 2549 
##    5

reveals important upstream regulators including several G protein subunits such as subunit beta 2 (Entrez Gene ID 2783).