ToPASeq 1.22.0
Note: the ToPASeq package currently undergoes a major rework due to the change of the package maintainer. It is recommended to use the topology-based methods implemented in the EnrichmentBrowser or the graphite package instead.
We start by loading the package.
library(ToPASeq)
For RNA-seq data, we consider transcriptome profiles of four primary human airway smooth muscle cell lines in two conditions: control and treatment with dexamethasone Himes et al., 2014.
We load the airway dataset
library(airway)
data(airway)
For further analysis, we only keep genes that are annotated to an ENSEMBL gene ID.
airSE <- airway[grep("^ENSG", rownames(airway)),]
dim(airSE)
## [1] 63677 8
assay(airSE)[1:4,1:4]
## SRR1039508 SRR1039509 SRR1039512 SRR1039513
## ENSG00000000003 679 448 873 408
## ENSG00000000005 0 0 0 0
## ENSG00000000419 467 515 621 365
## ENSG00000000457 260 211 263 164
The EnrichmentBrowser package incorporates established
functionality from the limma package for differential expression
analysis.
This involves the voom
transformation when applied to RNA-seq data.
Alternatively, differential expression analysis for RNA-seq data can also be
carried out based on the negative binomial distribution with edgeR
and DESeq2.
This can be performed using the function EnrichmentBrowser::deAna
and assumes some standardized variable names:
For more information on experimental design, see the limma user’s guide, chapter 9.
For the airway dataset, the GROUP variable indicates whether the cell lines have been treated with dexamethasone (1) or not (0).
airSE$GROUP <- ifelse(airway$dex == "trt", 1, 0)
table(airSE$GROUP)
##
## 0 1
## 4 4
Paired samples, or in general sample batches/blocks, can be defined via a
BLOCK column in the colData
slot.
For the airway dataset, the sample blocks correspond to the four different cell
lines.
airSE$BLOCK <- airway$cell
table(airSE$BLOCK)
##
## N052611 N061011 N080611 N61311
## 2 2 2 2
For RNA-seq data, the deAna
function can be used to carry out differential
expression analysis between the two groups either based on functionality from
limma (that includes the voom
transformation), or
alternatively, the frequently used edgeR or DESeq2
package. Here, we use the analysis based on edgeR.
library(EnrichmentBrowser)
airSE <- deAna(airSE, de.method="edgeR")
## Excluding 47751 genes not satisfying min.cpm threshold
rowData(airSE, use.names=TRUE)
## DataFrame with 15926 rows and 4 columns
## FC edgeR.STAT PVAL ADJ.PVAL
## <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 -0.3901002 31.0558140 0.000232422 0.00217355
## ENSG00000000419 0.1978022 6.6454709 0.027419893 0.07560513
## ENSG00000000457 0.0291609 0.0929623 0.766666551 0.84808859
## ENSG00000000460 -0.1243820 0.3832263 0.549659194 0.67996523
## ENSG00000000971 0.4172901 28.7686093 0.000312276 0.00272063
## ... ... ... ... ...
## ENSG00000273373 -0.0438722 0.0397087 0.8460260 0.901607
## ENSG00000273382 -0.8597567 7.7869742 0.0190219 0.057267
## ENSG00000273448 0.0281667 0.0103270 0.9752405 0.984888
## ENSG00000273472 -0.4642705 1.9010366 0.1978818 0.328963
## ENSG00000273486 -0.1109445 0.1536285 0.7032766 0.802377
Pathways are typically represented as graphs, where the nodes are genes and edges between the nodes represent interaction between genes.
The graphite package provides pathway collections from major pathway databases such as KEGG, Biocarta, Reactome, and NCI.
Here, we retrieve human KEGG pathways.
library(graphite)
pwys <- pathways(species="hsapiens", database="kegg")
pwys
## KEGG pathways for hsapiens
## 307 entries, retrieved on 22-04-2020
As the airway dataset uses ENSEMBL gene IDs, but the nodes of the pathways are based on NCBI Entrez Gene IDs,
nodes(pwys[[1]])
## [1] "ENTREZID:10327" "ENTREZID:124" "ENTREZID:125" "ENTREZID:126"
## [5] "ENTREZID:127" "ENTREZID:128" "ENTREZID:130" "ENTREZID:130589"
## [9] "ENTREZID:131" "ENTREZID:160287" "ENTREZID:1737" "ENTREZID:1738"
## [13] "ENTREZID:2023" "ENTREZID:2026" "ENTREZID:2027" "ENTREZID:217"
## [17] "ENTREZID:218" "ENTREZID:219" "ENTREZID:220" "ENTREZID:2203"
## [21] "ENTREZID:221" "ENTREZID:222" "ENTREZID:223" "ENTREZID:224"
## [25] "ENTREZID:226" "ENTREZID:229" "ENTREZID:230" "ENTREZID:2538"
## [29] "ENTREZID:2597" "ENTREZID:26330" "ENTREZID:2645" "ENTREZID:2821"
## [33] "ENTREZID:3098" "ENTREZID:3099" "ENTREZID:3101" "ENTREZID:387712"
## [37] "ENTREZID:3939" "ENTREZID:3945" "ENTREZID:3948" "ENTREZID:441531"
## [41] "ENTREZID:501" "ENTREZID:5105" "ENTREZID:5106" "ENTREZID:5160"
## [45] "ENTREZID:5161" "ENTREZID:5162" "ENTREZID:5211" "ENTREZID:5213"
## [49] "ENTREZID:5214" "ENTREZID:5223" "ENTREZID:5224" "ENTREZID:5230"
## [53] "ENTREZID:5232" "ENTREZID:5236" "ENTREZID:5313" "ENTREZID:5315"
## [57] "ENTREZID:55276" "ENTREZID:55902" "ENTREZID:57818" "ENTREZID:669"
## [61] "ENTREZID:7167" "ENTREZID:80201" "ENTREZID:83440" "ENTREZID:84532"
## [65] "ENTREZID:8789" "ENTREZID:92483" "ENTREZID:92579" "ENTREZID:9562"
we first map the gene IDs in the airway dataset from ENSEMBL to ENTREZ IDs.
airSE <- idMap(airSE, org="hsa", from="ENSEMBL", to="ENTREZID")
##
## Encountered 93 from.IDs with >1 corresponding to.ID
## (the first to.ID was chosen for each of them)
## Excluded 2387 from.IDs without a corresponding to.ID
## Encountered 7 to.IDs with >1 from.ID (the first from.ID was chosen for each of them)
## Mapped from.IDs have been added to the rowData column ENSEMBL
Next, we define all genes with adjusted p-value below 0.01 as differentially expressed, and collect their log2 fold change for further analysis.
all <- names(airSE)
de.ind <- rowData(airSE)$ADJ.PVAL < 0.01
de <- rowData(airSE)$FC[de.ind]
names(de) <- all[de.ind]
This results in 2,426 DE genes - out of 11,780 genes in total.
length(all)
## [1] 13532
length(de)
## [1] 2657
The Pathway Regulation Score (PRS) incorporates the pathway topology by weighting the indiviudal gene-level log2 fold changes by the number of downstream DE genes. The weighted absolute fold changes are summed across the pathway and statistical significance is assessed by permutation of genes. Ibrahim et al., 2012
res <- prs(de, all, pwys[1:100], nperm=100)
## 3697 node labels mapped to the expression data
## Average coverage 47.4046 %
## 0 (out of 100) pathways without a mapped node
## 9 pwys were filtered out
head(res)
## nPRS p.value
## Cysteine and methionine metabolism 10.858556 0.00990099
## Tyrosine metabolism 9.243633 0.00990099
## Glycine, serine and threonine metabolism 7.065008 0.00990099
## Arachidonic acid metabolism 6.989077 0.00990099
## Retinol metabolism 6.406457 0.00990099
## Glutathione metabolism 6.112306 0.00990099
Corresponding gene weights (number of downstream DE genes) can be obtained for a pathway of choice via
ind <- grep("ErbB signaling pathway", names(pwys))
weights <- prsWeights(pwys[[ind]], de, all)
## 71 node labels mapped to the expression data
## Average coverage 80.68182 %
## 0 (out of 1) pathways without a mapped node
weights
## 572 5601 5291 2065 6416 815 8440 5609 5063 369 2932
## 0 0 0 0 0 1 0 0 1 0 0
## 25 1399 5594 6655 5595 208 5599 6198 5602 2549 867
## 0 0 0 4 0 0 0 0 3 5 0
## 1027 1839 868 6654 10000 8503 5290 5335 1026 6776 2002
## 0 0 0 0 0 1 1 0 1 1 1
## 5605 25759 10298 5894 3845 4609 2064 207 27 817 5295
## 0 0 0 2 0 1 0 0 1 0 1
## 1956 53358 818 5058 5578 3084 673 4690 6464 1398 5604
## 0 1 0 0 0 1 0 0 0 0 1
## 5747 145957 5293 6777 3265 685 6199 3725 2885 5062 399694
## 0 0 1 1 3 1 0 1 0 0 1
## 1978 6714 5336 2475 4893
## 0 0 2 1 0
Inspecting the genes with maximum number of downstream DE genes
weights[weights == max(weights)]
## 2549
## 5
reveals important upstream regulators including several G protein subunits such as subunit beta 2 (Entrez Gene ID 2783).