fgsea
is an R-package for fast preranked gene set enrichment analysis (GSEA). This package allows to quickly and accurately calculate arbitrarily low GSEA P-values for a collection of gene sets. P-value estimation is based on an adaptive multi-level split Monte-Carlo scheme. See the preprint for algorithmic details.
Loading example pathways and gene-level statistics and setting random seed:
Running fgsea:
## Warning in fgseaMultilevel(...): For some pathways, in reality P-values are less
## than 1e-10. You can set the `eps` argument to zero for better estimation.
The resulting table contains enrichment scores and p-values:
## pathway pval padj log2err ES NES size
## 1: 5990979_Cell_Cycle,_Mitotic 1e-10 3.6625e-09 NA 0.5594755 2.746957 317
## 2: 5990980_Cell_Cycle 1e-10 3.6625e-09 NA 0.5388497 2.665929 369
## 3: 5990981_DNA_Replication 1e-10 3.6625e-09 NA 0.6440006 2.604576 82
## 4: 5990987_Synthesis_of_DNA 1e-10 3.6625e-09 NA 0.6478555 2.614742 78
## 5: 5990988_S_Phase 1e-10 3.6625e-09 NA 0.6013092 2.495273 98
## 6: 5990990_G1_S_Transition 1e-10 3.6625e-09 NA 0.6232905 2.536670 84
## leadingEdge
## 1: 66336,66977,12442,107995,66442,12571,...
## 2: 66336,66977,12442,107995,66442,19361,...
## 3: 57441,17219,69270,12575,69263,17215,...
## 4: 17219,69270,12575,69263,17215,68240,...
## 5: 67849,17219,69270,12575,69263,71988,...
## 6: 20135,13555,17219,12575,12448,17215,...
As you can see from the warning, fgsea
has a default lower bound eps=1e-10
for estimating P-values. If you need to estimate P-value more accurately, you can set the eps
argument to zero in the fgsea
function.
fgseaRes <- fgsea(pathways = examplePathways,
stats = exampleRanks,
eps = 0.0,
minSize = 15,
maxSize = 500)
head(fgseaRes[order(pval), ])
## pathway pval padj
## 1: 5990980_Cell_Cycle 8.656249e-27 5.072562e-24
## 2: 5990979_Cell_Cycle,_Mitotic 6.996107e-26 2.049859e-23
## 3: 5991851_Mitotic_Prometaphase 9.627014e-19 1.880477e-16
## 4: 5992217_Resolution_of_Sister_Chromatid_Cohesion 5.507092e-18 8.067889e-16
## 5: 5991454_M_Phase 3.448765e-14 3.416625e-12
## 6: 5991599_Separation_of_Sister_Chromatids 3.498250e-14 3.416625e-12
## log2err ES NES size leadingEdge
## 1: 1.3422338 0.5388497 2.689405 369 66336,66977,12442,107995,66442,19361,...
## 2: 1.3188888 0.5594755 2.754347 317 66336,66977,12442,107995,66442,12571,...
## 3: 1.1146645 0.7253270 3.006720 82 66336,66977,12442,107995,66442,52276,...
## 4: 1.0959293 0.7347987 2.994767 74 66336,66977,12442,107995,66442,52276,...
## 5: 0.9653278 0.5576247 2.598307 173 66336,66977,12442,107995,66442,52276,...
## 6: 0.9653278 0.6164600 2.702081 116 66336,66977,107995,66442,52276,67629,...
One can make an enrichment plot for a pathway:
plotEnrichment(examplePathways[["5991130_Programmed_Cell_Death"]],
exampleRanks) + labs(title="Programmed Cell Death")
Or make a table plot for a bunch of selected pathways:
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(examplePathways[topPathways], exampleRanks, fgseaRes,
gseaParam=0.5)
From the plot above one can see that there are very similar pathways in the table (for example 5991502_Mitotic_Metaphase_and_Anaphase
and 5991600_Mitotic_Anaphase
). To select only independent pathways one can use collapsePathways
function:
collapsedPathways <- collapsePathways(fgseaRes[order(pval)][padj < 0.01],
examplePathways, exampleRanks)
mainPathways <- fgseaRes[pathway %in% collapsedPathways$mainPathways][
order(-NES), pathway]
plotGseaTable(examplePathways[mainPathways], exampleRanks, fgseaRes,
gseaParam = 0.5)
To save the results in a text format data:table::fwrite
function can be used:
To make leading edge more human-readable it can be converted using mapIdsList
(similar to AnnotationDbi::mapIds
) function and a corresponding database (here org.Mm.eg.db
for mouse):
Also, fgsea
is parallelized using BiocParallel
package. By default the first registered backend returned by bpparam()
is used. To tweak the parallelization one can either specify BPPARAM
parameter used for bplapply
of set nproc
parameter, which is a shorthand for setting BPPARAM=MulticoreParam(workers = nproc)
.
For convenience there is reactomePathways
function that obtains pathways from Reactome for given set of genes. Package reactome.db
is required to be installed.
pathways <- reactomePathways(names(exampleRanks))
fgseaRes <- fgsea(pathways, exampleRanks, maxSize=500)
head(fgseaRes)
## pathway pval
## 1: 5-Phosphoribose 1-diphosphate biosynthesis 0.84368308
## 2: A tetrasaccharide linker sequence is required for GAG synthesis 0.65530303
## 3: ABC transporters in lipid homeostasis 0.22457627
## 4: ABC-family proteins mediated transport 0.31974922
## 5: ABO blood group biosynthesis 0.97916667
## 6: ADP signalling through P2Y purinoceptor 1 0.01215366
## padj log2err ES NES size
## 1: 0.9377611 0.05664842 0.4267378 0.6850714 2
## 2: 0.8621122 0.06280040 0.3218180 0.8393227 11
## 3: 0.5863542 0.13284630 -0.4385385 -1.2314139 12
## 4: 0.6832311 0.09026355 0.2771011 1.0981143 67
## 5: 0.9928526 0.04850598 0.5120427 0.6883219 1
## 6: 0.1054198 0.38073040 0.6228710 1.7872897 16
## leadingEdge
## 1: 328099,19139
## 2: 14733,20971,20970,12032,29873,218271,...
## 3: 19299,27403,11307,11806,217265,27409,...
## 4: 56199,17463,26440,26444,19179,228769,...
## 5: 14344
## 6: 14696,14702,14700,14682,14676,66066,...
One can also start from .rnk
and .gmt
files as in original GSEA:
rnk.file <- system.file("extdata", "naive.vs.th1.rnk", package="fgsea")
gmt.file <- system.file("extdata", "mouse.reactome.gmt", package="fgsea")
Loading ranks:
ranks <- read.table(rnk.file,
header=TRUE, colClasses = c("character", "numeric"))
ranks <- setNames(ranks$t, ranks$ID)
str(ranks)
## Named num [1:12000] -63.3 -49.7 -43.6 -41.5 -33.3 ...
## - attr(*, "names")= chr [1:12000] "170942" "109711" "18124" "12775" ...
Loading pathways:
## List of 6
## $ 1221633_Meiotic_Synapsis : chr [1:64] "12189" "13006" "15077" "15078" ...
## $ 1368092_Rora_activates_gene_expression : chr [1:9] "11865" "12753" "12894" "18143" ...
## $ 1368110_Bmal1:Clock,Npas2_activates_circadian_gene_expression : chr [1:16] "11865" "11998" "12753" "12952" ...
## $ 1445146_Translocation_of_Glut4_to_the_Plasma_Membrane : chr [1:55] "11461" "11465" "11651" "11652" ...
## $ 186574_Endocrine-committed_Ngn3+_progenitor_cells : chr [1:4] "18012" "18088" "18506" "53626"
## $ 186589_Late_stage_branching_morphogenesis_pancreatic_bud_precursor_cells: chr [1:4] "11925" "15205" "21410" "246086"
And runnig fgsea:
## pathway
## 1: 1221633_Meiotic_Synapsis
## 2: 1445146_Translocation_of_Glut4_to_the_Plasma_Membrane
## 3: 442533_Transcriptional_Regulation_of_Adipocyte_Differentiation_in_3T3-L1_Pre-adipocytes
## 4: 508751_Circadian_Clock
## 5: 5334727_Mus_musculus_biological_processes
## 6: 573389_NoRC_negatively_regulates_rRNA_expression
## pval padj log2err ES NES size
## 1: 0.5716695 0.7460987 0.06378454 0.2885754 0.9307488 27
## 2: 0.6983607 0.8393464 0.05302125 0.2387284 0.8459673 39
## 3: 0.1062802 0.2707834 0.21392786 -0.3640706 -1.3586459 31
## 4: 0.8186528 0.8966763 0.04811840 0.2516324 0.7211082 17
## 5: 0.3879310 0.5950984 0.07511816 0.2469065 1.0524610 106
## 6: 0.4127807 0.6146414 0.08152651 0.3607407 1.0337822 17
## leadingEdge
## 1: 15270,12189,71846,19357
## 2: 17918,19341,20336,22628,22627,20619,...
## 3: 76199,19014,26896,229003,17977,17978,...
## 4: 20893,59027,19883
## 5: 60406,19361,15270,20893,12189,68240,...
## 6: 60406,20018,245688,20017