rGREAT has integrated Gene Ontology gene sets for all supported organisms and MSigDB gene sets for human. It is very easy to apply GREAT analysis on other geneset databases such as KEGG. THe basical use has been explained in the section “Manually set gene sets” in the vignette “Analyze with local GREAT. In this vignette, we will just show you several examples.

KEGG pathways

KEGG pathway database is widely used for gene set enrichment analysis. To apply GREAT anlaysis with KEGG pathways, we need to construct the corresponding gene sets.

KEGG provides its data via a REST API (https://rest.kegg.jp/). There are several commands that can be used to retrieve specific types of data. Here we use the package KEGGREST to obtain data directly from KEGG. To get a list of genes involved in pathways, we use the “link” command which is implemented in the function keggLink().

library(KEGGREST)
pathway2gene = keggLink("pathway", "hsa")
head(pathway2gene)
##       hsa:10327         hsa:124         hsa:125         hsa:126         hsa:127 
## "path:hsa00010" "path:hsa00010" "path:hsa00010" "path:hsa00010" "path:hsa00010" 
##         hsa:128 
## "path:hsa00010"

The returned object pathway2gene is a named vector, where the names are gene Entrez IDs (with an additional "hsa:" prefix) and values are pathway IDs (with an additional "path:" prefix). You can try to execute keggLink("hsa","pathway") to compare the results. The named vectors are not common for downstream gene set analysis. Let’s convert it to a list of genes, also remove the prefix.

kegg_pathways = split(gsub("hsa:", "", names(pathway2gene)),
                      gsub("path:", "", pathway2gene))

Now simply send kegg_pathways to great() to perform GREAT analysis.

library(rGREAT)
gr = randomRegions(genome = "hg19")
great(gr, kegg_pathways, "hg19")
## 908 regions are associated to 1377 genes' extended TSSs.
##   TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
##   Genome: hg19
##   OrgDb: org.Hs.eg.db
##   Gene sets: self-provided
##   Background: whole genome excluding gaps
## Mode: Basal plus extension
##   Proximal: 5000 bp upstream, 1000 bp downstream,
##   plus Distal: up to 1000000 bp

Here we only include the KEGG pathway IDs. The full name of pathways can be obtained by the following command:

pn = keggList("pathway/hsa")
head(pn)
##                                                     path:hsa00010 
##             "Glycolysis / Gluconeogenesis - Homo sapiens (human)" 
##                                                     path:hsa00020 
##                "Citrate cycle (TCA cycle) - Homo sapiens (human)" 
##                                                     path:hsa00030 
##                "Pentose phosphate pathway - Homo sapiens (human)" 
##                                                     path:hsa00040 
## "Pentose and glucuronate interconversions - Homo sapiens (human)" 
##                                                     path:hsa00051 
##          "Fructose and mannose metabolism - Homo sapiens (human)" 
##                                                     path:hsa00052 
##                     "Galactose metabolism - Homo sapiens (human)"

Please go the the documentation of KEGGREST or the main website of KEGG API (https://rest.kegg.jp/) to find out commands for retrieving various datasets there.

Please note, KEGG pathways are only free for academic users, https://www.pathway.jp/en/licensing.html.

Reactome pathways

Reactome is another popular pathway database for gene set enrichment analysis. In Bioconductor, there is a package reactome.db that contains genes associated to every pathway in Reactome.

To get the Reactome pathway gene sets, simply use the object reactomePATHID2EXTID.

library(reactome.db)
gs = as.list(reactomePATHID2EXTID)
great(gr, gs, "hg19")
## 908 regions are associated to 1377 genes' extended TSSs.
##   TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
##   Genome: hg19
##   OrgDb: org.Hs.eg.db
##   Gene sets: self-provided
##   Background: whole genome excluding gaps
## Mode: Basal plus extension
##   Proximal: 5000 bp upstream, 1000 bp downstream,
##   plus Distal: up to 1000000 bp

The full pathway names are in the object reactomePATHID2NAME. You can convert the object to a normal list by as.list(reactomePATHID2NAME).

UniProt keywords gene sets

UniProt database provides a list of controlled vocabulary represented as keywords for genes or proteins (https://www.uniprot.org/keywords/). This is useful for summarizing gene functions in a compact way. The UniProtKeywords package contains gene sets associated to keywords.

UniProtKeywords support several organisms. Here we use human as an example. In the function load_keyword_genesets(), you need to specify the taxon ID for the organism.

library(UniProtKeywords)
gs = load_keyword_genesets(9606)
great(gr, gs, "hg19")
## 908 regions are associated to 1377 genes' extended TSSs.
##   TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
##   Genome: hg19
##   OrgDb: org.Hs.eg.db
##   Gene sets: self-provided
##   Background: whole genome excluding gaps
## Mode: Basal plus extension
##   Proximal: 5000 bp upstream, 1000 bp downstream,
##   plus Distal: up to 1000000 bp

Specific gene sets in .gmt format

Please refer to the section “Manually set gene sets” in the vignette “Analyze with local GREAT. There introduces a function read_gmt() which can import a .gmt file into a list of genes.

The tool Enrichr provides a great resource of gene sets in .gmt format. See https://maayanlab.cloud/Enrichr/#libraries.

Session info

sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] UniProtKeywords_0.99.4                 
##  [2] reactome.db_1.82.0                     
##  [3] KEGGREST_1.38.0                        
##  [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [5] GenomicFeatures_1.50.2                 
##  [6] AnnotationDbi_1.60.0                   
##  [7] Biobase_2.58.0                         
##  [8] rGREAT_2.0.2                           
##  [9] GenomicRanges_1.50.1                   
## [10] GenomeInfoDb_1.34.3                    
## [11] IRanges_2.32.0                         
## [12] S4Vectors_0.36.0                       
## [13] BiocGenerics_0.44.0                    
## [14] knitr_1.41                             
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7                            
##  [2] matrixStats_0.63.0                      
##  [3] bit64_4.0.5                             
##  [4] doParallel_1.0.17                       
##  [5] filelock_1.0.2                          
##  [6] RColorBrewer_1.1-3                      
##  [7] progress_1.2.2                          
##  [8] httr_1.4.4                              
##  [9] tools_4.2.2                             
## [10] bslib_0.4.1                             
## [11] utf8_1.2.2                              
## [12] R6_2.5.1                                
## [13] DT_0.26                                 
## [14] DBI_1.1.3                               
## [15] colorspace_2.0-3                        
## [16] GetoptLong_1.0.5                        
## [17] tidyselect_1.2.0                        
## [18] prettyunits_1.1.1                       
## [19] bit_4.0.5                               
## [20] curl_4.3.3                              
## [21] compiler_4.2.2                          
## [22] cli_3.4.1                               
## [23] xml2_1.3.3                              
## [24] DelayedArray_0.24.0                     
## [25] rtracklayer_1.58.0                      
## [26] sass_0.4.3                              
## [27] rappdirs_0.3.3                          
## [28] stringr_1.4.1                           
## [29] digest_0.6.30                           
## [30] Rsamtools_2.14.0                        
## [31] rmarkdown_2.18                          
## [32] XVector_0.38.0                          
## [33] pkgconfig_2.0.3                         
## [34] htmltools_0.5.3                         
## [35] MatrixGenerics_1.10.0                   
## [36] highr_0.9                               
## [37] dbplyr_2.2.1                            
## [38] fastmap_1.1.0                           
## [39] htmlwidgets_1.5.4                       
## [40] rlang_1.0.6                             
## [41] GlobalOptions_0.1.2                     
## [42] RSQLite_2.2.18                          
## [43] shiny_1.7.3                             
## [44] shape_1.4.6                             
## [45] jquerylib_0.1.4                         
## [46] BiocIO_1.8.0                            
## [47] generics_0.1.3                          
## [48] jsonlite_1.8.3                          
## [49] BiocParallel_1.32.1                     
## [50] BioMartGOGeneSets_0.99.8                
## [51] dplyr_1.0.10                            
## [52] RCurl_1.98-1.9                          
## [53] magrittr_2.0.3                          
## [54] GO.db_3.16.0                            
## [55] GenomeInfoDbData_1.2.9                  
## [56] Matrix_1.5-3                            
## [57] Rcpp_1.0.9                              
## [58] fansi_1.0.3                             
## [59] lifecycle_1.0.3                         
## [60] stringi_1.7.8                           
## [61] yaml_2.3.6                              
## [62] SummarizedExperiment_1.28.0             
## [63] zlibbioc_1.44.0                         
## [64] org.Hs.eg.db_3.16.0                     
## [65] BiocFileCache_2.6.0                     
## [66] grid_4.2.2                              
## [67] blob_1.2.3                              
## [68] promises_1.2.0.1                        
## [69] parallel_4.2.2                          
## [70] crayon_1.5.2                            
## [71] lattice_0.20-45                         
## [72] Biostrings_2.66.0                       
## [73] circlize_0.4.15                         
## [74] hms_1.1.2                               
## [75] pillar_1.8.1                            
## [76] rjson_0.2.21                            
## [77] codetools_0.2-18                        
## [78] biomaRt_2.54.0                          
## [79] XML_3.99-0.12                           
## [80] glue_1.6.2                              
## [81] evaluate_0.18                           
## [82] httpuv_1.6.6                            
## [83] foreach_1.5.2                           
## [84] png_0.1-7                               
## [85] vctrs_0.5.1                             
## [86] assertthat_0.2.1                        
## [87] cachem_1.0.6                            
## [88] xfun_0.35                               
## [89] mime_0.12                               
## [90] xtable_1.8-4                            
## [91] restfulr_0.0.15                         
## [92] later_1.3.0                             
## [93] tibble_3.1.8                            
## [94] TxDb.Hsapiens.UCSC.hg38.knownGene_3.16.0
## [95] iterators_1.0.14                        
## [96] GenomicAlignments_1.34.0                
## [97] memoise_2.0.1                           
## [98] ellipsis_0.3.2