Abstract

clusterProfiler implements methods to analyze and visualize functional profiles of genomic coordinates (supported by ChIPseeker), gene and gene clusters.

Supported Analysis

Over-Representation Analysis
Gene Set Enrichment Analysis
Biological theme comparison

Supported ontologies/pathways

Disease Ontology (via DOSE)
Network of Cancer Gene (via DOSE)
DisGeNET (via DOSE)
Gene Ontology (supports many species with GO annotation query online via AnnotationHub)
KEGG Pathway and Module with latest online data (supports more than 4000 species listed in http://www.genome.jp/kegg/catalog/org_list.html)
Reactome Pathway (via ReactomePA)
DAVID (via RDAVIDWebService)
Molecular Signatures Database
- hallmark gene sets
- positional gene sets
- curated gene sets
- motif gene sets
- computational gene sets
- GO gene sets
- oncogenic signatures
- immunologic signatures
Other Annotations
- from other sources (e.g. DisGeNET as an example)
- user’s annotation
- customized ontology
- and many others

Visualization

barplot
cnetplot
dotplot
enrichMap
gseaplot
plotGOgraph (via topGO package)
upsetplot

Citation

If you use clusterProfiler in published research, please cite:

G Yu, LG Wang, Y Han, QY He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 2012, 16(5):284-287. doi:[10.1089/omi.2011.0118](http://dx.doi.org/10.1089/omi.2011.0118)

Introduction

In recently years, high-throughput experimental techniques such as microarray, RNA-Seq and mass spectrometry can detect cellular molecules at systems-level. These kinds of analyses generate huge quantitaties of data, which need to be given a biological interpretation. A commonly used approach is via clustering in the gene dimension for grouping different genes based on their similarities¹.

To search for shared functions among genes, a common way is to incorporate the biological knowledge, such as Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG), for identifying predominant biological themes of a collection of genes.

After clustering analysis, researchers not only want to determine whether there is a common theme of a particular gene cluster, but also to compare the biological themes among gene clusters. The manual step to choose interesting clusters followed by enrichment analysis on each selected cluster is slow and tedious. To bridge this gap, we designed clusterProfiler², for comparing and visualizing functional profiles among gene clusters.

`bitr`: Biological Id TranslatoR

clusterProfiler provides bitr and bitr_kegg for converting ID types. Both bitr and bitr_kegg support many species including model and many non-model organisms.

x <- c("GPX3",  "GLRX",   "LBP",   "CRYAB", "DEFB1", "HCLS1",   "SOD2",   "HSPA2",
       "ORM1",  "IGFBP1", "PTHLH", "GPC3",  "IGFBP3","TOB1",    "MITF",   "NDRG1",
       "NR1H4", "FGFR3",  "PVR",   "IL6",   "PTPRM", "ERBB2",   "NID2",   "LAMB1",
       "COMP",  "PLS3",   "MCAM",  "SPP1",  "LAMC1", "COL4A2",  "COL4A1", "MYOC",
       "ANXA4", "TFPI2",  "CST6",  "SLPI",  "TIMP2", "CPM",     "GGT1",   "NNMT",
       "MAL",   "EEF1A2", "HGD",   "TCN2",  "CDA",   "PCCA",    "CRYM",   "PDXK",
       "STC1",  "WARS",  "HMOX1", "FXYD2", "RBP4",   "SLC6A12", "KDELR3", "ITM2B")
eg = bitr(x, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
head(eg)

##   SYMBOL ENTREZID
## 1   GPX3     2878
## 2   GLRX     2745
## 3    LBP     3929
## 4  CRYAB     1410
## 5  DEFB1     1672
## 6  HCLS1     3059

User should provides an annotation package, both fromType and toType can accept any types that supported.

User can use keytypes to list all supporting types.

library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

We can translate from one type to other types.

ids <- bitr(x, fromType="SYMBOL", toType=c("UNIPROT", "ENSEMBL"), OrgDb="org.Hs.eg.db")
head(ids)

##   SYMBOL    UNIPROT         ENSEMBL
## 1   GPX3     P22352 ENSG00000211445
## 2   GLRX A0A024RAM2 ENSG00000173221
## 3   GLRX     P35754 ENSG00000173221
## 4    LBP     P18428 ENSG00000129988
## 5    LBP     Q8TCF0 ENSG00000129988
## 6  CRYAB     P02511 ENSG00000109846

For GO analysis, user don’t need to convert ID, all ID type provided by OrgDb can be used in groupGO, enrichGO and gseGO by specifying keytype parameter.

`bitr_kegg`: converting biological IDs using KEGG API

data(gcSample)
hg <- gcSample[[1]]
head(hg)

## [1] "4597"  "7111"  "5266"  "2175"  "755"   "23046"

eg2np <- bitr_kegg(hg, fromType='kegg', toType='ncbi-proteinid', organism='hsa')
head(eg2np)

##    kegg ncbi-proteinid
## 1 10001      NP_005457
## 2 10209      NP_005792
## 3 10232      NP_037536
## 4 10324      NP_006054
## 5 10411   NP_001092002
## 6 10614      NP_006451

The ID type (both fromType & toType) should be one of ‘kegg’, ‘ncbi-geneid’, ‘ncbi-proteinid’ or ‘uniprot’. The ‘kegg’ is the primary ID used in KEGG database. The data source of KEGG was from NCBI. A rule of thumb for the ‘kegg’ ID is entrezgene ID for eukaryote species and Locus ID for prokaryotes.

Many prokaryote species don’t have entrezgene ID available. For example we can check the gene information of ece:Z5100 in http://www.genome.jp/dbget-bin/www_bget?ece:Z5100, which have NCBI-ProteinID and UnitProt links in the Other DBs Entry, but not NCBI-GeneID.

If we try to convert Z5100 to ncbi-geneid, bitr_kegg will throw error of ncbi-geneid is not supported.

bitr_kegg("Z5100", fromType="kegg", toType='ncbi-geneid', organism='ece')

## Error in KEGG_convert(fromType, toType, organism) :
## ncbi-geneid is not supported for ece ...

We can of course convert it to ncbi-proteinid and uniprot:

bitr_kegg("Z5100", fromType="kegg", toType='ncbi-proteinid', organism='ece')

##    kegg ncbi-proteinid
## 1 Z5100       AAG58814

bitr_kegg("Z5100", fromType="kegg", toType='uniprot', organism='ece')

##    kegg uniprot
## 1 Z5100  Q7DB85

GO Analysis

Supported organisms

GO analyses (groupGO(), enrichGO() and gseGO()) support organisms that have an OrgDb object available.

Bioconductor have already provide OrgDb for about 20 species. User can query OrgDb online by AnnotationHub or build their own by AnnotationForge. An example can be found in the vignette of GOSemSim.

If user have GO annotation data (in data.frame format with first column of gene ID and second column of GO ID), they can use enricher() and gseGO() functions to perform over-representation test and gene set enrichment analysis.

If genes are annotated by direction annotation, it should also annotated by its ancestor GO nodes (indirect annation). If user only has direct annotation, they can pass their annotation to buildGOmap function, which will infer indirection annotation and generate a data.frame that suitable for both enricher() and gseGO().

GO classification

In clusterProfiler, groupGO is designed for gene classification based on GO distribution at a specific level. Here we use dataset geneList provided by DOSE. Please refer to vignette of DOSE for more details.

data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
gene.df <- bitr(gene, fromType = "ENTREZID",
        toType = c("ENSEMBL", "SYMBOL"),
        OrgDb = org.Hs.eg.db)
head(gene.df)

##   ENTREZID         ENSEMBL SYMBOL
## 1     4312 ENSG00000196611   MMP1
## 2     8318 ENSG00000093009  CDC45
## 3    10874 ENSG00000109255    NMU
## 4    55143 ENSG00000134690  CDCA8
## 5    55388 ENSG00000065328  MCM10
## 6      991 ENSG00000117399  CDC20

ggo <- groupGO(gene     = gene,
               OrgDb    = org.Hs.eg.db,
               ont      = "CC",
               level    = 3,
               readable = TRUE)

head(ggo)

##                    ID                    Description Count GeneRatio
## GO:0005886 GO:0005886                plasma membrane    54    54/207
## GO:0005628 GO:0005628              prospore membrane     0     0/207
## GO:0005789 GO:0005789 endoplasmic reticulum membrane     6     6/207
## GO:0019867 GO:0019867                 outer membrane     2     2/207
## GO:0031090 GO:0031090             organelle membrane    19    19/207
## GO:0034357 GO:0034357        photosynthetic membrane     0     0/207
##                                                                                                                                                                                                                                                                                                                                              geneID
## GO:0005886 S100A9/MELK/S100A8/MARCO/CXCL10/LAMP3/CEP55/UGT8/UBE2C/SLC7A5/CXCL9/FADS2/MSLN/IL1R2/KIF18A/S100P/GZMB/TRAT1/GABRP/AQP9/GPR19/PRC1/SLC2A6/LAG3/NUDT1/CACNA1D/VSTM4/ITPR1/SYT17/SLC16A4/CORIN/KCNK15/CA12/KCNE4/HLA-DQA1/ADH1B/PDZK1/C7/ACKR1/COL17A1/PSD3/EMCN/SLC44A4/LRP2/NLGN4X/MAPT/ERBB4/CX3CR1/LAMP5/ABCA8/PIP/STEAP4/PTPRT/CYBRD1
## GO:0005628                                                                                                                                                                                                                                                                                                                                         
## GO:0005789                                                                                                                                                                                                                                                                                                  FADS2/ITPR1/HLA-DQA1/CYP4F8/CYP4B1/FMO5
## GO:0019867                                                                                                                                                                                                                                                                                                                                 MAOB/PGR
## GO:0031090                                                                                                                                                                                                                     MARCO/LAMP3/DUSP2/DTL/NUDT1/MAOB/ITPR1/FAM198B/HLA-DQA1/LRP2/HMGCS2/CYP4F8/CYP4B1/LAMP5/ABCA8/FMO5/STEAP4/PGR/CYBRD1
## GO:0034357

The input parameters of gene is a vector of gene IDs (can be any ID type that supported by corresponding OrgDb).

If readable is setting to TRUE, the input gene IDs will be converted to gene symbols.

GO over-representation test

Over-representation test³ were implemented in clusterProfiler. For calculation details and explanation of paramters, please refer to the vignette of DOSE.

ego <- enrichGO(gene          = gene,
                universe      = names(geneList),
                OrgDb         = org.Hs.eg.db,
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
        readable      = TRUE)
head(ego)

##                    ID                              Description GeneRatio
## GO:0005819 GO:0005819                                  spindle    25/198
## GO:0000779 GO:0000779 condensed chromosome, centromeric region    15/198
## GO:0000775 GO:0000775           chromosome, centromeric region    18/198
## GO:0000776 GO:0000776                              kinetochore    15/198
## GO:0000793 GO:0000793                     condensed chromosome    18/198
## GO:0005876 GO:0005876                      spindle microtubule    10/198
##              BgRatio       pvalue     p.adjust       qvalue
## GO:0005819 238/11745 2.090374e-13 5.518588e-11 4.950886e-11
## GO:0000779  90/11745 2.241220e-11 2.958411e-09 2.654077e-09
## GO:0000775 152/11745 7.936845e-11 6.984424e-09 6.265930e-09
## GO:0000776 103/11745 1.649359e-10 8.919109e-09 8.001593e-09
## GO:0000793 159/11745 1.689225e-10 8.919109e-09 8.001593e-09
## GO:0005876  45/11745 2.821374e-09 1.241405e-07 1.113700e-07
##                                                                                                                                                         geneID
## GO:0005819 CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A
## GO:0000779                                                            CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/CDT1/BIRC5/NCAPG/AURKB/AURKA/CCNB1
## GO:0000775                                           CDCA8/CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/AURKA/CCNB1
## GO:0000776                                                             CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/AURKB/CCNB1
## GO:0000793                                          CENPE/NDC80/TOP2A/NCAPH/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/CDT1/BIRC5/NCAPG/AURKB/CHEK1/AURKA/CCNB1
## GO:0005876                                                                                         SKA1/NUSAP1/CDK1/KIF18A/KIF11/AURKB/PRC1/KIF18B/AURKA/KIF4A
##            Count
## GO:0005819    25
## GO:0000779    15
## GO:0000775    18
## GO:0000776    15
## GO:0000793    18
## GO:0005876    10

As I mentioned before, any gene ID type that supported in OrgDb can be directly used in GO analyses. User need to specify the keytype parameter to specify the input gene ID type.

ego2 <- enrichGO(gene         = gene.df$ENSEMBL,
                OrgDb         = org.Hs.eg.db,
        keytype       = 'ENSEMBL',
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05)

Gene ID can be mapped to gene Symbol by using paramter readable=TRUE or setReadable function.

ego2 <- setReadable(ego2, OrgDb = org.Hs.eg.db)

drop specific GO terms or level

enrichGO test the whole GO corpus and enriched result may contains very general terms. With dropGO function, user can remove specific GO terms or GO level from results obtained from both enrichGO and compareCluster.

test GO at sepcific level

enrichGO doesn’t contain parameter to restrict the test at specific GO level. Instead, we provide a function gofilter to restrict the result at specific GO level. It works with results obtained from both enrichGO and compareCluster.

reduce redundancy of enriched GO terms

According to issue #28, I implement a simplify method to remove redundant GO terms obtained from enrichGO. An example can be found in the blog post. It internally call GOSemSim to calculate similarities among GO terms and remove those highly similar terms by keeping one representative term. The simplify method also works with both outputs from enrichGO and compareCluster.

GO Gene Set Enrichment Analysis

A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated previous were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)⁴ directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.

For algorithm details, please refer to the vignette of DOSE.

ego3 <- gseGO(geneList     = geneList,
              OrgDb        = org.Hs.eg.db,
              ont          = "CC",
              nPerm        = 1000,
              minGSSize    = 100,
              maxGSSize    = 500,
              pvalueCutoff = 0.05,
              verbose      = FALSE)

GSEA use permutation test, user can set nPerm for number of permutations. Only gene Set size in [minGSSize, maxGSSize] will be tested.

GO Semantic Similarity Analysis

GO semantic similarity can be calculated by GOSemSim¹. We can use it to cluster genes/proteins into different clusters based on their functional similarity and can also use it to measure the similarities among GO terms to reduce the redundancy of GO enrichment results.

KEGG analysis

The annotation package, KEGG.db, is not updated since 2012. It’s now pretty old and in clusterProfiler, enrichKEGG (for KEGG pathway) and enrichMKEGG (for KEGG module) supports downloading latest online version of KEGG data for enrichment analysis. Using KEGG.db is also supported by explicitly setting use_internal_data parameter to TRUE, but it’s not recommended.

With this new feature, organism is not restricted to those supported in previous release, it can be any species that have KEGG annotation data available in KEGG database. User should pass abbreviation of academic name to the organism parameter. The full list of KEGG supported organisms can be accessed via http://www.genome.jp/kegg/catalog/org_list.html.

clusterProfiler provides search_kegg_organism() function to help searching supported organisms.

search_kegg_organism('ece', by='kegg_code')

##     kegg_code                        scientific_name common_name
## 366       ece Escherichia coli O157:H7 EDL933 (EHEC)        <NA>

ecoli <- search_kegg_organism('Escherichia coli', by='scientific_name')
dim(ecoli)

## [1] 65  3

head(ecoli)

##     kegg_code                        scientific_name common_name
## 361       eco           Escherichia coli K-12 MG1655        <NA>
## 362       ecj            Escherichia coli K-12 W3110        <NA>
## 363       ecd            Escherichia coli K-12 DH10B        <NA>
## 364       ebw                Escherichia coli BW2952        <NA>
## 365      ecok            Escherichia coli K-12 MDS42        <NA>
## 366       ece Escherichia coli O157:H7 EDL933 (EHEC)        <NA>

KEGG over-representation test

kk <- enrichKEGG(gene         = gene,
                 organism     = 'hsa',
                 pvalueCutoff = 0.05)
head(kk)

##                ID                             Description GeneRatio
## hsa04110 hsa04110                              Cell cycle     11/88
## hsa04114 hsa04114                          Oocyte meiosis     10/88
## hsa04218 hsa04218                     Cellular senescence     10/88
## hsa03320 hsa03320                  PPAR signaling pathway      7/88
## hsa04914 hsa04914 Progesterone-mediated oocyte maturation      7/88
## hsa04115 hsa04115                   p53 signaling pathway      5/88
##           BgRatio       pvalue     p.adjust       qvalue
## hsa04110 124/7299 2.306110e-07 4.266304e-05 4.199548e-05
## hsa04114 124/7299 2.050914e-06 1.897096e-04 1.867411e-04
## hsa04218 160/7299 2.007076e-05 1.050736e-03 1.034295e-03
## hsa03320  72/7299 2.271861e-05 1.050736e-03 1.034295e-03
## hsa04914  99/7299 1.765455e-04 6.532184e-03 6.429973e-03
## hsa04115  68/7299 1.303794e-03 4.020033e-02 3.957131e-02
##                                                      geneID Count
## hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232    11
## hsa04114    991/9133/983/4085/51806/6790/891/9232/3708/5241    10
## hsa04218     2305/4605/9133/890/983/51806/1111/891/776/3708    10
## hsa03320                 4312/9415/9370/5105/2167/3158/5346     7
## hsa04914                    9133/890/983/4085/6790/891/5241     7
## hsa04115                             9133/6241/983/1111/891     5

Input ID type can be kegg, ncbi-geneid, ncbi-proteinid or uniprot, an example can be found in the post.

KEGG Gene Set Enrichment Analysis

kk2 <- gseKEGG(geneList     = geneList,
               organism     = 'hsa',
               nPerm        = 1000,
               minGSSize    = 120,
               pvalueCutoff = 0.05,
               verbose      = FALSE)
head(kk2)

##                ID                 Description setSize enrichmentScore
## hsa04510 hsa04510              Focal adhesion     188      -0.4188582
## hsa05164 hsa05164                 Influenza A     156       0.3651059
## hsa05162 hsa05162                     Measles     122       0.3938756
## hsa03013 hsa03013               RNA transport     131       0.4116488
## hsa04218 hsa04218         Cellular senescence     143       0.4153718
## hsa04062 hsa04062 Chemokine signaling pathway     162       0.3780480
##                NES      pvalue   p.adjust    qvalues rank
## hsa04510 -1.713082 0.001424501 0.01973221 0.01298171 2183
## hsa05164  1.591310 0.003225806 0.01973221 0.01298171 2823
## hsa05162  1.639236 0.003267974 0.01973221 0.01298171 2607
## hsa03013  1.736054 0.003278689 0.01973221 0.01298171 3383
## hsa04218  1.779084 0.003300330 0.01973221 0.01298171 1155
## hsa04062  1.664238 0.003311258 0.01973221 0.01298171 1298
##                            leading_edge
## hsa04510 tags=27%, list=17%, signal=23%
## hsa05164 tags=35%, list=23%, signal=27%
## hsa05162 tags=35%, list=21%, signal=28%
## hsa03013 tags=40%, list=27%, signal=29%
## hsa04218  tags=17%, list=9%, signal=16%
## hsa04062 tags=21%, list=10%, signal=19%
##                                                                                                                                                                                                                                                                                         core_enrichment
## hsa04510                                 5228/7424/1499/4636/83660/7059/5295/1288/23396/3910/3371/3082/1291/394/3791/7450/596/3685/1280/3675/595/2318/3912/1793/1278/1277/1293/10398/55742/2317/7058/25759/56034/3693/3480/5159/857/1292/3908/3909/63923/3913/1287/3679/7060/3479/10451/80310/1311/1101
## hsa05164            3627/3576/56649/6352/4599/6772/6347/3838/3126/3112/5645/91543/5646/4938/4940/3458/64135/54205/2224/3383/4939/23586/4793/5603/3452/5604/7124/29107/3569/3119/9021/834/356/3310/5605/5610/5594/8480/4792/10379/5578/9641/5371/3593/10625/2932/3592/6300/3267/3109/5606/3108/1432/3552
## hsa05162                                                                            898/9134/4599/3559/6772/3561/917/3654/915/4938/4940/3458/64135/1019/916/4068/4939/940/23586/4793/7128/5588/6504/3452/3569/7097/2213/3596/4478/356/3310/1460/5610/4792/9367/10379/9641/3593/1147/3558/2932/3592/3560
## hsa03013 10460/1978/55110/54913/9688/8894/11260/10799/9631/4116/5042/8761/6396/23165/8662/10248/55706/79833/9775/29107/23636/5905/9513/5901/10775/10557/4927/79902/1981/26986/11171/10762/8480/8891/11097/26019/10940/4686/9972/81929/10556/3646/9470/387082/1977/57122/8563/7514/79023/3837/9818/56000
## hsa04218                                                                                                                                                                     2305/4605/9133/890/983/51806/1111/891/993/3576/1978/898/9134/4609/1869/1029/22808/3804/1871/5499/91860/292/1019/11200/1875
## hsa04062                                                                                                                      3627/10563/6373/4283/6362/6355/2921/6364/3576/6352/10663/1230/6772/6347/6351/3055/1237/1236/4067/6354/114/3702/6361/1794/1234/6367/6375/6374/2919/409/4793/2792/6360/5880

KEGG Module over-representation test

KEGG Module is a collection of manually defined function units. In some situation, KEGG Modules have a more straightforward interpretation.

mkk <- enrichMKEGG(gene = gene,
                   organism = 'hsa')

KEGG Module Gene Set Enrichment Analysis

mkk2 <- gseMKEGG(geneList = geneList,
                 species = 'hsa')

Disease analysis

DOSE⁵ supports Disease Ontology (DO) Semantic and Enrichment analysis. The enrichDO function is very useful for identifying disease association of interesting genes, and function gseDO function is designed for gene set enrichment analysis of DO.

In addition, DOSE also supports enrichment analysis of Network of Cancer Gene (NCG)⁶ and Disease Gene Network⁷, please refer to the DOSE vignettes.

Reactome pathway analysis

ReactomePA⁸ uses Reactome as a source of pathway data. The function call of enrichPathway and gsePathway in ReactomePA is consistent with enrichKEGG and gseKEGG.

DAVID functional analysis

clusterProfiler provides enrichment and GSEA analysis with GO, KEGG, DO and Reactome pathway supported internally, some user may prefer GO and KEGG analysis with DAVID⁹ and still attracted by the visualization methods provided by clusterProfiler^???. To bridge the gap between DAVID and clusterProfiler, we implemented enrichDAVID. This function query enrichment analysis result from DAVID webserver via RDAVIDWebService¹⁰ and stored the result as an enrichResult instance, so that we can use all the visualization functions in clusterProfiler to visualize DAVID results. enrichDAVID is fully compatible with compareCluster function and comparing enrichment results from different gene clusters is now available with DAVID.

david <- enrichDAVID(gene = gene,
                     idType = "ENTREZ_GENE_ID",
                     listType = "Gene",
                     annotation = "KEGG_PATHWAY",
                     david.user = "clusterProfiler@hku.hk")

DAVID Web Service has the following limitations:

A job with more than 3000 genes to generate gene or term cluster report will not be handled by DAVID due to resource limit.
No more than 200 jobs in a day from one user or computer.
DAVID Team reserves right to suspend any improper uses of the web service without notice.

For more details, please refer to http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html.

As user has limited usage, please register and use your own user account to run enrichDAVID.

Universal enrichment analysis

clusterProfiler supports both hypergeometric test and gene set enrichment analyses of many ontology/pathway, but it’s still not enough for users may want to analyze their data with unsupported organisms, slim version of GO, novel functional annotation (e.g. GO via BlastGO or KEGG via KAAS), unsupported ontologies/pathways or customized annotations.

clusterProfiler provides enricher function for hypergeometric test and GSEA function for gene set enrichment analysis that are designed to accept user defined annotation. They accept two additional parameters TERM2GENE and TERM2NAME. As indicated in the parameter names, TERM2GENE is a data.frame with first column of term ID and second column of corresponding mapped gene and TERM2NAME is a data.frame with first column of term ID and second column of corresponding term name. TERM2NAME is optional.

An example of using enricher and GSEA to analyze DisGeNet annotation is presented in the post, use clusterProfiler as an universal enrichment analysis tool.

Using MSigDB gene set collections

The MSigDB is a collection of annotated gene sets, it include 8 major collections:

H: hallmark gene sets
C1: positional gene sets
C2: curated gene sets
C3: motif gene sets
C4: computational gene sets
C5: GO gene sets
C6: oncogenic signatures
C7: immunologic signatures

Users can use enricher and GSEA function to analyze gene set collections downloaded from Molecular Signatures Database (MSigDb). clusterProfiler provides a function, read.gmt, to parse the gmt file into a TERM2GENE data.frame that is ready for both enricher and GSEA functions.

gmtfile <- system.file("extdata", "c5.cc.v5.0.entrez.gmt", package="clusterProfiler")
c5 <- read.gmt(gmtfile)

egmt <- enricher(gene, TERM2GENE=c5)
head(egmt)

##                                                ID              Description
## SPINDLE                                   SPINDLE                  SPINDLE
## MICROTUBULE_CYTOSKELETON MICROTUBULE_CYTOSKELETON MICROTUBULE_CYTOSKELETON
## CYTOSKELETAL_PART               CYTOSKELETAL_PART        CYTOSKELETAL_PART
## SPINDLE_MICROTUBULE           SPINDLE_MICROTUBULE      SPINDLE_MICROTUBULE
## MICROTUBULE                           MICROTUBULE              MICROTUBULE
## CYTOSKELETON                         CYTOSKELETON             CYTOSKELETON
##                          GeneRatio  BgRatio       pvalue     p.adjust
## SPINDLE                      11/82  39/5270 7.667674e-12 5.214018e-10
## MICROTUBULE_CYTOSKELETON     16/82 152/5270 8.449298e-10 2.872761e-08
## CYTOSKELETAL_PART            15/82 235/5270 2.414879e-06 5.237096e-05
## SPINDLE_MICROTUBULE           5/82  16/5270 3.080645e-06 5.237096e-05
## MICROTUBULE                   6/82  32/5270 7.740446e-06 1.052701e-04
## CYTOSKELETON                 16/82 367/5270 1.308357e-04 1.482805e-03
##                                qvalue
## SPINDLE                  4.197043e-10
## MICROTUBULE_CYTOSKELETON 2.312439e-08
## CYTOSKELETAL_PART        4.215619e-05
## SPINDLE_MICROTUBULE      4.215619e-05
## MICROTUBULE              8.473751e-05
## CYTOSKELETON             1.193589e-03
##                                                                                                  geneID
## SPINDLE                                           991/9493/9787/22974/983/332/3832/7272/9055/6790/24137
## MICROTUBULE_CYTOSKELETON 991/9493/9133/7153/9787/22974/4751/983/332/3832/7272/9055/6790/24137/4137/7802
## CYTOSKELETAL_PART             991/9493/7153/9787/22974/4751/983/332/3832/7272/9055/6790/24137/4137/7802
## SPINDLE_MICROTUBULE                                                             983/332/3832/9055/24137
## MICROTUBULE                                                                983/332/3832/9055/24137/4137
## CYTOSKELETON             991/9493/9133/7153/9787/22974/4751/983/332/3832/7272/9055/6790/24137/4137/7802
##                          Count
## SPINDLE                     11
## MICROTUBULE_CYTOSKELETON    16
## CYTOSKELETAL_PART           15
## SPINDLE_MICROTUBULE          5
## MICROTUBULE                  6
## CYTOSKELETON                16

egmt2 <- GSEA(geneList, TERM2GENE=c5, verbose=FALSE)
head(egmt2)

##                                                                    ID
## EXTRACELLULAR_REGION                             EXTRACELLULAR_REGION
## EXTRACELLULAR_REGION_PART                   EXTRACELLULAR_REGION_PART
## EXTRACELLULAR_MATRIX                             EXTRACELLULAR_MATRIX
## PROTEINACEOUS_EXTRACELLULAR_MATRIX PROTEINACEOUS_EXTRACELLULAR_MATRIX
## CELL_PROJECTION                                       CELL_PROJECTION
## EXTRACELLULAR_MATRIX_PART                   EXTRACELLULAR_MATRIX_PART
##                                                           Description
## EXTRACELLULAR_REGION                             EXTRACELLULAR_REGION
## EXTRACELLULAR_REGION_PART                   EXTRACELLULAR_REGION_PART
## EXTRACELLULAR_MATRIX                             EXTRACELLULAR_MATRIX
## PROTEINACEOUS_EXTRACELLULAR_MATRIX PROTEINACEOUS_EXTRACELLULAR_MATRIX
## CELL_PROJECTION                                       CELL_PROJECTION
## EXTRACELLULAR_MATRIX_PART                   EXTRACELLULAR_MATRIX_PART
##                                    setSize enrichmentScore       NES
## EXTRACELLULAR_REGION                   401      -0.3860230 -1.706160
## EXTRACELLULAR_REGION_PART              310      -0.4101043 -1.780371
## EXTRACELLULAR_MATRIX                    95      -0.6229461 -2.327514
## PROTEINACEOUS_EXTRACELLULAR_MATRIX      93      -0.6355317 -2.363681
## CELL_PROJECTION                         87      -0.4729701 -1.736928
## EXTRACELLULAR_MATRIX_PART               54      -0.5908035 -1.992545
##                                         pvalue   p.adjust    qvalues rank
## EXTRACELLULAR_REGION               0.001246883 0.02600536 0.02031889 1797
## EXTRACELLULAR_REGION_PART          0.001295337 0.02600536 0.02031889 1897
## EXTRACELLULAR_MATRIX               0.001567398 0.02600536 0.02031889 1473
## PROTEINACEOUS_EXTRACELLULAR_MATRIX 0.001569859 0.02600536 0.02031889 1473
## CELL_PROJECTION                    0.001589825 0.02600536 0.02031889 2280
## EXTRACELLULAR_MATRIX_PART          0.001631321 0.02600536 0.02031889 1794
##                                                      leading_edge
## EXTRACELLULAR_REGION               tags=29%, list=14%, signal=26%
## EXTRACELLULAR_REGION_PART          tags=32%, list=15%, signal=28%
## EXTRACELLULAR_MATRIX               tags=48%, list=12%, signal=43%
## PROTEINACEOUS_EXTRACELLULAR_MATRIX tags=49%, list=12%, signal=44%
## CELL_PROJECTION                    tags=28%, list=18%, signal=23%
## EXTRACELLULAR_MATRIX_PART          tags=59%, list=14%, signal=51%
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       core_enrichment
## EXTRACELLULAR_REGION               3910/51162/2878/2717/3373/4153/10406/1301/6750/7474/4925/7450/80781/1490/1306/3931/4314/6586/3964/10272/8425/8082/11005/4256/3483/7482/8910/23037/27122/7042/3912/4322/167/2817/9353/6037/1278/2934/5176/4060/283/30008/5549/5950/22795/727/10516/23452/1293/2247/1295/1012/6469/2192/1281/4023/54360/50509/11167/4319/1290/9365/3952/10879/11096/2202/4313/3625/2199/6444/6320/1294/3075/4653/5764/3991/3263/1462/1289/3908/4016/3909/4053/8817/7033/8292/5125/2162/5744/1842/5654/10631/2331/3730/2487/347/6863/5104/3913/27123/4982/1300/2200/9607/1287/7060/1489/9723/6424/1307/1311/4693/4148/1101/2922/10647
## EXTRACELLULAR_REGION_PART                                                                                                  268/3567/57124/3910/2878/3373/4153/10406/1301/6750/7474/4925/80781/1490/1306/3931/4314/6586/3964/10272/8425/8082/4256/3483/7482/8910/27122/3912/4322/167/2817/1278/4060/283/30008/5549/5950/22795/727/10516/23452/1293/2247/1295/1012/6469/2192/1281/54360/50509/11167/4319/1290/9365/3952/10879/11096/2202/4313/2199/6444/1294/3075/4653/5764/3991/3263/1462/1289/3908/3909/4053/8817/8292/5125/5744/1842/5654/10631/2331/3730/347/6863/3913/27123/1300/2200/9607/1287/7060/9723/6424/1307/1311/4693/4148/1101/2922/10647
## EXTRACELLULAR_MATRIX                                                                                                                                                                                                                                                                                                                                                                                            1490/1306/8425/8082/4256/8910/3912/1278/4060/283/30008/5549/22795/10516/1293/1295/2192/1281/50509/4319/1290/11096/2202/2199/6444/1294/1462/1289/3908/3909/4053/8292/1842/10631/2331/3730/3913/1300/2200/1287/7060/1307/1311/4148/1101
## PROTEINACEOUS_EXTRACELLULAR_MATRIX                                                                                                                                                                                                                                                                                                                                                                              1490/1306/8425/8082/4256/8910/3912/1278/4060/283/30008/5549/22795/10516/1293/1295/2192/1281/50509/4319/1290/11096/2202/2199/6444/1294/1462/1289/3908/3909/4053/8292/1842/10631/2331/3730/3913/1300/2200/1287/7060/1307/1311/4148/1101
## CELL_PROJECTION                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             4763/57147/64064/80184/322/54997/7248/5311/7042/323/4747/4744/1012/11346/2191/4741/4646/9576/114327/51466/27124/4137/7802
## EXTRACELLULAR_MATRIX_PART                                                                                                                                                                                                                                                                                                                                                                                                                                                              3915/6443/55914/3910/1301/80781/1306/8082/8910/3912/1278/4060/283/30008/22795/1293/1295/1281/50509/1290/6444/1294/1289/3908/3909/8292/3913/1300/2200/1287/1307

Functional analysis of NGS data

Functional analysis using NGS data (eg, RNA-Seq and ChIP-Seq) can be performed by linking coding and non-coding regions to coding genes via ChIPseeker¹¹ package, which can annotates genomic regions to their nearest genes, host genes, and flanking genes respectivly. In addtion, it provides a function, seq2gene, that simultaneously considering host genes, promoter region and flanking gene from intergenic region that may under control via cis-regulation. This function maps genomic regions to genes in a many-to-many manner and facilitate functional analysis. For more details, please refer to ChIPseeker.

Visualization

The function calls of groupGO, enrichGO, enrichKEGG, enrichDO, enrichPathway and enricher are consistent and all the output can be visualized by bar plot, enrichment map and category-gene-network plot. It is very common to visualize the enrichment result in bar or pie chart. We believe the pie chart is misleading and only provide bar chart.

barplot

barplot(ggo, drop=TRUE, showCategory=12)

barplot(ego, showCategory=8)

dotplot

dotplot is a good alternative to barplot.

dotplot(ego)

enrichMap

Enrichment map can be viusalized by enrichMap, which also support results obtained from hypergeometric test and gene set enrichment analysis.

enrichMap(ego)

cnetplot

In order to consider the potentially biological complexities in which a gene may belong to multiple annotation categories and provide information of numeric changes if available, we developed cnetplot function to extract the complex association.

## categorySize can be scaled by 'pvalue' or 'geneNum'
cnetplot(ego, categorySize="pvalue", foldChange=geneList)

plotGOgraph

plotGOgraph, which is based on topGO, can accept output of enrichGO and visualized the enriched GO induced graph.

plotGOgraph(ego)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 33 
## Number of Edges = 58 
## 
## $complete.dag
## [1] "A graph with 33 nodes."

gseaplot

Running score of gene set enrichment analysis and its association of phenotype can be visualized by gseaplot.

gseaplot(kk2, geneSetID = "hsa04145")

plotting gsea result

browseKEGG

To view the KEGG pathway, user can use browseKEGG function, which will open web browser and highlight enriched genes.

browseKEGG(kk, 'hsa04110')

pathview from pathview package

clusterProfiler users can also use pathview from the pathview¹² to visualize KEGG pathway.

The following example illustrate how to visualize “hsa04110” pathway, which was enriched in our previous analysis.

library("pathview")
hsa04110 <- pathview(gene.data  = geneList,
                     pathway.id = "hsa04110",
                     species    = "hsa",
                     limit      = list(gene=max(abs(geneList)), cpd=1))

For further information, please refer to the vignette of pathview¹².

Biological theme comparison

clusterProfiler was developed for biological theme comparison², and it provides a function, compareCluster, to automatically calculate enriched functional categories of each gene clusters.

data(gcSample)
lapply(gcSample, head)

## $X1
## [1] "4597"  "7111"  "5266"  "2175"  "755"   "23046"
## 
## $X2
## [1] "23450" "5160"  "7126"  "26118" "8452"  "3675" 
## 
## $X3
## [1] "894"   "7057"  "22906" "3339"  "10449" "6566" 
## 
## $X4
## [1] "5573"  "7453"  "5245"  "23450" "6500"  "4926" 
## 
## $X5
## [1] "5982" "7318" "6352" "2101" "8882" "7803"
## 
## $X6
## [1] "5337"  "9295"  "4035"  "811"   "23365" "4629" 
## 
## $X7
## [1] "2621" "2665" "5690" "3608" "3550" "533" 
## 
## $X8
## [1] "2665" "4735" "1327" "3192" "5573" "9528"

The input for geneCluster parameter should be a named list of gene IDs. To speed up the compilation of this document, we set use_internal_data = TRUE.

ck <- compareCluster(geneCluster = gcSample, fun = "enrichKEGG")
head(as.data.frame(ck))

##   Cluster       ID                  Description GeneRatio  BgRatio
## 1      X2 hsa04110                   Cell cycle    18/357 124/7299
## 2      X2 hsa05340     Primary immunodeficiency     8/357  37/7299
## 3      X2 hsa05200           Pathways in cancer    35/357 395/7299
## 4      X2 hsa04064 NF-kappa B signaling pathway    13/357  95/7299
## 5      X3 hsa04512     ECM-receptor interaction     9/169  82/7299
## 6      X4 hsa04110                   Cell cycle    20/384 124/7299
##         pvalue     p.adjust       qvalue
## 1 2.932169e-05 0.0081807507 0.0077470984
## 2 3.352517e-04 0.0383237111 0.0362922146
## 3 4.120829e-04 0.0383237111 0.0362922146
## 4 6.715477e-04 0.0468404538 0.0443574944
## 5 1.062486e-04 0.0257121642 0.0237102167
## 6 6.158296e-06 0.0009820694 0.0008044511
##                                                                                                                                                                    geneID
## 1                                                                                 991/1869/890/1871/701/990/10926/9088/8317/9700/9134/1029/2810/699/11200/23594/8555/4173
## 2                                                                                                                                     100/6891/3932/973/916/925/958/64421
## 3 3675/1956/1869/324/3480/1871/113/1902/2261/1909/637/355/5888/9134/5915/3908/2246/5154/7704/4437/1029/185/7187/3551/3479/332/5733/330/6654/1288/5914/405/54583/2122/6772
## 4                                                                                                         4067/3383/7128/3932/5971/4050/6850/7187/3551/10892/5588/330/958
## 5                                                                                                                            7057/3339/1299/3695/1101/3679/3910/3696/3693
## 6                                                                        6500/9184/4172/994/4175/4171/1387/10274/8697/902/4616/5591/4176/8881/7043/983/1022/1028/891/4173
##   Count
## 1    18
## 2     8
## 3    35
## 4    13
## 5     9
## 6    20

Formula interface of compareCluster

compareCluster also supports passing a formula (the code to support formula has been contributed by Giovanni Dall’Olio) of type $Entrez \sim group$ or $Entrez \sim group + othergroup$ .

mydf <- data.frame(Entrez=names(geneList), FC=geneList)
mydf <- mydf[abs(mydf$FC) > 1,]
mydf$group <- "upregulated"
mydf$group[mydf$FC < 0] <- "downregulated"
mydf$othergroup <- "A"
mydf$othergroup[abs(mydf$FC) > 2] <- "B"

formula_res <- compareCluster(Entrez~group+othergroup, data=mydf, fun="enrichKEGG")

head(as.data.frame(formula_res))

##           Cluster         group othergroup       ID
## 1 downregulated.A downregulated          A hsa04974
## 2 downregulated.A downregulated          A hsa04510
## 3 downregulated.A downregulated          A hsa04512
## 4 downregulated.A downregulated          A hsa05414
## 5 downregulated.A downregulated          A hsa04010
## 6 downregulated.A downregulated          A hsa04151
##                        Description GeneRatio  BgRatio       pvalue
## 1 Protein digestion and absorption    15/276  90/7299 1.125985e-06
## 2                   Focal adhesion    20/276 199/7299 5.801183e-05
## 3         ECM-receptor interaction    11/276  82/7299 2.377255e-04
## 4     Dilated cardiomyopathy (DCM)    11/276  90/7299 5.404096e-04
## 5           MAPK signaling pathway    23/276 294/7299 7.091830e-04
## 6       PI3K-Akt signaling pathway    26/276 351/7299 7.351909e-04
##       p.adjust       qvalue
## 1 0.0002837482 0.0002619396
## 2 0.0073094910 0.0067476922
## 3 0.0199689379 0.0184341491
## 4 0.0308780190 0.0285047711
## 5 0.0308780190 0.0285047711
## 6 0.0308780190 0.0285047711
##                                                                                                                                   geneID
## 1                                                            1281/50509/1290/477/1294/1360/1289/1292/23428/1359/1300/1287/6505/2006/7373
## 2                               55742/2317/7058/25759/56034/3693/3480/5159/857/1292/3908/3909/63923/3913/1287/3679/7060/3479/10451/80310
## 3                                                                                7058/3693/3339/1292/3908/3909/63923/3913/1287/3679/7060
## 4                                                                                55799/27092/6444/3693/775/3908/5350/7043/3679/3479/9254
## 5                55799/3306/2317/4214/55970/56034/27092/4254/3480/4908/3305/5159/775/8817/4915/3551/7786/7043/57551/3479/9254/1846/80310
## 6 55970/5618/7058/10161/56034/3693/4254/3480/4908/5159/1292/3908/2690/3909/8817/4915/3551/2791/63923/3913/3667/1287/3679/7060/3479/80310
##   Count
## 1    15
## 2    20
## 3    11
## 4    11
## 5    23
## 6    26

Visualization of profile comparison

We can visualize the result using dotplot method.

dotplot(ck)

dotplot(formula_res)

dotplot(formula_res, x=~group) + ggplot2::facet_grid(~othergroup)

By default, only top 5 (most significant) categories of each cluster was plotted. User can changes the parameter showCategory to specify how many categories of each cluster to be plotted, and if showCategory was set to NULL, the whole result will be plotted.

The plot function accepts a parameter by for setting the scale of dot sizes. The default parameter by is setting to “geneRatio”, which corresponding to the “GeneRatio” column of the output. If it was setting to count, the comparison will be based on gene counts, while if setting to rowPercentage, the dot sizes will be normalized by count/(sum of each row)

To provide the full information, we also provide number of identified genes in each category (numbers in parentheses) when by is setting to rowPercentage and number of gene clusters in each cluster label (numbers in parentheses) when by is setting to geneRatio, as shown in Figure 3. If the dot sizes were based on count, the row numbers will not shown.

The p-values indicate that which categories are more likely to have biological meanings. The dots in the plot are color-coded based on their corresponding p-values. Color gradient ranging from red to blue correspond to in order of increasing p-values. That is, red indicate low p-values (high enrichment), and blue indicate high p-values (low enrichment). P-values and adjusted p-values were filtered out by the threshold giving by parameter pvalueCutoff, and FDR can be estimated by qvalue.

User can refer to the example in Yu (2012)²; we analyzed the publicly available expression dataset of breast tumour tissues from 200 patients (GSE11121, Gene Expression Omnibus)¹³. We identified 8 gene clusters from differentially expressed genes, and using compareCluster to compare these gene clusters by their enriched biological process.

The comparison function was designed as a framework for comparing gene clusters of any kind of ontology associations, not only groupGO, enrichGO, enrichKEGG and enricher provided in this package, but also other biological and biomedical ontologies, for instance, enrichDO from DOSE⁵, enrichMeSH from meshes and enrichPathway from ReactomePA work fine with compareCluster for comparing biological themes in disease and reactome pathway perspective. More details can be found in the vignettes of DOSE⁵ and ReactomePA.

Homepage

Please visit clusterProfiler homepage for more information.

Session Information

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] Rgraphviz_2.22.0      clusterProfiler_3.6.0 GSEABase_1.40.0      
##  [4] annotate_1.56.0       XML_3.98-1.9          topGO_2.30.0         
##  [7] SparseM_1.77          graph_1.56.0          org.Hs.eg.db_3.4.2   
## [10] GO.db_3.4.2           AnnotationDbi_1.40.0  IRanges_2.12.0       
## [13] S4Vectors_0.16.0      Biobase_2.38.0        BiocGenerics_0.24.0  
## [16] DOSE_3.4.0           
## 
## loaded via a namespace (and not attached):
##  [1] qvalue_2.10.0       prettydoc_0.2.0     fgsea_1.4.0        
##  [4] purrr_0.2.4         reshape2_1.4.2      splines_3.4.2      
##  [7] lattice_0.20-35     colorspace_1.3-2    htmltools_0.3.6    
## [10] yaml_2.1.14         blob_1.1.0          rlang_0.1.2        
## [13] glue_1.2.0          DBI_0.7             BiocParallel_1.12.0
## [16] bit64_0.9-7         matrixStats_0.52.2  rvcheck_0.0.9      
## [19] plyr_1.8.4          stringr_1.2.0       munsell_0.4.3      
## [22] GOSemSim_2.4.0      gtable_0.2.0        memoise_1.1.0      
## [25] evaluate_0.10.1     labeling_0.3        knitr_1.17         
## [28] highr_0.6           Rcpp_0.12.13        xtable_1.8-2       
## [31] backports_1.1.1     scales_0.5.0        DO.db_2.9          
## [34] bit_1.1-12          gridExtra_2.3       fastmatch_1.1-0    
## [37] ggplot2_2.2.1       digest_0.6.12       stringi_1.1.5      
## [40] rprojroot_1.2       bitops_1.0-6        tools_3.4.2        
## [43] magrittr_1.5        RCurl_1.95-4.8      lazyeval_0.2.1     
## [46] tibble_1.3.4        RSQLite_2.0         tidyr_0.7.2        
## [49] pkgconfig_2.0.1     data.table_1.10.4-3 rmarkdown_1.6      
## [52] igraph_1.1.2        compiler_3.4.2

References

1. Yu, G. et al. GOSemSim: An r package for measuring semantic similarity among go terms and gene products. Bioinformatics 26, 976–978 (2010).

2. Yu, G., Wang, L.-G., Han, Y. & He, Q.-Y. ClusterProfiler: An r package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16, 284–287 (2012).

3. Boyle, E. I. et al. GO::TermFinder–open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics (Oxford, England) 20, 3710–3715 (2004).

4. Subramanian, A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102, 15545–15550 (2005).

5. Yu, G., Wang, L.-G., Yan, G.-R. & He, Q.-Y. DOSE: An r/bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics 31, 608–609 (2015).

6. A., O., D., G. M., M., T. P. & C., F. D. NCG 5.0: Updates of a manually curated repository of cancer genes and associated properties from cancer mutational screenings. Nucleic Acids Research 44, D992–D999 (2016).

7. Janet, P. et al. DisGeNET: A discovery platform for the dynamical exploration of human diseases and their genes. Database 2015, bav028 (2015).

8. Yu, G. & He, Q.-Y. ReactomePA: An r/bioconductor package for reactome pathway analysis and visualization. Mol. BioSyst. 12, 477–479 (2016).

9. Huang, D. et al. The DAVID gene functional classification tool: A novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biology 8, R183 (2007).

10. Fresno, C. & Fernández, E. A. RDAVIDWebService: A versatile r interface to DAVID. Bioinformatics 29, 2810–2811 (2013).

11. Yu, G., Wang, L.-G. & He, Q.-Y. ChIPseeker: An r/bioconductor package for chip peak annotation, comparison and visualization. Bioinformatics 31, 2382–2383 (2015).

12. Luo, W. & Brouwer, C. Pathview: An R/Bioconductor package for pathway-based data integration and visualization. Bioinformatics 29, 1830–1831 (2013).

13. Schmidt, M. et al. The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Research 68, 5405–5413 (2008).

Statistical analysis and visualization of functional profiles for genes and gene clusters

Guangchuang Yu School of Public Health, The University of Hong Kong

2017-10-30

Abstract

Supported Analysis

Supported ontologies/pathways

Visualization

Citation

Introduction

bitr: Biological Id TranslatoR

bitr_kegg: converting biological IDs using KEGG API

GO Analysis

Supported organisms

GO classification

GO over-representation test

drop specific GO terms or level

test GO at sepcific level

reduce redundancy of enriched GO terms

GO Gene Set Enrichment Analysis

GO Semantic Similarity Analysis

KEGG analysis

KEGG over-representation test

KEGG Gene Set Enrichment Analysis

KEGG Module over-representation test

KEGG Module Gene Set Enrichment Analysis

Disease analysis

Reactome pathway analysis

DAVID functional analysis

Universal enrichment analysis

Using MSigDB gene set collections

Functional analysis of NGS data

Visualization

barplot

dotplot

enrichMap

cnetplot

plotGOgraph

gseaplot

browseKEGG

pathview from pathview package

Biological theme comparison

Formula interface of compareCluster

Visualization of profile comparison

Homepage

Session Information

References

Guangchuang Yu
School of Public Health, The University of Hong Kong

`bitr`: Biological Id TranslatoR

`bitr_kegg`: converting biological IDs using KEGG API