FoldGO: a tool for fold-change-specific functional enrichment analysis

A typical scenario of transcriptome data analysis is identification of differentially expressed genes (DEGs), those with significant changes in the number of transcripts (fold-change), and functional enrichment analysis of DEGs lists using Gene Ontology. Classical gene set enrichment analysis ignores the difference in the fold-changes that lead to the loss of valuable information. The FoldGO method has been created to identify the GO terms significantly overrepresented for the genes responded to the factor within a narrow fold-change-interval. FoldGO processes the DEGs list in three steps:

See an example of the FoldGO performance in the analysis of the transcriptome data on expression changes of Arabidopsis thaliana genes in response to plant hormone auxin treatment (Omelyanchuk et al., 2017)¹.

Workflow

FoldGO pipeline consists of three steps:

gene sublists generation;
GO terms preselection (optional);
fold-change-specific enrichment analysis.

As input data the algorithm uses the tables for up- and down-regulated genes that contain Gene IDs and their fold-change values:

GeneID	fold-change
AT3G65420	3.6
AT1G78450	1.5
AT2G66890	2.1
…

First, one has to separate initial set of genes in to quantiles and generate unions of all neighbouring quantiles. For example, we will use built-in data derived from experiment on auxin treatment of Arabidopsis thaliana roots. GeneGroups function will take only first two columns, so be sure that your data have gene identifiers in the first column and fold-change values (FC) in the second one. Note that data for up- and down-regulation must be processed separately. In the following example demonstrating GeneGroups function usage only the data for up-regulation is used.

head(degenes)

GeneID	FC	pval	qval
AT1G01120	2.069	2.70e-10	2.60e-08
AT1G02520	1.641	7.79e-05	2.06e-03
AT1G02730	1.290	3.37e-05	9.95e-04
AT1G02850	1.978	2.50e-06	1.01e-04
AT1G02900	1.138	1.22e-03	1.95e-02
AT1G03110	2.686	8.44e-05	2.22e-03

up_groups <- GeneGroups(degenes, quannumber=6)

At the next step we will conduct functional enrichment analysis of generated groups. For functional enrichment analysis FoldGO uses TopGO package.

Functional annotation

Custom annotaion

For custom annotation one has to provide GO id -> Gene id list or object of MgsaSets (from mgsa package) or GAFReader class. FoldGO provides GAFReader - simple and convinient parser for annotation presented in GAF format. It takes only two arguments:

file - full path to GAF file.
geneid_col - index of column with Genes IDs

NOTE: The GAF file in example is truncated. The original full file can be downloaded from GO consortium website: http://www.geneontology.org/page/download-go-annotations.

gaf_path <- system.file("extdata", "gene_association.tair.lzma", package = "FoldGO")
gaf <- GAFReader(file = gaf_path,  geneid_col = 10)

One can retrieve direct annotations as list object and version of GAF file using following methods:

getVersion(gaf)
#> [1] "2.0"

getAnnotation(gaf)

To annotate our gene groups we will use FuncAnnotGroupsTopGO function which uses topGO package for singular enrichment analysis. The minimal set of arguments needed for this function to work is:

groups - object of GeneGroups class
namespace - character string specifing GO namespace. This argument accepts the following values: “BP”, “MF” or “CC”, where
- BP - biological process
- MF - molecular function
- CC - cellular component
customAnnot - It can be an object generated by or function from mgsa package or list which has GO term ids as keys and character vectors contain gene ids as values.
annot - functions used to compile a list of GO terms such that each element in the list is a character vector containing all the gene identifiers that are mapped to the respective GO term. Here it must be assigned with topGO::annFUN.GO2genes
bggenes - vector contains background set of genes

up_annotobj <- FuncAnnotGroupsTopGO(genegroups = up_groups, namespace = "MF", customAnnot = gaf, annot = topGO::annFUN.GO2genes, bggenes = bggenes)

Using annotaions from Bioconductor packages

Another possibility is to use bioconductor packages containing annotations for specific organism. For example “org.Hs.eg.db” (Human) and “org.At.tair.db” (Arabidopsis), package name must be assigned to mapping argument. In this case one has to assign topGO::annFUN.org to annot argument and specify ID (from topGO package manual: character string specifing the gene identifier to use). Currently only the following identifiers can be used: c("entrez", "genbank", "alias", "ensembl", "symbol", "genename", "unigene").

Enrichment analysis with Arabidopsis annotations:

up_annotobj <- FuncAnnotGroupsTopGO(up_groups,"MF", mapping = "org.At.tair.db", annot = topGO::annFUN.org, ID = "entrez", bggenes = bggenes)

Enrichment analysis with annotations for Human:

up_groups <- GeneGroups(degenes_hum, quannumber=6)
FuncAnnotGroupsTopGO(up_groups,"MF", mapping = "org.Hs.eg.db", annot = topGO::annFUN.org, ID = "ensembl", bggenes = bggenes_hum)

Testing for fold-specificity

The fold-specificity recognition procedure consists of GO terms preselection from DEGs annotation and fold-change-specific enrichment analysis. At each step the FDR threshold must be established. By default FDR threshold for GO terms preselection (fdrstep1) is set to 1 (no preselection) and FDR threshold for fold-change-specific enrichment analysis (fdrstep2) is set to 0.05. As a default method for mutltiple testing correction FoldGO uses Benjamini-Hochberg correction procedure².

up_fsobj <- FoldSpecTest(up_annotobj, fdrstep1 = 0.05, fdrstep2 = 0.01)
down_fsobj <- FoldSpecTest(down_annotobj, fdrstep1 = 0.05, fdrstep2 = 0.01)

It is possible to choose another correction procedure from R base which can be listed via p.adjust.methods. Here Benjamini-Yekutieli correction procedure³ is selected.

FoldSpecTest(up_annotobj, padjmethod = "BY")

One can inspect the results of enrichment analysis as dataframes. Access dataframe with fold-specific terms

fs_table <- getFStable(up_fsobj)

head(fs_table)

ids	namespace	name	wholeint_pval	wholeint_padj	min_pval	padj	interval
GO:0003723	MF	RNA binding	1.80e-07	2.71e-05	7.28e-05	9.26e-03	1-4
GO:0003735	MF	structural constituent of ribosome	1.00e-30	1.13e-27	2.98e-27	2.32e-24	1-2
GO:0003743	MF	translation initiation factor activity	6.50e-05	4.80e-03	8.31e-05	9.26e-03	1-2
GO:0005198	MF	structural molecule activity	1.00e-30	1.13e-27	1.94e-26	7.57e-24	1-2
GO:0008135	MF	translation factor activity, RNA binding	1.80e-07	2.71e-05	2.56e-08	6.67e-06	1-2
GO:0097159	MF	organic cyclic compound binding	2.90e-07	3.88e-05	3.89e-06	6.07e-04	1-5

where:

ids - GO term identifier
namespace - GO term namespace
name - GO term full name
wholeint_pval - p-value for specific GO term derived from annotation for all differentially expressed genes set
wholeint_padj - q-value for specific GO term derived from annotation for all differentially expressed genes set
min_pval - minimal p-value for specific GO term across all intervals
padj - adjusted minimal p-value for specific GO term across all intervals
interval - interval that corresponds to minimal p-value for specific GO term

And with not fold-specific:

nfs_table <- getNFStable(up_fsobj)

head(nfs_table)

ids	namespace	name	wholeint_pval	wholeint_padj	min_pval	padj	interval
GO:0000166	MF	nucleotide binding	3.70e-07	4.17e-05	3.30e-04	2.58e-02	1-5
GO:0003697	MF	single-stranded DNA binding	5.20e-04	3.01e-02	9.43e-02	1.00e+00	1-4
GO:0003746	MF	translation elongation factor activity	2.40e-04	1.46e-02	2.11e-04	2.06e-02	1
GO:0003883	MF	CTP synthase activity	3.90e-04	2.31e-02	2.26e-01	1.00e+00	4-5
GO:0004386	MF	helicase activity	6.50e-08	1.13e-05	2.32e-02	6.04e-01	3
GO:0005488	MF	binding	6.00e-06	5.41e-04	6.48e-04	4.50e-02	1-5

Plot results

Via plot function one can plot “Fold-change specific GO Profile” on which the GO terms significantly associated with a certain fold-change intervals are presented in yellow and blue boxes for up- and down-regulated genes, correspondingly. Here the result for six equal in size fold-change intervals is presented. The diagram presents only fold-change-specific terms. If the gene was associated fold-specifically with down-regulation but not fold-specifically with up-regulation (or vise versa) than not fold-change-specific interval (1-6 here) will be also shown.

plot(up_fsobj, down_fsobj)

Omelyanchuk, N. A. et al. Auxin regulates functional gene groups in a fold-change-specific manner in Arabidopsis thaliana roots // Nat Sci Rep – 2017. – N 7(1) – p 2489↩
Benjamini, Y., Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society - 1995 - Series B. 57 (1): 289–300↩
Benjamini, Y., Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics - 2001 - 29 (4): 1165–1188. doi:10.1214/aos/1013699998 ↩