1 Overview of GEOmetadb

The NCBI Gene Expression Omnibus (GEO) represents the largest repository of microarray data in existence. One difficulty in dealing with GEO is finding the microarray data that is of interest. As part of the NCBI Entrez search system, GEO can be searched online via web pages or using NCBI Eutils. However, the web search is not as full-featured as it could be, particularly for programmatic access. NCBI Eutils offers another option for finding data within the vast stores of GEO, but it is cumbersome to use, often requiring multiple complicated Eutils calls to get at the relevant information. We have found it absolutely critical to have ready access not just to the microarray data, but to the metadata describing the microarray experiments. To this end we have created GEOmetadb.

1.1 Citing GEOmetadb

If you use GEOmetadb or the associated SQLite database, please cite as follows:

## 
## Please cite the following if utilizing the GEOmetadb software:
## 
##   Zhu Y, Davis S, Stephens R, Meltzer PS, Chen Y. GEOmetadb: powerful
##   alternative search engine for the Gene Expression Omnibus.
##   Bioinformatics. 2008 Dec 1;24(23):2798-800. doi:
##   10.1093/bioinformatics/btn520. Epub 2008 Oct 7. PubMed PMID:
##   18842599; PubMed Central PMCID: PMC2639278.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     author = {Yuelin Zhu and Sean Davis and Robert Stephens and Paul S. Meltzer and Yidong Chen},
##     title = {GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus.},
##     journal = {Bioinformatics (Oxford, England)},
##     year = {2008},
##     month = {Dec},
##     day = {01},
##     volume = {24},
##     number = {23},
##     pages = {2798--2800},
##     abstract = {The NCBI Gene Expression Omnibus (GEO) represents the largest public repository of microarray data. However, finding data in GEO can be challenging. We have developed GEOmetadb in an attempt to make querying the GEO metadata both easier and more powerful. All GEO metadata records as well as the relationships between them are parsed and stored in a local MySQL database. A powerful, flexible web search interface with several convenient utilities provides query capabilities not available via NCBI tools. In addition, a Bioconductor package, GEOmetadb that utilizes a SQLite export of the entire GEOmetadb database is also available, rendering the entire GEO database accessible with full power of SQL-based queries from within R.},
##     issn = {1367-4811},
##     doi = {10.1093/bioinformatics/btn520},
##     url = {http://www.ncbi.nlm.nih.gov/pubmed/18842599},
##     language = {eng},
##   }

1.2 What is GEOmetadb?

The GEOmetadb package is an attempt to make querying the metadata describing microarray experiments, platforms, and datasets both easier and more powerful. At the heart of GEOmetadb is a SQLite database that stores nearly all the metadata associated with all GEO data types including GEO samples (GSM), GEO platforms (GPL), GEO data series (GSE), and curated GEO datasets (GDS), as well as the relationships between these data types. This database is generated by our server by parsing all the records in GEO and needs to be downloaded via a simple helper function to the user’s local machine before GEOmetadb is useful. Once this is done, the entire GEO database is accessible with simple SQL-based queries. With the GEOmetadb database, queries that are simply not possible using NCBI tools or web pages are often quite simple.

The relationships between the tables in the GEOmetadb SQLite database can be seen in the following entity-relationship diagram.

1.3 Conversion capabilities

A very typical problem for large-scale consumers of GEO data is to determine the relationships between various GEO accession types. As examples, consider the following questions:

  • What samples are associated with GEO platform “GPL96”, which represents the Affymetrix hgu133a array?}
  • What GEO Series were performed using GPL96?
  • What samples are in my favorite three GEO Series records?
  • How many samples are associated with the ten most popular GEO platforms?

Because these types of questions are common, GEOmetadb contains the function geoConvert that addresses these questions directly and efficiently.

1.4 What GEOmetadb is not

We have faithfully parsed and maintained in GEO when creating GEOmetadb. This means that limitations inherent to GEO are also inherent in GEOmetadb. We have made no attempt to curate, semantically recode, or otherwise “clean up”" GEO; to do so would require significant resources, which we do not have.

GEOmetadb does not contain any microarray data. For access to microarray data from within R/Bioconductor, please look at the GEOquery package. In fact, we would expect that many users will find that the combination of GEOmetadb and GEOquery is quite powerful.

2 Getting Started

Once GEOmetadb is installed (see the Bioconductor website for full installation instructions), we are ready to begin.

2.1 Getting the GEOmetadb database

This package does not come with a pre-installed version of the database. This has the advantage that the user will get the most up-to-date version of the database to start; the database can be re-downloaded using the same command as often as desired. First, load the library.

The download and uncompress steps are done automatically with a single command, getSQLiteFile. This production SQLite database is huge. For demo purpose, we use a demo database file here.

## Unzipping...
## Metadata associate with downloaded file:
##                 name               value
## 1     schema version                 1.0
## 2 creation timestamp 2020-04-07 20:53:49

The default storage location is in the current working directory and the default filename is “GEOmetadb.sqlite”; it is best to leave the name unchanged unless there is a pressing reason to change it.

Since this SQLite file is of key importance in GEOmetadb, it is perhaps of some interest to know some details about the file itself.

##                                                                               size
## /tmp/Rtmp3IYQ5O/Rbuild3ea5e26e7f006d/GEOmetadb/vignettes/GEOmetadb.sqlite 87445504
##                                                                           isdir
## /tmp/Rtmp3IYQ5O/Rbuild3ea5e26e7f006d/GEOmetadb/vignettes/GEOmetadb.sqlite FALSE
##                                                                           mode
## /tmp/Rtmp3IYQ5O/Rbuild3ea5e26e7f006d/GEOmetadb/vignettes/GEOmetadb.sqlite  644
##                                                                                         mtime
## /tmp/Rtmp3IYQ5O/Rbuild3ea5e26e7f006d/GEOmetadb/vignettes/GEOmetadb.sqlite 2021-10-26 17:31:32
##                                                                                         ctime
## /tmp/Rtmp3IYQ5O/Rbuild3ea5e26e7f006d/GEOmetadb/vignettes/GEOmetadb.sqlite 2021-10-26 17:31:32
##                                                                                         atime
## /tmp/Rtmp3IYQ5O/Rbuild3ea5e26e7f006d/GEOmetadb/vignettes/GEOmetadb.sqlite 2021-10-26 17:31:32
##                                                                            uid
## /tmp/Rtmp3IYQ5O/Rbuild3ea5e26e7f006d/GEOmetadb/vignettes/GEOmetadb.sqlite 1002
##                                                                            gid
## /tmp/Rtmp3IYQ5O/Rbuild3ea5e26e7f006d/GEOmetadb/vignettes/GEOmetadb.sqlite 1002
##                                                                               uname
## /tmp/Rtmp3IYQ5O/Rbuild3ea5e26e7f006d/GEOmetadb/vignettes/GEOmetadb.sqlite biocbuild
##                                                                              grname
## /tmp/Rtmp3IYQ5O/Rbuild3ea5e26e7f006d/GEOmetadb/vignettes/GEOmetadb.sqlite biocbuild

Now, the SQLite file is available for connection. The standard DBI functionality as implemented in RSQLite function dbConnect makes the connection to the database. The dbDisconnect function disconnects the connection.

The variable con is an RSQLite connection object.

2.2 A word about SQL

The Structured Query Language, or SQL, is a very powerful and standard way of working with relational data. GEO is composed of several data types, all of which are related to each other; in fact, NCBI uses a relational SQL database for metadata storage and querying. SQL databases and SQL itself are designed specifically to work efficiently with just such data. While the goal of many programming projects and programmers is to hide the details of SQL from the user, we are of the opinion that such efforts may be counterproductive, particularly with complex data and the need for ad hoc queries, both of which are characteristics with GEO metadata. We have taken the view that exposing the power of SQL will enable users to maximally utilize the vast data repository that is GEO. We understand that many users are not accustomed to working with SQL and, therefore, have devoted a large section of the vignette to working examples. Our goal is not to teach SQL, so a quick tutorial of SQL is likely to be beneficial to those who have not used it before. Many such tutorials are available online and can be completed in 30 minutes or less.

3 Examples

3.1 Interacting with the database

The functionality covered in this section is covered in much more detail in the DBI and RSQLite package documentation. We cover enough here only to be useful.

Again, we connect to the database.

The dbListTables function lists all the tables in the SQLite database handled by the connection object con.

##  [1] "gds"               "gds_subset"       
##  [3] "geoConvert"        "geodb_column_desc"
##  [5] "gpl"               "gse"              
##  [7] "gse_gpl"           "gse_gsm"          
##  [9] "gsm"               "metaInfo"         
## [11] "sMatrix"

There is also the dbListFields function that can list database fields associated with a table.

##  [1] "ID"                   "title"               
##  [3] "gse"                  "status"              
##  [5] "submission_date"      "last_update_date"    
##  [7] "pubmed_id"            "summary"             
##  [9] "type"                 "contributor"         
## [11] "web_link"             "overall_design"      
## [13] "repeats"              "repeats_sample_list" 
## [15] "variable"             "variable_description"
## [17] "contact"              "supplementary_file"

Sometimes it is useful to get the actual SQL schema associated with a table. Here, we get the table schema for the gpl table.

##    cid                 name type notnull dflt_value pk
## 1    0                   ID REAL       0         NA  0
## 2    1                title TEXT       0         NA  0
## 3    2                  gpl TEXT       0         NA  0
## 4    3               status TEXT       0         NA  0
## 5    4      submission_date TEXT       0         NA  0
## 6    5     last_update_date TEXT       0         NA  0
## 7    6           technology TEXT       0         NA  0
## 8    7         distribution TEXT       0         NA  0
## 9    8             organism TEXT       0         NA  0
## 10   9         manufacturer TEXT       0         NA  0
## 11  10 manufacture_protocol TEXT       0         NA  0
## 12  11              coating TEXT       0         NA  0
## 13  12       catalog_number TEXT       0         NA  0
## 14  13              support TEXT       0         NA  0
## 15  14          description TEXT       0         NA  0
## 16  15             web_link TEXT       0         NA  0
## 17  16              contact TEXT       0         NA  0
## 18  17       data_row_count REAL       0         NA  0
## 19  18   supplementary_file TEXT       0         NA  0
## 20  19         bioc_package TEXT       0         NA  0

3.2 Writing SQL queries and getting results

Select 5 records from the gse table and show the first 7 columns.

##      ID
## 1   869
## 2  8806
## 3 97992
## 4 97994
## 5  8807
##                                                                                                         title
## 1                                                                 Osteosarcoma TE85 cell tissue culture study
## 2                            Age-dependent aorta transcriptomes in wild-type and apoE-deficient C57BL/6J mice
## 3 Molecular and functional sex differences of noradrenergic neurons in the mouse locus coeruleus [Affymetrix]
## 4              Molecular and functional sex differences of noradrenergic neurons in the mouse locus coeruleus
## 5                                                    Gene expression profiling in NCoR deficient mouse livers
##         gse                status submission_date
## 1   GSE1000 Public on Jan 28 2004      2004-01-27
## 2  GSE10000 Public on Dec 16 2008      2007-12-21
## 3 GSE100002 Public on Apr 30 2018      2017-06-13
## 4 GSE100005 Public on Apr 30 2018      2017-06-13
## 5  GSE10001 Public on Dec 04 2008      2007-12-21
##   last_update_date pubmed_id
## 1       2013-04-17        NA
## 2       2013-05-22  19139167
## 3       2018-05-01        NA
## 4       2018-05-01        NA
## 5       2013-05-01  19052228

Get the GEO series accession and title from GEO series that were submitted by “Sean Davis”. The “%” sign is used in combination with the “like” operator to do a “wildcard” search for the name “Sean Davis” with any number of characters before or after or between “Sean” and “Davis”.

##         gse
## 1  GSE14543
## 2  GSE15621
## 3  GSE16087
## 4  GSE16088
## 5  GSE16091
## 6  GSE16102
## 7  GSE18544
## 8  GSE19063
## 9  GSE20016
## 10 GSE22520
## 11 GSE25127
## 12 GSE25164
## 13  GSE2553
## 14 GSE29428
## 15 GSE33249
## 16 GSE41795
## 17  GSE4406
## 18 GSE52137
## 19  GSE5357
## 20  GSE5481
## 21  GSE7376
## 22  GSE7882
## 23 GSE80005
## 24  GSE8486
## 25  GSE9328
##                                                                                                                      title
## 1                                                                              A molecular function map of Ewing’s Sarcoma
## 2                                                                  Acute Lymphocytic Leukemia versus associated xenografts
## 3                                                                          Gene expression profiles of canine osteosarcoma
## 4                                                                           Gene expression profiles of human osteosarcoma
## 5                                                                     Gene expression profiles of human osteosarcoma, set2
## 6                                                                Gene expression profiles of canine and human osteosarcoma
## 7      Expression Profiling of a Mouse Xenograft Model of “Triple-Negative” Breast Cancer Brain Metastases With Vorinostat
## 8                                                           Genome-wide map of PAX3-FKHR binding sites in rhabdomyosarcoma
## 9  Analyses of Human Brain Metastases of Breast Cancer Reveal the Association between HK2 Up-Regulation and Poor Prognosis
## 10                                             Mouse Models of Alveolar/Embryonal Rhabdomyosarcoma & Spindle Cell Sarcomas
## 11                                                                       Ewing Sarcoma cell lines treated with mithramycin
## 12                                                                                         UV effects in mouse melanocytes
## 13                                                                                                     NHGRI_Sarcoma_Baird
## 14       A methyl-deviator epigenotype of estrogen-receptor-positive breast carcinoma is associated with malignant biology
## 15                    High resolution comparative genomic hybridisation analysis of ductal carcinoma in situ of the breast
## 16                               Accelerated high-yield generation of limb-innervating motor neurons from human stem cells
## 17                                          Gene expression profiling of CD4+ T-cells and GM6990 lymphoblastoid cell lines
## 18                                            Expression data from thymic epithelial cells of 5-6 weeks old C57BL/6J mice.
## 19                                                                                                   NHGRI Menin ChIP-Chip
## 20                                                                                             BRAF siRNA NHGRI_Liang_BRAF
## 21                                                               Detection of novel amplification units in prostate cancer
## 22                         Gene Expression and Comparative Genomic Hybridization of Ductal Carcinoma In Situ of the Breast
## 23                                                     Gene expression profile in mouse esophageal squamous cell carcinoma
## 24                                                               Whole genome DNAse hypersensitivity in human CD4+ T-cells
## 25                                                                                             ATF2 knockout in papillomas

As another example, GEOmetadb can find all samples on GPL96 (Affymetrix hgu133a) that have .CEL files available for download.

## [1] 5000    2

But why limit to only GPL96? Why not look for all Affymetrix arrays that have .CEL files? And list those with their associated GPL information, as well as the Bioconductor annotation package name?

##   bioc_package   gpl        gsm
## 1      hgu133a GPL96 GSM1003058
## 2      hgu133a GPL96 GSM1003059
## 3      hgu133a GPL96 GSM1003060
## 4      hgu133a GPL96 GSM1003061
## 5      hgu133a GPL96 GSM1003062
##                                                                        supplementary_file
## 1 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1003nnn/GSM1003058/suppl/GSM1003058_C1.cel.gz
## 2 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1003nnn/GSM1003059/suppl/GSM1003059_C2.cel.gz
## 3 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1003nnn/GSM1003060/suppl/GSM1003060_C3.cel.gz
## 4 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1003nnn/GSM1003061/suppl/GSM1003061_C4.cel.gz
## 5 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1003nnn/GSM1003062/suppl/GSM1003062_C5.cel.gz

Of course, we can combine programming and data access. A simple sapply example shows how to query each of the tables for number of records.

##                    [,1]
## gds                3566
## gds_subset            0
## geoConvert        67734
## geodb_column_desc   104
## gpl                1433
## gse               31156
## gse_gpl            6085
## gse_gsm            5650
## gsm                5000
## metaInfo              2
## sMatrix            5000

3.3 Conversion of GEO entity types

Large-scale consumers of GEO data might want to convert GEO entity type from one to others, e.g. finding all GSM and GSE associated with ‘GPL96’. Function goeConvert does the conversion with a very fast mapping between entity types.

Covert ‘GPL96’ to other possible types in the GEOmetadb.sqlite.

Check what GEO types and how many entities in each type in the conversion.

## $gse
## [1] 686   2
## 
## $gsm
## [1] 5000    2
## 
## $gds
## [1] 365   2
## 
## $sMatrix
## [1] 4 2
##   from_acc   to_acc
## 1    GPL96  GSE1000
## 2    GPL96 GSE10072
## 3    GPL96 GSE10089
## 4    GPL96 GSE10137
## 5    GPL96 GSE10139
##   from_acc     to_acc
## 1    GPL96 GSM1003058
## 2    GPL96 GSM1003059
## 3    GPL96 GSM1003060
## 4    GPL96 GSM1003061
## 5    GPL96 GSM1003062
##   from_acc  to_acc
## 1    GPL96 GDS1023
## 2    GPL96 GDS1036
## 3    GPL96 GDS1050
## 4    GPL96 GDS1062
## 5    GPL96 GDS1063
##    from_acc                              to_acc
## 1     GPL96       GSE14543_series_matrix.txt.gz
## 2     GPL96       GSE16088_series_matrix.txt.gz
## 3     GPL96       GSE16091_series_matrix.txt.gz
## 4     GPL96 GSE16102-GPL96_series_matrix.txt.gz
## NA     <NA>                                <NA>

3.4 Mappings between GPL and Bioconductor microarry annotation packages

The function getBiocPlatformMap is to get GPL information of a given list of Bioconductor microarry annotation packages. Note currently the GEOmetadb does not contains all the mappings, but we are trying to construct a relative complete list.

Get GPL information of ‘hgu133a’ and ‘hgu95av2’:

##                                                     title
## 1          [HG-U133A] Affymetrix Human Genome U133A Array
## 2 [HG_U95Av2] Affymetrix Human Genome U95 Version 2 Array
## 3            [HG_U95A] Affymetrix Human Genome U95A Array
##       gpl bioc_package manufacturer     organism
## 1   GPL96      hgu133a   Affymetrix Homo sapiens
## 2 GPL8300     hgu95av2   Affymetrix Homo sapiens
## 3   GPL91     hgu95av2   Affymetrix Homo sapiens
##   data_row_count
## 1          22283
## 2          12625
## 3          12626

3.5 More advanced queries

Now, for something a bit more complicated, we would like to find all the human breast cancer-related Affymetrix gene expression GEO series.

## [1] 9 2
##   title                                                                                         
## 1 Breast Cancer Gene Expression Data from Frankfurt Series                                      
## 2 Expression data from breast cancer FNA biopsies from patients ( (USO samples)                 
## 3 Gene expression of breast cancer tissue in a large population-based cohort of Swedish patients
## 4 Genetic Reclassification of Histologic Grade Delineates New Clinical Subtypes of Breast Cancer
## 5 Breast Cancer Gene Expression Data from Hamburg Series                                        
##   gse     
## 1 GSE4611 
## 2 GSE42822
## 3 GSE1456 
## 4 GSE4922 
## 5 GSE46184

3.7 Using dplyr with GEOmetadb

The dplyr package provides a very streamlined functional approach to accessing SQL databases.

Load the dplyr library and connect to the GEOmetadb.sqlite database. The dplyr package is written around manipulation of tabular data, so we will work with only one GEOmetadb table, the gse table. Instead of using SQL, we can use dplyr verbs to get information from tables of interest.

## # Source:   lazy query [?? x 18]
## # Database: sqlite 3.36.0
## #   [/tmp/Rtmp3IYQ5O/Rbuild3ea5e26e7f006d/GEOmetadb/vignettes/GEOmetadb.sqlite]
##      ID title       gse    status       submission_date
##   <dbl> <chr>       <chr>  <chr>        <chr>          
## 1  2151 NHGRI_Sarc… GSE25… Public on O… 2005-04-21     
## # … with 13 more variables: last_update_date <chr>,
## #   pubmed_id <int>, summary <chr>, type <chr>,
## #   contributor <chr>, web_link <chr>,
## #   overall_design <chr>, repeats <chr>,
## #   repeats_sample_list <chr>, variable <chr>,
## #   variable_description <chr>, contact <chr>,
## #   supplementary_file <chr>

3.8 Cleanup

Finally, it is probably a good idea to close the connection. Please see the DBI package for details on this point.

If you want to remove old GEOmetadb.sqlite file before retrieve a new version from the server, execute the following codes:

4 sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8      
##  [2] LC_NUMERIC=C              
##  [3] LC_TIME=en_GB             
##  [4] LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8   
##  [6] LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8      
##  [8] LC_NAME=C                 
##  [9] LC_ADDRESS=C              
## [10] LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8
## [12] LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets 
## [6] methods   base     
## 
## other attached packages:
##  [1] dplyr_1.0.7         wordcloud_2.6      
##  [3] RColorBrewer_1.1-2  tm_0.7-8           
##  [5] NLP_0.2-1           GEOmetadb_1.56.0   
##  [7] RSQLite_2.2.8       GEOquery_2.62.0    
##  [9] Biobase_2.54.0      BiocGenerics_0.40.0
## [11] knitr_1.36         
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1  xfun_0.27        
##  [3] bslib_0.3.1       slam_0.1-48      
##  [5] purrr_0.3.4       vctrs_0.3.8      
##  [7] generics_0.1.1    htmltools_0.5.2  
##  [9] yaml_2.2.1        utf8_1.2.2       
## [11] blob_1.2.2        rlang_0.4.12     
## [13] jquerylib_0.1.4   pillar_1.6.4     
## [15] withr_2.4.2       glue_1.4.2       
## [17] DBI_1.1.1         dbplyr_2.1.1     
## [19] bit64_4.0.5       lifecycle_1.0.1  
## [21] stringr_1.4.0     codetools_0.2-18 
## [23] evaluate_0.14     memoise_2.0.0    
## [25] tzdb_0.1.2        fastmap_1.1.0    
## [27] parallel_4.1.1    fansi_0.5.0      
## [29] highr_0.9         Rcpp_1.0.7       
## [31] readr_2.0.2       cachem_1.0.6     
## [33] limma_3.50.0      jsonlite_1.7.2   
## [35] bit_4.0.4         hms_1.1.1        
## [37] digest_0.6.28     stringi_1.7.5    
## [39] cli_3.0.1         tools_4.1.1      
## [41] magrittr_2.0.1    sass_0.4.0       
## [43] tibble_3.1.5      crayon_1.4.1     
## [45] tidyr_1.1.4       pkgconfig_2.0.3  
## [47] ellipsis_0.3.2    data.table_1.14.2
## [49] xml2_1.3.2        assertthat_0.2.1 
## [51] rmarkdown_2.11    R6_2.5.1         
## [53] compiler_4.1.1