The phantasusLite
package contains a set functions that facilitate working with public
gene expression datasets originally developed for phantasus package. Unlike phantasus
, phantasusLite
aims to limit the amount of dependencies.
The current functionality includes:
It is recommended to install the release version of the package from Bioconductor using the following commands:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("phantasusLite")
Alternatively, the most recent version of the package can be installed from the GitHub repository:
library(devtools)
install_github("ctlab/phantasusLite")
library(GEOquery)
library(phantasusLite)
Let’s load dataset GSE53053 from GEO using GEOquery
package:
ess <- getGEO("GSE53053")
es <- ess[[1]]
RNA-seq dataset from GEO do not contain the expression matrix, thus exprs(es)
is empty:
head(exprs(es))
## GSM1281300 GSM1281301 GSM1281302 GSM1281303 GSM1281304 GSM1281305
## GSM1281306 GSM1281307
However, a number of precomputed gene count tables are available at HSDS server ‘https://ctlab.itmo.ru/hsds/’. It features HDF5 files with counts from ARCHS4 and DEE2 projects:
url <- 'https://ctlab.itmo.ru/hsds/?domain=/counts'
getHSDSFileList(url)
## [1] "/counts/archs4/Arabidopsis_thaliana_count_matrix.h5"
## [2] "/counts/archs4/Bos_taurus_count_matrix.h5"
## [3] "/counts/archs4/Caenorhabditis_elegans_count_matrix.h5"
## [4] "/counts/archs4/Danio_rerio_count_matrix.h5"
## [5] "/counts/archs4/Drosophila_melanogaster_count_matrix.h5"
## [6] "/counts/archs4/Gallus_gallus_count_matrix.h5"
## [7] "/counts/archs4/Rattus_norvegicus_count_matrix.h5"
## [8] "/counts/archs4/Saccharomyces_cerevisiae_count_matrix.h5"
## [9] "/counts/archs4/human_matrix_v11.h5"
## [10] "/counts/archs4/meta.h5"
## [11] "/counts/archs4/mouse_matrix_v11.h5"
## [12] "/counts/archs4.old/meta.h5"
## [13] "/counts/archs4_new/human_gene_v2.2.h5"
## [14] "/counts/archs4_new/meta.h5"
## [15] "/counts/archs4_new/mouse_gene_v2.2.h5"
## [16] "/counts/dee2/athaliana_star_matrix_20221107.h5"
## [17] "/counts/dee2/celegans_star_matrix_20221107.h5"
## [18] "/counts/dee2/dmelanogaster_star_matrix_20221107.h5"
## [19] "/counts/dee2/drerio_star_matrix_20211102.h5"
## [20] "/counts/dee2/ecoli_star_matrix_20221107.h5"
## [21] "/counts/dee2/hsapiens_star_matrix_20221107.h5"
## [22] "/counts/dee2/meta.h5"
## [23] "/counts/dee2/mmusculus_star_matrix_20221107.h5"
## [24] "/counts/dee2/rnorvegicus_star_matrix_20221107.h5"
## [25] "/counts/dee2/scerevisiae_star_matrix_20221107.h5"
GSE53053 dataset was sequenced from Mus musculus and we can get an expression matrix from the corresponding HDF5-file with DEE2 data:
file <- "dee2/mmusculus_star_matrix_20221107.h5"
es <- loadCountsFromH5FileHSDS(es, url, file)
head(exprs(es))
## GSM1281300 GSM1281301 GSM1281302 GSM1281303 GSM1281304
## ENSMUSG00000102693 0 0 0 0 0
## ENSMUSG00000064842 0 0 0 0 0
## ENSMUSG00000051951 0 0 0 0 0
## ENSMUSG00000102851 0 0 0 0 0
## ENSMUSG00000103377 0 0 0 0 0
## ENSMUSG00000104017 0 0 0 0 0
## GSM1281305 GSM1281306 GSM1281307
## ENSMUSG00000102693 0 0 0
## ENSMUSG00000064842 0 0 0
## ENSMUSG00000051951 0 0 0
## ENSMUSG00000102851 0 0 0
## ENSMUSG00000103377 0 0 0
## ENSMUSG00000104017 0 0 0
Normally loadCountsFromHSDS
can be used to automatically select the HDF5-file with the largest
number of quantified samples:
es <- ess[[1]]
es <- loadCountsFromHSDS(es, url)
head(exprs(es))
## GSM1281300 GSM1281301 GSM1281302 GSM1281303 GSM1281304 GSM1281305
## 0610007P14Rik 86 67 30 46 23 61
## 0610009B22Rik 29 22 3 0 33 13
## 0610009L18Rik 0 0 7 0 0 15
## 0610009O20Rik 103 38 17 20 31 54
## 0610010F05Rik 259 91 115 88 113 185
## 0610010K14Rik 17 6 0 0 1 0
## GSM1281306 GSM1281307
## 0610007P14Rik 105 22
## 0610009B22Rik 15 26
## 0610009L18Rik 0 9
## 0610009O20Rik 24 29
## 0610010F05Rik 108 163
## 0610010K14Rik 0 7
The counts are different from the previous values as ARCHS4 counts were used – ARCHS4 is prioritized when there are several files with the same number of samples:
preproc(experimentData(es))$gene_counts_source
## [1] "archs4/mouse_matrix_v11.h5"
For some of the GEO datasets, such as GSE53053, the sample annotation is not fully available. However, frequently sample titles are structured in a way that allows to infer the groups. For example, for GSE53053 we can see there are three groups: Ctrl, MandIL4, MandLPSandIFNg, with up to 3 replicates:
es$title
## [1] "Ctrl_1" "Ctrl_2" "MandIL4_1" "MandIL4_2"
## [5] "MandIL4_3" "MandLPSandIFNg_1" "MandLPSandIFNg_2" "MandLPSandIFNg_3"
For such well-structured titles, inferCondition
function can be used to automatically
identify the sample conditions and replicates:
es <- inferCondition(es)
print(es$condition)
## [1] "Ctrl" "Ctrl" "MandIL4" "MandIL4"
## [5] "MandIL4" "MandLPSandIFNg" "MandLPSandIFNg" "MandLPSandIFNg"
print(es$replicate)
## [1] "1" "2" "1" "2" "3" "1" "2" "3"
GCT text format can be used to store annotated gene expression matrices and load them in software such as Morpheus or Phantasus.
For example, we can save the ExpressionSet
object that we defined previously:
f <- file.path(tempdir(), "GSE53053.gct")
writeGct(es, f)
And the load the file back:
es2 <- readGct(f)
## Warning in readGct(f): duplicated row IDs: missing missing missing missing
## missing missing; they were renamed
print(es2)
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 32544 features, 8 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: Ctrl_1 Ctrl_2 ... MandLPSandIFNg_3 (8 total)
## varLabels: title geo_accession ... replicate (43 total)
## varMetadata: labelDescription
## featureData
## featureNames: missing ENSMUSG00000007777 ... ENSMUSG00000064368
## (32544 total)
## fvarLabels: ENSEMBLID Gene Symbol
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] phantasusLite_1.0.0 GEOquery_2.70.0 Biobase_2.62.0
## [4] BiocGenerics_0.48.0 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.7 utf8_1.2.4 generics_0.1.3
## [4] tidyr_1.3.0 SparseArray_1.2.0 xml2_1.3.5
## [7] stringi_1.7.12 lattice_0.22-5 hms_1.1.3
## [10] digest_0.6.33 magrittr_2.0.3 evaluate_0.22
## [13] grid_4.3.1 bookdown_0.36 fastmap_1.1.1
## [16] R.oo_1.25.0 jsonlite_1.8.7 Matrix_1.6-1.1
## [19] R.utils_2.12.2 limma_3.58.0 BiocManager_1.30.22
## [22] httr_1.4.7 purrr_1.0.2 fansi_1.0.5
## [25] jquerylib_0.1.4 abind_1.4-5 cli_3.6.1
## [28] crayon_1.5.2 rlang_1.1.1 XVector_0.42.0
## [31] R.methodsS3_1.8.2 withr_2.5.1 cachem_1.0.8
## [34] DelayedArray_0.28.0 yaml_2.3.7 S4Arrays_1.2.0
## [37] tools_4.3.1 tzdb_0.4.0 dplyr_1.1.3
## [40] curl_5.1.0 vctrs_0.6.4 R6_2.5.1
## [43] matrixStats_1.0.0 stats4_4.3.1 lifecycle_1.0.3
## [46] stringr_1.5.0 zlibbioc_1.48.0 S4Vectors_0.40.0
## [49] IRanges_2.36.0 pkgconfig_2.0.3 pillar_1.9.0
## [52] bslib_0.5.1 data.table_1.14.8 glue_1.6.2
## [55] statmod_1.5.0 xfun_0.40 tibble_3.2.1
## [58] tidyselect_1.2.0 MatrixGenerics_1.14.0 knitr_1.44
## [61] rjson_0.2.21 htmltools_0.5.6.1 rmarkdown_2.25
## [64] rhdf5client_1.24.0 readr_2.1.4 compiler_4.3.1
```