Contents

library(MungeSumstats)

MungeSumstats now offers high throughput query and import functionality to data from the MRC IEU Open GWAS Project.

1 Find GWAS datasets

#### Search for datasets ####
metagwas <- MungeSumstats::find_sumstats(traits = c("parkinson","alzheimer"), 
                                         min_sample_size = 1000)
head(metagwas,3)
ids <- (dplyr::arrange(metagwas, nsnp))$id  
##          id               trait group_name year    author
## 1 ieu-a-298 Alzheimer's disease     public 2013   Lambert
## 2   ieu-b-2 Alzheimer's disease     public 2019 Kunkle BW
## 3 ieu-a-297 Alzheimer's disease     public 2013   Lambert
##                                                                                                                                                                                                                                                                                                                    consortium
## 1                                                                                                                                                                                                                                                                                                                        IGAP
## 2 Alzheimer Disease Genetics Consortium (ADGC), European Alzheimer's Disease Initiative (EADI), Cohorts for Heart and Aging Research in Genomic Epidemiology Consortium (CHARGE), Genetic and Environmental Risk in AD/Defining Genetic, Polygenic and Environmental Risk for Alzheimer's Disease Consortium (GERAD/PERADES),
## 3                                                                                                                                                                                                                                                                                                                        IGAP
##                 sex population     unit     nsnp sample_size       build
## 1 Males and Females   European log odds    11633       74046 HG19/GRCh37
## 2 Males and Females   European       NA 10528610       63926 HG19/GRCh37
## 3 Males and Females   European log odds  7055882       54162 HG19/GRCh37
##   category                subcategory ontology mr priority     pmid sd
## 1  Disease Psychiatric / neurological       NA  1        1 24162737 NA
## 2   Binary Psychiatric / neurological       NA  1        0 30820047 NA
## 3  Disease Psychiatric / neurological       NA  1        2 24162737 NA
##                                                                      note ncase
## 1 Exposure only; Effect allele frequencies are missing; forward(+) strand 25580
## 2                                                                      NA 21982
## 3                Effect allele frequencies are missing; forward(+) strand 17008
##   ncontrol     N
## 1    48466 74046
## 2    41944 63926
## 3    37154 54162

2 Import full results

You can supply import_sumstats() with a list of as many OpenGWAS IDs as you want, but we’ll just give one to save time.

datasets <- MungeSumstats::import_sumstats(ids = "ieu-a-298",
                                           ref_genome = "GRCH37")

2.1 Summarise results

By default, import_sumstats results a named list where the names are the Open GWAS dataset IDs and the items are the respective paths to the formatted summary statistics.

print(datasets)
## $`ieu-a-298`
## [1] "/tmp/Rtmp6LvvmQ/ieu-a-298.tsv.gz"

You can easily turn this into a data.frame as well.

results_df <- data.frame(id=names(datasets), 
                         path=unlist(datasets))
print(results_df)
##                  id                             path
## ieu-a-298 ieu-a-298 /tmp/Rtmp6LvvmQ/ieu-a-298.tsv.gz

3 Import full results (parallel)

Optional: Speed up with multi-threaded download via axel.

datasets <- MungeSumstats::import_sumstats(ids = ids, 
                                           vcf_download = TRUE, 
                                           download_method = "axel", 
                                           nThread = max(2,future::availableCores()-2))

4 Further functionality

See the Getting started vignette for more information on how to use MungeSumstats and its functionality.

5 Session Info

utils::sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MungeSumstats_1.4.5 BiocStyle_2.24.0   
## 
## loaded via a namespace (and not attached):
##   [1] fs_1.5.2                                   
##   [2] bitops_1.0-7                               
##   [3] matrixStats_0.62.0                         
##   [4] bit64_4.0.5                                
##   [5] filelock_1.0.2                             
##   [6] progress_1.2.2                             
##   [7] httr_1.4.3                                 
##   [8] googleAuthR_2.0.0                          
##   [9] GenomeInfoDb_1.32.2                        
##  [10] tools_4.2.0                                
##  [11] bslib_0.3.1                                
##  [12] utf8_1.2.2                                 
##  [13] R6_2.5.1                                   
##  [14] DBI_1.1.2                                  
##  [15] BiocGenerics_0.42.0                        
##  [16] tidyselect_1.1.2                           
##  [17] prettyunits_1.1.1                          
##  [18] bit_4.0.4                                  
##  [19] curl_4.3.2                                 
##  [20] compiler_4.2.0                             
##  [21] cli_3.3.0                                  
##  [22] Biobase_2.56.0                             
##  [23] xml2_1.3.3                                 
##  [24] DelayedArray_0.22.0                        
##  [25] rtracklayer_1.56.0                         
##  [26] bookdown_0.26                              
##  [27] sass_0.4.1                                 
##  [28] rappdirs_0.3.3                             
##  [29] stringr_1.4.0                              
##  [30] digest_0.6.29                              
##  [31] Rsamtools_2.12.0                           
##  [32] rmarkdown_2.14                             
##  [33] R.utils_2.11.0                             
##  [34] XVector_0.36.0                             
##  [35] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1
##  [36] pkgconfig_2.0.3                            
##  [37] htmltools_0.5.2                            
##  [38] MatrixGenerics_1.8.0                       
##  [39] highr_0.9                                  
##  [40] dbplyr_2.2.0                               
##  [41] fastmap_1.1.0                              
##  [42] BSgenome_1.64.0                            
##  [43] rlang_1.0.2                                
##  [44] RSQLite_2.2.14                             
##  [45] jquerylib_0.1.4                            
##  [46] BiocIO_1.6.0                               
##  [47] generics_0.1.2                             
##  [48] jsonlite_1.8.0                             
##  [49] BiocParallel_1.30.3                        
##  [50] dplyr_1.0.9                                
##  [51] R.oo_1.24.0                                
##  [52] VariantAnnotation_1.42.1                   
##  [53] RCurl_1.98-1.7                             
##  [54] magrittr_2.0.3                             
##  [55] GenomeInfoDbData_1.2.8                     
##  [56] Matrix_1.4-1                               
##  [57] Rcpp_1.0.8.3                               
##  [58] S4Vectors_0.34.0                           
##  [59] fansi_1.0.3                                
##  [60] lifecycle_1.0.1                            
##  [61] R.methodsS3_1.8.1                          
##  [62] stringi_1.7.6                              
##  [63] yaml_2.3.5                                 
##  [64] SummarizedExperiment_1.26.1                
##  [65] zlibbioc_1.42.0                            
##  [66] BiocFileCache_2.4.0                        
##  [67] grid_4.2.0                                 
##  [68] blob_1.2.3                                 
##  [69] parallel_4.2.0                             
##  [70] crayon_1.5.1                               
##  [71] lattice_0.20-45                            
##  [72] Biostrings_2.64.0                          
##  [73] GenomicFeatures_1.48.3                     
##  [74] hms_1.1.1                                  
##  [75] KEGGREST_1.36.2                            
##  [76] seqminer_8.4                               
##  [77] knitr_1.39                                 
##  [78] pillar_1.7.0                               
##  [79] GenomicRanges_1.48.0                       
##  [80] rjson_0.2.21                               
##  [81] codetools_0.2-18                           
##  [82] biomaRt_2.52.0                             
##  [83] stats4_4.2.0                               
##  [84] XML_3.99-0.9                               
##  [85] glue_1.6.2                                 
##  [86] evaluate_0.15                              
##  [87] SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20   
##  [88] data.table_1.14.2                          
##  [89] BiocManager_1.30.18                        
##  [90] png_0.1-7                                  
##  [91] vctrs_0.4.1                                
##  [92] purrr_0.3.4                                
##  [93] assertthat_0.2.1                           
##  [94] cachem_1.0.6                               
##  [95] xfun_0.31                                  
##  [96] restfulr_0.0.14                            
##  [97] gargle_1.2.0                               
##  [98] tibble_3.1.7                               
##  [99] GenomicAlignments_1.32.0                   
## [100] AnnotationDbi_1.58.0                       
## [101] memoise_2.0.1                              
## [102] IRanges_2.30.0                             
## [103] ellipsis_0.3.2