1 Vignette overview

metagenomeFeatures and associated annotation packages1 Other databases are avilable as Bioconductor AnnotationData including; Greengenes Release 13.5: greengenesMgDb13.5, Greengenes Release 13.8 99% OTUs,greengenes13.8_99MgDb Ribosomal Database Project Release 11.5, ribosomaldatabaseproject11.5MgDb, and Silva 128.1: silva128.1MgDb. can be used to obtain phylogentic trees and representative sequences for 16S rRNA marker gene sequence experimental datasets when the reads are clustered using closed reference clustering and a database with an MgDb annotation package. In this vignette we demostrate how to use metagenomeFeatures and Greengenes 16S rRNA database version 13.8 85% OTUs to obtain a phylogenetic tree and representative sequences for Rousk et al. (2010), a soil microbiome study. We then use the tree and sequence data obtained from the Greengenes MgDb to generate a phyloseq-class object for community analysis2 The phyloseq package defines the phyloseq-calss for working with 16S rRNA experimental data..

2 Greengenes 13.8 85% OTU MgDb

The gg 13.8 85% OTU (gg85) is provided as part of the metagenomeFeatures package. gg85 is a MgDb-class object with the taxonomic heirarchy, sequence data, and phylogeny for the Greengenes database clustered at the 0.85 similarity threshold. After loading the metagenomeFeatures package gg85 is loaded using get_gg13.8_85MgDb().3 See the “metagenomeFeatures classes and methods.” vignette, Vignettes("metagenomeFeatures classes and methods.") for more information on methods for working with MgDb-class and mgFeatures-class objects.

library(metagenomeFeatures)
gg85 <- get_gg13.8_85MgDb()

2.1 QIITA Dataset

QIITA is a public repository for sharing OTU tables and raw sequence data from 16S rRNA marker gene studies, however some of the datasets do not include representative sequences or phylogenetic trees. For this vignette we are using data from the previously mentioned soil microbiome study, https://qiita.ucsd.edu/study/description/94 (Rousk et al. 2010). A BIOM and qiime mapping file for the study can be obtained from QIITA. A vector of Greengenes ids for the study cluster centers is included in this package for use in this vignette.

data(qiita_study_94_gg_ids)

2.2 Obtaining Representative Sequences and Phylogenetic Tree

A mgFeatures object is generated from gg85 (or any other MgDb-class object) using the annotateFeatures() function along with a set of database ids or keys. The resulting mgFeatures class object has the taxonomic heirarchy, phylogeny, and sequence data for the study OTUs.4 See the “metagenomeFeatures classes and methods.” vignette, Vignettes("metagenomeFeatures classes and methods.") for more information on methods for working with MgDb-class and mgFeatures-class objects.

soil_mgF <- annotateFeatures(gg85, qiita_study_94_gg_ids)
#> Found more than one class "phylo" in cache; using the first, from namespace 'metagenomeFeatures'
#> Also defined by 'tidytree' 'phyloseq'
#> Found more than one class "phylo" in cache; using the first, from namespace 'metagenomeFeatures'
#> Also defined by 'tidytree' 'phyloseq'
#> Found more than one class "phylo" in cache; using the first, from namespace 'metagenomeFeatures'
#> Also defined by 'tidytree' 'phyloseq'
#> Found more than one class "phylo" in cache; using the first, from namespace 'metagenomeFeatures'
#> Also defined by 'tidytree' 'phyloseq'

## Taxonomic heirarchy
soil_mgF
#> mgFeatures with 61 rows and 8 columns
#>            Keys     Kingdom            Phylum                   Class
#>     <character> <character>       <character>             <character>
#> 1       1107824 k__Bacteria p__Proteobacteria  c__Gammaproteobacteria
#> 2        824826 k__Bacteria p__Proteobacteria  c__Alphaproteobacteria
#> 3        694268 k__Bacteria          p__WPS-2                     c__
#> 4        579266 k__Bacteria p__Proteobacteria   c__Betaproteobacteria
#> 5        558862 k__Bacteria       p__Chlorobi               c__SJA-28
#> ...         ...         ...               ...                     ...
#> 57      4389227 k__Bacteria  p__Acidobacteria               c__iii1-8
#> 58      4391683 k__Bacteria p__Proteobacteria  c__Alphaproteobacteria
#> 59      4421369 k__Bacteria  p__Acidobacteria c__[Chloracidobacteria]
#> 60      4477112 k__Bacteria            p__WS2              c__SHA-109
#> 61      4479102 k__Bacteria           p__OP11               c__OP11-4
#>                     Ord               Family       Genus     Species
#>             <character>          <character> <character> <character>
#> 1      o__Legionellales      f__Coxiellaceae         g__         s__
#> 2                   o__                  f__         g__         s__
#> 3                   o__                  f__         g__         s__
#> 4    o__Burkholderiales                  f__         g__         s__
#> 5                   o__                  f__         g__         s__
#> ...                 ...                  ...         ...         ...
#> 57             o__32-20                  f__         g__         s__
#> 58  o__Sphingomonadales f__Sphingomonadaceae         g__         s__
#> 59              o__RB41         f__Ellin6075         g__         s__
#> 60                  o__                  f__         g__         s__
#> 61                  o__                  f__         g__         s__

# Sequence data
mgF_seq(soil_mgF)
#> DNAStringSet object of length 61:
#>      width seq                                              names               
#>  [1]  1502 TAGAGTTTGATCCTGGCTCAGAT...GAAGTCGTAACAAGGTAGCCGT 1107824
#>  [2]  1396 AGAGTTTGATCATGGCTCAGGAT...GAGCCTTGTACACACCGCCCGT 824826
#>  [3]  1424 AACGCTGGCGGCGTGCCTAACAC...TAGGTAAGGGGGACGAAGTCGT 694268
#>  [4]  1498 AGAGTTTGATCCTGGCTCAGAGT...AAGTCGTAACAAGGTAGCCGTA 579266
#>  [5]  1432 AGAGTTTGATCATGGCTCAGGAC...GCCAGAAGTAGTTAGCCTAACC 558862
#>  ...   ... ...
#> [57]  1378 TGCTTAACACATGCAAGTCGAGC...CCTTGCACACACCGCCCGTCAC 4389227
#> [58]  1606 AGAGTTTGATCCTGGCTCAGAAC...GTGAAGTCGTAACAAGGTAACC 4391683
#> [59]  1496 AGAGTTTGATCCTGGCTCAGAAT...GTGAAGTCGTAACAAGGTAACC 4421369
#> [60]  1403 GAACGCTGGCGGTACGTCTGACA...AGCGGCCGAAGGTGGAGTCAGT 4477112
#> [61]  1336 GATGAACGCTGGCGGCGTGCCTT...AGCAAAGTTGGGGGCGCCCGAA 4479102

# Tree data 
mgF_tree(soil_mgF)
#> 
#> Phylogenetic tree with 61 tips and 60 internal nodes.
#> 
#> Tip labels:
#>  200762, 206404, 4479102, 113767, 551866, 552576, ...
#> 
#> Rooted; includes branch lengths.

3 Using mgFeatures-class object with phyloseq

Finally we use the phyogenetic tree from the mgFeatures-class object along with the OTU table for a beta diversity analysis. After loading the phyloseq package we will load the abundance data from the QITTA biom file, using the phyloseq command import_biom().
Next we will define the sample_data slot using a modified version of the mapping file 2301_mapping_file.txt obtained from QITTA, which only includes sample names and experimental variables pH, carbon/nitrogen ratio, total nitrogen, and total organic carbon. Future development of a common data structure for working with microbiome data that extends the SummarizeExperiment-class is underway. For this new class the mgFeatures-class will define the rowData slot.

data_dir <- system.file("extdata", package = "metagenomeFeatures")

## Load Biom
biom_file <-  file.path(data_dir, "229_otu_table.biom")
soil_ps <- phyloseq::import_biom(BIOMfilename = biom_file)

## Define sample data
sample_file <- file.path(data_dir, "229_sample_data.tsv")
sample_dat <- read.delim(sample_file)

## Rownames matching sample_names(), required for phyloseq sample_data slot
rownames(sample_dat) <- sample_dat$SampleID
sample_data(soil_ps) <- sample_dat

## Resulting phyloseq object
soil_ps
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 2305 taxa and 26 samples ]
#> sample_data() Sample Data:       [ 26 samples by 5 sample variables ]
#> tax_table()   Taxonomy Table:    [ 2305 taxa by 7 taxonomic ranks ]

Rousk et al. (2010) clustered the sequences at 97% similarity therefore gg85 only contains a subset of the OTUs in the dataset.
To define the tree and sequence slot we need to subset soil_ps to only include OTUs in gg85. Samples with no OTUs in gg85 were also removed for our beta diversity analysis.

# Removing OTUs not in `gg85`
soil_tree <- mgF_tree(soil_mgF)
soil_ps_gg85 <- prune_taxa(taxa = soil_tree$tip.label, x = soil_ps)

# Removing samples with no OTUs in `gg85`
samples_to_keep <- sample_sums(soil_ps_gg85) != 0
soil_ps_gg85 <- prune_samples(samples = samples_to_keep, x = soil_ps_gg85)

Now soil_mgF can be used to to define our soil_ps object refseq and tree slots.

## Defining tree slot
phy_tree(physeq = soil_ps_gg85) <- soil_tree
#> Found more than one class "phylo" in cache; using the first, from namespace 'metagenomeFeatures'
#> Also defined by 'tidytree' 'phyloseq'
#> Found more than one class "phylo" in cache; using the first, from namespace 'metagenomeFeatures'
#> Also defined by 'tidytree' 'phyloseq'
#> Found more than one class "phylo" in cache; using the first, from namespace 'metagenomeFeatures'
#> Also defined by 'tidytree' 'phyloseq'
#> Found more than one class "phylo" in cache; using the first, from namespace 'metagenomeFeatures'
#> Also defined by 'tidytree' 'phyloseq'
#> Found more than one class "phylo" in cache; using the first, from namespace 'metagenomeFeatures'
#> Also defined by 'tidytree' 'phyloseq'

## Defining seq slot
soil_ps_gg85@refseq <- mgF_seq(soil_mgF)

Now that the tree slot is defined we can perform microbial community analysis requiring a phylogenetic tree such as phylogenetic beta diversity. For the subset of features used here, pH is a primary driver for differences in beta diversity between the samples (Figure 1). Nearly 30% of the total variation in the data is explained by the first axis and sample pH is correlated with the first axis. This result is consistent with Rousk et al. (2010) even though we only used a small subset of the study OTUs. The outlier sample (94.PH18) is an artifact of using a small subset of the original dataset, only one gg85 OTU is observed in the sample and that OTU was not observed in the other samples.

soil_ord <- ordinate(physeq = soil_ps_gg85, 
                     distance = "wunifrac", 
                     method = "PCoA")
plot_ordination(soil_ps_gg85, soil_ord, 
                color = "ph", 
                type="sample",
                label = "SampleID")
Beta diversity and ordination for a subset of features from Rousk et al. [-@rousk2010soil]. Beta diversity was estimated using Weighted Unifrac and Principal Component Analysis was used for ordination. Sampels are represented as individual point and color indicates soil sample pH.

Figure 1: Beta diversity and ordination for a subset of features from Rousk et al
(2010). Beta diversity was estimated using Weighted Unifrac and Principal Component Analysis was used for ordination. Sampels are represented as individual point and color indicates soil sample pH.

Session info

#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] phyloseq_1.32.0          forcats_0.5.0            ggplot2_3.3.0           
#> [4] dplyr_0.8.5              ggtree_2.2.0             metagenomeFeatures_2.8.0
#> [7] Biobase_2.48.0           BiocGenerics_0.34.0      BiocStyle_2.16.0        
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyr_1.0.2         splines_4.0.0       bit64_0.9-7        
#>  [4] jsonlite_1.6.1      foreach_1.5.0       assertthat_0.2.1   
#>  [7] BiocManager_1.30.10 rvcheck_0.1.8       highr_0.8          
#> [10] stats4_4.0.0        blob_1.2.1          yaml_2.2.1         
#> [13] pillar_1.4.3        RSQLite_2.2.0       lattice_0.20-41    
#> [16] glue_1.4.0          digest_0.6.25       XVector_0.28.0     
#> [19] colorspace_1.4-1    htmltools_0.4.0     Matrix_1.2-18      
#> [22] plyr_1.8.6          pkgconfig_2.0.3     magick_2.3         
#> [25] bookdown_0.18       zlibbioc_1.34.0     purrr_0.3.4        
#> [28] patchwork_1.0.0     tidytree_0.3.3      scales_1.1.0       
#> [31] tibble_3.0.1        mgcv_1.8-31         farver_2.0.3       
#> [34] IRanges_2.22.0      ellipsis_0.3.0      withr_2.2.0        
#> [37] lazyeval_0.2.2      cli_2.0.2           survival_3.1-12    
#> [40] magrittr_1.5        crayon_1.3.4        memoise_1.1.0      
#> [43] evaluate_0.14       fansi_0.4.1         nlme_3.1-147       
#> [46] MASS_7.3-51.6       vegan_2.5-6         data.table_1.12.8  
#> [49] tools_4.0.0         lifecycle_0.2.0     stringr_1.4.0      
#> [52] Rhdf5lib_1.10.0     aplot_0.0.4         S4Vectors_0.26.0   
#> [55] munsell_0.5.0       cluster_2.1.0       Biostrings_2.56.0  
#> [58] ade4_1.7-15         compiler_4.0.0      rlang_0.4.5        
#> [61] rhdf5_2.32.0        grid_4.0.0          iterators_1.0.12   
#> [64] biomformat_1.16.0   igraph_1.2.5        labeling_0.3       
#> [67] rmarkdown_2.1       multtest_2.44.0     gtable_0.3.0       
#> [70] codetools_0.2-16    DBI_1.1.0           reshape2_1.4.4     
#> [73] R6_2.4.1            knitr_1.28          bit_1.1-15.2       
#> [76] utf8_1.1.4          treeio_1.12.0       permute_0.9-5      
#> [79] ape_5.3             stringi_1.4.6       Rcpp_1.0.4.6       
#> [82] DECIPHER_2.16.0     vctrs_0.2.4         dbplyr_1.4.3       
#> [85] tidyselect_1.0.0    xfun_0.13

Reference

Rousk, Johannes, Erland B{}{}th, Philip C Brookes, Christian L Lauber, Catherine Lozupone, J Gregory Caporaso, Rob Knight, and Noah Fierer. 2010. “Soil Bacterial and Fungal Communities Across a pH Gradient in an Arable Soil.” The ISME Journal 4 (10). Nature Publishing Group:1340.