The AnnotationHubData
package provides tools to acquire, annotate, convert and store data for use in Bioconductor’s AnnotationHub
. BED files from the Encode project, gtf files from Ensembl, or annotation tracks from UCSC, are examples of data that can be downloaded, described with metadata, transformed to standard Bioconductor
data types, and stored so that they may be conveniently served up on demand to users via the AnnotationHub client. While data are often manipulated into a more R-friendly form, the data themselves retain their raw content and are not filtered or curated like those in ExperimentHub.
Each resource has associated metadata that can be searched through the AnnotationHub
client interface.
Multiple, related resources are added to AnnotationHub
by creating a software package similar to the existing annotation packages. The package itself does not contain data but serves as a light weight wrapper around scripts that generate metadata for the resources added to AnnotationHub
.
At a minimum the package should contain a man page describing the resources. Vignettes and additional R
code for manipulating the objects are optional.
Creating the package involves the following steps:
Notify Bioconductor
team member:
Man page and vignette examples in the software package will not work until the data are available in AnnotationHub
. Adding the data to AWS S3 and the metadata to the production database involves assistance from a Bioconductor
team member. If you are interested in submitting a package, please send an email to packages@bioconductor.org so a team member can work with you through the process.
Building the software package:
Below is an outline of package organization. The files listed are required unless otherwise stated.
AnnotationHub
database. The file should be generated from the code in inst/scripts/make-metadata.R where the final data are written out with write.csv(…, row.names=FALSE). The required column names and data types are specified in AnnotationHubData::readMetadataFromCsv()
. See ?readMetadataFromCsv
for details.make-data.R: A script describing the steps involved in making the data object(s). This includes where the original data were downloaded from, pre-processing, and how the final R object was made. Include a description of any steps performed outside of R
with third party software. Data objects should be serialized with save() with the .rda extension on the filename.
make-metadata.R: A script to make the metadata.csv file located in inst/extdata of the package. See ?readMetadataFromCsv
for a description of expected fields and data types. readMetadataFromCsv()
can be used to validate the metadata.csv file before submitting the package.
vignettes/
OPTIONAL vignette(s) describing analysis workflows.
R/
make-metadata.R:
Code that assembles metadata for all resources and calls AnnotationHubData::AnnotationHubMetadata()
. The output should be a list of AnnotationHubMetadata
objects, one for each resource. Examples functions can be found in the AnnotationHubData
source code with names of make*ToAHM().
make-data.R:
Code that downloads and manipulates (if necessary) the data; outputs are files on disk ready to be pushed to S3. If data are to be hosted on a personal web site instead of S3, this file should explain any manipulation of the data prior to hosting on the web site. For data hosted on a public web site with no prior manipultaion this file is not needed.
OPTIONAL functions to enhance data exploration.
man/
package man page:
The package man page serves as a landing point and should briefly describe all resources associated with the package. There should be an entry for each resource title either on the package man page or individual man pages.
resource man pages:
OPTIONAL. Man page(s) should describe the resource (raw data source, processing, QC steps) and demonstrate how the data can be loaded through the AnnotationHub
interface. For example, replace “SEARCHTERM*" below with one or more search terms that uniquely identify resources in your package.
library(AnnotationHub)
hub <- AnnotationHub()
myfiles <- query(hub, "SEARCHTERM1", "SEARCHTERM2")
myfiles[[1]] ## load the first resource in the list
DESCRIPTION / NAMESPACE
The package should depend on and fully import AnnotationHub
. Package authors are encouraged to use the AnnotationHub::listResources()
and AnnotationHub::loadResource()
functions in their man pages and vignette. These helpers are designed to facilitate data discovery within a specific package vs within all of AnnotationHub
.
Data objects:
Data are not formally part of the software package and are stored separately in AWS S3 buckets. The author should make the data available via dropbox, ftp site or another mutually accessible application and it will be uploaded to S3 by a member of the Bioconductor
team.
Package review:
When the data and metadata are ready, a Bioconductor
team member will push the data to AWS S3 and add the metadata to the production database. At this point the package man pages and vignette can be finalized. When the package passes R CMD build and check it can be submitted to the package tracker for review.
Individual objects of a standard class can be added to the hub by providing only the data and metadata files or by creating a package as described in the Family of Resources
section.
OrgDb, TxDb and BSgenome objects are well defined Bioconductor
classes and methods to download and process these objects already exist in AnnotationHub
. When adding only one or two objects the overhead of creating a package may be unnecessary. The goal of the package is to provide structure for metadata generation and makes sense when there are plans to update versions or add new organisms in the future.
Make sure the OrgDb, TxDb or BSgenome object you want to add does not already exist here: Biocondcutor annotation repository
Providing just data and metadata files involves the following steps:
Notify Bioconductor
team member:
Adding the data to AWS S3 and the metadata to the production database involves assistance from a Bioconductor
team member. Please send email to packages@bioconductor.org so a team member can work with you through the process.
Prepare the data:
In the case of an OrgDb object, only the sqlite file is stored in S3. See makeOrgPackageFromNCBI() and makeOrgPackage() in the AnnotationForge
package for help creating the sqlite file. BSgenome objects should be made according to the steps outline in the BSgenome vignette. TxDb objects will be made on-the-fly from a GRanges with GenomicFeatures::makeTxDbFromGRanges() when the resource is downloaded from AnnotationHub
. Data should be provided as a GRanges object. See GenomicRanges::makeGRangesFromDataFrame() or rtracklayer::import() for help creating the GRanges.
Generate metadata:
Prepare a .R file that generates metadata for the resource(s) by calling the AnnotationHubData::AnnotationHubMetadata()
constructor. Argument details are found on the ?AnnotationHubMetadata
man page.
As an example, this piece of code generates the metadata for Timothée’s the Vitis vinifera TxDb Timothée Flutre contributed to AnnotationHub
:
metadata <- AnnotationHubMetadata(
Description="Gene Annotation for Vitis vinifera",
Genome="IGGP12Xv0",
Species="Vitis vinifera",
SourceUrl="http://genomes.cribi.unipd.it/DATA/V2/V2.1/V2.1.gff3",
SourceLastModifiedDate=as.POSIXct("2014-04-17"),
SourceVersion="2.1",
RDataPath="community/tflutre/",
TaxonomyId=29760L,
Title="Vvinifera_CRIBI_IGGP12Xv0_V2.1.gff3.Rdata",
BiocVersion=package_version("3.3"),
Coordinate_1_based=TRUE,
DataProvider="CRIBI",
Maintainer="Timothée Flutre <timothee.flutre@supagro.inra.fr",
RDataClass="GRanges",
DispatchClass="GRanges",
SourceType="GFF",
RDataDateAdded=as.POSIXct(Sys.time()),
Recipe=NA_character_,
PreparerClass="None",
Tags=c("GFF", "CRIBI", "Gene", "Transcript", "Annotation"),
Notes="chrUn renamed to chrUkn"
)
Biocondcutor
team member.Multiple versions of the data can be added to the same package as they become available. Be sure the title is descriptive and reflects the distinguishing information such as version or genome build.
make data available via dropbox, ftp, etc. and notify maintainer@bioconductor.org
update make-metadata.R with the new metadata information
bump package version and commit to svn/git
Contact maintainer@bioconductor.org with any questions.
A bug fix may involve a change to the metadata, data resource or both.
the replacement resource must have the same name as the original
notify maintainer@bioconductor.org that you want to replace the data and make the files available via dropbox, ftp, etc.
notify maintainer@bioconductor.org that you want to change the metadata
update make-metadata.R with modified information
bump the package version and commit to svn/git
When a resource is removed from AnnotationHub
the ‘status’ field in the metadata is modified to explain why they are no longer available. Once this status is changed the AnnotationHub()
constructor will not list the resource among the available ids. An attempt to extract the resource with ‘[[’ and the AH id will return an error along with the status message.
To remove a resource from AnnotationHub
contact maintainer@bioconductor.org.
The process for adding data to AnnotationHub
has evolved substantially since the first vignettes were written. Much of the information contained in those documents is outdated or applicable only to repeat-run recipes added to the code base. For historical purposes these documents have been moved to the inst/scripts/ directory of the AnnotationHubData
package.