1 Overview

The AnnotationHubData package provides tools to acquire, annotate, convert and store data for use in Bioconductor’s AnnotationHub. BED files from the Encode project, gtf files from Ensembl, or annotation tracks from UCSC, are examples of data that can be downloaded, described with metadata, transformed to standard Bioconductor data types, and stored so that they may be conveniently served up on demand to users via the AnnotationHub client. While data are often manipulated into a more R-friendly form, the data themselves retain their raw content and are not filtered or curated like those in ExperimentHub.
Each resource has associated metadata that can be searched through the AnnotationHub client interface.

2 New resources

2.1 Family of resources

Multiple, related resources are added to AnnotationHub by creating a software package similar to the existing annotation packages. The package itself does not contain data but serves as a light weight wrapper around scripts that generate metadata for the resources added to AnnotationHub.

At a minimum the package should contain a man page describing the resources. Vignettes and additional R code for manipulating the objects are optional.

Creating the package involves the following steps:

Notify Bioconductor team member:
Man page and vignette examples in the software package will not work until the data are available in AnnotationHub. Adding the data to AWS S3 and the metadata to the production database involves assistance from a Bioconductor team member. If you are interested in submitting a package, please send an email to packages@bioconductor.org so a team member can work with you through the process.
Building the software package:
Below is an outline of package organization. The files listed are required unless otherwise stated.

inst/extdata/
- metadata.csv: This file contains the metadata in the format of one row per resource to be added to the AnnotationHub database. The file should be generated from the code in inst/scripts/make-metadata.R where the final data are written out with write.csv(…, row.names=FALSE). The required column names and data types are specified in AnnotationHubData::readMetadataFromCsv(). See ?readMetadataFromCsv for details.
If necessary, metadata can be broken up into multiple csv files instead having of all records in a single “metadata.csv”.
inst/scripts/
- make-data.R:
  A script describing the steps involved in making the data object(s). This includes where the original data were downloaded from, pre-processing, and how the final R object was made. Include a description of any steps performed outside of R with third party software. Output of the script should be files on disk ready to be pushed to S3. If data are to be hosted on a personal web site instead of S3, this file should explain any manipulation of the data prior to hosting on the web site. For data hosted on a public web site with no prior manipultaion this file is not needed.
- make-metadata.R:
  A script to make the metadata.csv file located in inst/extdata of the package. See ?readMetadataFromCsv for a description of the metadata.csv file, expected fields and data types. The readMetadataFromCsv() function can be used to validate the metadata.csv file before submitting the package.
vignettes/

OPTIONAL vignette(s) describing analysis workflows.
R/

OPTIONAL functions to enhance data exploration.
man/
- package man page:
  OPTIONAL. The package man page serves as a landing point and should briefly describe all resources associated with the package. There should be an entry for each resource title either on the package man page or individual man pages.
- resource man pages:
  OPTIONAL. Man page(s) should describe the resource (raw data source, processing, QC steps) and demonstrate how the data can be loaded through the AnnotationHub interface. For example, replace “SEARCHTERM*" below with one or more search terms that uniquely identify resources in your package.
```
library(AnnotationHub)
hub <- AnnotationHub()
myfiles <- query(hub, "SEARCHTERM1", "SEARCHTERM2")
myfiles[[1]]  ## load the first resource in the list
```
DESCRIPTION / NAMESPACE
The scripts used to generate the metadata will likely use functions from AnnotationHubData which should be listed in Depends/Imports as necessary.

Package authors are encouraged to use the AnnotationHub::listResources() and AnnotationHub::loadResource() functions in their man pages and vignette. These helpers are designed to facilitate data discovery within a specific package vs within all of AnnotationHub. If used, these functions should be imported from AnnotationHub.

Data objects:
Data are not formally part of the software package and are stored separately in AWS S3 buckets. The author should make the data available via dropbox, ftp site or another mutually accessible application and it will be uploaded to S3 by a member of the Bioconductor team.
Confirm valid metadata:
Confirm the data in inst/exdata/metadata.csv are valid by running AnnotationHubData:::makeAnnotationHubMetadata() on your package. Please address and warnings or errors.
Package review:
Submit the package to the tracker for review. The primary purpose of the package review is to validate the metadata in the csv file(s). It is ok if the package fails R CMD build and check because the data and metadata are not yet in place. Once the metadata.csv is approved, records are added to the production database. At that point the package man pages and vignette can be finalized and the package should pass R CMD build and check.

2.2 Individual resources

Individual objects of a standard class can be added to the hub by providing only the data and metadata files or by creating a package as described in the Family of Resources section.

OrgDb, TxDb and BSgenome objects are well defined Bioconductor classes and methods to download and process these objects already exist in AnnotationHub. When adding only one or two objects the overhead of creating a package may be unnecessary. The goal of the package is to provide structure for metadata generation and makes sense when there are plans to update versions or add new organisms in the future.

Make sure the OrgDb, TxDb or BSgenome object you want to add does not already exist in the
Biocondcutor annotation repository

Providing just data and metadata files involves the following steps:

Notify Bioconductor team member:
Adding the data to AWS S3 and the metadata to the production database involves assistance from a Bioconductor team member. Please send email to packages@bioconductor.org so a team member can work with you through the process.
Prepare the data:
In the case of an OrgDb object, only the sqlite file is stored in S3. See makeOrgPackageFromNCBI() and makeOrgPackage() in the AnnotationForge package for help creating the sqlite file. BSgenome objects should be made according to the steps outline in the BSgenome vignette. TxDb objects will be made on-the-fly from a GRanges with GenomicFeatures::makeTxDbFromGRanges() when the resource is downloaded from AnnotationHub. Data should be provided as a GRanges object. See GenomicRanges::makeGRangesFromDataFrame() or rtracklayer::import() for help creating the GRanges.
Generate metadata:
Prepare a .R file that generates metadata for the resource(s) by calling the AnnotationHubData::AnnotationHubMetadata() constructor. Argument details are found on the ?AnnotationHubMetadata man page.

As an example, this piece of code generates the metadata for the Vitis vinifera TxDb Timothée Flutre contributed to AnnotationHub:

metadata <- AnnotationHubMetadata(
    Description="Gene Annotation for Vitis vinifera",
    Genome="IGGP12Xv0",
    Species="Vitis vinifera",
    SourceUrl="http://genomes.cribi.unipd.it/DATA/V2/V2.1/V2.1.gff3",
    SourceLastModifiedDate=as.POSIXct("2014-04-17"),
    SourceVersion="2.1",
    RDataPath="community/tflutre/",
    TaxonomyId=29760L, 
    Title="Vvinifera_CRIBI_IGGP12Xv0_V2.1.gff3.Rdata",
    BiocVersion=package_version("3.3"),
    Coordinate_1_based=TRUE,
    DataProvider="CRIBI",
    Maintainer="Timothée Flutre <timothee.flutre@supagro.inra.fr",
    RDataClass="GRanges",
    DispatchClass="GRanges",
    SourceType="GFF",
    RDataDateAdded=as.POSIXct(Sys.time()),
    Recipe=NA_character_,
    PreparerClass="None",
    Tags=c("GFF", "CRIBI", "Gene", "Transcript", "Annotation"),
    Notes="chrUn renamed to chrUkn"
)

Add data to S3 and metadata to the database:
This last step is done by the Biocondcutor team member.

3 Additional resources

Metadata for new versions of the data can be added to the same package as they become available.

The titles for the new versions must be unique and not match the title of any resource currently in AnnotationHub. Good practice would be to include the version and / or genome build in the title.
Make data available via dropbox, ftp, etc. and notify maintainer@bioconductor.org
Update make-metadata.R with the new metadata information
Generate a new or updated metadata.csv file. The package should contain metadata for all versions of the data in AnnotationHub. When adding a new version it might be helpful to write a new csv file named by version, e.g., metadata_v84.csv, metadata_85.csv etc.
Bump package version and commit to svn/git
Notify maintainer@bioconductor.org that an update is ready and a team member will add the new metadata to the production database; new resources will not be visible in AnnotationHub until the metadata are added to the database.

Contact maintainer@bioconductor.org with any questions.

4 Bug fixes

A bug fix may involve a change to the metadata, data resource or both.

4.1 Update the resource

The replacement resource must have the same name as the original
Notify maintainer@bioconductor.org that you want to replace the data and make the files available via dropbox, ftp, etc.

4.2 Update the metadata

New metadata records can be added for new resources but modifying existing records is discouraged. Record modification will only be done in the case of bug fixes.

Notify maintainer@bioconductor.org that you want to change the metadata
Update make-metadata.R with modified information
Bump the package version and commit to svn/git

5 Remove resources

When a resource is removed from AnnotationHub the ‘status’ field in the metadata is modified to explain why they are no longer available. Once this status is changed the AnnotationHub() constructor will not list the resource among the available ids. An attempt to extract the resource with ‘[[’ and the AH id will return an error along with the status message.

To remove a resource from AnnotationHub contact maintainer@bioconductor.org.

6 Historical vignettes

The process for adding data to AnnotationHub has evolved substantially since the first vignettes were written. Much of the information contained in those documents is outdated or applicable only to repeat-run recipes added to the code base. These documents have been retained for historical purposes and are located in the inst/scripts/ directory of the AnnotationHubData package.

Introduction to AnnotationHubData

Valerie Obenchain

Modified: October 2016. Compiled: 11 Oct 2017

Contents