Contents

1 Overview

The AnnotationHubData package provides tools to acquire, annotate, convert and store data for use in Bioconductor’s AnnotationHub. BED files from the Encode project, gtf files from Ensembl, or annotation tracks from UCSC, are examples of data that can be downloaded, described with metadata, transformed to standard Bioconductor data types, and stored so that they may be conveniently served up on demand to users via the AnnotationHub client. While data are often manipulated into a more R-friendly form, the data themselves retain their raw content and are not filtered or curated like those in ExperimentHub.
Each resource has associated metadata that can be searched through the AnnotationHub client interface.

2 New resources

2.1 Family of resources

Multiple, related resources are added to AnnotationHub by creating a software package similar to the existing annotation packages. The package itself does not contain data but serves as a light weight wrapper around scripts that generate metadata for the resources added to AnnotationHub.

At a minimum the package should contain a man page describing the resources. Vignettes and additional R code for manipulating the objects are optional.

Creating the package involves the following steps:

  1. Notify Bioconductor team member:
    Man page and vignette examples in the software package will not work until the data are available in AnnotationHub. Adding the data to AWS S3 and the metadata to the production database involves assistance from a Bioconductor team member. If you are interested in submitting a package, please send an email to packages@bioconductor.org so a team member can work with you through the process.

  2. Building the software package:
    Below is an outline of package organization. The files listed are required unless otherwise stated.

  1. Data objects:
    Data are not formally part of the software package and are stored separately in AWS S3 buckets. The author should make the data available via dropbox, ftp site or another mutually accessible application and it will be uploaded to S3 by a member of the Bioconductor team.

  2. Confirm valid metadata:
    Confirm the data in inst/exdata/metadata.csv are valid by running AnnotationHubData:::makeAnnotationHubMetadata() on your package. Please address and warnings or errors.

  3. Package review:
    Submit the package to the tracker for review. The primary purpose of the package review is to validate the metadata in the csv file(s). It is ok if the package fails R CMD build and check because the data and metadata are not yet in place. Once the metadata.csv is approved, records are added to the production database. At that point the package man pages and vignette can be finalized and the package should pass R CMD build and check.

2.2 Individual resources

Individual objects of a standard class can be added to the hub by providing only the data and metadata files or by creating a package as described in the Family of Resources section.

OrgDb, TxDb and BSgenome objects are well defined Bioconductor classes and methods to download and process these objects already exist in AnnotationHub. When adding only one or two objects the overhead of creating a package may be unnecessary. The goal of the package is to provide structure for metadata generation and makes sense when there are plans to update versions or add new organisms in the future.

Make sure the OrgDb, TxDb or BSgenome object you want to add does not already exist in the
Biocondcutor annotation repository

Providing just data and metadata files involves the following steps:

  1. Notify Bioconductor team member:
    Adding the data to AWS S3 and the metadata to the production database involves assistance from a Bioconductor team member. Please send email to packages@bioconductor.org so a team member can work with you through the process.

  2. Prepare the data:
    In the case of an OrgDb object, only the sqlite file is stored in S3. See makeOrgPackageFromNCBI() and makeOrgPackage() in the AnnotationForge package for help creating the sqlite file. BSgenome objects should be made according to the steps outline in the BSgenome vignette. TxDb objects will be made on-the-fly from a GRanges with GenomicFeatures::makeTxDbFromGRanges() when the resource is downloaded from AnnotationHub. Data should be provided as a GRanges object. See GenomicRanges::makeGRangesFromDataFrame() or rtracklayer::import() for help creating the GRanges.

  3. Generate metadata:
    Prepare a .R file that generates metadata for the resource(s) by calling the AnnotationHubData::AnnotationHubMetadata() constructor. Argument details are found on the ?AnnotationHubMetadata man page.

As an example, this piece of code generates the metadata for the Vitis vinifera TxDb Timothée Flutre contributed to AnnotationHub:

metadata <- AnnotationHubMetadata(
    Description="Gene Annotation for Vitis vinifera",
    Genome="IGGP12Xv0",
    Species="Vitis vinifera",
    SourceUrl="http://genomes.cribi.unipd.it/DATA/V2/V2.1/V2.1.gff3",
    SourceLastModifiedDate=as.POSIXct("2014-04-17"),
    SourceVersion="2.1",
    RDataPath="community/tflutre/",
    TaxonomyId=29760L, 
    Title="Vvinifera_CRIBI_IGGP12Xv0_V2.1.gff3.Rdata",
    BiocVersion=package_version("3.3"),
    Coordinate_1_based=TRUE,
    DataProvider="CRIBI",
    Maintainer="Timothée Flutre <timothee.flutre@supagro.inra.fr",
    RDataClass="GRanges",
    DispatchClass="GRanges",
    SourceType="GFF",
    RDataDateAdded=as.POSIXct(Sys.time()),
    Recipe=NA_character_,
    PreparerClass="None",
    Tags=c("GFF", "CRIBI", "Gene", "Transcript", "Annotation"),
    Notes="chrUn renamed to chrUkn"
)
  1. Add data to S3 and metadata to the database:
    This last step is done by the Biocondcutor team member.

3 Additional resources

Metadata for new versions of the data can be added to the same package as they become available.

Contact maintainer@bioconductor.org with any questions.

4 Bug fixes

A bug fix may involve a change to the metadata, data resource or both.

4.1 Update the resource

4.2 Update the metadata

New metadata records can be added for new resources but modifying existing records is discouraged. Record modification will only be done in the case of bug fixes.

5 Remove resources

When a resource is removed from AnnotationHub the ‘status’ field in the metadata is modified to explain why they are no longer available. Once this status is changed the AnnotationHub() constructor will not list the resource among the available ids. An attempt to extract the resource with ‘[[’ and the AH id will return an error along with the status message.

To remove a resource from AnnotationHub contact maintainer@bioconductor.org.

6 Historical vignettes

The process for adding data to AnnotationHub has evolved substantially since the first vignettes were written. Much of the information contained in those documents is outdated or applicable only to repeat-run recipes added to the code base. These documents have been retained for historical purposes and are located in the inst/scripts/ directory of the AnnotationHubData package.