1 Version Info

R version: R version 4.2.2 (2022-10-31)
Bioconductor version: 3.16
Package version: 1.22.0

2 Introduction

Annotation resources make up a significant proportion of the Bioconductor project[1]. And there are also a diverse set of online resources available which are accessed using specific packages. This walkthrough will describe the most popular of these resources and give some high level examples on how to use them.

Bioconductor annotation resources have traditionally been used near the end of an analysis. After the bulk of the data analysis, annotations would be used interpretatively to learn about the most significant results. But increasingly, they are also used as a starting point or even as an intermediate step to help guide a study that is still in progress. In addition to this, what it means for something to be an annotation is also becoming less clear than it once was. It used to be clear that annotations were only those things that had been established after multiple different studies had been performed (such as the primary role of a gene product). But today many large data sets are treated by communities in much the same way that classic annotations once were: as a reference for additional comparisons.

Another change that is underway with annotations in Bioconductor is in the way that they are obtained. In the past annotations existed almost exclusively as separate annotation packages[2,3,4]. Today packages are still an enormous source of annotations. The current release repository contains over eight hundred annotation packages. This table summarizes some of the more important classes of annotation objects that are often accessed using packages:

Object Type	Example Package Name	Contents
TxDb	TxDb.Hsapiens.UCSC.hg19.knownGene	Transcriptome ranges for the known gene track of Homo sapiens, e.g., introns, exons, UTR regions.
OrgDb	org.Hs.eg.db	Gene-based information for Homo sapiens; useful for mapping between gene IDs, Names, Symbols, GO and KEGG identifiers, etc.
BSgenome	BSgenome.Hsapiens.UCSC.hg19	Full genome sequence for Homo sapiens.
Organism.dplyr	src_organism	Collection of multiple annotations for a common organism and genome build.
AnnotationHub	AnnotationHub	Provides a convenient interface to annotations from many different sources; objects are returned as fully parsed Bioconductor data objects or as the name of a file on disk.

But in spite of the popularity of annotation packages, annotations are increasingly also being pulled down from web services like biomaRt[5,6,7] or from the AnnotationHub[8]. And both of these represent enormous resources for annotation data.

In part because of the rapidly evolving landscape, it is currently impossible in a single document to cover every possible annotation or even every kind of annotation present in Bioconductor. So here we will instead go over the most popular annotation resources and describe them in a way intended to expose common patterns used for accessing them. The hope is that a user with this information will be able to make educated guesses about how to find and use additional resources that will inevitably be added later. Topics that will be covered will include the following:

3 Set Up

In this chapter we make use of several Bioconductor packages. You can install them with BiocManager::install():

if (!"BiocManager" %in% rownames(installed.packages()))
     install.packages("BiocManager")
BiocManager::install(c("AnnotationHub", "Homo.sapiens",
           "Organism.dplyr",
           "TxDb.Hsapiens.UCSC.hg19.knownGene",
           "TxDb.Hsapiens.UCSC.hg38.knownGene",
           "BSgenome.Hsapiens.UCSC.hg19", "biomaRt",
           "TxDb.Athaliana.BioMart.plantsmart22"))

The usage of the installed packages will be described in detail within the Usage section.

4 Using AnnotationHub

The top of the list for learning about annotation resources is the relatively new AnnotationHub package[8]. The AnnotationHub was created to provide a convenient access point for end users to find a large range of different annotation objects for use with Bioconductor. Resources found in the AnnotationHub are easy to discover and are presented to the user as familiar Bioconductor data objects. Because it is a recent addition, the AnnotationHub allows access to a broad range of annotation like objects, some of which may not have been considered annotations even a few years ago. To get started with the AnnotationHub users only need to load the package and then create a local AnnotationHub object like this:

## snapshotDate(): 2022-10-31

ah <- AnnotationHub()

The very 1st time that you call the AnnotationHub, it will create a cache directory on your system and download the latest metadata for the hubs current contents. From that time forward, whenever you download one of the hubs data objects, it will also cache those files in the local directory so that if you request the information again, you will be able to access it quickly.

The show method of an AnnotationHub object will tell you how many resources are currently accessible using that object as well as give a high level overview of the most common kinds of data present.

ah

## AnnotationHub with 67944 records
## # snapshotDate(): 2022-10-31
## # $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/g...
## # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Bos taurus,...
## # $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, ChainFile...
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH5012"]]' 
## 
##              title                                       
##   AH5012   | Chromosome Band                             
##   AH5013   | STS Markers                                 
##   AH5014   | FISH Clones                                 
##   AH5015   | Recomb Rate                                 
##   AH5016   | ENCODE Pilot                                
##   ...        ...                                         
##   AH109476 | Ensembl 108 EnsDb for Xiphophorus couchianus
##   AH109477 | Ensembl 108 EnsDb for Xiphophorus maculatus 
##   AH109478 | Ensembl 108 EnsDb for Xenopus tropicalis    
##   AH109479 | Ensembl 108 EnsDb for Zonotrichia albicollis
##   AH109480 | Ensembl 108 EnsDb for Zalophus californianus

As you can see from the object above, there are a LOT of different resources available. So normally when you get an AnnotationHub object the 1st thing you want to do is to filter it to remove unwanted resources.

Fortunately, the AnnotationHub has several different kinds of metadata that you can use for searching and subsetting. To see the different categories all you need to do is to type the name of your AnnotationHub object and then tab complete from the ‘$’ operator. And to see all possible contents of one of these categories you can pass that value in to unique like this:

unique(ah$dataprovider)

##  [1] "UCSC"                                                                                                      
##  [2] "Ensembl"                                                                                                   
##  [3] "RefNet"                                                                                                    
##  [4] "Inparanoid8"                                                                                               
##  [5] "NHLBI"                                                                                                     
##  [6] "ChEA"                                                                                                      
##  [7] "Pazar"                                                                                                     
##  [8] "NIH Pathway Interaction Database"                                                                          
##  [9] "Haemcode"                                                                                                  
## [10] "BroadInstitute"                                                                                            
## [11] "PRIDE"                                                                                                     
## [12] "Gencode"                                                                                                   
## [13] "CRIBI"                                                                                                     
## [14] "Genoscope"                                                                                                 
## [15] "MISO, VAST-TOOLS, UCSC"                                                                                    
## [16] "UWashington"                                                                                               
## [17] "Stanford"                                                                                                  
## [18] "dbSNP"                                                                                                     
## [19] "BioMart"                                                                                                   
## [20] "GeneOntology"                                                                                              
## [21] "KEGG"                                                                                                      
## [22] "URGI"                                                                                                      
## [23] "EMBL-EBI"                                                                                                  
## [24] "MicrosporidiaDB"                                                                                           
## [25] "FungiDB"                                                                                                   
## [26] "TriTrypDB"                                                                                                 
## [27] "ToxoDB"                                                                                                    
## [28] "AmoebaDB"                                                                                                  
## [29] "PlasmoDB"                                                                                                  
## [30] "PiroplasmaDB"                                                                                              
## [31] "CryptoDB"                                                                                                  
## [32] "TrichDB"                                                                                                   
## [33] "GiardiaDB"                                                                                                 
## [34] "The Gene Ontology Consortium"                                                                              
## [35] "ENCODE Project"                                                                                            
## [36] "SchistoDB"                                                                                                 
## [37] "NCBI/UniProt"                                                                                              
## [38] "GENCODE"                                                                                                   
## [39] "http://www.pantherdb.org"                                                                                  
## [40] "RMBase v2.0"                                                                                               
## [41] "snoRNAdb"                                                                                                  
## [42] "tRNAdb"                                                                                                    
## [43] "NCBI"                                                                                                      
## [44] "DrugAge, DrugBank, Broad Institute"                                                                        
## [45] "DrugAge"                                                                                                   
## [46] "DrugBank"                                                                                                  
## [47] "Broad Institute"                                                                                           
## [48] "HMDB, EMBL-EBI, EPA"                                                                                       
## [49] "STRING"                                                                                                    
## [50] "OMA"                                                                                                       
## [51] "OrthoDB"                                                                                                   
## [52] "PathBank"                                                                                                  
## [53] "EBI/EMBL"                                                                                                  
## [54] "NCBI,DBCLS"                                                                                                
## [55] "FANTOM5,DLRP,IUPHAR,HPRD,STRING,SWISSPROT,TREMBL,ENSEMBL,CELLPHONEDB,BADERLAB,SINGLECELLSIGNALR,HOMOLOGENE"
## [56] "WikiPathways"                                                                                              
## [57] "VAST-TOOLS"                                                                                                
## [58] "pyGenomeTracks "                                                                                           
## [59] "NA"                                                                                                        
## [60] "UoE"                                                                                                       
## [61] "TargetScan,miRTarBase,USCS,ENSEMBL"                                                                        
## [62] "TargetScan"                                                                                                
## [63] "QuickGO"                                                                                                   
## [64] "CIS-BP"                                                                                                    
## [65] "CTCFBSDB 2.0"                                                                                              
## [66] "HOCOMOCO v11"                                                                                              
## [67] "JASPAR 2022"                                                                                               
## [68] "Jolma 2013"                                                                                                
## [69] "SwissRegulon"                                                                                              
## [70] "ENCODE SCREEN v3"                                                                                          
## [71] "MassBank"                                                                                                  
## [72] "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/"                                                                     
## [73] "excluderanges"                                                                                             
## [74] "ENCODE"                                                                                                    
## [75] "GitHub"                                                                                                    
## [76] "Stanford.edu"                                                                                              
## [77] "Publication"                                                                                               
## [78] "CHM13"                                                                                                     
## [79] "UCSChub"

One of the most valuable ways in which the data is labeled is according to the kind of R object that will be returned to you.

unique(ah$rdataclass)

##  [1] "GRanges"                           "data.frame"                       
##  [3] "Inparanoid8Db"                     "TwoBitFile"                       
##  [5] "ChainFile"                         "SQLiteConnection"                 
##  [7] "biopax"                            "BigWigFile"                       
##  [9] "AAStringSet"                       "MSnSet"                           
## [11] "mzRident"                          "list"                             
## [13] "TxDb"                              "Rle"                              
## [15] "EnsDb"                             "VcfFile"                          
## [17] "igraph"                            "data.frame, DNAStringSet, GRanges"
## [19] "sqlite"                            "data.table"                       
## [21] "character"                         "SQLite"                           
## [23] "SQLiteFile"                        "Tibble"                           
## [25] "Rda"                               "FaFile"                           
## [27] "String"                            "CompDb"                           
## [29] "OrgDb"

Once you have identified which sorts of metadata you would like to use to find your data of interest, you can then use the subset or query methods to reduce the size of the hub object to something more manageable. For example you could select only those records where the string ‘GRanges’ was in the metadata. As you can see GRanges are one of the more popular formats for data that comes from the AnnotationHub.

grs <- query(ah, "GRanges")
grs

## AnnotationHub with 28993 records
## # snapshotDate(): 2022-10-31
## # $dataprovider: Ensembl, BroadInstitute, UCSC, Haemcode, FungiDB, Pazar, Tr...
## # $species: Homo sapiens, Mus musculus, Bos taurus, Pan troglodytes, Danio r...
## # $rdataclass: GRanges, data.frame, DNAStringSet, GRanges
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH5012"]]' 
## 
##              title                  
##   AH5012   | Chromosome Band        
##   AH5013   | STS Markers            
##   AH5014   | FISH Clones            
##   AH5015   | Recomb Rate            
##   AH5016   | ENCODE Pilot           
##   ...        ...                    
##   AH107381 | danRer10.UCSC.scaffold 
##   AH107382 | dm6.UCSC.other         
##   AH107383 | dm3.UCSC.contig        
##   AH107384 | dm3.UCSC.scaffold      
##   AH107385 | TAIR10.UCSC.araTha1.gap

Or you can use subsetting to only select for matches on a specific field

grs <- ah[ah$rdataclass == "GRanges",]

The subset function is also provided.

orgs <- subset(ah, ah$rdataclass == "OrgDb")
orgs

## AnnotationHub with 1871 records
## # snapshotDate(): 2022-10-31
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Escherichia coli, greater Indian_fruit_bat, Zootoca vivipara, Zo...
## # $rdataclass: OrgDb
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH107050"]]' 
## 
##              title                                       
##   AH107050 | org.Ag.eg.db.sqlite                         
##   AH107051 | org.At.tair.db.sqlite                       
##   AH107052 | org.Bt.eg.db.sqlite                         
##   AH107053 | org.Cf.eg.db.sqlite                         
##   AH107054 | org.Gg.eg.db.sqlite                         
##   ...        ...                                         
##   AH109234 | org.Rhizoctonia_praticola.eg.sqlite         
##   AH109235 | org.Rhizoctonia_solani.eg.sqlite            
##   AH109236 | org.Heterostelium_album_PN500.eg.sqlite     
##   AH109237 | org.Heterostelium_pallidum_PN500.eg.sqlite  
##   AH109238 | org.Polysphondylium_pallidum_PN500.eg.sqlite

And if you really need access to all the metadata you can extract it as a DataFrame using mcols() like so:

meta <- mcols(ah)

Also if you are a fan of GUI’s you can use the display method to look at your data in a browser and return selected rows back as a smaller AnnotationHub object like this:

sah <- display(ah)

Calling this method will produce a web based interface like the one pictured here:

Once you have the AnnotationHub object pared down to a reasonable size, and are sure about which records you want to retrieve, then you only need to use the ‘[[’ operator to extract them. Using the ‘[[’ operator, you can extract by numeric index (1,2,3) or by AnnotationHub ID. If you choose to use the former, you simply extract the element that you are interested in. So for our chain example, you might just want to 1st one like this:

res <- grs[[1]]

## loading from cache

head(res, n=3)

## UCSC track 'cytoBand'
## UCSCData object with 3 ranges and 1 metadata column:
##       seqnames          ranges strand |        name
##          <Rle>       <IRanges>  <Rle> | <character>
##   [1]     chr1       1-2300000      * |      p36.33
##   [2]     chr1 2300001-5400000      * |      p36.32
##   [3]     chr1 5400001-7200000      * |      p36.31
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

4.1 AnnotationHub exercises

Exercise 1: Use the AnnotationHub to extract UCSC data that is from Homo sapiens and also specifically from the hg19 genome. What happens to the hub object as you filter data at each step?

Exercise 2 Now that you have basically narrowed things down to the hg19 annotations from UCSC genome browser, lets get one of these annotations. Find the oreganno track and save it into a local variable.

[ Back to top ]

5 OrgDb objects

At this point you might be wondering: What is this OrgDb object about? OrgDb objects are one member of a family of annotation objects that all represent hidden data through a shared set of methods. So if you look closely at the dog object created below you can see it contains data for Canis familiaris (taxonomy ID = 9615). You can learn a little more about it by learning about the columns method.

query(orgs, "Canis familiaris")

## AnnotationHub with 2 records
## # snapshotDate(): 2022-10-31
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Canis familiaris_dingo, Canis familiaris
## # $rdataclass: OrgDb
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH107053"]]' 
## 
##              title                               
##   AH107053 | org.Cf.eg.db.sqlite                 
##   AH107556 | org.Canis_familiaris_dingo.eg.sqlite

dog <- ah[["AH107053"]]

## loading from cache

columns(dog)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "ONTOLOGY"     "ONTOLOGYALL" 
## [16] "PATH"         "PMID"         "REFSEQ"       "SYMBOL"       "UNIPROT"

The columns method gives you a vector of data types that can be retrieved from the object that you call it on. So the above call indicates that there are several different data types that can be retrieved from the tetra object.

A very similar method is the keytypes method, which will list all the data types that can also be used as keys.

keytypes(dog)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "ONTOLOGY"     "ONTOLOGYALL" 
## [16] "PATH"         "PMID"         "REFSEQ"       "SYMBOL"       "UNIPROT"

In many cases most of the things that are listed as columns will also come back from a keytypes call, but since these two things are not guaranteed to be identical, we maintain two separate methods.

Now that you can see what kinds of things can be used as keys, you can call the keys method to extract out all the keys of a given key type.

head(keys(dog, keytype="ENTREZID"))

## [1] "399518" "399530" "399544" "399545" "399653" "403152"

This is useful if you need to get all the IDs of a particular kind but the keys method has a few extra arguments that can make it even more flexible. For example, using the keys method you could also extract the gene SYMBOLS that contain “COX” like this:

keys(dog, keytype="SYMBOL", pattern="COX")

##  [1] "COX5B"   "COX7A2L" "COX8A"   "COX15"   "COX5A"   "COX4I1"  "COX6A2" 
##  [8] "COX20"   "COX18"   "ACOX1"   "COX4I2"  "ACOX3"   "COX10"   "COX17"  
## [15] "COX11"   "ACOXL"   "COX7A1"  "COX1"    "COX2"    "COX3"    "COX19"  
## [22] "COX7B2"  "COX14"   "ACOX2"   "COX16"

Or if you really needed an other keytype, you can use the column argument to extract the ENTREZ GENE IDs for those gene SYMBOLS that contain the string “COX”:

keys(dog, keytype="ENTREZID", pattern="COX", column="SYMBOL")

## 'select()' returned 1:1 mapping between keys and columns

##  [1] "474567"    "475739"    "476040"    "477792"    "478370"    "479623"   
##  [7] "479780"    "480099"    "482193"    "483322"    "485825"    "488790"   
## [13] "489515"    "503668"    "609555"    "611729"    "612614"    "804478"   
## [19] "804479"    "804480"    "100685945" "100687434" "100688544" "100855488"
## [25] "119863880"

But often, you will really want to extract other data that matches a particular key or set of keys. For that there are two methods which you can use. The more powerful of these is probably select. Here is how you would look up the gene SYMBOL, and REFSEQ id for specific entrez gene ID.

select(dog, keys="804478", columns=c("SYMBOL","REFSEQ"), keytype="ENTREZID")

## 'select()' returned 1:1 mapping between keys and columns

##   ENTREZID SYMBOL    REFSEQ
## 1   804478   COX1 NP_008473

When you call it, select will return a data.frame that attempts to fill in matching values for all the columns you requested. However, if you ask select for things that have a many to one relationship to your keys it can result in an expansion of the data object that is returned. For example, watch what happens when we ask for the GO terms for the same entrez gene ID:

select(dog, keys="804478", columns="GO", keytype="ENTREZID")

## 'select()' returned 1:many mapping between keys and columns

##    ENTREZID         GO EVIDENCE ONTOLOGY
## 1    804478 GO:0004129      IEA       MF
## 2    804478 GO:0005743      IEA       CC
## 3    804478 GO:0005751      IEA       CC
## 4    804478 GO:0006119      IEA       BP
## 5    804478 GO:0006123      IEA       BP
## 6    804478 GO:0009060      IEA       BP
## 7    804478 GO:0015990      IEA       BP
## 8    804478 GO:0016021      IEA       CC
## 9    804478 GO:0020037      IEA       MF
## 10   804478 GO:0022904      IEA       BP
## 11   804478 GO:0045277      IEA       CC
## 12   804478 GO:0046872      IEA       MF

Because there are several GO terms associated with the gene “804478”, you end up with many rows in the data.frame. This can become problematic if you then ask for several columns that have a many to one relationship to the original key. If you were to do that, not only would the result multiply in size, it would also become really hard to use. A better strategy is to be selective when using select.

Sometimes you might want to look up matching results in a way that is simpler than the data.frame object that select returns. This is especially true when you only want to look up one kind of value per key. For these cases, we recommend that you look at the mapIds method. Lets look at what happens if request the same basic information as in our recent select call, but instead using the mapIds method:

mapIds(dog, keys="804478", column="GO", keytype="ENTREZID")

## 'select()' returned 1:many mapping between keys and columns

##       804478 
## "GO:0004129"

As you can see, the mapIds method allows you to simplify the result that is returned. And by default, mapIds only returns the 1st matching element for each key. But what if you really need all those GO terms returned when you call mapIds? Well then you can make use of the mapIds multiVals argument. There are several options for this argument, we have already seen how by default you can return only the ‘first’ element. But you can also return a ‘list’ or ‘CharacterList’ object, or you can ‘filter’ out or return ‘asNA’ any keys that have multiple matches. You can even define your own rule (as a function) and pass that in as an argument to multiVals. Lets look at what happens when you return a list:

mapIds(dog, keys="804478", column="GO", keytype="ENTREZID", multiVals="list")

## 'select()' returned 1:many mapping between keys and columns

## $`804478`
##  [1] "GO:0004129" "GO:0005743" "GO:0005751" "GO:0006119" "GO:0006123"
##  [6] "GO:0009060" "GO:0015990" "GO:0016021" "GO:0020037" "GO:0022904"
## [11] "GO:0045277" "GO:0046872"

Now you know how to extract information from an OrgDb object, you might find it helpful to know that there is a whole family of other AnnotationDb derived objects that you can also use with these same five methods (keytypes(), columns(), keys(), select(), and mapIds()). For example there are ChipDb objects, InparanoidDb objects and TxDb objects which contain data about microarray probes, inparanoid homology partners or transcript range information respectively. And there are also more specialized objects like GODb or ReactomeDb objects which offer access to data from GO and reactome. In the next section, we will be looking at one of the more popular classes of these objects: the TxDb object.

5.1 OrgDb exercises

Exercise 3: Look at the help page for the different columns and keytypes values with: help(“SYMBOL”). Now use this information and what we just described to look up the entrez gene and chromosome for the gene symbol “MSX2”.

Exercise 4: In the previous exercise we had to use gene symbols as keys. But in the past this kind of behavior has sometimes been inadvisable because some gene symbols are used as the official symbol for more than one gene. To learn if this is still happening take advantage of the fact that entrez gene ids are uniquely assigned, and extract all of the gene symbols and their associated entrez gene ids from the org.Hs.eg.db package. Then check the symbols for redundancy.

[ Back to top ]

6 TxDb Objects

As mentioned before, TxDb objects can be accessed using the standard set of methods: keytypes(), columns(), keys(), select(), and mapIds(). But because these objects contain information about a transcriptome, they are often used to compare range based information to these important features of the genome[3,4]. As a result they also have specialized accessors for extracting out ranges that correspond to important transcriptome characteristics.

Lets start by loading a TxDb object from an annotation package based on the UCSC ensembl genes track for Drosophila. A common practice when loading these is to shorten the long name to ‘txdb’ (just as a convenience).

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb

## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1

Just by looking at the TxDb object, we can learn a lot about what data it contains including where the data came from, which build of the UCSC genome it was based on and the last time that the object was updated. One of the most common uses for a TxDb object is to extract various kinds of transcript data out of it. So for example you can extract all the transcripts out of the TxDb as a GRanges object like this:

txs <- transcripts(txdb)
txs

## GRanges object with 5506 ranges and 2 metadata columns:
##          seqnames            ranges strand |     tx_id     tx_name
##             <Rle>         <IRanges>  <Rle> | <integer> <character>
##      [1]     chr3     238279-451097      + |     13060  uc003bot.3
##      [2]     chr3     238279-451097      + |     13061  uc003bou.3
##      [3]     chr3     239326-290282      + |     13062  uc003bov.2
##      [4]     chr3     239326-440831      + |     13063  uc003bow.2
##      [5]     chr3     361366-451097      + |     13064  uc011asi.2
##      ...      ...               ...    ... .       ...         ...
##   [5502]    chr18 77732867-77748532      - |     65761  uc002lnr.3
##   [5503]    chr18 77732867-77748532      - |     65762  uc010drf.3
##   [5504]    chr18 77732867-77793915      - |     65763  uc010drg.3
##   [5505]    chr18 77915117-78005397      - |     65764  uc002lny.3
##   [5506]    chr18 77941005-78005397      - |     65765  uc010xfp.2
##   -------
##   seqinfo: 2 sequences from hg19 genome

Similarly, there are also extractors for exons(), cds(), genes() and promoters(). Which kind of feature you choose to extract just depends on what information you are after. These basic extractors are fine if you only want a flat representation of these data, but many of these features are inherently nested. So instead of extracting a flat GRanges object, you might choose instead to extract a GRangesList object that groups the transcripts by the genes that they are associated with like this:

txby <- transcriptsBy(txdb, by="gene")
txby

## GRangesList object of length 1612:
## $`1000`
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]    chr18 25530930-25616539      - |     65378  uc010xbn.1
##   [2]    chr18 25530930-25757445      - |     65379  uc002kwg.2
##   -------
##   seqinfo: 2 sequences from hg19 genome
## 
## $`100009676`
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand |     tx_id     tx_name
##          <Rle>           <IRanges>  <Rle> | <integer> <character>
##   [1]     chr3 101395274-101398057      + |     14200  uc003dvg.3
##   -------
##   seqinfo: 2 sequences from hg19 genome
## 
## $`100101467`
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]    chr18 32831023-32870196      - |     65418  uc002kyl.3
##   [2]    chr18 32831023-32870196      - |     65419  uc002kym.3
##   [3]    chr18 32843361-32870165      - |     65420  uc002kyn.1
##   -------
##   seqinfo: 2 sequences from hg19 genome
## 
## ...
## <1609 more elements>

Just as with the flat extractors, there is a whole family of extractors available depending on what you want to extract and how you want it grouped. They include transcriptsBy(), exonsBy(), cdsBy(), intronsByTranscript(), fiveUTRsByTranscript() and threeUTRsByTranscript().

When dealing with genomic data it is almost inevitable that you will run into problems with the way that different groups have adopted alternate ways of naming chromosomes. This is because almost every major repository has cooked up their own slightly different way of labeling these important features.

To cope with this, the Seqinfo object was invented and is attached to TxDb objects as well as the GenomicRanges extracted from these objects. You can extract it using the seqinfo() method like this:

si <- seqinfo(txdb)
si

## Seqinfo object with 2 sequences from hg19 genome:
##   seqnames seqlengths isCircular genome
##   chr3      198022430         NA   hg19
##   chr18      78077248         NA   hg19

And since the seqinfo information is also attached to the GRanges objects produced by the TxDb extractors, you can also call seqinfo on the results of those methods like this:

txby <- transcriptsBy(txdb, by="gene")
si <- seqinfo(txby)

The Seqinfo object contains a lot of valuable data about which chromosome features are present, whether they are circular or linear, and how long each one is. It is also something that will be checked against if you try to do an operation like ‘findOverlaps’ to compute overlapping ranges etc. So it’s a valuable way to make sure that the chromosomes and genome are the same for your annotations as the range that you are comparing them to. But sometimes you may have a situation where your annotation object contains data that is comparable to your data object, but where it is simply named with a different naming style. For those cases, there are helpers that you can use to discover what the current name style is for an object. And there is also a setter method to allow you to change the value to something more appropriate. So in the following example, we are going to change the seqlevelStyle from ‘UCSC’ to ‘ensembl’ based naming convention (and then back again).

head(seqlevels(txdb))

## [1] "chr3"  "chr18"

seqlevelsStyle(txdb)

## [1] "UCSC"

seqlevelsStyle(txdb) <- "NCBI"
head(seqlevels(txdb))

## [1] "3"  "18"

## then change it back
seqlevelsStyle(txdb) <- "UCSC"
head(seqlevels(txdb))

## [1] "chr3"  "chr18"

In addition to being able to change the naming style used for an object with seqinfo data, you can also toggle which of the chromosomes are ‘active’ so that the software will ignore certain chromosomes. By default, all of the chromosomes are set to be ‘active’.

head(isActiveSeq(txdb), n=30)

##  chr3 chr18 
##  TRUE  TRUE

But sometimes you might wish to ignore some of them. For example, lets suppose that you wanted to ignore the Y chromosome from our txdb. You could do that like so:

isActiveSeq(txdb)["chrY"] <- FALSE
head(isActiveSeq(txdb), n=26)

6.1 TxDb exercises

Exercise 5: Use the accessors for the TxDb.Hsapiens.UCSC.hg19.knownGene package to retrieve the gene id, transcript name and transcript chromosome for all the transcripts. Do this using both the select() method and also using the transcripts() method. What is the difference in the output?

Exercise 6: Load the TxDb.Athaliana.BioMart.plantsmart22 package. This package is not from UCSC and it is based on plantsmart. Now use select or one of the range based accessors to look at the gene ids from this TxDb object. How do they compare to what you saw in the TxDb.Hsapiens.UCSC.hg19.knownGene package?

[ Back to top ]

7 Organism.dplyr src_organism Objects

So what happens if you have data from multiple different Annotation objects. For example, what if you had gene SYMBOLS (found in an OrgDb object) and you wanted to easily match those up with known gene transcript names from a UCSC based TxDb object? There is an ideal tool that can help with this kind of problem and it’s called an src_organism object from the Organism.dplyr package. src_organism objects and their related methods are able to query each of OrgDb and TxDb resources for you and then merge the results back together in way that lets you pretend that you only have one source for all your annotations.

library(Organism.dplyr)

src_organism objects can be created for organisms that have both an OrgDb and a TxDb. To see organisms that can have src_organism objects made, use the function supportOrganisms():

supported <- supportedOrganisms()
print(supported, n=Inf)

## # A tibble: 21 × 3
##    organism                OrgDb         TxDb                                  
##    <chr>                   <chr>         <chr>                                 
##  1 Bos taurus              org.Bt.eg.db  TxDb.Btaurus.UCSC.bosTau8.refGene     
##  2 Caenorhabditis elegans  org.Ce.eg.db  TxDb.Celegans.UCSC.ce11.refGene       
##  3 Caenorhabditis elegans  org.Ce.eg.db  TxDb.Celegans.UCSC.ce6.ensGene        
##  4 Canis familiaris        org.Cf.eg.db  TxDb.Cfamiliaris.UCSC.canFam3.refGene 
##  5 Drosophila melanogaster org.Dm.eg.db  TxDb.Dmelanogaster.UCSC.dm3.ensGene   
##  6 Drosophila melanogaster org.Dm.eg.db  TxDb.Dmelanogaster.UCSC.dm6.ensGene   
##  7 Danio rerio             org.Dr.eg.db  TxDb.Drerio.UCSC.danRer10.refGene     
##  8 Gallus gallus           org.Gg.eg.db  TxDb.Ggallus.UCSC.galGal4.refGene     
##  9 Homo sapiens            org.Hs.eg.db  TxDb.Hsapiens.UCSC.hg18.knownGene     
## 10 Homo sapiens            org.Hs.eg.db  TxDb.Hsapiens.UCSC.hg19.knownGene     
## 11 Homo sapiens            org.Hs.eg.db  TxDb.Hsapiens.UCSC.hg38.knownGene     
## 12 Mus musculus            org.Mm.eg.db  TxDb.Mmusculus.UCSC.mm10.ensGene      
## 13 Mus musculus            org.Mm.eg.db  TxDb.Mmusculus.UCSC.mm10.knownGene    
## 14 Mus musculus            org.Mm.eg.db  TxDb.Mmusculus.UCSC.mm9.knownGene     
## 15 Macaca mulatta          org.Mmu.eg.db TxDb.Mmulatta.UCSC.rheMac3.refGene    
## 16 Macaca mulatta          org.Mmu.eg.db TxDb.Mmulatta.UCSC.rheMac8.refGene    
## 17 Pan troglodytes         org.Pt.eg.db  TxDb.Ptroglodytes.UCSC.panTro4.refGene
## 18 Rattus norvegicus       org.Rn.eg.db  TxDb.Rnorvegicus.UCSC.rn4.ensGene     
## 19 Rattus norvegicus       org.Rn.eg.db  TxDb.Rnorvegicus.UCSC.rn5.refGene     
## 20 Rattus norvegicus       org.Rn.eg.db  TxDb.Rnorvegicus.UCSC.rn6.refGene     
## 21 Sus scrofa              org.Ss.eg.db  TxDb.Sscrofa.UCSC.susScr3.refGene

Notice how there are multiple entries for a single organism (e.g. three for Homo sapiens). There is only one OrgDb per organism, but different TxDbs can be used. To specify a certain version of a TxDb to use, we can use the src_organism() function to create an src_organism object.

library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene")

## creating 'src_organism' database...

src

## src:  sqlite 3.40.0 [/tmp/RtmpulTysV/file25af3f5ceb0782]
## tbls: id, id_accession, id_go, id_go_all, id_omim_pm, id_protein,
##   id_transcript, ranges_cds, ranges_exon, ranges_gene, ranges_tx

We can also create one using the src_ucsc() function. This will create an src_organism object using the most recent TxDb version available:

src <- src_ucsc("Homo sapiens")

src

## src:  sqlite 3.40.0 [/tmp/RtmpulTysV/file25af3f5ceb0782]
## tbls: id, id_accession, id_go, id_go_all, id_omim_pm, id_protein,
##   id_transcript, ranges_cds, ranges_exon, ranges_gene, ranges_tx

The five methods that worked for all of the other Db objects that we have discussed (keytypes(), columns(), keys(), select(), and mapIds()) all work for src_organism objects. Here, we use keytypes() to show which keytypes can be passed to the keytype argument of select().

keytypes(src)

##  [1] "accnum"       "alias"        "cds_chrom"    "cds_end"      "cds_id"      
##  [6] "cds_name"     "cds_start"    "cds_strand"   "ensembl"      "ensemblprot" 
## [11] "ensembltrans" "entrez"       "enzyme"       "evidence"     "evidenceall" 
## [16] "exon_chrom"   "exon_end"     "exon_id"      "exon_name"    "exon_rank"   
## [21] "exon_start"   "exon_strand"  "gene_chrom"   "gene_end"     "gene_start"  
## [26] "gene_strand"  "genename"     "go"           "goall"        "ipi"         
## [31] "map"          "omim"         "ontology"     "ontologyall"  "pfam"        
## [36] "pmid"         "prosite"      "refseq"       "symbol"       "tx_chrom"    
## [41] "tx_end"       "tx_id"        "tx_name"      "tx_start"     "tx_strand"   
## [46] "tx_type"      "uniprot"

Use columns() to show which keytypes can be passed to the keytype argument of select().

columns(src)

##  [1] "accnum"       "alias"        "cds_chrom"    "cds_end"      "cds_id"      
##  [6] "cds_name"     "cds_start"    "cds_strand"   "ensembl"      "ensemblprot" 
## [11] "ensembltrans" "entrez"       "enzyme"       "evidence"     "evidenceall" 
## [16] "exon_chrom"   "exon_end"     "exon_id"      "exon_name"    "exon_rank"   
## [21] "exon_start"   "exon_strand"  "gene_chrom"   "gene_end"     "gene_start"  
## [26] "gene_strand"  "genename"     "go"           "goall"        "ipi"         
## [31] "map"          "omim"         "ontology"     "ontologyall"  "pfam"        
## [36] "pmid"         "prosite"      "refseq"       "symbol"       "tx_chrom"    
## [41] "tx_end"       "tx_id"        "tx_name"      "tx_start"     "tx_strand"   
## [46] "tx_type"      "uniprot"

And that’s it. You can now use these objects in the same way that you use OrgDb or TxDb objects. It works the same as the base objects that it contains:

select(src, keys="4488", columns=c("symbol", "tx_name"), keytype="entrez")

## Joining, by = "entrez"

##    entrez symbol           tx_name
## 1    4488   MSX2 ENST00000239243.7
## 2    4488   MSX2 ENST00000507785.2
## 3    4488   MSX2 ENST00000239243.7
## 4    4488   MSX2 ENST00000507785.2
## 5    4488   MSX2 ENST00000239243.7
## 6    4488   MSX2 ENST00000507785.2
## 7    4488   MSX2 ENST00000239243.7
## 8    4488   MSX2 ENST00000507785.2
## 9    4488   MSX2 ENST00000239243.7
## 10   4488   MSX2 ENST00000507785.2
## 11   4488   MSX2 ENST00000239243.7
## 12   4488   MSX2 ENST00000507785.2
## 13   4488   MSX2 ENST00000239243.7
## 14   4488   MSX2 ENST00000507785.2

Organism.dplyr also supports numerous Genomic Extractor functions allowing users to filter based on information contained in the OrgDb and TxDb objects. To see the filters supported by a src_organism() object, use supportedFIlters():

head(supportedFilters(src))

##            filter     field
## 1    AccnumFilter    accnum
## 2     AliasFilter     alias
## 3  CdsChromFilter cds_chrom
## 44   CdsEndFilter   cds_end
## 42    CdsIdFilter    cds_id
## 4   CdsNameFilter  cds_name

The ranged based accessors such as those in GenomicFeatures will also work. There are also "_tbl" functions (e.g. transcripts_tbl()) that return tbl objects instead of GRanges objects. Complex filter statements can be given as input. Here we declare a GRangesFilter and use two different type-returning accessors to query transcripts that either start with “SNORD” and are within our given GRangesFilter, or have symbol with symbol “ADA”:

gr <- GRangesFilter(GenomicRanges::GRanges("chr1:44000000-55000000"))
transcripts(src, filter=~(symbol %startsWith% "SNORD" & gr) | symbol == "ADA")

## GRanges object with 66 ranges and 3 metadata columns:
##        seqnames            ranges strand |     tx_id           tx_name
##           <Rle>         <IRanges>  <Rle> | <integer>       <character>
##    [1]     chr1 44775864-44775943      + |      3425 ENST00000581525.1
##    [2]     chr1 44776490-44776593      + |      3426 ENST00000364043.1
##    [3]     chr1 44777843-44777912      + |      3429 ENST00000365161.1
##    [4]     chr1 44778390-44778456      + |      3431 ENST00000384690.1
##    [5]     chr1 44778390-44778458      + |      3432 ENST00000625943.1
##    ...      ...               ...    ... .       ...               ...
##   [62]    chr20 44623752-44651678      - |    232685 ENST00000695997.1
##   [63]    chr20 44623972-44651718      - |    232686 ENST00000696009.1
##   [64]    chr20 44626323-44651661      - |    232687 ENST00000545776.5
##   [65]    chr20 44627547-44651720      - |    232688 ENST00000696010.1
##   [66]    chr20 44636071-44652233      - |    232689 ENST00000535573.1
##             symbol
##        <character>
##    [1]     SNORD55
##    [2]     SNORD46
##    [3]    SNORD38A
##    [4]    SNORD38B
##    [5]    SNORD38B
##    ...         ...
##   [62]         ADA
##   [63]         ADA
##   [64]         ADA
##   [65]         ADA
##   [66]         ADA
##   -------
##   seqinfo: 640 sequences (1 circular) from hg38 genome

transcripts_tbl(src, filter=~(symbol %startsWith% "SNORD" & gr) | symbol == "ADA")

## # A tibble: 66 × 7
##    tx_chrom tx_start   tx_end tx_strand  tx_id tx_name           symbol  
##    <chr>       <int>    <int> <chr>      <int> <chr>             <chr>   
##  1 chr1     44775864 44775943 +           3425 ENST00000581525.1 SNORD55 
##  2 chr1     44776490 44776593 +           3426 ENST00000364043.1 SNORD46 
##  3 chr1     44777843 44777912 +           3429 ENST00000365161.1 SNORD38A
##  4 chr1     44778390 44778456 +           3431 ENST00000384690.1 SNORD38B
##  5 chr1     44778390 44778458 +           3432 ENST00000625943.1 SNORD38B
##  6 chr20    44584896 44651702 -         232629 ENST00000696034.1 ADA     
##  7 chr20    44618605 44651745 -         232630 ENST00000537820.2 ADA     
##  8 chr20    44618618 44651699 -         232631 ENST00000696003.1 ADA     
##  9 chr20    44618625 44651699 -         232632 ENST00000696004.1 ADA     
## 10 chr20    44619521 44651678 -         232633 ENST00000695991.1 ADA     
## # … with 56 more rows

7.1 Organism.dplyr exercises

Exercise 7: Use the src_organism object to look up the gene symbol, transcript start and chromosome using select(). Then do the same thing using transcripts. You might expect that this call to transcripts will look the same as it did for the TxDb object, but (temporarily) it will not.

Exercise 8: Look at the results from call the columns method on the src_organism object and compare that to what happens when you call columns on the org.Hs.eg.db object and then look at a call to columns on the TxDb.Hsapiens.UCSC.hg19.knownGene object.

Exercise 9: Use the src_organism object with the transcripts method to look up the entrez gene IDs for all gene symbols that contain the letter ‘X’.

[ Back to top ]

8 BSgenome Objects

Another important annotation resource type is a BSgenome package[10]. There are many BSgenome packages in the repository for you to choose from. And you can learn which organisms are already supported by using the available.genomes() function.

head(available.genomes())

## [1] "BSgenome.Alyrata.JGI.v1"                
## [2] "BSgenome.Amellifera.BeeBase.assembly4"  
## [3] "BSgenome.Amellifera.NCBI.AmelHAv3.1"    
## [4] "BSgenome.Amellifera.UCSC.apiMel2"       
## [5] "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [6] "BSgenome.Aofficinalis.NCBI.V1"

Unlike the other resources that we have discussed here, these packages are meant to contain sequence data for a specific genome build of an organism. You can load one of these packages in the usual way. And each of them normally has an alias for the primary object that is shorter than the full package name (as a convenience):

ls(2)

## character(0)

Hsapiens

## Human genome:
## # organism: Homo sapiens (Human)
## # genome: hg19
## # provider: UCSC
## # release date: June 2013
## # 298 sequences:
## #   chr1                  chr2                  chr3                 
## #   chr4                  chr5                  chr6                 
## #   chr7                  chr8                  chr9                 
## #   chr10                 chr11                 chr12                
## #   chr13                 chr14                 chr15                
## #   ...                   ...                   ...                  
## #   chr19_gl949749_alt    chr19_gl949750_alt    chr19_gl949751_alt   
## #   chr19_gl949752_alt    chr19_gl949753_alt    chr20_gl383577_alt   
## #   chr21_gl383578_alt    chr21_gl383579_alt    chr21_gl383580_alt   
## #   chr21_gl383581_alt    chr22_gl383582_alt    chr22_gl383583_alt   
## #   chr22_kb663609_alt                                               
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## # to access a given sequence)

The getSeq method is a useful way of extracting data from these packages. This method takes several arguments but the important ones are the 1st two. The 1st argument specifies the BSgenome object to use and the second argument (names) specifies what data you want back out. So for example, if you call it and give a character vector that names the seqnames for the object then you will get the sequences from those chromosomes as a DNAStringSet object.

seqNms <- seqnames(Hsapiens)
head(seqNms)

## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"

getSeq(Hsapiens, seqNms[1:2])

## DNAStringSet object of length 2:
##         width seq                                           names               
## [1] 249250621 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr1
## [2] 243199373 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr2

Whereas if you give the a GRanges object for the 2nd argument, you can instead get a DNAStringSet that corresponds to those ranges. This can be a powerful way to learn what sequence was present from a particular range. For example, here we can extract the range of a specific gene of interest like this.

txby <- transcriptsBy(txdb, by="gene")
geneOfInterest <- txby[["4488"]]
res <- getSeq(Hsapiens, geneOfInterest)
res

Additionally, the Biostrings[11] package has many useful functions for finding a pattern in a string set etc. You may not have noticed when it happened, but the Biostrings package was loaded when you loaded the BSgenome object, so these functions will already be available for you to explore.

8.1 BSgenome exercises

Exercise 10: Use what you have just learned to extract the sequence for the PTEN gene.

[ Back to top ]

9 biomaRt

Another great annotation resource is the biomaRt package[5,6,7]. The biomaRt package exposes a huge family of different online annotation resources called marts. Each mart is another of a set of online web resources that are following a convention that allows them to work with this package. Historically these marts were maintained by various projects around the world, however the majority are now maintained as part of Ensembl and we’ll focus on that resource here. If you wish to access another BioMart instance see the biomaRt vignette Using a BioMart other than Ensembl.

The first step in using biomaRt is always to load the package and then decide which “mart” you want to use. Once you have made your decision, you will then use the useEnsembl() method to create a mart object in your R session. Here we are looking at the marts available and then choosing to use one of the most popular marts: the Ensembl “genes” mart.

listEnsembl()

##         biomart                version
## 1         genes      Ensembl Genes 108
## 2 mouse_strains      Mouse strains 108
## 3          snps  Ensembl Variation 108
## 4    regulation Ensembl Regulation 108

ensembl <- useEnsembl(biomart = "genes")
ensembl

## Object of class 'Mart':
##   Using the ENSEMBL_MART_ENSEMBL BioMart database
##   No dataset selected.

Each ‘mart’ can contain datasets for multiple different things. In our example here the “genes” mart contains separate datasets for a large number of organisms. So the next step is that you need to decide on a dataset. Once you have chosen one, you will need to specify that dataset using the dataset argument when you call the useEnsembl() constructor method. Here we will point to the dataset for humans.

head(listDatasets(ensembl))

##                        dataset                           description
## 1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1)
## 2     acalliptera_gene_ensembl      Eastern happy genes (fAstCal1.2)
## 3   acarolinensis_gene_ensembl       Green anole genes (AnoCar2.0v2)
## 4    acchrysaetos_gene_ensembl       Golden eagle genes (bAquChr1.2)
## 5    acitrinellus_gene_ensembl        Midas cichlid genes (Midas_v5)
## 6    amelanoleuca_gene_ensembl       Giant panda genes (ASM200744v2)
##       version
## 1 ASM259213v1
## 2  fAstCal1.2
## 3 AnoCar2.0v2
## 4  bAquChr1.2
## 5    Midas_v5
## 6 ASM200744v2

ensembl <- useEnsembl(biomart="genes", dataset="hsapiens_gene_ensembl")
ensembl

## Object of class 'Mart':
##   Using the ENSEMBL_MART_ENSEMBL BioMart database
##   Using the hsapiens_gene_ensembl dataset

Next we need to think about attributes, values and filters. Lets start with attributes. You can get a listing of the different kinds of attributes from biomaRt buy using the listAttributes method:

head(listAttributes(ensembl))

##                            name                  description         page
## 1               ensembl_gene_id               Gene stable ID feature_page
## 2       ensembl_gene_id_version       Gene stable ID version feature_page
## 3         ensembl_transcript_id         Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5            ensembl_peptide_id            Protein stable ID feature_page
## 6    ensembl_peptide_id_version    Protein stable ID version feature_page

And you can see what the values for a particular attribute are by using the getBM method:

head(getBM(attributes="chromosome_name", mart=ensembl))

##   chromosome_name
## 1               1
## 2              10
## 3              11
## 4              12
## 5              13
## 6              14

Attributes are the things that you can have returned from biomaRt. They are analogous to what you get when you use the columns method with other objects.

In the biomaRt package, filters are things that can be used with values to restrict or choose what comes back. The ‘values’ here are treated as keys that you are passing in and which you would like to know more information about. In contrast, the filter represents the kind of key that you are searching for. So for example, you might choose a filter name of “chromosome_name” to go with specific value of “1”. Together these two argument values would request whatever attributes matched things on the 1st chromosome. Just as there is an accessor for attributes, there is also an accessor to list all available filters:

head(listFilters(ensembl))

##              name              description
## 1 chromosome_name Chromosome/scaffold name
## 2           start                    Start
## 3             end                      End
## 4      band_start               Band Start
## 5        band_end                 Band End
## 6    marker_start             Marker Start

So now you know about attributes, values and filters, you can call the getBM() method to put it all together and request specific data from the mart. So for example, the following requests gene symbols and NCBI Gene (formerly called ‘entrezgene’) IDs that are found on chromosome 1 of humans:

res <- getBM(attributes = c("hgnc_symbol", "entrezgene_id"),
             filters = "chromosome_name",
             values = "1", 
             mart = ensembl)
head(res)

##   hgnc_symbol entrezgene_id
## 1                     84771
## 2                    727856
## 3                 100287102
## 4                 100287596
## 5                 102725121
## 6     DDX11L1            NA

Of course you may have noticed that a lot of the arguments for getBM are very similar to what you do when working with OrgDb objects. So if it’s your preference you can also use the standard select(), columns(), keytypes() etc methods with mart objects.

head(columns(ensembl))

## [1] "3_utr_end"   "3_utr_end"   "3_utr_start" "3_utr_start" "3utr"       
## [6] "5_utr_end"

9.1 biomaRt exercises

Exercise 11: Pull down GO terms for entrez gene id “1” from human by using the ensembl “hsapiens_gene_ensembl” dataset.

Exercise 12: Now compare the GO terms you just pulled down to the same GO terms from the org.Hs.eg.db package (which you can now retrieve using select()). What differences do you notice? Why do you suspect that is?

[ Back to top ]

10 Creating annotation objects

By now you are aware that Bioconductor has a lot of annotation resources. But it is still completely impossible to have every annotation resource pre-packaged for every conceivable use. Because of this, almost all annotation objects have special functions that can be called to create those objects (or the packages that load them) from generalized data resources or specific file types. Below is a table with a few of the more popular options.

If you want this	And you have this	Then you could call this to help
TxDb	tracks from UCSC	GenomicFeatures::makeTxDbPackageFromUCSC
TxDb	data from biomaRt	GenomicFeatures::makeTxDbPackageFromBiomaRt
TxDb	gff or gtf file	GenomicFeatures::makeTxDbFromGFF
OrgDb	custom data.frames	AnnotationForge::makeOrgPackage
OrgDb	valid Taxonomy ID	AnnotationForge::makeOrgPackageFromNCBI
ChipDb	org package & data.frame	AnnotationForge::makeChipPackage
BSgenome	fasta or twobit sequence files	BSgenome::forgeBSgenomeDataPkg

In most cases the output for resource creation functions will be an annotation package that you can install.

And there is unfortunately not enough space to demonstrate how to call each of these functions here. But to do so is actually pretty straightforward and most such functions will be well documented with their associated manual pages and vignettes[3,4,10,12]. As usual, you can see the help page for any function right inside of R.

help("makeTxDbPackageFromUCSC")

If you plan to make use of these kinds of functions then you should expect to consult the associated documentation first. These kinds of functions tend to have a lot of arguments and most of them also require that their input data meet some fairly specific criteria. Finally, you should know that even after you have succeeded at creating an annotation package, you will also have to make use of the install.packages() function (with the repos argument=NULL) to install whatever package source directory has just been created.

11 Important considerations

The bioconductor project represents a very large and active codebase from an active and engaged community. Because of this, you should expect that the software described in this walkthrough will change over time and often in dramatic ways. As an example, the getSeq function that is described in this chapter is expected to a big overhaul in the coming months. When this happens the older function will be deprecated for a full release cycle (6 months) and then labeled as defunct for another release cycle before it is removed. This cycle is in place so that active users can be warned about what is happening and where they should look for the appropriate replacement functionality. But obviously, this system cannot warn end users if they have not been vigilant about updating their software to the latest version. So please take the time to always update your software to the latest version.

To stay abreast of new developments users are encouraged to explore the bioconductor website which contains many current walkthroughs and vignettes. Also visit the support site where you can ask questions and engage in discussions.

12 sessionInfo()

Package versions used in this tutorial:

sessionInfo()

## R version 4.2.2 (2022-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] annotation_1.22.0                        
##  [2] TxDb.Athaliana.BioMart.plantsmart22_3.0.1
##  [3] biomaRt_2.54.0                           
##  [4] BSgenome.Hsapiens.UCSC.hg19_1.4.3        
##  [5] BSgenome_1.66.2                          
##  [6] rtracklayer_1.58.0                       
##  [7] Homo.sapiens_1.3.1                       
##  [8] GO.db_3.16.0                             
##  [9] OrganismDbi_1.40.0                       
## [10] org.Mm.eg.db_3.16.0                      
## [11] org.Hs.eg.db_3.16.0                      
## [12] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0   
## [13] TxDb.Hsapiens.UCSC.hg38.knownGene_3.16.0 
## [14] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2  
## [15] GenomicFeatures_1.50.4                   
## [16] AnnotationDbi_1.60.0                     
## [17] Organism.dplyr_1.26.0                    
## [18] AnnotationFilter_1.22.0                  
## [19] dplyr_1.1.0                              
## [20] AnnotationHub_3.6.0                      
## [21] BiocFileCache_2.6.0                      
## [22] dbplyr_2.3.0                             
## [23] VariantAnnotation_1.44.0                 
## [24] Rsamtools_2.14.0                         
## [25] Biostrings_2.66.0                        
## [26] XVector_0.38.0                           
## [27] SummarizedExperiment_1.28.0              
## [28] Biobase_2.58.0                           
## [29] GenomicRanges_1.50.2                     
## [30] GenomeInfoDb_1.34.8                      
## [31] IRanges_2.32.0                           
## [32] S4Vectors_0.36.1                         
## [33] MatrixGenerics_1.10.0                    
## [34] matrixStats_0.63.0                       
## [35] BiocGenerics_0.44.0                      
## [36] BiocStyle_2.26.0                         
## 
## loaded via a namespace (and not attached):
##  [1] rjson_0.2.21                  ellipsis_0.3.2               
##  [3] bit64_4.0.5                   interactiveDisplayBase_1.36.0
##  [5] fansi_1.0.4                   xml2_1.3.3                   
##  [7] codetools_0.2-18              cachem_1.0.6                 
##  [9] knitr_1.42                    jsonlite_1.8.4               
## [11] png_0.1-8                     graph_1.76.0                 
## [13] shiny_1.7.4                   BiocManager_1.30.19          
## [15] compiler_4.2.2                httr_1.4.4                   
## [17] assertthat_0.2.1              Matrix_1.5-3                 
## [19] fastmap_1.1.0                 lazyeval_0.2.2               
## [21] cli_3.6.0                     later_1.3.0                  
## [23] htmltools_0.5.4               prettyunits_1.1.1            
## [25] tools_4.2.2                   glue_1.6.2                   
## [27] GenomeInfoDbData_1.2.9        rappdirs_0.3.3               
## [29] Rcpp_1.0.10                   jquerylib_0.1.4              
## [31] vctrs_0.5.2                   xfun_0.36                    
## [33] stringr_1.5.0                 mime_0.12                    
## [35] lifecycle_1.0.3               restfulr_0.0.15              
## [37] XML_3.99-0.13                 zlibbioc_1.44.0              
## [39] hms_1.1.2                     promises_1.2.0.1             
## [41] parallel_4.2.2                RBGL_1.74.0                  
## [43] yaml_2.3.7                    curl_5.0.0                   
## [45] memoise_2.0.1                 sass_0.4.5                   
## [47] stringi_1.7.12                RSQLite_2.2.20               
## [49] BiocVersion_3.16.0            BiocIO_1.8.0                 
## [51] filelock_1.0.2                BiocParallel_1.32.5          
## [53] rlang_1.0.6                   pkgconfig_2.0.3              
## [55] bitops_1.0-7                  evaluate_0.20                
## [57] lattice_0.20-45               purrr_1.0.1                  
## [59] GenomicAlignments_1.34.0      bit_4.0.5                    
## [61] tidyselect_1.2.0              magrittr_2.0.3               
## [63] bookdown_0.32                 R6_2.5.1                     
## [65] generics_0.1.3                DelayedArray_0.24.0          
## [67] DBI_1.1.3                     pillar_1.8.1                 
## [69] withr_2.5.0                   KEGGREST_1.38.0              
## [71] RCurl_1.98-1.10               tibble_3.1.8                 
## [73] crayon_1.5.2                  utf8_1.2.3                   
## [75] rmarkdown_2.20                progress_1.2.2               
## [77] grid_4.2.2                    blob_1.2.3                   
## [79] digest_0.6.31                 xtable_1.8-4                 
## [81] httpuv_1.6.8                  bslib_0.4.2

13 Acknowledgments

Research reported in this chapter was supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number U41HG004059 and by the National Cancer Institute of the National Institutes of Health under Award Number U24CA180996. We also want to thank the numerous institutions who produced and maintained the data that is used for generating and updating the annotation resources described here.

14 References

Wolfgang Huber, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, Sean Davis, Laurent Gatto, Thomas Girke, Raphael Gottardo, Florian Hahne, Kasper D Hansen, Rafael A Irizarry, Michael Lawrence, Michael I Love, James MacDonald, Valerie Obenchain, Andrzej K Oleś, Hervé Pagès, Alejandro Reyes, Paul Shannon, Gordon K Smyth, Dan Tenenbaum, Levi Waldron & Martin Morgan (2015) Orchestrating high-throughput genomic analysis with Bioconductor Nature Methods 12:115-121
Pages H, Carlson M, Falcon S and Li N. AnnotationDbi: Annotation Database Interface. R package version 1.30.0.
M. Carlson, H. Pages, P. Aboyoun, S. Falcon, M. Morgan, D. Sarkar, M. Lawrence GenomicFeatures: Tools for making and manipulating transcript centric annotations version 1.19.38.
Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, Morgan M and Carey V (2013). Software for Computing and Annotating Genomic Ranges. PLoS Computational Biology, 9. http://dx.doi.org/10.1371/journal.pcbi.1003118, http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003118
Steffen Durinck, Wolfgang Huber biomaRt: Interface to BioMart databases (e.g. Ensembl, COSMIC ,Wormbase and Gramene) version 2.23.5.
Durinck S, Spellman P, Birney E and Huber W (2009). Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nature Protocols, 4, pp. 1184-1191.
Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A and Huber W (2005). BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics, 21, pp. 3439-3440.
Morgan M, Carlson M, Tenenbaum D and Arora S. AnnotationHub: Client to access AnnotationHub resources. R package version 2.0.1.
Carlson M, Pages H, Morgan M and Obenchain V. OrganismDbi: Software to enable the smooth interfacing of different database packages. R package version 1.10.0.
Pages H. BSgenome: Infrastructure for Biostrings-based genome data packages. R package version 1.36.0.
Pages H, Aboyoun P, Gentleman R and DebRoy S. Biostrings: String objects representing biological sequences, and matching algorithms. R package version 2.36.0.
Carlson M, and Pages H. AnnotationForge: Code for Building Annotation Database Packages. R package version 1.10.0.

15 Answers for exercises

15.1 Exercise 1:

The 1st thing you need to do is look for thing from UCSC

ahs <- query(ah, "UCSC")

Then you can look for Genome values that match ‘hg19’ and a species that matches ‘Homo sapiens’.

ahs <- subset(ahs, ahs$genome=='hg19')
length(ahs)

## [1] 5908

ahs <- subset(ahs, ahs$species=='Homo sapiens')
length(ahs)

## [1] 5908

You might notice that the last two filtering steps are redundant (IOW doing the 1st of them is the same as doing both of them.) If this were not the case, we might suspect that there was a problem with the metadata.

15.2 Exercise 2:

This pulls down the oreganno annotations. Which are described on the UCSC site thusly: “This track displays literature-curated regulatory regions, transcription factor binding sites, and regulatory polymorphisms from ORegAnno (Open Regulatory Annotation). For more detailed information on a particular regulatory element, follow the link to ORegAnno from the details page.”

ahs <- query(ah, 'oreganno')
ahs

## AnnotationHub with 9 records
## # snapshotDate(): 2022-10-31
## # $dataprovider: Pazar, UCSC
## # $species: Saccharomyces cerevisiae, Homo sapiens, NA
## # $rdataclass: GRanges
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH5087"]]' 
## 
##             title                                 
##   AH5087  | ORegAnno                              
##   AH5213  | ORegAnno                              
##   AH7053  | ORegAnno                              
##   AH7061  | ORegAnno                              
##   AH22286 | pazar_ORegAnno_20120522.csv           
##   AH22287 | pazar_ORegAnno_ENCODEprom_20120522.csv
##   AH22288 | pazar_ORegAnno_Erythroid_20120522.csv 
##   AH22289 | pazar_ORegAnno_STAT1_ChIP_20120522.csv
##   AH22290 | pazar_ORegAnno_STAT1_lit_20120522.csv

ahs[1]

## AnnotationHub with 1 record
## # snapshotDate(): 2022-10-31
## # names(): AH5087
## # $dataprovider: UCSC
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # $rdatadateadded: 2013-03-26
## # $title: ORegAnno
## # $description: GRanges object from UCSC track 'ORegAnno'
## # $taxonomyid: 9606
## # $genome: hg19
## # $sourcetype: UCSC track
## # $sourceurl: rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/hg19/database...
## # $sourcesize: NA
## # $tags: c("oreganno", "UCSC", "track", "Gene", "Transcript",
## #   "Annotation") 
## # retrieve record with 'object[["AH5087"]]'

oreg <- ahs[['AH5087']]

## loading from cache

oreg

## GRanges object with 23118 ranges and 2 metadata columns:
##                        seqnames        ranges strand |        name     score
##                           <Rle>     <IRanges>  <Rle> | <character> <numeric>
##       [1]                  chr1 873499-873849      + | OREG0012989         0
##       [2]                  chr1 886764-887214      + | OREG0012990         0
##       [3]                  chr1 886938-886958      + | OREG0007909         0
##       [4]                  chr1 919400-919950      + | OREG0012991         0
##       [5]                  chr1 919695-919715      + | OREG0007910         0
##       ...                   ...           ...    ... .         ...       ...
##   [23114]  chr7_gl000195_random         1-851      + | OREG0026736         0
##   [23115]  chr7_gl000195_random 103427-103447      + | OREG0012963         0
##   [23116]  chr7_gl000195_random 121139-121159      + | OREG0012964         0
##   [23117] chr17_gl000204_random   58370-58955      + | OREG0026769         0
##   [23118] chr17_gl000205_random 117492-118442      + | OREG0026772         0
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

15.3 Exercise 3:

keys <- "MSX2"
columns <- c("ENTREZID", "CHR")
select(org.Hs.eg.db, keys, columns, keytype="SYMBOL")

## Warning in .deprecatedColsMessage(): Accessing gene location information via 'CHR','CHRLOC','CHRLOCEND' is
##   deprecated. Please use a range based accessor like genes(), or select()
##   with columns values like TXCHROM and TXSTART on a TxDb or OrganismDb
##   object instead.

## 'select()' returned 1:1 mapping between keys and columns

##   SYMBOL ENTREZID CHR
## 1   MSX2     4488   5

15.4 Exercise 4:

## 1st get all the gene symbols
orgSymbols <- keys(org.Hs.eg.db, keytype="SYMBOL")
## and then use that to get all gene symbols matched to all entrez gene IDs
egr <- select(org.Hs.eg.db, keys=orgSymbols, "ENTREZID", "SYMBOL")

## 'select()' returned 1:many mapping between keys and columns

length(egr$ENTREZID)

## [1] 77614

length(unique(egr$ENTREZID))

## [1] 77614

## VS:
length(egr$SYMBOL)

## [1] 77614

length(unique(egr$SYMBOL))

## [1] 77510

## So lets trap these symbols that are redundant and look more closely...
redund <- egr$SYMBOL
badSymbols <- redund[duplicated(redund)]
select(org.Hs.eg.db, badSymbols, "ENTREZID", "SYMBOL")

## 'select()' returned many:many mapping between keys and columns

##         SYMBOL  ENTREZID
## 1          HBD      3045
## 2          HBD 100187828
## 3         RNR1      4549
## 4         RNR1      6052
## 5         RNR2      4550
## 6         RNR2      6053
## 7          TEC      7006
## 8          TEC 100124696
## 9        MEMO1      7795
## 10       MEMO1     51072
## 11        MMD2    221938
## 12        MMD2 100505381
## 13     DEL1P36 100240737
## 14     DEL1P36 123670537
## 15    DEL11P13 100528024
## 16    DEL11P13 107648861
## 17   TRNAV-CAC 107985614
## 18   TRNAV-CAC 107985615
## 19   TRNAE-UUC 107987368
## 20   TRNAE-UUC 124905580
## 21   TRNAE-UUC 124905583
## 22   TRNAE-UUC 124905584
## 23   TRNAE-UUC 124905586
## 24   TRNAE-UUC 124905908
## 25   TRNAE-UUC 107987368
## 26   TRNAE-UUC 124905580
## 27   TRNAE-UUC 124905583
## 28   TRNAE-UUC 124905584
## 29   TRNAE-UUC 124905586
## 30   TRNAE-UUC 124905908
## 31   TRNAE-UUC 107987368
## 32   TRNAE-UUC 124905580
## 33   TRNAE-UUC 124905583
## 34   TRNAE-UUC 124905584
## 35   TRNAE-UUC 124905586
## 36   TRNAE-UUC 124905908
## 37   TRNAE-UUC 107987368
## 38   TRNAE-UUC 124905580
## 39   TRNAE-UUC 124905583
## 40   TRNAE-UUC 124905584
## 41   TRNAE-UUC 124905586
## 42   TRNAE-UUC 124905908
## 43   TRNAE-UUC 107987368
## 44   TRNAE-UUC 124905580
## 45   TRNAE-UUC 124905583
## 46   TRNAE-UUC 124905584
## 47   TRNAE-UUC 124905586
## 48   TRNAE-UUC 124905908
## 49   TRNAA-AGC 124901561
## 50   TRNAA-AGC 124901562
## 51   TRNAA-AGC 124901563
## 52   TRNAA-AGC 124901564
## 53   TRNAA-AGC 124901565
## 54   TRNAA-AGC 124906586
## 55   TRNAA-AGC 124901561
## 56   TRNAA-AGC 124901562
## 57   TRNAA-AGC 124901563
## 58   TRNAA-AGC 124901564
## 59   TRNAA-AGC 124901565
## 60   TRNAA-AGC 124906586
## 61   TRNAA-AGC 124901561
## 62   TRNAA-AGC 124901562
## 63   TRNAA-AGC 124901563
## 64   TRNAA-AGC 124901564
## 65   TRNAA-AGC 124901565
## 66   TRNAA-AGC 124906586
## 67   TRNAA-AGC 124901561
## 68   TRNAA-AGC 124901562
## 69   TRNAA-AGC 124901563
## 70   TRNAA-AGC 124901564
## 71   TRNAA-AGC 124901565
## 72   TRNAA-AGC 124906586
## 73   TRNAA-AGC 124901561
## 74   TRNAA-AGC 124901562
## 75   TRNAA-AGC 124901563
## 76   TRNAA-AGC 124901564
## 77   TRNAA-AGC 124901565
## 78   TRNAA-AGC 124906586
## 79   TRNAG-CCC 124905578
## 80   TRNAG-CCC 124905581
## 81   TRNAG-CCC 124905588
## 82   TRNAG-CCC 124905578
## 83   TRNAG-CCC 124905581
## 84   TRNAG-CCC 124905588
## 85   TRNAN-GUU 124905579
## 86   TRNAN-GUU 124905582
## 87   TRNAN-GUU 124905585
## 88   TRNAN-GUU 124905587
## 89   TRNAN-GUU 124905579
## 90   TRNAN-GUU 124905582
## 91   TRNAN-GUU 124905585
## 92   TRNAN-GUU 124905587
## 93   TRNAN-GUU 124905579
## 94   TRNAN-GUU 124905582
## 95   TRNAN-GUU 124905585
## 96   TRNAN-GUU 124905587
## 97   TRNAG-GCC 124905847
## 98   TRNAG-GCC 124905849
## 99   TRNAG-GCC 124905851
## 100  TRNAG-GCC 124905853
## 101  TRNAG-GCC 124905907
## 102  TRNAG-GCC 124905910
## 103  TRNAG-GCC 124905912
## 104  TRNAG-GCC 124905914
## 105  TRNAG-GCC 124905916
## 106  TRNAG-GCC 124905918
## 107  TRNAG-GCC 124905921
## 108  TRNAG-GCC 124905923
## 109  TRNAG-GCC 124905925
## 110  TRNAG-GCC 124905927
## 111  TRNAG-GCC 124905929
## 112  TRNAG-GCC 124905931
## 113  TRNAG-GCC 124905933
## 114  TRNAG-GCC 124905847
## 115  TRNAG-GCC 124905849
## 116  TRNAG-GCC 124905851
## 117  TRNAG-GCC 124905853
## 118  TRNAG-GCC 124905907
## 119  TRNAG-GCC 124905910
## 120  TRNAG-GCC 124905912
## 121  TRNAG-GCC 124905914
## 122  TRNAG-GCC 124905916
## 123  TRNAG-GCC 124905918
## 124  TRNAG-GCC 124905921
## 125  TRNAG-GCC 124905923
## 126  TRNAG-GCC 124905925
## 127  TRNAG-GCC 124905927
## 128  TRNAG-GCC 124905929
## 129  TRNAG-GCC 124905931
## 130  TRNAG-GCC 124905933
## 131  TRNAG-GCC 124905847
## 132  TRNAG-GCC 124905849
## 133  TRNAG-GCC 124905851
## 134  TRNAG-GCC 124905853
## 135  TRNAG-GCC 124905907
## 136  TRNAG-GCC 124905910
## 137  TRNAG-GCC 124905912
## 138  TRNAG-GCC 124905914
## 139  TRNAG-GCC 124905916
## 140  TRNAG-GCC 124905918
## 141  TRNAG-GCC 124905921
## 142  TRNAG-GCC 124905923
## 143  TRNAG-GCC 124905925
## 144  TRNAG-GCC 124905927
## 145  TRNAG-GCC 124905929
## 146  TRNAG-GCC 124905931
## 147  TRNAG-GCC 124905933
## 148  TRNAG-GCC 124905847
## 149  TRNAG-GCC 124905849
## 150  TRNAG-GCC 124905851
## 151  TRNAG-GCC 124905853
## 152  TRNAG-GCC 124905907
## 153  TRNAG-GCC 124905910
## 154  TRNAG-GCC 124905912
## 155  TRNAG-GCC 124905914
## 156  TRNAG-GCC 124905916
## 157  TRNAG-GCC 124905918
## 158  TRNAG-GCC 124905921
## 159  TRNAG-GCC 124905923
## 160  TRNAG-GCC 124905925
## 161  TRNAG-GCC 124905927
## 162  TRNAG-GCC 124905929
## 163  TRNAG-GCC 124905931
## 164  TRNAG-GCC 124905933
## 165  TRNAG-GCC 124905847
## 166  TRNAG-GCC 124905849
## 167  TRNAG-GCC 124905851
## 168  TRNAG-GCC 124905853
## 169  TRNAG-GCC 124905907
## 170  TRNAG-GCC 124905910
## 171  TRNAG-GCC 124905912
## 172  TRNAG-GCC 124905914
## 173  TRNAG-GCC 124905916
## 174  TRNAG-GCC 124905918
## 175  TRNAG-GCC 124905921
## 176  TRNAG-GCC 124905923
## 177  TRNAG-GCC 124905925
## 178  TRNAG-GCC 124905927
## 179  TRNAG-GCC 124905929
## 180  TRNAG-GCC 124905931
## 181  TRNAG-GCC 124905933
## 182  TRNAG-GCC 124905847
## 183  TRNAG-GCC 124905849
## 184  TRNAG-GCC 124905851
## 185  TRNAG-GCC 124905853
## 186  TRNAG-GCC 124905907
## 187  TRNAG-GCC 124905910
## 188  TRNAG-GCC 124905912
## 189  TRNAG-GCC 124905914
## 190  TRNAG-GCC 124905916
## 191  TRNAG-GCC 124905918
## 192  TRNAG-GCC 124905921
## 193  TRNAG-GCC 124905923
## 194  TRNAG-GCC 124905925
## 195  TRNAG-GCC 124905927
## 196  TRNAG-GCC 124905929
## 197  TRNAG-GCC 124905931
## 198  TRNAG-GCC 124905933
## 199  TRNAG-GCC 124905847
## 200  TRNAG-GCC 124905849
## 201  TRNAG-GCC 124905851
## 202  TRNAG-GCC 124905853
## 203  TRNAG-GCC 124905907
## 204  TRNAG-GCC 124905910
## 205  TRNAG-GCC 124905912
## 206  TRNAG-GCC 124905914
## 207  TRNAG-GCC 124905916
## 208  TRNAG-GCC 124905918
## 209  TRNAG-GCC 124905921
## 210  TRNAG-GCC 124905923
## 211  TRNAG-GCC 124905925
## 212  TRNAG-GCC 124905927
## 213  TRNAG-GCC 124905929
## 214  TRNAG-GCC 124905931
## 215  TRNAG-GCC 124905933
## 216  TRNAG-GCC 124905847
## 217  TRNAG-GCC 124905849
## 218  TRNAG-GCC 124905851
## 219  TRNAG-GCC 124905853
## 220  TRNAG-GCC 124905907
## 221  TRNAG-GCC 124905910
## 222  TRNAG-GCC 124905912
## 223  TRNAG-GCC 124905914
## 224  TRNAG-GCC 124905916
## 225  TRNAG-GCC 124905918
## 226  TRNAG-GCC 124905921
## 227  TRNAG-GCC 124905923
## 228  TRNAG-GCC 124905925
## 229  TRNAG-GCC 124905927
## 230  TRNAG-GCC 124905929
## 231  TRNAG-GCC 124905931
## 232  TRNAG-GCC 124905933
## 233  TRNAG-GCC 124905847
## 234  TRNAG-GCC 124905849
## 235  TRNAG-GCC 124905851
## 236  TRNAG-GCC 124905853
## 237  TRNAG-GCC 124905907
## 238  TRNAG-GCC 124905910
## 239  TRNAG-GCC 124905912
## 240  TRNAG-GCC 124905914
## 241  TRNAG-GCC 124905916
## 242  TRNAG-GCC 124905918
## 243  TRNAG-GCC 124905921
## 244  TRNAG-GCC 124905923
## 245  TRNAG-GCC 124905925
## 246  TRNAG-GCC 124905927
## 247  TRNAG-GCC 124905929
## 248  TRNAG-GCC 124905931
## 249  TRNAG-GCC 124905933
## 250  TRNAG-GCC 124905847
## 251  TRNAG-GCC 124905849
## 252  TRNAG-GCC 124905851
## 253  TRNAG-GCC 124905853
## 254  TRNAG-GCC 124905907
## 255  TRNAG-GCC 124905910
## 256  TRNAG-GCC 124905912
## 257  TRNAG-GCC 124905914
## 258  TRNAG-GCC 124905916
## 259  TRNAG-GCC 124905918
## 260  TRNAG-GCC 124905921
## 261  TRNAG-GCC 124905923
## 262  TRNAG-GCC 124905925
## 263  TRNAG-GCC 124905927
## 264  TRNAG-GCC 124905929
## 265  TRNAG-GCC 124905931
## 266  TRNAG-GCC 124905933
## 267  TRNAG-GCC 124905847
## 268  TRNAG-GCC 124905849
## 269  TRNAG-GCC 124905851
## 270  TRNAG-GCC 124905853
## 271  TRNAG-GCC 124905907
## 272  TRNAG-GCC 124905910
## 273  TRNAG-GCC 124905912
## 274  TRNAG-GCC 124905914
## 275  TRNAG-GCC 124905916
## 276  TRNAG-GCC 124905918
## 277  TRNAG-GCC 124905921
## 278  TRNAG-GCC 124905923
## 279  TRNAG-GCC 124905925
## 280  TRNAG-GCC 124905927
## 281  TRNAG-GCC 124905929
## 282  TRNAG-GCC 124905931
## 283  TRNAG-GCC 124905933
## 284  TRNAG-GCC 124905847
## 285  TRNAG-GCC 124905849
## 286  TRNAG-GCC 124905851
## 287  TRNAG-GCC 124905853
## 288  TRNAG-GCC 124905907
## 289  TRNAG-GCC 124905910
## 290  TRNAG-GCC 124905912
## 291  TRNAG-GCC 124905914
## 292  TRNAG-GCC 124905916
## 293  TRNAG-GCC 124905918
## 294  TRNAG-GCC 124905921
## 295  TRNAG-GCC 124905923
## 296  TRNAG-GCC 124905925
## 297  TRNAG-GCC 124905927
## 298  TRNAG-GCC 124905929
## 299  TRNAG-GCC 124905931
## 300  TRNAG-GCC 124905933
## 301  TRNAG-GCC 124905847
## 302  TRNAG-GCC 124905849
## 303  TRNAG-GCC 124905851
## 304  TRNAG-GCC 124905853
## 305  TRNAG-GCC 124905907
## 306  TRNAG-GCC 124905910
## 307  TRNAG-GCC 124905912
## 308  TRNAG-GCC 124905914
## 309  TRNAG-GCC 124905916
## 310  TRNAG-GCC 124905918
## 311  TRNAG-GCC 124905921
## 312  TRNAG-GCC 124905923
## 313  TRNAG-GCC 124905925
## 314  TRNAG-GCC 124905927
## 315  TRNAG-GCC 124905929
## 316  TRNAG-GCC 124905931
## 317  TRNAG-GCC 124905933
## 318  TRNAG-GCC 124905847
## 319  TRNAG-GCC 124905849
## 320  TRNAG-GCC 124905851
## 321  TRNAG-GCC 124905853
## 322  TRNAG-GCC 124905907
## 323  TRNAG-GCC 124905910
## 324  TRNAG-GCC 124905912
## 325  TRNAG-GCC 124905914
## 326  TRNAG-GCC 124905916
## 327  TRNAG-GCC 124905918
## 328  TRNAG-GCC 124905921
## 329  TRNAG-GCC 124905923
## 330  TRNAG-GCC 124905925
## 331  TRNAG-GCC 124905927
## 332  TRNAG-GCC 124905929
## 333  TRNAG-GCC 124905931
## 334  TRNAG-GCC 124905933
## 335  TRNAG-GCC 124905847
## 336  TRNAG-GCC 124905849
## 337  TRNAG-GCC 124905851
## 338  TRNAG-GCC 124905853
## 339  TRNAG-GCC 124905907
## 340  TRNAG-GCC 124905910
## 341  TRNAG-GCC 124905912
## 342  TRNAG-GCC 124905914
## 343  TRNAG-GCC 124905916
## 344  TRNAG-GCC 124905918
## 345  TRNAG-GCC 124905921
## 346  TRNAG-GCC 124905923
## 347  TRNAG-GCC 124905925
## 348  TRNAG-GCC 124905927
## 349  TRNAG-GCC 124905929
## 350  TRNAG-GCC 124905931
## 351  TRNAG-GCC 124905933
## 352  TRNAG-GCC 124905847
## 353  TRNAG-GCC 124905849
## 354  TRNAG-GCC 124905851
## 355  TRNAG-GCC 124905853
## 356  TRNAG-GCC 124905907
## 357  TRNAG-GCC 124905910
## 358  TRNAG-GCC 124905912
## 359  TRNAG-GCC 124905914
## 360  TRNAG-GCC 124905916
## 361  TRNAG-GCC 124905918
## 362  TRNAG-GCC 124905921
## 363  TRNAG-GCC 124905923
## 364  TRNAG-GCC 124905925
## 365  TRNAG-GCC 124905927
## 366  TRNAG-GCC 124905929
## 367  TRNAG-GCC 124905931
## 368  TRNAG-GCC 124905933
## 369  TRNAL-CAG 124905848
## 370  TRNAL-CAG 124905850
## 371  TRNAL-CAG 124905852
## 372  TRNAL-CAG 124905906
## 373  TRNAL-CAG 124905909
## 374  TRNAL-CAG 124905911
## 375  TRNAL-CAG 124905913
## 376  TRNAL-CAG 124905915
## 377  TRNAL-CAG 124905917
## 378  TRNAL-CAG 124905920
## 379  TRNAL-CAG 124905922
## 380  TRNAL-CAG 124905924
## 381  TRNAL-CAG 124905926
## 382  TRNAL-CAG 124905928
## 383  TRNAL-CAG 124905930
## 384  TRNAL-CAG 124905932
## 385  TRNAL-CAG 124905934
## 386  TRNAL-CAG 124905848
## 387  TRNAL-CAG 124905850
## 388  TRNAL-CAG 124905852
## 389  TRNAL-CAG 124905906
## 390  TRNAL-CAG 124905909
## 391  TRNAL-CAG 124905911
## 392  TRNAL-CAG 124905913
## 393  TRNAL-CAG 124905915
## 394  TRNAL-CAG 124905917
## 395  TRNAL-CAG 124905920
## 396  TRNAL-CAG 124905922
## 397  TRNAL-CAG 124905924
## 398  TRNAL-CAG 124905926
## 399  TRNAL-CAG 124905928
## 400  TRNAL-CAG 124905930
## 401  TRNAL-CAG 124905932
## 402  TRNAL-CAG 124905934
## 403  TRNAL-CAG 124905848
## 404  TRNAL-CAG 124905850
## 405  TRNAL-CAG 124905852
## 406  TRNAL-CAG 124905906
## 407  TRNAL-CAG 124905909
## 408  TRNAL-CAG 124905911
## 409  TRNAL-CAG 124905913
## 410  TRNAL-CAG 124905915
## 411  TRNAL-CAG 124905917
## 412  TRNAL-CAG 124905920
## 413  TRNAL-CAG 124905922
## 414  TRNAL-CAG 124905924
## 415  TRNAL-CAG 124905926
## 416  TRNAL-CAG 124905928
## 417  TRNAL-CAG 124905930
## 418  TRNAL-CAG 124905932
## 419  TRNAL-CAG 124905934
## 420  TRNAL-CAG 124905848
## 421  TRNAL-CAG 124905850
## 422  TRNAL-CAG 124905852
## 423  TRNAL-CAG 124905906
## 424  TRNAL-CAG 124905909
## 425  TRNAL-CAG 124905911
## 426  TRNAL-CAG 124905913
## 427  TRNAL-CAG 124905915
## 428  TRNAL-CAG 124905917
## 429  TRNAL-CAG 124905920
## 430  TRNAL-CAG 124905922
## 431  TRNAL-CAG 124905924
## 432  TRNAL-CAG 124905926
## 433  TRNAL-CAG 124905928
## 434  TRNAL-CAG 124905930
## 435  TRNAL-CAG 124905932
## 436  TRNAL-CAG 124905934
## 437  TRNAL-CAG 124905848
## 438  TRNAL-CAG 124905850
## 439  TRNAL-CAG 124905852
## 440  TRNAL-CAG 124905906
## 441  TRNAL-CAG 124905909
## 442  TRNAL-CAG 124905911
## 443  TRNAL-CAG 124905913
## 444  TRNAL-CAG 124905915
## 445  TRNAL-CAG 124905917
## 446  TRNAL-CAG 124905920
## 447  TRNAL-CAG 124905922
## 448  TRNAL-CAG 124905924
## 449  TRNAL-CAG 124905926
## 450  TRNAL-CAG 124905928
## 451  TRNAL-CAG 124905930
## 452  TRNAL-CAG 124905932
## 453  TRNAL-CAG 124905934
## 454  TRNAL-CAG 124905848
## 455  TRNAL-CAG 124905850
## 456  TRNAL-CAG 124905852
## 457  TRNAL-CAG 124905906
## 458  TRNAL-CAG 124905909
## 459  TRNAL-CAG 124905911
## 460  TRNAL-CAG 124905913
## 461  TRNAL-CAG 124905915
## 462  TRNAL-CAG 124905917
## 463  TRNAL-CAG 124905920
## 464  TRNAL-CAG 124905922
## 465  TRNAL-CAG 124905924
## 466  TRNAL-CAG 124905926
## 467  TRNAL-CAG 124905928
## 468  TRNAL-CAG 124905930
## 469  TRNAL-CAG 124905932
## 470  TRNAL-CAG 124905934
## 471  TRNAL-CAG 124905848
## 472  TRNAL-CAG 124905850
## 473  TRNAL-CAG 124905852
## 474  TRNAL-CAG 124905906
## 475  TRNAL-CAG 124905909
## 476  TRNAL-CAG 124905911
## 477  TRNAL-CAG 124905913
## 478  TRNAL-CAG 124905915
## 479  TRNAL-CAG 124905917
## 480  TRNAL-CAG 124905920
## 481  TRNAL-CAG 124905922
## 482  TRNAL-CAG 124905924
## 483  TRNAL-CAG 124905926
## 484  TRNAL-CAG 124905928
## 485  TRNAL-CAG 124905930
## 486  TRNAL-CAG 124905932
## 487  TRNAL-CAG 124905934
## 488  TRNAL-CAG 124905848
## 489  TRNAL-CAG 124905850
## 490  TRNAL-CAG 124905852
## 491  TRNAL-CAG 124905906
## 492  TRNAL-CAG 124905909
## 493  TRNAL-CAG 124905911
## 494  TRNAL-CAG 124905913
## 495  TRNAL-CAG 124905915
## 496  TRNAL-CAG 124905917
## 497  TRNAL-CAG 124905920
## 498  TRNAL-CAG 124905922
## 499  TRNAL-CAG 124905924
## 500  TRNAL-CAG 124905926
## 501  TRNAL-CAG 124905928
## 502  TRNAL-CAG 124905930
## 503  TRNAL-CAG 124905932
## 504  TRNAL-CAG 124905934
## 505  TRNAL-CAG 124905848
## 506  TRNAL-CAG 124905850
## 507  TRNAL-CAG 124905852
## 508  TRNAL-CAG 124905906
## 509  TRNAL-CAG 124905909
## 510  TRNAL-CAG 124905911
## 511  TRNAL-CAG 124905913
## 512  TRNAL-CAG 124905915
## 513  TRNAL-CAG 124905917
## 514  TRNAL-CAG 124905920
## 515  TRNAL-CAG 124905922
## 516  TRNAL-CAG 124905924
## 517  TRNAL-CAG 124905926
## 518  TRNAL-CAG 124905928
## 519  TRNAL-CAG 124905930
## 520  TRNAL-CAG 124905932
## 521  TRNAL-CAG 124905934
## 522  TRNAL-CAG 124905848
## 523  TRNAL-CAG 124905850
## 524  TRNAL-CAG 124905852
## 525  TRNAL-CAG 124905906
## 526  TRNAL-CAG 124905909
## 527  TRNAL-CAG 124905911
## 528  TRNAL-CAG 124905913
## 529  TRNAL-CAG 124905915
## 530  TRNAL-CAG 124905917
## 531  TRNAL-CAG 124905920
## 532  TRNAL-CAG 124905922
## 533  TRNAL-CAG 124905924
## 534  TRNAL-CAG 124905926
## 535  TRNAL-CAG 124905928
## 536  TRNAL-CAG 124905930
## 537  TRNAL-CAG 124905932
## 538  TRNAL-CAG 124905934
## 539  TRNAL-CAG 124905848
## 540  TRNAL-CAG 124905850
## 541  TRNAL-CAG 124905852
## 542  TRNAL-CAG 124905906
## 543  TRNAL-CAG 124905909
## 544  TRNAL-CAG 124905911
## 545  TRNAL-CAG 124905913
## 546  TRNAL-CAG 124905915
## 547  TRNAL-CAG 124905917
## 548  TRNAL-CAG 124905920
## 549  TRNAL-CAG 124905922
## 550  TRNAL-CAG 124905924
## 551  TRNAL-CAG 124905926
## 552  TRNAL-CAG 124905928
## 553  TRNAL-CAG 124905930
## 554  TRNAL-CAG 124905932
## 555  TRNAL-CAG 124905934
## 556  TRNAL-CAG 124905848
## 557  TRNAL-CAG 124905850
## 558  TRNAL-CAG 124905852
## 559  TRNAL-CAG 124905906
## 560  TRNAL-CAG 124905909
## 561  TRNAL-CAG 124905911
## 562  TRNAL-CAG 124905913
## 563  TRNAL-CAG 124905915
## 564  TRNAL-CAG 124905917
## 565  TRNAL-CAG 124905920
## 566  TRNAL-CAG 124905922
## 567  TRNAL-CAG 124905924
## 568  TRNAL-CAG 124905926
## 569  TRNAL-CAG 124905928
## 570  TRNAL-CAG 124905930
## 571  TRNAL-CAG 124905932
## 572  TRNAL-CAG 124905934
## 573  TRNAL-CAG 124905848
## 574  TRNAL-CAG 124905850
## 575  TRNAL-CAG 124905852
## 576  TRNAL-CAG 124905906
## 577  TRNAL-CAG 124905909
## 578  TRNAL-CAG 124905911
## 579  TRNAL-CAG 124905913
## 580  TRNAL-CAG 124905915
## 581  TRNAL-CAG 124905917
## 582  TRNAL-CAG 124905920
## 583  TRNAL-CAG 124905922
## 584  TRNAL-CAG 124905924
## 585  TRNAL-CAG 124905926
## 586  TRNAL-CAG 124905928
## 587  TRNAL-CAG 124905930
## 588  TRNAL-CAG 124905932
## 589  TRNAL-CAG 124905934
## 590  TRNAL-CAG 124905848
## 591  TRNAL-CAG 124905850
## 592  TRNAL-CAG 124905852
## 593  TRNAL-CAG 124905906
## 594  TRNAL-CAG 124905909
## 595  TRNAL-CAG 124905911
## 596  TRNAL-CAG 124905913
## 597  TRNAL-CAG 124905915
## 598  TRNAL-CAG 124905917
## 599  TRNAL-CAG 124905920
## 600  TRNAL-CAG 124905922
## 601  TRNAL-CAG 124905924
## 602  TRNAL-CAG 124905926
## 603  TRNAL-CAG 124905928
## 604  TRNAL-CAG 124905930
## 605  TRNAL-CAG 124905932
## 606  TRNAL-CAG 124905934
## 607  TRNAL-CAG 124905848
## 608  TRNAL-CAG 124905850
## 609  TRNAL-CAG 124905852
## 610  TRNAL-CAG 124905906
## 611  TRNAL-CAG 124905909
## 612  TRNAL-CAG 124905911
## 613  TRNAL-CAG 124905913
## 614  TRNAL-CAG 124905915
## 615  TRNAL-CAG 124905917
## 616  TRNAL-CAG 124905920
## 617  TRNAL-CAG 124905922
## 618  TRNAL-CAG 124905924
## 619  TRNAL-CAG 124905926
## 620  TRNAL-CAG 124905928
## 621  TRNAL-CAG 124905930
## 622  TRNAL-CAG 124905932
## 623  TRNAL-CAG 124905934
## 624  TRNAL-CAG 124905848
## 625  TRNAL-CAG 124905850
## 626  TRNAL-CAG 124905852
## 627  TRNAL-CAG 124905906
## 628  TRNAL-CAG 124905909
## 629  TRNAL-CAG 124905911
## 630  TRNAL-CAG 124905913
## 631  TRNAL-CAG 124905915
## 632  TRNAL-CAG 124905917
## 633  TRNAL-CAG 124905920
## 634  TRNAL-CAG 124905922
## 635  TRNAL-CAG 124905924
## 636  TRNAL-CAG 124905926
## 637  TRNAL-CAG 124905928
## 638  TRNAL-CAG 124905930
## 639  TRNAL-CAG 124905932
## 640  TRNAL-CAG 124905934
## 641  TRNAD-GUC 124905854
## 642  TRNAD-GUC 124905857
## 643  TRNAD-GUC 124905860
## 644  TRNAD-GUC 124905863
## 645  TRNAD-GUC 124905866
## 646  TRNAD-GUC 124905869
## 647  TRNAD-GUC 124905872
## 648  TRNAD-GUC 124905875
## 649  TRNAD-GUC 124905878
## 650  TRNAD-GUC 124905881
## 651  TRNAD-GUC 124905884
## 652  TRNAD-GUC 124905887
## 653  TRNAD-GUC 124905890
## 654  TRNAD-GUC 124905893
## 655  TRNAD-GUC 124905896
## 656  TRNAD-GUC 124905899
## 657  TRNAD-GUC 124905902
## 658  TRNAD-GUC 124905854
## 659  TRNAD-GUC 124905857
## 660  TRNAD-GUC 124905860
## 661  TRNAD-GUC 124905863
## 662  TRNAD-GUC 124905866
## 663  TRNAD-GUC 124905869
## 664  TRNAD-GUC 124905872
## 665  TRNAD-GUC 124905875
## 666  TRNAD-GUC 124905878
## 667  TRNAD-GUC 124905881
## 668  TRNAD-GUC 124905884
## 669  TRNAD-GUC 124905887
## 670  TRNAD-GUC 124905890
## 671  TRNAD-GUC 124905893
## 672  TRNAD-GUC 124905896
## 673  TRNAD-GUC 124905899
## 674  TRNAD-GUC 124905902
## 675  TRNAD-GUC 124905854
## 676  TRNAD-GUC 124905857
## 677  TRNAD-GUC 124905860
## 678  TRNAD-GUC 124905863
## 679  TRNAD-GUC 124905866
## 680  TRNAD-GUC 124905869
## 681  TRNAD-GUC 124905872
## 682  TRNAD-GUC 124905875
## 683  TRNAD-GUC 124905878
## 684  TRNAD-GUC 124905881
## 685  TRNAD-GUC 124905884
## 686  TRNAD-GUC 124905887
## 687  TRNAD-GUC 124905890
## 688  TRNAD-GUC 124905893
## 689  TRNAD-GUC 124905896
## 690  TRNAD-GUC 124905899
## 691  TRNAD-GUC 124905902
## 692  TRNAD-GUC 124905854
## 693  TRNAD-GUC 124905857
## 694  TRNAD-GUC 124905860
## 695  TRNAD-GUC 124905863
## 696  TRNAD-GUC 124905866
## 697  TRNAD-GUC 124905869
## 698  TRNAD-GUC 124905872
## 699  TRNAD-GUC 124905875
## 700  TRNAD-GUC 124905878
## 701  TRNAD-GUC 124905881
## 702  TRNAD-GUC 124905884
## 703  TRNAD-GUC 124905887
## 704  TRNAD-GUC 124905890
## 705  TRNAD-GUC 124905893
## 706  TRNAD-GUC 124905896
## 707  TRNAD-GUC 124905899
## 708  TRNAD-GUC 124905902
## 709  TRNAD-GUC 124905854
## 710  TRNAD-GUC 124905857
## 711  TRNAD-GUC 124905860
## 712  TRNAD-GUC 124905863
## 713  TRNAD-GUC 124905866
## 714  TRNAD-GUC 124905869
## 715  TRNAD-GUC 124905872
## 716  TRNAD-GUC 124905875
## 717  TRNAD-GUC 124905878
## 718  TRNAD-GUC 124905881
## 719  TRNAD-GUC 124905884
## 720  TRNAD-GUC 124905887
## 721  TRNAD-GUC 124905890
## 722  TRNAD-GUC 124905893
## 723  TRNAD-GUC 124905896
## 724  TRNAD-GUC 124905899
## 725  TRNAD-GUC 124905902
## 726  TRNAD-GUC 124905854
## 727  TRNAD-GUC 124905857
## 728  TRNAD-GUC 124905860
## 729  TRNAD-GUC 124905863
## 730  TRNAD-GUC 124905866
## 731  TRNAD-GUC 124905869
## 732  TRNAD-GUC 124905872
## 733  TRNAD-GUC 124905875
## 734  TRNAD-GUC 124905878
## 735  TRNAD-GUC 124905881
## 736  TRNAD-GUC 124905884
## 737  TRNAD-GUC 124905887
## 738  TRNAD-GUC 124905890
## 739  TRNAD-GUC 124905893
## 740  TRNAD-GUC 124905896
## 741  TRNAD-GUC 124905899
## 742  TRNAD-GUC 124905902
## 743  TRNAD-GUC 124905854
## 744  TRNAD-GUC 124905857
## 745  TRNAD-GUC 124905860
## 746  TRNAD-GUC 124905863
## 747  TRNAD-GUC 124905866
## 748  TRNAD-GUC 124905869
## 749  TRNAD-GUC 124905872
## 750  TRNAD-GUC 124905875
## 751  TRNAD-GUC 124905878
## 752  TRNAD-GUC 124905881
## 753  TRNAD-GUC 124905884
## 754  TRNAD-GUC 124905887
## 755  TRNAD-GUC 124905890
## 756  TRNAD-GUC 124905893
## 757  TRNAD-GUC 124905896
## 758  TRNAD-GUC 124905899
## 759  TRNAD-GUC 124905902
## 760  TRNAD-GUC 124905854
## 761  TRNAD-GUC 124905857
## 762  TRNAD-GUC 124905860
## 763  TRNAD-GUC 124905863
## 764  TRNAD-GUC 124905866
## 765  TRNAD-GUC 124905869
## 766  TRNAD-GUC 124905872
## 767  TRNAD-GUC 124905875
## 768  TRNAD-GUC 124905878
## 769  TRNAD-GUC 124905881
## 770  TRNAD-GUC 124905884
## 771  TRNAD-GUC 124905887
## 772  TRNAD-GUC 124905890
## 773  TRNAD-GUC 124905893
## 774  TRNAD-GUC 124905896
## 775  TRNAD-GUC 124905899
## 776  TRNAD-GUC 124905902
## 777  TRNAD-GUC 124905854
## 778  TRNAD-GUC 124905857
## 779  TRNAD-GUC 124905860
## 780  TRNAD-GUC 124905863
## 781  TRNAD-GUC 124905866
## 782  TRNAD-GUC 124905869
## 783  TRNAD-GUC 124905872
## 784  TRNAD-GUC 124905875
## 785  TRNAD-GUC 124905878
## 786  TRNAD-GUC 124905881
## 787  TRNAD-GUC 124905884
## 788  TRNAD-GUC 124905887
## 789  TRNAD-GUC 124905890
## 790  TRNAD-GUC 124905893
## 791  TRNAD-GUC 124905896
## 792  TRNAD-GUC 124905899
## 793  TRNAD-GUC 124905902
## 794  TRNAD-GUC 124905854
## 795  TRNAD-GUC 124905857
## 796  TRNAD-GUC 124905860
## 797  TRNAD-GUC 124905863
## 798  TRNAD-GUC 124905866
## 799  TRNAD-GUC 124905869
## 800  TRNAD-GUC 124905872
## 801  TRNAD-GUC 124905875
## 802  TRNAD-GUC 124905878
## 803  TRNAD-GUC 124905881
## 804  TRNAD-GUC 124905884
## 805  TRNAD-GUC 124905887
## 806  TRNAD-GUC 124905890
## 807  TRNAD-GUC 124905893
## 808  TRNAD-GUC 124905896
## 809  TRNAD-GUC 124905899
## 810  TRNAD-GUC 124905902
## 811  TRNAD-GUC 124905854
## 812  TRNAD-GUC 124905857
## 813  TRNAD-GUC 124905860
## 814  TRNAD-GUC 124905863
## 815  TRNAD-GUC 124905866
## 816  TRNAD-GUC 124905869
## 817  TRNAD-GUC 124905872
## 818  TRNAD-GUC 124905875
## 819  TRNAD-GUC 124905878
## 820  TRNAD-GUC 124905881
## 821  TRNAD-GUC 124905884
## 822  TRNAD-GUC 124905887
## 823  TRNAD-GUC 124905890
## 824  TRNAD-GUC 124905893
## 825  TRNAD-GUC 124905896
## 826  TRNAD-GUC 124905899
## 827  TRNAD-GUC 124905902
## 828  TRNAD-GUC 124905854
## 829  TRNAD-GUC 124905857
## 830  TRNAD-GUC 124905860
## 831  TRNAD-GUC 124905863
## 832  TRNAD-GUC 124905866
## 833  TRNAD-GUC 124905869
## 834  TRNAD-GUC 124905872
## 835  TRNAD-GUC 124905875
## 836  TRNAD-GUC 124905878
## 837  TRNAD-GUC 124905881
## 838  TRNAD-GUC 124905884
## 839  TRNAD-GUC 124905887
## 840  TRNAD-GUC 124905890
## 841  TRNAD-GUC 124905893
## 842  TRNAD-GUC 124905896
## 843  TRNAD-GUC 124905899
## 844  TRNAD-GUC 124905902
## 845  TRNAD-GUC 124905854
## 846  TRNAD-GUC 124905857
## 847  TRNAD-GUC 124905860
## 848  TRNAD-GUC 124905863
## 849  TRNAD-GUC 124905866
## 850  TRNAD-GUC 124905869
## 851  TRNAD-GUC 124905872
## 852  TRNAD-GUC 124905875
## 853  TRNAD-GUC 124905878
## 854  TRNAD-GUC 124905881
## 855  TRNAD-GUC 124905884
## 856  TRNAD-GUC 124905887
## 857  TRNAD-GUC 124905890
## 858  TRNAD-GUC 124905893
## 859  TRNAD-GUC 124905896
## 860  TRNAD-GUC 124905899
## 861  TRNAD-GUC 124905902
## 862  TRNAD-GUC 124905854
## 863  TRNAD-GUC 124905857
## 864  TRNAD-GUC 124905860
## 865  TRNAD-GUC 124905863
## 866  TRNAD-GUC 124905866
## 867  TRNAD-GUC 124905869
## 868  TRNAD-GUC 124905872
## 869  TRNAD-GUC 124905875
## 870  TRNAD-GUC 124905878
## 871  TRNAD-GUC 124905881
## 872  TRNAD-GUC 124905884
## 873  TRNAD-GUC 124905887
## 874  TRNAD-GUC 124905890
## 875  TRNAD-GUC 124905893
## 876  TRNAD-GUC 124905896
## 877  TRNAD-GUC 124905899
## 878  TRNAD-GUC 124905902
## 879  TRNAD-GUC 124905854
## 880  TRNAD-GUC 124905857
## 881  TRNAD-GUC 124905860
## 882  TRNAD-GUC 124905863
## 883  TRNAD-GUC 124905866
## 884  TRNAD-GUC 124905869
## 885  TRNAD-GUC 124905872
## 886  TRNAD-GUC 124905875
## 887  TRNAD-GUC 124905878
## 888  TRNAD-GUC 124905881
## 889  TRNAD-GUC 124905884
## 890  TRNAD-GUC 124905887
## 891  TRNAD-GUC 124905890
## 892  TRNAD-GUC 124905893
## 893  TRNAD-GUC 124905896
## 894  TRNAD-GUC 124905899
## 895  TRNAD-GUC 124905902
## 896  TRNAD-GUC 124905854
## 897  TRNAD-GUC 124905857
## 898  TRNAD-GUC 124905860
## 899  TRNAD-GUC 124905863
## 900  TRNAD-GUC 124905866
## 901  TRNAD-GUC 124905869
## 902  TRNAD-GUC 124905872
## 903  TRNAD-GUC 124905875
## 904  TRNAD-GUC 124905878
## 905  TRNAD-GUC 124905881
## 906  TRNAD-GUC 124905884
## 907  TRNAD-GUC 124905887
## 908  TRNAD-GUC 124905890
## 909  TRNAD-GUC 124905893
## 910  TRNAD-GUC 124905896
## 911  TRNAD-GUC 124905899
## 912  TRNAD-GUC 124905902
## 913  TRNAE-CUC 124905855
## 914  TRNAE-CUC 124905858
## 915  TRNAE-CUC 124905861
## 916  TRNAE-CUC 124905864
## 917  TRNAE-CUC 124905867
## 918  TRNAE-CUC 124905870
## 919  TRNAE-CUC 124905873
## 920  TRNAE-CUC 124905876
## 921  TRNAE-CUC 124905879
## 922  TRNAE-CUC 124905882
## 923  TRNAE-CUC 124905885
## 924  TRNAE-CUC 124905888
## 925  TRNAE-CUC 124905891
## 926  TRNAE-CUC 124905894
## 927  TRNAE-CUC 124905897
## 928  TRNAE-CUC 124905900
## 929  TRNAE-CUC 124905903
## 930  TRNAE-CUC 124905855
## 931  TRNAE-CUC 124905858
## 932  TRNAE-CUC 124905861
## 933  TRNAE-CUC 124905864
## 934  TRNAE-CUC 124905867
## 935  TRNAE-CUC 124905870
## 936  TRNAE-CUC 124905873
## 937  TRNAE-CUC 124905876
## 938  TRNAE-CUC 124905879
## 939  TRNAE-CUC 124905882
## 940  TRNAE-CUC 124905885
## 941  TRNAE-CUC 124905888
## 942  TRNAE-CUC 124905891
## 943  TRNAE-CUC 124905894
## 944  TRNAE-CUC 124905897
## 945  TRNAE-CUC 124905900
## 946  TRNAE-CUC 124905903
## 947  TRNAE-CUC 124905855
## 948  TRNAE-CUC 124905858
## 949  TRNAE-CUC 124905861
## 950  TRNAE-CUC 124905864
## 951  TRNAE-CUC 124905867
## 952  TRNAE-CUC 124905870
## 953  TRNAE-CUC 124905873
## 954  TRNAE-CUC 124905876
## 955  TRNAE-CUC 124905879
## 956  TRNAE-CUC 124905882
## 957  TRNAE-CUC 124905885
## 958  TRNAE-CUC 124905888
## 959  TRNAE-CUC 124905891
## 960  TRNAE-CUC 124905894
## 961  TRNAE-CUC 124905897
## 962  TRNAE-CUC 124905900
## 963  TRNAE-CUC 124905903
## 964  TRNAE-CUC 124905855
## 965  TRNAE-CUC 124905858
## 966  TRNAE-CUC 124905861
## 967  TRNAE-CUC 124905864
## 968  TRNAE-CUC 124905867
## 969  TRNAE-CUC 124905870
## 970  TRNAE-CUC 124905873
## 971  TRNAE-CUC 124905876
## 972  TRNAE-CUC 124905879
## 973  TRNAE-CUC 124905882
## 974  TRNAE-CUC 124905885
## 975  TRNAE-CUC 124905888
## 976  TRNAE-CUC 124905891
## 977  TRNAE-CUC 124905894
## 978  TRNAE-CUC 124905897
## 979  TRNAE-CUC 124905900
## 980  TRNAE-CUC 124905903
## 981  TRNAE-CUC 124905855
## 982  TRNAE-CUC 124905858
## 983  TRNAE-CUC 124905861
## 984  TRNAE-CUC 124905864
## 985  TRNAE-CUC 124905867
## 986  TRNAE-CUC 124905870
## 987  TRNAE-CUC 124905873
## 988  TRNAE-CUC 124905876
## 989  TRNAE-CUC 124905879
## 990  TRNAE-CUC 124905882
## 991  TRNAE-CUC 124905885
## 992  TRNAE-CUC 124905888
## 993  TRNAE-CUC 124905891
## 994  TRNAE-CUC 124905894
## 995  TRNAE-CUC 124905897
## 996  TRNAE-CUC 124905900
## 997  TRNAE-CUC 124905903
## 998  TRNAE-CUC 124905855
## 999  TRNAE-CUC 124905858
## 1000 TRNAE-CUC 124905861
## 1001 TRNAE-CUC 124905864
## 1002 TRNAE-CUC 124905867
## 1003 TRNAE-CUC 124905870
## 1004 TRNAE-CUC 124905873
## 1005 TRNAE-CUC 124905876
## 1006 TRNAE-CUC 124905879
## 1007 TRNAE-CUC 124905882
## 1008 TRNAE-CUC 124905885
## 1009 TRNAE-CUC 124905888
## 1010 TRNAE-CUC 124905891
## 1011 TRNAE-CUC 124905894
## 1012 TRNAE-CUC 124905897
## 1013 TRNAE-CUC 124905900
## 1014 TRNAE-CUC 124905903
## 1015 TRNAE-CUC 124905855
## 1016 TRNAE-CUC 124905858
## 1017 TRNAE-CUC 124905861
## 1018 TRNAE-CUC 124905864
## 1019 TRNAE-CUC 124905867
## 1020 TRNAE-CUC 124905870
## 1021 TRNAE-CUC 124905873
## 1022 TRNAE-CUC 124905876
## 1023 TRNAE-CUC 124905879
## 1024 TRNAE-CUC 124905882
## 1025 TRNAE-CUC 124905885
## 1026 TRNAE-CUC 124905888
## 1027 TRNAE-CUC 124905891
## 1028 TRNAE-CUC 124905894
## 1029 TRNAE-CUC 124905897
## 1030 TRNAE-CUC 124905900
## 1031 TRNAE-CUC 124905903
## 1032 TRNAE-CUC 124905855
## 1033 TRNAE-CUC 124905858
## 1034 TRNAE-CUC 124905861
## 1035 TRNAE-CUC 124905864
## 1036 TRNAE-CUC 124905867
## 1037 TRNAE-CUC 124905870
## 1038 TRNAE-CUC 124905873
## 1039 TRNAE-CUC 124905876
## 1040 TRNAE-CUC 124905879
## 1041 TRNAE-CUC 124905882
## 1042 TRNAE-CUC 124905885
## 1043 TRNAE-CUC 124905888
## 1044 TRNAE-CUC 124905891
## 1045 TRNAE-CUC 124905894
## 1046 TRNAE-CUC 124905897
## 1047 TRNAE-CUC 124905900
## 1048 TRNAE-CUC 124905903
## 1049 TRNAE-CUC 124905855
## 1050 TRNAE-CUC 124905858
## 1051 TRNAE-CUC 124905861
## 1052 TRNAE-CUC 124905864
## 1053 TRNAE-CUC 124905867
## 1054 TRNAE-CUC 124905870
## 1055 TRNAE-CUC 124905873
## 1056 TRNAE-CUC 124905876
## 1057 TRNAE-CUC 124905879
## 1058 TRNAE-CUC 124905882
## 1059 TRNAE-CUC 124905885
## 1060 TRNAE-CUC 124905888
## 1061 TRNAE-CUC 124905891
## 1062 TRNAE-CUC 124905894
## 1063 TRNAE-CUC 124905897
## 1064 TRNAE-CUC 124905900
## 1065 TRNAE-CUC 124905903
## 1066 TRNAE-CUC 124905855
## 1067 TRNAE-CUC 124905858
## 1068 TRNAE-CUC 124905861
## 1069 TRNAE-CUC 124905864
## 1070 TRNAE-CUC 124905867
## 1071 TRNAE-CUC 124905870
## 1072 TRNAE-CUC 124905873
## 1073 TRNAE-CUC 124905876
## 1074 TRNAE-CUC 124905879
## 1075 TRNAE-CUC 124905882
## 1076 TRNAE-CUC 124905885
## 1077 TRNAE-CUC 124905888
## 1078 TRNAE-CUC 124905891
## 1079 TRNAE-CUC 124905894
## 1080 TRNAE-CUC 124905897
## 1081 TRNAE-CUC 124905900
## 1082 TRNAE-CUC 124905903
## 1083 TRNAE-CUC 124905855
## 1084 TRNAE-CUC 124905858
## 1085 TRNAE-CUC 124905861
## 1086 TRNAE-CUC 124905864
## 1087 TRNAE-CUC 124905867
## 1088 TRNAE-CUC 124905870
## 1089 TRNAE-CUC 124905873
## 1090 TRNAE-CUC 124905876
## 1091 TRNAE-CUC 124905879
## 1092 TRNAE-CUC 124905882
## 1093 TRNAE-CUC 124905885
## 1094 TRNAE-CUC 124905888
## 1095 TRNAE-CUC 124905891
## 1096 TRNAE-CUC 124905894
## 1097 TRNAE-CUC 124905897
## 1098 TRNAE-CUC 124905900
## 1099 TRNAE-CUC 124905903
## 1100 TRNAE-CUC 124905855
## 1101 TRNAE-CUC 124905858
## 1102 TRNAE-CUC 124905861
## 1103 TRNAE-CUC 124905864
## 1104 TRNAE-CUC 124905867
## 1105 TRNAE-CUC 124905870
## 1106 TRNAE-CUC 124905873
## 1107 TRNAE-CUC 124905876
## 1108 TRNAE-CUC 124905879
## 1109 TRNAE-CUC 124905882
## 1110 TRNAE-CUC 124905885
## 1111 TRNAE-CUC 124905888
## 1112 TRNAE-CUC 124905891
## 1113 TRNAE-CUC 124905894
## 1114 TRNAE-CUC 124905897
## 1115 TRNAE-CUC 124905900
## 1116 TRNAE-CUC 124905903
## 1117 TRNAE-CUC 124905855
## 1118 TRNAE-CUC 124905858
## 1119 TRNAE-CUC 124905861
## 1120 TRNAE-CUC 124905864
## 1121 TRNAE-CUC 124905867
## 1122 TRNAE-CUC 124905870
## 1123 TRNAE-CUC 124905873
## 1124 TRNAE-CUC 124905876
## 1125 TRNAE-CUC 124905879
## 1126 TRNAE-CUC 124905882
## 1127 TRNAE-CUC 124905885
## 1128 TRNAE-CUC 124905888
## 1129 TRNAE-CUC 124905891
## 1130 TRNAE-CUC 124905894
## 1131 TRNAE-CUC 124905897
## 1132 TRNAE-CUC 124905900
## 1133 TRNAE-CUC 124905903
## 1134 TRNAE-CUC 124905855
## 1135 TRNAE-CUC 124905858
## 1136 TRNAE-CUC 124905861
## 1137 TRNAE-CUC 124905864
## 1138 TRNAE-CUC 124905867
## 1139 TRNAE-CUC 124905870
## 1140 TRNAE-CUC 124905873
## 1141 TRNAE-CUC 124905876
## 1142 TRNAE-CUC 124905879
## 1143 TRNAE-CUC 124905882
## 1144 TRNAE-CUC 124905885
## 1145 TRNAE-CUC 124905888
## 1146 TRNAE-CUC 124905891
## 1147 TRNAE-CUC 124905894
## 1148 TRNAE-CUC 124905897
## 1149 TRNAE-CUC 124905900
## 1150 TRNAE-CUC 124905903
## 1151 TRNAE-CUC 124905855
## 1152 TRNAE-CUC 124905858
## 1153 TRNAE-CUC 124905861
## 1154 TRNAE-CUC 124905864
## 1155 TRNAE-CUC 124905867
## 1156 TRNAE-CUC 124905870
## 1157 TRNAE-CUC 124905873
## 1158 TRNAE-CUC 124905876
## 1159 TRNAE-CUC 124905879
## 1160 TRNAE-CUC 124905882
## 1161 TRNAE-CUC 124905885
## 1162 TRNAE-CUC 124905888
## 1163 TRNAE-CUC 124905891
## 1164 TRNAE-CUC 124905894
## 1165 TRNAE-CUC 124905897
## 1166 TRNAE-CUC 124905900
## 1167 TRNAE-CUC 124905903
## 1168 TRNAE-CUC 124905855
## 1169 TRNAE-CUC 124905858
## 1170 TRNAE-CUC 124905861
## 1171 TRNAE-CUC 124905864
## 1172 TRNAE-CUC 124905867
## 1173 TRNAE-CUC 124905870
## 1174 TRNAE-CUC 124905873
## 1175 TRNAE-CUC 124905876
## 1176 TRNAE-CUC 124905879
## 1177 TRNAE-CUC 124905882
## 1178 TRNAE-CUC 124905885
## 1179 TRNAE-CUC 124905888
## 1180 TRNAE-CUC 124905891
## 1181 TRNAE-CUC 124905894
## 1182 TRNAE-CUC 124905897
## 1183 TRNAE-CUC 124905900
## 1184 TRNAE-CUC 124905903
## 1185 TRNAG-UCC 124905856
## 1186 TRNAG-UCC 124905859
## 1187 TRNAG-UCC 124905862
## 1188 TRNAG-UCC 124905865
## 1189 TRNAG-UCC 124905868
## 1190 TRNAG-UCC 124905871
## 1191 TRNAG-UCC 124905874
## 1192 TRNAG-UCC 124905877
## 1193 TRNAG-UCC 124905880
## 1194 TRNAG-UCC 124905883
## 1195 TRNAG-UCC 124905886
## 1196 TRNAG-UCC 124905889
## 1197 TRNAG-UCC 124905892
## 1198 TRNAG-UCC 124905895
## 1199 TRNAG-UCC 124905898
## 1200 TRNAG-UCC 124905901
## 1201 TRNAG-UCC 124905904
## 1202 TRNAG-UCC 124905856
## 1203 TRNAG-UCC 124905859
## 1204 TRNAG-UCC 124905862
## 1205 TRNAG-UCC 124905865
## 1206 TRNAG-UCC 124905868
## 1207 TRNAG-UCC 124905871
## 1208 TRNAG-UCC 124905874
## 1209 TRNAG-UCC 124905877
## 1210 TRNAG-UCC 124905880
## 1211 TRNAG-UCC 124905883
## 1212 TRNAG-UCC 124905886
## 1213 TRNAG-UCC 124905889
## 1214 TRNAG-UCC 124905892
## 1215 TRNAG-UCC 124905895
## 1216 TRNAG-UCC 124905898
## 1217 TRNAG-UCC 124905901
## 1218 TRNAG-UCC 124905904
## 1219 TRNAG-UCC 124905856
## 1220 TRNAG-UCC 124905859
## 1221 TRNAG-UCC 124905862
## 1222 TRNAG-UCC 124905865
## 1223 TRNAG-UCC 124905868
## 1224 TRNAG-UCC 124905871
## 1225 TRNAG-UCC 124905874
## 1226 TRNAG-UCC 124905877
## 1227 TRNAG-UCC 124905880
## 1228 TRNAG-UCC 124905883
## 1229 TRNAG-UCC 124905886
## 1230 TRNAG-UCC 124905889
## 1231 TRNAG-UCC 124905892
## 1232 TRNAG-UCC 124905895
## 1233 TRNAG-UCC 124905898
## 1234 TRNAG-UCC 124905901
## 1235 TRNAG-UCC 124905904
## 1236 TRNAG-UCC 124905856
## 1237 TRNAG-UCC 124905859
## 1238 TRNAG-UCC 124905862
## 1239 TRNAG-UCC 124905865
## 1240 TRNAG-UCC 124905868
## 1241 TRNAG-UCC 124905871
## 1242 TRNAG-UCC 124905874
## 1243 TRNAG-UCC 124905877
## 1244 TRNAG-UCC 124905880
## 1245 TRNAG-UCC 124905883
## 1246 TRNAG-UCC 124905886
## 1247 TRNAG-UCC 124905889
## 1248 TRNAG-UCC 124905892
## 1249 TRNAG-UCC 124905895
## 1250 TRNAG-UCC 124905898
## 1251 TRNAG-UCC 124905901
## 1252 TRNAG-UCC 124905904
## 1253 TRNAG-UCC 124905856
## 1254 TRNAG-UCC 124905859
## 1255 TRNAG-UCC 124905862
## 1256 TRNAG-UCC 124905865
## 1257 TRNAG-UCC 124905868
## 1258 TRNAG-UCC 124905871
## 1259 TRNAG-UCC 124905874
## 1260 TRNAG-UCC 124905877
## 1261 TRNAG-UCC 124905880
## 1262 TRNAG-UCC 124905883
## 1263 TRNAG-UCC 124905886
## 1264 TRNAG-UCC 124905889
## 1265 TRNAG-UCC 124905892
## 1266 TRNAG-UCC 124905895
## 1267 TRNAG-UCC 124905898
## 1268 TRNAG-UCC 124905901
## 1269 TRNAG-UCC 124905904
## 1270 TRNAG-UCC 124905856
## 1271 TRNAG-UCC 124905859
## 1272 TRNAG-UCC 124905862
## 1273 TRNAG-UCC 124905865
## 1274 TRNAG-UCC 124905868
## 1275 TRNAG-UCC 124905871
## 1276 TRNAG-UCC 124905874
## 1277 TRNAG-UCC 124905877
## 1278 TRNAG-UCC 124905880
## 1279 TRNAG-UCC 124905883
## 1280 TRNAG-UCC 124905886
## 1281 TRNAG-UCC 124905889
## 1282 TRNAG-UCC 124905892
## 1283 TRNAG-UCC 124905895
## 1284 TRNAG-UCC 124905898
## 1285 TRNAG-UCC 124905901
## 1286 TRNAG-UCC 124905904
## 1287 TRNAG-UCC 124905856
## 1288 TRNAG-UCC 124905859
## 1289 TRNAG-UCC 124905862
## 1290 TRNAG-UCC 124905865
## 1291 TRNAG-UCC 124905868
## 1292 TRNAG-UCC 124905871
## 1293 TRNAG-UCC 124905874
## 1294 TRNAG-UCC 124905877
## 1295 TRNAG-UCC 124905880
## 1296 TRNAG-UCC 124905883
## 1297 TRNAG-UCC 124905886
## 1298 TRNAG-UCC 124905889
## 1299 TRNAG-UCC 124905892
## 1300 TRNAG-UCC 124905895
## 1301 TRNAG-UCC 124905898
## 1302 TRNAG-UCC 124905901
## 1303 TRNAG-UCC 124905904
## 1304 TRNAG-UCC 124905856
## 1305 TRNAG-UCC 124905859
## 1306 TRNAG-UCC 124905862
## 1307 TRNAG-UCC 124905865
## 1308 TRNAG-UCC 124905868
## 1309 TRNAG-UCC 124905871
## 1310 TRNAG-UCC 124905874
## 1311 TRNAG-UCC 124905877
## 1312 TRNAG-UCC 124905880
## 1313 TRNAG-UCC 124905883
## 1314 TRNAG-UCC 124905886
## 1315 TRNAG-UCC 124905889
## 1316 TRNAG-UCC 124905892
## 1317 TRNAG-UCC 124905895
## 1318 TRNAG-UCC 124905898
## 1319 TRNAG-UCC 124905901
## 1320 TRNAG-UCC 124905904
## 1321 TRNAG-UCC 124905856
## 1322 TRNAG-UCC 124905859
## 1323 TRNAG-UCC 124905862
## 1324 TRNAG-UCC 124905865
## 1325 TRNAG-UCC 124905868
## 1326 TRNAG-UCC 124905871
## 1327 TRNAG-UCC 124905874
## 1328 TRNAG-UCC 124905877
## 1329 TRNAG-UCC 124905880
## 1330 TRNAG-UCC 124905883
## 1331 TRNAG-UCC 124905886
## 1332 TRNAG-UCC 124905889
## 1333 TRNAG-UCC 124905892
## 1334 TRNAG-UCC 124905895
## 1335 TRNAG-UCC 124905898
## 1336 TRNAG-UCC 124905901
## 1337 TRNAG-UCC 124905904
## 1338 TRNAG-UCC 124905856
## 1339 TRNAG-UCC 124905859
## 1340 TRNAG-UCC 124905862
## 1341 TRNAG-UCC 124905865
## 1342 TRNAG-UCC 124905868
## 1343 TRNAG-UCC 124905871
## 1344 TRNAG-UCC 124905874
## 1345 TRNAG-UCC 124905877
## 1346 TRNAG-UCC 124905880
## 1347 TRNAG-UCC 124905883
## 1348 TRNAG-UCC 124905886
## 1349 TRNAG-UCC 124905889
## 1350 TRNAG-UCC 124905892
## 1351 TRNAG-UCC 124905895
## 1352 TRNAG-UCC 124905898
## 1353 TRNAG-UCC 124905901
## 1354 TRNAG-UCC 124905904
## 1355 TRNAG-UCC 124905856
## 1356 TRNAG-UCC 124905859
## 1357 TRNAG-UCC 124905862
## 1358 TRNAG-UCC 124905865
## 1359 TRNAG-UCC 124905868
## 1360 TRNAG-UCC 124905871
## 1361 TRNAG-UCC 124905874
## 1362 TRNAG-UCC 124905877
## 1363 TRNAG-UCC 124905880
## 1364 TRNAG-UCC 124905883
## 1365 TRNAG-UCC 124905886
## 1366 TRNAG-UCC 124905889
## 1367 TRNAG-UCC 124905892
## 1368 TRNAG-UCC 124905895
## 1369 TRNAG-UCC 124905898
## 1370 TRNAG-UCC 124905901
## 1371 TRNAG-UCC 124905904
## 1372 TRNAG-UCC 124905856
## 1373 TRNAG-UCC 124905859
## 1374 TRNAG-UCC 124905862
## 1375 TRNAG-UCC 124905865
## 1376 TRNAG-UCC 124905868
## 1377 TRNAG-UCC 124905871
## 1378 TRNAG-UCC 124905874
## 1379 TRNAG-UCC 124905877
## 1380 TRNAG-UCC 124905880
## 1381 TRNAG-UCC 124905883
## 1382 TRNAG-UCC 124905886
## 1383 TRNAG-UCC 124905889
## 1384 TRNAG-UCC 124905892
## 1385 TRNAG-UCC 124905895
## 1386 TRNAG-UCC 124905898
## 1387 TRNAG-UCC 124905901
## 1388 TRNAG-UCC 124905904
## 1389 TRNAG-UCC 124905856
## 1390 TRNAG-UCC 124905859
## 1391 TRNAG-UCC 124905862
## 1392 TRNAG-UCC 124905865
## 1393 TRNAG-UCC 124905868
## 1394 TRNAG-UCC 124905871
## 1395 TRNAG-UCC 124905874
## 1396 TRNAG-UCC 124905877
## 1397 TRNAG-UCC 124905880
## 1398 TRNAG-UCC 124905883
## 1399 TRNAG-UCC 124905886
## 1400 TRNAG-UCC 124905889
## 1401 TRNAG-UCC 124905892
## 1402 TRNAG-UCC 124905895
## 1403 TRNAG-UCC 124905898
## 1404 TRNAG-UCC 124905901
## 1405 TRNAG-UCC 124905904
## 1406 TRNAG-UCC 124905856
## 1407 TRNAG-UCC 124905859
## 1408 TRNAG-UCC 124905862
## 1409 TRNAG-UCC 124905865
## 1410 TRNAG-UCC 124905868
## 1411 TRNAG-UCC 124905871
## 1412 TRNAG-UCC 124905874
## 1413 TRNAG-UCC 124905877
## 1414 TRNAG-UCC 124905880
## 1415 TRNAG-UCC 124905883
## 1416 TRNAG-UCC 124905886
## 1417 TRNAG-UCC 124905889
## 1418 TRNAG-UCC 124905892
## 1419 TRNAG-UCC 124905895
## 1420 TRNAG-UCC 124905898
## 1421 TRNAG-UCC 124905901
## 1422 TRNAG-UCC 124905904
## 1423 TRNAG-UCC 124905856
## 1424 TRNAG-UCC 124905859
## 1425 TRNAG-UCC 124905862
## 1426 TRNAG-UCC 124905865
## 1427 TRNAG-UCC 124905868
## 1428 TRNAG-UCC 124905871
## 1429 TRNAG-UCC 124905874
## 1430 TRNAG-UCC 124905877
## 1431 TRNAG-UCC 124905880
## 1432 TRNAG-UCC 124905883
## 1433 TRNAG-UCC 124905886
## 1434 TRNAG-UCC 124905889
## 1435 TRNAG-UCC 124905892
## 1436 TRNAG-UCC 124905895
## 1437 TRNAG-UCC 124905898
## 1438 TRNAG-UCC 124905901
## 1439 TRNAG-UCC 124905904
## 1440 TRNAG-UCC 124905856
## 1441 TRNAG-UCC 124905859
## 1442 TRNAG-UCC 124905862
## 1443 TRNAG-UCC 124905865
## 1444 TRNAG-UCC 124905868
## 1445 TRNAG-UCC 124905871
## 1446 TRNAG-UCC 124905874
## 1447 TRNAG-UCC 124905877
## 1448 TRNAG-UCC 124905880
## 1449 TRNAG-UCC 124905883
## 1450 TRNAG-UCC 124905886
## 1451 TRNAG-UCC 124905889
## 1452 TRNAG-UCC 124905892
## 1453 TRNAG-UCC 124905895
## 1454 TRNAG-UCC 124905898
## 1455 TRNAG-UCC 124905901
## 1456 TRNAG-UCC 124905904

15.5 Exercise 5:

So to retrieve this information using select you need to do it like this:

res1 <- select(TxDb.Hsapiens.UCSC.hg19.knownGene,
               keys(TxDb.Hsapiens.UCSC.hg19.knownGene, keytype="TXID"),
               columns=c("GENEID","TXNAME","TXCHROM"), keytype="TXID")

## 'select()' returned 1:1 mapping between keys and columns

head(res1)

##   TXID    GENEID     TXNAME TXCHROM
## 1    1 100287102 uc001aaa.3    chr1
## 2    2 100287102 uc010nxq.1    chr1
## 3    3 100287102 uc010nxr.1    chr1
## 4    4     79501 uc001aal.1    chr1
## 5    5      <NA> uc001aaq.2    chr1
## 6    6      <NA> uc001aar.2    chr1

And to do it using transcripts you do it like this:

res2 <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene,
                    columns = c("gene_id","tx_name"))
head(res2)

## GRanges object with 6 ranges and 2 metadata columns:
##       seqnames        ranges strand |         gene_id     tx_name
##          <Rle>     <IRanges>  <Rle> | <CharacterList> <character>
##   [1]     chr3 238279-451097      + |           10752  uc003bot.3
##   [2]     chr3 238279-451097      + |           10752  uc003bou.3
##   [3]     chr3 239326-290282      + |           10752  uc003bov.2
##   [4]     chr3 239326-440831      + |           10752  uc003bow.2
##   [5]     chr3 361366-451097      + |           10752  uc011asi.2
##   [6]     chr3 577914-887698      + |                  uc003boy.1
##   -------
##   seqinfo: 2 sequences from hg19 genome

Notice that in the 2nd case we don’t have to ask for the chromosome, as transcripts() returns a GRanges object, so the chromosome will automatically be returned as part of the object.

15.6 Exercise 6:

res <- transcripts(TxDb.Athaliana.BioMart.plantsmart22, columns = c("gene_id"))

You will notice that the gene ids for this package are TAIR locus IDs and are NOT entrez gene IDs like what you saw in the TxDb.Hsapiens.UCSC.hg19.knownGene package. It’s important to always pay attention to the kind of gene id is being used by the TxDb you are looking at.

15.7 Exercise 7:

keys <- keys(Homo.sapiens, keytype="TXID")
res1 <- select(Homo.sapiens,
               keys= keys,
               columns=c("SYMBOL","TXSTART","TXCHROM"), keytype="TXID")

head(res1)

And to do it using transcripts you do it like this:

res2 <- transcripts(Homo.sapiens, columns="SYMBOL")

## 'select()' returned 1:1 mapping between keys and columns

head(res2)

## GRanges object with 6 ranges and 1 metadata column:
##       seqnames        ranges strand |          SYMBOL
##          <Rle>     <IRanges>  <Rle> | <CharacterList>
##   [1]     chr3 238279-451097      + |            CHL1
##   [2]     chr3 238279-451097      + |            CHL1
##   [3]     chr3 239326-290282      + |            CHL1
##   [4]     chr3 239326-440831      + |            CHL1
##   [5]     chr3 361366-451097      + |            CHL1
##   [6]     chr3 577914-887698      + |            <NA>
##   -------
##   seqinfo: 2 sequences from hg19 genome

15.8 Exercise 8:

columns(Homo.sapiens)

##  [1] "ACCNUM"       "ALIAS"        "CDSCHROM"     "CDSEND"       "CDSID"       
##  [6] "CDSNAME"      "CDSSTART"     "CDSSTRAND"    "DEFINITION"   "ENSEMBL"     
## [11] "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
## [16] "EVIDENCEALL"  "EXONCHROM"    "EXONEND"      "EXONID"       "EXONNAME"    
## [21] "EXONRANK"     "EXONSTART"    "EXONSTRAND"   "GENEID"       "GENENAME"    
## [26] "GENETYPE"     "GO"           "GOALL"        "GOID"         "IPI"         
## [31] "MAP"          "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
## [36] "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"      
## [41] "TERM"         "TXCHROM"      "TXEND"        "TXID"         "TXNAME"      
## [46] "TXSTART"      "TXSTRAND"     "TXTYPE"       "UCSCKG"       "UNIPROT"

columns(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"

columns(TxDb.Hsapiens.UCSC.hg19.knownGene)

##  [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"  
##  [6] "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"  
## [11] "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"   
## [16] "TXEND"      "TXID"       "TXNAME"     "TXSTART"    "TXSTRAND"  
## [21] "TXTYPE"

## You might also want to look at this:
transcripts(Homo.sapiens, columns=c("SYMBOL","CHRLOC"))

## 'select()' returned 1:1 mapping between keys and columns

## GRanges object with 5506 ranges and 1 metadata column:
##          seqnames            ranges strand |          SYMBOL
##             <Rle>         <IRanges>  <Rle> | <CharacterList>
##      [1]     chr3     238279-451097      + |            CHL1
##      [2]     chr3     238279-451097      + |            CHL1
##      [3]     chr3     239326-290282      + |            CHL1
##      [4]     chr3     239326-440831      + |            CHL1
##      [5]     chr3     361366-451097      + |            CHL1
##      ...      ...               ...    ... .             ...
##   [5502]    chr18 77732867-77748532      - |          TXNL4A
##   [5503]    chr18 77732867-77748532      - |          TXNL4A
##   [5504]    chr18 77732867-77793915      - |          TXNL4A
##   [5505]    chr18 77915117-78005397      - |          PARD6G
##   [5506]    chr18 77941005-78005397      - |          PARD6G
##   -------
##   seqinfo: 2 sequences from hg19 genome

The key difference is that the TXSTART refers to the start of a transcript and originates in the TxDb object from the TxDb.Hsapiens.UCSC.hg19.knownGene package, while the CHRLOC refers to the same thing but originates in the OrgDb object from the org.Hs.eg.db package. The point of origin is significant because the TxDb object represents a transcriptome from UCSC and the OrgDb is primarily gene centric data that originates at NCBI. The upshot is that CHRLOC will not have as many regions represented as TXSTART, since there has to be an official gene for there to even be a record. The CHRLOC data is also locked in for org.Hs.eg.db as data for hg19, whereas you can swap in a different TxDb object to match the genome you are using to make it hg18 etc. For these reasons, we strongly recommend using TXSTART instead of CHRLOC. Howeverm CHRLOC still remains in the org packages for historical reasons.

15.9 Exercise 9:

To find the keys that match, make use of the pattern and column arguments.

xk = head(keys(Homo.sapiens, keytype="ENTREZID", pattern="X", column="SYMBOL"))

## 'select()' returned 1:1 mapping between keys and columns

xk

## [1] "51"  "179" "189" "239" "240" "241"

select verifies the results

select(Homo.sapiens, xk, "SYMBOL", "ENTREZID")

## 'select()' returned 1:1 mapping between keys and columns

##   ENTREZID  SYMBOL
## 1       51   ACOX1
## 2      179   AGMX2
## 3      189    AGXT
## 4      239  ALOX12
## 5      240   ALOX5
## 6      241 ALOX5AP

15.10 Exercise 10:

## Get the transcript ranges grouped by gene
txby <- transcriptsBy(Homo.sapiens, by="gene")
## look up the entrez ID for the gene symbol 'PTEN'
select(Homo.sapiens, keys='PTEN', columns='ENTREZID', keytype='SYMBOL')
## subset that genes transcripts
geneOfInterest <- txby[["5728"]]
## extract the sequence
res <- getSeq(Hsapiens, geneOfInterest)
res

15.11 Exercise 11:

ensembl <- useEnsembl(biomart = "ensembl", dataset="hsapiens_gene_ensembl")
ids <- c("1")
getBM(attributes=c('go_id', 'entrezgene_id'),
          filters = 'entrezgene_id',
      values = ids, 
          mart = ensembl)

##         go_id entrezgene_id
## 1                         1
## 2  GO:0005576             1
## 3  GO:0005615             1
## 4  GO:0070062             1
## 5  GO:0003674             1
## 6  GO:0008150             1
## 7  GO:0072562             1
## 8  GO:0062023             1
## 9  GO:0034774             1
## 10 GO:1904813             1
## 11 GO:0031093             1

15.12 Exercise 12:

ids <- c("1")
select(org.Hs.eg.db, keys=ids, columns="GO", keytype="ENTREZID")

## 'select()' returned 1:many mapping between keys and columns

##    ENTREZID         GO EVIDENCE ONTOLOGY
## 1         1 GO:0003674       ND       MF
## 2         1 GO:0005576      HDA       CC
## 3         1 GO:0005576      IDA       CC
## 4         1 GO:0005576      TAS       CC
## 5         1 GO:0005615      HDA       CC
## 6         1 GO:0008150       ND       BP
## 7         1 GO:0031093      TAS       CC
## 8         1 GO:0034774      TAS       CC
## 9         1 GO:0062023      HDA       CC
## 10        1 GO:0070062      HDA       CC
## 11        1 GO:0072562      HDA       CC
## 12        1 GO:1904813      TAS       CC

When this exercise was written, there was a different number of GO terms returned from biomaRt than from org.Hs.eg.db. This may not always be true in the future though as both of these resources are updated. It is expected however that this web service, (which is updated continuously) will fall in and out of sync with the org.Hs.eg.db package (which is updated twice a year). This is an important difference as each approach has different advantages and disadvantages. The advantage to updating continuously is that you always have the very latest annotations which are frequently different for something like GO terms. The advantage to using a package is that the results are frozen to a release of Bioconductor. And this can help you to get the same answers that you get today (reproducibility), a few years from now.

[ Back to top ]

Genomic Annotation Resources

24 April 2018

Contents

1 Version Info

2 Introduction

3 Set Up

4 Using AnnotationHub

4.1 AnnotationHub exercises

5 OrgDb objects

5.1 OrgDb exercises

6 TxDb Objects

6.1 TxDb exercises

7 Organism.dplyr src_organism Objects

7.1 Organism.dplyr exercises

8 BSgenome Objects

8.1 BSgenome exercises

9 biomaRt

9.1 biomaRt exercises

10 Creating annotation objects

11 Important considerations

12 sessionInfo()

13 Acknowledgments

14 References

15 Answers for exercises

15.1 Exercise 1:

15.2 Exercise 2:

15.3 Exercise 3:

15.4 Exercise 4:

15.5 Exercise 5:

15.6 Exercise 6:

15.7 Exercise 7:

15.8 Exercise 8:

15.9 Exercise 9:

15.10 Exercise 10:

15.11 Exercise 11:

15.12 Exercise 12: