Genomic annotations for functional analyses

Author

You!

Published

January 29, 2025

Approximate time: 30 minutes

Learning Objectives

Discuss the available genomic annotation databases and the different types if information stored
Compare and contrast the tools available for accessing genomic annotation databases

Genomic annotations

The analysis of next-generation sequencing results requires associating genes, transcripts, proteins, etc. with functional or regulatory information. To perform functional analysis on gene lists, we often need to obtain gene identifiers that are compatible with the tools we wish to use and this is not always trivial.

Databases

We retrieve information on the processes, pathways, etc. (for which a gene is involved in) from the necessary database where the information is stored. The database you choose will be dependent on what type of information you are trying to obtain. Examples of databases that are often queried, include:

General databases

Offer comprehensive information on genome features, feature coordinates, homology, variant information, phenotypes, protein domain/family information, associated biological processes/pathways, associated microRNAs, etc.:

Ensembl (use Ensembl gene IDs)
NCBI (use Entrez gene IDs)
UCSC
EMBL-EBI

Annotation-specific databases

Provide annotations related to a specific topic:

Gene Ontology (GO): database of gene ontology biological processes, cellular components and molecular functions - based on Ensembl or Entrez gene IDs or official gene symbols
KEGG: database of biological pathways - based on Entrez gene IDs
MSigDB: database of gene sets
Reactome: database of biological pathways
Human Phenotype Ontology: database of genes associated with human disease
CORUM: database of protein complexes for human, mouse, rat
…

Tools for accessing databases

Within R, there are many popular packages used for gene/transcript-level annotation. These packages provide tools that take the list of genes you provide and retrieve information for each gene using one or more of the databases listed above.

Annotation tools: for accessing/querying annotations from a specific databases

Tool	Description	Pros	Cons
org.Xx.eg.db	Query gene feature information for the organism of interest	gene ID conversion, biotype and coordinate information	only latest genome build available
EnsDb.Xx.vxx	Transcript and gene-level information directly fetched from Ensembl API (similar to TxDb, but with filtering ability and versioned by Ensembl release)	easy functions to extract features, direct filtering	Not the most up-to-date annotations, more difficult to use than some packages
TxDb.Xx.UCSC.hgxx.knownGene	UCSC database for transcript and gene-level information or can create own TxDb from an SQLite database file using the GenomicFeatures package	feature information, easy functions to extract features	only available current and recent genome builds - can create your own, less up-to-date with annotations than Ensembl
annotables	Gene-level feature information immediately available for the human and model organisms	super quick and easy gene ID conversion, biotype and coordinate information	static resource, not updated regularly
biomaRt	An R package version of the Ensembl BioMart online tool	all Ensembl database information available, all organisms on Ensembl, wealth of information

AnnotationHub

AnnotationHub is a wonderful resource for accessing genomic data or querying large collection of whole genome resources, including ENSEMBL, UCSC, ENCODE, Broad Institute, KEGG, NIH Pathway Interaction Database, etc. All of this information is stored and easily accessible by directly connecting to the database.

To get started with AnnotationHub, we first load the library and connect to the database:

library(AnnotationHub)
library(ensembldb)

#To avoid prompting from annotation hub
setAnnotationHubOption("CACHE", "./AnnotationHub")
setAnnotationHubOption("ASK", FALSE)

# Connect to AnnotationHub
ah <- AnnotationHub()

To see the types of information stored inside our database, we can just type the name of the object:

# Explore the AnnotationHub object
ah

AnnotationHub with 72098 records
# snapshotDate(): 2024-10-28
# $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/g...
# $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Rattus norv...
# $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, SQLiteFil...
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH5012"]]' 

             title                                             
  AH5012   | Chromosome Band                                   
  AH5013   | STS Markers                                       
  AH5014   | FISH Clones                                       
  AH5015   | Recomb Rate                                       
  AH5016   | ENCODE Pilot                                      
  ...        ...                                               
  AH119504 | Ensembl 113 EnsDb for Xiphophorus maculatus       
  AH119505 | Ensembl 113 EnsDb for Xenopus tropicalis          
  AH119506 | Ensembl 113 EnsDb for Zonotrichia albicollis      
  AH119507 | Ensembl 113 EnsDb for Zalophus californianus      
  AH119508 | Ensembl 113 EnsDb for Zosterops lateralis melanops

If you would like to see more information about any of the classes of data you can extract that information as well. For example, if you wanted to determine all species information available, you could explore that within the AnnotationHub object:

# Explore all species information available
unique(ah$species) %>% head()

[1] "Homo sapiens"         "Vicugna pacos"        "Dasypus novemcinctus"
[4] "Otolemur garnettii"   "Papio hamadryas"      "Papio anubis"

In addition to species information, there is also additional information about the type of Data Objects and the Data Providers:

# Explore the types of Data Objects available
unique(ah$rdataclass) %>% head()

[1] "GRanges"          "data.frame"       "Inparanoid8Db"    "TwoBitFile"      
[5] "ChainFile"        "SQLiteConnection"

# Explore the Data Providers
unique(ah$dataprovider) %>% head()

[1] "UCSC"        "Ensembl"     "RefNet"      "Inparanoid8" "NHLBI"      
[6] "ChEA"

Now that we know the types of information available from AnnotationHub we can query it for the information we want using the query() function. Let’s say we would like to return the Ensembl EnsDb information for Human. To return the records available, we need to use the terms as they are output from the ah object to extract the desired data.

# Query AnnotationHub
human_ens <- query(ah, c("Homo sapiens", "EnsDb"))

The query retrieves all hits for the EnsDb objects, and you will see that they are listed by the release number. The most current release for GRCh38 is Ensembl 109 and AnnotationHub offers that as an option to use. However, if you look at options for older releases, for Homo sapiens it only goes back as far as Ensembl 87. This is fine if you are using GRCh38, however if you were using an older genome build like hg19/GRCh37, you would need to load the EnsDb package if available for that release or you might need to build your own with ensembldb.

human_ens

AnnotationHub with 28 records
# snapshotDate(): 2024-10-28
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

             title                             
  AH53211  | Ensembl 87 EnsDb for Homo Sapiens 
  AH53715  | Ensembl 88 EnsDb for Homo Sapiens 
  AH56681  | Ensembl 89 EnsDb for Homo Sapiens 
  AH57757  | Ensembl 90 EnsDb for Homo Sapiens 
  AH60773  | Ensembl 91 EnsDb for Homo Sapiens 
  ...        ...                               
  AH109606 | Ensembl 109 EnsDb for Homo sapiens
  AH113665 | Ensembl 110 EnsDb for Homo sapiens
  AH116291 | Ensembl 111 EnsDb for Homo sapiens
  AH116860 | Ensembl 112 EnsDb for Homo sapiens
  AH119325 | Ensembl 113 EnsDb for Homo sapiens

If we are looking for the latest Ensembl release, so that the annotations are the most up-to-date, we can use the AnnotationHub ID to subset the object:

# Extract annotations of interest
human_ens <- human_ens[[length(human_ens)]]

Now we can use ensembldb functions to extract the information at the gene, transcript, or exon levels. We are interested in the gene-level annotations, so we can extract that information as follows:

# Extract gene-level information
genes(human_ens, return.type = "data.frame") %>% head()

But note that it is just as easy to get the transcript- or exon-level information:

# Extract transcript-level information
transcripts(human_ens, return.type = "data.frame") %>% head()

# Extract exon-level information
exons(human_ens, return.type = "data.frame") %>% head()

To obtain an annotation data frame using AnnotationHub, we’ll use the genes() function, but only keep selected columns and filter out rows to keep those corresponding to our gene identifiers in our results file:

# Create a gene-level dataframe 
annotations_ahb <- genes(human_ens, return.type = "data.frame")  %>%
  dplyr::select(gene_id, gene_name, entrezid, gene_biotype, description) %>% 
  dplyr::filter(gene_id %in% res_tableCont_tb$gene)

This dataframe looks like it should be fine as it is, but we look a little closer we will notice that the column containing Entrez identifiers is a list, and in fact there are many Ensembl identifiers that map to more than one Entrez identifier!

# Wait a second, we don't have one-to-one mappings!
class(annotations_ahb$entrezid)

[1] "list"

which(map(annotations_ahb$entrezid, length) > 1)

ENSG00000290825 ENSG00000239149 ENSG00000229571 ENSG00000291072 ENSG00000158747 
              1             308             322             400             454 
ENSG00000235200 ENSG00000233008 ENSG00000137962 ENSG00000235501 ENSG00000225206 
           1344            1457            1604            1615            1635 
ENSG00000187733 ENSG00000226419 ENSG00000277610 ENSG00000291123 ENSG00000263513 
           1691            1829            1950            1955            1984 
ENSG00000232527 ENSG00000288905 ENSG00000117262 ENSG00000188092 ENSG00000274020 
           1988            2004            2024            2084            2099 
ENSG00000291158 ENSG00000143384 ENSG00000163156 ENSG00000272668 ENSG00000162747 
           2130            2173            2205            2496            2580 
ENSG00000120341 ENSG00000230124 ENSG00000284237 ENSG00000224535 ENSG00000199396 
           2782            2823            3148            3207            3420 
ENSG00000199270 ENSG00000201925 ENSG00000228044 ENSG00000203668 ENSG00000226005 
           3421            3422            3510            3584            3721 
ENSG00000242147 ENSG00000272381 ENSG00000228426 ENSG00000285662 ENSG00000228566 
           3745            3994            4093            4149            4288 
ENSG00000230417 ENSG00000166272 ENSG00000232903 ENSG00000171714 ENSG00000281880 
           4482            4838            5143            5592            5644 
ENSG00000254584 ENSG00000255267 ENSG00000229183 ENSG00000254911 ENSG00000245552 
           5646            5752            5940            6617            6649 
ENSG00000109927 ENSG00000149557 ENSG00000151067 ENSG00000258325 ENSG00000111671 
           6950            7009            7135            7138            7251 
ENSG00000213809 ENSG00000111215 ENSG00000212127 ENSG00000255374 ENSG00000255760 
           7393            7411            7416            7425            7659 
ENSG00000257545 ENSG00000178882 ENSG00000249345 ENSG00000215483 ENSG00000226519 
           8614            8939            8965            9346            9395 
ENSG00000176124 ENSG00000186047 ENSG00000235532 ENSG00000178734 ENSG00000223617 
           9482            9491            9596            9606            9786 
ENSG00000258038 ENSG00000206588 ENSG00000253563 ENSG00000214900 ENSG00000258537 
          10148           10195           10236           10326           10356 
ENSG00000257365 ENSG00000133961 ENSG00000223403 ENSG00000258710 ENSG00000277865 
          10549           10680           11019           11154           11167 
ENSG00000183629 ENSG00000207432 ENSG00000261247 ENSG00000178115 ENSG00000207430 
          11226           11258           11259           11276           11279 
ENSG00000137843 ENSG00000237289 ENSG00000259426 ENSG00000159289 ENSG00000167195 
          11413           11540           11942           12026           12069 
ENSG00000278662 ENSG00000251209 ENSG00000259182 ENSG00000290824 ENSG00000290355 
          12235           12488           12536           12561           12562 
ENSG00000261617 ENSG00000069764 ENSG00000224712 ENSG00000166780 ENSG00000185164 
          12899           12966           12968           12987           13032 
ENSG00000184110 ENSG00000183336 ENSG00000181625 ENSG00000169627 ENSG00000132207 
          13212           13249           13250           13296           13297 
ENSG00000261052 ENSG00000090857 ENSG00000157335 ENSG00000157429 ENSG00000196436 
          13299           13812           13813           13843           13896 
ENSG00000140839 ENSG00000197943 ENSG00000260279 ENSG00000197912 ENSG00000286190 
          13897           14007           14158           14163           14372 
ENSG00000171916 ENSG00000273018 ENSG00000154016 ENSG00000233098 ENSG00000126861 
          14704           14706           14727           14803           14993 
ENSG00000274808 ENSG00000274512 ENSG00000273709 ENSG00000263715 ENSG00000120088 
          15116           15158           15377           15479           15481 
ENSG00000228696 ENSG00000238083 ENSG00000108379 ENSG00000263293 ENSG00000267452 
          15491           15496           15501           15518           15657 
ENSG00000226364 ENSG00000287664 ENSG00000267280 ENSG00000136488 ENSG00000227036 
          15659           15769           15801           15867           15991 
ENSG00000234721 ENSG00000175711 ENSG00000262352 ENSG00000173213 ENSG00000267712 
          16008           16331           16340           16342           16845 
ENSG00000266256 ENSG00000188629 ENSG00000213999 ENSG00000267767 ENSG00000205076 
          16982           17431           17840           18091           18269 
ENSG00000178934 ENSG00000231924 ENSG00000160336 ENSG00000204577 ENSG00000267710 
          18270           18438           18952           18988           19098 
ENSG00000267858 ENSG00000236790 ENSG00000228999 ENSG00000222005 ENSG00000234690 
          19236           19305           19426           19734           19745 
ENSG00000115239 ENSG00000228590 ENSG00000271889 ENSG00000186854 ENSG00000261600 
          19775           19822           19838           20124           20224 
ENSG00000291150 ENSG00000144015 ENSG00000186281 ENSG00000015568 ENSG00000144063 
          20225           20264           20270           20443           20450 
ENSG00000290614 ENSG00000136698 ENSG00000236107 ENSG00000213160 ENSG00000237298 
          20666           20668           20910           20936           21063 
ENSG00000231646 ENSG00000185176 ENSG00000233806 ENSG00000261186 ENSG00000088298 
          21111           21735           21773           21779           22215 
ENSG00000168746 ENSG00000215440 ENSG00000228340 ENSG00000274333 ENSG00000275496 
          22355           22556           22579           22684           22691 
ENSG00000280145 ENSG00000280018 ENSG00000276077 ENSG00000278932 ENSG00000274391 
          22696           22699           22701           22730           22741 
ENSG00000156273 ENSG00000205670 ENSG00000222018 ENSG00000243627 ENSG00000221398 
          22835           22920           22921           22923           22926 
ENSG00000159216 ENSG00000238141 ENSG00000184012 ENSG00000160200 ENSG00000160201 
          22931           22995           23017           23051           23052 
ENSG00000160202 ENSG00000188660 ENSG00000234289 ENSG00000160223 ENSG00000128383 
          23055           23059           23062           23079           23789 
ENSG00000197182 ENSG00000144455 ENSG00000227110 ENSG00000231304 ENSG00000280739 
          23965           24069           24089           24274           24434 
ENSG00000186448 ENSG00000174840 ENSG00000184220 ENSG00000249846 ENSG00000243620 
          24508           24802           25018           25373           25544 
ENSG00000232832 ENSG00000233136 ENSG00000248933 ENSG00000250634 ENSG00000261761 
          26136           26329           26332           26370           26476 
ENSG00000250546 ENSG00000260641 ENSG00000245293 ENSG00000249509 ENSG00000196951 
          26894           26987           27062           27113           27279 
ENSG00000251600 ENSG00000151611 ENSG00000251249 ENSG00000198948 ENSG00000215156 
          27303           27322           27365           27487           27894 
ENSG00000205572 ENSG00000205571 ENSG00000172062 ENSG00000145736 ENSG00000245526 
          28185           28186           28200           28203           28396 
ENSG00000250331 ENSG00000120709 ENSG00000247199 ENSG00000249738 ENSG00000248469 
          28479           28846           29065           29198           29329 
ENSG00000278970 ENSG00000248275 ENSG00000112679 ENSG00000145979 ENSG00000228223 
          29465           29483           29496           29665           29836 
ENSG00000285761 ENSG00000231074 ENSG00000241370 ENSG00000146112 ENSG00000244355 
          29975           30012           30018           30035           30117 
ENSG00000244731 ENSG00000204314 ENSG00000237541 ENSG00000204209 ENSG00000124713 
          30145           30155           30177           30221           30397 
ENSG00000146147 ENSG00000065615 ENSG00000111850 ENSG00000228624 ENSG00000112541 
          30556           30743           30774           30982           31459 
ENSG00000155026 ENSG00000105889 ENSG00000196683 ENSG00000290758 ENSG00000152926 
          31647           31776           31782           32060           32238 
ENSG00000262461 ENSG00000274570 ENSG00000286014 ENSG00000178809 ENSG00000186645 
          32338           32340           32400           32405           32448 
ENSG00000105793 ENSG00000272752 ENSG00000223646 ENSG00000004866 ENSG00000224897 
          32527           32686           32874           32905           32960 
ENSG00000273297 ENSG00000234352 ENSG00000230549 ENSG00000236125 ENSG00000215372 
          33081           33110           33525           33526           33528 
ENSG00000255251 ENSG00000285765 ENSG00000285950 ENSG00000255378 ENSG00000186562 
          33530           33531           33538           33539           33540 
ENSG00000237038 ENSG00000225327 ENSG00000254229 ENSG00000248538 ENSG00000254866 
          33542           33543           33546           33566           33624 
ENSG00000145002 ENSG00000228801 ENSG00000076554 ENSG00000260493 ENSG00000248690 
          33635           34035           34256           34297           34594 
ENSG00000253438 ENSG00000229140 ENSG00000181135 ENSG00000181404 ENSG00000180071 
          34653           34667           34785           34865           35273 
ENSG00000287838 ENSG00000284116 ENSG00000290718 ENSG00000291075 ENSG00000283378 
          35281           35293           35296           35305           35334 
ENSG00000238113 ENSG00000290971 ENSG00000260691 ENSG00000196873 ENSG00000234394 
          35348           35396           35407           35409           35412 
ENSG00000224958 ENSG00000158169 ENSG00000225194 ENSG00000271086 ENSG00000234323 
          35414           35677           35691           35760           35815 
ENSG00000286112 ENSG00000233016 ENSG00000273730 ENSG00000277626 ENSG00000277641 
          36096           36257           36343           36370           36371 
ENSG00000288357 ENSG00000288487 ENSG00000288206 ENSG00000233192 ENSG00000257473 
          36400           36401           36402           36511           36531 
ENSG00000227046 ENSG00000223793 ENSG00000206206 ENSG00000231823 ENSG00000206301 
          36535           36549           36554           36571           36591 
ENSG00000231617 ENSG00000225103 ENSG00000206279 ENSG00000231526 ENSG00000229396 
          36596           36614           36619           36638           36643 
ENSG00000268009 ENSG00000269791 ENSG00000236362 ENSG00000189064 ENSG00000179304 
          37062           37063           37117           37118           37154 
ENSG00000268350 ENSG00000158301 ENSG00000269226 ENSG00000101883 ENSG00000267978 
          37156           37497           37543           37678           37892 
ENSG00000221867 ENSG00000268902 ENSG00000197172 ENSG00000235961 ENSG00000063587 
          37930           37931           37935           37942           37947 
ENSG00000224533 ENSG00000099715 ENSG00000114374 ENSG00000291016 
          38029           38044           38058           38073

So what do we do here? And why do we have this problem? An answer from the Ensembl Help Desk is that this occurs when we cannot choose a perfect match; ie when we have two good matches, but one does not appear to match with a better percentage than the other. In that case, we assign both matches. What we will do is choose to keep the first identifier for these multiple mapping cases.

annotations_ahb$entrezid <- map(annotations_ahb$entrezid,1) %>%  unlist()

Let’s take a look and see how many of our Ensembl identifiers have an associated gene symbol, and how many of them are unique:

sum(is.na(annotations_ahb$gene_name))

[1] 0

sum(duplicated(annotations_ahb$gene_name))

[1] 11482

Let’s identify the non-duplicated genes and only keep the ones that are not duplicated:

# Determine the indices for the non-duplicated genes
non_duplicates_idx <- which(duplicated(annotations_ahb$gene_name) == FALSE)

# How many rows does annotations_ahb have?
annotations_ahb %>% nrow()

[1] 38076

# Return only the non-duplicated genes using indices
annotations_ahb <- annotations_ahb[non_duplicates_idx, ]

# How many rows are we left with after removing?
annotations_ahb %>% nrow()

[1] 26594

Finally, it would be good to know what proportion of the Ensembl identifiers map to an Entrez identifier:

# Determine how many of the Entrez column entries are NA
which(is.na(annotations_ahb$entrezid)) %>% length()

[1] 7260

For the different steps in the functional analysis, we require Ensembl and Entrez IDs. We will use the gene annotations that we generated previously to merge with our differential expression results.

## Merge the AnnotationHub dataframe with the results 
res_ids_ahb <- inner_join(res_tableCont_tb, annotations_ahb, by=c("gene"="gene_id"))

Using AnnotationHub to create a tx2gene file

If you wish to easily translate between genes and transcripts (this depends on how the count matrix was computed), it would be wise to create a ‘transcript to gene’ translation file (or tx2gene file). In our case, we already have one tx2gene table generated by our pipeline, but sometimes you may not have access to an tx2gene file already, so it is useful to learn how to create such a file To create it, we would need to use a combination of the methods above and merge two dataframes together. For example:

## DO NOT RUN THIS CODE

# Create a transcript dataframe
 txdb <- transcripts(human_ens, return.type = "data.frame") %>%
   dplyr::select(tx_id, gene_id)
 txdb <- txdb[grep("ENST", txdb$tx_id),]
 
 # Create a gene-level dataframe
 genedb <- genes(human_ens, return.type = "data.frame")  %>%
   dplyr::select(gene_id, gene_name)
 
 # Merge the two dataframes together
 annotations <- inner_join(txdb, genedb)

Annotables package

The annotables package is a super easy annotation package to use. It is not updated frequently, so it’s not great for getting the most up-to-date information for the current builds and does not have information for other organisms than human and mouse, but is a quick way to get annotation information, specially for older builds.

library(annotables)
grch38

We can see that the grch37 object already contains all the information we want in a super easy way. Let’s annotate the results of our shrunken DEA for our Control vs Vampirium comparison:

## Re-run this code if you are unsure that you have the right table
res_tableCont <- lfcShrink(dds, coef = "condition_control_vs_vampirium")
res_tableCont_tb <- res_tableCont %>%
    data.frame() %>%
    rownames_to_column(var="gene") %>% 
    as_tibble()

## Return the IDs for the gene symbols in the DE results
ids <- grch38 %>% dplyr::filter(ensgene %in% rownames(res_tableCont))

## Merge the IDs with the results 
res_ids <- inner_join(res_tableCont_tb, ids, by=c("gene"="ensgene"))

head(res_ids)

Our data is now ready to use for functional analysis! We have all the ids necessary to proceed.

Exercise 1

Create a new res_ids object using the annotables package with the human build grch37. NOTE call it res_ids_grch37!
What are the differences between the res_id_ahbobject and the res_ids_grch37?

Exercise 2

Annotate the results of your DEA for Garlicum vs Vampirium with grch38.