Revision as of 13:58, 20 November 2007

The Reference Genome Annotation Project

Introduction

With more and more genomes being sequenced, we are in the middle of an explosion of genomic information. With limited resources to manually annotate the growing number of sequenced genomes with functions, automatic annotation will be the method of choice for many groups. Since many model organism databases have a group of trained and highly skilled GO curators, the GO consortium has coordinated an effort to maximize and optimize the GO annotation of a large and representative set of key genomes ('reference genomes'). The goal of this project is to completely annotate 12 reference genomes so that annotations from this effort may be used to effectively seed the automatic annotation efforts of other genomes.

The reference genomes are:

Arabidopsis thaliana (http://www.arabidopsis.org/)
Caenorhabditis elegans (http://www.wormbase.org/)
Danio rerio (zebrafish; http://zfin.org)
Dictyostelium discoideum (http://www.dictybase.org/)
Drosophila melanogaster (http://flybase.org/)
Escherichia coli (http://www.tigr.org/)?
Gallus gallus (http://www.agbase.msstate.edu/)
Homo sapiens (http://www.ebi.ac.uk/GOA/human_release.html)?
Mus musculus (http://www.informatics.jax.org/)
Rattus norvegicus (http://rgd.mcw.edu/)
Saccharomyces cerevisiae (http://www.yeastgenome.org/)
Schizosaccharomyces pombe (http://www.genedb.org/genedb/pombe/)

The Reference Genome GO Annotation Team, with representatives from each genome annotation group, coordinates annotation, facilitates implementation of GO Consortium annotation priorities, and provides quantitative measures to assess progress toward the goal of broad and deep annotation of the reference genomes. This group represents the annotation expertise within the GO consortium and provides key liaisons to the model organism databases that have primary responsibilities for the annotation of the reference genomes.

Priorities for Annotation

Our ultimate aim is to provide comprehensive GO annotation for all gene products in each of the reference genomes. This is a huge task and requires prioritizing curation targets. Our initial annotation efforts (Aug 2006- Sept 2007) focused on orthologs of human disease genes but in Oct 2007 we widened our list to four priority areas:

Orthologs of human disease genes
Topical or ‘hot’ genes
Genes conserved from E. coli to human but currently lacking GO annotation
Genes involved in biochemical/signalling pathways

Each month we curate 5 genes from each category as selected by one of the participating databases on a rotational basis.

Overview of project strategy

Every month each database curates the same set of 20 genes from our priority list. Working on the same genes together promotes cross-organism discussion about annotations and frequently leads to new terms being added to the Gene Ontology.

Curation process summary:

Where they exist, identify the ortholog(s)/homolog(s) of the selected target genes in each species
Enter the gene identifiers in a shared spreadsheet so that all curators can see the set of genes being curated.
Collect and annotate available literature about the genes.
Assign GO terms based on experimental data.
Review existing GO annotations to make sure they conform to agreed standards.
Record in shared spreadsheet that GO annotation is considered comprehensive for each gene.

A web tool for reference genome annotation is under development. This will help curators to track and compare annotations, thus streamlining the annotation process.

How does this project differ from standard GO annotation?

The reference genome databases have agreed to follow guidelines that are more stringent than those used for standard annotation:

Experimental evidence codes (IDA, IMP, IGI, IPI, IEP) should be used where possible
Terms inferred from sequence and structural similarity (ISS) should only be used where the terms are supported by experimental evidence for the similar sequence
Non-traceable author statements (NAS) should be avoided
No new annotations should be based on traceable author statements (TAS); existing terms assigned with TAS should gradually be replaced with the appropriate experimental evidence code based on the primary literature

How do we know when GO annotation is comprehensive?

The amount of literature per gene is very variable. Where possible we review every paper about a given gene and capture all possible GO terms but this is only feasible when there are tens of papers. For genes associated with hundreds or even thousands of publications we cannot read all of the papers so we do our best to prioritise the literature and capture all aspects of the gene with GO terms. In this situation we often work from recent reviews to lead us to key experimental papers. Users are encouraged to notify us if we have failed to capture some aspect of a specific gene [go help - or do we need a separate forum?].

In some cases, there is no experimental data for any of the reference genome species but experimental data may be available in other model systems; in these cases we submit GO annotations for the relevant species to GOA [link] so that this information is captured from the primary literature.

Where can GO annotations from the project be viewed?

All GO annotations from this project are included in the [gene association files] that each group submits to GO. Annotations can also be viewed using [AmiGO].

It is also possible to specifically view the reference genome effort. A list of all annotated genes linking to colorful graphics can be viewed in full here:
http://www.geneontology.org/images/RefGenomeGraphs/

Each curated reference gene links to one graph. In addition to the graph each page includes two informative tables: a table comparing organism annotations for each term (rows are GO terms, columns correspond to organism), or a table that shows full experimental annotations in each organism for the given gene. This facilitates comparison of the curation status in the 12 reference genomes and helps curators to identify genes that need attention.

Note from Val. The pombe annotations aren't showing here. I'll report this to Mary but can this be updated before its made public?

Partial Graph of Gene POLA

Concluding Remarks

This project aims to improve annotations across a wide range of organisms. The resulting high quality annotations will no doubt improve electronic annotations that propagate from this resource and annotations will facilitate cross-species functional comparison. Furthermore, the easy comparison of annotations between organisms can lead to new hypothesis and thus will inspire new exciting research.

Activities

(this is probably not necessary for the public site?)

Monthly Conference Calls

First Reference Genome Annotation Meeting, Princeton, NJ, Sept 26, 27, 2007

Future plans

Dicuss orthology determination here or wait until we have standardised procedure?
Provide a section about 'metrics', or is that rather just for internal use? (showing a graph or part of the graph from Chris that's on the [ref genome minutes page] might be attractive?
We discussed at meeting having a 'comments' form or something. There should be a link to a pop-up mail box or so.
Do we need the Activities section at all?

@@ Line 24: / Line 24: @@
 ==Priorities for Annotation==
-Our ultimate aim is to provide comprehensive GO annotation for all gene products in each of the reference genomes. This is a huge task and requires prioritizing curation targets. Our initial annotation efforts (Aug 2006- Sept 2007) focussed on orthologs of human disease genes but in Oct 2007 we widened our list to four priority areas:
+Our ultimate aim is to provide comprehensive GO annotation for all gene products in each of the reference genomes. This is a huge task and requires prioritizing curation targets. Our initial annotation efforts (Aug 2006- Sept 2007) focused on orthologs of human disease genes but in Oct 2007 we widened our list to four priority areas:
 * Orthologs of human disease genes

Ref Gen pub draft (Retired): Difference between revisions