Beginning Annotation SOP

From GO Wiki
Revision as of 10:03, 14 November 2006 by Jclark (talk | contribs)

Jump to: navigation, search

This page is a place to build the SOP on beginning annotation.

These are the guidelines that the PIs gave me:

The initial step would be to create a document that outlines the annotation process. In addition, a case study, such as how the Chicken genome came to be annotated—what order events happened in, how the timing worked, what software was used, how they interacted with their GOA mentors, and so on—would be very useful.

As far as the documentation; you might start with this outline...

First, a brief statement about how the annotation process starts once the genes or gene products are defined (i.e. unique, stable IDs/ identifiers from UniProt or RefSeq are available for their sequences). Then, the document should include steps for doing GO annotations by various methods including automated methods such as InterProScan approach or by incorporating experimentally based annotations of orthologs; and curated methods such as assigning literature (experimentally) based GO annotations. The document should provide pointers to the existing documentation wherever possible. Thirdly, there should be information on the gene association file format and how to submit.

Once this documentation [essentially a 'standard operating procedure' not a detailed how-to] is defined, it can then be used to frame the inquiries of annotation groups and to support these groups in many contexts.

SOP for starting annotation

What is GO annotation?

This document gives a brief introduction to the procedures involved in annotating gene products to the gene ontology.

The Gene Ontology is a system for categorizing gene products according to the cellular locations and biological process in which they act, and the molecular functions that they carry out. (See figure 1 below.)

[[1]](From Clark et al., 2005)

The Gene Ontology is written to accommodate the annotation of gene products from all species. This enables scientists to look at the annotations to a single GO term and find out about related research in a range of species. (See figure 2 below) Terms are added to the ontologies as they are required for annotation so if you find that some of your favourites are missing then please let us know and we will add them.

[[2]](From Clark et al., 2005)

What do I need to start annotation?

There several starting points for GO annotation but you should have at least one of the following:

  • Publications showing information about your gene products and the cellular locations or biological process in which they act, or the molecular functions that they carry out.
  • A DNA, protein or RNA sequence.
  • A whole genome sequence.

What do I gain by annotating my genes?

There are a number of benefits to be derived from annotation gene products to the Gene Ontology.

  • You can annotate your whole genome to get an overview of the proportions of gene products involved in each general process:


  • You can display your microarray results grouped according to GO categories for more meaningful conclusions.

(No pic yet)

How do I make annotation?

There are a number of different ways to start GO annotation depending on the your computing and manpower resources, and the the objects that you wish to annotate.

Automatic Annotation

A. Automatic GO annotation tools

There are several GO-related annotation tools that have been developed by many groups. Look at the annotation tools on the GO website:

Please write to the GO-Friends mailing list if you have a specific annotation tool needs. All the tool developers are there and will help you to choose a good to tool, or may modify a tool to include the functionality that you need. Mail go-friends at

B. Automatic annotation based on GO mapping files and GO-annotated protein datasets for those users with database infrastructure in place.

1) sequence-based methods

blast2go You may also like to try BLAST2GO to find GO annotations to sequences similar to yours.
Use GOst

2) Domain-based comparison methods

Interproscan Run your sequences through InterProScan either online or by downloading and running the application on your own computer.

3) other

keyword2go - a mapping of Swiss-Prot keywords to GO ec2go - a mapping of EC numbers to GO see additional mappings page on go website (link here)

C. Annotation Services for those without a pre-existing database infrastructure:

1) TIGR's Annotation Engine Service for prokaryotic genomes. This free service provides automatic annotation and database infrastructure to anyone with a prokaryotic DNA sequence they wish to annotate. More information can be found at: In addition, TIGR offers a 3-day Prokaryotic Annotation and Analysis course. This course complements the Annotation Engine service in that it provides detailed information on TIGR's annotation pipeline and use of the free manual annotation tool Manatee. More info can be found at:

2) GenDB

3) The public repositories <a href=""> EBI Submissions, including EMBL-bank </a>
<a href="">Genbank Submissions</a>
<a href="">DDBJ Submission</a>
(EMBL-Bank, GenBank and DDBJ exchange data amongst themselves so you can use any of these submission interfaces and have your data appear in all three resources.) Once your sequences have been processed and passed along the pipeline to all the related databases you will be able to retrieve:
a) Unique stable identifiers such as UniProt or RefSeq for your sequences. b) Some level of automatic GO annotation to your sequences. (Publically available.)
(I am hoping that I can insert a picture here that shows where sequences go in and where annotated sequences come out so that people can choose their favourite provider and be sure that they know what they are getting and what they are missing out on by choosing to download at that point.)

D. GOblet

Manual Annotation

A. Literature based manual annotation - Check and improve your annotations against the literature.

1) Read the manual annotation guidelines on the GO Consortium website.

2) Contact the GO Consortium to ask about annotation camps and mentoring.
[insert description of camps and mentoring]

B. Sequence based manual annotation.

The process of manual annotation based on sequence similarity involves the manual review of a host of sequence based search data including: BLAST-type searches, domain based searches (InterPro,Pfam, TIGRFAMs, PROSITE, etc.), SignalP, TMHMM, paralagous families, COGs, etc. The annotator evalutates this information by looking at alignments, scores, etc. while taking into consideration the genomic context of the gene product being annotated including neighboring genes, possible operons, syntenic regions, pathway and system resconstruction, etc.

PAMGO example

Here is an example of how a new group has started working with the GO Consortium.

The Plant-Associated Microbe Gene Ontology (PAMGO) Group

In 2003 the genome sequence of the tomato pathogen Pseudomonas syringae pv. tomato DC3000 was published. This project was a collaboration between Robin Buell at the The Institute for Genomic Research (TIGR) and Alan Collmer of Cornell University. As part of the annotation of P. syringae TIGR provided some GO assignments to the P. syringae proteins. Dissussion between Alan Collmer and Brett Tyler at the NSF Plant Genome Research Program Awardees Meeting that fall revealed a shared awareness of the potential power of the GO and led to the formation of the Plant-Associated Microbe Gene Ontology (PAMGO) working group. Brett Tyler coordinated the effort to bring together PIs from genome projects representing the major groups of microbial pathogens: Bacteria, Fungi, Oomycetes, and Nematodes. The PAMGO group recognized the potential power of the GO to greatly facilitate research in areas common to all these pathogens by providing a robust framework for comparing functions across species. Since TIGR is a member of the GO consortium, the new PAMGO group entered into collaboration with TIGR staff Michelle Gwinn-Giglio and Linda Hannick to develop terms specific for interactions between pathogens and their hosts.

During 2004 the PAMGO Interest Group worked to develop high level terms to describe processes relevant to plant-microbe associations, which would provide a framework for the later development of more detailed terms. Candace Collmer (Wells College) while on sabbatical leave, and Michelle Gwinn-Giglio (TIGR) led the effort. This activity began with a full-day workshop on April 23, 2004 at TIGR of all the PAMGO participants. The workshop participants defined a set of high level terms and relationships that would be as general as possible, not only for pathogens of all kingdoms, but for the whole range of host-microbe interactions from mutualism to parasitism, and for all hosts, not only plants. Further refinement of the terms and their definitions occurred by email, and on June 2, 2004 the proposed terms were submitted to the GO community for discussion. On Aug 22-23, 2004 Candace and Alan Collmer and Michelle Gwinn-Giglio presented the proposal at a GO content meeting focused on pathogenesis, metabolism, and the cell cycle at the Carnegie Institution, Stanford, CA and on Oct. 15-16, 2004 Michelle Gwinn-Giglio presented three modified options to a GO Consortium Meeting in Chicago. These high level terms generated much debate, both at the original workshop, and within the wider GO community, because of the varied ways in which different communities use words such as "Symbiosis" and "Pathogenesis", and the difficulty of defining the term "Pathogenesis" consistently, given that some organisms may or may not cause disease depending on the physical environment and the physiological or genetic status of the host. This discussion highlighted the varied usage of these terms and stimulated user communities to think about how these terms should be used. A final version was agreed upon and resubmitted to GO on Dec 14, 2004 and made part of the active ontologies on Jan 31, 2005.

In addition to the term development activites in 2004, the PAMGO group was also busy writing a grant to the NSF/USDA Microbial Genome Sequencing Program to fund their GO development work. Fortunately, the grant was awarded and provides 3 years of funding (Fall 2005-Fall 2008) for PAMGO to continue the development of more granular terms under the initial PAMGO term set. The PAMGO group is now actively working on terms that will describe the myriad ways that pathogens affect the metabolism of their hosts. A PAMGO jamboree was held in July 2006 where more than 100 new terms were developed.

Using the PAMGO terms, as well as the rest of the GO ontologies, PAMGO annotators are assigning GO terms to the proteins from the PAMGO organisms that have a role in interacting with their hosts. It is anticipated that PAMGO will begin sending in association files of these annotations at the end of this year.

PAMGO people and pathogens:

Virginia Bioinformatics Institute

Phytophthora sojae (Oomycete)
Phytophthora ramorum (Oomycete)
Brett Tyler
Trudy Torto-Alalibo
Marcus Chibucos
Rays Jiang

Agrobacterium tumefaciens (Bacterium)
Joao Setubal
Joshua Shallom
Tsai-Tien Tseng

Cornell University

Pseudomonas syringae pv. tomato DC3000 (Bacterium)
Pseudomonas syringae pv. phaseolicola 1448A (Bacterium)
Pseudomonas syringae pv. syringae B728A (Bacterium)
Alan Collmer
Magdalen Lindeberg
Candace Collmer (Wells College, September-May)

University of Wisconsin

Erwinia chrysanthemi 3937 (Bacterium)
Nicole Perna
Jeremy Glasner
Bryan Biehl

North Carolina State University

Magnaporthe grisea (Fungus)
Meloidogyne hapla (Nematode)
Ralph Dean
David Bird
Thomas Mitchell
Shaowu Meng

The Institute for Genomic Research

Michelle Gwinn-Giglio
Linda Hannick
Robin Buell
Owen White

Chicken DB -Agbase example

An initial contact by an individual interested in using the GO for annotation of chicken genes was made at the GO Consortium/Users Meeting in October 2004. Follow up emails between Steve Burgess of Mississippi State University (PI of project) set up a visit of Fiona McCarthy to MGI at the Jackson Laboratory in Bar Harbor ME. Over a two week period (March 7-18, 2005), the curator from the Chicken Database was introduced to the Gene Ontology and also other aspects the design of a model organism database. A rough outline of the topics covered follows:

1. MGI User support gave an introduction to MGI; curator worked through "User Support Informatics Workshop" and become familiar with MGI.

2. A review Mouse Gene Nomenclature documentation, followed by a short nomenclature workshop. Mouse and Human gene nomenclature are co-ordinated and it is a goal to apply this co-ordination to other emerging vertebrate genomes.

3. Review of the GO homepage contents, with emphasis on available annotation documentation.

4. Introduction to Ontologies: Gene Ontology and other controlled vocabularies, included an introduction to GO, The Mouse Anatomical Dictionary,and the Mouse Phenotype Ontology.

5. GO Content, including Gene Ontology content and structure, anatomical representation in the GO, and changing the GO (Obo-Edit and SourceForge)

6. Curation documentation (curation for GO, gene expression, MGI phenotype and an introduction to the MGI editorial interface).

7. Data downloads and GO Automated annotation using UniProt and GO Translation Tables

8. Curation either by test implementation of large-scale strategies or by sequence similarities:

9. Introduction to Literature Selection and Manual Annotation. This included how MGI does literature triage. A plan was then drawn up for chicken paper literature searches and literature list creation so that GO annotation of chicken papers could begin.

10: Submitting GO annotation: Gene_association file format was explained

Upon return to the home site, the curator set up ongoing periodic review of annotations with a mentor at MGI. Chicken DB will submit reveiwed annotations in a standard gene association file format to EBI. Once deemed “trained”, curator would then instruct other members of their community.