Beginning Annotation SOP

From GO Wiki
Jump to: navigation, search

This page is a place to build the SOP on beginning annotation.

These are the guidelines that the PIs gave me:

The initial step would be to create a document that outlines the annotation process. In addition, a case study, such as how the Chicken genome came to be annotated—what order events happened in, how the timing worked, what software was used, how they interacted with their GOA mentors, and so on—would be very useful.

As far as the documentation; you might start with this outline...

First, a brief statement about how the annotation process starts once the genes or gene products are defined (i.e. unique, stable IDs/ identifiers from UniProt or RefSeq are available for their sequences). Then, the document should include steps for doing GO annotations by various methods including automated methods such as InterProScan approach or by incorporating experimentally based annotations of orthologs; and curated methods such as assigning literature (experimentally) based GO annotations. The document should provide pointers to the existing documentation wherever possible. Thirdly, there should be information on the gene association file format and how to submit.

Once this documentation [essentially a 'standard operating procedure' not a detailed how-to] is defined, it can then be used to frame the inquiries of annotation groups and to support these groups in many contexts.

SOP for starting annotation

What is GO annotation?

This document gives a brief introduction to the procedures involved in annotating gene products to the gene ontology.

The Gene Ontology is a system for categorizing gene products according to the cellular locations and biological process in which they act, and the molecular functions that they carry out. (See figure 1 below.) The process of categorizing gene products in this way is known as 'annotation'

[[1]](From Clark et al., 2005)

The Gene Ontology is written to accommodate the annotation of gene products from all species. This enables scientists to look at the list of gene products annotated to a single GO term (e.g. a process term) and find out about research into that process in a range of species. (See figure 2 below) Terms are added to the ontologies as they are required for annotation so if you find that some of your favourites are missing then please let us know and we will add them.

[[2]](From Clark et al., 2005)

The paper below provides a detailed description of GO and annotation suitable for newcomers. It would be helpful to read this before reading the rest of this page.
Clark et al., 2005

What do I need to have to start annotation?

There several starting points for GO annotation but you should have at least one of the following:

  • Publications showing information about your gene products and the cellular locations or biological process in which they act, or the molecular functions that they carry out.
  • A DNA, protein or RNA sequence, with the correct identifier for the species being annotated.
  • A whole genome sequence.

What do I gain by annotating my genes?

There are a number of benefits to be derived from annotating gene products to the Gene Ontology.

  • You can annotate your whole genome to get an overview of the proportions of gene products involved in each general process:


  • You can display your microarray results grouped according to GO categories for more meaningful conclusions.

(No pic yet)

[[4]](Wertheim et al., 2005)

  • (Evelyn)your publication and data will get disseminated worldwide via existing annotation pipelines within the GO Consortium...for example data submitted to GOA is disemminated to Entrez Gene, UniProtKB and Ensembl as well as more than 100 GO tools some academic some are commercial...Lots of experimental research is paid for by the tax papers and Im pretty sure that dissemination is an important grant deliverable.


  • This is what i say for GOA perhaps you can also use some of these....(Evelyn)

GOA has significant applications for both internal and external users, some examples of how GOA data can be used are:

  • Function prediction for individual protein sequences for bench scientists and biological database curators via GO tools or InterProScan (InterPro2GO)
  • Large-scale electronic functional classification for genome annotation projects
  • Automatic and manual annotation of UniProtKB/Swiss-Prot and TrEMBL
  • Protein function, process, component statistics for complete proteomes in Integr8
  • Analysis of large- and small-scale gene expression or mass spectrometry datasets
  • Validation of automated text-mining and natural language processing techniques for deriving information about gene function from the literature
  • To enhance dataset querying and functionality e.g. users can query IntAct and Ensembl via annotated GO terms or link to QuickGO browser
  • Provision of a fully GO annotated set of Human genes (GO Consortium Reference Genome Project) for comparative analysis

How do I annotate to GO?

There are a number of different ways to start GO annotation depending on the your computing and manpower resources, and on the objects that you wish to annotate.

Automatic Annotation

A. Automatic GO annotation tools

There are several GO-related annotation tools that have been developed by different groups. Look at the annotation tools on the GO website:

Please write to the GO-Friends mailing list if you have a specific annotation tool needs. All the tool developers are there and will help you to choose a good tool, or may modify a tool to include the functionality that you need. Mail go-friends at

B. Automatic annotation based on GO mapping files and GO-annotated protein datasets for those users with database infrastructure in place.

1) sequence-based methods

blast2go You may like to try BLAST2GO to find GO annotations to sequences similar to yours.
Use GOst

2) Domain-based comparison methods

Interproscan Run your sequences through InterProScan either online or by downloading and running the application on your own computer.

3) other

keyword2go - a mapping of Swiss-Prot keywords to GO ec2go - a mapping of EC numbers to GO see additional mappings page on go website (link here)

C. Annotation Services for those without a pre-existing database infrastructure:

1) TIGR's Annotation Engine Service for prokaryotic genomes. This free service provides automatic annotation and database infrastructure to anyone with a prokaryotic DNA sequence they wish to annotate. More information can be found at: In addition, TIGR offers a 3-day Prokaryotic Annotation and Analysis course. This course complements the Annotation Engine service in that it provides detailed information on TIGR's annotation pipeline and use of the free manual annotation tool Manatee. More info can be found at:

2) GenDB

3) The public repositories <a href=""> EBI Submissions, including EMBL-bank </a>
<a href="">Genbank Submissions</a>
<a href="">DDBJ Submission</a>
(EMBL-Bank, GenBank and DDBJ exchange data amongst themselves so you can use any of these submission interfaces and have your data appear in all three resources.) Once your sequences have been processed and passed along the pipeline to all the related databases you will be able to retrieve:
a) Unique stable identifiers such as UniProt or RefSeq for your sequences. b) Some level of automatic GO annotation to your sequences. (Publically available.)
(I am hoping that I can insert a picture here that shows where sequences go in and where annotated sequences come out so that people can choose their favourite provider and be sure that they know what they are getting and what they are missing out on by choosing to download at that point.)

Manual Annotation

A. Literature based manual annotation - Check and improve your annotations against the literature.

1) Read the manual annotation guidelines on the GO Consortium website.

2) Contact the GO Consortium to ask about annotation camps and mentoring.
[insert description of camps and mentoring]

Introduction to the process of manual annotation

There are two types of annotation that you are likely to want to make. The first is where a paper describes an experiment that shows some important information about a gene product and you want to capture that information as an annotation. Such experiments might include:

  • Enzyme assays
  • In vitro reconstitution (e.g. transcription)
  • Immunofluorescence (for cellular component)
  • Cell fractionation (for cellular component)
  • Physical interaction/binding
  • Transcript levels (e.g. Northerns, microarray data)
  • Protein levels (e.g. Western blots)
  • "Traditional" genetic interactions such as suppressors, synthetic lethals, etc.
  • Functional complementation
  • Rescue experiments
  • Inference about one gene drawn from the phenotype of a mutation in a different gene.
  • Any gene mutation/knockout
  • Overexpression/ectopic expression of wild-type or mutant genes
  • Anti-sense experiments
  • RNAi experiments
  • Specific protein inhibitors
  • Polymorphism or allelic variation
  • 2-hybrid interactions
  • Co-purification
  • Co-immunoprecipitation
  • Ion/protein binding experiments

The other common type of annotation transfers information about gene products with known function to other gene products with similar sequence, and this is called an ISS annotation (Inferred from sequence similarity.)

The flow charts below give a basic introduction to help you to start to make these types of annotation.

Annotations with experimental evidence

(Comment: Evelyn, its nice to have these simple steps but we need also to expand on each one..for example there are a number of useful tools for finding appropriate publications, GeneRif, Ihop, Citeplore)

Open a new spreadsheet and enter the following in the top row:


Gene Product GO ID Pubmed ID Evidence code Taxon

Choose a gene product to annotate.


Write down the accession number of gene product in column 1. (e.g. UNIPROT:P35748)


Find a publication demonstrating the function or location of action of the gene product or the process that the gene product is involved in. 


Write down the pubmed id of the paper in column 3. (e.g. PMID:11781338)


Browse the GO to find the GO term that describes the relevant process, function or component. Be sure to read the definition as well as the term name. 


Write down the GO:id of the GO term in column 2. (e.g. GO:0048276)


Look at the Evidence code quickstart guide to find the relevant evidence code for the experiment that was used.


Write down the evidence code in column 4. (e.g. IDA) 


Find the Taxon ID for the species of origin of your gene product from the NCBI taxonomy browser: 


Write the taxon ID down in column 5.

Gene Product GO ID Pubmed ID Evidence code Taxon
UNIPROT:P35748 GO:0048276 PMID:11781338 IDA 9986

Annotation Inferred by Sequence Similarity

To make an ISS annotation you will need to add an extra row to your spreadsheet.

Sequence based manual annotation.

The process of manual annotation based on sequence similarity can involve the manual review of a host of sequence based search data including: BLAST-type searches, domain based searches (InterPro,Pfam, TIGRFAMs, PROSITE, etc.), SignalP, TMHMM, paralagous families, COGs, etc. The annotator evalutates this information by looking at alignments, scores, etc. while taking into consideration the genomic context of the gene product being annotated including neighboring genes, possible operons, syntenic regions, pathway and system resconstruction, etc.

If you would like to make sequence based annotation then there is just a little more information that you need and you should contact the model organism database that deals with a species taxonomically closest to your species of interest for further details. To find your closest database see

PAMGO example

Here is an example of how a new group has started working with the GO Consortium.

The Plant-Associated Microbe Gene Ontology (PAMGO) Group

In 2003 the genome sequence of the tomato pathogen Pseudomonas syringae pv. tomato DC3000 was published. This project was a collaboration between Robin Buell at the The Institute for Genomic Research (TIGR) and Alan Collmer of Cornell University. As part of the annotation of P. syringae TIGR provided some GO assignments to the P. syringae proteins. Dissussion between Alan Collmer and Brett Tyler at the NSF Plant Genome Research Program Awardees Meeting that fall revealed a shared awareness of the potential power of the GO and led to the formation of the Plant-Associated Microbe Gene Ontology (PAMGO) working group. Brett Tyler coordinated the effort to bring together PIs from genome projects representing the major groups of microbial pathogens: Bacteria, Fungi, Oomycetes, and Nematodes. The PAMGO group recognized the potential power of the GO to greatly facilitate research in areas common to all these pathogens by providing a robust framework for comparing functions across species. Since TIGR is a member of the GO consortium, the new PAMGO group entered into collaboration with TIGR staff Michelle Gwinn-Giglio and Linda Hannick to develop terms specific for interactions between pathogens and their hosts.

During 2004 the PAMGO Interest Group worked to develop high level terms to describe processes relevant to plant-microbe associations, which would provide a framework for the later development of more detailed terms. Candace Collmer (Wells College) while on sabbatical leave, and Michelle Gwinn-Giglio (TIGR) led the effort. This activity began with a full-day workshop on April 23, 2004 at TIGR of all the PAMGO participants. The workshop participants defined a set of high level terms and relationships that would be as general as possible, not only for pathogens of all kingdoms, but for the whole range of host-microbe interactions from mutualism to parasitism, and for all hosts, not only plants. Further refinement of the terms and their definitions occurred by email, and on June 2, 2004 the proposed terms were submitted to the GO community for discussion. On Aug 22-23, 2004 Candace and Alan Collmer and Michelle Gwinn-Giglio presented the proposal at a GO content meeting focused on pathogenesis, metabolism, and the cell cycle at the Carnegie Institution, Stanford, CA and on Oct. 15-16, 2004 Michelle Gwinn-Giglio presented three modified options to a GO Consortium Meeting in Chicago. These high level terms generated much debate, both at the original workshop, and within the wider GO community, because of the varied ways in which different communities use words such as "Symbiosis" and "Pathogenesis", and the difficulty of defining the term "Pathogenesis" consistently, given that some organisms may or may not cause disease depending on the physical environment and the physiological or genetic status of the host. This discussion highlighted the varied usage of these terms and stimulated user communities to think about how these terms should be used. A final version was agreed upon and resubmitted to GO on Dec 14, 2004 and made part of the active ontologies on Jan 31, 2005.

In addition to the term development activites in 2004, the PAMGO group was also busy writing a grant to the NSF/USDA Microbial Genome Sequencing Program to fund their GO development work. Fortunately, the grant was awarded and provides 3 years of funding (Fall 2005-Fall 2008) for PAMGO to continue the development of more granular terms under the initial PAMGO term set. The PAMGO group is now actively working on terms that will describe the myriad ways that pathogens affect the metabolism of their hosts. A PAMGO jamboree was held in July 2006 where more than 100 new terms were developed.

Using the PAMGO terms, as well as the rest of the GO ontologies, PAMGO annotators are assigning GO terms to the proteins from the PAMGO organisms that have a role in interacting with their hosts. It is anticipated that PAMGO will begin sending in association files of these annotations at the end of this year.

PAMGO people and pathogens:

Virginia Bioinformatics Institute

Phytophthora sojae (Oomycete)
Phytophthora ramorum (Oomycete)
Brett Tyler
Trudy Torto-Alalibo
Marcus Chibucos
Rays Jiang

Agrobacterium tumefaciens (Bacterium)
Joao Setubal
Joshua Shallom
Tsai-Tien Tseng

Cornell University

Pseudomonas syringae pv. tomato DC3000 (Bacterium)
Pseudomonas syringae pv. phaseolicola 1448A (Bacterium)
Pseudomonas syringae pv. syringae B728A (Bacterium)
Alan Collmer
Magdalen Lindeberg
Candace Collmer (Wells College, September-May)

University of Wisconsin

Erwinia chrysanthemi 3937 (Bacterium)
Nicole Perna
Jeremy Glasner
Bryan Biehl

North Carolina State University

Magnaporthe grisea (Fungus)
Meloidogyne hapla (Nematode)
Ralph Dean
David Bird
Thomas Mitchell
Shaowu Meng

The Institute for Genomic Research

Michelle Gwinn-Giglio
Linda Hannick
Robin Buell
Owen White

Chicken DB -Agbase example

An initial contact by an individual interested in using the GO for annotation of chicken genes was made at the GO Consortium/Users Meeting in October 2004. Follow up emails between Shane Burgess of Mississippi State University (PI of project) set up a visit of Fiona McCarthy to MGI at the Jackson Laboratory in Bar Harbor ME. Over a two week period (March 7-18, 2005), the curator from the Chicken Database was introduced to the Gene Ontology and also other aspects the design of a model organism database. A rough outline of the topics covered follows:

1. MGI User support gave an introduction to MGI; curator worked through "User Support Informatics Workshop" and become familiar with MGI.

2. A review Mouse Gene Nomenclature documentation, followed by a short nomenclature workshop. Mouse and Human gene nomenclature are co-ordinated and it is a goal to apply this co-ordination to other emerging vertebrate genomes.

3. Review of the GO homepage contents, with emphasis on available annotation documentation.

4. Introduction to Ontologies: Gene Ontology and other controlled vocabularies, included an introduction to GO, The Mouse Anatomical Dictionary,and the Mouse Phenotype Ontology.

5. GO Content, including Gene Ontology content and structure, anatomical representation in the GO, and changing the GO (Obo-Edit and SourceForge)

6. Curation documentation (curation for GO, gene expression, MGI phenotype and an introduction to the MGI editorial interface).

7. Data downloads and GO Automated annotation using UniProt and GO Translation Tables

8. Curation either by test implementation of large-scale strategies or by sequence similarities:

9. Introduction to Literature Selection and Manual Annotation. This included how MGI does literature triage. A plan was then drawn up for chicken paper literature searches and literature list creation so that GO annotation of chicken papers could begin.

10: Submitting GO annotation: Gene_association file format was explained

Upon return to the home site, the curator set up ongoing periodic review of annotations with a mentor at MGI by submitting a gene_association file. The annotations are examined against the literature reference supplied, and comments are returned to the AgBase curator. AgBase adds GO annotations to EBI using the EBI-GOA Protein2GO tool. Submissions are on a monthly basis. Once deemed “trained”, the AgBase curator would then instruct other members of their community.