Beginning Annotation SOP
This page is a place to build the SOP on beginning annotation.
These are the guidelines that the PIs gave me:
The initial step would be to create a document that outlines the annotation process. In addition, a case study, such as how the Chicken genome came to be annotated—what order events happened in, how the timing worked, what software was used, how they interacted with their GOA mentors, and so on—would be very useful.
As far as the documentation; you might start with this outline...
First, a brief statement about how the annotation process starts once the genes or gene products are defined (i.e. unique, stable IDs/ identifiers from UniProt or RefSeq are available for their sequences). Then, the document should include steps for doing GO annotations by various methods including automated methods such as InterProScan approach or by incorporating experimentally based annotations of orthologs; and curated methods such as assigning literature (experimentally) based GO annotations. The document should provide pointers to the existing documentation wherever possible. Thirdly, there should be information on the gene association file format and how to submit.
Once this documentation [essentially a 'standard operating procedure' not a detailed how-to] is defined, it can then be used to frame the inquiries of annotation groups and to support these groups in many contexts.
SOP for starting annotation
A. Automatic GO annotation tools
There are several GO-related annotation tools that have been developed by many groups. Look at the annotation tools on the GO website:
Please write to the GO-Friends mailing list if you have a specific annotation tool needs. All the tool developers are there and will help you to choose a good to tool, or may modify a tool to include the functionality that you need. Mail go-friends at geneontology.org.
B. Automatic annotation based on GO mapping files and GO-annotated protein datasets for those users with database infrastructure in place.
1) sequence-based methods
You may also like to try BLAST2GO to find GO annotations to sequences similar to yours.
2) Domain-based comparison methods
keyword2go - a mapping of Swiss-Prot keywords to GO ec2go - a mapping of EC numbers to GO see additional mappings page on go website (link here)
C. Annotation Services
1) TIGR's Annotation Engine Service for prokaryotic genomes. This free service provides automatic annotation and database infrastructure to anyone with a prokaryotic DNA sequence they wish to annotate. More information can be found at: www.tigr.org/AnnotationEngine/ In addition, TIGR offers a 3-day Prokaryotic Annotation and Analysis course. This course complements the Annotation Engine service in that it provides detailed information on TIGR's annotation pipeline and use of the free manual annotation tool Manatee. More info can be found at: www.tigr.org/AnnotationClass/
2) GenDB www.cebitec.uni-bielefeld.de/groups/brf/software/gendb_info/
E. For those who do not have any database infrastructure you can use the public repositories to aid in this process:
Submit your sequences to one of the large public repositories.
<a href="http://www.ebi.ac.uk/Submissions/"> EBI Submissions, including EMBL-bank </a>
<a href="http://www.ncbi.nlm.nih.gov/Genbank/submit.html">Genbank Submissions</a>
<a href="http://www.ddbj.nig.ac.jp/sub-e.html">DDBJ Submission</a>
(EMBL-Bank, GenBank and DDBJ exchange data amongst themselves so you can use any of these submission interfaces and have your data appear in all three resources.)
Once your sequences have been processed and passed along the pipeline to all the related databases you will be able to retrieve:
1) Unique stable identifiers such as UniProt or RefSeq for your sequences. 2) Some level of automatic GO annotation to your sequences. (Publically available.)
(I am hoping that I can insert a picture here that shows where sequences go in and where annotated sequences come out so that people can choose their favourite provider and be sure that they know what they are getting and what they are missing out on by choosing to download at that point.)
1. Literature based manual annotation - Check and improve your annotations against the literature.
1) Read the manual annotation guidelines on the GO Consortium website.
2) Contact the GO Consortium to ask about annotation camps and mentoring.
[insert description of camps and mentoring]
2. Sequence based manual annotation.
The process of manual annotation based on sequence similarity involves the manual review of a host of sequence based search data including: BLAST-type searches, domain based searches (InterPro,Pfam, TIGRFAMs, PROSITE, etc.), SignalP, TMHMM, paralagous families, COGs, etc. The annotator evalutates this information by looking at alignments, scores, etc. while taking into consideration the genomic context of the gene product being annotated including neighboring genes, possible operons, syntenic regions, pathway and system resconstruction, etc.
Here is an example of how a new group has started working with the GO Consortium.
The Plant-Associated Microbe Gene Ontology (PAMGO) Group
In 2003 the genome sequence of the tomato pathogen Pseudomonas syringae pv. tomato DC3000 was published. This project was a collaboration between Robin Buell at the The Institute for Genomic Research (TIGR) and Alan Collmer of Cornell University. As part of the annotation of P. syringae TIGR provided some GO assignments to the P. syringae proteins. Dissussion between Alan Collmer and Brett Tyler at the NSF Plant Genome Research Program Awardees Meeting that fall revealed a shared awareness of the potential power of the GO and led to the formation of the Plant-Associated Microbe Gene Ontology (PAMGO) working group. Brett Tyler coordinated the effort to bring together PIs from genome projects representing the major groups of microbial pathogens: Bacteria, Fungi, Oomycetes, and Nematodes. The PAMGO group recognized the potential power of the GO to greatly facilitate research in areas common to all these pathogens by providing a robust framework for comparing functions across species. Since TIGR is a member of the GO consortium, the new PAMGO group entered into collaboration with TIGR staff Michelle Gwinn-Giglio and Linda Hannick to develop terms specific for interactions between pathogens and their hosts.
During 2004 the PAMGO Interest Group worked to develop high level terms to describe processes relevant to plant-microbe associations, which would provide a framework for the later development of more detailed terms. Candace Collmer (Wells College) while on sabbatical leave, and Michelle Gwinn-Giglio (TIGR) led the effort. This activity began with a full-day workshop on April 23, 2004 at TIGR of all the PAMGO participants. The workshop participants defined a set of high level terms and relationships that would be as general as possible, not only for pathogens of all kingdoms, but for the whole range of host-microbe interactions from mutualism to parasitism, and for all hosts, not only plants. Further refinement of the terms and their definitions occurred by email, and on June 2, 2004 the proposed terms were submitted to the GO community for discussion. On Aug 22-23, 2004 Candace and Alan Collmer and Michelle Gwinn-Giglio presented the proposal at a GO content meeting focused on pathogenesis, metabolism, and the cell cycle at the Carnegie Institution, Stanford, CA and on Oct. 15-16, 2004 Michelle Gwinn-Giglio presented three modified options to a GO Consortium Meeting in Chicago. These high level terms generated much debate, both at the original workshop, and within the wider GO community, because of the varied ways in which different communities use words such as "Symbiosis" and "Pathogenesis", and the difficulty of defining the term "Pathogenesis" consistently, given that some organisms may or may not cause disease depending on the physical environment and the physiological or genetic status of the host. This discussion highlighted the varied usage of these terms and stimulated user communities to think about how these terms should be used. A final version was agreed upon and resubmitted to GO on Dec 14, 2004 and made part of the active ontologies on Jan 31, 2005.
In addition to the term development activites in 2004, the PAMGO group was also busy writing a grant to the NSF/USDA Microbial Genome Sequencing Program to fund their GO development work. Fortunately, the grant was awarded and provides 3 years of funding (Fall 2005-Fall 2008) for PAMGO to continue the development of more granular terms under the initial PAMGO term set. The PAMGO group is now actively working on terms that will describe the myriad ways that pathogens affect the metabolism of their hosts. A PAMGO jamboree was held in July 2006 where more than 100 new terms were developed.
Using the PAMGO terms, as well as the rest of the GO ontologies, PAMGO annotators are assigning GO terms to the proteins from the PAMGO organisms that have a role in interacting with their hosts. It is anticipated that PAMGO will begin sending in association files of these annotations at the end of this year.
PAMGO people and pathogens:
Virginia Bioinformatics Institute
Phytophthora sojae (Oomycete)
Phytophthora ramorum (Oomycete)
Agrobacterium tumefaciens (Bacterium)
Pseudomonas syringae pv. tomato DC3000 (Bacterium)
Pseudomonas syringae pv. phaseolicola 1448A (Bacterium)
Pseudomonas syringae pv. syringae B728A (Bacterium)
Candace Collmer (Wells College, September-May)
University of Wisconsin
Erwinia chrysanthemi 3937 (Bacterium)
North Carolina State University
Magnaporthe grisea (Fungus)
Meloidogyne hapla (Nematode)
The Institute for Genomic Research