WormBase, September 2009

From GO Wiki
Jump to: navigation, search

Progress Report

In Progress: last updated: 09-28-2009



Juancarlos Chan

Developer, WormBase, Caltech, Pasadena, CA

Ranjana Kishore

Curator, WormBase, Caltech, Pasadena, CA

Paul Sternberg

PI, WormBase, Caltech, Pasadena, CA

Kimberly Van Auken

Curator, WormBase, Caltech, Pasadena, CA

Additional technical support:

Anthony Rogers

WormBase, Sanger Center, Hinxton, UK

Gary Williams

WormBase, Sanger Center, Hinxton, UK


Ruihua Fang

Developer, Textpresso, Caltech, Pasadena, CA

Hans Michael Muller

Project Leader, Textpresso, Caltech, Pasadena, CA

Arun Rangarajan

Developer, Textpresso, Caltech, Pasadena, CA

Annotation Progress

Table 1: Number of Genes Annotated to Each GO Ontology

Type of Annotation Number of Genes Annotated  % Change from October 2008 Number of Unique GO Terms Total Number of GO Terms
Manual Annotation 1684 +10% 1536 10573
Phenotype2GO Mappings 4769 +2.2% 53 30644
IEA/Electronic 9115 -27.7% (see Note) 418 9115
Total 14623 +2.0% 1812 50332

Note: The decrease in IEA annotations is due to changes in our InterPro2GO pipeline that introduced improved parameters for running each of the InterPro member database prediction algorithms on the C. elegans proteome. These improvements reduced the number of low confidence domain predictions and consequently, the number of low confidence IEA annotations.

Methods and Strategies for Annotation

Literature Curation

Manual curation of the C. elegans literature remains our highest curation priority, contributing to ~90% of our total curation efforts.

We have implemented a GO curation check-out form that affords curators easy visual access to the curation status of all named C. elegans gene, e.g. vha-6 or egl-9. Genes are displayed in a list that includes the current number of published papers (references) indexed to that gene and the last date for which annotations to either of the three ontologies were made. Curators can query and sort the list according to reference count, gene name, and curation status.

Computational Methods

Our computational methods encompass two main approaches: 1) InterPro2GO mappings for IEA annotations and 2) Phenotype2GO mappings for IMP annotations.

InterPro2GO Mappings

These annotations are annotations of C. elegans proteins to GO terms based on electronic matching of protein motifs/domains to those documented in the Interpro database (http://www.ebi.ac.uk/interpro/), and their mapping to GO terms provided by the Interpro2go file generated by the EBI (PMID:12654719, PMID:12520011). Note that the 'IEA' annotations are not reviewed for accuracy by human curators. As such, all of these annotations use the evidence code 'IEA'.

Phenotype2GO Mappings:

These annotations are obtained by a semi-automated method wherein phenotypes are mapped to a GO term/s by WormBase curators. These mappings are then used by a script to attach GO_terms to genes. These annotations all have the evidence code 'IMP'. Currently, allele phenotypes or phenotypes obtained by large scale RNA interference screens have been used for the mapping. For example, the phenotype 'STErile' (Ste) which is a specialization of 'post-embryonic defect' and 'reproductive defect' is mapped to the GO term 'reproduction' (GO:0000003). A list of the currently used mappings can be found here:


Priorities for Annotation

Our annotation priorities are as follows:

1) Reference Genome genes

2) Genes presented for annotation via our Textpresso-based semi-automated Cellular Component curation pipeline

3) Genes from training set papers used for piloting semi-automated Textpresso-based Molecular Function curation

4) Newly described genes for which previous annotation was not available

5) C. elegans orthologs of human disease genes

6) Phenotype2GO and InterPro2GO annotations are updated with each release.

Presentations and Publications


Reference Genome Group of the Gene Ontology Consortium. The Gene Ontology's Reference Genome Project: a unified framework for functional annotation across species, PLoS Comput Biol. 2009 Jul;5(7):e1000431. Epub 2009 Jul 3

Van Auken K, Jaffery J, Chan J, Müller HM, Sternberg PW. Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation. BMC Bioinformatics. 2009 Jul 21;10:228.

Presentations including Talks and Tutorials and Teaching

Yook K, Van Auken KM, Sternberg P, and the WormBase Consortium. Using Textpresso for Information Retrieval, Fact Extraction, and Database Entry. Third International Biocuration Conference, April 16-19, 2009, Berlin, Germany. Available from Nature Proceedings: http://precedings.nature.com/documents/3302/version/1

Poster presentations


Other Highlights

A. Ontology Development Contributions

WormBase curators have contributed to ontology discussion and development in the areas of intraflagellar transport, sex determination and dosage compensation, apoptosis, gastrulation, and drug withdrawal.

B. Annotation Outreach and User Advocacy Efforts

Kimberly Van Auken continues to participate in the gohelp rotation. Ranjana Kishore continues to participate in the efforts of the GO News group.

C. Other Highlights

Curation Tools: Ontology Annotator

We are developing a new, web-based curation tool, the Ontology Annotator, that can be used to annotate genes to any ontology, including the Gene Ontology and the WormBase Phenotype Ontology. The Ontology Annotator incorporates and expands upon much of the functionality of the Phenote Curation tool. Some of the more useful features of the tool include bulk annotation capabilities, autocomplete functions, retrieving data and filtering of the retrieved data for editing purposes.

Semi-Automated Molecular Function Curation

We continue to explore Textpresso-based GO curation, by developing pipelines for semi-automated Molecular Function curation. Preliminarily, our plans involve a two-tiered approach encompassing: 1) document classification using SVMs (Support Vector Machines) and 2) category searches to identify curatable sentences within documents identified as high confidence for Molecular Function information by SVMs. Our initial efforts are focusing on the binding branch of the MF ontology, including protein-nucleic acid interactions.

Expanded Phenotype2GO Mappings

Working with the WormBase phenotype curators, we have added an additional 146 mappings to our Phenotype2GO mappings, which are used to make GO Biological Process annotations using the IMP evidence code. Allele- or RNAi-based phenotypes are annotated to a term from the WormBase phenotype ontology, which is then mapped to an appropriate GO term. A list of the new mappings can be found here: