WormBase

From GO Wiki
Jump to navigation Jump to search

WormBase Gene Ontology Progress Report, December 2010

Staff

Juancarlos Chan

Developer, WormBase, Caltech, Pasadena, CA.

Ranjana Kishore

Curator, WormBase, Caltech, Pasadena, CA.

Paul Sternberg

PI, WormBase, Caltech, Pasadena, CA.

Kimberly Van Auken

Curator, WormBase, Caltech, Pasadena, CA.


Jolene Fernandes

Phenotype Curation, WormBase, Caltech, Pasadena, CA.

Gary Schindelman

Phenotype Curation, WormBase, Caltech, Pasadena, CA.


Ruihua Fang

Bioinformatician, Developer, WormBase, Caltech, Pasadena, CA


Additional technical support:

Anthony Rogers

WormBase, Sanger Center, Hinxton, UK

Gary Williams

WormBase, Sanger Center, Hinxton, UK


Textpresso:

Hans Michael Muller

Project Leader, Textpresso, Caltech, Pasadena, CA

Arun Rangarajan

Developer, Textpresso, Caltech, Pasadena, CA

Annotation Progress

Table 1: Number of Genes Annotated

Type of Annotation Number of Genes Annotated, Dec 2009 % Change from Dec 2008 Number of Unique GO Terms Total Number of GO Terms
Manual Annotation 1751 +12.9 1,614 8,853
Phenotype2GO Mappings 6719 +43.5 105 50,078
IEA/Electronic 12,847 +1.4 1,519 54,645
Total 15,668 +8.9 2,769 113,576


Methods and Strategies for Annotation

Literature Curation

Manual curation of the C. elegans literature remains our highest curation priority, contributing to ~90% of our total curation efforts. We have implemented a GO curation check-out form that affords curators easy visual access to the curation status of all named C. elegans gene, e.g. vha-6 or egl-9. Genes are displayed in a list that includes the current number of published papers (references) indexed to that gene and the last date for which annotations to either of the three ontologies were made. Curators can query and sort the list according to reference count, gene name, and curation status.

Computational Methods

InterPro2GO mappings for IEA annotations: These annotations are annotations of C. elegans proteins to GO terms based on electronic matching of protein motifs/domains to those documented in the Interpro database (http://www.ebi.ac.uk/interpro/), and their mapping to GO terms provided by the Interpro2go file generated by the EBI (PMID:12654719, PMID:12520011). Note that the 'IEA' annotations are not reviewed for accuracy by human curators. As such, all of these annotations use the evidence code 'IEA'.

TMHMM2GO mapping for IEA annotations: We also include the results of an internal pipeline that maps proteins containing a transmembrane domain, as predicted by the TMHMM algorithm, to the GO Cellular Component term, intergral to membrane. ~6800 gene products are annotated to integral to membrane via this pipeline. InterPro2GO and TMHMM2GO annotations are updated at every database release.

Semi-automated Methods

Review and improvement of the Phenotype2GO data pipeline

WormBase uses a well defined phenotype ontology to annotate gene-allele and RNAi-based phenotypes. A total of 201 phenotype terms used in annotation have been mapped to a GO term. These mappings are used to automatically generate Biological Process annotations to genes using the IMP evidence code, at every WormBase database build . The complete list of WormBase phenotype to GO term mappings can be found here: http://www.wormbase.org/wiki/index.php/Gene_Ontology. We have begun a review of our phenotype to GO term mappings and are in the process of making changes to this pipeline so that annotations are in accord with a stricter use of the IMP evidence code as recently described in GO consortium annotation policies. This process will involve removing some high-level phenotype term to GO term mappings and/or removal of certain RNAi experiments/papers from being included in this pipeline and the review and changing of several scripts that are involved.

Priorities for Annotation

Our annotation priorities are as follows:

1) Reference Genome genes

2) Genes presented for annotation via our Textpresso-based semi-automated Cellular Component curation pipeline

3) Genes from training set papers used for piloting semi-automated Textpresso-based Molecular Function curation

4) Newly described genes for which previous annotation was not available

5) C. elegans orthologs of human disease genes

6) Phenotype2GO and InterPro2GO annotations are updated with each release.


Presentations and Publications

Publications, Talks, Posters 2010-

Ontology Development Contributions

WormBase curators have contributed to ontology discussion and development in the areas of:

  • cilium terms (updates/revisions to terms added in 2005)
  • octapamine/tyramine signaling involved in the response to food (and the regulation terms)
  • alpha-tubulin acetylation
  • phagosome maturation involved in apoptotic cell clearance
  • phagosome acidification involved in apoptotic cell clearance
  • phagolysosome assembly involved in apoptotic cell clearance
  • phagosome-lysosome docking involved in apoptotic cell clearance
  • phagosome-lysosome fusion involved in apoptotic cell clearance
  • neuropeptide receptor binding
  • striated muscle contraction involved in embryonic body morphogenesis
  • striated muscle myosin thick filament assembly
  • striated muscle paramyosin thick filament assembly (2010)
  • determination of left/right asymmetry in the nervous system
  • regulation of locomotion (including positive and negative regulation child terms) involved in locomotory behavior
  • detoxification of arsenic
  • chondroitin sulfate proteoglycan binding
  • chondroitin sulfate binding
  • regulation (includes positive and negative regulation child terms) of nematode larval development
  • regulation of (includes positive and negative regulation terms) dauer larval development

Annotation Outreach and User Advocacy Efforts

Kimberly Van Auken continues to participate in the go-help rotation. Ranjana Kishore continues to participate in the efforts of the GO News group.


Other Highlights

Curation Tools: Ontology Annotator

We have developed a new, web-based curation tool, the Ontology Annotator, that can be used to annotate genes to any ontology, including the Gene Ontology and the WormBase Phenotype Ontology. The Ontology Annotator incorporates and expands upon much of the functionality of the Phenote curation tool. Some of the more useful features of the tool include bulk annotation capabilities, autocomplete functions, retrieving data and filtering of the retrieved data for editing purposes.

Semi-Automated Molecular Function Curation

In addition to using Textpresso for Cellular Component Curation, we have developed pipelines for semi-automated Molecular Function (MF) curation. Given that Molecular Function annotations can be made based upon evidence from a wide variety of experiments, our approach has been to divide molecular function curation into two broad categories: 1) macromolecular interactions and 2) enzymatic and transporter activities.

Macromolecular Interactions

For the former, we employ a two-step curation pipline. First, we use an SVM-based document classification algorithm to identify new papers likely to contain reports of macromolecular interactions. These papers are then searched for matching sentences using Textpresso categories developed specifically for identifying sentences describing macromolecular interactions. Our initial results indicate that the SVM performs very well on predicting high confidence true positive papers (precision=86.4%) as well as true negative papers (precision=88.9%). The Textpresso categories also have a high recall (89.7%), but have a relatively low precision (precision=47.3%). In practice, this means that while curators are able to retrieve and curate nearly all macromolecular interactions, they still must examine a number of false positive sentences, so the curation efficiency needs to be improved. We are in the process of fine tuning the Textpresso categories for macromolecular interactions in hopes of addressing this issue and improving curation efficiency. The new Textpresso categories for this data are scheduled to be in place at the end of December 2010.

Enzymatic and Transporter Activities In parallel to these efforts, we are also investigating semi-automated curation methods for annotating enzymatic and transporter activities. For this data type, we are developing Textpresso categories as well as training a Hidden Markov Model (HMM) to identify curatable sentences. For the former, we have collected 419 sentences from 64 papers. Using these sentences, we have developed two new Textpresso categories for enzymatic and transporter activities. As for the new macromolecular interaction categories, we plan to implement these categories on Textpresso by the end of December 2010.

In collaboration with Hans-Michael Mueller, we are also training a Hidden Markov Model to identify sentences describing enzymatic and transporter activities. At present, we are in the third round of training and evalutation for the model. We hope to complete an initial evaluation of the model by early next year and will report on its performance.

Textpresso-Based identification of literature for human disease gene orthologs

Human disease gene orthologs are a high priority annotation list for both WormBase and other model organism databases of the GO consortium. Sequence-based or similarity searches for human disease gene orthologs may provide useful candidates but the biological information needs to be extracted manually from the literature. We have started to work on a project that combines the use of Textpresso-based categories and key word searches to identify elegans papers which describe the study of a human disease gene ortholog. Sentences in which a C. elegans gene co-occurs with a human disease term were deemed important for this process. A new 'human disease' Textpresso category was formed using the human disease ontology in the OBO foundry, the Neuroscience Information Framework Standardized (NIFSTD) ontology and the existing Textpresso human disease category. Several disease terms that would increase the number of false positives were removed iteratively by a manual process. As performing an AND query on just two categories--human disease and C.elegans gene, returned too many false positives, a third category was formed, containing words such as `ortholog’, `homolog’, `similar’, `relate’ and `model’, and added to the AND query. The query is currently being fine-tuned for increasing precision and recall. Once established, it will automate flagging new articles that have disease gene ortholog data. Subsequently this method can also be used for the extraction of relevant information.

Textpresso-Based Curation Pipelines for Other MODs

We have been collaborating with The Arabidopsis Information Resource (TAIR) and dictyBase to develop and implement Textpresso-based curation pipelines for Cellular Component annotation. At present, we have been working with TAIR to develop a pipeline by which they can perform Textpresso searches on the Arabidopsis corpus from 2008 to identify potentially new annotations that were not previously captured in their existing GO curation pipeline. The results of these searches are being presented in a curation form that will allow TAIR curators to make new annotations and retrieve the annotations in both a gene_association file format and a more basic, three-column format that is similar to their user submission format. We hope to generalize this pipeline so that other groups, including dictyBase, will be able to perform Textpresso queries, send the results to a curation form, and retrieve the output of the curation as a gene_association file.