WormBase

From GO Wiki
Jump to navigation Jump to search

WormBase Gene Ontology Progress Report, December 2009

Staff

Juancarlos Chan

Developer, WormBase, Caltech, Pasadena, CA.

Ranjana Kishore

Curator, WormBase, Caltech, Pasadena, CA.

Paul Sternberg

PI, WormBase, Caltech, Pasadena, CA.

Kimberly Van Auken

Curator, WormBase, Caltech, Pasadena, CA.


Jolene Fernandes

Phenotype Curation, WormBase, Caltech, Pasadena, CA.

Gary Schindelman

Phenotype Curation, WormBase, Caltech, Pasadena, CA.


Ruihua Fang

Bioinformatician, Developer, WormBase, Caltech, Pasadena, CA


Additional technical support:

Anthony Rogers

WormBase, Sanger Center, Hinxton, UK

Gary Williams

WormBase, Sanger Center, Hinxton, UK


Textpresso:

Hans Michael Muller

Project Leader, Textpresso, Caltech, Pasadena, CA

Arun Rangarajan

Developer, Textpresso, Caltech, Pasadena, CA

Annotation Progress

Table 1: Number of Genes Annotated

Type of Annotation Number of Genes Annotated, Dec 2009 % Change from Dec 2008 Number of Unique GO Terms Total Number of GO Terms
Manual Annotation 1751 +12.9 1,614 8,853
Phenotype2GO Mappings 6719 +43.5 105 50,078
IEA/Electronic 12,847 +1.4 1,519 54,645
Total 15,668 +8.9 2,769 113,576


Methods and Strategies for Annotation

Literature Curation

Manual curation of the C. elegans literature remains our highest curation priority, contributing to ~90% of our total curation efforts. We have implemented a GO curation check-out form that affords curators easy visual access to the curation status of all named C. elegans gene, e.g. vha-6 or egl-9. Genes are displayed in a list that includes the current number of published papers (references) indexed to that gene and the last date for which annotations to either of the three ontologies were made. Curators can query and sort the list according to reference count, gene name, and curation status.

Computational Methods

InterPro2GO mappings for IEA annotations: These annotations are annotations of C. elegans proteins to GO terms based on electronic matching of protein motifs/domains to those documented in the Interpro database (http://www.ebi.ac.uk/interpro/), and their mapping to GO terms provided by the Interpro2go file generated by the EBI (PMID:12654719, PMID:12520011). Note that the 'IEA' annotations are not reviewed for accuracy by human curators. As such, all of these annotations use the evidence code 'IEA'.

TMHMM2GO mapping for IEA annotations: We also include the results of an internal pipeline that maps proteins containing a transmembrane domain, as predicted by the TMHMM algorithm, to the GO Cellular Component term, intergral to membrane. ~6800 gene products are annotated to integral to membrane via this pipeline.

Semi-automated Methods

Phenotype2GO Mappings for IMP annotations

WormBase uses a well defined phenotype ontology to annotate gene-allele and RNAi-based phenotypes. Initially a set of ~ 55 phenotype terms (mostly phenotypes observed in large-scale RNAi experiments) were mapped to high-level GO terms. These mappings were then used to annotate genes to the Biological Process ontology, using the IMP evidence code.

Recently, we have mapped an additional 146 phenotype terms to GO terms. This set contains phenotype terms that have been used to manually annotate alleles, based on experimental evidence, from the published literature. These mappings often result in GO terms that are of deeper granularity. For example, the phenotype term 'centrosome pair and associated pronuclear rotation abnormal' is mapped to the GO term 'centrosomal and pronuclear rotation', based on a careful reading of the definitions of the phenotype and GO terms. These phenotype2go_term mappings are then used to semi-automatically attach GO terms to genes based on their allele phenotypes (see below). In all, 201 allele and RNAi-based phenotypes have been mapped to a GO term. The complete list of WormBase phenotype2go mappings can be found here: http://www.wormbase.org/wiki/index.php/Gene_Ontology


Development of a new phenotype2GO data pipeline

We are developing a data pipeline that allows a closer synchronization between the efforts of the manual GO and Phenotype curation projects at WormBase. This pipeline involves a semi-automated method of attaching GO terms to genes based on their manually curated phenotype terms, which have been mapped to GO terms by curators (see above section).

We are currently working to import and display these annotations into our newly developed tool for GO annotation, the Ontology Annotator (see below), so that a GO curator is able to inspect these semi-automated annotations in the course of GO annotation. This would allow curators to verify accuracy of annotation and improve or add to the mappings themselves.

Priorities for Annotation

Our annotation priorities are as follows:

1) Reference Genome genes

2) Genes presented for annotation via our Textpresso-based semi-automated Cellular Component curation pipeline

3) Genes from training set papers used for piloting semi-automated Textpresso-based Molecular Function curation

4) Newly described genes for which previous annotation was not available

5) C. elegans orthologs of human disease genes

6) Phenotype2GO and InterPro2GO annotations are updated with each release.


Presentations and Publications

Publications, Talks, Posters 2010-

Ontology Development Contributions

WormBase curators have contributed to ontology discussion and development in the areas of:

  • cilium terms (updates/revisions to terms added in 2005)
  • octapamine/tyramine signaling involved in the response to food (and the regulation terms) (2010)
  • alpha-tubulin acetylation (2010)
  • phagosome maturation involved in apoptotic cell clearance (2010)
  • phagosome acidification involved in apoptotic cell clearance(2010)
  • phagolysosome assembly involved in apoptotic cell clearance (2010)
  • phagosome-lysosome docking involved in apoptotic cell clearance (2010
  • phagosome-lysosome fusion involved in apoptotic cell clearance (2010)
  • neuropeptide receptor binding (2010)
  • striated muscle contraction involved in embryonic body morphogenesis (2010)
  • striated muscle myosin thick filament assembly (2010)
  • striated muscle paramyosin thick filament assembly (2010)
  • determination of left/right asymmetry in the nervous system (2010)
  • regulation of locomotion (including positive and negative regulation child terms) involved in locomotory behavior (2010)
  • detoxification of arsenic (2010)
  • chondroitin sulfate proteoglycan binding (2010)
  • chondroitin sulfate binding (2010)
  • regulation (includes positive and negative regulation child terms) of nematode larval development (2010)
  • regulation of (includes positive and negative regulation terms) dauer larval development (2010)

Annotation Outreach and User Advocacy Efforts

Kimberly Van Auken continues to participate in the go-help rotation. Ranjana Kishore continues to participate in the efforts of the GO News group.


Other Highlights

Curation Tools: Ontology Annotator

We have developed a new, web-based curation tool, the Ontology Annotator, that can be used to annotate genes to any ontology, including the Gene Ontology and the WormBase Phenotype Ontology. The Ontology Annotator incorporates and expands upon much of the functionality of the Phenote curation tool. Some of the more useful features of the tool include bulk annotation capabilities, autocomplete functions, retrieving data and filtering of the retrieved data for editing purposes.

Semi-Automated Molecular Function Curation

In addition to using Textpresso for Cellular Component Curation, we have developed pipelines for semi-automated Molecular Function (MF) curation. Given that Molecular Function annotations can be made based upon evidence from a wide variety of experiments, our approach has been to divide molecular function curation into two broad categories: 1) macromolecular interactions and 2) enzymatic and transporter activities.

For the former, we employ a two-step curation pipline. First, we use an SVM-based document classification algorithm to identify new papers likely to contain reports of macromolecular interactions. These papers are then searched, using Textpresso categories developed specifically for identifying sentences describing macromolecular interactions. Our initial results indicate that the SVM performs very well on predicting high confidence true positive papers (precision=86.4%) as well as true negative papers (precision=88.9%), but the initial extpresso categories, while having a recall of 89.7%, have a relatively low precision (precision=47.3%) in identifying true positive sentences. In practice, this means that while curators are able to retrieve and curate nearly all macromolecular interactions, the curation efficiency needs to be improved. We are in the process of fine tuning the Textpresso categories for macromolecular interactions in hopes of improving curation efficiency.

In parallel to these efforts, we are also investigating semi-automated curation methods for annotating enzymatic and transporter activities. For this data type, we are developing Textpresso categories as well as training a Hidden Markov Model (HMM) to identify curatable sentences. For the former, we have collected

NEED TO UPDATE NUMBERS

207 papers and 144 sentences for this data type and....

Textpresso-Based Curation Pipelines for Other MODs

We have been collaborating with The Arabidopsis Information Resource (TAIR) and dictyBase to develop and implement Textpresso-based curation pipelines for Cellular Component annotation. At present, we have been working with TAIR to develop a pipeline by which they can perform Textpresso searches on the Arabidopsis corpus from 2008 to identify potentially new annotations that were not previously captured in their existing GO curation pipeline. The results of these searches are being presented in a curation form that will allow TAIR curators to make new annotations and retrieve the annotations in both a gene_association file format and a more basic, three-column format that is similar to their user submission format. We hope to generalize this pipeline so that other groups, including dictyBase, will be able to perform Textpresso queries, send the results to a curation form, and retrieve the output of the curation as a gene_association file.