WormBase Gene Ontology Progress Report, December 2010
Developer, WormBase, Caltech, Pasadena, CA.
Curator, WormBase, Caltech, Pasadena, CA.
PI, WormBase, Caltech, Pasadena, CA.
Kimberly Van Auken
Curator, WormBase, Caltech, Pasadena, CA.
Phenotype Curation, WormBase, Caltech, Pasadena, CA.
Phenotype Curation, WormBase, Caltech, Pasadena, CA.
Bioinformatician, Developer, WormBase, Caltech, Pasadena, CA
Additional technical support:
WormBase, Sanger Center, Hinxton, UK
WormBase, Sanger Center, Hinxton, UK
Hans Michael Muller
Project Leader, Textpresso, Caltech, Pasadena, CA
Developer, Textpresso, Caltech, Pasadena, CA
Table 1: Number of Genes Annotated
|Type of Annotation||Number of Genes Annotated, Dec 2010||% Change from Dec 2009||Number of Unique GO Terms||Total Number of GO Terms|
Methods and Strategies for Annotation
Manual curation of the C. elegans literature remains our highest curation priority, contributing to ~90% of our total curation efforts. Curators use a GO curation check-out form that affords curators easy visual access to the curation status of all named C. elegans gene, e.g. vha-6 or egl-9. Genes are displayed in a list that includes the current number of published papers (references) indexed to that gene and the last date for which annotations to any of the three ontologies were made. Curators can query and sort the list according to reference count, gene name, and curation status.
InterPro2GO mappings for IEA annotations: These annotations are annotations of C. elegans proteins to GO terms based on electronic matching of protein motifs/domains to those documented in the InterPro database (http://www.ebi.ac.uk/interpro/), and their mapping to GO terms provided by the InterPro2GO file generated by the EBI (PMID:12654719, PMID:12520011). Note that the 'IEA' annotations are not reviewed for accuracy by human curators. As such, all of these annotations use the evidence code 'IEA'.
TMHMM2GO mapping for IEA annotations: We also include the results of an internal pipeline that maps proteins containing a transmembrane domain, as predicted by the TMHMM algorithm, to the GO Cellular Component term, integral to membrane. About 6,710 gene products are annotated to the term 'integral to membrane' via this pipeline.
InterPro2GO and TMHMM2GO annotations are updated at every database release.
Review and improvement of the Phenotype2GO data pipeline
WormBase uses a well defined phenotype ontology to annotate gene-allele and RNAi-based phenotypes. A total of 201 phenotype terms used in annotation have been mapped to a GO term. These mappings are used to automatically generate Biological Process annotations to genes using the IMP evidence code, at every WormBase database build . The complete list of WormBase phenotype to GO term mappings can be found here: http://www.wormbase.org/wiki/index.php/Gene_Ontology. We have begun a detailed review of our phenotype to GO term mappings. We are in the process of making changes to this pipeline so that annotations are made with a stricter use of the IMP evidence code, as recently described in GO consortium annotation policies. This process will involve removing some high-level phenotype term to GO term mappings and/or removal of certain RNAi experiments/papers from being included in this pipeline and the review and changing of scripts.
Textpresso-Based Cellular Component Curation
As a complimentary approach to our manual curation pipeline, we continue to employ the Textpresso information retrieval system to annotate C. elegans gene products to the Cellular Component ontology. Textpresso searches for component annotations use three different categories (Assay Terms, Component Terms, and Verbs) plus a category of C. elegans protein names. Searches using these categories return sentences that contain a match to at least one term in each category. We use this approach to annotate newly published papers as well as papers published prior to 2010. For newly published papers, we prioritize our searches by first searching through papers that, as determined by a Support Vector Machine document classifier, have a relatively high probability of containing expression data.
Over the past year, through our manual and Textpresso-based pipelines, we added new cellular component annotations to 292 genes. Of these, 113 genes were annotated from papers published in 2010, with the remainder of the annotations coming from previously published papers. For the 2010 papers, Textpresso’s annotation recall was 91.3%. We have not yet measured the recall on papers annotated this year but published prior to 2010.
Priorities for Annotation
Our annotation priorities are as follows:
1) Reference Genome genes
2) Genes presented for annotation via our Textpresso-based semi-automated Cellular Component curation pipeline
3) Genes from training set papers used for piloting semi-automated Textpresso-based Molecular Function curation
4) Newly described genes for which previous annotation was not available
5) C. elegans orthologs of human disease genes
Presentations and Publications
Ontology Development Contributions
WormBase curators have contributed to ontology discussion and development in the areas of:
Biology of the cilium:
- updates/revisions to terms added in 2005
Biology of the phagosome-lysosome during apoptotic cell clearance, terms added:
- phagosome maturation involved in apoptotic cell clearance
- phagosome acidification involved in apoptotic cell clearance
- phagolysosome assembly involved in apoptotic cell clearance
- phagosome-lysosome docking involved in apoptotic cell clearance
- phagosome-lysosome fusion involved in apoptotic cell clearance
Biology of muscle, terms added:
- striated muscle contraction involved in embryonic body morphogenesis
- striated muscle myosin thick filament assembly
- striated muscle paramyosin thick filament assembly (2010)
- alpha-tubulin acetylation
Biology of nematode larval development, terms added:
- regulation (includes positive and negative regulation child terms) of nematode larval development
- regulation of (includes positive and negative regulation terms) dauer larval development
Other terms added were:
- neuropeptide receptor binding
- determination of left/right asymmetry in the nervous system
- regulation of locomotion (including positive and negative regulation child terms) involved in locomotory behavior
- detoxification of arsenic
- chondroitin sulfate proteoglycan binding
- chondroitin sulfate binding
- octapamine/tyramine signaling involved in the response to food (and the regulation terms)
Annotation Outreach and User Advocacy Efforts
Kimberly Van Auken continues to participate in the go-help rotation. Ranjana Kishore continues to participate in the efforts of the GO News group.
Curation Tools: Ontology Annotator
We have continued development on our web-based curation tool, the Ontology Annotator, that can be used to annotate genes to any ontology, including the Gene Ontology and the WormBase Phenotype Ontology. The Ontology Annotator incorporates and expands upon much of the functionality of the Phenote curation tool. Some of the more useful features of the tool include bulk annotation capabilities, autocomplete functions, retrieving data and filtering of the retrieved data for editing purposes. We have improved functionalities for existing annotation interfaces and have added the following curation interfaces that are fully functional: antibody, small molecule and gene regulation. We are currently working on 2 new curation interfaces: gene regulation and expression pattern related pictures.
Textpresso- and HMM-Based Molecular Function Curation
In addition to using Textpresso for Cellular Component Curation, we have developed pipelines for semi-automated Molecular Function (MF) curation. Given that Molecular Function annotations can be made based upon evidence from a wide variety of experiments, our approach has been to divide molecular function curation into two broad categories: 1) macromolecular interactions and 2) enzymatic and transporter activities.
For the former, we employ a two-step curation pipeline. First, we use an SVM-based document classification algorithm to identify new papers likely to contain reports of macromolecular interactions. These papers are then searched for matching sentences using Textpresso categories developed specifically for identifying sentences describing macromolecular interactions. Our initial results indicate that the SVM performs very well on predicting high confidence true positive papers (precision=86.4%) as well as true negative papers (precision=88.9%). The Textpresso categories also have a high recall (89.7%), but have a relatively low precision (precision=47.3%). In practice, this means that while curators are able to retrieve and curate nearly all macromolecular interactions, they still must examine a number of false positive sentences, so the curation efficiency needs to be improved. We are in the process of fine tuning the Textpresso categories for macromolecular interactions in hopes of addressing this issue and improving curation efficiency. The new Textpresso categories for this data are scheduled to be in place at the end of December 2010.
Enzymatic and Transporter Activities In parallel to these efforts, we are also investigating semi-automated curation methods for annotating enzymatic and transporter activities. For this data type, we are developing Textpresso categories as well as training a Hidden Markov Model (HMM) to identify curatable sentences. For the former, we have collected 419 sentences from 64 papers. Using these sentences, we have developed two new Textpresso categories for enzymatic and transporter activities. As for the new macromolecular interaction categories, we plan to implement these categories on Textpresso by the end of December 2010.
In collaboration with Hans-Michael Mueller, we are also training a Hidden Markov Model to identify sentences describing enzymatic and transporter activities. At present, we are in the third round of training and evaluation for the model. We hope to complete an initial evaluation of the model by early next year and will report on its performance.
Textpresso-Based identification of literature for human disease gene orthologs
Human disease gene orthologs are a high priority annotation list for WormBase and other model organism databases. Sequence-based or similarity searches for human disease gene orthologs provide useful candidates but the biological information needs to be extracted manually from the literature. We are working on a project that combines the use of Textpresso-based categories and key word searches to identify elegans papers which describe the study of a human disease gene ortholog. Sentences in which a C. elegans gene co-occurs with a human disease term were deemed important for this process. A new 'human disease' Textpresso category was formed using the human disease ontology in the OBO foundry, the Neuroscience Information Framework Standardized (NIFSTD) ontology and the existing Textpresso human disease category. Several disease terms that would increase the number of false positives were removed iteratively by a manual process. As performing an 'AND' query on just two categories--human disease and C. elegans gene, returned too many false positives, a third category was formed with the words 'ortholog’, `homolog’, `similar’, `relate’ and `model’, and added to the 'AND' query. The query is being fine-tuned for increasing precision and recall. Once established, it will automate the flagging of new articles that have disease gene ortholog data. Subsequently this method can be used for the extraction of relevant information.
Textpresso-Based Curation Pipelines for Other MODs
We have been collaborating with The Arabidopsis Information Resource (TAIR) and dictyBase to develop and implement Textpresso-based curation pipelines for Cellular Component annotation. At present, we have been working with TAIR to develop a pipeline by which they can perform Textpresso searches on the Arabidopsis corpus from 2008 to identify potentially new annotations that were not previously captured in their existing GO curation pipeline. The results of these searches are being presented in a curation form that will allow TAIR curators to make new annotations and retrieve the annotations in both a gene_association file format and a more basic, three-column format that is similar to their user submission format. We hope to generalize this pipeline so that other groups, including dictyBase, will be able to perform Textpresso queries, send the results to a curation form, and retrieve the output of the curation as a gene_association file.
Collaboration with BioGRID
In late summer, we begin a collaboration with the BioGRID (Biological General Repository for Interaction Datasets) to add protein binding annotations (IPI evidence code) curated for GO into BioGRID. In addition, we also plan to add curated genetic interactions (IGI evidence code) to BioGRID. Our initial work will focus on adding interactions curated as part of the Reference Genome’s Wnt signaling pathway annotation project. Concurrently, we will also add any newly curated protein binding annotations (Wnt pathway or otherwise) to BioGRID.