WormBase, March 2010

From GO Wiki
Jump to: navigation, search

WormBase Gene Ontology Progress Report, March 2010

Staff

Paul Sternberg

PI, WormBase, Caltech, Pasadena, CA.

Ranjana Kishore

Curator, WormBase, Caltech, Pasadena, CA.

Kimberly Van Auken

Curator, WormBase, Caltech, Pasadena, CA.

Juancarlos Chan

Developer, WormBase, Caltech, Pasadena, CA.


Jolene Fernandes

Phenotype Curation, WormBase, Caltech, Pasadena, CA.

Gary Schindelman

Phenotype Curation, WormBase, Caltech, Pasadena, CA.

Ruihua Fang

Curation Automation Pipelines, WormBase, Caltech, Pasadena, CA.


Additional technical support:

Anthony Rogers

WormBase, Sanger Center, Hinxton, UK

Gary Williams

WormBase, Sanger Center, Hinxton, UK


Textpresso:

Hans Michael Muller

Project Leader, Textpresso, Caltech, Pasadena, CA

Arun Rangarajan

Developer, Textpresso, Caltech, Pasadena, CA


Annotation Progress

Table 1: Number of Genes Annotated

Type of Annotation Number of Genes Annotated, March 2010  % Change from Sep 2009 Number of Unique GO Terms Total Number of GO Terms
Manual Annotation 1803 +7.1 1666 9210
Phenotype2GO Annotations 6547 +37.3 105 45177
IEA/Electronic Annotations 12807 +40.5 1498 53057
Total Annotations 15553 +6.4 2799 107444


Methods and Strategies for Annotation

Literature Curation

Manual curation of the C. elegans literature remains our highest curation priority, contributing to ~90% of our total curation efforts. We have implemented a GO curation check-out form that affords curators easy visual access to the curation status of all named C. elegans gene, e.g. vha-6 or egl-9. Genes are displayed in a list that includes the current number of published papers (references) indexed to that gene and the last date for which annotations to either of the three ontologies were made. Curators can query and sort the list according to reference count, gene name, and curation status.

Computational Methods

InterPro2GO Mappings

InterPro2GO mappings for IEA annotations: These annotations are annotations of C. elegans proteins to GO terms based on electronic matching of protein motifs/domains to those documented in the Interpro database (http://www.ebi.ac.uk/interpro/), and their mapping to GO terms provided by the Interpro2go file generated by the EBI (PMID:12654719, PMID:12520011). Note that the 'IEA' annotations are not reviewed for accuracy by human curators. As such, all of these annotations use the evidence code 'IEA'.

TMHMM Predicted Membrane Proteins

We also make annotations based upon the results of a transmembrane HMM, TMHMM. All C. elegans proteins predicted to be membrane-spanning proteins by this model are annotated to the GO Cellular Component term, GO:0016021, integral to membrane, using the IEA evidence code.

Semi-automated Methods

Phenotype2GO Mappings for IMP annotations

WormBase uses a well defined phenotype ontology to annotate gene-allele and RNAi-based phenotypes. Initially a set of ~ 55 phenotype terms (mostly phenotypes observed in large-scale RNAi experiments) were mapped to high-level GO terms. These mappings were then used to annotate genes to the Biological Process ontology, using the IMP evidence code.

We now have over 200 phenotype terms mapped to GO terms. This set now contains phenotype terms that have been used to manually annotate alleles, based on experimental evidence, from the published literature. These mappings often result in GO terms that are of deeper granularity. For example, the phenotype term 'centrosome pair and associated pronuclear rotation abnormal' is mapped to the GO term 'centrosomal and pronuclear rotation', based on a careful reading of the definitions of the phenotype and GO terms. These phenotype2go_term mappings are then used to semi-automatically attach GO terms to genes based on their allele phenotypes (see below). The complete list of WormBase phenotype2go mappings can be found here: http://www.wormbase.org/wiki/index.php/Gene_Ontology


Development of a new phenotype2GO data pipeline

We are developing a data pipeline that allows a closer synchronization between the efforts of the manual GO and Phenotype curation projects at WormBase. This pipeline involves a semi-automated method of attaching GO terms to genes based on their manually curated phenotype terms, which have been mapped to GO terms by curators (see above section).

We are currently working to import and display these annotations into our newly developed tool for GO annotation, the Ontology Annotator (see below), so that a GO curator is able to inspect these semi-automated annotations in the course of GO annotation. This would allow curators to verify accuracy of annotation and improve or add to the mappings themselves.

Priorities for Annotation

Our annotation priorities are as follows:

1) Reference Genome genes

2) Genes presented for annotation via our Textpresso-based semi-automated Cellular Component curation pipeline

3) Genes from training set papers used for piloting semi-automated Textpresso-based Molecular Function curation

4) Newly described genes for which previous annotation was not available

5) C. elegans orthologs of human disease genes

6) Phenotype2GO, InterPro2GO, and TMHMM-based annotations are updated with each release.

Presentations and Publications

The Gene Ontology Consortium. (2010). The Gene Ontology in 2010: extensions and refinements. Nucleic Acids Res. 38:D331-5.

Harris TW, Antoshechkin I, Bieri T, Blasiar D, Chan J, Chen WJ, De La Cruz N, Davis P, Duesbury M, Fang R, Fernandes J, Han M, Kishore R, Lee R, Müller HM, Nakamura C, Ozersky P, Petcherski A, Rangarajan A, Rogers A, Schindelman G, Schwarz EM, Tuli MA, Van Auken K, Wang D, Wang X, Williams G, Yook K, Durbin R, Stein LD, Spieth J, Sternberg PW. (2010). WormBase: a comprehensive resource for nematode research. Nucleic Acids Res. 38:D463-7

Hill DP, Berardini TZ, Howe DG, Van Auken KM. (2009). Representing Ontogeny Through Ontology: A Developmental Biologist’s Guide to The Gene Ontology. Mol Reprod. Dev. 77(4):314-29.

Ontology Development Contributions

WormBase curators have contributed to ontology discussion and development in the areas of nematode and dauer larval development, gastrulation, dense core granule, detoxification of arsenic and chondroitin sulfate and chondroitin sulfate proteoglycan binding.

Annotation Outreach and User Advocacy Efforts

Kimberly Van Auken continues to participate in the go-help rotation. Ranjana Kishore continues to participate in the efforts of the GO News group.


Other Highlights

Highlighting Reference Genome Annotations

To bring more attention to the Reference Genome Project in WormBase, we have added a tag to all Reference Genome genes in our curation database. This tag will be propagated to the WormBase website and made visible to our users.

We also added a news item to the WormBase homepage announcing our participation in the project.

Curation Tools: Ontology Annotator

We have developed a new, web-based curation tool, the Ontology Annotator, that can be used to annotate genes to any ontology, including the Gene Ontology and the WormBase Phenotype Ontology. The Ontology Annotator incorporates and expands upon much of the functionality of the Phenote curation tool. Some of the more useful features of the tool include bulk annotation capabilities, autocomplete functions, retrieving data and filtering of the retrieved data for editing purposes.

Semi-Automated Molecular Function Curation

We continue to explore Textpresso-based GO curation by developing pipelines for semi-automated Molecular Function (MF) curation. Our curation plans involve a two-tiered approach encompassing: 1) document classification using SVMs (Support Vector Machines) and 2) category searches to identify curatable sentences within those documents identified as positives for Molecular Function data by SVM.

Our initial efforts are focused on macromolecular interactions and enzymatic and transporter activities. For the former, we collected 248 sentences that describe experimentally determined binding activity and developed two new Textpresso categories for literature searching: MF_Int_Assay (206 terms) and MF_Int_Verbs (106 terms). Searches performed with these categories on SVM-positive papers indicate that the categories were able to identify curatable information from positive papers with ~85% recall and 75% precision. Further, the categories were able to retrieve three papers initially identified as false negatives by the SVMs, indicating that the two methods provide complementary approaches to semi-automated curation of GO terms from the published literature.

In parallel to these efforts, we are also identifying papers and sentences that describe enzymatic and transporter activities. To date, we have collected 127 papers for document classification and 199 sentences for this data type. We are also beginning HMM approaches to identifying sentences containing Molecular Function data within the C. elegans corpus.


Back to Meeting Progress Reports March 2010