Latest revision as of 12:14, 10 February 2023

WormBase Gene Ontology Progress Report, December 2010

Staff

Juancarlos Chan

Developer, WormBase, Caltech, Pasadena, CA.

Ranjana Kishore

Curator, WormBase, Caltech, Pasadena, CA.

Paul Sternberg

PI, WormBase, Caltech, Pasadena, CA.

Kimberly Van Auken

Curator, WormBase, Caltech, Pasadena, CA.

Jolene Fernandes

Phenotype Curation, WormBase, Caltech, Pasadena, CA.

Gary Schindelman

Phenotype Curation, WormBase, Caltech, Pasadena, CA.

Ruihua Fang

Bioinformatician, Developer, WormBase, Caltech, Pasadena, CA

Additional technical support:

Anthony Rogers

WormBase, Sanger Center, Hinxton, UK

Gary Williams

WormBase, Sanger Center, Hinxton, UK

Textpresso:

Hans Michael Muller

Project Leader, Textpresso, Caltech, Pasadena, CA

Arun Rangarajan

Developer, Textpresso, Caltech, Pasadena, CA

Annotation Progress

Table 1: Number of Genes Annotated

Type of Annotation	Number of Genes Annotated, Dec 2010	% Change from Dec 2009	Number of Unique GO Terms	Total Number of GO Terms
Manual Annotation	2,098	+19.8	1,840	10,467
Phenotype2GO Mappings	6309	-6.1	113	42,349
IEA/Electronic	12,954	+0.83	1,476	55,091
Total	15,799	+0.84	2,937	107,907

Methods and Strategies for Annotation

Literature Curation

Manual curation of the C. elegans literature remains our highest curation priority, contributing to ~90% of our total curation efforts. Curators use a GO curation check-out form that affords curators easy visual access to the curation status of all named C. elegans gene, e.g. vha-6 or egl-9. Genes are displayed in a list that includes the current number of published papers (references) indexed to that gene and the last date for which annotations to any of the three ontologies were made. Curators can query and sort the list according to reference count, gene name, and curation status.

Computational Methods

InterPro2GO mappings for IEA annotations: These annotations are annotations of C. elegans proteins to GO terms based on electronic matching of protein motifs/domains to those documented in the InterPro database (http://www.ebi.ac.uk/interpro/), and their mapping to GO terms provided by the InterPro2GO file generated by the EBI (PMID:12654719, PMID:12520011). Note that the 'IEA' annotations are not reviewed for accuracy by human curators. As such, all of these annotations use the evidence code 'IEA'.

TMHMM2GO mapping for IEA annotations: We also include the results of an internal pipeline that maps proteins containing a transmembrane domain, as predicted by the TMHMM algorithm, to the GO Cellular Component term, integral to membrane. About 6,710 gene products are annotated to the term 'integral to membrane' via this pipeline.

InterPro2GO and TMHMM2GO annotations are updated at every database release.

Semi-automated Methods

Review and improvement of the Phenotype2GO data pipeline

WormBase uses a well defined phenotype ontology to annotate gene-allele and RNAi-based phenotypes. A total of 201 phenotype terms used in annotation have been mapped to a GO term. These mappings are used to automatically generate Biological Process annotations to genes using the IMP evidence code, at every WormBase database build . The complete list of WormBase phenotype to GO term mappings can be found here: http://www.wormbase.org/wiki/index.php/Gene_Ontology. We have begun a detailed review of our phenotype to GO term mappings. We are in the process of making changes to this pipeline so that annotations are made with a stricter use of the IMP evidence code, as recently described in GO consortium annotation policies. This process will involve removing some high-level phenotype term to GO term mappings and/or removal of certain RNAi experiments/papers from being included in this pipeline and the review and changing of scripts.

Textpresso-Based Cellular Component Curation

As a complimentary approach to our manual curation pipeline, we continue to employ the Textpresso information retrieval system to annotate C. elegans gene products to the Cellular Component ontology. Textpresso searches for component annotations use three different categories (Assay Terms, Component Terms, and Verbs) plus a category of C. elegans protein names. Searches using these categories return sentences that contain a match to at least one term in each category. We use this approach to annotate newly published papers as well as papers published prior to 2010. For newly published papers, we prioritize our searches by first searching through papers that, as determined by a Support Vector Machine document classifier, have a relatively high probability of containing expression data.

Over the past year, through our manual and Textpresso-based pipelines, we added new cellular component annotations to 292 genes. Of these, 113 genes were annotated from papers published in 2010, with the remainder of the annotations coming from previously published papers. For the 2010 papers, Textpresso’s annotation recall was 91.3%. We have not yet measured the recall on papers annotated this year but published prior to 2010.

Priorities for Annotation

Our annotation priorities are as follows:

1) Reference Genome genes

2) Genes presented for annotation via our Textpresso-based semi-automated Cellular Component curation pipeline

3) Genes from training set papers used for piloting semi-automated Textpresso-based Molecular Function curation

4) Newly described genes for which previous annotation was not available

5) C. elegans orthologs of human disease genes

Presentations and Publications

Publications,_Talks,_Posters_2010

Ontology Development Contributions

WormBase curators have contributed to ontology discussion and development in the areas of:
Biology of the cilium:

updates/revisions to terms added in 2005

Biology of the phagosome-lysosome during apoptotic cell clearance, terms added:

phagosome maturation involved in apoptotic cell clearance
phagosome acidification involved in apoptotic cell clearance
phagolysosome assembly involved in apoptotic cell clearance
phagosome-lysosome docking involved in apoptotic cell clearance
phagosome-lysosome fusion involved in apoptotic cell clearance

Biology of muscle, terms added:

striated muscle contraction involved in embryonic body morphogenesis
striated muscle myosin thick filament assembly
striated muscle paramyosin thick filament assembly (2010)
alpha-tubulin acetylation

Biology of nematode larval development, terms added:

regulation (includes positive and negative regulation child terms) of nematode larval development
regulation of (includes positive and negative regulation terms) dauer larval development

Other terms added were:

neuropeptide receptor binding
determination of left/right asymmetry in the nervous system
regulation of locomotion (including positive and negative regulation child terms) involved in locomotory behavior
detoxification of arsenic
chondroitin sulfate proteoglycan binding
chondroitin sulfate binding
octapamine/tyramine signaling involved in the response to food (and the regulation terms)

Annotation Outreach and User Advocacy Efforts

Kimberly Van Auken continues to participate in the go-help rotation. Ranjana Kishore continues to participate in the efforts of the GO News group.

Other Highlights

Curation Tools: Ontology Annotator

We have continued development on our web-based curation tool, the Ontology Annotator, that can be used to annotate genes to any ontology, including the Gene Ontology and the WormBase Phenotype Ontology. The Ontology Annotator incorporates and expands upon much of the functionality of the Phenote curation tool. Some of the more useful features of the tool include bulk annotation capabilities, autocomplete functions, retrieving data and filtering of the retrieved data for editing purposes. We have improved functionalities for existing annotation interfaces and have added the following curation interfaces that are fully functional: antibody, small molecule and gene regulation. We are currently working on 2 new curation interfaces: gene regulation and expression pattern related pictures.

Textpresso- and HMM-Based Molecular Function Curation

In addition to using Textpresso for Cellular Component Curation, we have developed pipelines for semi-automated Molecular Function (MF) curation. Given that Molecular Function annotations can be made based upon evidence from a wide variety of experiments, our approach has been to divide molecular function curation into two broad categories: 1) macromolecular interactions and 2) enzymatic and transporter activities.

Macromolecular Interactions

For the former, we employ a two-step curation pipeline. First, we use an SVM-based document classification algorithm to identify new papers likely to contain reports of macromolecular interactions. These papers are then searched for matching sentences using Textpresso categories developed specifically for identifying sentences describing macromolecular interactions. Our initial results indicate that the SVM performs very well on predicting high confidence true positive papers (precision=86.4%) as well as true negative papers (precision=88.9%). The Textpresso categories also have a high recall (89.7%), but have a relatively low precision (precision=47.3%). In practice, this means that while curators are able to retrieve and curate nearly all macromolecular interactions, they still must examine a number of false positive sentences, so the curation efficiency needs to be improved. We are in the process of fine tuning the Textpresso categories for macromolecular interactions in hopes of addressing this issue and improving curation efficiency. The new Textpresso categories for this data are scheduled to be in place at the end of December 2010.

Enzymatic and Transporter Activities In parallel to these efforts, we are also investigating semi-automated curation methods for annotating enzymatic and transporter activities. For this data type, we are developing Textpresso categories as well as training a Hidden Markov Model (HMM) to identify curatable sentences. For the former, we have collected 419 sentences from 64 papers. Using these sentences, we have developed two new Textpresso categories for enzymatic and transporter activities. As for the new macromolecular interaction categories, we plan to implement these categories on Textpresso by the end of December 2010.

In collaboration with Hans-Michael Mueller, we are also training a Hidden Markov Model to identify sentences describing enzymatic and transporter activities. At present, we are in the third round of training and evaluation for the model. We hope to complete an initial evaluation of the model by early next year and will report on its performance.

Textpresso-Based identification of literature for human disease gene orthologs

Human disease gene orthologs are a high priority annotation list for WormBase and other model organism databases. Sequence-based or similarity searches for human disease gene orthologs provide useful candidates but the biological information needs to be extracted manually from the literature. We are working on a project that combines the use of Textpresso-based categories and key word searches to identify elegans papers which describe the study of a human disease gene ortholog. Sentences in which a C. elegans gene co-occurs with a human disease term were deemed important for this process. A new 'human disease' Textpresso category was formed using the human disease ontology in the OBO foundry, the Neuroscience Information Framework Standardized (NIFSTD) ontology and the existing Textpresso human disease category. Several disease terms that would increase the number of false positives were removed iteratively by a manual process. As performing an 'AND' query on just two categories--human disease and C. elegans gene, returned too many false positives, a third category was formed with the words 'ortholog’, `homolog’, `similar’, `relate’ and `model’, and added to the 'AND' query. The query is being fine-tuned for increasing precision and recall. Once established, it will automate the flagging of new articles that have disease gene ortholog data. Subsequently this method can be used for the extraction of relevant information.

Textpresso-Based Curation Pipelines for Other MODs

We have been collaborating with The Arabidopsis Information Resource (TAIR) and dictyBase to develop and implement Textpresso-based curation pipelines for Cellular Component annotation. At present, we have been working with TAIR to develop a pipeline by which they can perform Textpresso searches on the Arabidopsis corpus from 2008 to identify potentially new annotations that were not previously captured in their existing GO curation pipeline. The results of these searches are being presented in a curation form that will allow TAIR curators to make new annotations and retrieve the annotations in both a gene_association file format and a more basic, three-column format that is similar to their user submission format. We hope to generalize this pipeline so that other groups, including dictyBase, will be able to perform Textpresso queries, send the results to a curation form, and retrieve the output of the curation as a gene_association file.

Collaboration with BioGRID

In late summer, we begin a collaboration with the BioGRID (Biological General Repository for Interaction Datasets) to add protein binding annotations (IPI evidence code) curated for GO into BioGRID. In addition, we also plan to add curated genetic interactions (IGI evidence code) to BioGRID. Our initial work will focus on adding interactions curated as part of the Reference Genome’s Wnt signaling pathway annotation project. Concurrently, we will also add any newly curated protein binding annotations (Wnt pathway or otherwise) to BioGRID.

@@ Line 1: / Line 1: @@
+[[Category:Reports - WormBase]]
 ='''WormBase Gene Ontology Progress Report, December 2010'''=
@@ Line 53: / Line 54: @@
 Developer, Textpresso, Caltech, Pasadena, CA
 =='''Annotation Progress'''==
@@ Line 63: / Line 66: @@
 |-
 ! Manual Annotation
-| 2,098 ||    ||1,840||10,467
+| 2,098 || +19.8 ||1,840||10,467
 |-
 !Phenotype2GO Mappings
-| 6309 ||    || 113 || 42,349
+| 6309 || -6.1 || 113 || 42,349
 |-
 !IEA/Electronic
-|12,954 ||  || 1,476 || 55,091
+|12,954 || +0.83 || 1,476 || 55,091
 |-
 !Total
-|15,799 ||   || 2,937 || 107,907
+|15,799 || +0.84 || 2,937 || 107,907
 |}
@@ Line 78: / Line 81: @@
 '''
+===Literature Curation===
+'''
+Manual curation of the C. elegans literature remains our highest curation priority, contributing to ~90% of our total curation efforts. Curators use a GO curation check-out form that affords curators easy visual access to the curation status of all named C. elegans gene, e.g. vha-6 or egl-9. Genes are displayed in a list that includes the current number of published papers (references) indexed to that gene and the last date for which annotations to any of the three ontologies were made. Curators can query and sort the list according to reference count, gene name, and curation status.
+'''
 === Computational Methods ===
 '''
-InterPro2GO mappings for IEA annotations: These annotations are annotations of C. elegans proteins to GO terms based on electronic matching of protein motifs/domains to those documented in the Interpro database (http://www.ebi.ac.uk/interpro/), and their mapping to GO terms provided by the Interpro2go file generated by the EBI (PMID:12654719, PMID:12520011). Note that the 'IEA' annotations are not reviewed for accuracy by human curators. As such, all of these annotations use the evidence code 'IEA'.
+InterPro2GO mappings for IEA annotations: These annotations are annotations of C. elegans proteins to GO terms based on electronic matching of protein motifs/domains to those documented in the InterPro database (http://www.ebi.ac.uk/interpro/), and their mapping to GO terms provided by the InterPro2GO file generated by the EBI (PMID:12654719, PMID:12520011). Note that the 'IEA' annotations are not reviewed for accuracy by human curators. As such, all of these annotations use the evidence code 'IEA'.
 TMHMM2GO mapping for IEA annotations: We also include the results of an internal pipeline that maps proteins containing a transmembrane domain, as predicted by the TMHMM algorithm, to the GO Cellular Component term, integral to membrane. About 6,710 gene products are annotated to the term 'integral to membrane' via this pipeline.
@@ Line 96: / Line 106: @@
 We have begun a detailed review of our phenotype to GO term mappings. We are in the process of making changes to this pipeline so that annotations are made with a stricter use of the IMP evidence code, as recently described in GO consortium annotation policies. This process will involve removing some high-level phenotype term to GO term mappings and/or removal of certain RNAi experiments/papers from being included in this pipeline and the review and changing of scripts.
-'''Textpresso for Cellular Component Curation'''
+'''Textpresso-Based Cellular Component Curation'''
-As a complimentary approach to our manual curation pipeline, we continue to employ the Textpresso information retrieval system to annotate C. elegans gene products to the Cellular Component ontology.  Textpresso searches for component annotations use three different categories (Assay Terms, Component Terms, and Verbs) plus a category of C. elegans protein names.  Searches using these categories return sentences that contain a match to at least one term in each category.  We use this approach to annotate newly published papers as well as papers published prior to 2010.
+As a complimentary approach to our manual curation pipeline, we continue to employ the Textpresso information retrieval system to annotate C. elegans gene products to the Cellular Component ontology.  Textpresso searches for component annotations use three different categories (Assay Terms, Component Terms, and Verbs) plus a category of C. elegans protein names.  Searches using these categories return sentences that contain a match to at least one term in each category.  We use this approach to annotate newly published papers as well as papers published prior to 2010.  For newly published papers, we prioritize our searches by first searching through papers that, as determined by a Support Vector Machine document classifier, have a relatively high probability of containing expression data.
-Over the past year, through our manual and Textpresso-based pipelines, we added new cellular component annotations to 291 genes.  Of these, 115 genes were annotated from papers published in 2010, with the remainder of the annotations coming from previously published papers.
+Over the past year, through our manual and Textpresso-based pipelines, we added new cellular component annotations to 292 genes.  Of these, 113 genes were annotated from papers published in 2010, with the remainder of the annotations coming from previously published papers.  For the 2010 papers, Textpresso’s annotation recall was 91.3%.  We have not yet measured the recall on papers annotated this year but published prior to 2010.
 ==='''Priorities for Annotation'''===
@@ Line 118: / Line 128: @@
 =='''Presentations and Publications'''==
-[[Publications, Talks, Posters 2010-]]
+[https://docs.google.com/document/d/1w8nR5llrexyqCKk3NTlj3v3s7gKBA-DPtAT3LiIe5jo/edit#heading=h.g9z1ccafe2s Publications,_Talks,_Posters_2010]
 =='''Ontology Development Contributions'''==
@@ Line 163: / Line 173: @@
 We have continued development on our web-based curation tool, the Ontology Annotator, that can be used to annotate genes to any ontology, including the Gene Ontology and the WormBase Phenotype Ontology. The Ontology Annotator incorporates and expands upon much of the functionality of the Phenote curation tool. Some of the more useful features of the tool include bulk annotation capabilities, autocomplete functions, retrieving data and filtering of the retrieved data for editing purposes. We have improved functionalities for existing annotation interfaces and have added the following curation interfaces that are fully functional: antibody, small molecule and gene regulation.  We are currently working on 2 new curation interfaces: gene regulation and expression pattern related pictures.
-===Semi-Automated Molecular Function Curation===
+===Textpresso- and HMM-Based Molecular Function Curation===
 In addition to using Textpresso for Cellular Component Curation, we have developed pipelines for semi-automated Molecular Function (MF) curation.  Given that Molecular Function annotations can be made based upon evidence from a wide variety of experiments, our approach has been to divide molecular function curation into two broad categories: '''1) macromolecular interactions''' and '''2) enzymatic and transporter activities'''.
@@ Line 169: / Line 179: @@
 '''Macromolecular Interactions'''
-For the former, we employ a two-step curation pipline.  First, we use an SVM-based document classification algorithm to identify new papers likely to contain reports of macromolecular interactions.  These papers are then searched for matching sentences using Textpresso categories developed specifically for identifying sentences describing macromolecular interactions.  Our initial results indicate that the SVM performs very well on predicting high confidence true positive papers (precision=86.4%) as well as true negative papers (precision=88.9%).  The Textpresso categories also have a high recall (89.7%), but have a relatively low precision (precision=47.3%).  In practice, this means that while curators are able to retrieve and curate nearly all macromolecular interactions, they still must examine a number of false positive sentences, so the curation efficiency needs to be improved.  We are in the process of fine tuning the Textpresso categories for macromolecular interactions in hopes of addressing this issue and improving curation efficiency.  The new Textpresso categories for this data are scheduled to be in place at the end of December 2010.
+For the former, we employ a two-step curation pipeline.  First, we use an SVM-based document classification algorithm to identify new papers likely to contain reports of macromolecular interactions.  These papers are then searched for matching sentences using Textpresso categories developed specifically for identifying sentences describing macromolecular interactions.  Our initial results indicate that the SVM performs very well on predicting high confidence true positive papers (precision=86.4%) as well as true negative papers (precision=88.9%).  The Textpresso categories also have a high recall (89.7%), but have a relatively low precision (precision=47.3%).  In practice, this means that while curators are able to retrieve and curate nearly all macromolecular interactions, they still must examine a number of false positive sentences, so the curation efficiency needs to be improved.  We are in the process of fine tuning the Textpresso categories for macromolecular interactions in hopes of addressing this issue and improving curation efficiency.  The new Textpresso categories for this data are scheduled to be in place at the end of December 2010.
 '''Enzymatic and Transporter Activities'''
 In parallel to these efforts, we are also investigating semi-automated curation methods for annotating enzymatic and transporter activities. For this data type, we are developing Textpresso categories as well as training a Hidden Markov Model (HMM) to identify curatable sentences.  For the former, we have collected 419 sentences from 64 papers.  Using these sentences, we have developed two new Textpresso categories for enzymatic and transporter activities.  As for the new macromolecular interaction categories, we plan to implement these categories on Textpresso by the end of December 2010.
-In collaboration with Hans-Michael Mueller, we are also training a Hidden Markov Model to identify sentences describing enzymatic and transporter activities.  At present, we are in the third round of training and evalutation for the model.  We hope to complete an initial evaluation of the model by early next year and will report on its performance.
+In collaboration with Hans-Michael Mueller, we are also training a Hidden Markov Model to identify sentences describing enzymatic and transporter activities.  At present, we are in the third round of training and evaluation for the model.  We hope to complete an initial evaluation of the model by early next year and will report on its performance.
 ===Textpresso-Based identification of literature for human disease gene orthologs===
-Human disease gene orthologs are a high priority annotation list for WormBase and other model organism databases. Sequence-based or similarity searches for human disease gene orthologs provide useful candidates but the biological information needs to be extracted manually from the literature. We are working on a project that combines the use of Textpresso-based categories and key word searches to identify ''elegans'' papers which describe the study of a human disease gene ortholog.  Sentences in which a ''C. elegans'' gene co-occurs with a human disease term were deemed important for this process. A new 'human disease' Textpresso category was formed using the human disease ontology in the OBO foundry, the Neuroscience Information Framework Standardized (NIFSTD) ontology and the existing Textpresso human disease category. Several disease terms that would increase the number of false positives were removed iteratively by a manual process. As performing an 'AND' query on just two categories--human disease and ''C.elegans'' gene, returned too many false positives, a third category was formed with the words`ortholog’, `homolog’, `similar’, `relate’ and `model’, and added to the 'AND' query. The query is being fine-tuned for increasing precision and recall. Once established, it will automate the flagging of new articles that have disease gene ortholog data. Subsequently this method can be used for the extraction of relevant information.
+Human disease gene orthologs are a high priority annotation list for WormBase and other model organism databases. Sequence-based or similarity searches for human disease gene orthologs provide useful candidates but the biological information needs to be extracted manually from the literature. We are working on a project that combines the use of Textpresso-based categories and key word searches to identify ''elegans'' papers which describe the study of a human disease gene ortholog.  Sentences in which a ''C. elegans'' gene co-occurs with a human disease term were deemed important for this process. A new 'human disease' Textpresso category was formed using the human disease ontology in the OBO foundry, the Neuroscience Information Framework Standardized (NIFSTD) ontology and the existing Textpresso human disease category. Several disease terms that would increase the number of false positives were removed iteratively by a manual process. As performing an 'AND' query on just two categories--human disease and ''C. elegans'' gene, returned too many false positives, a third category was formed with the words 'ortholog’, `homolog’, `similar’, `relate’ and `model’, and added to the 'AND' query. The query is being fine-tuned for increasing precision and recall. Once established, it will automate the flagging of new articles that have disease gene ortholog data. Subsequently this method can be used for the extraction of relevant information.
 ===Textpresso-Based Curation Pipelines for Other MODs===
 We have been collaborating with The Arabidopsis Information Resource (TAIR) and dictyBase to develop and implement Textpresso-based curation pipelines for Cellular Component annotation.  At present, we have been working with TAIR to develop a pipeline by which they can perform Textpresso searches on the Arabidopsis corpus from 2008 to identify potentially new annotations that were not previously captured in their existing GO curation pipeline.  The results of these searches are being presented in a curation form that will allow TAIR curators to make new annotations and retrieve the annotations in both a gene_association file format and a more basic, three-column format that is similar to their user submission format.  We hope to generalize this pipeline so that other groups, including dictyBase, will be able to perform Textpresso queries, send the results to a curation form, and retrieve the output of the curation as a gene_association file.
+===Collaboration with BioGRID===
+In late summer, we begin a collaboration with the BioGRID (Biological General Repository for Interaction Datasets) to add protein binding annotations (IPI evidence code) curated for GO into BioGRID.  In addition, we also plan to add curated genetic interactions (IGI evidence code) to BioGRID.  Our initial work will focus on adding interactions curated as part of the Reference Genome’s Wnt signaling pathway annotation project.  Concurrently, we will also add any newly curated protein binding annotations (Wnt pathway or otherwise) to BioGRID.

WormBase: Difference between revisions