WormBase December 2014: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
 
(20 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Overview:
Overview:


= Staff: =
= Staff =


Paul Sternberg, PI, WormBase, GO [8%; 0% funded by GOC]
Paul Sternberg, PI, WormBase, GO [8%; 0% funded by GOC]
Line 160: Line 160:
==Curation methods==
==Curation methods==


'''Literature curation:'''
===Literature curation===


Curation of the primary literature continues to be the major focus of our manual annotation efforts.
Curation of the primary literature continues to be the major focus of our manual annotation efforts.
Line 166: Line 166:
Over the past year, WormBase has begun a topic-based approach to curation in which curators focus curation efforts on one or more biological topics, or processes, for each release cycle.  Topics over the past year have included the endoplasmic reticulum and mitochondrial unfolded protein responses, innate immunity and defense response, and Wnt signaling pathways (see below).
Over the past year, WormBase has begun a topic-based approach to curation in which curators focus curation efforts on one or more biological topics, or processes, for each release cycle.  Topics over the past year have included the endoplasmic reticulum and mitochondrial unfolded protein responses, innate immunity and defense response, and Wnt signaling pathways (see below).


'''Semi-automated curation using the Textpresso information retrieval system'''
===Semi-automated curation using the Textpresso information retrieval system===


We also routinely employ the Textpresso information retrieval system for semi-automated curation of GO Cellular Component and Molecular Function annotations.
We also routinely employ the Textpresso information retrieval system for semi-automated curation of GO Cellular Component and Molecular Function annotations.


'''Computational annotation strategies:'''
===Computational annotation strategies===


Our computational annotation strategies include mapping genes to GO terms using InterPro domains and mapping genes to Biological Process terms based upon mappings between terms in the Worm Phenotype Ontology (WPO).  Beginning with the WS246 WormBase release, these Phenotype2GO-based annotations will include phenotypes based upon genetic variations as well as RNAi experiments.  Results from automated methods are generated anew with each WormBase database build to reflect any changes in the underlying reference genome sequence and/or gene models.
Our computational annotation strategies include mapping genes to GO terms using InterPro domains and mapping genes to Biological Process terms based upon mappings between terms in the Worm Phenotype Ontology (WPO).  Beginning with the WS246 WormBase release, these Phenotype2GO-based annotations will include phenotypes based upon genetic variations as well as RNAi experiments.  Results from automated methods are generated anew with each WormBase database build to reflect any changes in the underlying reference genome sequence and/or gene models.
Line 176: Line 176:
==Curation strategies==
==Curation strategies==


'''Priorities for annotation'''
===Priorities for annotation===


Selection of genes for annotation is guided by several criteria:
Selection of genes for annotation is guided by several criteria:
Line 188: Line 188:


=  Presentations and Publications =
=  Presentations and Publications =
a.  Papers with substantial GO content
==Papers with substantial GO content==


*"BC4GO: A Full-Text Corpus for the BioCreative IV GO Task." '''Kimberly Van Auken''', Mary L. Schaeffer, Peter McQuilton, Stanley J. F. Laulederkind, Donghui Li, Shur-Jen Wang, G. Thomas Hayman, Susan Tweedie, Cecilia N. Arighi, '''James Done, Hans-Michael Müller, Paul W. Sternberg''', Yuqing Mao, Chih-Hsuan Wei, Zhiyong Lu. Database (Oxford). 2014 Jul 28;2014. pii: bau074. doi: 10.1093/database/bau074. Print 2014.
*'''Van Auken K''', Schaeffer ML, McQuilton P, Laulederkind SJF, Li D, Wang SJ, Hayman GT, Tweedie S, Arighi CN, '''Done J, Müller HM, Sternberg PW''', Mao Y, Wei CH, Lu Z. BC4GO: A Full-Text Corpus for the BioCreative IV GO Task.Database (Oxford). 2014 Jul 28;2014. pii: bau074. doi: 10.1093/database/bau074. Print 2014. PMID: 25070993, PMCID: PMC4112614


*"Overview of the Gene Ontology Task at BioCreative IV." Yuqing Mao, '''Kimberly Van Auken''', Donghui Li, Cecilia N. Arighi, Peter McQuilton, G. Thomas Hayman, Susan Tweedie, Mary L. Schaeffer, Stanley J. F. Laulederkind, Shur-Jen Wang, Gobeill Julien, Ruch Patrick, Luu Anh Tuan, Jung-jae Kim, Jung-Hsien Chiang, Yu-De Chen, Chia-Jung Yang, Hongfang Liu, Dongqing Zhu, Yanpeng Li, Hong Yu, Ehsan Emadzadeh, Graciela Gonzalez, Jian-Ming Chen, Hong-Jie Dai, Zhiyong Lu. Database (Oxford). 2014 Aug 25;2014. pii: bau086. doi: 10.1093/database/bau086. Print 2014.
*Mao Y, '''Van Auken K''', Li D, Arighi CN, McQuilton P, Hayman GT, Tweedie S, Schaeffer ML, Laulederkind SJF Wang SJ, Gobeill J, Ruch P, Luu AT, Kim JJ, Chiang JH, Chen YD, Yang CJ, Liu H, Zhu D, Li Y, Yu H, Emadzadeh E, Gonzalez G, Chen JM, Dai HJ, Lu Z. Overview of the Gene Ontology Task at BioCreative IV. Database (Oxford). 2014 Aug 25;2014. pii: bau086. doi: 10.1093/database/bau086. Print 2014. PMID:25157073, PMCID: PMC4142793


*"A method for increasing expressivity of Gene Ontology annotations using a compositional approach." Rachael P Huntley, Midori A Harris, Yasmin Alam-Faruque, Judith A Blake, Seth Carbon, Heiko Dietze, Emily C Dimmer, Rebecca E Foulger, David P Hill, Varsha K Khodiyar, Antonia Lock, Jane Lomax, Ruth C Lovering, Prudence Mutowo-Meullenet, Tony Sawford, '''Kimberly Van Auken''', Valerie Wood and Christopher J Mungall. BMC Bioinformatics. 2014 May 21;15:155. doi: 10.1186/1471-2105-15-155.
*Huntley RP, Harris MA, Alam-Faruque Y, Blake JA, Carbon S, Dietze H, Dimmer EC, Foulger RE, Hill DP, Khodiyar VK, Lock A, Lomax J, Lovering RC, Mutowo-Meullenet P, Sawford T, '''Van Auken K''', Wood V, Mungall CJ. A method for increasing expressivity of Gene Ontology annotations using a compositional approach. BMC Bioinformatics. 2014 May 21;15:155. doi: 10.1186/1471-2105-15-155. PubMed PMID: 24885854; PMCID: PMC4039540.


b.  Presentations including Talks and Tutorials and Teaching
== Presentations including Talks and Tutorials and Teaching ==


c. Poster presentations
== Poster presentations ==


= Other Highlights: =
=Other Highlights=


A. Ontology Development Contributions:
== Ontology Development Contributions ==
*Pending Term Requests:
*Terms Added to the Ontology in 2014:
**lysosome-related organelle
**lysosome-related organelle
**gut granule
**gut granule
Line 209: Line 209:
**gut granule membrane
**gut granule membrane
**peptidyl-proline 4-dioxygenase binding
**peptidyl-proline 4-dioxygenase binding
**tail spike morphogenesis
**regulation, positive, negative anterograde synaptic vesicle transport
**positive, negative regulation of pharyngeal pumping


 
== Annotation Outreach and User Advocacy Efforts ==
B.  Annotation Outreach and User Advocacy Efforts:
* Kimberly Van Auken continues to serve on the GO-help rota.
* Kimberly Van Auken continues to serve on the GO-help rota.
* Kimberly Van Auken assisted with migration of content to the new GO website.
* Kimberly Van Auken assisted with migration of content to the new GO website.


== Annotation Advocacy ==
* Kimberly Van Auken is participating in bi-weekly calls on development of the LEGO curation model and accompanying curation tool, Noctua.


C.  Other Highlights:
== Other Highlights ==
* We have written a new script for reporting our manual annotations statistics.  This script reports the number of annotations per contributing group according to evidence code and also reports the number of annotations with annotation extensions.
=== WormBase Data Models and Software ===
 
*Progress Reports - Juancarlos Chan and Kimberly Van Auken have written a new script for reporting our manual annotations statistics.  This script reports the number of annotations per contributing group according to evidence code and also reports the number of annotations with annotation extensions.
* WormBase GO Annotation Model - We have completed a draft of a new GO annotation model for WormBase and will begin testing sample data.  The new GO model should be incorporated into WormBase build WS244.  
*WormBase GO Annotation Model - Kimberly Van Auken, Kevin Howe, Paul Davis, and Daniela Raciti collaborated on development and testing of a new GO annotation model for WormBase.  The model will allow for full incorporation of annotation extension data into WormBase, as well as additional annotation details and new IEA annotations from the UniProt-GOA group.  The new model and accompanying data will be included in WormBase Release WS237. 
*Phenotype2GO Annotation Pipeline - Kimberly Van Auken, Juancarlos Chan, and Kevin Howe revised the Phenotype2GO-based annotation pipeline to incorporate both RNAi- and genetic variation-based phenotypes to increase coverage for these types of annotations.  Concurrently, they also changed the curation pipeline to allow for improved quality control by housing all annotations within a curation tool hosted at Caltech.
*WormBase Ontology Browser - Raymond Lee and Juancarlos Chan, in collaboration with the GO software team at Berkeley, developed a new ontology browser for WormBase, WOBr using the new AmiGO2.0 software.  The browser allows for searching across multiple ontologies used in WormBase, including GO, WPO, and the worm Anatomy and Life Stage Ontologies.


* BioCreative - WormBase participated in the [http://www.biocreative.org/tasks/biocreative-iv/track-4-GO/2013 BioCreative Track 4] task of identifying GO evidence sentences and GO annotations from the full text of publicationsUsing a GO Annotation Tool (GOAT) developed by the Textpresso team that allowed for highlighting sentences and associating GO annotations, a WormBase curator provided training and test data for the full text of 22 papers and then helped to perform error analysis on the results submitted by the participating teams. Other curation groups participating included FlyBase, MaizeDB, RGD, and TAIR.   Two papers describing this work were submitted to Database and one has been accepted with minor revision.
===Text Mining and Textpresso Central===
*Kimberly Van Auken and Yuling Li developed a new support vector machine (SVM) document classifier for a subclass of the molecular function ontology: catalytic activity. This SVM is now included in the WormBase data flagging pipeline that also includes classifiers for macromolecular interactions, expression patterns, and RNAi- and variation-based phenotypes. 
*Monica McAndrews (MGI), Kimberly Van Auken, and Yuling Li are collaborating on a document classification pipeline to help MGI identify papers suitable for curation.  Using training and testing papers supplied by MGI, we have developed an SVM classifier to distinguish mouse from non-mouse papersThe next step in the process will be to help MGI develop a pipeline for identifying mouse markers (genes) associated with experimental data in these papers.
*Hans-Michael Muller and Yuling Li started developing a literature curation platform named Textpresso Central that enables curators to perform full text literature searches, view and curate research papers, train and apply machine learning and text mining algorithm for semantic analysis and curation purposes. The user is supported in this task by giving him capabilities to select, edit and store lists of papers, sentences, term and categories in order to perform training and mining. The system is designed with the intent to empower the user to perform as many operations on a literature corpus or a particular paper as possible. It uses state-of-the-art software packages and frameworks such as the Unstructured Information Management Architecture (http://uima.apache.org), Lucene (http://lucene.apache.org), and Wt (http://www.webtoolkit.eu/wt). The corpus of papers can be build from fulltextarticles that are available in PDF format (http://en.wikipedia.org/wiki/Portable\_Document\_Format) or NXML (http://dtd.nlm.nih.gov/). An extension for articles published in HTML (http://en.wikipedia.org/wiki/HTML) is planned.


Back to http://wiki.geneontology.org/index.php/Progress_Reports
Back to http://wiki.geneontology.org/index.php/Progress_Reports

Latest revision as of 13:27, 16 December 2014

Overview:

Staff

Paul Sternberg, PI, WormBase, GO [8%; 0% funded by GOC]

Juancarlos Chan, Developer, WormBase [25%; 25% funded by GOC]

James Done, Developer, Textpresso [40%; 40% funded by GOC]

Ranjana Kishore, Curator [25%; 10% funded by GOC]

Yuling Li, Developer, Textpresso [30%; 20% funded by GOC]

Hans Michael Mueller, PI, Textpresso [75%; 50% funded by GOC]

Daniela Raciti, Curator [10%; 0% funded by GOC]

Kimberly Van Auken, Curator [100%; 75% funded by GOC]

Annotation Progress

WormBase GO Annotation Statistics as of December 1, 2014

Manual annotation statistics are summarized in Tables 1 - 3.

Total number of unique manual annotations: 27422

Total number of genes with manual annotations: 4690

Table 1: Summary of C. elegans Manual Biological Process Annotations

Numbers refer to total number of annotations; annotations in parentheses represent annotations with extensions.

Annotation Group IMP IGI IDA ISS TAS IEP IPI IC NAS ISM ND IBA IRD RCA ISO IKR
WormBase 7461 (244) 2990 (53) 1107 (19) 315 (1) 115 275 (58) 60 50 (10) 32 2 0 0 0 0 0 0
UniProt 466 (2) 28 115 (1) 170 22 13 0 5 104 0 65 0 0 2 0 0
GOC 59 10 309 329 22 0 4 7 14 0 0 331 0 2 2 0
BHF-UCL 11 0 0 2 0 4 0 0 0 0 0 0 0 0 0 0
MGI 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HGNC 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0
GO_Central 0 0 0 4 0 0 0 0 0 0 0 2810 3 0 0 1
ParkinsonsUK-UCL 2 2 (2) 6 0 0 0 0 0 0 0 0 0 0 0 0 0
Totals 8004 (246) 3030 (55) 1537 (20) 804 (1) 159 292 (58) 64 62 (10) 150 2 65 3141 3 4 2 1


Table 2: Summary of C. elegans Molecular Function Annotations

Numbers refer to total number of annotations; annotations in parentheses represent annotations with extensions.

Annotation Group IMP IGI IDA ISS TAS IEP IPI IC NAS ISM ND IBA IRD RCA ISO IKR
WormBase 151 (5) 35 1617 (133) 658 (1) 49 0 1211 11 7 4 35 0 0 0 2 0
IntAct 0 0 0 0 0 0 1987 (54) 0 0 0 0 0 0 0 0 0
UniProt 33 2 99 (1) 172 19 0 231 3 53 0 127 0 0 19 0 0
GO_Central 0 0 0 0 0 0 0 0 0 0 0 2096 2 0 0 1
HGNC 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0
ParkinsonsUK-UCL 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0
Totals 184 (5) 37 1720 (134) 832 (1) 68 0 3429 (54) 13 60 4 162 2096 2 19 2 1


Table 3: Summary of C. elegans Cellular Component Annotations

Numbers refer to total number of annotations; annotations in parentheses represent annotations with extensions.

Annotation Group IMP IGI IDA ISS TAS IEP IPI IC NAS ISM ND IBA IRD RCA ISO IKR
WormBase 9 0 5625 (683) 322 26 0 142 (3) 43 6 4 4 0 0 1 0 0
GO_Central 0 0 0 0 0 0 0 0 0 0 0 2001 3 0 0 1
UniProt 14 1 208 186 16 0 0 19 50 0 119 0 0 18 0 0
MGI 0 0 14 0 0 0 0 0 0 0 0 0 0 0 0 0
BHF-UCL 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0
Reactome 0 0 0 3 4 0 0 0 0 0 0 0 0 0 0 0
HGNC 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0
Totals 23 1 5854 (683) 519 46 0 142 (3) 62 56 4 123 2001 3 19 0 1


Table 4: Summary of C. elegans Computational Annotations

Based on WormBase Release WS246

Total number of genes with Phenotype2GO-based Annotation: 6,809

Total number of genes with InterPro2GO-based Annotation: 11,282

Type of Annotation IEA
Phenotype2GO Mappings - WormBase 42,666
IEA/InterPro2GO - WormBase 35,082

Methods and strategies for annotation

Curation methods

Literature curation

Curation of the primary literature continues to be the major focus of our manual annotation efforts.

Over the past year, WormBase has begun a topic-based approach to curation in which curators focus curation efforts on one or more biological topics, or processes, for each release cycle. Topics over the past year have included the endoplasmic reticulum and mitochondrial unfolded protein responses, innate immunity and defense response, and Wnt signaling pathways (see below).

Semi-automated curation using the Textpresso information retrieval system

We also routinely employ the Textpresso information retrieval system for semi-automated curation of GO Cellular Component and Molecular Function annotations.

Computational annotation strategies

Our computational annotation strategies include mapping genes to GO terms using InterPro domains and mapping genes to Biological Process terms based upon mappings between terms in the Worm Phenotype Ontology (WPO). Beginning with the WS246 WormBase release, these Phenotype2GO-based annotations will include phenotypes based upon genetic variations as well as RNAi experiments. Results from automated methods are generated anew with each WormBase database build to reflect any changes in the underlying reference genome sequence and/or gene models.

Curation strategies

Priorities for annotation

Selection of genes for annotation is guided by several criteria:

  • Annotation of gene sets involved in specific biological processes as part of WormBase's coordinated topic-based curation process
    • Topics annotated to date: Unfolded Protein Response (ER and mitochondrial), innate immune response, defense response to pathogen, and Wnt signaling
  • Genes identified in Textpresso-based curation pipelines
  • Re-annotation of genes affected by changes to the ontology, e.g. cilia biology, ubiquitination, enzyme regulator activities
  • Publication of newly characterized genes
  • C. elegans genes orthologous to human disease genes

Presentations and Publications

Papers with substantial GO content

  • Van Auken K, Schaeffer ML, McQuilton P, Laulederkind SJF, Li D, Wang SJ, Hayman GT, Tweedie S, Arighi CN, Done J, Müller HM, Sternberg PW, Mao Y, Wei CH, Lu Z. BC4GO: A Full-Text Corpus for the BioCreative IV GO Task.Database (Oxford). 2014 Jul 28;2014. pii: bau074. doi: 10.1093/database/bau074. Print 2014. PMID: 25070993, PMCID: PMC4112614
  • Mao Y, Van Auken K, Li D, Arighi CN, McQuilton P, Hayman GT, Tweedie S, Schaeffer ML, Laulederkind SJF Wang SJ, Gobeill J, Ruch P, Luu AT, Kim JJ, Chiang JH, Chen YD, Yang CJ, Liu H, Zhu D, Li Y, Yu H, Emadzadeh E, Gonzalez G, Chen JM, Dai HJ, Lu Z. Overview of the Gene Ontology Task at BioCreative IV. Database (Oxford). 2014 Aug 25;2014. pii: bau086. doi: 10.1093/database/bau086. Print 2014. PMID:25157073, PMCID: PMC4142793
  • Huntley RP, Harris MA, Alam-Faruque Y, Blake JA, Carbon S, Dietze H, Dimmer EC, Foulger RE, Hill DP, Khodiyar VK, Lock A, Lomax J, Lovering RC, Mutowo-Meullenet P, Sawford T, Van Auken K, Wood V, Mungall CJ. A method for increasing expressivity of Gene Ontology annotations using a compositional approach. BMC Bioinformatics. 2014 May 21;15:155. doi: 10.1186/1471-2105-15-155. PubMed PMID: 24885854; PMCID: PMC4039540.

Presentations including Talks and Tutorials and Teaching

Poster presentations

Other Highlights

Ontology Development Contributions

  • Terms Added to the Ontology in 2014:
    • lysosome-related organelle
    • gut granule
    • gut granule lumen
    • gut granule membrane
    • peptidyl-proline 4-dioxygenase binding
    • tail spike morphogenesis
    • regulation, positive, negative anterograde synaptic vesicle transport
    • positive, negative regulation of pharyngeal pumping

Annotation Outreach and User Advocacy Efforts

  • Kimberly Van Auken continues to serve on the GO-help rota.
  • Kimberly Van Auken assisted with migration of content to the new GO website.

Annotation Advocacy

  • Kimberly Van Auken is participating in bi-weekly calls on development of the LEGO curation model and accompanying curation tool, Noctua.

Other Highlights

WormBase Data Models and Software

  • Progress Reports - Juancarlos Chan and Kimberly Van Auken have written a new script for reporting our manual annotations statistics. This script reports the number of annotations per contributing group according to evidence code and also reports the number of annotations with annotation extensions.
  • WormBase GO Annotation Model - Kimberly Van Auken, Kevin Howe, Paul Davis, and Daniela Raciti collaborated on development and testing of a new GO annotation model for WormBase. The model will allow for full incorporation of annotation extension data into WormBase, as well as additional annotation details and new IEA annotations from the UniProt-GOA group. The new model and accompanying data will be included in WormBase Release WS237.
  • Phenotype2GO Annotation Pipeline - Kimberly Van Auken, Juancarlos Chan, and Kevin Howe revised the Phenotype2GO-based annotation pipeline to incorporate both RNAi- and genetic variation-based phenotypes to increase coverage for these types of annotations. Concurrently, they also changed the curation pipeline to allow for improved quality control by housing all annotations within a curation tool hosted at Caltech.
  • WormBase Ontology Browser - Raymond Lee and Juancarlos Chan, in collaboration with the GO software team at Berkeley, developed a new ontology browser for WormBase, WOBr using the new AmiGO2.0 software. The browser allows for searching across multiple ontologies used in WormBase, including GO, WPO, and the worm Anatomy and Life Stage Ontologies.

Text Mining and Textpresso Central

  • Kimberly Van Auken and Yuling Li developed a new support vector machine (SVM) document classifier for a subclass of the molecular function ontology: catalytic activity. This SVM is now included in the WormBase data flagging pipeline that also includes classifiers for macromolecular interactions, expression patterns, and RNAi- and variation-based phenotypes.
  • Monica McAndrews (MGI), Kimberly Van Auken, and Yuling Li are collaborating on a document classification pipeline to help MGI identify papers suitable for curation. Using training and testing papers supplied by MGI, we have developed an SVM classifier to distinguish mouse from non-mouse papers. The next step in the process will be to help MGI develop a pipeline for identifying mouse markers (genes) associated with experimental data in these papers.
  • Hans-Michael Muller and Yuling Li started developing a literature curation platform named Textpresso Central that enables curators to perform full text literature searches, view and curate research papers, train and apply machine learning and text mining algorithm for semantic analysis and curation purposes. The user is supported in this task by giving him capabilities to select, edit and store lists of papers, sentences, term and categories in order to perform training and mining. The system is designed with the intent to empower the user to perform as many operations on a literature corpus or a particular paper as possible. It uses state-of-the-art software packages and frameworks such as the Unstructured Information Management Architecture (http://uima.apache.org), Lucene (http://lucene.apache.org), and Wt (http://www.webtoolkit.eu/wt). The corpus of papers can be build from fulltextarticles that are available in PDF format (http://en.wikipedia.org/wiki/Portable\_Document\_Format) or NXML (http://dtd.nlm.nih.gov/). An extension for articles published in HTML (http://en.wikipedia.org/wiki/HTML) is planned.

Back to http://wiki.geneontology.org/index.php/Progress_Reports