WormBase December 2016: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
No edit summary
 
(15 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''The data is currently for 2015.  Report is in progress.'''
Overview:
Overview:


Line 17: Line 15:
Raymond Lee, Curator, WormBase [10%; 0% funded by GOC]
Raymond Lee, Curator, WormBase [10%; 0% funded by GOC]


Yuling Li, Developer, Textpresso [30%; 25% funded by GOC]
Yuling Li (thru August 2016), Developer, Textpresso [30%; 25% funded by GOC]


Jane Lomax, Curator, WormBase ParaSite [10%; 0% funded by GOC]
Jane Lomax (thru July 2016), Curator, WormBase ParaSite [10%; 0% funded by GOC]


Hans Michael Mueller, Project Lead, Textpresso [75%; 50% funded by GOC]
Hans Michael Mueller, Project Lead, Textpresso [75%; 50% funded by GOC]
Line 183: Line 181:
Curation of the primary literature continues to be the major focus of our manual annotation efforts.
Curation of the primary literature continues to be the major focus of our manual annotation efforts.


Over the past year, WormBase has begun a topic-based approach to curation in which curators focus curation efforts on one or more biological topics, or processes, for each release cycleTopics over the past year have included the endoplasmic reticulum and mitochondrial unfolded protein responses, innate immunity and defense response, and Wnt signaling pathways (see below).
Over the past year, WormBase curation efforts were focused largely on developing preliminary pathway models using the Noctua curation tool.  To this end, literature curation involved reviewing ''C. elegans'' pathways, the biological entities that participate in those pathways, and the annotations, particularly Molecular Function annotations, associated with those entitiesPathways reviewed include apoptosis, asymmetric cell division, defense response, insulin signaling, neuronal cell fate specification, mRNA decay, semaphorin/plexin signaling, thermosensory transduction, and TOR signaling.


===Semi-automated curation using the Textpresso information retrieval system===
===Curation using the Textpresso information retrieval system===


We also employ the Textpresso information retrieval system for semi-automated curation of GO Cellular Component and Molecular Function annotations.
We also employ the Textpresso information retrieval system for curation of GO Cellular Component and Molecular Function annotations.


===Computational annotation strategies===  
===Computational annotation strategies===  


Our computational annotation strategies include mapping genes to GO terms using InterPro domains performed as part of the WormBase build cycle, as well as computational predictions made via the UniProtKB pipeline, including keyword mappings and UniRule mapping.
Our computational annotation strategies include mapping genes to GO terms using InterPro domains performed as part of the WormBase build cycle, as well as computational predictions made via the UniProtKB pipeline, including keyword mappings and UniRule mapping.
Also as part of the WormBase build cycle, we map genes to Biological Process terms based upon mappings between terms in the Worm Phenotype Ontology (WPO).  Beginning with the WS246 WormBase release, these Phenotype2GO-based annotations include phenotypes based upon genetic variations as well as RNAi experiments.  Results from automated methods are generated anew with each WormBase database build to reflect any changes in the underlying reference genome sequence and/or gene models.
Also as part of the WormBase build cycle, we map genes to Biological Process terms based upon mappings between terms in the Worm Phenotype Ontology (WPO).


==Curation strategies==
==Curation strategies==
Line 200: Line 198:
Selection of genes for annotation is guided by several criteria:
Selection of genes for annotation is guided by several criteria:


*Annotation of gene sets involved in specific biological processes as part of the LEGO working group and WormBase's coordinated topic-based approach to curation
*Annotation of gene sets involved in specific biological processes as part of the LEGO working group  
**Topics annotated to date:
*Genes identified in Textpresso-based curation pipelines, for example genes described in papers flagged by an SVM (Support Vector Machine) classification algorithm having a high confidence of reporting Molecular Function experiments such as enzymatic assays
***Unfolded Protein Response (ER and mitochondrial)
*Re-annotation of genes affected by changes to the ontology, e.g. cilia biology, ubiquitination, enzyme regulator activities, and obsoleted annotation extensions
***innate immune response
*Publication of newly characterized genes for which no previous biological data was available
***defense response to pathogen (fungal as well as Gram-negative and Gram-positive bacteria)
***Wnt signaling
***RNAi-mediated behavioral response to odor
***anchor cell invasion (in progress)
*Genes identified in Textpresso-based curation pipelines
*Re-annotation of genes affected by changes to the ontology, e.g. cilia biology, ubiquitination, enzyme regulator activities
*Publication of newly characterized genes


=  Presentations and Publications =
=  Presentations and Publications =
==Papers with substantial GO content==
==Papers with substantial GO content==
*Gene Ontology Consortium: going forward. Gene Ontology Consortium. Nucleic Acids Research 2015 Jan;43(Database issue):D1049-56. doi: 10.1093/nar/gku1179, PMID:25428369
*Expansion of the Gene Ontology knowledgebase and resources. Gene Ontology Consortium. Nucleic Acids Research (2016) pii:gkw1108. PMID:27899567
*Guidelines for the functional annotation of microRNAs using the Gene Ontology. Huntley RP, Sitnikov D, Orlic-Milacic M, Balakrishnan R, D'Eustachio P, Gillespie ME, Howe D, Kalea AZ, Maegdefessel L, Osumi-Sutherland D, Petri V, Smith JR, '''Van Auken K''', Wood V, Zampetaki A, Mayr M, Lovering RC. RNA. 2016 May;22(5):667-76. doi:10.1261/rna.055301.115. PMID:26917558.


== Presentations including Talks and Tutorials and Teaching ==
== Presentations including Talks and Tutorials and Teaching ==
*Kimberly van Auken: Gene Ontology (GO): Finding GO annotations and performing enrichment analysis.  2015 International C. elegans Meeting, UCLA, Los Angeles, CA, June 25 and 27, 2015.
*TextpressoCentral: A System for Integrating Full Text Literature Curation with Diverse Curation Platforms including the Gene Ontology Consortium's Common Annotation Framework. '''Kimberly Van Auken''', Yuling Li, Seth Carbon, Christopher Mungall, Suzanna Lewis, Hans-Michael Muller and Paul Sternberg. ISB 2016 Geneva, Switzerlandhttps://www.sib.swiss/events/biocuration2016/oral-presentations
 
== Poster presentations ==
*Textpresso Central: A System for Integratng Full Text Literature Curation with Diverse Curation Platforms. Kimberly Van Auken, Yuling Li, Hans-Michael Muller, and Paul Sternberg. BioCreative Workshop V, September 9-11, 2015cicCartuja Research Center, Seville, Spain.


=Other Highlights=
=Other Highlights=
== Ontology Development Contributions ==
*Ontology Contributions and Discussions in 2015:
**amino acid transport and transporter terms
**ascaroside binding
**chitin-based cuticle extracellular matrix
**hemidesmosome
**modulation of age-related behavioral decline
**posttranscriptional regulation of synapse organization
**numerous TermGenie requests


== Annotation Outreach and User Advocacy Efforts ==
== Annotation Outreach and User Advocacy Efforts ==
* Kimberly Van Auken continues to serve on the GO-help rota.
*Kimberly Van Auken continues to serve on the GO-help rota.
* Kimberly Van Auken and Dmitry Snitnikov (MGI) are working with a group at Peking University to incorproate human lncRNA annotations into the GOC.
*Kimberly Van Auken served on the Data Capture Working Group.


== Annotation Advocacy ==
== Annotation Advocacy ==
* Kimberly Van Auken participated in the LEGO working group as an alpha tester of the Noctua software and participated in the Geneva LEGO workshop, December 8-10, 2015.
* Kimberly Van Auken and David Hill (MGI) continue to serve as Annotation Working Group Co-Managers.
* Starting in October, 2015, Kimberly Van Auken and David Hill (MGI) are now Annotation Advocacy Co-Managers.
* Kimberly Van Auken continued to participate in the LEGO working group as an alpha tester of the Noctua software and helped to train GO curators in LEGO curation and the Noctua annotation tool at the Geneva LEGO workshop (April, 2016), an MGI workshop (June 2016), an EBI workshop (September 2016), and the USC workshop (November 2016).
 
== Other Highlights ==
=== WormBase Data Models and Software ===
*WormBase GO Annotation Model - Starting with WS248, we have incorporated a new GO annotation model into WormBase.  The model allows for full incorporation of annotation extension data into WormBase, as well as additional annotation details and new IEA annotations from the UniProt-GOA group. 
*WormBase GO Annotation Display - To support the new GO annotation model, we revised the GO annotation web display on WB gene pages.  The web display now has two views that users can select: Summary and View.  The summary view allows users to see the GO ID, GO term, and annotation extension.  The full view additionally provides the evidence code, reference, contributor, and supporting evidence in the With/From column of the gene association file.


===Text Mining and Textpresso Central===
== Text Mining and Textpresso Central ==
*Monica McAndrews (MGI), Kimberly Van Auken, and Yuling Li are collaborating on a document classification pipeline to help MGI identify papers suitable for curation.  Using training and testing papers supplied by MGI, we have developed an SVM classifier to distinguish mouse from non-mouse papers.  We are beginning steps to put this pipeline into production.
*Monica McAndrews (MGI), Kimberly Van Auken, Hans-Michael Mueller, and Yuling Li (thru August 2016) are collaborating on a document classification pipeline to help MGI identify papers suitable for curation.  Using training and testing papers supplied by MGI, we have developed an SVM classifier to distinguish mouse from non-mouse papers.  We are beginning steps to put this pipeline into production.
*Hans-Michael Muller, Yuling Li, and Kimberly Van Auken have developed the Textpresso Central that enables curators to perform full text literature searches and then view the search results in the context of the paper, annotate text, and send those annotations to the Protein2GO tool hosted by the UniProt group at the EBI. The system is designed with the intent to empower the user to perform as many operations on a literature corpus or a particular paper as possible. It uses state-of-the-art software packages and frameworks such as the Unstructured Information Management Architecture (http://uima.apache.org), Lucene (http://lucene.apache.org), and Wt (http://www.webtoolkit.eu/wt). The corpus of papers can be build from fulltextarticles that are available in PDF format (http://en.wikipedia.org/wiki/Portable\_Document\_Format) or NXML (http://dtd.nlm.nih.gov/). An extension for articles published in HTML (http://en.wikipedia.org/wiki/HTML) is planned.
*Hans-Michael Muller, Kimberly Van Auken, and Seth Carbon continued development of the TextpressoCentral (TPC) curation system and its integration with the Noctua annotation tool. TPC enables curators to perform full text literature searches, view the search results in the context of the paper, annotate text, and send those annotations to an external database. Over the past year, we have worked on developing a curation interface for GO annotation, as well as the protocol for communication between TPC and Noctua


Back to http://wiki.geneontology.org/index.php/Progress_Reports
Back to http://wiki.geneontology.org/index.php/Progress_Reports


[[Category: Reports]]
[[Category: Reports]]

Latest revision as of 15:52, 20 December 2016

Overview:

Staff

Person, Group [Effort, Funding]

Paul Sternberg, PI, WormBase, GO [8%; 0% funded by GOC]

Juancarlos Chan, Developer, WormBase [25%; 25% funded by GOC]

Sibyl Gao, Developer, WormBase [5%; 0% funded by GOC]

Kevin Howe, Project Lead, WormBase - EBI [5%; 0% funded by GOC]

Raymond Lee, Curator, WormBase [10%; 0% funded by GOC]

Yuling Li (thru August 2016), Developer, Textpresso [30%; 25% funded by GOC]

Jane Lomax (thru July 2016), Curator, WormBase ParaSite [10%; 0% funded by GOC]

Hans Michael Mueller, Project Lead, Textpresso [75%; 50% funded by GOC]

Daniela Raciti, Curator [10%; 0% funded by GOC]

Kimberly Van Auken, Curator, Co-Manager, Annotation Working Group [100%; 75% funded by GOC]

Annotation Progress

WormBase GO Annotation Statistics as of December 20, 2016

Manual annotation statistics are summarized in Tables 1 - 3.

Total number of unique manual annotations: 42747 (+8.8% from 2015)

Total number of genes with manual annotations: 7596 (+12.3% from 2015)

Table 1: Summary of C. elegans Manual Biological Process Annotations

Numbers refer to total number of annotations; annotations in parentheses represent annotations with extensions.

Annotation Group IMP IGI IDA ISS TAS IEP IPI IC NAS ISM ND IBA IRD
WormBase 7623 (426) 3141 (90) 1106 (24) 327 (1) 109 292 (56) 51 52 (10) 32 2 2 0 0
UniProt 1530 (552) 976 (390) 165 (15) 197 26 (3) 14 2 (2) 5 104 0 65 0 0
CACAO 20 1 3 0 0 0 0 0 0 0 0 0 0
BHF-UCL 11 0 0 2 0 4 0 0 0 0 0 0 0
MGI 4 0 6 0 0 0 0 0 0 0 0 0 0
HGNC 0 0 0 4 0 0 0 0 0 0 0 0 0
GO_Central 2 0 0 4 0 0 0 0 0 0 0 7945 1
ParkinsonsUK-UCL 10 (4) 6 (3) 11 2 (1) 0 0 0 0 0 0 0 0 0
Totals 9200 (982) 4124 (483) 1291 (39) 536 (2) 135 (3) 310 (56) 53 (2) 57 (10) 136 2 67 7945 1


Table 2: Summary of C. elegans Molecular Function Annotations

Numbers refer to total number of annotations; annotations in parentheses represent annotations with extensions.

Annotation Group IMP IGI IDA ISS TAS IPI IC NAS ISM ND IBA ISO IKR IRD
WormBase 161 (11) 32 1688 (209) 647 (4) 45 1348 (5) 21 (1) 4 3 73 0 2 0 0
IntAct 0 0 0 0 0 2085 (52) 0 0 0 0 0 0 0 0
UniProt 57 (2) 17 139 (3) 194 23 (1) 321 (3) 4 51 0 126 0 0 0 0
CACAO 1 0 7 0 0 0 0 0 0 0 0 0 0 0
GO_Central 0 0 0 0 0 0 0 0 0 0 6538 0 1 1
HGNC 0 0 0 2 0 0 0 0 0 0 0 0 0 0
ParkinsonsUK-UCL 0 0 0 0 0 2 (2) 0 0 0 0 0 0 0 0
Totals 219 (13) 49 1834 (212) 843 (4) 68 (1) 3754 (60) 25 (4) 55 4 199 6538 2 1 1


Table 3: Summary of C. elegans Cellular Component Annotations

Numbers refer to total number of annotations; annotations in parentheses represent annotations with extensions.

Annotation Group IMP IGI IDA ISS TAS IPI IC NAS ISM ND IBA
WormBase 9 0 5863 (784) 382 27 141 (3) 50 7 4 10 0
GO_Central 0 0 0 0 0 0 0 0 0 0 6326
UniProt 29 (10) 1 379 (73) 203 18 0 19 50 0 118 0
MGI 0 0 16 0 0 0 0 0 0 0 0
HGNC 0 0 0 8 0 0 0 0 0 0 0
BHF-UCL 0 0 7 0 0 0 0 0 0 0 0
CACAO 0 0 3 0 0 0 0 0 0 0 0
Totals 38 (10) 1 6268 (857) 593 45 141 (3) 69 57 4 128 6326


Table 4: Summary of C. elegans Computational Annotations

Summary Statistics Based on WormBase Release WS256

Genes Stats:

 Genes with GO_term connections  15047 
   Non-IEA-only annotation              640
   IEA-only annotation                 7830
   Both IEA and non-IEA annotations    6577

GO_term Stats:

 Distinct GO_terms connected to Genes   5679
   Associated by non-IEA only               3123
   Associated by IEA only                    825
   Associated by both IEA and non-IEA       1731
Type of Annotation IEA
Phenotype2GO Mappings - WormBase 37,714
IEA/InterPro2GO - WormBase 22,660

Methods and strategies for annotation

Curation methods

Literature curation

Curation of the primary literature continues to be the major focus of our manual annotation efforts.

Over the past year, WormBase curation efforts were focused largely on developing preliminary pathway models using the Noctua curation tool. To this end, literature curation involved reviewing C. elegans pathways, the biological entities that participate in those pathways, and the annotations, particularly Molecular Function annotations, associated with those entities. Pathways reviewed include apoptosis, asymmetric cell division, defense response, insulin signaling, neuronal cell fate specification, mRNA decay, semaphorin/plexin signaling, thermosensory transduction, and TOR signaling.

Curation using the Textpresso information retrieval system

We also employ the Textpresso information retrieval system for curation of GO Cellular Component and Molecular Function annotations.

Computational annotation strategies

Our computational annotation strategies include mapping genes to GO terms using InterPro domains performed as part of the WormBase build cycle, as well as computational predictions made via the UniProtKB pipeline, including keyword mappings and UniRule mapping. Also as part of the WormBase build cycle, we map genes to Biological Process terms based upon mappings between terms in the Worm Phenotype Ontology (WPO).

Curation strategies

Priorities for annotation

Selection of genes for annotation is guided by several criteria:

  • Annotation of gene sets involved in specific biological processes as part of the LEGO working group
  • Genes identified in Textpresso-based curation pipelines, for example genes described in papers flagged by an SVM (Support Vector Machine) classification algorithm having a high confidence of reporting Molecular Function experiments such as enzymatic assays
  • Re-annotation of genes affected by changes to the ontology, e.g. cilia biology, ubiquitination, enzyme regulator activities, and obsoleted annotation extensions
  • Publication of newly characterized genes for which no previous biological data was available

Presentations and Publications

Papers with substantial GO content

  • Expansion of the Gene Ontology knowledgebase and resources. Gene Ontology Consortium. Nucleic Acids Research (2016) pii:gkw1108. PMID:27899567
  • Guidelines for the functional annotation of microRNAs using the Gene Ontology. Huntley RP, Sitnikov D, Orlic-Milacic M, Balakrishnan R, D'Eustachio P, Gillespie ME, Howe D, Kalea AZ, Maegdefessel L, Osumi-Sutherland D, Petri V, Smith JR, Van Auken K, Wood V, Zampetaki A, Mayr M, Lovering RC. RNA. 2016 May;22(5):667-76. doi:10.1261/rna.055301.115. PMID:26917558.

Presentations including Talks and Tutorials and Teaching

  • TextpressoCentral: A System for Integrating Full Text Literature Curation with Diverse Curation Platforms including the Gene Ontology Consortium's Common Annotation Framework. Kimberly Van Auken, Yuling Li, Seth Carbon, Christopher Mungall, Suzanna Lewis, Hans-Michael Muller and Paul Sternberg. ISB 2016 Geneva, Switzerland. https://www.sib.swiss/events/biocuration2016/oral-presentations

Other Highlights

Annotation Outreach and User Advocacy Efforts

  • Kimberly Van Auken continues to serve on the GO-help rota.
  • Kimberly Van Auken served on the Data Capture Working Group.

Annotation Advocacy

  • Kimberly Van Auken and David Hill (MGI) continue to serve as Annotation Working Group Co-Managers.
  • Kimberly Van Auken continued to participate in the LEGO working group as an alpha tester of the Noctua software and helped to train GO curators in LEGO curation and the Noctua annotation tool at the Geneva LEGO workshop (April, 2016), an MGI workshop (June 2016), an EBI workshop (September 2016), and the USC workshop (November 2016).

Text Mining and Textpresso Central

  • Monica McAndrews (MGI), Kimberly Van Auken, Hans-Michael Mueller, and Yuling Li (thru August 2016) are collaborating on a document classification pipeline to help MGI identify papers suitable for curation. Using training and testing papers supplied by MGI, we have developed an SVM classifier to distinguish mouse from non-mouse papers. We are beginning steps to put this pipeline into production.
  • Hans-Michael Muller, Kimberly Van Auken, and Seth Carbon continued development of the TextpressoCentral (TPC) curation system and its integration with the Noctua annotation tool. TPC enables curators to perform full text literature searches, view the search results in the context of the paper, annotate text, and send those annotations to an external database. Over the past year, we have worked on developing a curation interface for GO annotation, as well as the protocol for communication between TPC and Noctua

Back to http://wiki.geneontology.org/index.php/Progress_Reports