WormBase December 2017

From GO Wiki
Jump to: navigation, search

Overview:

Staff

Person, Group [Effort, Funding]

Paul Sternberg, PI, WormBase, GO [8%; 0% funded by GOC]

Valerio Arnaboldi, Developer, Textpresso [%; % funded by GOC]

Juancarlos Chan, Developer, WormBase [25%; 25% funded by GOC]

Jae Cho, Curator, WormBase [10%; 0% funded by GOC]

Sibyl Gao, Developer, WormBase [5%; 0% funded by GOC]

Chris Grove, Curator, WormBase [10%; 0% funded by GOC]

Kevin Howe, Project Lead, WormBase - EBI [5%; 0% funded by GOC]

Raymond Lee, Curator, WormBase [10%; 0% funded by GOC]

Jane Mendel, Curator, WormBase [10%; 0% funded by GOC]

Hans Michael Mueller, Project Lead, Textpresso [75%; 50% funded by GOC]

Daniela Raciti, Curator [10%; 0% funded by GOC]

Kimberly Van Auken, Curator, Co-Manager, Annotation Working Group [100%; 75% funded by GOC]

Annotation Progress

WormBase GO Annotation Statistics as of December 13, 2017

Manual annotation statistics are summarized in Tables 1 - 3.

Total number of unique manual annotations: 44384 (+3.8% from 2016)

Total number of genes with manual annotations: 7508 (-1.1% from 2016)

Table 1: Summary of C. elegans Manual Biological Process Annotations

Numbers refer to total number of annotations; annotations in parentheses represent annotations with extensions.

Annotation Group IMP IGI IDA ISS TAS IEP IPI IC NAS ISM ND IBA IKR HMP HGI
WormBase 8102 (451) 3215 (92) 1135 (33) 328 (1) 108 310 (18) 51 73 (10) 24 2 3 0 1 0 0
UniProt 3189 (751) 2331 (369) 212 (22) 203 29 (3) 14 2 (2) 5 103 0 68 0 0 18(18) 135 (135)
CACAO 20 1 3 0 0 0 0 0 0 0 0 0 0 0 0
BHF-UCL 11 0 0 2 0 4 0 0 0 0 0 0 0 0 0
MGI 4 0 6 0 0 0 0 0 0 0 0 0 0 0 0
HGNC 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0
GO_Central 2 0 0 0 0 0 0 0 0 0 0 7430 0 0 0
ParkinsonsUK-UCL 14 (4) 9 (3) 11 3 (1) 0 0 0 0 0 0 0 0 0 0 0
CAFA 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0
Totals 11344 (1206) 5556 (464) 1369 (55) 540 (2) 137 (3) 328 (18) 53 (2) 83 (10) 127 2 71 7530 1 18 (18) 135 (135)


Table 2: Summary of C. elegans Molecular Function Annotations

Numbers refer to total number of annotations; annotations in parentheses represent annotations with extensions.

Annotation Group IMP IGI IDA ISS TAS IPI IC NAS ISM ND IBA ISO IKR
WormBase 174 (11) 32 1895 (214) 653 (5) 45 1355 (5) 22 4 3 79 0 2 2
IntAct 0 0 0 0 0 2250 (56) 0 0 0 0 0 0 0
UniProt 111 (11) 23 165 (5) 198 24 (1) 458 (3) 4 51 0 123 0 0 0
CACAO 1 0 7 0 0 0 0 0 0 0 0 0 0
GO_Central 0 0 0 0 0 0 0 0 0 0 5957 0 1
HGNC 0 0 0 2 0 0 0 0 0 0 0 0 0
ParkinsonsUK-UCL 0 0 0 0 0 2 (2) 0 0 0 0 0 0 0
CAFA 1 0 0 0 0 1 0 0 0 0 0 0 0
Reactome 0 0 0 0 1 0 0 0 0 0 0 0 0
Totals 287 (22) 55 2067 (219) 853 (5) 70 (1) 4066 (66) 26 55 3 202 5957 2 3


Table 3: Summary of C. elegans Cellular Component Annotations

Numbers refer to total number of annotations; annotations in parentheses represent annotations with extensions.

Annotation Group IMP IGI IDA ISS TAS IPI IC NAS ISM ND IBA
WormBase 9 0 6674 (798) 386 27 143 (3) 50 (1) 7 4 12 0
GO_Central 0 0 0 0 0 0 0 0 0 0 6063
UniProt 72 (22) 1 711 (100) 233 20 1 19 49 0 117 0
MGI 0 0 16 0 0 0 0 0 0 0 0
HGNC 0 0 0 8 0 0 0 0 0 0 0
BHF-UCL 0 0 7 0 0 0 0 0 0 0 0
CACAO 0 0 3 0 0 0 0 0 0 0 0
Reactome 0 0 0 0 8 0 0 0 0 0 0
SynGO 0 0 0 1 (1) 0 0 0 0 0 0 0
Totals 81 (22) 1 7411 (898) 620 (1) 55 144 (3) 69 (1) 56 4 129 6063


Table 4: Summary of C. elegans Computational Annotations

Summary Statistics Based on WormBase Release WS262

Genes Stats:

 Genes with GO_term connections  13880 
   Non-IEA-only annotation             1170
   IEA-only annotation                 6290
   Both IEA and non-IEA annotations    6420

GO_term Stats:

 Distinct GO_terms connected to Genes   6023
   Associated by non-IEA only               3567
   Associated by IEA only                    817
   Associated by both IEA and non-IEA       1639
Type of Annotation IEA
Interpro2GO 21823
Other (e.g. Swiss-Prot keyword mapping) 44965

Methods and strategies for annotation

Curation methods

Literature curation

  • Curation of the primary literature continues to be the major focus of our manual annotation efforts.
  • Over the past year, WormBase curation efforts were focused largely on developing preliminary pathway models using the Noctua curation tool.
  • To this end, literature curation involved reviewing C. elegans pathways, the biological entities that participate in those pathways, and the annotations, particularly Molecular Function annotations, associated with those entities. Pathways reviewed include Wnt signaling, Notch signaling, serotonin-mediated GPCR signaling involved in egg laying, Q neuroblast migration, apoptosis, asymmetric cell division, defense response, insulin signaling, neuronal cell fate specification, mRNA decay, semaphorin/plexin signaling, thermosensory transduction, and TOR signaling.
  • WB holds weekly GO-CAM training and discussion sections to review models and discuss annotation issues.

Curation using the Textpresso information retrieval system

  • We also employ the Textpresso information retrieval system for curation of GO Cellular Component and Molecular Function annotations.

Computational annotation strategies

  • Our computational annotation strategies include mapping genes to GO terms using InterPro domains performed as part of the WormBase build cycle, as well as computational predictions made via the UniProtKB pipeline, including keyword mappings and UniRule mapping.
  • One major change to our annotation set this year was the removal of our Phenotype2GO annotations. These IEA annotations, based on manual mappings of terms from the Worm Phenotype Ontology (WPO) to GO Biological Process (BP) terms, provided a preliminary set of annotations for genes based on phenotypes from large-scale screens. As we move towards a more mechanistic based approach to GO curation, including future use of an extended set of Gene - GO Term relations, however, we feel that these annotations too often represent processes relatively far downstream from the time- and site-of-action of the encoded gene products. As we develop more GO-CAM models, we will review our set of BP annotations for these genes and replace them with more precise annotations wherever possible.

Curation strategies

Priorities for annotation

Selection of genes for annotation is guided by several criteria:

  • Annotation of gene sets involved in specific biological processes as part of GO-CAM modeling
  • Genes identified in Textpresso-based curation pipelines, for example genes described in papers flagged by an SVM (Support Vector Machine) classification algorithm having a high confidence of reporting Molecular Function experiments such as enzymatic assays
  • Re-annotation of genes as part of the QA/QC pipeline, particularly with respect to signaling pathways
  • Re-annotation of genes affected by changes to the ontology, e.g. cilia biology, ubiquitination, enzyme regulator activities, and obsoleted annotation extensions
  • Publication of newly characterized genes for which no previous biological data was available

Presentations and Publications

Papers with substantial GO content

  • Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature. (2017) H.-M. Müller, K. M. Van Auken, Y. Li, and P.W. Sternberg. Revised manuscript submitted to BMC BioInformatics.

Presentations including Talks and Tutorials and Teaching

  • Data Curation in the Biomedical Sciences: from Text to Databases to Knowledge Discovery, Kimberly Van Auken, Data Science Invited Talk Series, November 3, 2017, School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN

Posters

  • Textpresso Central: A Customizable Platform for Searching, Text Mining, Viewing, and Curating the Biomedical Literature. Kimberly Van Auken, Hans-Michael Mueller, Yuling Li, Seth Carbon, Chris Mungall, Suzanna Lewis, and Paul Sternberg. GOC, June 2017, Corvallis, OR

Other Highlights

Annotation Outreach and User Advocacy Efforts

  • Kimberly Van Auken worked with the GeneDB annotation group to submit updated gene association files.

Annotation Advocacy

  • Kimberly Van Auken continue to co-manage the Annotation Working Group with David Hill (MGI) and Pascale Gaudet (SIB).
  • Kimberly Van Auken serves on the QA/QC working group with Pascale Gaudet (SIB), Sylvain Poux (SIB), and Val Wood (PomBase).
  • Kimberly Van Auken participated in the High Throughput (HTP) Working Group with Helen Attrill (FB), Stacia Engel (SGD), Pascale Gaudet (SIB), Ruth Lovering (UCL), and Sylvain Poux (SIB).
  • Kimberly Van Auken participated in the Signaling Workshop at the October 2017 GOC meeting in Cambridge, UK, where she and Helen Attrill (FB) presented work on the Wnt signaling pathway.
  • Kimberly Van Auken is participating in the Protein Complexes Working Group.
  • WB curators (Cho, Grove, Lee, Mendel, Van Auken) hold a weekly meeting to discuss GO-CAM models.

Text Mining and Textpresso Central

  • Hans-Michael Muller, Kimberly Van Auken, and Seth Carbon continued development of the TextpressoCentral (TPC) curation system and its integration with the Noctua annotation tool. TPC enables curators to perform full text literature searches, view the search results in the context of the paper, annotate text, and send those annotations to an external database. Over the past year, we have worked on developing a curation interface for GO annotation, as well as the protocol for communication between TPC and Noctua. Curators may now send GO Molecular Function annotations from TPC to Noctua by a round-trip protocol initiated from within Noctua.

Back to http://wiki.geneontology.org/index.php/Progress_Reports