Software Group progress report for 2012

From GO Wiki
Jump to navigation Jump to search

Aim 4. We will develop a Common Annotation Framework. We will develop a centralized curation system to help curators efficiently capture annotations from the literature. Our project database will be extensible to provide quality checks and to allow export to GOA files and other formats for GO projects, as well as for Model Organism Databases (MOD) and other project databases. We will complete the OBO (Open Biological Ontologies) to OWL (Ontology Web Language) mapping, and make full use of OWL constructs and tools as required within the GOC. We will port the software backend to make use of standard application programmer interfaces (API) such as the Java OWL-API.

Over the past year, we undertook the following steps towards a Common Annotation Framework.

1) To fully ascertain the curation needs of the diverse GO community, we began this part of the project by conducting phone interviews with curators from 10 different groups who currently, or in the recent past, annotate for GO (FlyBase, MGI, PAMGO, PomBase, RGD, SGD, TAIR, UniProt, WormBase, and Zfin). Each of the curators was asked to demonstrate their current annotation tool, discuss the benefits and limitations of that tool, and provide feedback on what issues their annotation group might need to address when migrating to a common annotation tool. The resulting requirements document is at: https://docs.google.com/document/d/1JoWe_2uc0QkzpZkG9iSOdrtz1cXXtf0aXRQnouPnTA/edit

2) Upon consideration of these requirements, we decided to begin the transition to a Common Annotation Framework by systematically migrating GO Consortium members to curating using the Protein2GO tool developed by the GO Annotation (GOA) group at UniProt. The Protein2GO tool offers many essential and desired features, including sophisticated, real time error checking and multiple ontology look-up services for populating the Annotation Extension column (Column 16).

3) Groups already using Protein2GO include UniProt, dictyBase, and AgBase. The first new group to migrate to Protein2GO will be WormBase, followed by SGD. Curators and developers at UniProt, WormBase, and SGD have developed guidelines and standard operating procedures for the transition to ease the migration of subsequent groups. Annotations made using Protein2GO will be exported biweekly to each group in the form of a Gene Association File that each MOD will subsequently submit to the GOC and upload into their own database. At this time, WormBase has finished validating their annotation file and will begin using the Protein2GO tool in January 2013.

4) An additional component of the Common Annotation Framework is the incorporation of text mining tools into the GO curation workflow. Towards this aim, the WormBase and Textpresso groups have been collaborating with dictyBase and TAIR to develop GO Cellular Component curation (Textpresso for CCC) pipelines. For TAIR, the curation pipeline has largely consisted of batch queries either on a given year of publications or, subsequently, the entire corpus. For dictyBase, the Textpresso-based pipeline will be performed on a weekly basis on newly published papers. Textpresso for CCC will use a newly designed curation form, currently under development and independent of Protein2GO, that allows curators to associate GO annotations with specific sentences, and leverage past annotation for future curation. Protein2GO and WormBase curators are finalizing the details of information transfer via web services from the CCC form at Caltech to the Protein2GO tool at UniProt to ensure that all annotations are ultimately housed in the common Protein2GO database. Evaluation of the performance of Textpresso for CCC for dictyBase and TAIR was performed as part of the BioCreative 2012 Workshop in Washington, D.C. and published in Van Auken, et al., Database, 2012. Additionally, preliminary evaluation of text mining-based approaches for GO Molecular Function curation were presented as part of the 2012 BioCurator meeting, also in Washington, D.C.

5) In addition to the Textpresso-based curation pipeline, the WormBase and Textpresso groups have developed a web-based paper viewer/annotator for GO curation. This tool allows curators to upload an HTML or NLM-XML document to highlight and annotate sentences used for GO annotation. Sentences highlighted and linked to GO annotations will be used to track the rationale for a given annotation (an effective training tool), and also used as training sets for future development of text mining tools. The latter work is currently being performed in collaboration with the BioCreative (Critical Assessment of Information Extraction systems in Biology) group who will include a GO annotation task for text mining groups in their 2013 Challenge.

Progress toward Quarterly milestones: Y1.Q1, GOC database live; Y1.Q2, Web-based access to GOC via Ontology Annotator 1.0; Y1.Q3, User feedback and update Ontology Annotator 1.1; Curation Control Panel 1.0; Y1.Q4, Paper Viewer 1.0; NLP module/Textpresso-cell component curation 1.0;

The Protein2GO approach provides us with, at least temporarily, a central GOC database and a web-based annotation interface. We have a Page Viewer and a cell component curation module but not an overall control panel.

Plans Y2.Q1, Specification for Ontology Annotator 2.0; NLP results displayed in Paper Viewer 1.1; Y2.Q2, NLP module/Textpresso-Molecular Function module 1.0; Y2.Q3, Prototype of revised Ontology Annotator 2.0; Y2.Q4, Insertion of NLP results into Ontology Annotator 2.0;

We will assess the use of Protein2GO for non-protein annotations, check data roundtrip from MODs (e.g., WormBase) and bring on other users. We will refine the paper viewer, and use it to highlight the results of NLP, initially from the cell component curation (CCC) pipeline, but later in the year from the Molecular Function module. With the core components in place, we will be able to decide whether to continue with Protein2GO or develop new software using the features of Protein2GO.