Common Annotation Framework

From GO Wiki
Revision as of 19:11, 13 March 2014 by Hmmuller (talk | contribs) (→‎Literature and Text Mining Tools)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Common Annotation Framework Progress Report

Personnel

  • Juancarlos Chan, Developer (WormBase, Caltech)
  • Yuling Li, Developer (Textpresso, Caltech)
  • Hans-Michael Müller, Developer (Textpresso, Caltech)
  • Kimberly Van Auken - WormBase, Caltech
  • Paul Sternberg, PI (Gene Ontology, WormBase, Caltech)

Annotation IDs

  • One of the goals of the common annotation framework is to allow curators to annotate to IDs for a number of different entities, including proteins, non-coding RNAs (ncRNAs), and macromolecular complexes.
  • We solicited input from consortium members on what IDs they would additionally like to use in Protein2GO, the GOA annotation tool that currently serves as the common annotation tool.
  • We held two conference calls to discuss annotation to non-UniProtKB protein identifiers and current plans are:
    • ncRNAs - the annotation tool will use RNA identifiers from RNAcentral. Groups wishing to annotate to RNAcentral IDs will need to coordinate with that resource to make sure that IDs between their database and RNAcentral can be mapped accordingly.
    • macromolecular complexes - the annotation tool will use IntAct complex identifiers. On March 4th, 2014, Sandra Orchard, Proteomics Services Team Coordinator at the EBI in Hinxton, gave a presentation to GOC curators on creating and annotating to macromolecular complexes. Groups wishing to annotate to macromolecular complexes will need to coordinate with IntAct to get editor IDs for the IntAct curation tool.

Literature and Text Mining Tools

  • Towards an Integrated Literature Curation Platform
We have started to build a literature work platform that allows a curator and general user to perform the curational task with highest efficiency. The platform functionality will include full text searching, paper viewing, paper curation, the training and applying of machine learning and text mining algorithm and the creating and editing of semantic categories. The application will take advantage of state-of-the-art techonologies and open-source libraries for indexing, annotating and web-development.
  • Specifications
The platform will use Lucene, Wt (a C++ Web Toolkit) and UIMA (Unstructured Information Management Architecture). It is entirely written in C++ giving the application all advantages that C++ holds over other programming languages.
  • Modules Written or in Development
PDF to CAS (Common Analysis System) converter
A CAS file is a standard file format for storing unstructured text and its annotations and is central to the UIMA framework. The converter takes a pdf file, extracts all text information and tokenizes. It also extracts all graphical information.
NXML to CAS converter
Same functionality as the PDF to CAS converter except that it takes an NXML file instead of a PDF file.
Lexical Annotator
Loads lexica of various semantic categories and annotates a CAS file.
Lucene Indexer
Takes a CAS file and indexes the full text for keywords and lexical annotations.
Category Editor
Web-based editor of semantic categories and their lexica that are stored in a PostgreSQL database.
Paper Viewer
Displays a CAS file that represents a research paper (in NXML or PDF format) along with all annotations; allows user to select text sniplets from the web display and then make additional manual annotations that are stored in a PostgreSQL database.
Annotation Updater
Updates a CAS file with manual annotations obtained from the PostgreSQL database.
Search Interface
Allows users to post a Lucene query to the Lucene index generated by the Lucene Indexer. Displays search results, bibliographical information and provides links to the Paper Viewer displaying papers of interest.

--Hmmuller 23:06, 13 March 2014 (UTC)