Common Annotation Framework
Jump to navigation Jump to search
Common Annotation Framework Progress Report
- Juancarlos Chan, Developer (WormBase, Caltech)
- Yuling Li, Developer (Textpresso, Caltech)
- Hans-Michael Müller, Developer (Textpresso, Caltech)
- Kimberly Van Auken - WormBase, Caltech
- Paul Sternberg, PI (Gene Ontology, WormBase, Caltech)
- One of the goals of the common annotation framework is to allow curators to annotate to IDs for a number of different entities, including proteins, non-coding RNAs (ncRNAs), and macromolecular complexes.
- We solicited input from consortium members on what IDs they would additionally like to use in Protein2GO, the GOA annotation tool that currently serves as the common annotation tool.
- We held two conference calls to discuss annotation to non-UniProtKB protein identifiers and current plans are:
- ncRNAs - the annotation tool will use RNA identifiers from RNAcentral. Groups wishing to annotate to RNAcentral IDs will need to coordinate with that resource to make sure that IDs between their database and RNAcentral can be mapped accordingly.
- macromolecular complexes - the annotation tool will use IntAct complex identifiers. On March 4th, 2014, Sandra Orchard, Proteomics Services Team Coordinator at the EBI in Hinxton, gave a presentation to GOC curators on creating and annotating to macromolecular complexes. Groups wishing to annotate to macromolecular complexes will need to coordinate with IntAct to get editor IDs for the IntAct curation tool.
Literature and Text Mining Tools
- Towards an Integrated Literature Curation Platform
- We have started to build a literature work platform that allows a curator and general user to perform the curational task with highest efficiency. The platform functionality will include full text searching, paper viewing, paper curation, the training and applying of machine learning and text mining algorithm and the creating and editing of semantic categories. The application will take advantage of state-of-the-art techonologies and open-source libraries for indexing, annotating and web-development.
- The platform will use Lucene, Wt (a C++ Web Toolkit) and UIMA (Unstructured Information Management Architecture). It is entirely written in C++ giving the application all advantages that C++ holds over other programming languages.
- Modules Written or in Development
- PDF to CAS (Common Analysis System) converter
- A CAS file is a standard file format for storing unstructured text and its annotations and is central to the UIMA framework. The converter takes a pdf file, extracts all text information and tokenizes. It also extracts all graphical information.
- NXML to CAS converter
- Same functionality as the PDF to CAS converter except that it takes an NXML file instead of a PDF file.
- Lexical Annotator
- Loads lexica of various semantic categories and annotates a CAS file.
- Lucene Indexer
- Takes a CAS file and indexes the full text for keywords and lexical annotations.
- Category Editor
- Web-based editor of semantic categories and their lexica that are stored in a PostgreSQL database.
- Paper Viewer
- Displays a CAS file that represents a research paper (in NXML or PDF format) along with all annotations; allows user to select text sniplets from the web display and then make additional manual annotations that are stored in a PostgreSQL database.
- Annotation Updater
- Updates a CAS file with manual annotations obtained from the PostgreSQL database.
- Search Interface
- Allows users to post a Lucene query to the Lucene index generated by the Lucene Indexer. Displays search results, bibliographical information and provides links to the Paper Viewer displaying papers of interest.
--Hmmuller 23:06, 13 March 2014 (UTC)