Software Group progress report for 2014: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
No edit summary
Line 70: Line 70:
* Additional extensions to the BBOP JS library (supporting both AmiGO & Noctua) to make it even more generic and add new functionality. The enhanced implementation of service agents supports fine-grained client/server interactions in Noctua  
* Additional extensions to the BBOP JS library (supporting both AmiGO & Noctua) to make it even more generic and add new functionality. The enhanced implementation of service agents supports fine-grained client/server interactions in Noctua  
* Met with Dexter Pratt (Ideker group) to initiate discussions on integration with NDEx, OpenBEL format
* Met with Dexter Pratt (Ideker group) to initiate discussions on integration with NDEx, OpenBEL format
=== Accomlishments: Text Mining ===
*The WormBase and Textpresso teams developed a new support vector machine (SVM) document classifier for a subclass of the Molecular Function ontology: catalytic activity.  This SVM is included in the WormBase data flagging pipeline and will be incorporated into the Textpresso Central suite of curation tools (see below). 
*MGI, WormBase, and Textpresso are collaborating on a document classification pipeline to help MGI identify papers suitable for curation using an SVM classifier to distinguish mouse from non-mouse papers.  The initial SVM has been developed and further work will be aimed at identifying mouse markers (genes) associated with experimental data in these papers.
*Hans-Michael Muller and Yuling Li started developing a literature curation platform named Textpresso Central that enables curators to perform full text literature searches, view and curate research papers, train and apply machine learning and text mining algorithm for semantic analysis and curation purposes. The user is supported in this task by giving him capabilities to select, edit and store lists of papers, sentences, term and categories in order to perform training and mining. The system is designed with the intent to empower the user to perform as many operations on a literature corpus or a particular paper as possible. It uses state-of-the-art software packages and frameworks such as the Unstructured Information Management Architecture (http://uima.apache.org), Lucene (http://lucene.apache.org), and Wt (http://www.webtoolkit.eu/wt). The corpus of papers can be build from fulltextarticles that are available in PDF format (http://en.wikipedia.org/wiki/Portable\_Document\_Format) or NXML (http://dtd.nlm.nih.gov/). An extension for articles published in HTML (http://en.wikipedia.org/wiki/HTML) is planned.


== Aim 5 ==
== Aim 5 ==

Revision as of 13:02, 17 December 2014

Report period: Dec 1, 2013 to Nov 30, 2014.

The activities of the software group intersect all aims of the GO proposal, so these are broken down by aims here.

Aim 1

We will provide experimental annotations for human and major model organisms.

Accomplishments: Support of the automatic annotation Quality Control pipeline

  • Created a Continuous Integration System for monitoring, quality control, and publication of annotations (using Jenkins). The monitoring site includes statistics and metrics of data quality. We continue to support and extend the pipeline on an ongoing basis.
  • Documented patterns used for extended annotations together with members of GO annotation group.
  • Developed a specification for publishing GO annotations as linked data (OBAN) in collaborated with the Parkinson team / CCTV project at EBI
  • Supported the display of annotation extensions in AmiGO
  • Created a system for verification of annotation extension relations

Publications

  • Huntley RP, Harris MA, Alam-Faruque Y, Blake JA, Carbon S, Dietze H, Dimmer EC, Foulger RE, Hill DP, Khodiyar VK, Lock A, Lomax J, Lovering RC, Mutowo-Meullenet P, Sawford T, Van Auken K, Wood V, Mungall CJ. A method for increasing expressivity of Gene Ontology annotations using a compositional approach. BMC Bioinformatics. 2014 May 21;15:155. doi: 10.1186/1471-2105-15-155. PubMed PMID: 24885854; PubMed Central PMCID: PMC4039540.
  • Chibucos MC, Mungall CJ, Balakrishnan R, Christie KR, Huntley RP, White O, Blake JA, Lewis SE, Giglio M. Standardized description of scientific evidence using the Evidence Ontology (ECO). Database (Oxford). 2014 Jul 22;2014. pii: bau075. doi: 10.1093/database/bau075. Print 2014. PubMed PMID: 25052702; PubMed Central PMCID: PMC4105709.


Aim 3

We will perform phylogenetically-based propagation of annotations.

Accomplishments: Phylogenetic Annotation Software (PAINT)

At the beginning of this period the code was at beta70. In May we released PAINT 1.0 and since then there have been 13 minor releases. A large number of enhancements were added and bugs fixed during the PAINT hackathon during July, providing attendees with immediate response to their requests.

  • Created the ability to add columns for more general terms to enable their use for ancestral annotation, when the experimental annotations of the extant descendents are to more specific terms.
  • Provided complete undo/redo support, with the history recorded and displayed in the log file (also known as ‘notes’)
  • Added capability of collapsing branches of the tree for which there are no experimental annotations among the descendents.
  • Implemented a GO taxon check web service (currently runs on Berkeley server)
  • Added a call out to the GO taxon check web service dynamically when a user attempts to annotate an ancestral node to determine if it is allowable.
  • Improved the Multiple Sequence Alignment (MSA) view.
  • Improved the search functionality
  • Added special graphic for lateral transfer
  • Numerous other small enhancements (e.g. tooltips, formatting of notes, switch to GO_Central as the source) and maintenance as bugs were reported.

Accomplishments: PAINT annotations:

  • Joined the PAINT curation team for phylogenetic-based curation of ontology annotations of gene families.
  • Attended a one-day long training session on PAINT curation with collaborators at University of Southern California, and a four-day PAINT curation hackathon at Stanford University.
  • Joined the annotation team as species-agnostic curators; started to actively participate contributing annotations to approximately 2,000 gene sequences during the process of reviewing GO terms related to DNA Repair, via recombinational and non-recombinational methods.
  • Currently attend bi-weekly conference calls to report progress with this group.

Accomplishments - jsPAINT:

  • Initial work on an updating script (“touchup”) is underway and will be completed in the first quarter of 2015. This code will ensure that the GAF files exported from PAINT by the annotators remain synchronized with the latest versions of the GO, the experimental annotations, and the PANTHER family trees.
  • We assisted in the mentoring of a Google Summer of Code student (under BioJS) in the development of a Web Browser MSA viewer to use in jsPAINT

Presentations

  • Poster; Wellcome Trust Advanced Courses and Scientific Conferences, Genome Informatics, Churchill College, Cambridge UK, September 2014 - Phylogenetic Annotation INference Tool: PAINT.
  • Poster; Lawrence Berkeley Laboratory, Life Sciences Retreat. October 2014 - Phylogenetic Annotation INference Tool: PAINT.

Aim 4

We will develop a Common Annotation Framework.

Accomplishments: Noctua

Noctua is a new project that allows users to simultaneously collaborate in the creation and building of complex annotations using a graph-based interface. It runs entirely in the user’s web browser (JavaScript), making use of jsPlumb and Socket.io. We have made substantial progress in the past year taking Noctua from design concept, to prototypes, to usable alpha which is currently being used for usability evaluation. We are working against a development roadmap that seeks to meet future community needs based on the feedback we are receiving from our alpha testers. Our progress to date includes:

  • Evaluation of a range of possible technologies for Noctua development, including creation of early demonstrations prototypes to evaluate technology from both user and developer perspectives
  • Design and implementation of a three part architecture to maximize flexibility and minimize component complexity: a web client, a coordination layer (Minerva), and a data engine (Barista)
  • To the Noctua web client interface we have added new features, enhancements, and issued fixes generated from user feedback. Noctua reuses AmiGO-based widgets for term and gene product searches ensuring easier maintenance and lowered development costs.
  • The ‘Minerva’ coordinator implementation includes:
  • a messaging and authorization server to coordinate communication across multiple clients (using Socket.io) and mediate communications with the BBOP JS data engine
  • a login and authorization service (via Mozilla Foundation’s Persona identity tokens)
  • The ‘Barista’ data engine component provides the data store, data model management, and all logical operations on the models, as well as attempted operation and model status. While our future plans call for use of a triplestore, the data engine currently uses an in-memory model, the filesystem, and Amazon S3 as its data store. The data engine is currently seeded in multiple ways to support legacy annotations and migrate existing GO annotations:
  • Seeding of LEGO models from existing Annotations GAF files and ontology
  • Conversely we extended our OWLTools library to convert LEGO models to GPAD/GPI and GAF files. This backwards compatibility is automatically maintained by a Jenkins job as a standard part of the GO data pipeline
  • Additional extensions to the BBOP JS library (supporting both AmiGO & Noctua) to make it even more generic and add new functionality. The enhanced implementation of service agents supports fine-grained client/server interactions in Noctua
  • Met with Dexter Pratt (Ideker group) to initiate discussions on integration with NDEx, OpenBEL format

Accomlishments: Text Mining

  • The WormBase and Textpresso teams developed a new support vector machine (SVM) document classifier for a subclass of the Molecular Function ontology: catalytic activity. This SVM is included in the WormBase data flagging pipeline and will be incorporated into the Textpresso Central suite of curation tools (see below).
  • MGI, WormBase, and Textpresso are collaborating on a document classification pipeline to help MGI identify papers suitable for curation using an SVM classifier to distinguish mouse from non-mouse papers. The initial SVM has been developed and further work will be aimed at identifying mouse markers (genes) associated with experimental data in these papers.
  • Hans-Michael Muller and Yuling Li started developing a literature curation platform named Textpresso Central that enables curators to perform full text literature searches, view and curate research papers, train and apply machine learning and text mining algorithm for semantic analysis and curation purposes. The user is supported in this task by giving him capabilities to select, edit and store lists of papers, sentences, term and categories in order to perform training and mining. The system is designed with the intent to empower the user to perform as many operations on a literature corpus or a particular paper as possible. It uses state-of-the-art software packages and frameworks such as the Unstructured Information Management Architecture (http://uima.apache.org), Lucene (http://lucene.apache.org), and Wt (http://www.webtoolkit.eu/wt). The corpus of papers can be build from fulltextarticles that are available in PDF format (http://en.wikipedia.org/wiki/Portable\_Document\_Format) or NXML (http://dtd.nlm.nih.gov/). An extension for articles published in HTML (http://en.wikipedia.org/wiki/HTML) is planned.

Aim 5

We will maintain and upgrade the Gene Ontologies

Accomplishments (supplement)

  • Continued to work on integration between CL and GO, we hold biweekly meetings of CL editors (current attendees: LBNL, OHSU, Buffalo, ZFIN).
  • Integrated Uberon logical definitions and TermGenie templates.
  • Continued semi-automated alignment of Uberon with the implicit GO anatomy in various areas, e.g. renal[Alam-Faruque 2014], and performed additional integration with other ontologies [Haendel 2014]
  • Created a Cell Ontology TermGenie instance to support both OMICS consortia (in use by ENCODE) and to support GO editors and annotators.
  • Created Continuous Integration job for the cell ontology as a part of the Jenkins pipeline
  • Performed link-filling and new term requests to support FANTOM5 project[Anderrson 2014]

Accomplishments (core GO)

  • Published TermGenie paper [Dietze et al. 2014]
  • Created workflow for relation editing and relation constraint editing
  • Extensions to Relations Ontology
  • Provided support for ontology sourceforge Jamboree
  • Worked closely with ontology group and maintaining and refactoring various aspects of ontology
  • Initiated a project to unify GO biological process branch and NCI Thesaurus
  • Restore E-mail reports for active requests on Sourceforge (migrating scripts to Jenkins and using current SourceForge API)
  • Refactored pipeline for different GO builds
  • Protege Plugin for OBO-annotations in OWL, improved usability to edit OBO compliant OWL annotations for labels, references and similar in Protege
  • Commenced work on persistent cached link ontology manager (Protege plugin of high priority for GO workflow)
  • Documented and published on use of OWL in GO [Mungall 2014owled]
  • TermGenie improvements: Commit to OWL, Recent submissions page, quick ontology state check, support and use of SSH keys for SVN authentication, 7 new templates add to the GO TermGenie, Tree-based view for available templates in GO TermGenie

Publications

  • Heiko Dietze, Tanya Z Berardini, Rebecca E Foulger, David P Hill, Jane Lomax, David Osumi-Sutherland, Paola Roncaglia and Christopher J Mungall TermGenie – a web-application for pattern-based ontology class generation, Journal of Biomedical Semantics [PMCID in progress]
  • Haendel MA, Balhoff JP, Bastian FB, Blackburn DC, Blake JA, Bradford Y, Comte A, Dahdul WM, Dececchi TA, Druzinsky RE, Hayamizu TF, Ibrahim N, Lewis SE, Mabee PM, Niknejad A, Robinson-Rechavi M, Sereno PC, Mungall CJ. Unification of multi-species vertebrate anatomy ontologies for comparative biology in Uberon. J Biomed Semantics. 2014 May 19;5:21. doi: 10.1186/2041-1480-5-21. eCollection2014. PubMed PMID: 25009735; PubMed Central PMCID: PMC4089931.
  • Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, ChenY, Zhao X, Schmidl C, Suzuki T, Ntini E, Arner E, Valen E, Li K, Schwarzfischer L, Glatz D, Raithel J, Lilje B, Rapin N, Bagger FO, Jørgensen M, Andersen PR, Bertin N, Rackham O, Burroughs AM, Baillie JK, Ishizu Y, Shimizu Y, Furuhata E, Maeda S, Negishi Y, Mungall CJ, Meehan TF, Lassmann T, Itoh M, Kawaji H, Kondo N, Kawai J, Lennartsson A, Daub CO, Heutink P, Hume DA, Jensen TH, Suzuki H, Hayashizaki Y, Müller F; FANTOM Consortium, Forrest AR, Carninci P, Rehli M, Sandelin A. An atlas of active enhancers across human cell types and tissues. Nature. 2014 Mar 27;507(7493):455-61. doi: 10.1038/nature12787. PubMed PMID:

24670763.

  • Mungall, C. J., Dietze, H., & Osumi-Sutherland, D. (2014). Use of OWL within the Gene Ontology. In M. Keet & V. Tamma (Eds.), Proceedings of the 11th International Workshop on OWL: Experiences and Directions (OWLED 2014) (pp. 25–36). Riva del Garda, Italy, October 17-18, 2014. doi:10.1101/010090

Aim 6

We will provide annotations and ontologies to the broad genetics community, supporting the use of the Gene Ontology resources.

Accomplishments: The AmiGO Ontology and Annotation Browser:

  • AmiGO 1.x
  • We provided continued support and maintenance of AmiGO 1.x legacy code to cover any use cases not yet covered by AmiGO 2 for our large user base;
  • We provided continuing support for the legacy databases, including exploration of simplifying the pipeline using modern software (OWLTools).
  • AmiGO 2.x - We successfully released AmiGO 2, with an accelerated ‘GOlr’ backend based on Solr/Lucene technology:
  • GOlr’s new fast text and facet searching makes it possible for users to interactively search the data and filter away unwanted results;
  • The initial production release was thoroughly tested for stability and improved the performance over AmiGO 1, by orders of magnitude in some cases;
  • We coordinated testing and phased rollout of the new AmiGO 2 stack, with its simplified and refactored codebase, and oversaw its deployment with the production team at Stanford;
  • AmiGO 2 has been continuously enhanced with new user-requested features and enhancements since its initial release (AmiGO 2.1), as well as fixing problems encountered by users;
  • Since the initial production release we have continued to increase the number and detail of fields and personalities offered to users;
  • We now have in development numerous pre-beta tools, with both novel and user requested functionality.

Accomplishments: GO Web Site:

  • Carried out design, implementation, and deployment of a new and improved website for the Gene Ontology Consortium. The new design involved:
    • 1. Ensuring that all content is up-to-date (and can easily be maintained that way),
    • 2. Ensuring that terms of licensing and usage are upgraded and visible;
    • 3. Making the site more dynamic and interactive;
    • 4. Encouraging participation from the research community while enforcing workflows to most effectively capture their input;
    • 5. Adding many new features - new layout, skin, and dynamic content;
    • 6. Added a dynamic protein family tree viewer for display of PAINT annotations as a part of the upcoming release to public website (scheduled for 2nd quarter 2015)
  • Carried out a major push to migration of content from legacy site, including:
    • 1. Reorganization of content in a hierarchical manner to make it consistent throughout the site,
    • 2. Staging and testing of pages as content was transferred. We edited and pushed content for approx. 80% of >200 HTML files from the outdated site.
    • 3. Training and creating editing documentation for eight additional editors who worked on the remaining ~20% of the pages, and coordinated and revised their contributions. The current version of the website contains approximately 150 pages of updated and reorganized information.
  • The new website was successfully deployed to production in June, 2014.
  • Members of the software group are active “gatekeepers” and coordinators of content for the GOC website.

Accomplishments: Infrastructure:

  • Supported GO aims through special software and data requests by consortium
  • OWLTools: incremental improvements to loading software to add functionality for automatically loading the full NCBI Taxonomy ontology and all GOA IEA annotations. (AmiGO 2)

Accomplishments: Outreach:

  • Wrote biannual NAR paper about the GO [authored by Munoz-Torres and Drabkin 2014]
  • Are taking the initiative to plan and implement an education portal for the GOC (work in progress).
  • Continuously support the GO user community responding to inquiries received via the GO Helpdesk (http://geneontology.org/form/contact-go)

Publications

  • The Gene Ontology Consortium. 2014. Gene Ontology Consortium: Going Forward. Nucleic Acids Res., In Press (doi: 10.1093/nar/gku1179)