Software Group 2010 Future Plans

From GO Wiki
Jump to: navigation, search

This report outlines the strategy for the GO Software and Utilities Group for the next 5/10 years.

See also: Timeline

Presentation: Media:software-group-report-bar-harbor-2010.pdf

Introduction

Although the GO Software and Utilities group has a history of 4-5 years, the software and infrastructure which it supports has a history dating back to the inception of GO. A lot of time is spent supporting and tools and infrastructure that were constructed according to outdated requirements.

In particular, the first decade of GO constituted a "wild west" approach to the development of tools and infrastructure. We developed our own in-house software, schemas and formats because the alternatives coming from the wider ontology community were perceived to be either inadequate or not appropriate for GO.

Ontology Development

Changes in Ontology Development

The initial model was an editorial team working through tracker items submitted by annotators. In addition the editorial team would have individual projects for working out some section of the ontology. The editors use a single tool and maintain the graph manually.

The new model is a modular approach to ontology design. Editors work with external groups to get some module up to GO's standards - for example, a subset of CHEBI or the Cell ontology. Once this module is up to standard, it can be used for automated classification within GO using reasoning. Effectively GO "outsources" large pieces of ontology development. New annotator requests can be satisfied instantly by using standard templates and reasoning (the TermGenie system). This frees up GO editors to work on aspects such as upper level organization, termgenie templates, etc.

Implementation

We have been working towards this new model for some time. We have added OWL-inspired constructs to obo format for the representation of simple compositional class definitions that allow the use of reasoning for automated ontology classification. We have implemented our own reasoners in OBO-Edit.

This strategy worked well at first as we could write custom reasoning engines dedicated to the structures we used commonly in GO. However, it doesn't scale well in the long run. There is too much dependence on the implementor of the reasoner (previously John, now Chris). We need to switch to using 3rd party reasoners and infrastructure to remove this dependence.

In addition, the initial requirements for oboedit did not include modular ontology development, and it has proven difficult to adapt it. Consequently it is difficult to work in the multi-ontology environment required for modular development. The wider ontology community has adopted standards such as MIREOT (which allows for the selected import of external terms into an ontology), which is partially implemented in OE.

Most developers of ontology support tools assume use of the OWL language. If we wish to take advantage of these we will have to start using OWL. However, it will be necessary to support obo-format as legacy for some time. Therefore, our strategy is to treat obo-format as an alternative concrete syntax for (a subset of) OWL, and to develop converters that make the choice of underlying format superfluous. Many converters have been written in the last few years but most have been hampered either due to the limitations of OWL1.0 or inadequate specification of obo-format. Our goal is to make a formal obo-format1.4 specification that satisfies computer scientists, and to develop watertight converters. We will also port our internal infrastructure to use the java OWL API.

[[]]

For ontology development, we imagine a transition away from a pure OE development approach to a mixed-mode approach. Editors will use either OE or Protege4, depending on the task and the editors individual preference. For example, reasoning and MIREOT procedures will use P4. In addition, an increasing number of new terms will be granted via TermGenie. TermGenie will be ported to use the OWLAPI. Web interfaces will be increasingly used for other tasks - e.g. pre-obsoletion.

We may switch the editors version of the ontology to be OWL (once adequate support is in place in OE). Of course, we will continue to provide downloads in obo format, as well as excel-friendly tabular formats, but we expect an increasing number of ontology consumers to download the OWL. For example, groups such as ArrayExpress already use OWL and the OWLAPI internally.

These changes will allow a more modular automated approach to ontology development. Reasoners will be used increasingly for ontology classification and detection of errors. In addition, there will be less of a dependency on individual GO developers as standard 3rd party software is used.

Plan

  • freeze on OE new features
  • prioritize obo/owl mapping
  • plan on OE as plugin to P4

Annotation

Currently information is being lost during annotation due to lack of expressivity of the GO annotation model. We have made some progress, such as specification of col16 and col17. However, the transition has been slow in part because individual MODs lack the computational support required to implement the extensions.

Expressivity

  • Term enrichment using col16

Pulling annotations automatically from pathway databases:

GO-Pathway-Database integration slides

Automated Annotation QC and Inference

We have implemented a taxon constraint system that is in use for annotation checking and further upstream in prediction tools.

Currently this system uses an in-house constraint checking engine, which places a large dependency on in-house developers. We will therefore port this to use the OWLAPI and have the editors specify the logic directly in OWL. This will effectively take the maintenance and development of this system out of the hands on the GO developers.

We have also implemented an inference engine that materializes inferred BP annotations based on MF annotations and MF to BP links. Again this is an in-house tool that could be re-implemented using standard 3rd party software.

We anticipate additional QC checks and inferences. We will integrate these all into a single rule engine.

Annotation Tools

The software group and the Panther group have been collaborating to develop PAINT (Phylogenetic Annotation Inference Tool), a java standalone application for multi-species annotation based on phylogenetic inference over experimental evidence.

In addition, each group/MOD has their own in-house annotation interface. This diversity leads to duplication of effort. In addition this leads to an inability to move beyond the current GO model as so much is invested in existing tools and there are no resources to migrate them.

We plan to develop a unified web-based annotation interface "IndiGO" that can be used by a variety of groups. This will also eventually support PAINT-style phylogenetic inference, and will be hooked up to different inference tools.

The existing Protein2GO tool developed by the EBI GOA group can be used as the basis for this tool. In addition, development can dovetail with the development of the generic Pombe annotation tool (Val Wood and Kim Rutherford).

Reference Genome

The Reference Genome project requires support for managing and tracking work. We also need to show this work as part of our web presence.

We already have a number of web-based RefG reports derived directly from the database (developed by Sven, using the AmiGO infrastructure). We also have separate reports generated by Mary - these need to unified into the same architecture.

There are numerous challenges, some of which will be met by other changes described in this document. We need timely reporting of annotation statistics, which will require better database structures and incremental loading (see below). We need procedures in place for triggering reports for example when an experimental annotation supporting a phylogenetic inference changes. We need to provide dynamic visualizations for the RefG data (see web presence section).

Metrics

Develop automated metrics.

  • Comparison of gene set enrichment. See first few slides of [1]

Database

The GO database is a crucial part of the GO infrastructure.

  • Underpins AmiGO
  • Underpins PAINT
  • Used by annotation group in QC checks
  • Used by users in GOOSE queries
  • Mirrored in a number of different places for in-house use

However, the GO db was designed in 1999 and the core remains largely unchanged. Although the design anticipated a number of changes we are increasingly hamstrung by the schema design.

  • builds and incremental loading difficult
  • inefficient for common operations
  • software crystalized/moribund around bad design decisions.
  • killer queries
  • impedance mismatch of ontologies to relational database systems

In addition, the need for a single monolithic unified database schema is less compelling than five or ten years ago. There are now a number of alternatives to the relational model that may satisfy GO requirements. These include:

  • Text-indexing engines, such as Lucene/SOLR
  • Key-value databases, such as Google BigTable
  • RDF triplestores
  • In-memory querying/reasoning, e.g. OWLAPI

Note that QuickGO is not backed by a relational system - it uses its own indexing solutions for fast queries.

We do not anticipate a single monolithic system to satisfy all GO requirements. Our current strategy is to use a mixture.

A large number of GO queries can be satisfied in a highly efficient manner using text indexing engines. Currently AmiGO 1.8 uses Lucene, but we are limited by the perl version of Lucene. We are currently exploring Apache/SOLR which is implemented in java and provides a Web API.

We also intend to leverage existing community datastores. For example, the Neurocommons group (Science Commons, University of Buffalo), maintains a large RDF triplestore of multiple resources. We have worked with this group to get an adequate representation of GO in OWL into this triplestore, and are working with them to also get annotations in. This has a number of advantages:

  • No maintenance is required by the GO group - we are reusing
 community infrastructure
  • Integration for free with numerous other resources and
 ontologies. For example, queries spanning the GO graph, GO
 annotations, pubmed, interaction databases, and so on.
  • SPARQL queries may be more intuitive than SQL queries for some
 users.
  • Speed
  • GOOSE-like front-end
  • Use of a growing number of different 3rd party front-ends
  • Use of W3C standards

We anticipate a number of GO infrastructural functions could be shifted to a 3rd party externally maintained triplestore.

We anticipate the need for an in-house relational database for some tasks for the foreseeable future. For these requirements, we will redesign the schema from scrach with anticipated requirements for the next 5 to 10 years. This schema will be optimized for instant incremental loading / OLTP storage of GO annotations.

We have funds for this redesign and reimplementation of architecture.

The supporting middleware will largely be re-implemented in java, using Hibernate. The OWLAPI will be used for ontology queries.

Web Presence

AmiGO and QuickGO

The web presence for GO has traditionally been the website (standard Apache setup, with some custom CGIs for querying mini-databases of bibliographic references and xrefs) together with AmiGO.

Development of AmiGO has faltered slightly in the last few years for a number of reasons. The initial databases and perl middleware designed in 1999-2000 has been a major hindrance, and it has been difficult to deploy new features on the production server. There are currently multiple tools available in AmiGO only on the labs site.

In addition to this, QuickGO has many popular features such as graph visualization. Whilst AmiGO and QuickGO have many complementary features, there is significant overlap in functionality with little code re-use.

We have been addressing this with a loosely coupled approach - AmiGO labs is now able to show QuickGO graphs by using QuickGO as a web service. However, ideally we would have increased integration allowing for greater code re-use and less duplication of effort.

As we are currently migrating towards more use of java within the internal infrastructure of GO (OWLAPI for ontology work, SOLR for text indexing, Hibernate for relational access) this is an opportune time to consider a major refactoring that maximizes reuse of QuickGO components, as well as other 3rd party java ontology interfaces such as the EBI OLS and Gene Expression Atlas.

The exact nature of this refactoring is to be determined. We will form a working group consisting of the QuickGO and AmiGO developers. This will work closely with annotation and reference genome groups to deliver a unified front end to GO that is capable of handling all extensions to GO anticipted in the next 5 years

Advanced Queries and Displays

Reference Genome Visualization

GO Tools

We currently list over 50 tools for GO-based analysis and processing on the GO website. We have started to collect additional metadata for each tool, but are still fairly reactive. Users must spend considerable effort selecting the right tool, installing it (if it isn't a web tool) and preparing data. It can be difficult for non-bioinformatics experts to create workflows that must be repeated regularly.

In the past we have attempted to remedy this situation by making AmiGO a central hub for analyses. AmiGO has capabilities such as ontology slimming and term enrichment (GO-TermFinder), and labs offers shopping cart capabilities, allowing users to take the output from one tool and use it as input in another.

However, this approach is difficult to scale. In addition, ID mapping remains a constant problem. We are therefore currently exploring generic workflow solutions. The most popular of these within the genomics field is Galaxy, which is frequently used in next generation sequencing analysis environments. Galaxy is actually very flexible and neutral with respect to datatype, and can be used for ontology annotation workflows. Wrapping existing analysis programs is extremely simple and can even be done by biologists with rudimentary scripting capabilities.

Working with Erich Antenaza (cell cycle ontology) [REF: submitted] we have created galaxy tools for performing standard GO based analyses. These can be chained together in workflows and shared. FIGURE.

So far the only enrichment tool we have wrapped is TermFinder, but it is in principle easy to wrap most existing tools that provide either scripting, web services or APIs.

Deployment and Cloud Computing

Currently, building of the GO database and deployment of AmiGO places a significant burden on the software group and hinders fast release cycles. This is due to a mismatch in server types and architectures between Berkeley and the SGD production site. Consequently, many AmiGO features are only available on Berkeley labs.

We anticipate that in some respects this problem will eventually be simplified as we become less dependent on perl and mysql. However, in other respects it will get harder, especially if we want to deploy our galaxy environment outside Berkeley.

Fortunately modern virtualization techniques can rescue us here. Many genomics environments are now available as virtual machine (VM) images. These can be deployed at individual sites on a VM, or executed on the cloud. The advantages of cloud deployment include elastic provision of resources, and less administation. This is particularly advantageous in our current situation, where deployment on the SGD machines presents a considerable burden. By pursuing a cloud approach we can free up SGD resources to work on other aspects of production, or on database and software development.

Summary

Reuse Strategy

Our existing software infrastructure includes many legacy components and home-grown solutions. Whilst these home-grown solutions have arguably proven the most effective route in the past, they will not scale to future requirements. To be more efficient, the GO software group will take full advantage of existing 3rd party standards, libraries, tools, environments and infrastructure. This includes:

  • Increased use of OWL internally within the GO, and increased visibility of the OWL representation of GO to external bioinformatics users.
  • Use of the OWLAPI as the basis for all internal ontology processing and reasoning.
  • Use of Proteg4 for advanced ontology editing together with OE where appropriate.
  • Use of standard workflow environments such as Galaxy for integrating GO tools.
  • Virtualization and cloud computing
  • Use of external infrastructure for databasing and query purposes - including RDF triplestores such as Neurocommons and bio2rdf, as well as MARTs and Mines, e.g. intermine.
  • Integration with and reuse of EBI-developed tools, specifically QuickGO, but also a view to sharing components across a variety of ontology-oriented bioinformatics applications, including OLS, Gene Expression Atlas.

Architecture

Existing legacy perl middleware will be retired. All core middleware will be implemented in java. Standard libraries used include the OWLAPI, Hibernate and the SOLR API. Lightweight REST APIs will be provided where appropriate, allowing development of web components in other languages than java, if required.

The existing monolithic relational database will play less of a role. SOLR will be used for all text-based querying, and some additional queries. The combination of SOLR plus an in-memory copy of the ontology accessed via the OWLAPI may be sufficient for both basic and advanced queries.

Web-based

Role of Software and Utilities Group in GO

The SWUG is responsible for development of tools and infrastructure for both external users of GO and internal consortium members. By effectively reusing existing software and infrastructure that group can become better honed towards providing effective support in both these roles.