Database Enhancement ARRA progress report for 2010
A one year ARRA Administrative supplement to the GO Consortium Grant was awarded in September, 2010. The aim of the supplement is to update the GO core database to support integrated annotation directly from the MOD communities that are the core developers of the GO.
Fine-grained project details and project management documents can be found at on the Schema Overhaul pages.
In September we hired one FTE (through LBL) who has made significant progress with this project. We are also looking to hire additional contractors early in the new year.
We have called the new schema and architecture GOLD (GO Latest Database), and the current soon-to-be legacy database/infrastructure LEAD.
Current Progress Sept-Dec 2010
We oped to redesign the entire GO schema from scratch, rather than patch on additional changes. This is split into three portions
- ontology modeling (ontol.sql)
- gene association modeling (gaf.sql)
- phylogenetic modeling (phylo.sql)
The ontology portion is strongly coupled to OWL which is highly expressive and should serve GO's requirements for the next 5 years at least (see Transition to OWL). This portion is now finalized. The gaf and phylo portions are undergoing finalization.
We are also taking this opportunity to switch to postgresql, which offers significant advantages over mysql (which we have been using until now) and has a better long-term future.
We are moving away from our legacy perl framework to a purely java serverside solution. This brings significant advantages in terms of developer expertise, portability and ease of installation (currently a major problem with our existing infrastructure), efficiency and speed, availability of robust 3rd party tools (e.g. hibernate, OWLAPI).
We have created an OBO access layer over the OWL API (see Schema_Overhaul#OBO_Access_Layer). This allows the full power of the OWL API (plus associated reasoners, tooling etc) together with a simplified convenience layer tuned towards the requirements of GO (in particular, GO-style textual metadata).
We are in the process of developing the object model for the GAF and Phylo portions.
We are switching to the industry-standard Hibernate object-relational mapping framework. We have created a Hibernate layer for the ontology portion of the schema, and are in the process of extending this layer to other portions of the schema.
Bulkloads and Incremental Uploads
We have created a new framework that supports two modes of database population. (1) Bulkloading will populate an empty database from scratch using a specially tuned fast update procedure (2) Incremental updates can be used to keep a database in sync with a data source or data submission by computing deltas.
Incremental updates use the hibernate framework.
We will use bulkloading in the early stages, and when the database switches to production we will switch to incremental updates. Incremental updates are one of the key aspects of the ARRA proposal.
We are also working on a change tracking mechanism, that will allow for querying of historical GO data.
Administrative Web Interface
In addition to command-line tools for populating the database, we have commenced creating an administrative web interface using the Jetty Servlet framework. This is not yet publicly available, but is simple to run locally. Currently this interface allows both bulkloading and updating of the ontology portion of the database. In future it will allow monitoring of timed update and QC tasks, as well as direct submission of data.
This interface will replace the patchwork perl/cron system we currently have in place.
This interface will also be key for advanced QC and new annotation generation inference tasks (see Annotation Rule Engine).
SOLR text indexing
We have also implemented a full text indexing system using Apache/SOLR. So far this is implemented for the ontology portion and can easily be extended to the gaf/phylo portion.
Goals for 2011
We have made speedy progress with just one FTE hired so far, and expect even more rapid progress after hiring contractors next year.
We expect the new GOLD architecture will start delivering value to the GOC early next year even before it goes into production. The beta instance can be used for a number of reporting and QC functions. It can also be used within PAINT.
- complete vertical architecture stacks (schema, hibernate, OM, solr, admin) for gaf and phylo portions of the schema
- extend the web administrative interface, adding submission capabilities and timed tasks
- Test in production environment
- Start porting tools
Full details on the google doc: https://docs.google.com/document/d/1XuguhNAvbN5d3zssZaxYxx8gsgxObCGnUS97VBbBiD4/edit?authkey=CIqtvcQB#