Difference between revisions of "Schema Overhaul"

From GO Wiki
Jump to: navigation, search
(Schema Overhaul)
 
(46 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
'''NOTE:''' the majority of the material in this page is moving or has moved to the GO trac - see the [http://sourceforge.net/apps/trac/geneontology/milestone/GOLD%20beta GOLD beta milestone]
 +
 +
There is also some (outdated) material in: [https://docs.google.com/document/d/1XuguhNAvbN5d3zssZaxYxx8gsgxObCGnUS97VBbBiD4/edit# google doc].
 +
 +
 
== Background ==
 
== Background ==
  
Line 16: Line 21:
  
 
* incremental loading rather than rebuilding
 
* incremental loading rather than rebuilding
 +
* time travel? TBD
 +
* full audit info - e.g. time of last load
 
* fast text search
 
* fast text search
 
* efficient queries
 
* efficient queries
Line 27: Line 34:
 
* increased integration between ontology and annotation workflows
 
* increased integration between ontology and annotation workflows
 
* annotation QC, [[Rule Engine]] support
 
* annotation QC, [[Rule Engine]] support
 +
 +
== Source ==
 +
 +
All source is contained in the 'gold' project in go svn:
 +
 +
http://sourceforge.net/projects/geneontology/develop
  
 
== Overview of plan ==
 
== Overview of plan ==
Line 53: Line 66:
 
redesigned schema. This way we gain immediate short term benefit at
 
redesigned schema. This way we gain immediate short term benefit at
 
the same time laying a sustainable path for the future.
 
the same time laying a sustainable path for the future.
 +
 +
=== Load Strategy ===
 +
 +
Bulk load strategy:
 +
 +
* Each table X has a mirror table called bulk_X with identically labeled columns
 +
* java Loader object generates bulk files and bulk-loads into bulk_X tables
 +
* java Loader object executes SQL to update X with rows from bulk_X
 +
** individual logic is specific to loader
  
 
=== Architecture ===
 
=== Architecture ===
  
[[Media:gold-arch-2010.png]]
+
[[Image:Gold-arch-2010.png]]
 +
 
 +
* [http://geneontology.svn.sourceforge.net/viewvc/geneontology/OWLTools/ OWLTools] - current location for OBO Access Layer
 +
* [http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/gold gold] - GOLD
  
 
=== Dataflow ===
 
=== Dataflow ===
  
[[Media:gold-dataflow.png]]
+
[[Image:Gold-dataflow.png]]
 +
 
 +
TODO:
 +
 
 +
* some rules should be processed further upstream. TBD
 +
* need diagrams for other data types - seqs, trees, taxonomy (ncbi tax goes into ontol tables)
  
 
== Deliverables ==
 
== Deliverables ==
  
 
The majority of these will be delivered by the 2+ ARRA developers.
 
The majority of these will be delivered by the 2+ ARRA developers.
 +
 +
All timings include time for writing extensive junit test suites, smoke tests, integration tests, optimization, benchmarks, maintenance
  
 
=== OBO Access Layer ===
 
=== OBO Access Layer ===
Line 85: Line 117:
  
 
   Time: 1 month
 
   Time: 1 month
 +
 +
Current status:
 +
 +
See [[OWLTools]]
 +
 +
Code in svn [http://geneontology.svn.sourceforge.net/viewvc/geneontology/OWLTools/ OWLTools] (project may split and directory change)
  
 
=== Java Ontology Loader ===
 
=== Java Ontology Loader ===
  
Create a bridge between the OBO access layer and [[GHOUL]] (GO
+
* Input: an ontology accessed through the OBO access layer
Hibernate layer). This will allow the loading and updating of the
+
* Output: Updated instance of a Gold schema database
ontology portion of the GO database from OBO files via the OWLAPI.
+
 
 +
Loading will be incremental (but can also be used for bulk ontology creation, when the database instance is empty).
 +
 
 +
Implementation could be via a Hibernate layer (analagous to [[GHOUL]]) or direct JDBC.
 +
 
 +
The following OBO Format tags are loaded:
 +
 
 +
* id
 +
* name
 +
* def (including def xrefs)
 +
* synonym
 +
* subset
 +
* is_obsolete
 +
* consider
 +
* replaced_by
 +
* is_a
 +
* relationship
 +
* intersection_of
 +
* union
 +
* created_by
 +
* creation_date
 +
 
 +
Gold schema fields to be populated:
 +
 
 +
* to be specified
 +
 
 +
Note this section does ''not'' include the reasoning component (DR Gold)
 +
 
 +
Incremental loading logic:
 +
 
 +
Note that IDs should never be deleted in GO - instead they are obsoleted. This means that the incremental logic can be simpler - new versions can simply be laid over the top of existing data. Lack of surrogate keys and FKs also makes the logic simpler.
 +
 
 +
  Time: 1 month (including schema refinement time)
 +
 
 +
[http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/gold/src/org/geneontology/gold/io/OntologyLoader.java?content-type=text%2Fplain OntologyLoader]
 +
 
 +
=== Gold to Lead loader (ontologies) ===
  
Note that GHOUL will continue to be used for the existing db schema.
+
Create a loader that updates or builds the ontology tables in Lead, populated directly from Gold.
  
  Time: 1 month
+
This may be via a bridge to the existing [[GHOUL]] layer, or a direct SQL/JDBC loader.
 +
 
 +
Tables to be populated: as for Gold loader, with the exception of
 +
 
 +
* intersection_of
 +
* union_of
 +
* created_by
 +
* creation_date
 +
 
 +
[http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/gold/src/org/geneontology/gold/io/CopyOntologyGoldToLead.java?content-type=text%2Fplain CopyOntologyGoldToLead]
  
=== GAF Parsing and Object Model ===
+
=== GAF Loading ===
  
Write or reuse java GAF parser and object model.
+
See [[Annotation_File_Format_Proposal]]
  
 +
* GAF normalizer - imports GAF, exports normalized annotation and gene info file.
 +
* Loads either into object model
  
 
Object model representation must mirror normalized association file /
 
Object model representation must mirror normalized association file /
 
gp info file structure, include col16 and col17.
 
gp info file structure, include col16 and col17.
 +
 +
It should be possible to reuse existing GAF code
  
 
   Time: 1 month
 
   Time: 1 month
  
=== GOLD db Loader ===
+
Code:
 +
 
 +
* GAFNormalizer
 +
* [http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/gold/src/org/geneontology/gold/io/AnnotationLoader.java?content-type=text%2Fplain AnnotationLoader]
 +
* [http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/gold/src/org/geneontology/gold/io/GeneInfoLoader.java?content-type=text%2Fplain GeneInfoLoader]
  
The dataflow has two parts:
+
=== Gold to Lead Loader (Annotations) ===
  
* split GAF2.0 into normalized assoc file and gp_info file (Amelia to specify)
+
Create a loader that updates or builds the association and gene_product tables in Lead, populated directly from Gold.
* normalized files loaded into db
 
  
  Time: 1 month
+
This may be via a bridge to the existing [[GHOUL]] layer, or a direct SQL/JDBC loader.
  
=== LOEAD db Loader ===
+
Time: 1 month
  
Load LEAD (old schema) from GOLD (new).  
+
* [http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/gold/src/org/geneontology/gold/io/CopyAnnotationsGoldToLead.java?content-type=text%2Fplain CopyAnnotationsGoldToLead]
  
  Time: 1 month
+
* [http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/gold/src/org/geneontology/gold/io/CopyGeneInfoGoldToLead.java?content-type=text%2Fplain CopyGeneInfoGoldToLead]
  
 
=== Extend/Replace OWL API OBO Parser ===
 
=== Extend/Replace OWL API OBO Parser ===
Line 132: Line 222:
 
   Time: 1 month and ongoing
 
   Time: 1 month and ongoing
  
=== Switch to load-qfo-seqs.pl ===
+
=== Java Seq Loader ===
  
Retire load-seqs-into-db.pl
+
replaces: load-qfo-seqs.pl
  
 
   Time: 1 week
 
   Time: 1 week
  
=== Java Seq Loader ===
+
[http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/gold/src/org/geneontology/gold/io/SeqLoader.java?content-type=text%2Fplain SeqLoader]
  
  Time: 1 week
+
=== Java PhyloTree Loader ===
 
 
=== Create tree loader ===
 
  
 
Takes nhx file (from Panther) as input. loads tree into db. written in
 
Takes nhx file (from Panther) as input. loads tree into db. written in
Line 150: Line 238:
 
blob and as normalized recursive structure.
 
blob and as normalized recursive structure.
  
   Time: 1 week
+
   Time: 2 weeks
 +
 
 +
[http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/gold/src/org/geneontology/gold/io/PhyloTreeLoader.java?content-type=text%2Fplain PhyloTreeLoader]
  
 
=== Migrate SQL Reasoner to Java ===
 
=== Migrate SQL Reasoner to Java ===
Line 175: Line 265:
 
calculate this using existing reasoner APIs.
 
calculate this using existing reasoner APIs.
  
   Time: 1-2 months
+
   Time: 1 months
 +
 
 +
Current status: much of the existing code already implemented in OWLGraph
  
=== Admin Servlet ===
+
=== Admin Servlet (Gold Watch) ===
  
 
Create a simple servlet interface to the database that will allow GOC
 
Create a simple servlet interface to the database that will allow GOC
 
members and production administrators to check on the progress of
 
members and production administrators to check on the progress of
 
database loads and perform administrative tasks. This will be
 
database loads and perform administrative tasks. This will be
implemented using RESTlet, and will also allow programmatic access,
+
implemented using RESTlet or basic servlets, and will also allow programmatic access,
 
returning JSON objects.
 
returning JSON objects.
  
  Time: 1 month and ongoing
+
Capabilities
 
 
=== Extend GHOUL ===
 
  
Extend GHOUL as required. This includes adding new convenience
+
* provide reports of any problems (e.g. source files unavailable)
methods, benchmarking and optimizing based on the results of
+
* database contents statistics
benchmarks. In particular extend GHOUL to support new phylogenetic
+
* logs
protein family tree model in database.
+
* monitor current progress in [[#Dataflow]]
 +
* GAF upload capabilities
 +
* GAF diff reports
  
 
   Time: 1 month and ongoing
 
   Time: 1 month and ongoing
Line 204: Line 296:
 
   Time: 1 month and ongoing
 
   Time: 1 month and ongoing
  
== Research Alternative Query/Storage options ===
+
=== Research Alternative Query/Storage options ===
  
 
Investigate alternative storage/query options. These should be geared
 
Investigate alternative storage/query options. These should be geared
Line 216: Line 308:
 
Create a VM of all software used in database building. Deploy on
 
Create a VM of all software used in database building. Deploy on
 
production, plus Amazon etc.
 
production, plus Amazon etc.
 +
 +
* coordinate with Stanford
 +
* coordinate with Mirrors
 +
* database downloads
  
 
   Time: 1 month, ongoing
 
   Time: 1 month, ongoing
Line 229: Line 325:
 
   Time: 1 month
 
   Time: 1 month
  
=== Schema Redesign (S2) ===
+
[http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/gold/src/org/geneontology/gold/rules/ rules]
 +
 
 +
=== Golden Goose ===
 +
 
 +
GOOSE is schema neutral and could be extended. But needs to be done craefully so people don't confuse the schemas/examples. Se below.
 +
 
 +
  Time: 1 week
 +
 
 +
=== Exports ===
 +
 
 +
Provide a number of exports, views and cuts of GO information, including:
 +
 
 +
* http://wiki.geneontology.org/index.php/Annotation_File_Format_Proposal#Proposed_file_formatsplit
 +
 
 +
=== Migrate Examples ===
 +
 
 +
Migrate [[Example Queries]]
 +
 
 +
  Time: 2 weeks
 +
 
 +
=== Autodoc ===
 +
 
 +
Replicate: http://www.geneontology.org/GO.database.shtml
 +
for gold
 +
 
 +
  Time: 1 week
 +
 
 +
=== Export Utilities ===
 +
 
 +
Custom reports and queries
 +
 
 +
done via goose and/or galaxy?
 +
 
 +
Requirements TBD
 +
 
 +
=== ID Mapping ===
 +
 
 +
  Time: TBD
 +
 
 +
=== Integrate Taxon Constraint Engine ===
 +
 
 +
* [http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/gold/src/org/geneontology/gold/rules/AnnotationTaxonCheck.java?view=log AnnotationTaxonCheck]
 +
 
 +
  Time:1 month
 +
 
 +
=== Schema Redesign (Lead to Gold) ===
 +
 
 +
Lead: Chris
  
 
The existing schema is inefficient for querying and loading.
 
The existing schema is inefficient for querying and loading.
Line 246: Line 389:
 
key layer can be automatically created if need be).
 
key layer can be automatically created if need be).
  
   Time: background and then 2 months
+
   Time: ongoing
 
 
=== Integrate S2 and java layer ===
 
  
May be via Hibernate and/or bulkloads.
+
=== Gold Hibernate Layer ===
  
   Time: 2 months
+
   Time: 2 weeks
  
  
  
 
[[Category:SWUG Projects]]
 
[[Category:SWUG Projects]]

Latest revision as of 17:58, 29 September 2011

NOTE: the majority of the material in this page is moving or has moved to the GO trac - see the GOLD beta milestone

There is also some (outdated) material in: google doc.


Background

The GO database is currently crucial part of the GO infrastructure.

  • Underpins AmiGO
  • Underpins PAINT
  • Used by annotation group in QC checks
  • Used by users in GOOSE queries
  • Mirrored in a number of different places for in-house use

However, the GO db was designed in 1999 and the core remains largely unchanged. In addition, the existing perl architecture is highly inefficient and depends on the outdated go-perl library.

New requirements:

  • incremental loading rather than rebuilding
  • time travel? TBD
  • full audit info - e.g. time of last load
  • fast text search
  • efficient queries
  • use of relation chains (implemented in QuickGO)
  • relation filtering
  • use of other ontologies for col16 queries and term enrichment
  • better visualization
  • Increasing_Expressivity_in_GO_Annotations open ended (cf LEGO, col16)
  • open ended increased ontology expressivity (OWL, "cross-products")
  • increased integration between ontology and annotation workflows
  • annotation QC, Rule Engine support

Source

All source is contained in the 'gold' project in go svn:

http://sourceforge.net/projects/geneontology/develop

Overview of plan

We will overhaul the GO database schema and in parallel explore other storage/index options. For example, we will use SOLR for all text-oriented search. In fact, SOLR can fulfil a number of query capabilities. See Full_Text_Indexing.

The efficiency considerations may change depending on AmiGO_and_QuickGO_Integration. QuickGO is not backed by a relational database, it uses a custom index engine. This, together with the increaed use of SOLR will make the focus of the RDB more towards advanced queries and internal QC purposes.

Schema Overhaul

We will explore a more efficient relational structure. This may involve abandoning integer surrogate keys, which hamper incremental loads.

This new schema is currently called GOLD (Gene Ontology Latest Database). The current/old mysql one is called LEAD.

We will focus immediately on developing tools for GOLD. We will have a means of populating a Lead instance from a Gold instance. We can start switching to this schema early on, and support both schemas in parallel.

The overall strategy will be to focus on new java middleware. This middleware can update and access both the existing schema and a redesigned schema. This way we gain immediate short term benefit at the same time laying a sustainable path for the future.

Load Strategy

Bulk load strategy:

  • Each table X has a mirror table called bulk_X with identically labeled columns
  • java Loader object generates bulk files and bulk-loads into bulk_X tables
  • java Loader object executes SQL to update X with rows from bulk_X
    • individual logic is specific to loader

Architecture

Gold-arch-2010.png

  • OWLTools - current location for OBO Access Layer
  • gold - GOLD

Dataflow

Gold-dataflow.png

TODO:

  • some rules should be processed further upstream. TBD
  • need diagrams for other data types - seqs, trees, taxonomy (ncbi tax goes into ontol tables)

Deliverables

The majority of these will be delivered by the 2+ ARRA developers.

All timings include time for writing extensive junit test suites, smoke tests, integration tests, optimization, benchmarks, maintenance

OBO Access Layer

A java ontology API layered on the OWLAPI. This will implement the Repository enterprise pattern and will form a simple thin layer for simplifying access to the OWLAPI according to the OBO Format 1.4 specification. For example, methods such as:

   String getTextDefinitionForClass(OWLClass cls);
   Map<String,OWLClass> getSynonymMapForEntity(OWLEntity e);

The developer will work with CM on the specification, and the developer will provide the implementation. This layer will differ from the existing OboEdit OBO API in that it does not have its own object model - rather it uses the OWLAPI object model and provides GO/OBO-friendly methods for accessing and manipulating OWL. Note that in future obof will be considered an OWL syntax and will be accessed via the OWL API.

 Time: 1 month

Current status:

See OWLTools

Code in svn OWLTools (project may split and directory change)

Java Ontology Loader

  • Input: an ontology accessed through the OBO access layer
  • Output: Updated instance of a Gold schema database

Loading will be incremental (but can also be used for bulk ontology creation, when the database instance is empty).

Implementation could be via a Hibernate layer (analagous to GHOUL) or direct JDBC.

The following OBO Format tags are loaded:

  • id
  • name
  • def (including def xrefs)
  • synonym
  • subset
  • is_obsolete
  • consider
  • replaced_by
  • is_a
  • relationship
  • intersection_of
  • union
  • created_by
  • creation_date

Gold schema fields to be populated:

  • to be specified

Note this section does not include the reasoning component (DR Gold)

Incremental loading logic:

Note that IDs should never be deleted in GO - instead they are obsoleted. This means that the incremental logic can be simpler - new versions can simply be laid over the top of existing data. Lack of surrogate keys and FKs also makes the logic simpler.

 Time: 1 month (including schema refinement time)

OntologyLoader

Gold to Lead loader (ontologies)

Create a loader that updates or builds the ontology tables in Lead, populated directly from Gold.

This may be via a bridge to the existing GHOUL layer, or a direct SQL/JDBC loader.

Tables to be populated: as for Gold loader, with the exception of

  • intersection_of
  • union_of
  • created_by
  • creation_date

CopyOntologyGoldToLead

GAF Loading

See Annotation_File_Format_Proposal

  • GAF normalizer - imports GAF, exports normalized annotation and gene info file.
  • Loads either into object model

Object model representation must mirror normalized association file / gp info file structure, include col16 and col17.

It should be possible to reuse existing GAF code

 Time: 1 month

Code:

Gold to Lead Loader (Annotations)

Create a loader that updates or builds the association and gene_product tables in Lead, populated directly from Gold.

This may be via a bridge to the existing GHOUL layer, or a direct SQL/JDBC loader.

Time: 1 month

Extend/Replace OWL API OBO Parser

The current OWLAPI OBOParser has some flaws.

Extend or replace the OWLAPI OBOParser as required. In particular, extend existing support for synonyms, text definitions etc, implemented according to obo format 1.4 spec, using IAO and purl.obofoundry.org URIs.

 Time: 1 month and ongoing

Java Seq Loader

replaces: load-qfo-seqs.pl

 Time: 1 week

SeqLoader

Java PhyloTree Loader

Takes nhx file (from Panther) as input. loads tree into db. written in java. coordinate object model with PAINT.

Db structures already exist. Tree should be stored both as text nhx blob and as normalized recursive structure.

 Time: 2 weeks

PhyloTreeLoader

Migrate SQL Reasoner to Java

See Transitive Closure for more background.

Migrate the existing GO perl/SQL rule-based reasoner to java, or replace with existing OWL reasoner that is guaranteed to complete within a given period of time. The existing reasoner is a simple forward-chaining implementation with simple hardcoded rules (subclass, transitivity, property chains). Currently it is extremely slow which will hamper incremental loading.

Classification using equivalence axioms is done ahead of time prior to database loading, so this is not required for the database infrastructure. The reasoner need only calculate the transitive closure using the correct semantics for relation transitivity and property chains.

In principle an OWL-RL reasoner should be able run as fast or faster than the existing implementation -- although there are subtleties; for example, the GO database must store all entailed <X SubClassOf some R Y> triples to be able to support existing queries, it's not trivial to calculate this using existing reasoner APIs.

 Time: 1 months

Current status: much of the existing code already implemented in OWLGraph

Admin Servlet (Gold Watch)

Create a simple servlet interface to the database that will allow GOC members and production administrators to check on the progress of database loads and perform administrative tasks. This will be implemented using RESTlet or basic servlets, and will also allow programmatic access, returning JSON objects.

Capabilities

  • provide reports of any problems (e.g. source files unavailable)
  • database contents statistics
  • logs
  • monitor current progress in #Dataflow
  • GAF upload capabilities
  • GAF diff reports
 Time: 1 month and ongoing

PAINT integration, phase 1

This will mostly come for free as PAINT uses GHOUL. However, PAINT should gradually move away from using GHOUL directly to using a standard java layer.

 Time: 1 month and ongoing

Research Alternative Query/Storage options

Investigate alternative storage/query options. These should be geared towards future requirements and increased expressivity. E.g. OWL Databases, RDF triplestores.

 Time: (optional at end) 1 month.

Deployment

Create a VM of all software used in database building. Deploy on production, plus Amazon etc.

  • coordinate with Stanford
  • coordinate with Mirrors
  • database downloads
 Time: 1 month, ongoing

Rule Engine

Implement rule system for providing live quality control of ontology and annotations. Spec provided by CM. For ontology QC much of this can be done using standard reasoners. Some annotation QC may require some kind of rule evaluation in a closed-world environment, possibly translated to SQL.

 Time: 1 month

rules

Golden Goose

GOOSE is schema neutral and could be extended. But needs to be done craefully so people don't confuse the schemas/examples. Se below.

 Time: 1 week

Exports

Provide a number of exports, views and cuts of GO information, including:

Migrate Examples

Migrate Example Queries

 Time: 2 weeks

Autodoc

Replicate: http://www.geneontology.org/GO.database.shtml for gold

 Time: 1 week

Export Utilities

Custom reports and queries

done via goose and/or galaxy?

Requirements TBD

ID Mapping

 Time: TBD

Integrate Taxon Constraint Engine

 Time:1 month

Schema Redesign (Lead to Gold)

Lead: Chris

The existing schema is inefficient for querying and loading.

The schema redesign project will be ongoing in the background for the first 6 months in an experimental/exploratory phase, followed by concretization as the graint aims become crystalized. Key questions to be addressed include the status of advanced nested post-composition (cf LEGO), evidence chains, integration with other data types.

Requirements that are certain include incremental loading and time travel. A common request is to compare annotations from two timepoints.

To satisfy requirements it is likely the schema will not employ integer surrogate keys, instead using public identifiers (a surrogate key layer can be automatically created if need be).

 Time: ongoing

Gold Hibernate Layer

 Time: 2 weeks