GO Database Wishlist
General Schema Issues
Given complaints about the current DB schema, and the fact that more people are now trying to use the db directly (through GOOSE), perhaps it would be useful to mock up an ideal schema for the data, and see how much this would differ from the current one? If it turned out that only a few fields / tables were going to need changing, and the code modifications would not be too horrendous, it may be worth altering the schema.
Species taxonomies
We currently use a nested set implementation, as opposed to a pre-computed closure approach (see Transitive_closure)
This turns out to be inefficient in MySQL
- Notes from SWUG:Meeting_2009_05_07
- Seth: but fast for Lucene
- Resolved: investigate query speed further, driven by AmiGO taxon filtering request
Table/column Names
- some table names are poorly chosen and do not do a good job of reflecting contents
Cost of problem: LOW -- ?
- minor annoyance, increases learning curve)
- if we use DBIx::Class (perl module) to access db info, it creates objects based on the tables it finds in the database. Any database weirdness is therefore mirrored.
Cost of fixing: HIGH
- Notes from SWUG:Meeting_2009_05_07
- Bad naming choices are propagated into DBIx::Class layer
- Still not a high priorit
Query Speed
Gene Speed Search
- Searching for genes in AmiGO is currently slow
- Notes from SWUG:Meeting_2009_05_07
- Is it that slow?
- Lucene fix
Proposed solution: Lucene
Cost: low-medium (Seth has done most of the preparatory work)
Associations
Currently we don't have IEAs in the main database because this would slow things down.
- Notes from SWUG:Meeting_2009_05_07
- Adding IEAs to lite load and only filtering from goa_uniprot.
Proposed solution: schema change
merge association/evidence into one table, reducing need for additional join.
Could be done as view/materialized view
- Notes from SWUG:Meeting_2009_05_07
- Not required
Proposed solution: tablespaces
Use different physical partitions for different evidence codes. Invisible at logical model level.
Supported by MySQL? Could roll our own ad-hoc solution.
Proposed solution: materialized views
Methods: write views for common queries, and then materialize these for speed.
Additional investigation required: how does this work with current ORMs?
Cost: medium-low. Additional lag time in building the database
Loading
Load speed has improved tremendously using Ben's bulkloader, but we could still do with faster turnaround or even a live db
- Notes from SWUG:Meeting_2009_05_07
- Chris: use intermediate bulk loading tables, use SQL to update/insert
- Gail: this is standard technique. leverages SQL
- Investigate further. See go-dev/sql/load
Proposed Solution: master/slave db
MODs install slave dbs, write to this, replicates in central db.
Cost: high (support, MOD transitioning)
Proposed Solution: smart updates
Currently we waste a lot of cycles rebuilding the database, even though many of the data files haven't changed. GOA update once a month.
- Incremental updates rather than rebuilds
- Changes trigger updates
Obstacles: hard to do incremental updates using current bulkloading scheme.
Proposed Solution: java API
(compatible with either of the two above)
API
Current perl db API is crufty. We are gradually replacing with DBIx::Class, but this ties us even more to the current schema.
We also have two java APIs: GHOUL and OBO API; these are not currently integrated or bridged.
Solution 1: integrated java API
- OBD API?
Solution 2: replacement perl API with MOOSE
- Notes from SWUG:Meeting_2009_05_07
- prototype go-moose API in svn
- Ben: why not use Class::Accessor (same as DBIx::Class)
- Bridge between the two.
MySQL
MySQL has certain limitations: broken inefficient implementation of views
Solution: PostgreSQL
Combine this with schema changes
Cost: medium-high. Currently have expertise with admining MySQL
- Notes from SWUG:Meeting_2009_05_07
- Gail: Production supports a mixture of MySQL and Oracle, may be adding postgresql to list of supported dbs
- Seth: PG may be slower in some circumstances
- Chris: existing codebase may be tuned around MySQL
- Gail: MySQL is fine for most purposes
- Mike: Chado default is Pg, increase interoperability
- Consensus: not a high priority. If we had extra resources may be worth investigating further but won't happen otherwise
Phylo Protein Tree Support
We currently have no support for phylogenetic protein trees.
Cost: medium-low (no changes required)
Integration with MOD schemas and GMOD code
Proposed solution: migrate to Chado
- Should be done at same time as move to postgresql
- Cost: high; lots of code needs migrating
- Cost: speed? current GO DB has lots of GO-specific denormalizations
- Benefits: high -- integration with a lot of GMOD code
Graph Support
Most of the work has been done. See Transitive_closure.
Code still needs to migrate to productions
- Notes from SWUG:Meeting_2009_05_07
- Committed to cvs