GO Database Wishlist

From GO Wiki
Jump to: navigation, search

General Schema Issues

Given complaints about the current DB schema, and the fact that more people are now trying to use the db directly (through GOOSE), perhaps it would be useful to mock up an ideal schema for the data, and see how much this would differ from the current one? If it turned out that only a few fields / tables were going to need changing, and the code modifications would not be too horrendous, it may be worth altering the schema.

Species taxonomies

We currently use a nested set implementation, as opposed to a pre-computed closure approach (see Transitive_closure)

This turns out to be inefficient in MySQL

  • Notes from SWUG:Meeting_2009_05_07
    • Seth: but fast for Lucene
    • Resolved: investigate query speed further, driven by AmiGO taxon filtering request

Table/column Names

  • some table names are poorly chosen and do not do a good job of reflecting contents

Cost of problem: LOW -- ?

  • minor annoyance, increases learning curve)
  • if we use DBIx::Class (perl module) to access db info, it creates objects based on the tables it finds in the database. Any database weirdness is therefore mirrored.

Cost of fixing: HIGH

  • Notes from SWUG:Meeting_2009_05_07
    • Bad naming choices are propagated into DBIx::Class layer
    • Still not a high priorit

Query Speed

Gene Speed Search

  • Searching for genes in AmiGO is currently slow

Proposed solution: Lucene

Cost: low-medium (Seth has done most of the preparatory work)

Associations

Currently we don't have IEAs in the main database because this would slow things down.

Proposed solution: schema change

merge association/evidence into one table, reducing need for additional join.

Could be done as view/materialized view

Proposed solution: tablespaces

Use different physical partitions for different evidence codes. Invisible at logical model level.

Supported by MySQL? Could roll our own ad-hoc solution.

Proposed solution: materialized views

Methods: write views for common queries, and then materialize these for speed.

Additional investigation required: how does this work with current ORMs?

Cost: medium-low. Additional lag time in building the database

Loading

Load speed has improved tremendously using Ben's bulkloader, but we could still do with faster turnaround or even a live db

  • Notes from SWUG:Meeting_2009_05_07
    • Chris: use intermediate bulk loading tables, use SQL to update/insert
    • Gail: this is standard technique. leverages SQL
    • Investigate further. See go-dev/sql/load

Proposed Solution: master/slave db

MODs install slave dbs, write to this, replicates in central db.

Cost: high (support, MOD transitioning)

Proposed Solution: smart updates

Currently we waste a lot of cycles rebuilding the database, even though many of the data files haven't changed. GOA update once a month.

  • Incremental updates rather than rebuilds
  • Changes trigger updates

Obstacles: hard to do incremental updates using current bulkloading scheme.

Proposed Solution: java API

(compatible with either of the two above)

API

Current perl db API is crufty. We are gradually replacing with DBIx::Class, but this ties us even more to the current schema.

We also have two java APIs: GHOUL and OBO API; these are not currently integrated or bridged.


Solution 1: integrated java API

  • OBD API?

Solution 2: replacement perl API with MOOSE

  • Notes from SWUG:Meeting_2009_05_07
    • prototype go-moose API in svn
    • Ben: why not use Class::Accessor (same as DBIx::Class)
    • Bridge between the two.

MySQL

MySQL has certain limitations: broken inefficient implementation of views

Solution: PostgreSQL

Combine this with schema changes

Cost: medium-high. Currently have expertise with admining MySQL

  • Notes from SWUG:Meeting_2009_05_07
    • Gail: Production supports a mixture of MySQL and Oracle, may be adding postgresql to list of supported dbs
    • Seth: PG may be slower in some circumstances
    • Chris: existing codebase may be tuned around MySQL
    • Gail: MySQL is fine for most purposes
    • Mike: Chado default is Pg, increase interoperability
    • Consensus: not a high priority. If we had extra resources may be worth investigating further but won't happen otherwise

Phylo Protein Tree Support

We currently have no support for phylogenetic protein trees.

Cost: medium-low (no changes required)

Integration with MOD schemas and GMOD code

Proposed solution: migrate to Chado

  • Should be done at same time as move to postgresql
  • Cost: high; lots of code needs migrating
  • Cost: speed? current GO DB has lots of GO-specific denormalizations
  • Benefits: high -- integration with a lot of GMOD code

Graph Support

Most of the work has been done. See Transitive_closure.

Code still needs to migrate to productions