Database Meeting 21 Nov 2006

From GO Wiki
Jump to: navigation, search

Present

ChrisM, BenH, StanD, GailB, MikeC

Database load efficiency issues

 gene_associations file---go::parser---> ------assoc-xml----> XSLT -------------------------------> godb xml --->DB stag -----> GO
 |-----------------------------load-go-into-db.pl---------------------------------------------------------------------------------|
 |---------------------------------go2godb_prestore-------------------------------------------------|
 |--go2fmt.pl -w xml -p go_assoc gene-association.sgd.gz---|
                                                          | go-apply-xslt oboxml_to_godb_prestore |


Chris dropped in as we were having a discussion on go loading and the timing there of. To summarize current situation:

  • essential parts of go-lite (association files + sequences) can be loaded in ~16 hours (10/6).
  • essential parts of go-full (assocation files + sequences) takes 133 hours (108/25).

This is not counting all the dumping ftp files, etc.

These numbers are on the old machine, the new one is ~20% faster, even so, if we want to do a "full" (with IEAs) 3 times/week we probably need a factor of 3 speed up. Factor of 4 or 5 would be better because we can only expect the files to get larger.

We did some tests on a single GA file (SGD) using the scripts/programs outlined above. It seems that of the 11 minutes or so load this file, only ~1 minute is accounted for with go2godb_prestore. The loading seems to spend 90% of it's time in DBStag (loading the XML to the mysql database). That would imply that "parallelizing" the loading by preconverting the GA files to XML for DBStag would not give us the speed up we need. On the plus side, loading the sequences seems to spend most of it's time getting data over the network, so we should be able to improve/preload most of this away.

Further tests should be run just to make sure, but it does not look promising.

Plan "B" is to load the assoc files directly.

Some options:

I - bulk loading (drop whole DB every time)
II- incremental loading (psuedocode)
  if ( is_updated(ga_file)  ){

delete * from tables where source = ga_file

       insert rows into tables from ga_file.new
    }

II) seems to be slightly more complicated, but good considering goa_uniprot (80% of the problem) is only updated every 3 months. If 4x/year it takes a week to load instead of 2 days, that would be fine, too.

We could also look into mysql procedures vs. vanilla Perl DBI code.

Requirements for NCBO

If SGD have requirements for the new OBO site (including, but not limited to):

a programmatic means of remotely searching or accessing any OBO ontology
software 'widgets' that can be dropped into SGD curation pages

Pass them on to Chris