GoDB loading: Difference between revisions

Revision as of 18:31, 8 February 2007

New GO loading - Draft Specifications 2/8/07

Background

Amigo (1.0 or 2.0) should be able to access what is currently called "go-full" database, with all IEA annotations. This bulk of this database is ca. 10M gene associations (IEA) from the Uniprot goa project. With current software, the gofull load takes ~7 days. Amigo data should be loaded ideally daily, but 2-3 times per week is sufficient. A few days (1-2) could be trimmed from the loading procedure by removing some file i/o (for FTP files etc.) and keeping just the database load, but we would still need the speed up the loading by a factor of ~3 at a minimum.

As a side note, we investigated the possiblity of keeping the current loading steps as is, but doing them asynchronously, but since the vast majority of the time was spent loading the godbxml to the database itself via DB:Stag that we would not get a substantial "throughput" time increase. However, this would be competitive with loading 1xWeek.

Components of database loading (need to be run 3xWeek)

Loading ontology (obo file) this takes <2 hours on current server
Loading species/taxonomy info (req for gene_product table, ~5 minutes)
Loading association files (~100 hours)
Loading sequence files (~31 hours)
Dumping sequence files (~20 hours) - needed for GoST

Other loading functions (can be run asynchronously, maybe 1xweek)

dumprdfxml
dumpoboxml
dumpowl
make_release_tarballs

What needs to be sped up are steps 2-4. Currently, 2 is done via XLST transform, 3) and 4) done via perl DBI (go-db-perl API) on a database loaded with GA/Ontology.

Side note: Currently 3 is done on termdb only. There is demand of 3 on the whole database, and I nearly have the code in place. If we all decide to go this route, we can retire 1. As the code to do 1 and 3 is very similar, it will not cost us anything in cycle time to make this substitution. Place holder page: GoInOwl

Proposals

Approaches to improving the loading time:

In the short term, we could probably shave 1-2 days and load amigo weekly.
- This will become problematic as more associations are added.

Continue doing full loads, but use faster methods (bulk loading)
Do incremental loads from files
Allow remote "at-will" database updates as ontology, association, or sequence (gp2protein) files change at various sources.

Bulk Loading

The fastest way to load a mysql database is by formatting rectangular tab files, pre-constructed with foreign keys using the "LOAD DATA INFILE" command. Slightly less fast is to load with mysqlimport. mysql import (on golite) is about 60x faster than the current procedure (we currently use it to update a staging server). mysqlimport requires roughly the same pre-processing as "LOAD DATA INFILE" but is in sql format instead of plain text. There may be a way with mysql 5 procedures and triggers to use mysqlimport without defining the foreign keys explicitly in the sql file.

Ontology, Association, and Sequence loading are more or less independent projects and could possibly be handled with different mechanisms. For example, we could keep the DB:Stag loading for the ontology and load the sequences and associations by other mechanisms. The one area that is tricky is the dbxref table, which cross-references:

term (via term_dbxref 1:N)
associations (via evidence 1:N?) - I think these are qualifiers AND pubmed refs, evidence_dbxref
associations (via gene products 1:1)
sequences (via seq_dbxref 1:N)

Steps for bulk loading ga files:

note: when I first drafed this I was using a very old schema diagram, I have updated, but may still be mistakes

drop all tables and create empty ones
load ontology (termdb)
load species table
append data to the dbxref table using load data infile command
store 2 hashes termIdByGoId and speciesIdByTaxonId in RAM, note last id (index) in dbxref
lock db
create files for modified tables:
- create text files: gene_product, evidence, associations, dbxref
- association_qualifier, assoc_rel
- and linking tables:
- seq_dbxref, evidence_dbxref, gene_product_seq

(also gene_product_synonym and gene_product_property)

loop over GA files
1. loop over entries in GA files
2. if gp_is_new (
  store gp_info (symbol->id) in hash ( - this hash will have ~3M entries)
  get species_id from speciesIdByTaxon
  write to gene_product file, write to dbxref file and necesary linking table files
  )
3. get gene_product.id from hash, term.id from hash, write association file (save association.id) role_group?
4. write evidence/dbxrefs to file, with association.id (seq_acc?)
we could load these tables now
loop over gp2protein files, "load_sp.pl" load sequences
- various improvements to load_sp; could use bulk table loading, could write the fasta file (without GOIDs?) simultaneously.
unlock db, export, dump, etc.

OPEN QUESTIONS (for me)

Can we truncate the dbxref table and "splice" a new table to the end of it?
Are we going to have memory issues (I think at 3M gp_entries we are fine - I am assuming that the evidence and association rows depend ONLY on a single line in the GA files.
We are still doing alot of "work" in perl to process files. Will it be significantly faster?
I can think of a couple approaches to mix in the current go-dev API
- actually write something that translates the xml->text (funny kind of xslt)
- use the assoc parser, but write text instead of xml
- use the GO::Model objects but write a special parser.

Not that familiar with the API, but I am willing to trade of efficiency for development time.

AN ALTERNATIVE: Plain-old Perl (POP)

Just do this the good old dumb way. Bypass any object model or XML frameworks.

Algorithm:

Read in line from GA file
do the inserts, selects and updates in a block of code with no abstractions

It may turn out to be considerably faster

Advantage: simple and easy to understand
Disadvantage: lack of code reuse - parser written from ground-up; changes in file format require changes in all parsers (obvious retort: is this that big a deal?)

Load time: unknown; estimated faster than current method but slower than bulkloading from pre-prepared files

AN ALTERNATIVE: MySQL procedures

I (CJM) was initially keen on this idea. I think we would need a test to see how well this would work in practice.

Disadvantage: portability (a few other groups are using the load code for postgres). This may not be a big deal as the plsql code should be easy to port manually.

Disadvantage: lack of familiarity. Many programmers have a knee-jerk reaction against this kind of approach.

Load time: unknown

ANOTHER ALTERNATIVE: Write an OBO-Edit adapter

Advantage: As part of a move towards more code reuse within GO as part of a single unified framework

Disadvantage: less familiarity with Java within the GOC software group

Load time: may be faster than equivalent perl code (hard to determine without writing code to test..)

The same variations are on offer here as perl - bypass an object model, or go via objects (and possibly higher memory footprint)

ANOTHER ALTERNATIVE: use C

Hey, if speed is what we want this is worth mentioning. I'm guess there is no volunteers. I have some ontology parsing code I abandoned in C a long time ago..

This approach could be mixed in with the bulkload approach, so you don't have to do any messy C/ESQL stuff

Incremental Loading

It has been suggested that the database be loaded not all at once, but rather as the ontology and association files change. The ontology file changes hourly - and this will tend to invalidate current associations. The association files vary tremendously on how often they are comitted to CVS. TAIR currently commits nightly (but have be requested to do it less often to keep the CVS file size down!), SGD 2x week, others less often. The "big one" Uniprot GOA (in some sense "the problem") is only updated every 3 months.

Ideally this type of updating would include updates via remote application - circuventing CVS entirely. OBO-edit, for example, could directly update a master (or slave) database.

My current thought is that the fancy versions are quite complex to code, and reprsents a shift in how we run the production side of things. The simple versions are really the same as the bulk load proposal above. Perhaps after becoming more familiar with the API I could come up with an implementable plan for this.

@@ Line 54: / Line 54: @@
 # load ontology (termdb)
 # load species table
-# truncate dbxref table (this should allow us to append with load data infile)
+# append data to the dbxref table using load data infile command
 # store 2 hashes termIdByGoId and speciesIdByTaxonId in RAM, note last id (index) in dbxref
 # lock db