GoDB loading

New GO loading - Draft Specifications 2/8/07

Background

Amigo (1.0 or 2.0) should be able to access what is currently called "go-full" database, with all IEA annotations. This bulk of this database is ca. 10M gene associations (IEA) from the Uniprot goa project. With current software, the gofull load takes ~7 days. Amigo data should be loaded ideally daily, but 2-3 times per week is sufficient. A few days (1-2) could be trimmed from the loading procedure by removing some file i/o (for FTP files etc.) and keeping just the database load, but we would still need the speed up the loading by a factor of ~3 at a minimum.

As a side note, we investigated the possiblity of keeping the current loading steps as is, but doing them asynchronously, but since the vast majority of the time was spent loading the godbxml to the database itself via DB:Stag that we would not get a substantial "throughput" time increase. However, this would be competitive with loading 1xWeek.

Components of database loading (need to be run 3xWeek)

Loading ontology (obo file) this takes <2 hours on current server
Loading species/taxonomy info (req for gene_product table, ~5 minutes)
Loading association files (~100 hours)
Loading sequence files (~31 hours)
Dumping sequence files (~20 hours) - needed for GoST

Other loading functions (can be run asynchronously, maybe 1xweek)

dumprdfxml
dumpoboxml
dumpowl
make_release_tarballs

What needs to be sped up are steps 2-4. Currently, 2 is done via XLST transform, 3) and 4) done via perl DBI (go-db-perl API) on a database loaded with GA/Ontology.

Proposals

Approaches to improving the loading time:

In the short term, we could probably shave 1-2 days and load amigo weekly.
- This will become problematic as more associations are added.

Continue doing full loads, but use faster methods (bulk loading)
Do incremental loads from files
Allow remote "at-will" database updates as ontology, association, or sequence (gp2protein) files change at various sources.

Bulk Loading

The fastest way to load a mysql database is by formatting rectangular tab files, pre-constructed with foreign keys using the "LOAD DATA INFILE" command. Slightly less fast is to load with mysqlimport. mysql import (on golite) is about 60x faster than the current procedure (we currently use it to update a staging server). mysqlimport requires roughly the same pre-processing as "LOAD DATA INFILE" but is in sql format instead of plain text. There may be a way with mysql 5 procedures and triggers to use mysqlimport without defining the foreign keys explicitly in the sql file.

Ontology, Association, and Sequence loading are more or less independent projects and could possibly be handled with different mechanisms. For example, we could keep the DB:Stag loading for the ontology and load the sequences and associations by other mechanisms. The one area that is tricky is the dbxref table, which cross-references:

term (via term_dbxref 1:N)
associations (via evidence 1:N?) - I think these are qualifiers AND pubmed refs, evidence_dbxref
associations (via gene products 1:1)
sequences (via seq_dbxref 1:N)

Steps for bulk loading ga files:

note: when I first drafed this I was using a very old schema diagram, I have updated, but may still be mistakes

drop all tables and create empty ones
load ontology (termdb)
load species table
truncate dbxref table (this should allow us to append with load data infile)
store 2 hashes termIdByGoId and speciesIdByTaxonId in RAM, note last id (index) in dbxref
lock db
create files for modified tables:
- create text files: gene_product, evidence, associations, dbxref
- association_qualifier, assoc_rel
- and linking tables:
- seq_dbxref, evidence_dbxref, gene_product_seq

(also gene_product_synonym and gene_product_property)

loop over GA files
1. loop over entries in GA files
2. if gp_is_new (
  store gp_info (symbol->id) in hash ( - this hash will have ~3M entries)
  get species_id from speciesIdByTaxon
  write to gene_product file, write to dbxref file and necesary linking table files
  )
3. get gene_product.id from hash, term.id from hash, write association file (save association.id) role_group?
4. write evidence/dbxrefs to file, with association.id (seq_acc?)
we could load these tables now
loop over gp2protein files, "load_sp.pl" load sequences
- various improvements to load_sp; could use bulk table loading, could write the fasta file (without GOIDs?) simultaneously.
unlock db, export, dump, etc.

OPEN QUESTIONS (for me)

Can we truncate the dbxref table and "splice" a new table to the end of it?
Are we going to have memory issues (I think at 3M gp_entries we are fine - I am assuming that the evidence and association rows depend ONLY on a single line in the GA files.
We are still doing alot of "work" in perl to process files. Will it be significantly faster?
I can think of a couple approaches to mix in the current go-dev API
- actually write something that translates the xml->text (funny kind of xslt)
- use the assoc parser, but write text instead of xml
- use the GO::Model objects but write a special parser.

Not that familiar with the API, but I am willing to trade of efficiency for development time.

Incremental Loading

It has been suggested that the database be loaded not all at once, but rather as the ontology and association files change. The ontology file changes hourly - and this will tend to invalidate current associations. The association files vary tremendously on how often they are comitted to CVS. TAIR currently commits nightly (but have be requested to do it less often to keep the CVS file size down!), SGD 2x week, others less often. The "big one" Uniprot GOA (in some sense "the problem") is only updated every 3 months.

Ideally this type of updating would include updates via remote application - circuventing CVS entirely. OBO-edit, for example, could directly update a master (or slave) database.

My current thought is that the fancy versions are quite complex to code, and reprsents a shift in how we run the production side of things. The simple versions are really the same as the bulk load proposal above. Perhaps after becoming more familiar with the API I could come up with an implementable plan for this.