GoDB loading

From GO Wiki
Revision as of 11:51, 8 February 2007 by Hitz (talk | contribs) (Proposal for speeding up go-full loads)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

New GO loading - Draft Specifications 2/8/07

Background

Amigo (1.0 or 2.0) should be able to access what is currently called "go-full" database, with all IEA annotations. This bulk of this database is ca. 10M gene associations (IEA) from the Uniprot goa project. With current software, the gofull load takes ~7 days. Amigo data should be loaded ideally daily, but 2-3 times per week is sufficient. A few days (1-2) could be trimmed from the loading procedure by removing some file i/o (for FTP files etc.) and keeping just the database load, but we would still need the speed up the loading by a factor of ~3 at a minimum.

Components of database loading (need to be run 3xWeek)

  1. Loading ontology (obo file) this takes <2 hours on current server
  2. Loading species/taxonomy info (req for gene_product table, ~5 minutes)
  3. Loading association files (~100 hours)
  4. Loading sequence files (~31 hours)
  5. Dumping sequence files (~20 hours) - needed for GoST

Other loading functions (can be run asynchronously, maybe 1xweek)

  • dumprdfxml
  • dumpoboxml
  • dumpowl
  • make_release_tarballs

What needs to be sped up are steps 2-4. Currently, 2 is done via XLST transform, 3) and 4) done via perl DBI (go-db-perl API) on a database loaded with GA/Ontology.

Proposals

Approaches to improving the loading time:

  1. In the short term, we could probably shave 1-2 days and load amigo weekly.
    • This will become problematic as more associations are added.
  1. Continue doing full loads, but use faster methods (bulk loading)
  2. Do incremental loads from files
  3. Allow remote "at-will" database updates as ontology, association, or sequence (gp2protein) files change at various sources.


Discussion of Bulk Loading

The fastest way to load a mysql database is by formatting rectangular tab files, pre-constructed with foreign keys using the "LOAD DATA INFILE" command. Slightly less fast is to load with mysqlimport. mysql import (on golite) is about 60x faster than the current procedure (we currently use it to update a staging server). mysqlimport requires roughly the same pre-processing as "LOAD DATA INFILE" but is in sql format instead of plain text. There may be a way with mysql 5 procedures and triggers to use mysqlimport without defining the foreign keys explicitly in the sql file.

Ontology, Association, and Sequence loading are more or less independent projects and could possibly be handled with different mechanisms. For example, we could keep the DB:Stag loading for the ontology and load the sequences and associations by other mechanisms. The one area that is tricky is the dbxref table, which cross-references:

  • term (via term_dbxref 1:N)
  • associations (via evidence 1:N?) - I think these are qualifiers AND pubmed refs, evidence_dbxref
  • associations (via gene products 1:1)
  • sequences (via seq_dbxref 1:N)
===== Steps for bulk loading ga files: =====
 (note: when I first drafed this I was using a very old schema diagram, I have updated, but may still be mistakes)
  1. drop all tables and create empty ones
  2. load ontology (termdb)
  3. load species table
  4. truncate dbxref table (this should allow us to append with load data infile)
  5. store 2 hashes termIdByGoId and speciesIdByTaxonId in RAM, note last id (index) in dbxref
  6. lock db
  7. create files for modified tables:
#* create text files: gene_product, evidence, associations, dbxref 
#*association_qualifier, assoc_rel
#*and linking tables:
#*seq_dbxref, evidence_dbxref, gene_product_seq

(also gene_product_synonym and gene_product_property)

  1. loop over GA files
   ## loop over entries in GA files
    ## if gp_is_new ( 
store gp_info (symbol->id) in hash ( - this hash will have ~3M entries)
get species_id from speciesIdByTaxon

write to gene_product file, write to dbxref file

         )
     ##  get gene_product.id from hash, term.id from hash, write association file (save association.id) role_group?
     ##   write evidence/dbxrefs to file, with association.id (seq_acc?)
  1. we could load these tables now
  2. loop over gp2protein files, "load_sp.pl" load sequences
 #* various improvements to load_sp; could use bulk table loading, could write the fasta file (without GOIDs?) simultaneously.
  1. unlock db, export, dump, etc.
OPEN QUESTIONS (for me)
  1. Can we truncate the dbxref table and "splice" a new table to the end of it?
  2. Are we going to have memory issues (I think at 3M gp_entries we are fine - I am assuming that the evidence and association rows depend ONLY on a single line in the GA files.
  3. We are still doing alot of "work" in perl to process files. Will it be significantly faster?
  4. I can think of a couple approaches to mix in the current go-dev API
  #* actually write something that translates the xml->text (funny kind of xslt)
  #* use the assoc parser, but write text instead of xml   
  #* use the GO::Model objects but write a special parser.

Not that familiar with the API, but I am willing to trade of efficiency for development time.