Difference between revisions of "GoDB loading"

From GO Wiki
Jump to: navigation, search
(New GO loading - status 3/19/07)
(New GO loading - status 3/19/07)
Line 25: Line 25:
 
* There may be some issue with obsolete/merged go ids not being handled properly, but this is more of a filtering step.
 
* There may be some issue with obsolete/merged go ids not being handled properly, but this is more of a filtering step.
  
* added !IEA to skip these associations, but Gene products were loaded anyway?  Need to fix this.
+
* added !IEA to skip these associations, but Gene products were loaded anyway?  I ran the pre-process script to filter. | OK I must not be screening duplicates in my file creation because loading by SQL this works fine, even though there are ~2x GP in file than loaded.
  
* my loader loads both "Gene" and "gene" as gene_product_types in term table
+
* my loader loads both "Gene" and "gene" as gene_product_types in term table | actually msyql filters this.
  
 
Currently going to use new code to do a 'golite' for db integrity checks.
 
Currently going to use new code to do a 'golite' for db integrity checks.

Revision as of 11:25, 3 April 2007

New GO loading - status 3/19/07

I have code that will read GA files and spit out the appropriate data tables for load data infile. A small test of SGD+CGD+goa_uniprot took <5 hours to write files for 9M associations. Current go-full has 11M. Testing by hand of load data infile from command prompt took about 40 minutes without optimizations, analyze table, or shutting down indexes.

Note: to do LOAD DATA INFILE LOCAL you must provide --local-infile=1 to mysql command line.

TODO:

  • In the above test, 2.5M evidence rows were skipped upon loading (note: IGNORE is the default for a LOAD DATA INFILE LOCAL) because they violated UK reference (although they had unique "with" == seq_acc fields. We need to determine if we should start loading these or skipping them.
  • In addition, my code creates many extra evidence_dbxref rows. Some of these (~7M?) are due to the extra evidence rows detailed above, but I think the bulk are due to denormalizing the dbxref_id column in evidence (points to the GO_REF evidence DBXREF). I will have to prove this, however.
  • Need to test various options dropping indices, etc. on the mysql LOAD DATA step. Stan Dong has worked out some test cases. On the the other hand even if the above 40 minutes (maybe 1 hour with a full go full) is the worst case (and I suspect it's the best), who cares?
  • Need to implement the actual table locking and loading in perl
  • Need to implement database integrity checks (this could be kind of tedious)
  • the NCBI species taxonomy file is never actually updated. Chris says there is a GO::Admin->update_species() function which should work, it's just never called by the pipeline.
  • load_xref.err-xml is also quite large when termdb is loaded. There are about 50 "No such GOid errors" and 600 "would not extract ID from GO:". Most of these are in mips2go file, but 13 *2go files have at least 1 error.
  • There may be some issue with obsolete/merged go ids not being handled properly, but this is more of a filtering step.
  • added !IEA to skip these associations, but Gene products were loaded anyway? I ran the pre-process script to filter. | OK I must not be screening duplicates in my file creation because loading by SQL this works fine, even though there are ~2x GP in file than loaded.
  • my loader loads both "Gene" and "gene" as gene_product_types in term table | actually msyql filters this.

Currently going to use new code to do a 'golite' for db integrity checks.

New GO loading - Draft Specifications 2/8/07

Background

Amigo (1.0 or 2.0) should be able to access what is currently called "go-full" database, with all IEA annotations. This bulk of this database is ca. 10M gene associations (IEA) from the Uniprot goa project. With current software, the gofull load takes ~7 days. Amigo data should be loaded ideally daily, but 2-3 times per week is sufficient. A few days (1-2) could be trimmed from the loading procedure by removing some file i/o (for FTP files etc.) and keeping just the database load, but we would still need the speed up the loading by a factor of ~3 at a minimum.

As a side note, we investigated the possiblity of keeping the current loading steps as is, but doing them asynchronously, but since the vast majority of the time was spent loading the godbxml to the database itself via DB:Stag that we would not get a substantial "throughput" time increase. However, this would be competitive with loading 1xWeek.

Components of database loading (need to be run 3xWeek)

  1. Loading ontology (obo file) this takes <2 hours on current server
  2. Loading species/taxonomy info (req for gene_product table, ~5 minutes)
  3. Loading association files (~100 hours)
  4. Loading sequence files (~31 hours)
  5. Dumping sequence files (~20 hours) - needed for GoST

What needs to be sped up are steps 2-4. Currently, 2 is done via XLST transform, 3) and 4) done via perl DBI (go-db-perl API) on a database loaded with GA/Ontology.

Other loading functions (can be run asynchronously, maybe 1xweek)

  1. dumprdfxml
  2. dumpoboxml
  3. dumpowl
  4. make_release_tarballs


Side note: Currently 3 is done on termdb only. There is demand of 3 on the whole database, and I nearly have the code in place. If we all decide to go this route, we can retire 1. As the code to do 1 and 3 is very similar, it will not cost us anything in cycle time to make this substitution. Place holder page: GoInOwl

Proposals

Approaches to improving the loading time:

  1. In the short term, we could probably shave 1-2 days and load amigo weekly.
    • This will become problematic as more associations are added.
  1. Continue doing full loads, but use faster methods (bulk loading)
  2. Do incremental loads from files
  3. Allow remote "at-will" database updates as ontology, association, or sequence (gp2protein) files change at various sources.


Bulk Loading

The fastest way to load a mysql database is by formatting rectangular tab files, pre-constructed with foreign keys using the "LOAD DATA INFILE" command. Slightly less fast is to load with mysqlimport. mysql import (on golite) is about 60x faster than the current procedure (we currently use it to update a staging server). mysqlimport requires roughly the same pre-processing as "LOAD DATA INFILE" but is in sql format instead of plain text. There may be a way with mysql 5 procedures and triggers to use mysqlimport without defining the foreign keys explicitly in the sql file.

Ontology, Association, and Sequence loading are more or less independent projects and could possibly be handled with different mechanisms. For example, we could keep the DB:Stag loading for the ontology and load the sequences and associations by other mechanisms. The one area that is tricky is the dbxref table, which cross-references:

  • term (via term_dbxref 1:N)
  • associations (via evidence 1:N?) - I think these are qualifiers AND pubmed refs, evidence_dbxref
  • associations (via gene products 1:1)
  • sequences (via seq_dbxref 1:N)
Steps for bulk loading ga files:

Pseudo code for approach using GO::Handler

note: when I first drafed this I was using a very old schema diagram, I have updated, but may still be mistakes

  1. drop all tables and create empty ones
  2. load ontology (termdb)
  3. load species table
  4. append data to the dbxref table using load data infile command
  5. store a bunch of hashes for foreign keys in RAM, note last id (index) in dbxref
  6. lock db
  7. create files for modified tables:
    • create text files: gene_product, evidence, associations, dbxref
    • association_qualifier, assoc_rel
    • and linking tables:
    • seq_dbxref, evidence_dbxref, gene_product_seq

(also gene_product_synonym and gene_product_property)

  1. loop over GA files
    1. loop over entries in GA files
    2. if gp_is_new (
      store gp_info (symbol->id) in hash ( - this hash will have ~3M entries)
      get species_id from speciesIdByTaxon
      write to gene_product file, write to dbxref file and necesary linking table files
      )
    3. get gene_product.id from hash, term.id from hash, write association file (save association.id) role_group?
    4. write evidence/dbxrefs to file, with association.id (seq_acc?)
  2. we could load these tables now
  3. loop over gp2protein files, "load_sp.pl" load sequences
    • various improvements to load_sp; could use bulk table loading, could write the fasta file (without GOIDs?) simultaneously.
  4. unlock db, export, dump, etc.
OPEN QUESTIONS (for me)
  1. Can we truncate the dbxref table and "splice" a new table to the end of it? Yes, this is a normal append data loading operation
  2. Are we going to have memory issues (I think at 3M gp_entries we are fine - I am assuming that the evidence and association rows depend ONLY on a single line in the GA files.
  3. We are still doing alot of "work" in perl to process files. Will it be significantly faster?
  4. I can think of a couple approaches to mix in the current go-dev API
    • actually write something that translates the xml->text (funny kind of xslt)
    • use the assoc parser, but write text instead of xml
    • use the GO::Model objects but write a special parser.

Not that familiar with the API, but I am willing to trade of efficiency for development time.

AN ALTERNATIVE: Plain-old Perl (POP)

Just do this the good old dumb way. Bypass any object model or XML frameworks.

Algorithm:

  1. Read in line from GA file
  2. do the inserts, selects and updates in a block of code with no abstractions

It may turn out to be considerably faster

  • Advantage: simple and easy to understand
  • Disadvantage: lack of code reuse - parser written from ground-up; changes in file format require changes in all parsers (obvious retort: is this that big a deal?)
  • Load time: unknown; estimated faster than current method but slower than bulkloading from pre-prepared files


COMMENT BY BEN: This is nice and simple, but I think it will be too slow for large files. This is our experience with very large GFF files and Stan's Texpresso database work (both cases have to do millions of inserts)

AN ALTERNATIVE: MySQL procedures

I (CJM) was initially keen on this idea. I think we would need a test to see how well this would work in practice.

  • Disadvantage: portability (a few other groups are using the load code for postgres). This may not be a big deal as the plsql code should be easy to port manually.
  • Disadvantage: lack of familiarity. Many programmers have a knee-jerk reaction against this kind of approach.
  • Load time: unknown

COMMENT by BEN: I like this, but I have zero idea how to do it, so I feel that the learning curve outweighs other considerations.

ANOTHER ALTERNATIVE: Write an OBO-Edit adapter
  • Advantage: As part of a move towards more code reuse within GO as part of a single unified framework
  • Disadvantage: less familiarity with Java within the GOC software group
  • Load time: may be faster than equivalent perl code (hard to determine without writing code to test..)

The same variations are on offer here as perl - bypass an object model, or go via objects (and possibly higher memory footprint)


COMMENT by BEN: I think this is would fall under the Incremental loading section, below. Roughly, the database is updated by users (i.e. MODs, GOA) remotly using OBO edit (gene-assocationEdit?). This actually seems relativly transparent for ontology updates... but not associations. If we are talking about a multi-year SW project, I could get behind it, although obviously John would have to be an integral part (at least at the onset). I think that we would want to run OBO-edit itself off of a local mysql db. The first step would essentially be a "thick" amigo client.

ANOTHER ALTERNATIVE: use C

Hey, if speed is what we want this is worth mentioning. I'm guess there is no volunteers. I have some ontology parsing code I abandoned in C a long time ago..

This approach could be mixed in with the bulkload approach, so you don't have to do any messy C/ESQL stuff

Incremental Loading

It has been suggested that the database be loaded not all at once, but rather as the ontology and association files change. The ontology file changes hourly - and this will tend to invalidate current associations. The association files vary tremendously on how often they are comitted to CVS. TAIR currently commits nightly (but have be requested to do it less often to keep the CVS file size down!), SGD 2x week, others less often. The "big one" Uniprot GOA (in some sense "the problem") is only updated every 3 months.

Ideally this type of updating would include updates via remote application - circuventing CVS entirely. OBO-edit, for example, could directly update a master (or slave) database.

My current thought is that the fancy versions are quite complex to code, and reprsents a shift in how we run the production side of things. The simple versions are really the same as the bulk load proposal above. Perhaps after becoming more familiar with the API I could come up with an implementable plan for this.