Talk:GoDB loading

From GO Wiki
Jump to: navigation, search

Overall impressions: there are different possibilities and variants, but most of what we have to go on are gut feelings, unless we do some tests. However, properly testing and evaluating could take some time. The one that is guaranteed to go much faster is the bulkloads, but there are some unknowns regarding the simplicity of writing the bulkload dumper...

COMMENT: It's slighly more complicated than that POP option, because you have to keep track of the indices and foreign keys, but it's much simplier than any incremental option that I can think of.

I agree with the general principle of treating associations differently, and keeping the current infrastructure for loading ontologies.

  • Q: Can we truncate the dbxref table and "splice" a new table to the end of it?

I'm not sure I fully understand what this means; it makes me feel uneasy

COMMENT: Truncate was wrong word. "Append" is the correct one; Gail thinks that this is an option with LOAD DATA INFILE. What I mean is simply bulk load the "bottom half" of the table (i.e., DBXREFs that apply to assocations not terms)

  1. I can think of a couple approaches to mix in the current go-dev API
   * actually write something that translates the xml->text (funny kind of xslt)
   * use the assoc parser, but write text instead of xml
   * use the GO::Model objects but write a special parser. 

These are possibilities. However, since we are isolating associations as a special case, I think it is fine to bypass the full API here.

COMMENT: I would prefer to not write my own GA parser, since yours works. What I was trying to convey above is that I would like to take advantage of the API as much as possible (for all the maintainence and compatibility issues you cited).

The procedure is currently GA file -> OBO-XML -> GODB-XML -> DB. Where the last arrow is DB:Stag, and the main bottle neck. GODB-XML (and I may have the name wrong, please correct it!) is "isomorphic to the go database". So, it seems like it would be relatively straight forward to write that into rectangular tables that could be bulk loaded. Similarly, the parser loads the GA files into GO:Model objects already (then writes XML). Instead of writing XML, we could write flat files.

But maybe I misunderstand what the guts of the loading process is actually doing (especially re: termdb vs. assocdb).