GoDB loading

From GO Wiki
Jump to: navigation, search

GO bulk loading - status 9/19/07

This project was lowered in priority when it was determined that AmiGO couldn't really handle a gofull database with ca. 12M annotations.

The sequence loading step has been rewritten to be faster and to get more sequences from gp2protein file. In order to better compare bulk-loaded with API loaded databases, I had to bring the production API loading code up to date with the latest schema and code changes. This work was completed 9/15/07. Retesting of the bulk loading has begun.

Current go full has 14M associations.

New GO loading - status 3/19/07

I have code that will read GA files and spit out the appropriate data tables for load data infile. A small test of SGD+CGD+goa_uniprot took <5 hours to write files for 9M associations. Current go-full has 11M. Testing by hand of load data infile from command prompt took about 40 minutes without optimizations, analyze table, or shutting down indexes.

Note: to do LOAD DATA INFILE LOCAL you must provide --local-infile=1 to mysql command line.

TODO:

  • In the above test, 2.5M evidence rows were skipped upon loading (note: IGNORE is the default for a LOAD DATA INFILE LOCAL) because they violated UK reference (although they had unique "with" == seq_acc fields. We need to determine if we should start loading these or skipping them.
  • in a similar fashion, GENE_PRODUCT table has unique key of dbxref. I was not storing these in hash, so ~143K duplicate GPs in golite (these are not loaded)
  • In addition, my code creates many extra evidence_dbxref rows. Some of these (~7M?) are due to the extra evidence rows detailed above, but I think the bulk are due to denormalizing the dbxref_id column in evidence (points to the GO_REF evidence DBXREF). I will have to prove this, however.
  • Need to test various options dropping indices, etc. on the mysql LOAD DATA step. Stan Dong has worked out some test cases. On the the other hand even if the above 40 minutes (maybe 1 hour with a full go full) is the worst case (and I suspect it's the best), who cares?
  • Need to implement the actual table locking and loading in perl
  • Need to implement database integrity checks (this could be kind of tedious)
  • the NCBI species taxonomy file is never actually updated. Chris says there is a GO::Admin->update_species() function which should work, it's just never called by the pipeline.
  • load_xref.err-xml is also quite large when termdb is loaded. There are about 50 "No such GOid errors" and 600 "would not extract ID from GO:". Most of these are in mips2go file, but 13 *2go files have at least 1 error.
  • There may be some issue with obsolete/merged go ids not being handled properly, but this is more of a filtering step.
  • added !IEA to skip these associations, but Gene products were loaded anyway? I ran the pre-process script to filter. | OK I must not be screening duplicates in my file creation because loading by SQL this works fine, even though there are ~2x GP in file than loaded.
  • my loader loads both "Gene" and "gene" as gene_product_types in term table | actually msyql filters this.

Currently going to use new code to do a 'golite' for db integrity checks.

Status 3/3/07 updated 3/11/07- check of golite

  • the major difference appears to be here:
[mysql@goad load]$ echo "select count(*) from dbxref" | golite go
count(*)
957279
 ---
[mysql@goad load]$ echo "select count(*) from dbxref" | godev go
count(*)
331181


  • Running "example SQL queries" from [1] on bulk loaded and default loaded golite.
SELECT
 term.name AS superterm_name,
 term.acc AS superterm_acc,
 term.term_type AS superterm_type,
 association.*,
 gene_product.symbol AS gp_symbol,
 gene_product.dbxref_id AS gp_dbxref_id,
 gene_product.species_id AS gp_species_id
FROM term
 INNER JOIN graph_path ON (term.id=graph_path.term1_id)
 INNER JOIN association ON (graph_path.term2_id=association.term_id)
 INNER JOIN gene_product ON (association.gene_product_id=gene_product.id)
WHERE
term.name = 'nucleus';

Has a big descrepancy, must resolve.

godev: 23080 rows in set (0.60 sec)
golite: 35119 rows in set (0.56 sec)

Other querie: godev:

mysql>  SELECT count(*)    ->  FROM   term    ->  WHERE is_obsolete=0;
+----------+
| count(*) |
+----------+
|    22919 | 
+----------+
1 row in set (0.14 sec)

Golite

mysql> SELECT count(*) -> FROM term -> WHERE is_obsolete=0;

+----------+ | count(*) | +----------+ | 22919 | +----------+ 1 row in set (0.02 sec)

godev

mysql> SELECT term_type AS ontology, count(*) AS n_terms FROM term WHERE is_obsolete=0 GROUP BY term_type; +-----------------------+---------+ | ontology | n_terms | +-----------------------+---------+ | association_qualifier | 4 | | biological_process | 13413 | | cellular_component | 1926 | | gene_ontology | 1 | | molecular_function | 7559 | | relationship | 1 | | sequence | 4 | | subset | 5 | | synonym_type | 5 | | universal | 1 | +-----------------------+---------+ 10 rows in set (0.05 sec)

golite

mysql> SELECT term_type AS ontology,

   ->         count(*) AS n_terms
   ->  FROM   term
   ->  WHERE  is_obsolete=0
   ->  GROUP BY term_type;

+-----------------------+---------+ | ontology | n_terms | +-----------------------+---------+ | association_qualifier | 4 | | biological_process | 13413 | | cellular_component | 1926 | | gene_ontology | 1 | | molecular_function | 7559 | | relationship | 1 | | sequence | 4 | | subset | 5 | | synonym_type | 5 | | universal | 1 | +-----------------------+---------+ 10 rows in set (0.03 sec)

GoDEV:

mysql> SELECT xref_dbname, count(DISTINCT gene_product.id) FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) GROUP BY xref_dbname; +--------------------+---------------------------------+ | xref_dbname | count(DISTINCT gene_product.id) | +--------------------+---------------------------------+ | CGD | 1232 | | Dictybase | 3757 | | FB | 10265 | | GeneDB_Lmajor | 7 | | GeneDB_Pfalciparum | 2207 | | GeneDB_Spombe | 4994 | | GeneDB_Tbrucei | 2996 | | GR | 25702 | | MGI | 13449 | | PseudoCAP | 1519 | | RGD | 11699 | | SGD | 6473 | | TAIR | 32798 | | TIGR_CMR | 40107 | | TIGR_Tba1 | 292 | | UniProt | 18270 | | WB | 5569 | | ZFIN | 7813 | +--------------------+---------------------------------+ 18 rows in set (3.74 sec)

golite:

mysql> SELECT xref_dbname, count(DISTINCT gene_product.id) FROM   gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) GROUP BY xref_dbname;
+--------------------+---------------------------------+
| xref_dbname        | count(DISTINCT gene_product.id) |
+--------------------+---------------------------------+
| CGD                |                            1232 | 
| Dictybase          |                            3757 | 
| FB                 |                           10265 | 
| GeneDB_Lmajor      |                               7 | 
| GeneDB_Pfalciparum |                            2207 | 
| GeneDB_Spombe      |                            4994 | 
| GeneDB_Tbrucei     |                            2996 | 
| GR                 |                           25702 | 
| MGI                |                           13449 | 
| PseudoCAP          |                            1519 | 
| RGD                |                           11699 | 
| SGD                |                            6473 | 
| TAIR               |                           32804 | 
| TIGR_CMR           |                           40107 | 
| TIGR_Tba1          |                             292 | 
| UniProt            |                           18438 | 
| WB                 |                            5569 | 
| ZFIN               |                            7813 | 
+--------------------+---------------------------------+
18 rows in set (3.98 sec)

TAIR and UniProt have different numbers of associations; probably this is a source data issue (bleah)

mysql>  SELECT count(DISTINCT species_id)
    ->  FROM   gene_product;
+----------------------------+
| count(DISTINCT species_id) |
+----------------------------+
|                        906 | 
+----------------------------+

is the same for both

For godev:

mysql> SELECT species.common_name, species.genus, species.species, count(*) AS n_gps
    -> FROM   gene_product INNER JOIN species ON (gene_product.species_id=species.id)
    -> GROUP BY species.common_name, species.genus, species.species
    -> HAVING n_gps > 10;
+--------------------------------+---------------------+---------------------------------------+-------+
| common_name                    | genus               | species                               | n_gps |
+--------------------------------+---------------------+---------------------------------------+-------+
| NULL                           | Anaplasma           | phagocytophilum HZ                    |  1292 | 
| NULL                           | Bacillus            | anthracis str. Ames                   |  5289 | 
| NULL                           | Caenorhabditis      | briggsae                              |    18 | 
| NULL                           | Caenorhabditis      | elegans                               |  7175 | 
| NULL                           | Campylobacter       | jejuni RM1221                         |  1834 | 
| NULL                           | Candida             | albicans                              |  1232 | 
| NULL                           | Chlamydomonas       | reinhardtii                           |    15 | 
| NULL                           | Coxiella            | burnetii RSA 493                      |  2038 | 
| NULL                           | Dehalococcoides     | ethenogenes 195                       |  1584 | 
| NULL                           | Dictyostelium       | discoideum                            |  3757 | 
| NULL                           | Drosophila          | erecta                                |    13 | 
| NULL                           | Drosophila          | mauritiana                            |    11 | 
| NULL                           | Drosophila          | pseudoobscura                         |    28 | 
| NULL                           | Drosophila          | simulans                              |    35 | 
| NULL                           | Drosophila          | sp.                                   |    15 | 
| NULL                           | Drosophila          | virilis                               |    22 | 
| NULL                           | Drosophila          | yakuba                                |    22 | 
| NULL                           | Ehrlichia           | chaffeensis str. Arkansas             |  1094 | 
| NULL                           | Emericella          | nidulans                              |    12 | 
| NULL                           | Escherichia         | coli                                  |   883 | 
| NULL                           | Euglena             | gracilis                              |    15 | 
| NULL                           | Geobacter           | sulfurreducens PCA                    |  3423 | 
| NULL                           | Gloeobacter         | violaceus                             |   113 | 
| NULL                           | Helicobacter        | pylori                                |    24 | 
| NULL                           | Listeria            | monocytogenes str. 4b F2365           |  2823 | 
| NULL                           | Methylococcus       | capsulatus str. Bath                  |  2928 | 
| NULL                           | Mus                 | musculus molossinus                   |    11 | 
| NULL                           | Neorickettsia       | sennetsu str. Miyayama                |   930 | 
| NULL                           | Nostoc              | sp. PCC 7120                          |   117 | 
| NULL                           | Oryza               | australiensis                         |    14 | 
| NULL                           | Oryza               | barthii                               |    19 | 
| NULL                           | Oryza               | eichingeri                            |    12 | 
| NULL                           | Oryza               | glumipatula                           |    17 | 
| NULL                           | Oryza               | grandiglumis                          |    13 | 
| NULL                           | Oryza               | granulata                             |    12 | 
| NULL                           | Oryza               | latifolia                             |    13 | 
| NULL                           | Oryza               | meridionalis                          |    15 | 
| NULL                           | Oryza               | nivara                                |   118 | 
| NULL                           | Oryza               | officinalis                           |    29 | 
| NULL                           | Oryza               | punctata                              |    15 | 
| NULL                           | Oryza               | rhizomatis                            |    14 | 
| NULL                           | Oryza               | rufipogon                             |    48 | 
| NULL                           | Oryza               | sativa                                |  3651 | 
| NULL                           | Oryza               | sativa (indica cultivar-group)        |   393 | 
| NULL                           | Oryza               | sativa (japonica cultivar-group)      | 21235 | 
| NULL                           | Prochlorococcus     | marinus                               |   106 | 
| NULL                           | Prochlorococcus     | marinus str. MIT 9313                 |   106 | 
| NULL                           | Prochlorococcus     | marinus subsp. pastoris str. CCMP1986 |   103 | 
| NULL                           | Pseudomonas         | aeruginosa PAO1                       |  1519 | 
| NULL                           | Pseudomonas         | syringae pv. tomato                   |    31 | 
| NULL                           | Pseudomonas         | syringae pv. tomato str. DC3000       |  3902 | 
| NULL                           | Shewanella          | oneidensis MR-1                       |  4850 | 
| NULL                           | Shigella            | flexneri                              |    14 | 
| NULL                           | Silicibacter        | pomeroyi DSS-3                        |  4257 | 
| NULL                           | Synechococcus       | elongatus                             |   121 | 
| NULL                           | Synechococcus       | sp. WH 8102                           |   117 | 
| NULL                           | Synechocystis       | sp. PCC 6803                          |   234 | 
| NULL                           | Tityus              | cambridgei                            |    25 | 
| NULL                           | Trypanosoma         | brucei                                |   292 | 
| NULL                           | Trypanosoma         | brucei TREU927                        |  2996 | 
| NULL                           | Vibrio              | cholerae O1 biovar eltor              |  3863 | 
| NULL                           | Viola               | odorata                               |    13 | 
| African clawed frog            | Xenopus             | laevis                                |   275 | 
| African green monkey           | Cercopithecus       | aethiops                              |    17 | 
| African rice                   | Oryza               | glaberrima                            |    12 | 
| baker's yeast                  | Saccharomyces       | cerevisiae                            |  6473 | 
| cat                            | Felis               | catus                                 |    28 | 
| chicken                        | Gallus              | gallus                                |   334 | 
| chimpanzee                     | Pan                 | troglodytes                           |    29 | 
| Chinese hamster                | Cricetulus          | griseus                               |    28 | 
| cow                            | Bos                 | taurus                                |   395 | 
| crab-eating macaque            | Macaca              | fascicularis                          |    78 | 
| dog                            | Canis               | familiaris                            |    89 | 
| domestic guinea pig            | Cavia               | porcellus                             |    28 | 
| domestic silkworm              | Bombyx              | mori                                  |    31 | 
| fission yeast                  | Schizosaccharomyces | pombe                                 |  4994 | 
| fruit fly                      | Drosophila          | melanogaster                          | 10265 | 
| golden hamster                 | Mesocricetus        | auratus                               |    12 | 
| gorilla                        | Gorilla             | gorilla                               |    16 | 
| horse                          | Equus               | caballus                              |    28 | 
| house mouse                    | Mus                 | musculus                              | 13449 | 
| human                          | Homo                | sapiens                               | 10888 | 
| Japanese eel                   | Anguilla            | japonica                              |    13 | 
| long-staminate rice            | Oryza               | longistaminata                        |    20 | 
| lowland gorilla                | Gorilla             | gorilla gorilla                       |    11 | 
| malaria parasite P. falciparum | Plasmodium          | falciparum                            |  2207 | 
| Norway rat                     | Rattus              | norvegicus                            | 11827 | 
| orangutan                      | Pongo               | pygmaeus                              |    79 | 
| pea                            | Pisum               | sativum                               |    16 | 
| pig                            | Sus                 | scrofa                                |   133 | 
| rabbit                         | Oryctolagus         | cuniculus                             |    88 | 
| rainbow trout                  | Oncorhynchus        | mykiss                                |    13 | 
| rhesus monkey                  | Macaca              | mulatta                               |    43 | 
| sheep                          | Ovis                | aries                                 |    81 | 
| southeastern Asian house mouse | Mus                 | musculus castaneus                    |    13 | 
| spinach                        | Spinacia            | oleracea                              |    13 | 
| tailed frog                    | Ascaphus            | truei                                 |    13 | 
| thale cress                    | Arabidopsis         | thaliana                              | 32804 | 
| western clawed frog            | Xenopus             | tropicalis                            |    80 | 
| western European house mouse   | Mus                 | musculus domesticus                   |    16 | 
| zebrafish                      | Danio               | rerio                                 |  7813 | 
+--------------------------------+---------------------+---------------------------------------+-------+
101 rows in set (2.01 sec)

golite:

mysql> SELECT species.common_name, species.genus, species.species, count(*) AS n_gps
    -> FROM   gene_product INNER JOIN species ON (gene_product.species_id=species.id)
    -> GROUP BY species.common_name, species.genus, species.species
    -> HAVING n_gps > 10;
+--------------------------------+---------------------+---------------------------------------+-------+
| common_name                    | genus               | species                               | n_gps |
+--------------------------------+---------------------+---------------------------------------+-------+
| NULL                           | NULL                | NULL                                  |    94 | 
| NULL                           | Anaplasma           | phagocytophilum HZ                    |  1292 | 
| NULL                           | Bacillus            | anthracis str. Ames                   |  5289 | 
| NULL                           | Caenorhabditis      | briggsae                              |    18 | 
| NULL                           | Caenorhabditis      | elegans                               |  7175 | 
| NULL                           | Campylobacter       | jejuni RM1221                         |  1834 | 
| NULL                           | Candida             | albicans                              |  1232 | 
| NULL                           | Chlamydomonas       | reinhardtii                           |    15 | 
| NULL                           | Coxiella            | burnetii RSA 493                      |  2038 | 
| NULL                           | Dehalococcoides     | ethenogenes 195                       |  1584 | 
| NULL                           | Dictyostelium       | discoideum                            |  3757 | 
| NULL                           | Drosophila          | erecta                                |    13 | 
| NULL                           | Drosophila          | mauritiana                            |    11 | 
| NULL                           | Drosophila          | pseudoobscura                         |    28 | 
| NULL                           | Drosophila          | simulans                              |    35 | 
| NULL                           | Drosophila          | sp.                                   |    15 | 
| NULL                           | Drosophila          | virilis                               |    22 | 
| NULL                           | Drosophila          | yakuba                                |    22 | 
| NULL                           | Ehrlichia           | chaffeensis str. Arkansas             |  1094 | 
| NULL                           | Emericella          | nidulans                              |    12 | 
| NULL                           | Escherichia         | coli                                  |   883 | 
| NULL                           | Euglena             | gracilis                              |    15 | 
| NULL                           | Geobacter           | sulfurreducens PCA                    |  3423 | 
| NULL                           | Gloeobacter         | violaceus                             |   113 | 
| NULL                           | Helicobacter        | pylori                                |    24 | 
| NULL                           | Listeria            | monocytogenes str. 4b F2365           |  2823 | 
| NULL                           | Methylococcus       | capsulatus str. Bath                  |  2928 | 
| NULL                           | Mus                 | musculus molossinus                   |    11 | 
| NULL                           | Neorickettsia       | sennetsu str. Miyayama                |   930 | 
| NULL                           | Nostoc              | sp. PCC 7120                          |   117 | 
| NULL                           | Oryza               | australiensis                         |    14 | 
| NULL                           | Oryza               | barthii                               |    19 | 
| NULL                           | Oryza               | eichingeri                            |    12 | 
| NULL                           | Oryza               | glumipatula                           |    17 | 
| NULL                           | Oryza               | grandiglumis                          |    13 | 
| NULL                           | Oryza               | granulata                             |    12 | 
| NULL                           | Oryza               | latifolia                             |    13 | 
| NULL                           | Oryza               | meridionalis                          |    15 | 
| NULL                           | Oryza               | nivara                                |   118 | 
| NULL                           | Oryza               | officinalis                           |    29 | 
| NULL                           | Oryza               | punctata                              |    15 | 
| NULL                           | Oryza               | rhizomatis                            |    14 | 
| NULL                           | Oryza               | rufipogon                             |    48 | 
| NULL                           | Oryza               | sativa                                |  3651 | 
| NULL                           | Oryza               | sativa (indica cultivar-group)        |   393 | 
| NULL                           | Oryza               | sativa (japonica cultivar-group)      | 21235 | 
| NULL                           | Prochlorococcus     | marinus                               |   106 | 
| NULL                           | Prochlorococcus     | marinus str. MIT 9313                 |   106 | 
| NULL                           | Prochlorococcus     | marinus subsp. pastoris str. CCMP1986 |   103 | 
| NULL                           | Pseudomonas         | aeruginosa PAO1                       |  1519 | 
| NULL                           | Pseudomonas         | syringae pv. tomato                   |    31 | 
| NULL                           | Pseudomonas         | syringae pv. tomato str. DC3000       |  3902 | 
| NULL                           | Shewanella          | oneidensis MR-1                       |  4850 | 
| NULL                           | Shigella            | flexneri                              |    14 | 
| NULL                           | Silicibacter        | pomeroyi DSS-3                        |  4257 | 
| NULL                           | Synechococcus       | elongatus                             |   121 | 
| NULL                           | Synechococcus       | sp. WH 8102                           |   117 | 
| NULL                           | Synechocystis       | sp. PCC 6803                          |   234 | 
| NULL                           | Tityus              | cambridgei                            |    25 | 
| NULL                           | Trypanosoma         | brucei                                |   292 | 
| NULL                           | Trypanosoma         | brucei TREU927                        |  2996 | 
| NULL                           | Vibrio              | cholerae O1 biovar eltor              |  3863 | 
| NULL                           | Viola               | odorata                               |    13 | 
| African clawed frog            | Xenopus             | laevis                                |   275 | 
| African green monkey           | Cercopithecus       | aethiops                              |    17 | 
| African rice                   | Oryza               | glaberrima                            |    12 | 
| baker's yeast                  | Saccharomyces       | cerevisiae                            |  6473 | 
| cat                            | Felis               | catus                                 |    28 | 
| chicken                        | Gallus              | gallus                                |   334 | 
| chimpanzee                     | Pan                 | troglodytes                           |    29 | 
| Chinese hamster                | Cricetulus          | griseus                               |    28 | 
| cow                            | Bos                 | taurus                                |   395 | 
| crab-eating macaque            | Macaca              | fascicularis                          |    78 | 
| dog                            | Canis               | familiaris                            |    89 | 
| domestic guinea pig            | Cavia               | porcellus                             |    28 | 
| domestic silkworm              | Bombyx              | mori                                  |    31 | 
| fission yeast                  | Schizosaccharomyces | pombe                                 |  4994 | 
| fruit fly                      | Drosophila          | melanogaster                          | 10265 | 
| golden hamster                 | Mesocricetus        | auratus                               |    12 | 
| gorilla                        | Gorilla             | gorilla                               |    16 | 
| horse                          | Equus               | caballus                              |    28 | 
| house mouse                    | Mus                 | musculus                              | 13449 | 
| human                          | Homo                | sapiens                               | 10888 | 
| Japanese eel                   | Anguilla            | japonica                              |    13 | 
| long-staminate rice            | Oryza               | longistaminata                        |    20 | 
| lowland gorilla                | Gorilla             | gorilla gorilla                       |    11 | 
| malaria parasite P. falciparum | Plasmodium          | falciparum                            |  2207 | 
| Norway rat                     | Rattus              | norvegicus                            | 11827 | 
| orangutan                      | Pongo               | pygmaeus                              |    79 | 
| pea                            | Pisum               | sativum                               |    16 | 
| pig                            | Sus                 | scrofa                                |   133 | 
| rabbit                         | Oryctolagus         | cuniculus                             |    88 | 
| rainbow trout                  | Oncorhynchus        | mykiss                                |    13 | 
| rhesus monkey                  | Macaca              | mulatta                               |    43 | 
| sheep                          | Ovis                | aries                                 |    81 | 
| southeastern Asian house mouse | Mus                 | musculus castaneus                    |    13 | 
| spinach                        | Spinacia            | oleracea                              |    13 | 
| tailed frog                    | Ascaphus            | truei                                 |    13 | 
| thale cress                    | Arabidopsis         | thaliana                              | 32804 | 
| western clawed frog            | Xenopus             | tropicalis                            |    80 | 
| western European house mouse   | Mus                 | musculus domesticus                   |    16 | 
| zebrafish                      | Danio               | rerio                                 |  7813 | 
+--------------------------------+---------------------+---------------------------------------+-------+
102 rows in set (0.73 sec)

Only diffece here is 94 gps with species NULL/NULL in golite.

New GO loading - Draft Specifications 2/8/07

Background

Amigo (1.0 or 2.0) should be able to access what is currently called "go-full" database, with all IEA annotations. This bulk of this database is ca. 10M gene associations (IEA) from the Uniprot goa project. With current software, the gofull load takes ~7 days. Amigo data should be loaded ideally daily, but 2-3 times per week is sufficient. A few days (1-2) could be trimmed from the loading procedure by removing some file i/o (for FTP files etc.) and keeping just the database load, but we would still need the speed up the loading by a factor of ~3 at a minimum.

As a side note, we investigated the possiblity of keeping the current loading steps as is, but doing them asynchronously, but since the vast majority of the time was spent loading the godbxml to the database itself via DB:Stag that we would not get a substantial "throughput" time increase. However, this would be competitive with loading 1xWeek.

Components of database loading (need to be run 3xWeek)

  1. Loading ontology (obo file) this takes <2 hours on current server
  2. Loading species/taxonomy info (req for gene_product table, ~5 minutes)
  3. Loading association files (~100 hours)
  4. Loading sequence files (~31 hours)
  5. Dumping sequence files (~20 hours) - needed for GoST

What needs to be sped up are steps 2-4. Currently, 2 is done via XLST transform, 3) and 4) done via perl DBI (go-db-perl API) on a database loaded with GA/Ontology.

Other loading functions (can be run asynchronously, maybe 1xweek)

  1. dumprdfxml
  2. dumpoboxml
  3. dumpowl
  4. make_release_tarballs


Side note: Currently 3 is done on termdb only. There is demand of 3 on the whole database, and I nearly have the code in place. If we all decide to go this route, we can retire 1. As the code to do 1 and 3 is very similar, it will not cost us anything in cycle time to make this substitution. Place holder page: GoInOwl

Proposals

Approaches to improving the loading time:

  1. In the short term, we could probably shave 1-2 days and load amigo weekly.
    • This will become problematic as more associations are added.
  1. Continue doing full loads, but use faster methods (bulk loading)
  2. Do incremental loads from files
  3. Allow remote "at-will" database updates as ontology, association, or sequence (gp2protein) files change at various sources.


Bulk Loading

The fastest way to load a mysql database is by formatting rectangular tab files, pre-constructed with foreign keys using the "LOAD DATA INFILE" command. Slightly less fast is to load with mysqlimport. mysql import (on golite) is about 60x faster than the current procedure (we currently use it to update a staging server). mysqlimport requires roughly the same pre-processing as "LOAD DATA INFILE" but is in sql format instead of plain text. There may be a way with mysql 5 procedures and triggers to use mysqlimport without defining the foreign keys explicitly in the sql file.

Ontology, Association, and Sequence loading are more or less independent projects and could possibly be handled with different mechanisms. For example, we could keep the DB:Stag loading for the ontology and load the sequences and associations by other mechanisms. The one area that is tricky is the dbxref table, which cross-references:

  • term (via term_dbxref 1:N)
  • associations (via evidence 1:N?) - I think these are qualifiers AND pubmed refs, evidence_dbxref
  • associations (via gene products 1:1)
  • sequences (via seq_dbxref 1:N)
Steps for bulk loading ga files:

Pseudo code for approach using GO::Handler

note: when I first drafed this I was using a very old schema diagram, I have updated, but may still be mistakes

  1. drop all tables and create empty ones
  2. load ontology (termdb)
  3. load species table
  4. append data to the dbxref table using load data infile command
  5. store a bunch of hashes for foreign keys in RAM, note last id (index) in dbxref
  6. lock db
  7. create files for modified tables:
    • create text files: gene_product, evidence, associations, dbxref
    • association_qualifier, assoc_rel
    • and linking tables:
    • seq_dbxref, evidence_dbxref, gene_product_seq

(also gene_product_synonym and gene_product_property)

  1. loop over GA files
    1. loop over entries in GA files
    2. if gp_is_new (
      store gp_info (symbol->id) in hash ( - this hash will have ~3M entries)
      get species_id from speciesIdByTaxon
      write to gene_product file, write to dbxref file and necesary linking table files
      )
    3. get gene_product.id from hash, term.id from hash, write association file (save association.id) role_group?
    4. write evidence/dbxrefs to file, with association.id (seq_acc?)
  2. we could load these tables now
  3. loop over gp2protein files, "load_sp.pl" load sequences
    • various improvements to load_sp; could use bulk table loading, could write the fasta file (without GOIDs?) simultaneously.
  4. unlock db, export, dump, etc.
OPEN QUESTIONS (for me)
  1. Can we truncate the dbxref table and "splice" a new table to the end of it? Yes, this is a normal append data loading operation
  2. Are we going to have memory issues (I think at 3M gp_entries we are fine - I am assuming that the evidence and association rows depend ONLY on a single line in the GA files.
  3. We are still doing alot of "work" in perl to process files. Will it be significantly faster?
  4. I can think of a couple approaches to mix in the current go-dev API
    • actually write something that translates the xml->text (funny kind of xslt)
    • use the assoc parser, but write text instead of xml
    • use the GO::Model objects but write a special parser.

Not that familiar with the API, but I am willing to trade of efficiency for development time.

AN ALTERNATIVE: Plain-old Perl (POP)

Just do this the good old dumb way. Bypass any object model or XML frameworks.

Algorithm:

  1. Read in line from GA file
  2. do the inserts, selects and updates in a block of code with no abstractions

It may turn out to be considerably faster

  • Advantage: simple and easy to understand
  • Disadvantage: lack of code reuse - parser written from ground-up; changes in file format require changes in all parsers (obvious retort: is this that big a deal?)
  • Load time: unknown; estimated faster than current method but slower than bulkloading from pre-prepared files


COMMENT BY BEN: This is nice and simple, but I think it will be too slow for large files. This is our experience with very large GFF files and Stan's Texpresso database work (both cases have to do millions of inserts)

AN ALTERNATIVE: MySQL procedures

I (CJM) was initially keen on this idea. I think we would need a test to see how well this would work in practice.

  • Disadvantage: portability (a few other groups are using the load code for postgres). This may not be a big deal as the plsql code should be easy to port manually.
  • Disadvantage: lack of familiarity. Many programmers have a knee-jerk reaction against this kind of approach.
  • Load time: unknown

COMMENT by BEN: I like this, but I have zero idea how to do it, so I feel that the learning curve outweighs other considerations.

ANOTHER ALTERNATIVE: Write an OBO-Edit adapter
  • Advantage: As part of a move towards more code reuse within GO as part of a single unified framework
  • Disadvantage: less familiarity with Java within the GOC software group
  • Load time: may be faster than equivalent perl code (hard to determine without writing code to test..)

The same variations are on offer here as perl - bypass an object model, or go via objects (and possibly higher memory footprint)


COMMENT by BEN: I think this is would fall under the Incremental loading section, below. Roughly, the database is updated by users (i.e. MODs, GOA) remotly using OBO edit (gene-assocationEdit?). This actually seems relativly transparent for ontology updates... but not associations. If we are talking about a multi-year SW project, I could get behind it, although obviously John would have to be an integral part (at least at the onset). I think that we would want to run OBO-edit itself off of a local mysql db. The first step would essentially be a "thick" amigo client.

ANOTHER ALTERNATIVE: use C

Hey, if speed is what we want this is worth mentioning. I'm guess there is no volunteers. I have some ontology parsing code I abandoned in C a long time ago..

This approach could be mixed in with the bulkload approach, so you don't have to do any messy C/ESQL stuff

Incremental Loading

It has been suggested that the database be loaded not all at once, but rather as the ontology and association files change. The ontology file changes hourly - and this will tend to invalidate current associations. The association files vary tremendously on how often they are comitted to CVS. TAIR currently commits nightly (but have be requested to do it less often to keep the CVS file size down!), SGD 2x week, others less often. The "big one" Uniprot GOA (in some sense "the problem") is only updated every 3 months.

Ideally this type of updating would include updates via remote application - circuventing CVS entirely. OBO-edit, for example, could directly update a master (or slave) database.

My current thought is that the fancy versions are quite complex to code, and reprsents a shift in how we run the production side of things. The simple versions are really the same as the bulk load proposal above. Perhaps after becoming more familiar with the API I could come up with an implementable plan for this.