Specific go load

From GO Wiki
Jump to: navigation, search

This is pseudocode + detailed road map to bulk load associations

Primary approach is to leverage existing go-db-perl/go-perl code and keep as much of loading intact as possible, replacing only the slowest DBStag: portions. Additionally, we have much invested in the loading infrastructre, wrapping scripts, etc., so this is a minimal mutation.

Steps (in rough order of how they are executed)

  • Modify load_assoc method in go-prepare-release.pl to to call new script: go-load-assoc-bulk.pl (based on load-go-into-db)
  • go-load-assoc-bulk.pl will use the go_assoc_parser (GO::Parser), but a new handler, obo_godb_flat.pm)
    • Before the file is parsed, hashes similar to acc2name_h need to be set up, and the last index in dbxref needs to be stored.
    • we also need to append the SO terms to the term table
+-------+-------------------+-----------+-------------------+-------------+---------+
| id    | name              | term_type | acc               | is_obsolete | is_root |
+-------+-------------------+-----------+-------------------+-------------+---------+
| 23713 | gene              | sequence  | gene              |           0 |       0 | 
| 23717 | protein           | sequence  | protein           |           0 |       0 | 
| 23719 | protein_structure | sequence  | protein_structure |           0 |       0 | 
| 23720 | transcript        | sequence  | transcript        |           0 |       0 | 
| 23721 | complex           | sequence  | complex           |           0 |       0 | 
+-------+-------------------+-----------+-------------------+-------------+---------+

Similarly for association_qualfiers:

mysql> select distinct(term.name), term_id from association_qualifier, term where term.id = association_qualifier.term_id;
+--------------------+---------+
| name               | term_id |
+--------------------+---------+
| not                |   23714 | 
| contributes_to     |   23715 | 
| colocalizes_with   |   23716 | 
| not|contributes_to |   23718 | 
+--------------------+---------+

With acc = name and last two fields = 0

  • obo_godb_flat.pm is similar to obo_text.pm, but will write the table files
  • go_assoc_parser fires the following stag events, each has to be assocated with an e_EventName method in obo_godb_flat.pm
    • $self->start_event(DBSET);
    • $self->event(PRODDB, $proddb); - this sets the DBXREF.XREF_DBname for all following gene_products accession (i.e. DBXREF_ID -> SGD or UniProt)... or is this DB table?
    • $self->start_event(PROD) - When this event fires, need to write a line in gene_product table (file), the following colums are accessed via $self->stag_get(product, tag)
      • $self->event(PRODACC, $prodacc); - XREF to DBXREF table
      • $self->event(PRODSYMBOL, $prodsymbol);
      • $self->event(PRODNAME, $prodname)
      • $self->event(PRODTYPE, $prodtype)
      • $self->event(SECONDARY_PRODTAXA, $other[0]); - XREF to species table
      • $self->event(PRODTAXA, $prodtaxa); - XREF to species table
      • $self->event(PRODSYN, $_); GENE_PRODUCT_SYNONYM table
    • $self->start_event(ASSOC); - When this event fires need to write a line in assocation table (file)
      • $self->event(ASSOCDATE, $assocdate);
      • $self->event(SOURCE_DB, $source_db) see DB table XREF?
      • $self->event(TERMACC, $termacc); from acc2id hash
      • $self->event(IS_NOT, $is_not || '0');
    • $self->event(QUALIFIER, $_) - writes to ASSOCIATION_QUALIFIER table file
    • $self->event(ASPECT, $aspect); -- can be tossed I think
    • $self->start_event(EVIDENCE); - writes to EVIDENCE table file
      • $self->event(EVCODE, $evcode);
    • $self->event(WITH, $_) writes to ASSOC_REL
      • $self->event(REF, $_) writes to DBXREF, EVIDENCE_DBXREF

Actually, because of the STAG structure, there is only 1 method e_prod, which fires at the end of a gene_product (which is often a series of lines in GAF).

We also must modify fh/file methods of Data::Stag::Writer object nature for obo_godb_flat.pm Need to open a file for each table (9 at currrent count) and switch based on parse status.