Specific go load

From GO Wiki
Revision as of 10:49, 14 March 2007 by Hitz (talk | contribs)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This is pseudocode + detailed road map to bulk load associations

Primary approach is to leverage existing go-db-perl/go-perl code and keep as much of loading intact as possible, replacing only the slowest DBStag: portions. Additionally, we have much invested in the loading infrastructre, wrapping scripts, etc., so this is a minimal mutation.

Steps (in rough order of how they are executed)

  • Modify load_assoc method in go-prepare-release.pl to to call new script: go-load-assoc-bulk.pl (based on load-go-into-db)
  • go-load-assoc-bulk.pl will use the go_assoc_parser (GO::Parser), but a new handler, obo_godb_flat.pm)
    • Before the file is parsed, hashes similar to acc2name_h need to be set up, and the last index in dbxref needs to be stored.
    • we also need to append the SO terms to the term table
+-------+-------------------+-----------+-------------------+-------------+---------+
| id    | name              | term_type | acc               | is_obsolete | is_root |
+-------+-------------------+-----------+-------------------+-------------+---------+
| 23713 | gene              | sequence  | gene              |           0 |       0 | 
| 23717 | protein           | sequence  | protein           |           0 |       0 | 
| 23719 | protein_structure | sequence  | protein_structure |           0 |       0 | 
| 23720 | transcript        | sequence  | transcript        |           0 |       0 | 
| 23721 | complex           | sequence  | complex           |           0 |       0 | 
+-------+-------------------+-----------+-------------------+-------------+---------+

Similarly for association_qualfiers:

mysql> select distinct(term.name), term_id from association_qualifier, term where term.id = association_qualifier.term_id;
+--------------------+---------+
| name               | term_id |
+--------------------+---------+
| not                |   23714 | 
| contributes_to     |   23715 | 
| colocalizes_with   |   23716 | 
| not|contributes_to |   23718 | 
+--------------------+---------+

With acc = name and last two fields = 0

  • obo_godb_flat.pm is similar to obo_text.pm, but will write the table files
  • go_assoc_parser fires the following stag events, each has to be assocated with an e_EventName method in obo_godb_flat.pm
    • $self->start_event(DBSET);
    • $self->event(PRODDB, $proddb); - this sets the DBXREF.XREF_DBname for all following gene_products accession (i.e. DBXREF_ID -> SGD or UniProt)... or is this DB table?
    • $self->start_event(PROD) - When this event fires, need to write a line in gene_product table (file), the following colums are accessed via $self->stag_get(product, tag)
      • $self->event(PRODACC, $prodacc); - XREF to DBXREF table
      • $self->event(PRODSYMBOL, $prodsymbol);
      • $self->event(PRODNAME, $prodname)
      • $self->event(PRODTYPE, $prodtype)
      • $self->event(SECONDARY_PRODTAXA, $other[0]); - XREF to species table
      • $self->event(PRODTAXA, $prodtaxa); - XREF to species table
      • $self->event(PRODSYN, $_); GENE_PRODUCT_SYNONYM table
    • $self->start_event(ASSOC); - When this event fires need to write a line in assocation table (file)
      • $self->event(ASSOCDATE, $assocdate);
      • $self->event(SOURCE_DB, $source_db) see DB table XREF?
      • $self->event(TERMACC, $termacc); from acc2id hash
      • $self->event(IS_NOT, $is_not || '0');
    • $self->event(QUALIFIER, $_) - writes to ASSOCIATION_QUALIFIER table file
    • $self->event(ASPECT, $aspect); -- can be tossed I think
    • $self->start_event(EVIDENCE); - writes to EVIDENCE table file
      • $self->event(EVCODE, $evcode);
    • $self->event(WITH, $_) writes to ASSOC_REL
      • $self->event(REF, $_) writes to DBXREF, EVIDENCE_DBXREF

Actually, because of the STAG structure, there is only 1 method e_prod, which fires at the end of a gene_product (which is often a series of lines in GAF).

We also must modify fh/file methods of Data::Stag::Writer object nature for obo_godb_flat.pm Need to open a file for each table (9 at currrent count) and switch based on parse status.