Specific go load
This is pseudocode + detailed road map to bulk load associations
Primary approach is to leverage existing go-db-perl/go-perl code and keep as much of loading intact as possible, replacing only the slowest DBStag: portions. Additionally, we have much invested in the loading infrastructre, wrapping scripts, etc., so this is a minimal mutation.
Steps (in rough order of how they are executed)
- Modify load_assoc method in go-prepare-release.pl to to call new script: go-load-assoc-bulk.pl (based on load-go-into-db)
- go-load-assoc-bulk.pl will use the go_assoc_parser (GO::Parser), but a new handler, obo_godb_flat.pm)
- Before the file is parsed, hashes similar to acc2name_h need to be set up, and the last index in dbxref needs to be stored.
- we also need to append the SO terms to the term table
+-------+-------------------+-----------+-------------------+-------------+---------+ | id | name | term_type | acc | is_obsolete | is_root | +-------+-------------------+-----------+-------------------+-------------+---------+ | 23713 | gene | sequence | gene | 0 | 0 | | 23717 | protein | sequence | protein | 0 | 0 | | 23719 | protein_structure | sequence | protein_structure | 0 | 0 | | 23720 | transcript | sequence | transcript | 0 | 0 | | 23721 | complex | sequence | complex | 0 | 0 | +-------+-------------------+-----------+-------------------+-------------+---------+
Similarly for association_qualfiers:
mysql> select distinct(term.name), term_id from association_qualifier, term where term.id = association_qualifier.term_id; +--------------------+---------+ | name | term_id | +--------------------+---------+ | not | 23714 | | contributes_to | 23715 | | colocalizes_with | 23716 | | not|contributes_to | 23718 | +--------------------+---------+
With acc = name and last two fields = 0
- obo_godb_flat.pm is similar to obo_text.pm, but will write the table files
- go_assoc_parser fires the following stag events, each has to be assocated with an e_EventName method in obo_godb_flat.pm
- $self->start_event(DBSET);
- $self->event(PRODDB, $proddb); - this sets the DBXREF.XREF_DBname for all following gene_products accession (i.e. DBXREF_ID -> SGD or UniProt)... or is this DB table?
- $self->start_event(PROD) - When this event fires, need to write a line in gene_product table (file), the following colums are accessed via $self->stag_get(product, tag)
- $self->event(PRODACC, $prodacc); - XREF to DBXREF table
- $self->event(PRODSYMBOL, $prodsymbol);
- $self->event(PRODNAME, $prodname)
- $self->event(PRODTYPE, $prodtype)
- $self->event(SECONDARY_PRODTAXA, $other[0]); - XREF to species table
- $self->event(PRODTAXA, $prodtaxa); - XREF to species table
- $self->event(PRODSYN, $_); GENE_PRODUCT_SYNONYM table
- $self->start_event(ASSOC); - When this event fires need to write a line in assocation table (file)
- $self->event(ASSOCDATE, $assocdate);
- $self->event(SOURCE_DB, $source_db) see DB table XREF?
- $self->event(TERMACC, $termacc); from acc2id hash
- $self->event(IS_NOT, $is_not || '0');
- $self->event(QUALIFIER, $_) - writes to ASSOCIATION_QUALIFIER table file
- $self->event(ASPECT, $aspect); -- can be tossed I think
- $self->start_event(EVIDENCE); - writes to EVIDENCE table file
- $self->event(EVCODE, $evcode);
- $self->event(WITH, $_) writes to ASSOC_REL
- $self->event(REF, $_) writes to DBXREF, EVIDENCE_DBXREF