Proposed GPI1.2 format: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
(Created page with "===gp_information files (GPI)=== <pre> N.B. The first line in the gp_information file should be; !gpi-version: 1.2 </pre> ====Proposed format (March 2014)==== {| cellsp...")
 
Line 38: Line 38:
| 08 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:Q4VCS5 ||  
| 08 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:Q4VCS5 ||  
|-
|-
| 09 || DB_Xref(s) || optional || 0 or greater || - || - || UniProtKB:P38433 || PR:000025934
| 09 || DB_Xref(s) || optional || 0 or greater || - || UniProtKB:P38433 || PR:000025934
|-
|-
| 010 || Gene_Product_Properties || optional || 0 or greater || - || See Note 4 below ||
| 010 || Gene_Product_Properties || optional || 0 or greater || - || See Note 4 below ||

Revision as of 06:49, 30 June 2014

gp_information files (GPI)

N.B. The first line in the gp_information file should be;

!gpi-version: 1.2

Proposed format (March 2014)

column name required? cardinality GAF column Example for UniProt Example for IntAct
01 DB required 1 1 UniProtKB IntAct
02 DB_Object_ID required 1 2/17 Q4VCS5-1 EBI-9008420
03 DB_Object_Symbol required 1 3 AMOT HBA1:HBB
04 DB_Object_Name optional 0 or 1 10 Angiomotin Hemoglobin HbA complex
05 DB_Object_Synonym(s) optional 0 or greater 11 KIAA1071|AMOT HBA1-HBB complex|HBA1-HBB heterotetramer
06 DB_Object_Type required 1 12 protein complex
07 Taxon required 1 13 9606 9606
08 Parent_Object_ID optional 0 or 1 - UniProtKB:Q4VCS5
09 DB_Xref(s) optional 0 or greater - UniProtKB:P38433 PR:000025934
010 Gene_Product_Properties optional 0 or greater - See Note 4 below


Notes

1. Where it is stated that a column can have one or greater values, e.g. 'with', DB_Object_Synonym(s), DB_Xref(s), the values should be given as a pipe-separated list.


2. The DB_Xrefs column will be useful for mapping of MOD-specific identifiers/symbols/synonyms to UniProt accessions to assist MOD curators moving to Protein2GO in searching for familiar IDs/gene names. In the case of IntAct complexe IDs, it will be useful to include PRO IDs as an xref to enable a look-up function in Protein2GO.

3. Identifiers in the Parent_Object_ID column must have a prefix to avoid confusion in cases where an ID from a different database to the one specified in the header is included

4. The Gene Product Properties column can be filled with a pipe separated list of "property_name = property_value". There will be a fixed vocabulary for the property names and this list can be extended when necessary. Supported properties will include: 'GO annotation complete', "Phenotype annotation complete' (the value for these two properties would be a date), 'Target set' (e.g. Reference Genome, Kidney etc.), 'Database subset' (e.g. Swiss-Prot, TrEMBL), go_annotation_summary (textual summary of annotations for an entity)

Further questions/discussion points

1. Do we allow more than one namespace for each file?

Downside to insisting on one GPI file per namespace: It would hinder certain kinds of operations, such as the ability to concatenate multiple GPIs documents together for downstream processing.

Possible fix: Perhaps we could allow the namespace to be empty, and in those cases the prefix would have to be explicitly added?

Downside: There's also a slight sociotechnological consequence that I'm worried about, when local identifiers become "separated" from their prefix (e.g. in each row vs in the header) the prefix tends to get lost, and the local IDs end up somewhere downstream prefixless masquerading as a global ID or attached to some variant of the original prefix. This is how we end up with disasters like MGI:MGI:nnnn. But maybe this is unduly paranoid.

Fix: Document things clearly