Proposed GPI1.2 format: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
m (Replaced content with "Moved to https://github.com/geneontology/go-annotation/tree/master/specs Category:Software")
 
(6 intermediate revisions by 4 users not shown)
Line 1: Line 1:
===gp_information files (GPI)===
Moved to https://github.com/geneontology/go-annotation/tree/master/specs




<pre>
[[Category:Software]]
N.B. The first line in the gp_information file should be;
 
!gpi-version: 1.2
 
</pre>
 
====Proposed format (March 2014)====
 
 
{| cellspacing="2" border="1"
|-
! column
! name
! required?
! cardinality
! GAF column
! Example for UniProt
! Example for IntAct
|-
| 01 || DB || required || 1 || 1 || UniProtKB || IntAct
|-
| 02 || DB_Object_ID || required || 1 || 2/17 || Q4VCS5-1 || EBI-9008420
|-
| 03 || DB_Object_Symbol || required || 1 || 3 || AMOT || HBA1:HBB
|-
| 04 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin || Hemoglobin HbA complex
|-
| 05 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT_HUMAN|KIAA1071|AMOT || HBA-HBB complex|HBA1-HBB complex|HBA1-HBB heterotetramer
|-
| 06 || DB_Object_Type || required || 1 || 12 || protein || complex
|-
| 07 || Taxon || required || 1 || 13 || 9606 || 9606
|-
| 08 || Parent_Object_ID || optional || 0 or 1 ||  || UniProtKB:Q4VCS5 ||
|-
| 09 || DB_Xref(s) || optional || 0 or greater ||  || UniProtKB:P38433 || PR:000025934
|-
| 010 || Gene_Product_Properties || optional || 0 or greater ||  || See Note 4 below ||
|-
|}
 
 
'''Notes'''
 
1. Where it is stated that a column can have one or greater values,
e.g. 'with', DB_Object_Synonym(s), DB_Xref(s), the values should be given as a pipe-separated list.
 
 
2. The DB_Xrefs column will be useful for mapping of MOD-specific identifiers/symbols/synonyms to UniProt accessions to assist MOD curators moving to Protein2GO in searching for familiar IDs/gene names. In the case of IntAct complexe IDs, it will be useful to include PRO IDs as an xref to enable a look-up function in Protein2GO.
 
3. Identifiers in the Parent_Object_ID column must have a prefix to avoid confusion in cases where an ID from a different database to the one specified in the header is included
 
4. The Gene Product Properties column can be filled with a pipe separated list of "property_name = property_value". There will be a fixed vocabulary for the property names and this list can be extended when necessary.
Supported properties will include: 'GO annotation complete', "Phenotype annotation complete' (the value for these two properties would be a date), 'Target set' (e.g. Reference Genome, Kidney etc.), 'Database subset' (e.g. Swiss-Prot, TrEMBL), go_annotation_summary (textual summary of annotations for an entity)
 
'''Further questions/discussion points'''
 
1. Do we allow more than one namespace for each file?
 
Downside to insisting on one GPI file per namespace:
It would hinder certain kinds of operations, such as the ability to concatenate multiple GPIs documents together for downstream processing.
 
Possible fix:
Perhaps we could allow the namespace to be empty, and in those cases the prefix would have to be explicitly added?
 
Downside:
There's also a slight sociotechnological consequence that I'm worried about, when local identifiers become "separated" from their prefix (e.g. in each row vs in the header) the prefix tends to get lost, and the local IDs end up somewhere downstream prefixless masquerading as a global ID or attached to some variant of the original prefix. This is how we end up with disasters like MGI:MGI:nnnn. But maybe this is unduly paranoid.
 
Fix:
Document things clearly
 
 
[[Category:Specification]]
[[Category:GPAD]]

Latest revision as of 18:41, 6 March 2020