Final GPAD and GPI file format: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
m (Replaced content with "Moved to https://github.com/geneontology/go-annotation/tree/master/specs Category:Software")
 
(8 intermediate revisions by 4 users not shown)
Line 1: Line 1:
===gp_association files (GPAD)===
Moved to https://github.com/geneontology/go-annotation/tree/master/specs


<pre>
[[Category:Software]]
N.B. The first line in the gp_association file should be;
 
!gpa-version: 1.1
</pre>
 
 
====Final format (09 Jan 2013)====
 
 
{| border=1 cell-padding=5 cell-spacing=10
|-
! column
! name
! required?
! cardinality
! old column #
! extra info
|-
| 1 || DB || required || 1 || 1 || must be in xrf_abbs
|-
| 2 || DB_Object_ID || required || 1 || 2 || canonical or spliceform ID
|-
| 3 || Qualifier || required || 0 or greater || 4 || explicit relations (see Note 2)
|-
| 4 || GO ID ||  required || 1 || 5 || must be extant GO ID
|-
| 5 || DB:Reference(s) || required || 1 or greater || 6 || DB must be in xrf_abbs
|-
| 6 || Evidence code || required || 1 || 7 || from ECO
|-
| 7 || With (or) From || optional || 0 or greater || 8 ||
|-
| 8 || Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || NCBI taxon ID
|-
| 9 || Date ||  required || 1 || 14 || YYYYMMDD
|-
| 10 || Assigned_by ||  required || 1 || 15 || from xrf_abbs
|-
| 11 || Annotation Extension || optional || 0 or greater || 16 ||
|-
| 12 || Annotation Properties || optional || 0 or greater ||  || See Note 1 below ||
|}
 
'''Notes'''
 
1. The Annotation Properties column can be filled with a pipe separated list of "property_name = property_value". There will be a fixed vocabulary for the property names and this list can be extended when necessary.
The initial supported properties would be curator_name and annotation_identifier*, but can be extended to include e.g. curator_ID, modification_date, creation_date, annotation_notes...etc.
 
<nowiki>*</nowiki> curator_name and annotation_identifier will be useful for groups that are using Protein2GO for protein annotation who wish to maintain their annotations in their own database. These values can be used to keep track of individual annotations.
 
2. The explicit relations will be:
 
'part_of' for Cellular Component
 
'involved_in' for Biological Process
 
'enables' for Molecular Function
 
'''Further questions/discussion points'''
 
1. Evidence column.
a. Chain of evidence
 
2. Annotation properties column.
Tony has suggested including the GO evidence code here to avoid using a lookup to reverse engineer the file
 
===gp_information files (GPI)===
 
 
<pre>
N.B. The first two lines in the gp_information file should be;
 
!gpi-version: 1.1
 
!namespace: <database>
 
There should be a header line specifying the namespace of the annotating groups' identifiers, e.g. WB, UniProtKB
 
</pre>
 
====Final format (09 Jan 2013)====
 
 
{| cellspacing="2" border="1"
|-
! column
! name
! required?
! cardinality
! GAF column
! Example for UniProt
! Example for WormBase
 
|-
| 01 || DB_Object_ID || required || 1 || 2/17 || Q4VCS5-1 || WBGene00000035
|-
| 02 || DB_Object_Symbol || required || 1 || 3 || AMOT || ace-1
|-
| 03 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin ||
|-
| 04 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT_HUMAN|KIAA1071|AMOT || ACE1
|-
| 05 || DB_Object_Type || required || 1 || 12 || protein || gene
|-
| 06 || Taxon || required || 1 || 13 || taxon:9606 || taxon:6239
|-
| 07 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:Q4VCS5 || WB:WBGene00000035
|-
| 08 || DB_Xref(s) || optional || 0 or greater || - || - || UniProtKB:P38433
|-
| 09 || Gene_Product_Properties || optional || 0 or greater || - || See Note 1 below ||
|-
|}
 
 
'''Notes'''
 
1. Where it is stated that a column can have one or greater values,
e.g. 'with', DB_Object_Synonym(s), DB_Xref(s), the values should be given as a pipe-separated list.
 
 
2. The DB_Xrefs column will be useful for mapping of MOD-specific identifiers/symbols/synonyms to UniProt accessions to assist MOD curators moving to Protein2GO in searching for familiar IDs/gene names.
 
3. Identifiers in the Parent_Object_ID column must have a prefix to avoid confusion in cases where an ID from a different database to the one specified in the header is included
 
4. The Gene Product Properties column can be filled with a pipe separated list of "property_name = property_value". There will be a fixed vocabulary for the property names and this list can be extended when necessary.
Supported properties will include: 'GO annotation complete', "Phenotype annotaiton complete' (the value for these two properties would be a date), 'Target set' (e.g. Reference Genome, Kidney etc.), 'Database subset' (e.g. Swiss-Prot, TrEMBL).
 
'''Further questions/discussion points'''
 
1. Do we allow more than one namespace for each file?
 
Downside to insisting on one GPI file per namespace:
It would hinder certain kinds of operations, such as the ability to concatenate multiple GPIs documents together for downstream processing.
 
Possible fix:
Perhaps we could allow the namespace to be empty, and in those cases the prefix would have to be explicitly added?
 
Downside:
There's also a slight sociotechnological consequence that I'm worried about, when local identifiers become "separated" from their prefix (e.g. in each row vs in the header) the prefix tends to get lost, and the local IDs end up somewhere downstream prefixless masquerading as a global ID or attached to some variant of the original prefix. This is how we end up with disasters like MGI:MGI:nnnn. But maybe this is unduly paranoid.
 
Fix:
Document things clearly
 
 
[[Category:Meetings]]
[[Category:Specification]]
[[Category:GPAD]]

Latest revision as of 18:41, 6 March 2020