Final GPAD and GPI file format

From GO Wiki
Jump to: navigation, search

General Issues

  • All prefixes must be registered in the db-xrefs.yaml file
  • Prefixes should not contain spaces; dashes and underscores are okay

gp_association files (GPAD)

N.B. The first line in the gp_association file should be;

!gpa-version: 1.1

Use the short version of the format name. This is intentional for backwards compatibility.

Final format (09 Jan 2013)

column name required? cardinality old column # extra info
1 DB required 1 1 must be in xrf_abbs
2 DB_Object_ID required 1 2/17 canonical or isoform ID
3 Qualifier required 1 or greater 4 explicit relations (see Note 2)
4 GO ID required 1 5 must be extant GO ID
5 DB:Reference(s) required 1 or greater 6 DB must be in xrf_abbs
6 Evidence code required 1 7 from ECO
7 With (or) From optional 0 or greater 8
8 Interacting taxon ID (for multi-organism processes) optional 0 or 1 13 NCBI taxon ID
9 Date required 1 14 YYYYMMDD
10 Assigned_by required 1 15 from xrf_abbs
11 Annotation Extension optional 0 or greater 16
12 Annotation Properties optional 0 or greater See Note 1 below

Notes

1. The Annotation Properties column can be filled with a pipe separated list of "property_name = property_value". There will be a fixed vocabulary for the property names and this list can be extended when necessary. The initial supported properties would be curator_name and annotation_identifier*, but can be extended to include e.g. curator_ID, modification_date, creation_date, annotation_notes...etc.

* curator_name and annotation_identifier will be useful for groups that are using Protein2GO for protein annotation who wish to maintain their annotations in their own database. These values can be used to keep track of individual annotations.

2. The explicit relations will be:

'part_of' for Cellular Component

'involved_in' for Biological Process

'enables' for Molecular Function

Further questions/discussion points

1. Evidence column. a. Chain of evidence

2. Annotation properties column. Tony has suggested including the GO evidence code here to avoid using a lookup to reverse engineer the file

gp_information files (GPI)

Proposed GPI1.2 format

N.B. The first two lines in the gp_information file should be;

!gpi-version: 1.1

!namespace: <database>

There should be a header line specifying the namespace of the annotating groups' identifiers, e.g. WB, UniProtKB

Final format (09 Jan 2013)

column name required? cardinality GAF column Example for UniProt Example for WormBase
01 DB_Object_ID required 1 2/17 Q4VCS5-1 WBGene00000035
02 DB_Object_Symbol required 1 3 AMOT ace-1
03 DB_Object_Name optional 0 or 1 10 Angiomotin
04 DB_Object_Synonym(s) optional 0 or greater 11 KIAA1071|AMOT ACE1
05 DB_Object_Type required 1 12 protein gene
06 Taxon required 1 13 taxon:9606 taxon:6239
07 Parent_Object_ID optional 0 or 1 - UniProtKB:Q4VCS5 WB:WBGene00000035
08 DB_Xref(s) optional 0 or greater - - UniProtKB:P38433
09 Gene_Product_Properties optional 0 or greater - See Note 4 below


Notes

1. Where it is stated that a column can have one or greater values, e.g. 'with', DB_Object_Synonym(s), DB_Xref(s), the values should be given as a pipe-separated list.


2. The DB_Xrefs column will be useful for mapping of MOD-specific identifiers/symbols/synonyms to UniProt accessions to assist MOD curators moving to Protein2GO in searching for familiar IDs/gene names.

3. Identifiers in the Parent_Object_ID column must have a prefix to avoid confusion in cases where an ID from a different database to the one specified in the header is included

4. The Gene Product Properties column can be filled with a pipe separated list of "property_name = property_value". There will be a fixed vocabulary for the property names and this list can be extended when necessary. Supported properties will include: 'GO annotation complete', "Phenotype annotaiton complete' (the value for these two properties would be a date), 'Target set' (e.g. Reference Genome, Kidney etc.), 'Database subset' (e.g. Swiss-Prot, TrEMBL).

Further questions/discussion points

1. Do we allow more than one namespace for each file?

Downside to insisting on one GPI file per namespace: It would hinder certain kinds of operations, such as the ability to concatenate multiple GPIs documents together for downstream processing.

Possible fix: Perhaps we could allow the namespace to be empty, and in those cases the prefix would have to be explicitly added?

Downside: There's also a slight sociotechnological consequence that I'm worried about, when local identifiers become "separated" from their prefix (e.g. in each row vs in the header) the prefix tends to get lost, and the local IDs end up somewhere downstream prefixless masquerading as a global ID or attached to some variant of the original prefix. This is how we end up with disasters like MGI:MGI:nnnn. But maybe this is unduly paranoid.

Fix: Document things clearly