Gene Product Association Data (GPAD) Format (Archived)
Proposal to split the information in the GAF files into two sets, association data and gene product data.
Current Association File Format
column | required? | contents | cardinality |
---|---|---|---|
1 | required | DB | 1 |
2 | required | DB_Object_ID | 1 |
3 | required | DB_Object_Symbol | 1 |
4 | optional | Qualifier | 0 or greater |
5 | required | GO ID | 1 |
6 | required | DB:Reference(s) | 1 or greater |
7 | required | Evidence code | 1 |
8 | optional | With (or) From | 0 or greater |
9 | required | Aspect | 1 |
10 | optional | DB_Object_Name | 0 or 1 |
11 | optional | DB_Object_Synonym(s) | 0 or greater |
12 | required | DB_Object_Type (refers to col 17 if present) | 1 |
13 | required | taxon | 1 or 2 (for multi-org processes) |
14 | required | Date | 1 |
15 | required | Assigned_by | 1 |
16 | optional | Annotation cross products | ? |
17 | optional | Spliceform | 1 |
Proposal: remove gene product data from the association file, leaving just an identifier new format:
old column # | required? | contents | cardinality |
---|---|---|---|
1 | required | DB | 1 |
17 if present; else 2 | required | Spliceform ID OR DB_Object_ID | 1 |
4 | optional | Qualifier | 0 or greater |
5 | required | GO ID | 1 |
6 | required | DB:Reference(s) | 1 or greater |
7 | required | Evidence code | 1 |
8 | optional | With (or) From | 0 or greater |
14 | required | Date | 1 |
15 | required | Assigned_by | 1 |
16 | optional | Annotation cross products | ? |
13 | optional | Interacting taxon ID (for multi-organism processes) | 0 or 1 |
Gene product data would be stored in a separate file. It would consist of the following pieces of information:
old column # | required? | contents | cardinality |
---|---|---|---|
1 | required | DB | 1 |
2 | required | DB_Object_ID | 1 |
3 | required | DB_Object_Symbol | 1 |
10 | optional | DB_Object_Name | 0 or 1 |
11 | optional | DB_Object_Synonym(s) | 0 or greater |
12 | required | DB_Object_Type | 1 |
13 | required | taxon | 1 |
Any GPs with different spliceforms would also have the following data:
old column # | required? | contents | cardinality |
---|---|---|---|
17 | required | Spliceform ID | 1 |
12 | required | Spliceform object type | 1 |
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.
Example
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure:
SGD S000000296 PHO3 GO:0003993 SGD_REF:S000047763|PMID:2407294 IMP F acid phosphatase YBR092C gene taxon:4932 20010118 SGD SGD S000000296 PHO3 GO:0006796 SGD_REF:S000047115|PMID:2407294 TAS P acid phosphatase YBR092C gene taxon:4932 20041220 SGD SGD S000005370 RCL1 NOT GO:0003963 SGD_REF:S000039255|PMID:10790377 IDA F YOL010W gene taxon:4932 20020530 SGD SGD S000005197 TEX1 GO:0006406 SGD_REF:S000069956|PMID:11979277 IC GO:0000346 P YNL253W gene taxon:4932 20030221 SGD SGD S000005316 ABZ1 GO:0046820 SGD_REF:S000057703|PMID:8346682 ISS CGSC:pabA|CGSC:pabB F aminodeoxychorismate synthase YNR033W gene taxon:4932 20030106 SGD
This is how it could look in the proposed new format.
Association data:
SGD S000000296 GO:0003993 F SGD_REF:S000047763|PMID:2407294 IMP 20010118 SGD SGD S000000296 GO:0006796 P SGD_REF:S000047115|PMID:2407294 TAS 20041220 SGD SGD S000005370 NOT GO:0003963 F SGD_REF:S000039255|PMID:10790377 IDA 20020530 SGD SGD S000005197 GO:0006406 P SGD_REF:S000069956|PMID:11979277 IC GO:0000346 20030221 SGD SGD S000005316 GO:0046820 F SGD_REF:S000057703|PMID:8346682 ISS CGSC:pabA|CGSC:pabB 20030106 SGD
GP data:
SGD S000000296 PHO3 acid phosphatase YBR092C gene taxon:4932 UniProt:XXXXXX SGD S000005370 RCL1 YOL010W gene taxon:4932 UniProt:XXXXXX SGD S000005197 TEX1 YNL253W gene taxon:4932 UniProt:XXXXXX SGD S000005316 ABZ1 aminodeoxychorismate synthase YNR033W gene taxon:4932 UniProt:XXXXXX