Gene Product Association Data (GPAD) Format (Archived)
Proposal to split the information in the GAF files into two sets, association data and gene product data.
The reasons for doing this are as follows:
- allow unannotated gene products to be submitted to the GO database
- reduce the amount of redundant gene product information in the GAF files
Current Association File Format
column | required? | contents | cardinality |
---|---|---|---|
1 | required | DB | 1 |
2 | required | DB_Object_ID | 1 |
3 | required | DB_Object_Symbol | 1 |
4 | optional | Qualifier | 0 or greater |
5 | required | GO ID | 1 |
6 | required | DB:Reference(s) | 1 or greater |
7 | required | Evidence code | 1 |
8 | optional | With (or) From | 0 or greater |
9 | required | Aspect | 1 |
10 | optional | DB_Object_Name | 0 or 1 |
11 | optional | DB_Object_Synonym(s) | 0 or greater |
12 | required | DB_Object_Type (refers to col 17 if present) | 1 |
13 | required | taxon | 1 or 2 (for multi-org processes) |
14 | required | Date | 1 |
15 | required | Assigned_by | 1 |
16 | optional | Annotation cross products | ? |
17 | optional | Spliceform | 1 |
Proposed file format
Proposal: remove gene product data from the association file, leaving just an identifier.
Associations
new format for storing annotations:
old column # | required? | contents | cardinality |
---|---|---|---|
1 | required | DB | 1 |
17 if present; else 2 | required | Spliceform ID OR DB_Object_ID | 1 |
4 | optional | Qualifier | 0 or greater |
5 | required | GO ID | 1 |
6 | required | DB:Reference(s) | 1 or greater |
7 | required | Evidence code | 1 |
8 | optional | With (or) From | 0 or greater |
9 | required | Aspect | 1 |
13 | optional | Interacting taxon ID (for multi-organism processes) | 0 or 1 |
14 | required | Date | 1 |
15 | required | Assigned_by | 1 |
16 | optional | Annotation cross products | ? |
Gene Products
Gene product data would be stored in a separate file (perhaps combined with the gp2protein file). It would consist of the following pieces of information:
old column # | required? | contents | cardinality |
---|---|---|---|
1 | required | DB | 1 |
2 | required | DB_Object_ID | 1 |
3 | required | DB_Object_Symbol | 1 |
10 | optional | DB_Object_Name | 0 or 1 |
11 | optional | DB_Object_Synonym(s) | 0 or greater |
12 | required | DB_Object_Type | 1 |
13 | required | taxon | 1 |
n/a | required | UniProt xref (from gp2 protein file) | 1 |
Any GPs with different spliceforms would also have the following data:
old column # | required? | contents | cardinality |
---|---|---|---|
17 | required | Spliceform ID | 1 |
12 | required | Spliceform object type | 1 |
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.
Example
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):
1
DB |
2
DB Object ID |
3
DB Object Symbol |
4
Qualifier |
5
GO ID |
6
DB:Reference(s) |
7
Evidence code |
8
With (or) From |
9
Aspect |
10
DB Object Name |
11
DB Object Synonym(s) |
12
DB Object Type (refers to col 17 if present) |
13
taxon |
14
Date |
15
Assigned by |
16
Annotation cross products |
17
Spliceform |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SGD | S000000296 | PHO3 | GO:0003993 | SGD_REF:S000047763 | IMP | F | acid phosphatase | YBR092C | gene | taxon:4932 | 20010118 | SGD | ||||
SGD | S000000296 | PHO3 | GO:0006796 | SGD_REF:S000047115 | TAS | P | acid phosphatase | YBR092C | gene | taxon:4932 | 20041220 | SGD | ||||
SGD | S000005370 | RCL1 | NOT | GO:0003963 | SGD_REF:S000039255 | IDA | F | YOL010W | gene | taxon:4932 | 20020530 | SGD | ||||
SGD | S000005197 | TEX1 | GO:0006406 | SGD_REF:S000069956 | IC | GO:0000346 | P | YNL253W | gene | taxon:4932|taxon:745953 | 20030221 | SGD | ||||
SGD | S000005316 | ABZ1 | GO:0046820 | SGD_REF:S000057703 | ISS | CGSC:pabA | F | aminodeoxychorismate synthase | YNR033W | gene | taxon:4932|taxon:2861 | 20030106 | SGD | |||
UniProtKB | Q4VCS5 | AMOT_HUMAN | GO:0031410 | PMID:11257124 | IDA | C | AMOT, KIAA1071: Angiomotin | IPI00163085 | protein | taxon:9606 | 20051207 | UniProtKB | ||||
UniProtKB | Q4VCS5 | AMOT_HUMAN | GO:0043532 | PMID:11257124 | IDA | F | AMOT, KIAA1071: Angiomotin | IPI00163085 | protein | taxon:9606 | 20051207 | UniProtKB | ||||
UniProtKB | Q4VCS5 | AMOT_HUMAN | GO:0043116 | PMID:16043488 | IDA | P | AMOT, KIAA1071:Angiomotin | IPI00163085 | snoRNA | taxon:9606 | 20051207 | UniProtKB | Q4VCS5-1 | |||
UniProtKB | Q4VCS5 | AMOT_HUMAN | GO:0005515 | PMID:16043488 | IPI | UniProtKB:Q6RHR9-2 | F | AMOT, KIAA1071: Angiomotin | IPI00163085 | snoRNA | taxon:9606 | 20051207 | UniProtKB | Q4VCS5-1 | ||
UniProtKB | Q4VCS5 | AMOT_HUMAN | GO:0043532 | PMID:16043488 | IDA | F | AMOT, KIAA1071: Angiomotin | IPI00163085 | protein | taxon:9606 | 20051207 | UniProtKB | Q4VCS5-2 |
This is how it could look in the proposed new format.
Association data:
1
DB |
17 or 2
Spliceform ID OR DB Object ID |
4
Qualifier |
5
GO ID |
6
DB:Reference(s) |
7
Evidence code |
8
With (or) From |
9
Aspect |
13
Interacting taxon ID (for multi-organism processes) |
14
Date |
15
Assigned_by |
16
Annotation cross products |
---|---|---|---|---|---|---|---|---|---|---|---|
SGD | S000000296 | GO:0003993 | SGD_REF:S000047763 | IMP | F | 20010118 | SGD | ||||
SGD | S000000296 | GO:0006796 | SGD_REF:S000047115 | TAS | P | 20041220 | SGD | ||||
SGD | S000005370 | NOT | GO:0003963 | SGD_REF:S000039255 | IDA | F | 20020530 | SGD | |||
SGD | S000005197 | GO:0006406 | SGD_REF:S000069956 | IC | GO:0000346 | P | taxon:745953 | 20030221 | SGD | ||
SGD | S000005316 | GO:0046820 | SGD_REF:S000057703 | ISS | CGSC:pabA | F | taxon:2861 | 20030106 | SGD | ||
UniProtKB | Q4VCS5 | GO:0031410 | PMID:11257124 | IDA | C | 20051207 | UniProtKB | ||||
UniProtKB | Q4VCS5 | GO:0043532 | PMID:11257124 | IDA | F | 20051207 | UniProtKB | ||||
UniProtKB | Q4VCS5-1 | GO:0043116 | PMID:16043488 | IDA | P | 20051207 | UniProtKB | ||||
UniProtKB | Q4VCS5-1 | GO:0005515 | PMID:16043488 | IPI | UniProtKB:Q6RHR9-2 | F | 20051207 | UniProtKB | |||
UniProtKB | Q4VCS5-2 | GO:0043532 | PMID:16043488 | IDA | F | 20051207 | UniProtKB |
GP data:
1
DB |
2
DB_Object_ID |
3
DB_Object_Symbol |
10
DB_Object_Name |
11
DB_Object_Synonym(s) |
12
DB_Object_Type |
13
taxon |
n/a
UniProt xref (from gp2protein file) |
n/a
Spliceform ID, spliceform type |
---|---|---|---|---|---|---|---|---|
SGD | S000000296 | PHO3 | acid phosphatase | YBR092C | gene | taxon:4932 | UniProt:NE92D8 | |
SGD | S000005370 | RCL1 | YOL010W | gene | taxon:4932 | UniProt:JN97D8 | ||
SGD | S000005197 | TEX1 | YNL253W | gene | taxon:4932 | UniProt:F9NO8X | ||
SGD | S000005316 | ABZ1 | aminodeoxychorismate synthase | YNR033W | gene | taxon:4932 | UniProt:C2BF93 | |
UniProtKB | Q4VCS5 | AMOT_HUMAN | AMOT, KIAA1071: Angiomotin | IPI00163085 | protein | taxon:9606 | UniProt:Q4VCS5 | Q4VCS5-1, snoRNA | Q4VCS5-2, protein |
The representation of the spliceforms could be changed if it isn't clear enough.
Technical requirements and impact on existing software
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a parser that could take in the two files and produce one from them, or vice versa.