Gene Product Association Data (GPAD) Format (Archived): Difference between revisions
m (→Comments) |
m (→Comments) |
||
Line 473: | Line 473: | ||
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets. | 2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets. | ||
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this) | 3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data) | ||
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC)) | ([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC)) |
Revision as of 07:33, 26 January 2010
Proposal to split the information in the GAF files into two sets, association data and gene product data.
The reasons for doing this are as follows:
- allow unannotated gene products to be submitted to the GO database (could be useful in estimating the proportion of a genome that has been annotated; will also allow users to see that the GP they are searching for does exist, so they won't spend a long time fruitlessly searching for it [see note below])
- reduce the amount of redundant gene product information in the GAF files; every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the GAF files will be smaller, which would certainly be helpful for huge files like the UniProt releases.
NB: although the gp2protein files may contain IDs of unannotated gene products, this data does not go into the GO database, and it is not available in AmiGO. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file.
Current Association File Format
Annotation information has a shaded background, gene product data is in blue text, and information required for both has blue text on a shaded background.
column | required? | contents | cardinality |
---|---|---|---|
1 | required | DB | 1 |
2 | required | DB_Object_ID | 1 |
3 | required | DB_Object_Symbol | 1 |
4 | optional | Qualifier | 0 or greater |
5 | required | GO ID | 1 |
6 | required | DB:Reference(s) | 1 or greater |
7 | required | Evidence code | 1 |
8 | optional | With (or) From | 0 or greater |
9 | required | Aspect | 1 |
10 | optional | DB_Object_Name | 0 or 1 |
11 | optional | DB_Object_Synonym(s) | 0 or greater |
12 | required | DB_Object_Type (refers to col 17 if present) | 1 |
13 | required | taxon | 1 or 2 (for multi-org processes) |
14 | required | Date | 1 |
15 | required | Assigned_by | 1 |
16 | optional | Annotation cross products | ? |
17 | optional | Spliceform | 1 |
Proposed file format
Proposal: remove gene product data from the association file, leaving just an identifier.
Associations
new format for storing annotations:
contents | required? | cardinality | old column # |
---|---|---|---|
DB | required | 1 | 1 |
DB_Object_ID | required | 1 | 2 |
Qualifier | optional | 0 or greater | 4 |
GO ID | required | 1 | 5 |
DB:Reference(s) | required | 1 or greater | 6 |
Evidence code | required | 1 | 7 |
With (or) From | optional | 0 or greater | 8 |
Interacting taxon ID (for multi-organism processes) | optional | 0 or 1 | 13 |
Date | required | 1 | 14 |
Assigned_by | required | 1 | 15 |
Annotation Cross Products | optional | 0 or greater | 16 |
Spliceform ID | optional | 0 or 1 | 17 (if present) |
Gene Products
Gene product data would be stored in a separate file. It would consist of the following pieces of information:
contents | required? | cardinality | old column # |
---|---|---|---|
DB | required | 1 | 1 |
DB_Object_ID | required | 1 | 2 |
DB_Object_Symbol | required | 1 | 3 |
DB_Object_Name | optional | 0 or 1 | 10 |
DB_Object_Synonym(s) | optional | 0 or greater | 11 |
DB_Object_Type | required | 1 | 12 |
taxon | required | 1 | 13 |
Any GPs with different spliceforms would also have the following data (see GAF Col17 GeneProducts for more about spliceforms):
contents | required? | cardinality | old column # |
---|---|---|---|
Spliceform ID | required | 1 | 17 |
Spliceform object type | required | 1 | 12 |
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.
Example
Old GAF 1.0 Format
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):
1
DB |
2
DB Object ID |
3
DB Object Symbol |
4
Qualifier |
5
GO ID |
6
DB:Reference(s) |
7
Evidence code |
8
With (or) From |
9
Aspect |
10
DB Object Name |
11
DB Object Synonym(s) |
12
DB Object Type (refers to col 17 if present) |
13
taxon |
14
Date |
15
Assigned by |
16
Annotation cross products |
17
Spliceform |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SGD | S000000296 | PHO3 | GO:0003993 | SGD_REF:S000047763 | IMP | F | acid phosphatase | YBR092C | gene | taxon:4932 | 20010118 | SGD | ||||
SGD | S000000296 | PHO3 | GO:0006796 | SGD_REF:S000047115 | TAS | P | acid phosphatase | YBR092C | gene | taxon:4932 | 20041220 | SGD | ||||
SGD | S000005370 | RCL1 | NOT | GO:0003963 | SGD_REF:S000039255 | IDA | F | aminodeoxychorismate synthase | YOL010W | gene | taxon:4932 | 20020530 | SGD | |||
SGD | S000005370 | RCL1 | GO:0006406 | SGD_REF:S000069956 | IC | GO:0000346 | P | aminodeoxychorismate synthase | YOL010W | gene | taxon:4932|taxon:745953 | 20030221 | SGD | |||
SGD | S000005370 | RCL1 | GO:0046820 | SGD_REF:S000057703 | ISS | CGSC:pabA | F | aminodeoxychorismate synthase | YOL010W | gene | taxon:4932|taxon:2861 | 20030106 | SGD | |||
UniProtKB | Q4VCS5 | AMOT_HUMAN | GO:0031410 | PMID:11257124 | IDA | C | AMOT, KIAA1071: Angiomotin | IPI00163085 | protein | taxon:9606 | 20051207 | UniProtKB | ||||
UniProtKB | Q4VCS5 | AMOT_HUMAN | GO:0043532 | PMID:11257124 | IDA | F | AMOT, KIAA1071: Angiomotin | IPI00163085 | protein | taxon:9606 | 20051207 | UniProtKB | ||||
UniProtKB | Q4VCS5 | AMOT_HUMAN | GO:0043116 | PMID:16043488 | IDA | P | AMOT, KIAA1071:Angiomotin | IPI00163085 | snoRNA | taxon:9606 | 20051207 | UniProtKB | Q4VCS5-1 | |||
UniProtKB | Q4VCS5 | AMOT_HUMAN | GO:0005515 | PMID:16043488 | IPI | UniProtKB:Q6RHR9-2 | F | AMOT, KIAA1071: Angiomotin | IPI00163085 | snoRNA | taxon:9606 | 20051207 | UniProtKB | Q4VCS5-1 | ||
UniProtKB | Q4VCS5 | AMOT_HUMAN | GO:0043532 | PMID:16043488 | IDA | F | AMOT, KIAA1071: Angiomotin | IPI00163085 | protein | taxon:9606 | 20051207 | UniProtKB | Q4VCS5-2 |
Proposed new format
This is how it could look in the proposed new format.
Association data:
DB | DB Object ID | Qualifier | GO ID | DB:Reference(s) | Evidence code | With (or) From | Interacting taxon ID (for multi-organism processes) | Date | Assigned_by | Annotation cross products | Spliceform ID (if applicable) |
---|---|---|---|---|---|---|---|---|---|---|---|
SGD | S000000296 | GO:0003993 | SGD_REF:S000047763 | IMP | 20010118 | SGD | |||||
SGD | S000000296 | GO:0006796 | SGD_REF:S000047115 | TAS | 20041220 | SGD | |||||
SGD | S000005370 | NOT | GO:0003963 | SGD_REF:S000039255 | IDA | 20020530 | SGD | ||||
SGD | S000005370 | GO:0006406 | SGD_REF:S000069956 | IC | GO:0000346 | taxon:745953 | 20030221 | SGD | |||
SGD | S000005370 | GO:0046820 | SGD_REF:S000057703 | ISS | CGSC:pabA | taxon:2861 | 20030106 | SGD | |||
UniProtKB | Q4VCS5 | GO:0031410 | PMID:11257124 | IDA | 20051207 | UniProtKB | |||||
UniProtKB | Q4VCS5 | GO:0043532 | PMID:11257124 | IDA | 20051207 | UniProtKB | |||||
UniProtKB | Q4VCS5 | GO:0043116 | PMID:16043488 | IDA | 20051207 | UniProtKB | Q4VCS5-1 | ||||
UniProtKB | Q4VCS5 | GO:0005515 | PMID:16043488 | IPI | UniProtKB:Q6RHR9-2 | 20051207 | UniProtKB | Q4VCS5-1 | |||
UniProtKB | Q4VCS5-2 | GO:0043532 | PMID:16043488 | IDA | 20051207 | UniProtKB | Q4VCS5-2 |
GP data (including possible data from gp2protein file):
DB | DB_Object_ID | DB_Object_Symbol | DB_Object_Name | DB_Object_Synonym(s) | DB_Object_Type | taxon | Spliceform ID, spliceform type | xref from gp2protein file |
---|---|---|---|---|---|---|---|---|
SGD | S000000296 | PHO3 | acid phosphatase | YBR092C | gene | taxon:4932 | UniProt:NE92D8 | |
SGD | S000005370 | RCL1 | aminodeoxychorismate synthase | YOL010W | gene | taxon:4932 | UniProt:JN97D8 | |
UniProtKB | Q4VCS5 | AMOT_HUMAN | AMOT, KIAA1071: Angiomotin | IPI00163085 | protein | taxon:9606 | Q4VCS5-1, snoRNA | Q4VCS5-2, protein | UniProt:Q4VCS5 |
The representation of the spliceforms could be changed if it isn't clear enough.
Reformatting in obo1.3
Another option is to abandon the tab-delimited format and go for an obo-like tag-value format.
Gene Products
Gene product data for reaper in OBO 1.3 (-esque) syntax:
id: FB:FBgn0011706 symbol: rpr name: reaper type: gene [or use SO:id?] taxon: 7227 synonym: anon-WO0162936.19 synonym: CG4319 synonym: Reaper synonym: Reaper L synonym: rp synonym: RPR xref: UniProtKB:Q24475 SEQ_XREF [or some kind of modifier to show that this is a seq xref]
For a gene product with several spliceforms, the information could be represented thus:
[Entity] id: UniProtKB:Q4VCS5 symbol: AMOT_HUMAN name: AMOT, KIAA1071: Angiomotin type: protein taxon: 9606 xref: UniProtKB:Q24475 SEQ_XREF [Spliceform] id: UniProtKB:Q4VCS5-1 type: snoRNA [Spliceform] id: UniProtKB:Q4VCS5-2 type: protein
or
[Entity] id: UniProtKB:Q4VCS5 symbol: AMOT_HUMAN name: AMOT, KIAA1071: Angiomotin type: protein taxon: 9606 seq_xref: UniProtKB:Q24475
[Entity] id: UniProtKB:Q4VCS5-1 type: snoRNA relationship: isoform_of UniProtKB:Q4VCS4
[Entity] id: UniProtKB:Q4VCS5-2 type: protein relationship: isoform_of UniProtKB:Q4VCS4
Annotation data
Example: FB:FBgn0011706 annotated to GO:0035071, ref: PMID:19824712, ev code IC
[Annotation] subject: FB:FBgn0011706 object: GO:0035071 source: PMID:19824712 evidence: IC creation_date: 20070506 assigned_by: FlyBase
Example: SGD:S000005370, NOT GO:0003963, refs: SGD_REF:S000039255, PMID:84195322, evcode IDA
[Annotation] subject: SGD:S000005370 object: GO:0003963 is_negated: true evidence: IDA source: SGD_REF:S000039255 source: PMID:84195322 creation_date: 20020530 assigned_by: SGD
Example: UniProtKB:Q4VCS5-1 annotated to GO:0005515, ref: PMID:16043488, evcode IPI, with UniProtKB:Q6RHR9-2
[Annotation] subject: UniProtKB:Q4VCS5 property-value: isoform UniProtKBLQ4VCS5-1 object: GO:0005515 source: PMID:16043488 evidence: IPI xref: UniProtKB:Q6RHR9-2 EVIDENCE <-- or something to indicate that this is a with/from xref creation_date: 20051207 assigned_by: UniProtKB
Example: UniProtKB:H82KBU contributes_to GO:0006917, ref: PMID:8762143, evcode TAS, annotated by AgBase
[Annotation] subject: UniProtKB:H82KBU object: GO:0006917 relation: contributes_to source: PMID:8762143 evidence: TAS creation_date: 20041207 assigned_by: AgBase
Technical requirements and impact on existing software
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.
GO Database
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.
Groups submitting GO data
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.
Groups using GO data
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.
Another Other Business
What's all this spliceforms / isoforms stuff about?
Please see the documentation on column 17 for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.
Comments
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe: 1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)
(Edimmer 11:27, 26 January 2010 (UTC))