Gene Product Association Data (GPAD) Format (Archived): Difference between revisions
Line 405: | Line 405: | ||
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || || | | style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || || | ||
|- | |- | ||
| style="color:blue" | SGD || style="color:blue" | S000005370 || | | style="color:blue" | SGD || style="color:blue" | S000005370 || || NOT GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || || | ||
|- | |- | ||
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || taxon:745953 || 20030221 || SGD || || | | style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || taxon:745953 || 20030221 || SGD || || |
Revision as of 14:02, 26 October 2011
An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.
In Brief...
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation extension (col 16), and so on.
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the Evidence Code Ontology and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.
Why?
- allow unannotated gene products to be submitted to the GO database
- could be useful in estimating the proportion of a genome that has been annotated
- will also allow users to see that the GP they are searching for does exist, so they won't spend a long time fruitlessly searching for it [see note below]
- reduce the amount of redundant gene product information in the GAF files
- every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.
NB: although the gp2protein files may contain IDs of unannotated gene products, this data does not go into the GO database, and it is not available in AmiGO. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.
How?
- Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.
See Technical requirements and impact on existing software for more details.
Current Association File Format
File Header
The gene association file begins with a line declaring the format version as follows:
!gaf-version: 2.0
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:
!gaf-version: 2.0 !CVS Version: Revision: 1.134 $ !GOC Validation Date: 08/26/2009 $ !Submission Date: 8/26/2009 ! ! The above "Submission Date" is when the annotation project provided ! this file to the Gene Ontology Consortium (GOC). The "GOC Validation ! Date" indicates when this file was last changed as a result of a GOC ! validation and filtering process. The "CVS Version" above is the ! GOC version of this file. ! ! !Project_name: Schizosaccharomyces pombe GeneDB !URL: www.genedb.org/genedb/pombe !Contact Email: val@sanger.ac.uk !
File Body
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.
column | required? | contents | cardinality |
---|---|---|---|
1 | required | DB | 1 |
2 | required | DB_Object_ID | 1 |
3 | required | DB_Object_Symbol | 1 |
4 | optional | Qualifier | 0 or greater |
5 | required | GO ID | 1 |
6 | required | DB:Reference(s) | 1 or greater |
7 | required | Evidence code | 1 |
8 | optional | With (or) From | 0 or greater |
9 | required | Aspect | 1 |
10 | optional | DB_Object_Name | 0 or 1 |
11 | optional | DB_Object_Synonym(s) | 0 or greater |
12 | required | DB_Object_Type (refers to col 17 if present) | 1 |
13 | required | taxon | 1 or 2 (for multi-org processes) |
14 | required | Date | 1 |
15 | required | Assigned_by | 1 |
16 | optional | Annotation cross products | ? |
17 | optional | Spliceform | 1 |
Proposed Gene Product Association Data (GPAD) file format
All gene product data barring the ID of the object being annotated is removed from the annotation file.
File Header
The file starts with a line declaring the file format:
!gpad-version: 1.0
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:
!gpad-version: 1.0 !CVS Version: Revision: 1.134 $ !GOC Validation Date: 08/26/2009 $ !Submission Date: 8/26/2009 ! ! The above "Submission Date" is when the annotation project provided ! this file to the Gene Ontology Consortium (GOC). The "GOC Validation ! Date" indicates when this file was last changed as a result of a GOC ! validation and filtering process. The "CVS Version" above is the ! GOC version of this file. ! ! !Project_name: Schizosaccharomyces pombe GeneDB !URL: www.genedb.org/genedb/pombe !Contact Email: val@sanger.ac.uk !
File Body
contents | required? | cardinality | old column # | extra info |
---|---|---|---|---|
DB | required | 1 | 1 | must be in xrf_abbs |
DB_Object_ID | required | 1 | 2 | |
Qualifier | optional | 0 or greater | 4 | 'NOT' should not be in this column |
(NOT) GO ID | required | 1 | 5 | must be extant GO ID, prefixed with NOT for NOT associations |
DB:Reference(s) | required | 1 or greater | 6 | DB must be in xrf_abbs |
Evidence code | required | 1 | 7 | from ECO |
With (or) From | optional | 0 or greater | 8 | |
Interacting taxon ID (for multi-organism processes) | optional | 0 or 1 | 13 | ncbi taxon ID |
Date | required | 1 | 14 | YYYYMMDD |
Assigned_by | required | 1 | 15 | from xrf_abbs |
Annotation XP (Annotation Cross Products) | optional | 0 or greater | 16 |
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.
Additional Data
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.
Proposed Gene Product Information (GPI) file format
Gene product data is stored separately from annotation data.
File Header
The file starts with a line declaring the file format:
!gpi-version: 1.0
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:
!gpi-version: 1.0 !CVS Version: Revision: 1.134 $ !GOC Validation Date: 08/26/2009 $ !Submission Date: 8/26/2009 ! !Project_name: Schizosaccharomyces pombe GeneDB !URL: www.genedb.org/genedb/pombe !Contact Email: val@sanger.ac.uk !
File Body
contents | required? | cardinality | GAF 2.0 col # | extra info |
---|---|---|---|---|
DB | required | 1 | 1 | in xrf_abbs |
DB Object ID | required | 1 | 2 | |
DB Object Type | required | 1 | 12 | need a controlled vocab (SO + GO complex? PRO?) |
Taxon | required | 1 | 13 | |
DB Object Symbol | required | 1 | 3 | |
DB Object Name | optional | 0 or 1 | 10 | |
DB Object Synonym(s) | optional | 0 or greater | 11 | |
Parent GP ID | blank unless GP is an isoform (see next table) | 0 | n/a | protein - list gene; complex component - list complex ID |
Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) | optional | 0+, pipe-separated | n/a |
- check PRO for examples
- should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?
Spliceforms (see GAF Col17 GeneProducts for more about spliceforms) would have their own entries in this file, with the data as follows:
contents | required? | cardinality | GAF 2.0 col # |
---|---|---|---|
DB | required | 1 | 1 |
DB Object ID | required | 1 | 17 |
DB Object Type | required | 1 | 12 |
Taxon | required | 1 | 13 |
DB Object Symbol | required | 1 | 3 |
DB Object Name | optional | 0 or 1 | 10 |
DB Object Synonym(s) | optional | 0 or greater | 11 |
Parent GP ID | required | 1 | 2 |
Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) | optional | 0+, pipe-separated | n/a |
Multiple entries in the xrefs col should be pipe-separated.
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.
Additional Data
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.
Example
Old GAF 1.0 Format
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):
1
DB |
2
DB Object ID |
3
DB Object Symbol |
4
Qualifier |
5
GO ID |
6
DB:Reference(s) |
7
Evidence code |
8
With (or) From |
9
Aspect |
10
DB Object Name |
11
DB Object Synonym(s) |
12
DB Object Type (refers to col 17 if present) |
13
taxon |
14
Date |
15
Assigned by |
16
Annotation cross products |
17
Spliceform |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SGD | S000000296 | PHO3 | GO:0003993 | SGD_REF:S000047763 | IMP | F | acid phosphatase | YBR092C | gene | taxon:4932 | 20010118 | SGD | ||||
SGD | S000000296 | PHO3 | GO:0006796 | SGD_REF:S000047115 | TAS | P | acid phosphatase | YBR092C | gene | taxon:4932 | 20041220 | SGD | ||||
SGD | S000005370 | RCL1 | NOT | GO:0003963 | SGD_REF:S000039255 | IDA | F | aminodeoxychorismate synthase | YOL010W | gene | taxon:4932 | 20020530 | SGD | |||
SGD | S000005370 | RCL1 | GO:0006406 | SGD_REF:S000069956 | IC | GO:0000346 | P | aminodeoxychorismate synthase | YOL010W | gene | taxon:4932|taxon:745953 | 20030221 | SGD | |||
SGD | S000005370 | RCL1 | GO:0046820 | SGD_REF:S000057703 | ISS | CGSC:pabA | F | aminodeoxychorismate synthase | YOL010W | gene | taxon:4932|taxon:2861 | 20030106 | SGD | |||
UniProtKB | Q4VCS5 | AMOT_HUMAN | GO:0031410 | PMID:11257124 | IDA | C | AMOT, KIAA1071: Angiomotin | IPI00163085 | protein | taxon:9606 | 20051207 | UniProtKB | ||||
UniProtKB | Q4VCS5 | AMOT_HUMAN | GO:0043532 | PMID:11257124 | IDA | F | AMOT, KIAA1071: Angiomotin | IPI00163085 | protein | taxon:9606 | 20051207 | UniProtKB | ||||
UniProtKB | Q4VCS5 | AMOT_HUMAN | GO:0043116 | PMID:16043488 | IDA | P | AMOT, KIAA1071:Angiomotin | IPI00163085 | snoRNA | taxon:9606 | 20051207 | UniProtKB | Q4VCS5-1 | |||
UniProtKB | Q4VCS5 | AMOT_HUMAN | GO:0005515 | PMID:16043488 | IPI | UniProtKB:Q6RHR9-2 | F | AMOT, KIAA1071: Angiomotin | IPI00163085 | snoRNA | taxon:9606 | 20051207 | UniProtKB | Q4VCS5-1 | ||
UniProtKB | Q4VCS5 | AMOT_HUMAN | GO:0043532 | PMID:16043488 | IDA | F | AMOT, KIAA1071: Angiomotin | IPI00163085 | protein | taxon:9606 | 20051207 | UniProtKB | Q4VCS5-2 |
Proposed new format
This is how it could look in the proposed new format.
Association data:
DB | DB Object ID | Qualifier | GO ID | DB:Reference(s) | Evidence code | With (or) From | Interacting taxon ID (for multi-organism processes) | Date | Assigned_by | Annotation cross products | Spliceform ID (if applicable) |
---|---|---|---|---|---|---|---|---|---|---|---|
SGD | S000000296 | GO:0003993 | SGD_REF:S000047763 | ECO:0000015 | 20010118 | SGD | |||||
SGD | S000000296 | GO:0006796 | SGD_REF:S000047115 | ECO:0000304 | 20041220 | SGD | |||||
SGD | S000005370 | NOT GO:0003963 | SGD_REF:S000039255 | ECO:0000002 | 20020530 | SGD | |||||
SGD | S000005370 | GO:0006406 | SGD_REF:S000069956 | ECO:0000305 | GO:0000346 | taxon:745953 | 20030221 | SGD | |||
SGD | S000005370 | GO:0046820 | SGD_REF:S000057703 | ECO:0000250 | CGSC:pabA | taxon:2861 | 20030106 | SGD | |||
UniProtKB | Q4VCS5 | GO:0031410 | PMID:11257124 | ECO:0000002 | 20051207 | UniProtKB | |||||
UniProtKB | Q4VCS5 | GO:0043532 | PMID:11257124 | ECO:0000002 | 20051207 | UniProtKB | |||||
UniProtKB | Q4VCS5 | GO:0043116 | PMID:16043488 | ECO:0000002 | 20051207 | UniProtKB | Q4VCS5-1 | ||||
UniProtKB | Q4VCS5 | GO:0005515 | PMID:16043488 | ECO:0000021 | UniProtKB:Q6RHR9-2 | 20051207 | UniProtKB | Q4VCS5-1 | |||
UniProtKB | Q4VCS5-2 | GO:0043532 | PMID:16043488 | ECO:0000002 | 20051207 | UniProtKB | Q4VCS5-2 |
Gene Product Information (including possible data from gp2protein file) -- see the gene product information file format for an in-depth view.
DB | DB_Object_ID | DB_Object_Type | Taxon | DB Object Symbol | DB Object Name | DB Object Synonym(s) | Parent GP ID | Xrefs in other DBs |
---|---|---|---|---|---|---|---|---|
SGD | S000000296 | gene | 4932 | PHO3 | acid phosphatase | YBR092C | UniProt:NE92D8 | |
SGD | S000005370 | gene | 4932 | RCL1 | aminodeoxychorismate synthase | YOL010W | UniProt:JN97D8 | |
UniProtKB | Q4VCS5 | protein | 9606 | AMOT_HUMAN | AMOT, KIAA1071: Angiomotin | KIAA1071 | ||
UniProtKB | Q4VCS5-1 | snoRNA | 9606 | AMOT_HUMAN | Isoform 1 of Angiomotin | UniProtKB:Q4VCS5 | ||
UniProtKB | Q4VCS5-2 | protein | 9606 | AMOT_HUMAN | Isoform 2 of Angiomotin | UniProtKB:Q4VCS5 |
Technical requirements and impact on existing software
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.
GO Database
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.
Groups submitting GO data
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.
Groups using GO data
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.
Any Other Business
What's all this spliceforms / isoforms stuff about?
Please see the documentation on column 17 for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.
Comments
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)
(Edimmer 11:27, 26 January 2010 (UTC))
The UniProt gp_association and gp_information files
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at gp_association.goa_uniprot.gz and gp_information.goa_uniprot.gz.
The format of these files is fully documented in gp_association_readme and gp_information_readme, but, in summary, the columns present in each of the files are as follows:
gp_association
column | name | required? | cardinality | GAF column |
---|---|---|---|---|
01 | DB | required | 1 | 1 |
02 | DB_Object_ID | required | 1 | 2 |
03 | Qualifier | optional | 0 or greater | 4 |
04 | GO ID | required | 1 | 5 |
05 | DB:Reference(s) | required | 1 or greater | 6 |
06 | Evidence code | required | 1 | 7 |
07 | With | optional | 0 or greater | 8 |
08 | Extra taxon ID | optional | 0 or 1 | 13 |
09 | Date | required | 1 | 14 |
10 | Assigned_by | required | 1 | 15 |
11 | Annotation Extension | optional | 0 or greater | 16 |
12 | Spliceform ID | optional | 0 or 1 | 17 |
gp_information
column | name | required? | cardinality | GAF column | Example content |
---|---|---|---|---|---|
01 | DB | required | 1 | 1 | UniProtKB |
02 | DB_Subset | optional | 0 or 1 | - | Swiss-Prot or TrEMBL |
03 | DB_Object_ID | required | 1 | 2 | Q4VCS5 |
04 | DB_Object_Symbol | required | 1 | 3 | AMOT |
05 | DB_Object_Name | optional | 0 or 1 | 10 | Angiomotin |
06 | DB_Object_Synonym(s) | optional | 0 or greater | 11 | KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN |
07 | DB_Object_Type | required | 1 | 12 | protein |
08 | Taxon | required | 1 | 13 | taxon:9606 |
09 | Annotation_Target_Set | optional | 0 or greater | - | KRUK|Reference Genome |
10 | Annotation_Completed | optional | 1 | - | timestamp (YYYYMMDD) |
11 | Parent_Object_ID | optional | 0 or 1 | - | UniProtKB:P21677 |