Difference between revisions of "Gene Product Association Data (GPAD) Format (Archived)"

From GO Wiki
Jump to: navigation, search
m
 
(44 intermediate revisions by 5 users not shown)
Line 1: Line 1:
Proposal to split the information in the GAF files into two sets, association data and gene product data.
+
An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.
  
The reasons for doing this are as follows:
 
*allow unannotated gene products to be submitted to the GO database (could be useful in estimating the proportion of a genome that has been annotated; will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below])
 
*reduce the amount of redundant gene product information in the GAF files; every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the GAF files will be smaller, which would certainly be helpful for huge files like the UniProt releases.
 
  
 +
==In Brief...==
  
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file.
+
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation XP (col 16), and so on.
 +
 
 +
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.
 +
 
 +
 
 +
===Why?===
 +
 
 +
*allow unannotated gene products to be submitted to the GO database
 +
** could be useful in estimating the proportion of a genome that has been annotated
 +
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]
 +
*reduce the amount of redundant gene product information in the GAF files
 +
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.
 +
 
 +
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.
 +
 
 +
 
 +
===How?===
 +
 
 +
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.
 +
 
 +
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.
  
  
 
==Current Association File Format==
 
==Current Association File Format==
  
Annotation information has a shaded background, gene product data is in blue text, and information required for both has blue text on a shaded background.
+
===File Header===
 +
 
 +
The gene association file begins with a line declaring the format version as follows:
 +
 
 +
!gaf-version: 2.0
 +
 
 +
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:
 +
 
 +
!gaf-version: 2.0
 +
!CVS Version: Revision: 1.134 $
 +
!GOC Validation Date: 08/26/2009 $
 +
!Submission Date: 8/26/2009
 +
!
 +
! The above "Submission Date" is when the annotation project provided
 +
! this file to the Gene Ontology Consortium (GOC).  The "GOC Validation
 +
! Date" indicates when this file was last changed as a result of a GOC
 +
! validation and filtering process.  The "CVS Version" above is the
 +
! GOC version of this file.
 +
!
 +
!
 +
!Project_name: Schizosaccharomyces pombe GeneDB
 +
!URL: www.genedb.org/genedb/pombe
 +
!Contact Email: val@sanger.ac.uk
 +
!
 +
 
 +
===File Body===
 +
 
 +
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.
  
 
{| border=1 cell-padding=5
 
{| border=1 cell-padding=5
Line 106: Line 151:
 
|}
 
|}
  
==Proposed file format==
+
==Proposed Gene Product Association Data (GPAD) file format ==
 +
 
 +
All gene product data barring the ID of the object being annotated is removed from the annotation file.
 +
 
 +
===File Header===
 +
 
 +
The file starts with a line declaring the file format:
  
Proposal: remove gene product data from the association file, leaving just an identifier.
+
!gpad-version: 1.0
  
 +
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:
  
===Associations===
+
!gpad-version: 1.0
 +
!CVS Version: Revision: 1.134 $
 +
!GOC Validation Date: 08/26/2009 $
 +
!Submission Date: 8/26/2009
 +
!
 +
! The above "Submission Date" is when the annotation project provided
 +
! this file to the Gene Ontology Consortium (GOC).  The "GOC Validation
 +
! Date" indicates when this file was last changed as a result of a GOC
 +
! validation and filtering process.  The "CVS Version" above is the
 +
! GOC version of this file.
 +
!
 +
!
 +
!Project_name: Schizosaccharomyces pombe GeneDB
 +
!URL: www.genedb.org/genedb/pombe
 +
!Contact Email: val@sanger.ac.uk
 +
!
  
new format for storing annotations:
+
===File Body===
  
 
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10
 
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10
Line 121: Line 188:
 
! cardinality
 
! cardinality
 
! old column #
 
! old column #
 +
! extra info
 
|- style="color:blue"
 
|- style="color:blue"
| DB || style="color:red" | required || 1 || 1
+
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs
 
|- style="color:blue"
 
|- style="color:blue"
| DB_Object_ID || style="color:red" | required || 1 || 2
+
| DB_Object_ID || style="color:red" | required || 1 || 2 ||
 
|-  
 
|-  
| Qualifier || optional || 0 or greater || 4
+
| Qualifier || optional || 0 or greater || 4 || (NOT or integral_to)? (other_organism or colocalizes_with or contributes_to)? annotation_relation
 
|-  
 
|-  
| GO ID || style="color:red" | required || 1 || 5
+
| GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID
 
|-  
 
|-  
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6
+
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs
 
|-  
 
|-  
| Evidence code || style="color:red" | required || 1 || 7
+
| Evidence code || style="color:red" | required || 1 || 7 || from ECO
 
|-  
 
|-  
| With (or) From || optional || 0 or greater || 8
+
| With (or) From || optional || 0 or greater || 8 ||
 
|-  
 
|-  
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13
+
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID
 
|-  
 
|-  
| Date || style="color:red" | required || 1 || 14
+
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD
 
|-  
 
|-  
| Assigned_by || style="color:red" | required || 1 || 15
+
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs
 
|-  
 
|-  
| [[Annotation Cross Products]] || optional || 0 or greater || 16
+
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 ||  
|- style="color:blue"
 
| Spliceform ID || optional || 0 or 1 || 17 (if present)
 
 
|}
 
|}
  
===Gene Products===
+
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.
 +
 
 +
Note II: Unlike GAF 2.0, there is no extra column for spliceforms; the spliceform ID goes directly in the DB_Object_ID. The relation between the spliceform ID and the canonical form is held in the GPI file.
 +
 
 +
===Additional Data===
 +
 
 +
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.
 +
 
 +
 
 +
 
 +
==Proposed Gene Product Information (GPI) file format ==
 +
 
 +
Gene product data is stored separately from annotation data.
 +
 
 +
===File Header===
  
Gene product data would be stored in a separate file. It would consist of the following pieces of information -- see the [[Gene_Product_Data_File_Format | gene product data file format]] for an in-depth view.
+
The file starts with a line declaring the file format:
 +
 
 +
!gpi-version: 1.0
 +
 
 +
Further information or remarks should be prefixed by an exclamation mark.
 +
 
 +
It is strongly suggested that the final line of the file header is a tab-separated list of the column contents, as follows:
 +
 
 +
!DB    DB_Object_ID    DB_Object_Type    Taxon    DB_Object_Symbol    DB_Object_Name    DB_Object_Synonym(s)    Parent_GP_ID    DB_Object_Xrefs
 +
 
 +
An example of a full file header:
 +
 
 +
!gpi-version: 1.0
 +
!CVS Version: Revision: 1.134 $
 +
!GOC Validation Date: 08/26/2009 $
 +
!Submission Date: 8/26/2009
 +
!
 +
!Project_name: Schizosaccharomyces pombe GeneDB
 +
!URL: www.genedb.org/genedb/pombe
 +
!Contact Email: val@sanger.ac.uk
 +
!
 +
!DB    DB_Object_ID    DB_Object_Type    Taxon    DB_Object_Symbol    DB_Object_Name    DB_Object_Synonym(s)    Parent_GP_ID    DB_Object_Xrefs
 +
 
 +
===File Body===
  
 
{| border=1 cell-padding=5 style="color:blue"  
 
{| border=1 cell-padding=5 style="color:blue"  
Line 157: Line 260:
 
! cardinality
 
! cardinality
 
! GAF 2.0 col #
 
! GAF 2.0 col #
 +
! extra info
 
|- style="background:#ccffff"
 
|- style="background:#ccffff"
| DB || style="color:red" | required || 1 || 1
+
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs
 
|-style="background:#ccffff"
 
|-style="background:#ccffff"
| DB Object ID || style="color:red" | required || 1 || 2
+
| DB Object ID || style="color:red" | required || 1 || 2 ||
 
|-
 
|-
| DB Object Type || style="color:red" | required || 1 || 12
+
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)
 
|-
 
|-
| Taxon || style="color:red" | required || 1 || 13
+
| Taxon || style="color:red" | required || 1 || 13 ||
 
|-
 
|-
| DB Object Symbol || style="color:red" | required || 1 || 3
+
| DB Object Symbol || style="color:red" | required || 1 || 3 ||
 
|-
 
|-
| DB Object Name || optional || 0 or 1 || 10
+
| DB Object Name || optional || 0 or 1 || 10 ||
 
|-
 
|-
| DB Object Synonym(s) || optional || 0 or greater || 11
+
| DB Object Synonym(s) || optional || 0 or greater || 11 ||
 
|-
 
|-
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a
+
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID
 
|-
 
|-
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+ || n/a
+
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a ||
 
|}
 
|}
  
 +
 +
* check PRO for examples
 +
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?
  
 
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:
 
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:
Line 203: Line 310:
 
| Parent GP ID || style="color:red" | required || 1 || 2
 
| Parent GP ID || style="color:red" | required || 1 || 2
 
|-
 
|-
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+ || n/a
+
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a
 
|}
 
|}
  
 +
 +
Multiple entries in the xrefs col should be pipe-separated.
  
 
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.
 
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.
 +
 +
====Additional Data====
 +
 +
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.
  
 
==Example==
 
==Example==
Line 296: Line 409:
 
! Spliceform ID (if applicable)
 
! Spliceform ID (if applicable)
 
|-
 
|-
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || IMP || || || 20010118 || SGD || ||
+
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||
 
|-
 
|-
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || TAS || || || 20041220 || SGD || ||
+
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||
 
|-
 
|-
| style="color:blue" | SGD || style="color:blue" | S000005370 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || || 20020530 || SGD || ||
+
| style="color:blue" | SGD || style="color:blue" | S000005370 || || NOT GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||
 
|-
 
|-
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || taxon:745953 || 20030221 || SGD || ||
+
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || 745953 || 20030221 || SGD || ||
 
|-
 
|-
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||
+
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || 2861 || 20030106 || SGD || ||
 
|-
 
|-
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || IDA || || || 20051207 || UniProtKB || ||
+
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||
 
|-
 
|-
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || IDA || || || 20051207 || UniProtKB || ||
+
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||
 
|-
 
|-
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || IDA || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1
+
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1
 
|-
 
|-
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1
+
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1
 
|-
 
|-
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || IDA || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2
+
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2
 
|}
 
|}
  
  
GP data (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product data file format]] for an in-depth view.
+
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.
  
 
{| cellspacing="2" border="1" style="color:blue"
 
{| cellspacing="2" border="1" style="color:blue"
Line 343: Line 456:
 
|}
 
|}
  
 +
==Technical requirements and impact on existing software==
  
  
==Reformatting in obo1.3==
+
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.
  
Another option is to abandon the tab-delimited format and go for an obo-like tag-value format.
 
  
===Gene Products===
+
===GO Database===
  
Gene product data for [http://amigo.geneontology.org/cgi-bin/amigo/gp-details.cgi?gp=FB:FBgn0011706 reaper] in [http://geneontology.org/GO.format.obo-1_3.shtml OBO 1.3] (-esque)  syntax:
+
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.
  
id: FB:FBgn0011706
 
symbol: rpr
 
name: reaper
 
type: gene
 
taxon: 7227
 
synonym: anon-WO0162936.19
 
synonym: CG4319
 
synonym: Reaper
 
synonym: Reaper L
 
synonym: rp
 
synonym: RPR
 
xref:  UniProtKB:Q24475 SEQ_XREF [modifier to show that this is a seq xref]
 
  
 +
===Groups submitting GO data===
  
For a gene product with several spliceforms, the information could be represented thus:
+
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.
  
  
id: UniProtKB:Q4VCS5
+
===Groups using GO data===
symbol: AMOT_HUMAN
 
name: AMOT, KIAA1071: Angiomotin
 
type: protein
 
taxon: 9606
 
synonym: KIAA1071
 
 
id: UniProtKB:Q4VCS5-1
 
type: snoRNA
 
name: Isoform 1 of Angiomotin
 
parent: UniProtKB:Q4VCS4
 
 
id: UniProtKB:Q4VCS5-2
 
type: protein
 
name: Isoform 2 of Angiomotin
 
parent: UniProtKB:Q4VCS4
 
  
 +
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.
  
or
 
  
For a gene product with several spliceforms, the information could be represented thus:
+
==Any Other Business==
  
id: UniProtKB:Q4VCS5
+
===What's all this spliceforms / isoforms stuff about?===
symbol: AMOT_HUMAN
 
name: AMOT, KIAA1071: Angiomotin
 
type: protein
 
taxon: 9606
 
synonym: KIAA1071
 
 
id: UniProtKB:Q4VCS5-1
 
type: snoRNA
 
name: Isoform 1 of Angiomotin
 
relationship: isoform_of UniProtKB:Q4VCS4
 
 
id: UniProtKB:Q4VCS5-2
 
type: protein
 
name: Isoform 2 of Angiomotin
 
relationship: isoform_of UniProtKB:Q4VCS4
 
  
===Annotation data===
+
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.
  
  
Example: FB:FBgn0011706 annotated to GO:0035071, ref: PMID:19824712, ev code IC
 
  
[Annotation]
+
====Comments====
subject: FB:FBgn0011706
 
object: GO:0035071
 
source: PMID:19824712
 
evidence: IC
 
creation_date: 20070506
 
assigned_by: FlyBase
 
  
 +
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:
  
Example: SGD:S000005370, NOT GO:0003963, refs: SGD_REF:S000039255, PMID:84195322, evcode IDA
 
  
[Annotation]
+
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)
subject: SGD:S000005370
 
object: GO:0003963
 
is_negated: true
 
evidence: IDA
 
source: SGD_REF:S000039255
 
source: PMID:84195322
 
creation_date: 20020530
 
assigned_by: SGD
 
  
 +
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.
  
Example: UniProtKB:Q4VCS5-1 annotated to GO:0005515, ref: PMID:16043488, evcode IPI, with UniProtKB:Q6RHR9-2
+
3. Annotation Complete: yes/no  (annotation data all groups now store such information, but there is no current export mechanism for this data)
  
[Annotation]
+
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))
subject: UniProtKB:Q4VCS5
 
property-value: isoform UniProtKBLQ4VCS5-1
 
object: GO:0005515
 
source: PMID:16043488
 
evidence: IPI
 
xref: UniProtKB:Q6RHR9-2  EVIDENCE  <-- or something to indicate that this is a with/from xref
 
creation_date: 20051207
 
assigned_by: UniProtKB
 
  
 +
==The UniProt gp_association and gp_information files==
  
Example: UniProtKB:H82KBU contributes_to GO:0006917, ref: PMID:8762143, evcode TAS, annotated by AgBase
+
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].
  
[Annotation]
+
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:
subject: UniProtKB:H82KBU
 
object: GO:0006917
 
relation: contributes_to
 
source: PMID:8762143
 
evidence: TAS
 
creation_date: 20041207
 
assigned_by: AgBase
 
  
==Technical requirements and impact on existing software==
+
===gp_association===
  
 +
{| cellspacing="2" border="1"
 +
|-
 +
! column
 +
! name
 +
! required?
 +
! cardinality
 +
! GAF column
 +
|-
 +
| 01 || DB || required || 1 || 1
 +
|-
 +
| 02 || DB_Object_ID || required || 1 || 2
 +
|-
 +
| 03 || Qualifier || optional || 0 or greater || 4
 +
|-
 +
| 04 || GO ID || required || 1 || 5
 +
|-
 +
| 05 || DB:Reference(s) || required || 1 or greater || 6
 +
|-
 +
| 06 || Evidence code || required || 1 || 7
 +
|-
 +
| 07 || With || optional || 0 or greater || 8
 +
|-
 +
| 08 || Extra taxon ID || optional || 0 or 1 || 13
 +
|-
 +
| 09 || Date || required || 1 || 14
 +
|-
 +
| 10 || Assigned_by || required || 1 || 15
 +
|-
 +
| 11 || Annotation XP || optional || 0 or greater || 16
 +
|-
 +
| 12 || Spliceform ID || optional || 0 or 1 || 17
 +
|}
  
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.
+
===gp_information===
 +
{| cellspacing="2" border="1"
 +
|-
 +
! column
 +
! name
 +
! required?
 +
! cardinality
 +
! GAF column
 +
! Example
 +
|-
 +
| 01 || DB || required || 1 || 1 || UniProtKB
 +
|-
 +
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL
 +
|-
 +
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5
 +
|-
 +
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT
 +
|-
 +
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin
 +
|-
 +
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN
 +
|-
 +
| 07 || DB_Object_Type || required || 1 || 12 || protein
 +
|-
 +
| 08 || Taxon || required || 1 || 13 || taxon:9606
 +
|-
 +
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome
 +
|-
 +
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)
 +
|-
 +
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677
 +
|-
 +
|}
  
 +
== JSON Serializations ==
  
 +
=== GPAD JSON ===
  
===GO Database===
+
Document Format:
  
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.
+
  { METADATA-KEYVAL-PAIR*,
 +
    annotations: [ ANN[1], ..., ANN[n] ] }
  
 +
Annotation format:
  
===Groups submitting GO data===
+
Each annotation is an associative array, with keys as defined below
  
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.
+
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10
 +
|-
 +
! contents
 +
! required?
 +
! cardinality
 +
|- style="color:blue"
 +
| DB || style="color:red" | required || 1
 +
|- style="color:blue"
 +
| DB_Object_ID || style="color:red" | required || 1
 +
|-
 +
| Qualifier || optional || 0 or greater
 +
|-
 +
| Ontology ID || style="color:red" | required || 1
 +
|-
 +
| Reference || style="color:red" | required || 1 or greater
 +
|-
 +
| Evidence_type || style="color:red" | required || 1
 +
|-
 +
| With || optional || 0 or greater
 +
|-
 +
| Interacting_taxon_ID || optional || 0 or 1
 +
|-
 +
| Date || style="color:red" | required || 1
 +
|-
 +
| Assigned_by || style="color:red" | required || 1
 +
|-  
 +
| Annotation_extension || optional || 0 or greater
 +
|}
  
 +
=== GPI JSON ===
  
===Groups using GO data===
+
TODO
  
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.
+
== Meetings ==
  
 +
*[[Oct 2012 Meeting to finalize GPAD specification]]
  
==Any Other Business==
 
  
===What's all this spliceforms / isoforms stuff about?===
+
[[Category:Archived]]
 
 
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.
 
 
 
 
 
[[Category:GAF]] [[Category:Annotation]]
 
 
 
====Comments====
 
 
 
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:
 
 
 
 
 
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)
 
 
 
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.
 
 
 
3. Annotation Complete: yes/no  (annotation data all groups now store such information, but there is no current export mechanism for this data)
 
 
 
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))
 

Latest revision as of 08:37, 12 April 2019

An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.


In Brief...

This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation XP (col 16), and so on.

Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the Evidence Code Ontology and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.


Why?

  • allow unannotated gene products to be submitted to the GO database
    • could be useful in estimating the proportion of a genome that has been annotated
    • will also allow users to see that the GP they are searching for does exist, so they won't spend a long time fruitlessly searching for it [see note below]
  • reduce the amount of redundant gene product information in the GAF files
    • every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.

NB: although the gp2protein files may contain IDs of unannotated gene products, this data does not go into the GO database, and it is not available in AmiGO. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.


How?

  • Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.

See Technical requirements and impact on existing software for more details.


Current Association File Format

File Header

The gene association file begins with a line declaring the format version as follows:

!gaf-version: 2.0

Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:

!gaf-version: 2.0
!CVS Version: Revision: 1.134 $
!GOC Validation Date: 08/26/2009 $
!Submission Date: 8/26/2009
!
! The above "Submission Date" is when the annotation project provided
! this file to the Gene Ontology Consortium (GOC).  The "GOC Validation
! Date" indicates when this file was last changed as a result of a GOC
! validation and filtering process.  The "CVS Version" above is the
! GOC version of this file.
!
!
!Project_name: Schizosaccharomyces pombe GeneDB
!URL: www.genedb.org/genedb/pombe
!Contact Email: val@sanger.ac.uk
!

File Body

Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.

column required? contents cardinality
1 required DB 1
2 required DB_Object_ID 1
3 required DB_Object_Symbol 1
4 optional Qualifier 0 or greater
5 required GO ID 1
6 required DB:Reference(s) 1 or greater
7 required Evidence code 1
8 optional With (or) From 0 or greater
9 required Aspect 1
10 optional DB_Object_Name 0 or 1
11 optional DB_Object_Synonym(s) 0 or greater
12 required DB_Object_Type (refers to col 17 if present) 1
13 required taxon 1 or 2 (for multi-org processes)
14 required Date 1
15 required Assigned_by 1
16 optional Annotation cross products  ?
17 optional Spliceform 1

Proposed Gene Product Association Data (GPAD) file format

All gene product data barring the ID of the object being annotated is removed from the annotation file.

File Header

The file starts with a line declaring the file format:

!gpad-version: 1.0

Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:

!gpad-version: 1.0
!CVS Version: Revision: 1.134 $
!GOC Validation Date: 08/26/2009 $
!Submission Date: 8/26/2009
!
! The above "Submission Date" is when the annotation project provided
! this file to the Gene Ontology Consortium (GOC).  The "GOC Validation
! Date" indicates when this file was last changed as a result of a GOC
! validation and filtering process.  The "CVS Version" above is the
! GOC version of this file.
!
!
!Project_name: Schizosaccharomyces pombe GeneDB
!URL: www.genedb.org/genedb/pombe
!Contact Email: val@sanger.ac.uk
!

File Body

contents required? cardinality old column # extra info
DB required 1 1 must be in xrf_abbs
DB_Object_ID required 1 2
Qualifier optional 0 or greater 4 (NOT or integral_to)? (other_organism or colocalizes_with or contributes_to)? annotation_relation
GO ID required 1 5 must be extant GO ID
DB:Reference(s) required 1 or greater 6 DB must be in xrf_abbs
Evidence code required 1 7 from ECO
With (or) From optional 0 or greater 8
Interacting taxon ID (for multi-organism processes) optional 0 or 1 13 ncbi taxon ID
Date required 1 14 YYYYMMDD
Assigned_by required 1 15 from xrf_abbs
Annotation XP (Annotation Cross Products) optional 0 or greater 16

Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.

Note II: Unlike GAF 2.0, there is no extra column for spliceforms; the spliceform ID goes directly in the DB_Object_ID. The relation between the spliceform ID and the canonical form is held in the GPI file.

Additional Data

The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.


Proposed Gene Product Information (GPI) file format

Gene product data is stored separately from annotation data.

File Header

The file starts with a line declaring the file format:

!gpi-version: 1.0

Further information or remarks should be prefixed by an exclamation mark.

It is strongly suggested that the final line of the file header is a tab-separated list of the column contents, as follows:

!DB    DB_Object_ID    DB_Object_Type    Taxon    DB_Object_Symbol    DB_Object_Name    DB_Object_Synonym(s)    Parent_GP_ID    DB_Object_Xrefs

An example of a full file header:

!gpi-version: 1.0
!CVS Version: Revision: 1.134 $
!GOC Validation Date: 08/26/2009 $
!Submission Date: 8/26/2009
!
!Project_name: Schizosaccharomyces pombe GeneDB
!URL: www.genedb.org/genedb/pombe
!Contact Email: val@sanger.ac.uk
!
!DB    DB_Object_ID    DB_Object_Type    Taxon    DB_Object_Symbol    DB_Object_Name    DB_Object_Synonym(s)    Parent_GP_ID    DB_Object_Xrefs

File Body

contents required? cardinality GAF 2.0 col # extra info
DB required 1 1 in xrf_abbs
DB Object ID required 1 2
DB Object Type required 1 12 need a controlled vocab (SO + GO complex? PRO?)
Taxon required 1 13
DB Object Symbol required 1 3
DB Object Name optional 0 or 1 10
DB Object Synonym(s) optional 0 or greater 11
Parent GP ID blank unless GP is an isoform (see next table) 0 n/a protein - list gene; complex component - list complex ID
Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) optional 0+, pipe-separated n/a


  • check PRO for examples
  • should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?

Spliceforms (see GAF Col17 GeneProducts for more about spliceforms) would have their own entries in this file, with the data as follows:

contents required? cardinality GAF 2.0 col #
DB required 1 1
DB Object ID required 1 17
DB Object Type required 1 12
Taxon required 1 13
DB Object Symbol required 1 3
DB Object Name optional 0 or 1 10
DB Object Synonym(s) optional 0 or greater 11
Parent GP ID required 1 2
Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) optional 0+, pipe-separated n/a


Multiple entries in the xrefs col should be pipe-separated.

Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.

Additional Data

Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.

Example

Old GAF 1.0 Format

The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):

1

DB

2

DB Object ID

3

DB Object Symbol

4

Qualifier

5

GO ID

6

DB:Reference(s)

7

Evidence code

8

With (or) From

9

Aspect

10

DB Object Name

11

DB Object Synonym(s)

12

DB Object Type (refers to col 17 if present)

13

taxon

14

Date

15

Assigned by

16

Annotation cross products

17

Spliceform

SGD S000000296 PHO3 GO:0003993 SGD_REF:S000047763 IMP F acid phosphatase YBR092C gene taxon:4932 20010118 SGD
SGD S000000296 PHO3 GO:0006796 SGD_REF:S000047115 TAS P acid phosphatase YBR092C gene taxon:4932 20041220 SGD
SGD S000005370 RCL1 NOT GO:0003963 SGD_REF:S000039255 IDA F aminodeoxychorismate synthase YOL010W gene taxon:4932 20020530 SGD
SGD S000005370 RCL1 GO:0006406 SGD_REF:S000069956 IC GO:0000346 P aminodeoxychorismate synthase YOL010W gene taxon:4932|taxon:745953 20030221 SGD
SGD S000005370 RCL1 GO:0046820 SGD_REF:S000057703 ISS CGSC:pabA F aminodeoxychorismate synthase YOL010W gene taxon:4932|taxon:2861 20030106 SGD
UniProtKB Q4VCS5 AMOT_HUMAN GO:0031410 PMID:11257124 IDA C AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB
UniProtKB Q4VCS5 AMOT_HUMAN GO:0043532 PMID:11257124 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB
UniProtKB Q4VCS5 AMOT_HUMAN GO:0043116 PMID:16043488 IDA P AMOT, KIAA1071:Angiomotin IPI00163085 snoRNA taxon:9606 20051207 UniProtKB Q4VCS5-1
UniProtKB Q4VCS5 AMOT_HUMAN GO:0005515 PMID:16043488 IPI UniProtKB:Q6RHR9-2 F AMOT, KIAA1071: Angiomotin IPI00163085 snoRNA taxon:9606 20051207 UniProtKB Q4VCS5-1
UniProtKB Q4VCS5 AMOT_HUMAN GO:0043532 PMID:16043488 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB Q4VCS5-2


Proposed new format

This is how it could look in the proposed new format.

Association data:

DB DB Object ID Qualifier GO ID DB:Reference(s) Evidence code With (or) From Interacting taxon ID (for multi-organism processes) Date Assigned_by Annotation cross products Spliceform ID (if applicable)
SGD S000000296 GO:0003993 SGD_REF:S000047763 ECO:0000015 20010118 SGD
SGD S000000296 GO:0006796 SGD_REF:S000047115 ECO:0000304 20041220 SGD
SGD S000005370 NOT GO:0003963 SGD_REF:S000039255 ECO:0000002 20020530 SGD
SGD S000005370 GO:0006406 SGD_REF:S000069956 ECO:0000305 GO:0000346 745953 20030221 SGD
SGD S000005370 GO:0046820 SGD_REF:S000057703 ECO:0000250 CGSC:pabA 2861 20030106 SGD
UniProtKB Q4VCS5 GO:0031410 PMID:11257124 ECO:0000002 20051207 UniProtKB
UniProtKB Q4VCS5 GO:0043532 PMID:11257124 ECO:0000002 20051207 UniProtKB
UniProtKB Q4VCS5 GO:0043116 PMID:16043488 ECO:0000002 20051207 UniProtKB Q4VCS5-1
UniProtKB Q4VCS5 GO:0005515 PMID:16043488 ECO:0000021 UniProtKB:Q6RHR9-2 20051207 UniProtKB Q4VCS5-1
UniProtKB Q4VCS5-2 GO:0043532 PMID:16043488 ECO:0000002 20051207 UniProtKB Q4VCS5-2


Gene Product Information (including possible data from gp2protein file) -- see the gene product information file format for an in-depth view.

DB DB_Object_ID DB_Object_Type Taxon DB Object Symbol DB Object Name DB Object Synonym(s) Parent GP ID Xrefs in other DBs
SGD S000000296 gene 4932 PHO3 acid phosphatase YBR092C UniProt:NE92D8
SGD S000005370 gene 4932 RCL1 aminodeoxychorismate synthase YOL010W UniProt:JN97D8
UniProtKB Q4VCS5 protein 9606 AMOT_HUMAN AMOT, KIAA1071: Angiomotin KIAA1071
UniProtKB Q4VCS5-1 snoRNA 9606 AMOT_HUMAN Isoform 1 of Angiomotin UniProtKB:Q4VCS5
UniProtKB Q4VCS5-2 protein 9606 AMOT_HUMAN Isoform 2 of Angiomotin UniProtKB:Q4VCS5

Technical requirements and impact on existing software

For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.


GO Database

Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.


Groups submitting GO data

Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.


Groups using GO data

There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.


Any Other Business

What's all this spliceforms / isoforms stuff about?

Please see the documentation on column 17 for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.


Comments

GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:


1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)

2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.

3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)

(Edimmer 11:27, 26 January 2010 (UTC))

The UniProt gp_association and gp_information files

Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at gp_association.goa_uniprot.gz and gp_information.goa_uniprot.gz.

The format of these files is fully documented in gp_association_readme and gp_information_readme, but, in summary, the columns present in each of the files are as follows:

gp_association

column name required? cardinality GAF column
01 DB required 1 1
02 DB_Object_ID required 1 2
03 Qualifier optional 0 or greater 4
04 GO ID required 1 5
05 DB:Reference(s) required 1 or greater 6
06 Evidence code required 1 7
07 With optional 0 or greater 8
08 Extra taxon ID optional 0 or 1 13
09 Date required 1 14
10 Assigned_by required 1 15
11 Annotation XP optional 0 or greater 16
12 Spliceform ID optional 0 or 1 17

gp_information

column name required? cardinality GAF column Example
01 DB required 1 1 UniProtKB
02 DB_Subset optional 0 or 1 - Swiss-Prot or TrEMBL
03 DB_Object_ID required 1 2 Q4VCS5
04 DB_Object_Symbol required 1 3 AMOT
05 DB_Object_Name optional 0 or 1 10 Angiomotin
06 DB_Object_Synonym(s) optional 0 or greater 11 KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN
07 DB_Object_Type required 1 12 protein
08 Taxon required 1 13 taxon:9606
09 Annotation_Target_Set optional 0 or greater - KRUK|Reference Genome
10 Annotation_Completed optional 1 - timestamp (YYYYMMDD)
11 Parent_Object_ID optional 0 or 1 - UniProtKB:P21677

JSON Serializations

GPAD JSON

Document Format:

 { METADATA-KEYVAL-PAIR*,
   annotations: [ ANN[1], ..., ANN[n] ] }

Annotation format:

Each annotation is an associative array, with keys as defined below

contents required? cardinality
DB required 1
DB_Object_ID required 1
Qualifier optional 0 or greater
Ontology ID required 1
Reference required 1 or greater
Evidence_type required 1
With optional 0 or greater
Interacting_taxon_ID optional 0 or 1
Date required 1
Assigned_by required 1
Annotation_extension optional 0 or greater

GPI JSON

TODO

Meetings