Gene Product Association Data (GPAD) Format (Archived)

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Proposal to split the information in the GAF files into two sets, association data and gene product information.

In Brief...

Why?

allow unannotated gene products to be submitted to the GO database
- could be useful in estimating the proportion of a genome that has been annotated
- will also allow users to see that the GP they are searching for does exist, so they won't spend a long time fruitlessly searching for it [see note below]
reduce the amount of redundant gene product information in the GAF files
- every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.

NB: although the gp2protein files may contain IDs of unannotated gene products, this data does not go into the GO database, and it is not available in AmiGO. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.

How?

Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.

See Technical requirements and impact on existing software for more details.

Current Association File Format

Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.

column	required?	contents	cardinality
1	required	DB	1
2	required	DB_Object_ID	1
3	required	DB_Object_Symbol	1
4	optional	Qualifier	0 or greater
5	required	GO ID	1
6	required	DB:Reference(s)	1 or greater
7	required	Evidence code	1
8	optional	With (or) From	0 or greater
9	required	Aspect	1
10	optional	DB_Object_Name	0 or 1
11	optional	DB_Object_Synonym(s)	0 or greater
12	required	DB_Object_Type (refers to col 17 if present)	1
13	required	taxon	1 or 2 (for multi-org processes)
14	required	Date	1
15	required	Assigned_by	1
16	optional	Annotation cross products	?
17	optional	Spliceform	1

Proposed file format

Proposal: remove gene product information from the association data file, leaving just an identifier.

Association Data

new format for storing annotations:

contents	required?	cardinality	old column #	extra info
DB	required	1	1	must be in xrf_abbs
DB_Object_ID	required	1	2
Qualifier	optional	0 or greater	4	'NOT' should not be in this column
(NOT) GO ID	required	1	5	must be extant GO ID, prefixed with NOT for NOT associations
DB:Reference(s)	required	1 or greater	6	DB must be in xrf_abbs
Evidence code	required	1	7	from ECO
With (or) From	optional	0 or greater	8
Interacting taxon ID (for multi-organism processes)	optional	0 or 1	13	ncbi taxon ID
Date	required	1	14	YYYYMMDD
Assigned_by	required	1	15	from xrf_abbs
Annotation Extension (Annotation Cross Products)	optional	0 or greater	16
GP Context?	optional	0 or 1	17 (if present)	to be decided

Note: a transform would need to take place if GAF col 17 is filled in. Further discussion needed to decide where info should go.

Note 2: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.

Gene Product Information

Gene product information would be stored in a separate file. It would consist of the following pieces of information -- see the gene product information file format for an in-depth view.

contents	required?	cardinality	GAF 2.0 col #	extra info
DB	required	1	1	in xrf_abbs
DB Object ID	required	1	2
DB Object Type	required	1	12	need a controlled vocab (SO + GO complex?)
Taxon	required	1	13
DB Object Symbol	required	1	3
DB Object Name	optional	0 or 1	10
DB Object Synonym(s)	optional	0 or greater	11
Parent GP ID	blank unless GP is an isoform (see next table)	0	n/a	protein - list gene; complex component - list complex ID
Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file)	optional	0+	n/a

check PRO for examples

Spliceforms (see GAF Col17 GeneProducts for more about spliceforms) would have their own entries in this file, with the data as follows:

contents	required?	cardinality	GAF 2.0 col #
DB	required	1	1
DB Object ID	required	1	17
DB Object Type	required	1	12
Taxon	required	1	13
DB Object Symbol	required	1	3
DB Object Name	optional	0 or 1	10
DB Object Synonym(s)	optional	0 or greater	11
Parent GP ID	required	1	2
Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file)	optional	0+	n/a

Multiple entries in the xrefs col should be pipe-separated.

Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.

Example

Old GAF 1.0 Format

The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):

1 DB	2 DB Object ID	3 DB Object Symbol	4 Qualifier	5 GO ID	6 DB:Reference(s)	7 Evidence code	8 With (or) From	9 Aspect	10 DB Object Name	11 DB Object Synonym(s)	12 DB Object Type (refers to col 17 if present)	13 taxon	14 Date	15 Assigned by	17 Spliceform
SGD	S000000296	PHO3		GO:0003993	SGD_REF:S000047763	IMP		F	acid phosphatase	YBR092C	gene	taxon:4932	20010118	SGD
SGD	S000000296	PHO3		GO:0006796	SGD_REF:S000047115	TAS		P	acid phosphatase	YBR092C	gene	taxon:4932	20041220	SGD
SGD	S000005370	RCL1	NOT	GO:0003963	SGD_REF:S000039255	IDA		F	aminodeoxychorismate synthase	YOL010W	gene	taxon:4932	20020530	SGD
SGD	S000005370	RCL1		GO:0006406	SGD_REF:S000069956	IC	GO:0000346	P	aminodeoxychorismate synthase	YOL010W	gene	taxon:4932\|taxon:745953	20030221	SGD
SGD	S000005370	RCL1		GO:0046820	SGD_REF:S000057703	ISS	CGSC:pabA	F	aminodeoxychorismate synthase	YOL010W	gene	taxon:4932\|taxon:2861	20030106	SGD
UniProtKB	Q4VCS5	AMOT_HUMAN		GO:0031410	PMID:11257124	IDA		C	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProtKB
UniProtKB	Q4VCS5	AMOT_HUMAN		GO:0043532	PMID:11257124	IDA		F	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProtKB
UniProtKB	Q4VCS5	AMOT_HUMAN		GO:0043116	PMID:16043488	IDA		P	AMOT, KIAA1071:Angiomotin	IPI00163085	snoRNA	taxon:9606	20051207	UniProtKB	Q4VCS5-1
UniProtKB	Q4VCS5	AMOT_HUMAN		GO:0005515	PMID:16043488	IPI	UniProtKB:Q6RHR9-2	F	AMOT, KIAA1071: Angiomotin	IPI00163085	snoRNA	taxon:9606	20051207	UniProtKB	Q4VCS5-1
UniProtKB	Q4VCS5	AMOT_HUMAN		GO:0043532	PMID:16043488	IDA		F	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProtKB	Q4VCS5-2

Proposed new format

This is how it could look in the proposed new format.

Association data:

DB	DB Object ID	Qualifier	GO ID	DB:Reference(s)	Evidence code	With (or) From	Interacting taxon ID (for multi-organism processes)	Date	Assigned_by	Spliceform ID (if applicable)
SGD	S000000296		GO:0003993	SGD_REF:S000047763	IMP			20010118	SGD
SGD	S000000296		GO:0006796	SGD_REF:S000047115	TAS			20041220	SGD
SGD	S000005370	NOT	GO:0003963	SGD_REF:S000039255	IDA			20020530	SGD
SGD	S000005370		GO:0006406	SGD_REF:S000069956	IC	GO:0000346	taxon:745953	20030221	SGD
SGD	S000005370		GO:0046820	SGD_REF:S000057703	ISS	CGSC:pabA	taxon:2861	20030106	SGD
UniProtKB	Q4VCS5		GO:0031410	PMID:11257124	IDA			20051207	UniProtKB
UniProtKB	Q4VCS5		GO:0043532	PMID:11257124	IDA			20051207	UniProtKB
UniProtKB	Q4VCS5		GO:0043116	PMID:16043488	IDA			20051207	UniProtKB	Q4VCS5-1
UniProtKB	Q4VCS5		GO:0005515	PMID:16043488	IPI	UniProtKB:Q6RHR9-2		20051207	UniProtKB	Q4VCS5-1
UniProtKB	Q4VCS5-2		GO:0043532	PMID:16043488	IDA			20051207	UniProtKB	Q4VCS5-2

Gene Product Information (including possible data from gp2protein file) -- see the gene product information file format for an in-depth view.

DB	DB_Object_ID	DB_Object_Type	Taxon	DB Object Symbol	DB Object Name	DB Object Synonym(s)	Parent GP ID	Xrefs in other DBs
SGD	S000000296	gene	4932	PHO3	acid phosphatase	YBR092C		UniProt:NE92D8
SGD	S000005370	gene	4932	RCL1	aminodeoxychorismate synthase	YOL010W		UniProt:JN97D8
UniProtKB	Q4VCS5	protein	9606	AMOT_HUMAN	AMOT, KIAA1071: Angiomotin	KIAA1071
UniProtKB	Q4VCS5-1	snoRNA	9606	AMOT_HUMAN	Isoform 1 of Angiomotin		UniProtKB:Q4VCS5
UniProtKB	Q4VCS5-2	protein	9606	AMOT_HUMAN	Isoform 2 of Angiomotin		UniProtKB:Q4VCS5

Technical requirements and impact on existing software

For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.

GO Database

Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.

Groups submitting GO data

Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.

Groups using GO data

There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.

Any Other Business

What's all this spliceforms / isoforms stuff about?

Please see the documentation on column 17 for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.

Comments

GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:

1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)

2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.

3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)

(Edimmer 11:27, 26 January 2010 (UTC))

The UniProt gp_association and gp_information files

Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at gp_association.goa_uniprot.gz and gp_information.goa_uniprot.gz.

The format of these files is fully documented in gp_association_readme and gp_information_readme, but, in summary, the columns present in each of the files are as follows:

gp_association

column	name	required?	cardinality	GAF column
01	DB	required	1	1
02	DB_Object_ID	required	1	2
03	Qualifier	optional	0 or greater	4
04	GO ID	required	1	5
05	DB:Reference(s)	required	1 or greater	6
06	Evidence code	required	1	7
07	With	optional	0 or greater	8
08	Extra taxon ID	optional	0 or 1	13
09	Date	required	1	14
10	Assigned_by	required	1	15
11	Annotation Extension	optional	0 or greater	16
12	Spliceform ID	optional	0 or 1	17

gp_information

column	name	required?	cardinality	GAF column	Example content
01	DB	required	1	1	UniProtKB
02	DB_Subset	optional	0 or 1	-	Swiss-Prot or TrEMBL
03	DB_Object_ID	required	1	2	Q4VCS5
04	DB_Object_Symbol	required	1	3	AMOT
05	DB_Object_Name	optional	0 or 1	10	Angiomotin
06	DB_Object_Synonym(s)	optional	0 or greater	11	KIAA1071\|IPI:IPI00163085\|IPI:IPI00644547\|UniProtKB:AMOT_HUMAN
07	DB_Object_Type	required	1	12	protein
08	Taxon	required	1	13	taxon:9606
09	Annotation_Target_Set	optional	0 or greater	-	KRUK\|Reference Genome
10	Annotation_Completed	optional	1	-	timestamp (YYYYMMDD)
11	Parent_Object_ID	optional	0 or 1	-	UniProtKB:P21677