Gene Product Association Data (GPAD) Format (Archived): Difference between revisions

Revision as of 13:55, 19 October 2009

Proposal to split the information in the GAF files into two sets, association data and gene product data.

The reasons for doing this are as follows:

allow unannotated gene products to be submitted to the GO database (could be useful in estimating the proportion of a genome that has been annotated; will also allow users to see that the GP they are searching for does exist, so they won't spend a long time fruitlessly searching for it [see note below])
reduce the amount of redundant gene product information in the GAF files; every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the GAF files will be smaller, which would certainly be helpful for huge files like the UniProt releases.

NB: although the gp2protein files may contain IDs of unannotated gene products, this data does not go into the GO database, and it is not available in AmiGO.

Current Association File Format

Annotation information has a shaded background, gene product data is in blue text, and information required for both has blue text on a shaded background.

column	required?	contents	cardinality
1	required	DB	1
2	required	DB_Object_ID	1
3	required	DB_Object_Symbol	1
4	optional	Qualifier	0 or greater
5	required	GO ID	1
6	required	DB:Reference(s)	1 or greater
7	required	Evidence code	1
8	optional	With (or) From	0 or greater
9	required	Aspect	1
10	optional	DB_Object_Name	0 or 1
11	optional	DB_Object_Synonym(s)	0 or greater
12	required	DB_Object_Type (refers to col 17 if present)	1
13	required	taxon	1 or 2 (for multi-org processes)
14	required	Date	1
15	required	Assigned_by	1
16	optional	Annotation cross products	?
17	optional	Spliceform	1

Proposed file format

Proposal: remove gene product data from the association file, leaving just an identifier.

Associations

new format for storing annotations:

old column #	required?	contents	cardinality
1	required	DB	1
17 if present; else 2	required	Spliceform ID OR DB_Object_ID	1
4	optional	Qualifier	0 or greater
5	required	GO ID	1
6	required	DB:Reference(s)	1 or greater
7	required	Evidence code	1
8	optional	With (or) From	0 or greater
9	required	Aspect	1
13	optional	Interacting taxon ID (for multi-organism processes)	0 or 1
14	required	Date	1
15	required	Assigned_by	1
16	optional	Annotation Cross Products	0 or greater

Gene Products

Gene product data would be stored in a separate file. It would consist of the following pieces of information:

old column #	required?	contents	cardinality
1	required	DB	1
2	required	DB_Object_ID	1
3	required	DB_Object_Symbol	1
10	optional	DB_Object_Name	0 or 1
11	optional	DB_Object_Synonym(s)	0 or greater
12	required	DB_Object_Type	1
13	required	taxon	1

Any GPs with different spliceforms would also have the following data (see GAF Col17 GeneProducts for more about spliceforms):

old column #	required?	contents	cardinality
17	required	Spliceform ID	1
12	required	Spliceform object type	1

Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.

Example

The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):

1 DB	2 DB Object ID	3 DB Object Symbol	4 Qualifier	5 GO ID	6 DB:Reference(s)	7 Evidence code	8 With (or) From	9 Aspect	10 DB Object Name	11 DB Object Synonym(s)	12 DB Object Type (refers to col 17 if present)	13 taxon	14 Date	15 Assigned by	17 Spliceform
SGD	S000000296	PHO3		GO:0003993	SGD_REF:S000047763	IMP		F	acid phosphatase	YBR092C	gene	taxon:4932	20010118	SGD
SGD	S000000296	PHO3		GO:0006796	SGD_REF:S000047115	TAS		P	acid phosphatase	YBR092C	gene	taxon:4932	20041220	SGD
SGD	S000005370	RCL1	NOT	GO:0003963	SGD_REF:S000039255	IDA		F	aminodeoxychorismate synthase	YOL010W	gene	taxon:4932	20020530	SGD
SGD	S000005370	RCL1		GO:0006406	SGD_REF:S000069956	IC	GO:0000346	P	aminodeoxychorismate synthase	YOL010W	gene	taxon:4932\|taxon:745953	20030221	SGD
SGD	S000005370	RCL1		GO:0046820	SGD_REF:S000057703	ISS	CGSC:pabA	F	aminodeoxychorismate synthase	YOL010W	gene	taxon:4932\|taxon:2861	20030106	SGD
UniProtKB	Q4VCS5	AMOT_HUMAN		GO:0031410	PMID:11257124	IDA		C	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProtKB
UniProtKB	Q4VCS5	AMOT_HUMAN		GO:0043532	PMID:11257124	IDA		F	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProtKB
UniProtKB	Q4VCS5	AMOT_HUMAN		GO:0043116	PMID:16043488	IDA		P	AMOT, KIAA1071:Angiomotin	IPI00163085	snoRNA	taxon:9606	20051207	UniProtKB	Q4VCS5-1
UniProtKB	Q4VCS5	AMOT_HUMAN		GO:0005515	PMID:16043488	IPI	UniProtKB:Q6RHR9-2	F	AMOT, KIAA1071: Angiomotin	IPI00163085	snoRNA	taxon:9606	20051207	UniProtKB	Q4VCS5-1
UniProtKB	Q4VCS5	AMOT_HUMAN		GO:0043532	PMID:16043488	IDA		F	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProtKB	Q4VCS5-2

This is how it could look in the proposed new format.

Association data:

1 DB	17 or 2 Spliceform ID OR DB Object ID	4 Qualifier	5 GO ID	6 DB:Reference(s)	7 Evidence code	8 With (or) From	9 Aspect	13 Interacting taxon ID (for multi-organism processes)	14 Date	15 Assigned_by
SGD	S000000296		GO:0003993	SGD_REF:S000047763	IMP		F		20010118	SGD
SGD	S000000296		GO:0006796	SGD_REF:S000047115	TAS		P		20041220	SGD
SGD	S000005370	NOT	GO:0003963	SGD_REF:S000039255	IDA		F		20020530	SGD
SGD	S000005370		GO:0006406	SGD_REF:S000069956	IC	GO:0000346	P	taxon:745953	20030221	SGD
SGD	S000005370		GO:0046820	SGD_REF:S000057703	ISS	CGSC:pabA	F	taxon:2861	20030106	SGD
UniProtKB	Q4VCS5		GO:0031410	PMID:11257124	IDA		C		20051207	UniProtKB
UniProtKB	Q4VCS5		GO:0043532	PMID:11257124	IDA		F		20051207	UniProtKB
UniProtKB	Q4VCS5-1		GO:0043116	PMID:16043488	IDA		P		20051207	UniProtKB
UniProtKB	Q4VCS5-1		GO:0005515	PMID:16043488	IPI	UniProtKB:Q6RHR9-2	F		20051207	UniProtKB
UniProtKB	Q4VCS5-2		GO:0043532	PMID:16043488	IDA		F		20051207	UniProtKB

GP data (including possible data from gp2protein file):

1 DB	2 DB_Object_ID	3 DB_Object_Symbol	10 DB_Object_Name	11 DB_Object_Synonym(s)	12 DB_Object_Type	13 taxon	n/a Spliceform ID, spliceform type	n/a ?? xref from gp2protein file ??
SGD	S000000296	PHO3	acid phosphatase	YBR092C	gene	taxon:4932		UniProt:NE92D8
SGD	S000005370	RCL1	aminodeoxychorismate synthase	YOL010W	gene	taxon:4932		UniProt:JN97D8
UniProtKB	Q4VCS5	AMOT_HUMAN	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	Q4VCS5-1, snoRNA \| Q4VCS5-2, protein	UniProt:Q4VCS5

The representation of the spliceforms could be changed if it isn't clear enough.

Technical requirements and impact on existing software

For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a parser that could take in the two files and produce one from them, or vice versa.

Revision as of 13:51, 19 October 2009 (view source) Girlwithglasses (talk \| contribs) (→‎Example) ← Older edit		Revision as of 13:55, 19 October 2009 (view source) Girlwithglasses (talk \| contribs) (→‎Gene Products) Newer edit →
Line 231:		Line 231:


	Any GPs with different spliceforms would also have the following data:		Any GPs with different spliceforms would also have the following data (see [[ GAF Col17 GeneProducts ]] for more about spliceforms):

	{\| style="color:blue" border=1 cell-padding=5		{\| style="color:blue" border=1 cell-padding=5