Gene Product Association Data (GPAD) Format (Archived)

From GO Wiki
Revision as of 06:42, 29 September 2009 by Girlwithglasses (talk | contribs) (Created page with 'Proposal to split the information in the GAF files into two sets, association data and gene product data. ==Current Association File Format== {| class="wikitable" |- ! column !…')

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Proposal to split the information in the GAF files into two sets, association data and gene product data.

Current Association File Format

column required? contents cardinality
1 required DB 1
2 required DB_Object_ID 1
3 required DB_Object_Symbol 1
4 optional Qualifier 0 or greater
5 required GO ID 1
6 required DB:Reference) 1 or greater
7 required Evidence code 1
8 optional With (or) From 0 or greater
9 required Aspect 1
10 optional DB_Object_Name 0 or 1
11 optional Synonym) 0 or greater
12 required DB_Object_Type [refers to col 17 if present] 1
13 required taxon) 1 or 2
14 required Date 1
15 required Assigned_by 1
16 optional Annotation cross products  ?
17 optional Spliceform 1

Proposal: remove gene product data from the association file, leaving just an identifier new format:

old column # required? contents cardinality
1 required DB 1
17 if present; else 2 required Spliceform ID OR DB_Object_ID 1
4 optional Qualifier 0 or greater
5 required GO ID 1
6 required DB:Reference) 1 or greater
7 required Evidence code 1
8 optional With (or) From 0 or greater
14 required Date 1
15 required Assigned_by 1
16 optional Annotation cross products  ?
13 optional Interacting taxon ID (for multi-organism processes) 0 or 1

Gene product data would be stored in a separate file. It would consist of the following pieces of information:

old column # required? contents cardinality
1 required DB 1
2 required DB_Object_ID 1
3 required DB_Object_Symbol 1
10 optional DB_Object_Name 0 or 1
11 optional Synonym) 0 or greater
12 required DB_Object_Type 1
13 required taxon 1

Any GPs with different spliceforms would also have the following data:

old column # required? contents cardinality
17 required Spliceform ID 1
12 required Spliceform object type 1

Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.


Example

The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure:

SGD	S000000296	PHO3		GO:0003993	SGD_REF:S000047763|PMID:2407294 	IMP		F	acid phosphatase	YBR092C	gene	taxon:4932	20010118	SGD
SGD	S000000296	PHO3		GO:0006796	SGD_REF:S000047115|PMID:2407294 	TAS		P	acid phosphatase	YBR092C	gene	taxon:4932	20041220	SGD
SGD	S000005370	RCL1	NOT	GO:0003963	SGD_REF:S000039255|PMID:10790377	IDA		F		YOL010W	gene	taxon:4932	20020530	SGD
SGD	S000005197	TEX1		GO:0006406	SGD_REF:S000069956|PMID:11979277	IC	GO:0000346	P		YNL253W	gene	taxon:4932	20030221	SGD
SGD	S000005316	ABZ1		GO:0046820	SGD_REF:S000057703|PMID:8346682 	ISS	CGSC:pabA|CGSC:pabB	F	aminodeoxychorismate synthase	YNR033W	gene	taxon:4932	20030106	SGD

This is how it could look in the proposed new format.

Association data:

SGD	S000000296		GO:0003993	F	SGD_REF:S000047763|PMID:2407294 	IMP		20010118	SGD
SGD	S000000296		GO:0006796	P	SGD_REF:S000047115|PMID:2407294 	TAS		20041220	SGD
SGD	S000005370	NOT	GO:0003963	F	SGD_REF:S000039255|PMID:10790377	IDA		20020530	SGD
SGD	S000005197		GO:0006406	P	SGD_REF:S000069956|PMID:11979277	IC	GO:0000346	20030221	SGD
SGD	S000005316		GO:0046820	F	SGD_REF:S000057703|PMID:8346682 	ISS	CGSC:pabA|CGSC:pabB	20030106	SGD

GP data:

SGD	S000000296	PHO3	acid phosphatase	YBR092C	gene	taxon:4932	UniProt:XXXXXX
SGD	S000005370	RCL1		YOL010W	gene	taxon:4932	UniProt:XXXXXX
SGD	S000005197	TEX1		YNL253W	gene	taxon:4932	UniProt:XXXXXX
SGD	S000005316	ABZ1	aminodeoxychorismate synthase	YNR033W	gene	taxon:4932	UniProt:XXXXXX