Gene Product Association Data (GPAD) Format (Archived): Difference between revisions

From GO Wiki
Jump to navigation Jump to search
No edit summary
Line 149: Line 149:
===Gene Products===
===Gene Products===


Gene product data would be stored in a separate file. It would consist of the following pieces of information:
Gene product data would be stored in a separate file. It would consist of the following pieces of information -- see the [[Gene_Product_Data_File_Format | gene product data file format]] for an in-depth view.


{| style="color:blue" border=1 cell-padding=5
{| border=1 cell-padding=5 style="color:blue"
|-
|-
! contents
! contents
! required?
! required?
! cardinality
! cardinality
! old column #
! GAF 2.0 col #
|- style="background:#ccffff"
|- style="background:#ccffff"
| DB || style="color:red" | required || 1 || 1
| DB || style="color:red" | required || 1 || 1
|- style="background:#ccffff"
|-style="background:#ccffff"
| DB_Object_ID || style="color:red" | required || 1 || 2
| DB Object ID || style="color:red" | required || 1 || 2
|-
| DB Object Type || style="color:red" | required || 1 || 12
|-
| Taxon || style="color:red" | required || 1 || 13
|-
|-
| DB_Object_Symbol || style="color:red" | required || 1 || 3
| DB Object Symbol || style="color:red" | required || 1 || 3
|-
|-
| DB_Object_Name || optional || 0 or 1 || 10
| DB Object Name || optional || 0 or 1 || 10
|-
|-
| DB_Object_Synonym(s) || optional || 0 or greater || 11
| DB Object Synonym(s) || optional || 0 or greater || 11
|-
|-
| DB_Object_Type || style="color:red" | required || 1 || 12
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a
|-
|-
| taxon || style="color:red" | required || 1 || 13
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+ || n/a
|}
|}




Any GPs with different spliceforms would also have the following data (see [[ GAF Col17 GeneProducts ]] for more about spliceforms):
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:


{| style="color:blue" border=1 cell-padding=5
{| border=1 cell-padding=5 style="color:blue"
|-
|-
! contents
! contents
! required?
! required?
! cardinality
! cardinality
! old column #
! GAF 2.0 col #
|- style="background:#ccffff"
| DB || style="color:red" | required || 1 || 1
|- style="background:#ccffff"
|- style="background:#ccffff"
| Spliceform ID || style="color:red" | required || 1 || 17
| DB Object ID || style="color:red" | required || 1 || 17
|-
|-
| Spliceform object type || style="color:red" | required || 1 || 12
| DB Object Type || style="color:red" | required || 1 || 12
|-
| Taxon || style="color:red" | required || 1 || 13
|-
| DB Object Symbol || style="color:red" | required || 1 || 3
|-
| DB Object Name || optional || 0 or 1 || 10
|-
| DB Object Synonym(s) || optional || 0 or greater || 11
|-
| Parent GP ID || style="color:red" | required || 1 || 2
|-
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+ || n/a
|}
|}


Line 300: Line 318:




GP data (including possible data from gp2protein file):
GP data (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product data file format]] for an in-depth view.


{| cellspacing="2" border="1" style="color:blue"
{| cellspacing="2" border="1" style="color:blue"
Line 306: Line 324:
! DB
! DB
! DB_Object_ID
! DB_Object_ID
! DB_Object_Symbol
! DB_Object_Name
! DB_Object_Synonym(s)
! DB_Object_Type
! DB_Object_Type
! taxon
! Taxon
! Spliceform ID, spliceform type
! DB Object Symbol
! xref from gp2protein file
! DB Object Name
! DB Object Synonym(s)
! Parent GP ID
! Xrefs in other DBs
|-
|-
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 ||  PHO3 || acid phosphatase || YBR092C || gene || taxon:4932 || || UniProt:NE92D8
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 ||  PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8
|-
|-
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 ||  RCL1 || aminodeoxychorismate synthase || YOL010W || gene || taxon:4932 || || UniProt:JN97D8
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 ||  RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8
|-
|-
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || IPI00163085 || protein || taxon:9606 || Q4VCS5-1, snoRNA<nowiki> | </nowiki> Q4VCS5-2, protein || UniProt:Q4VCS5
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || ||  
|-
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 ||
|-
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||
|}
|}


The representation of the spliceforms could be changed if it isn't clear enough.
 


==Reformatting in obo1.3==
==Reformatting in obo1.3==
Line 334: Line 356:
  symbol: rpr
  symbol: rpr
  name: reaper
  name: reaper
  type: gene       [or use SO:id?]
  type: gene
  taxon: 7227
  taxon: 7227
  synonym: anon-WO0162936.19
  synonym: anon-WO0162936.19
Line 342: Line 364:
  synonym: rp
  synonym: rp
  synonym: RPR
  synonym: RPR
  xref: UniProtKB:Q24475 SEQ_XREF [or some kind of modifier to show that this is a seq xref]
  xref:   UniProtKB:Q24475 SEQ_XREF [modifier to show that this is a seq xref]




For a gene product with several spliceforms, the information could be represented thus:
For a gene product with several spliceforms, the information could be represented thus:


[Entity]
 
  id: UniProtKB:Q4VCS5
  id: UniProtKB:Q4VCS5
  symbol: AMOT_HUMAN
  symbol: AMOT_HUMAN
Line 353: Line 375:
  type: protein
  type: protein
  taxon: 9606
  taxon: 9606
  xref: UniProtKB:Q24475  SEQ_XREF
  synonym: KIAA1071
   
   
[Spliceform]
  id: UniProtKB:Q4VCS5-1
  id: UniProtKB:Q4VCS5-1
  type: snoRNA
  type: snoRNA
name: Isoform 1 of Angiomotin
parent: UniProtKB:Q4VCS4
   
   
[Spliceform]
  id: UniProtKB:Q4VCS5-2
  id: UniProtKB:Q4VCS5-2
  type: protein
  type: protein
name: Isoform 2 of Angiomotin
parent: UniProtKB:Q4VCS4




or
or


[Entity]
For a gene product with several spliceforms, the information could be represented thus:
 
  id: UniProtKB:Q4VCS5
  id: UniProtKB:Q4VCS5
  symbol: AMOT_HUMAN
  symbol: AMOT_HUMAN
Line 372: Line 397:
  type: protein
  type: protein
  taxon: 9606
  taxon: 9606
  seq_xref: UniProtKB:Q24475
  synonym: KIAA1071
 
   
  [Entity]
  id: UniProtKB:Q4VCS5-1
  id: UniProtKB:Q4VCS5-1
  type: snoRNA
  type: snoRNA
name: Isoform 1 of Angiomotin
  relationship: isoform_of UniProtKB:Q4VCS4
  relationship: isoform_of UniProtKB:Q4VCS4
 
   
  [Entity]
  id: UniProtKB:Q4VCS5-2
  id: UniProtKB:Q4VCS5-2
  type: protein
  type: protein
name: Isoform 2 of Angiomotin
  relationship: isoform_of UniProtKB:Q4VCS4
  relationship: isoform_of UniProtKB:Q4VCS4



Revision as of 15:01, 24 March 2010

Proposal to split the information in the GAF files into two sets, association data and gene product data.

The reasons for doing this are as follows:

  • allow unannotated gene products to be submitted to the GO database (could be useful in estimating the proportion of a genome that has been annotated; will also allow users to see that the GP they are searching for does exist, so they won't spend a long time fruitlessly searching for it [see note below])
  • reduce the amount of redundant gene product information in the GAF files; every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the GAF files will be smaller, which would certainly be helpful for huge files like the UniProt releases.


NB: although the gp2protein files may contain IDs of unannotated gene products, this data does not go into the GO database, and it is not available in AmiGO. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file.


Current Association File Format

Annotation information has a shaded background, gene product data is in blue text, and information required for both has blue text on a shaded background.

column required? contents cardinality
1 required DB 1
2 required DB_Object_ID 1
3 required DB_Object_Symbol 1
4 optional Qualifier 0 or greater
5 required GO ID 1
6 required DB:Reference(s) 1 or greater
7 required Evidence code 1
8 optional With (or) From 0 or greater
9 required Aspect 1
10 optional DB_Object_Name 0 or 1
11 optional DB_Object_Synonym(s) 0 or greater
12 required DB_Object_Type (refers to col 17 if present) 1
13 required taxon 1 or 2 (for multi-org processes)
14 required Date 1
15 required Assigned_by 1
16 optional Annotation cross products ?
17 optional Spliceform 1

Proposed file format

Proposal: remove gene product data from the association file, leaving just an identifier.


Associations

new format for storing annotations:

contents required? cardinality old column #
DB required 1 1
DB_Object_ID required 1 2
Qualifier optional 0 or greater 4
GO ID required 1 5
DB:Reference(s) required 1 or greater 6
Evidence code required 1 7
With (or) From optional 0 or greater 8
Interacting taxon ID (for multi-organism processes) optional 0 or 1 13
Date required 1 14
Assigned_by required 1 15
Annotation Cross Products optional 0 or greater 16
Spliceform ID optional 0 or 1 17 (if present)

Gene Products

Gene product data would be stored in a separate file. It would consist of the following pieces of information -- see the gene product data file format for an in-depth view.

contents required? cardinality GAF 2.0 col #
DB required 1 1
DB Object ID required 1 2
DB Object Type required 1 12
Taxon required 1 13
DB Object Symbol required 1 3
DB Object Name optional 0 or 1 10
DB Object Synonym(s) optional 0 or greater 11
Parent GP ID blank unless GP is an isoform (see next table) 0 n/a
Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) optional 0+ n/a


Spliceforms (see GAF Col17 GeneProducts for more about spliceforms) would have their own entries in this file, with the data as follows:

contents required? cardinality GAF 2.0 col #
DB required 1 1
DB Object ID required 1 17
DB Object Type required 1 12
Taxon required 1 13
DB Object Symbol required 1 3
DB Object Name optional 0 or 1 10
DB Object Synonym(s) optional 0 or greater 11
Parent GP ID required 1 2
Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) optional 0+ n/a


Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.

Example

Old GAF 1.0 Format

The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):

1

DB

2

DB Object ID

3

DB Object Symbol

4

Qualifier

5

GO ID

6

DB:Reference(s)

7

Evidence code

8

With (or) From

9

Aspect

10

DB Object Name

11

DB Object Synonym(s)

12

DB Object Type (refers to col 17 if present)

13

taxon

14

Date

15

Assigned by

16

Annotation cross products

17

Spliceform

SGD S000000296 PHO3 GO:0003993 SGD_REF:S000047763 IMP F acid phosphatase YBR092C gene taxon:4932 20010118 SGD
SGD S000000296 PHO3 GO:0006796 SGD_REF:S000047115 TAS P acid phosphatase YBR092C gene taxon:4932 20041220 SGD
SGD S000005370 RCL1 NOT GO:0003963 SGD_REF:S000039255 IDA F aminodeoxychorismate synthase YOL010W gene taxon:4932 20020530 SGD
SGD S000005370 RCL1 GO:0006406 SGD_REF:S000069956 IC GO:0000346 P aminodeoxychorismate synthase YOL010W gene taxon:4932|taxon:745953 20030221 SGD
SGD S000005370 RCL1 GO:0046820 SGD_REF:S000057703 ISS CGSC:pabA F aminodeoxychorismate synthase YOL010W gene taxon:4932|taxon:2861 20030106 SGD
UniProtKB Q4VCS5 AMOT_HUMAN GO:0031410 PMID:11257124 IDA C AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB
UniProtKB Q4VCS5 AMOT_HUMAN GO:0043532 PMID:11257124 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB
UniProtKB Q4VCS5 AMOT_HUMAN GO:0043116 PMID:16043488 IDA P AMOT, KIAA1071:Angiomotin IPI00163085 snoRNA taxon:9606 20051207 UniProtKB Q4VCS5-1
UniProtKB Q4VCS5 AMOT_HUMAN GO:0005515 PMID:16043488 IPI UniProtKB:Q6RHR9-2 F AMOT, KIAA1071: Angiomotin IPI00163085 snoRNA taxon:9606 20051207 UniProtKB Q4VCS5-1
UniProtKB Q4VCS5 AMOT_HUMAN GO:0043532 PMID:16043488 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB Q4VCS5-2


Proposed new format

This is how it could look in the proposed new format.

Association data:

DB DB Object ID Qualifier GO ID DB:Reference(s) Evidence code With (or) From Interacting taxon ID (for multi-organism processes) Date Assigned_by Annotation cross products Spliceform ID (if applicable)
SGD S000000296 GO:0003993 SGD_REF:S000047763 IMP 20010118 SGD
SGD S000000296 GO:0006796 SGD_REF:S000047115 TAS 20041220 SGD
SGD S000005370 NOT GO:0003963 SGD_REF:S000039255 IDA 20020530 SGD
SGD S000005370 GO:0006406 SGD_REF:S000069956 IC GO:0000346 taxon:745953 20030221 SGD
SGD S000005370 GO:0046820 SGD_REF:S000057703 ISS CGSC:pabA taxon:2861 20030106 SGD
UniProtKB Q4VCS5 GO:0031410 PMID:11257124 IDA 20051207 UniProtKB
UniProtKB Q4VCS5 GO:0043532 PMID:11257124 IDA 20051207 UniProtKB
UniProtKB Q4VCS5 GO:0043116 PMID:16043488 IDA 20051207 UniProtKB Q4VCS5-1
UniProtKB Q4VCS5 GO:0005515 PMID:16043488 IPI UniProtKB:Q6RHR9-2 20051207 UniProtKB Q4VCS5-1
UniProtKB Q4VCS5-2 GO:0043532 PMID:16043488 IDA 20051207 UniProtKB Q4VCS5-2


GP data (including possible data from gp2protein file) -- see the gene product data file format for an in-depth view.

DB DB_Object_ID DB_Object_Type Taxon DB Object Symbol DB Object Name DB Object Synonym(s) Parent GP ID Xrefs in other DBs
SGD S000000296 gene 4932 PHO3 acid phosphatase YBR092C UniProt:NE92D8
SGD S000005370 gene 4932 RCL1 aminodeoxychorismate synthase YOL010W UniProt:JN97D8
UniProtKB Q4VCS5 protein 9606 AMOT_HUMAN AMOT, KIAA1071: Angiomotin KIAA1071
UniProtKB Q4VCS5-1 snoRNA 9606 AMOT_HUMAN Isoform 1 of Angiomotin UniProtKB:Q4VCS5
UniProtKB Q4VCS5-2 protein 9606 AMOT_HUMAN Isoform 2 of Angiomotin UniProtKB:Q4VCS5


Reformatting in obo1.3

Another option is to abandon the tab-delimited format and go for an obo-like tag-value format.

Gene Products

Gene product data for reaper in OBO 1.3 (-esque) syntax:

id: FB:FBgn0011706
symbol: rpr
name: reaper
type: gene
taxon: 7227
synonym: anon-WO0162936.19
synonym: CG4319
synonym: Reaper
synonym: Reaper L
synonym: rp
synonym: RPR
xref:   UniProtKB:Q24475 SEQ_XREF [modifier to show that this is a seq xref]


For a gene product with several spliceforms, the information could be represented thus:


id: UniProtKB:Q4VCS5
symbol: AMOT_HUMAN
name: AMOT, KIAA1071: Angiomotin
type: protein
taxon: 9606
synonym: KIAA1071

id: UniProtKB:Q4VCS5-1
type: snoRNA
name: Isoform 1 of Angiomotin
parent: UniProtKB:Q4VCS4

id: UniProtKB:Q4VCS5-2
type: protein
name: Isoform 2 of Angiomotin
parent: UniProtKB:Q4VCS4


or

For a gene product with several spliceforms, the information could be represented thus:

id: UniProtKB:Q4VCS5
symbol: AMOT_HUMAN
name: AMOT, KIAA1071: Angiomotin
type: protein
taxon: 9606
synonym: KIAA1071

id: UniProtKB:Q4VCS5-1
type: snoRNA
name: Isoform 1 of Angiomotin
relationship: isoform_of UniProtKB:Q4VCS4

id: UniProtKB:Q4VCS5-2
type: protein
name: Isoform 2 of Angiomotin
relationship: isoform_of UniProtKB:Q4VCS4

Annotation data

Example: FB:FBgn0011706 annotated to GO:0035071, ref: PMID:19824712, ev code IC

[Annotation]
subject: FB:FBgn0011706
object: GO:0035071
source: PMID:19824712
evidence: IC
creation_date: 20070506 
assigned_by: FlyBase


Example: SGD:S000005370, NOT GO:0003963, refs: SGD_REF:S000039255, PMID:84195322, evcode IDA

[Annotation]
subject: SGD:S000005370
object: GO:0003963
is_negated: true
evidence: IDA
source: SGD_REF:S000039255
source: PMID:84195322
creation_date: 20020530
assigned_by: SGD


Example: UniProtKB:Q4VCS5-1 annotated to GO:0005515, ref: PMID:16043488, evcode IPI, with UniProtKB:Q6RHR9-2

[Annotation]
subject: UniProtKB:Q4VCS5
property-value: isoform UniProtKBLQ4VCS5-1
object: GO:0005515
source: PMID:16043488
evidence: IPI
xref: UniProtKB:Q6RHR9-2  EVIDENCE  <-- or something to indicate that this is a with/from xref
creation_date: 20051207
assigned_by: UniProtKB


Example: UniProtKB:H82KBU contributes_to GO:0006917, ref: PMID:8762143, evcode TAS, annotated by AgBase

[Annotation]
subject: UniProtKB:H82KBU
object: GO:0006917
relation: contributes_to
source: PMID:8762143
evidence: TAS
creation_date: 20041207
assigned_by: AgBase

Technical requirements and impact on existing software

For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.


GO Database

Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.


Groups submitting GO data

Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.


Groups using GO data

There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.


Any Other Business

What's all this spliceforms / isoforms stuff about?

Please see the documentation on column 17 for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.

Comments

GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:


1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)

2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.

3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)

(Edimmer 11:27, 26 January 2010 (UTC))