Gene Product Data File Format (Archived)
File format proposal for a new file of gene product information to be submitted by annotation groups.
File Contents
It would consist of the following pieces of information:
contents | required? | cardinality | GAF 2.0 col # |
---|---|---|---|
DB | required | 1 | 1 |
DB Object ID | required | 1 | 2 |
DB Object Type | required | 1 | 12 |
Taxon | required | 1 | 13 |
DB Object Symbol | required | 1 | 3 |
DB Object Name | optional | 0 or 1 | 10 |
DB Object Synonym(s) | optional | 0 or greater | 11 |
Parent GP ID | blank unless GP is an isoform (see next table) | 0 | n/a |
Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) | optional | 0+ | n/a |
Spliceforms (see GAF Col17 GeneProducts for more about spliceforms) would have their own entries in this file, with the data as follows:
contents | required? | cardinality | GAF 2.0 col # |
---|---|---|---|
DB | required | 1 | 1 |
DB Object ID | required | 1 | 17 |
DB Object Type | required | 1 | 12 |
Taxon | required | 1 | 13 |
DB Object Symbol | required | 1 | 3 |
DB Object Name | optional | 0 or 1 | 10 |
DB Object Synonym(s) | optional | 0 or greater | 11 |
Parent GP ID | required | 1 | 2 |
Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) | optional | 0+ | n/a |
File Format
The data could either be presented as tab-delimited text, like GAF 2.0, or as tag-value pairs, like OBO 1.2 format.
Example data
Example data, tab-delimited
Shown as a table for ease of formatting/display.
DB:DB Object ID | DB Object Type | Taxon | DB Object Symbol | DB Object Name | DB Object Synonym(s) | Parent GP ID | Xrefs |
---|---|---|---|---|---|---|---|
FB:FBgn0011706 | gene | 7227 | rpr | reaper | anon-WO0162936.19|CG4319|Reaper|Reaper L|rp|RPR | UniProtKB:Q24475 | |
UniProtKB:Q4VCS5 | protein | 9606 | AMOT_HUMAN | AMOT, KIAA1071: Angiomotin | KIAA1071 | ||
UniProtKB:Q4VCS5-1 | snoRNA | 9606 | AMOT_HUMAN | Isoform 1 of Angiomotin | UniProtKB:Q4VCS5 | ||
UniProtKB:Q4VCS5-2 | protein | 9606 | AMOT_HUMAN | Isoform 2 of Angiomotin | UniProtKB:Q4VCS5 |
Example data, tag-value
Gene product data for reaper in OBO 1.3 (-esque) syntax:
id: FB:FBgn0011706 symbol: rpr name: reaper type: gene taxon: 7227 synonym: anon-WO0162936.19 synonym: CG4319 synonym: Reaper synonym: Reaper L synonym: rp synonym: RPR xref: UniProtKB:Q24475 SEQ_XREF [modifier to show that this is a seq xref]
For a gene product with several spliceforms, the information could be represented thus:
id: UniProtKB:Q4VCS5 symbol: AMOT_HUMAN name: AMOT, KIAA1071: Angiomotin type: protein taxon: 9606 synonym: KIAA1071 id: UniProtKB:Q4VCS5-1 type: snoRNA name: Isoform 1 of Angiomotin parent: UniProtKB:Q4VCS4 id: UniProtKB:Q4VCS5-2 type: protein name: Isoform 2 of Angiomotin parent: UniProtKB:Q4VCS4
Submission and Downloads
Converting between the gene product data file format and the GAF format is simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or this format.
Additional Data
Proposed additional data for this file:
Subsets
subset: ref_genome subset: cardiovascular
Subset membership, to indicate if a gene product has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects. This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.
This could also be used for denoting the database subset a gene product belongs to, e.g. TrEMBL or Swiss-Prot for GOA gene products
Annotation complete
annotation_is_complete: <date>
All groups now store such information, but there is no current export mechanism for this data
More xrefs
At some point it would be useful to provide cross-references for the same (and related?) objects in other databases, especially given that many users may search with a UniProt / RefSeq / NCBI IDs for gene products from one of the reference genome databases. If the xref is not for the same object (e.g. a protein vs a sequence), the relationship between the two objects should be specified.
More File Formats
New annotation file format proposal
See GOA's GP file format proposal: http://www.ebi.ac.uk/seqdb/confluence/display/GOAP/supplementary+gene+product+information+file