Gene Product Data File Format (Archived)

From GO Wiki
Jump to: navigation, search

File format proposal for a new file of gene product information to be submitted by annotation groups.

File Contents

It would consist of the following pieces of information:

contents required? cardinality GAF 2.0 col #
DB required 1 1
DB Object ID required 1 2
DB Object Type required 1 12
Taxon required 1 13
DB Object Symbol required 1 3
DB Object Name optional 0 or 1 10
DB Object Synonym(s) optional 0 or greater 11
Parent GP ID blank unless GP is an isoform (see next table) 0 n/a
Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) optional 0+ n/a


Spliceforms (see GAF Col17 GeneProducts for more about spliceforms) would have their own entries in this file, with the data as follows:

contents required? cardinality GAF 2.0 col #
DB required 1 1
DB Object ID required 1 17
DB Object Type required 1 12
Taxon required 1 13
DB Object Symbol required 1 3
DB Object Name optional 0 or 1 10
DB Object Synonym(s) optional 0 or greater 11
Parent GP ID required 1 2
Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) optional 0+ n/a

File Format

The data could either be presented as tab-delimited text, like GAF 2.0, or as tag-value pairs, like OBO 1.2 format.


Example data

Example data, tab-delimited

Shown as a table for ease of formatting/display.

DB:DB Object ID DB Object Type Taxon DB Object Symbol DB Object Name DB Object Synonym(s) Parent GP ID Xrefs
FB:FBgn0011706 gene 7227 rpr reaper anon-WO0162936.19|CG4319|Reaper|Reaper L|rp|RPR UniProtKB:Q24475
UniProtKB:Q4VCS5 protein 9606 AMOT_HUMAN AMOT, KIAA1071: Angiomotin KIAA1071
UniProtKB:Q4VCS5-1 snoRNA 9606 AMOT_HUMAN Isoform 1 of Angiomotin UniProtKB:Q4VCS5
UniProtKB:Q4VCS5-2 protein 9606 AMOT_HUMAN Isoform 2 of Angiomotin UniProtKB:Q4VCS5

Example data, tag-value

Gene product data for reaper in OBO 1.3 (-esque) syntax:

id: FB:FBgn0011706
symbol: rpr
name: reaper
type: gene
taxon: 7227
synonym: anon-WO0162936.19
synonym: CG4319
synonym: Reaper
synonym: Reaper L
synonym: rp
synonym: RPR
xref:   UniProtKB:Q24475 SEQ_XREF [modifier to show that this is a seq xref]

For a gene product with several spliceforms, the information could be represented thus:

id: UniProtKB:Q4VCS5
symbol: AMOT_HUMAN
name: AMOT, KIAA1071: Angiomotin
type: protein
taxon: 9606
synonym: KIAA1071

id: UniProtKB:Q4VCS5-1
type: snoRNA
name: Isoform 1 of Angiomotin
parent: UniProtKB:Q4VCS4

id: UniProtKB:Q4VCS5-2
type: protein
name: Isoform 2 of Angiomotin
parent: UniProtKB:Q4VCS4

Submission and Downloads

Converting between the gene product data file format and the GAF format is simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or this format.


Additional Data

Proposed additional data for this file:

Subsets

subset: ref_genome
subset: cardiovascular

Subset membership, to indicate if a gene product has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects. This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.

This could also be used for denoting the database subset a gene product belongs to, e.g. TrEMBL or Swiss-Prot for GOA gene products

Annotation complete

annotation_is_complete: <date>

All groups now store such information, but there is no current export mechanism for this data

More xrefs

At some point it would be useful to provide cross-references for the same (and related?) objects in other databases, especially given that many users may search with a UniProt / RefSeq / NCBI IDs for gene products from one of the reference genome databases. If the xref is not for the same object (e.g. a protein vs a sequence), the relationship between the two objects should be specified.

More File Formats

New annotation file format proposal

See GOA's GP file format proposal: http://www.ebi.ac.uk/seqdb/confluence/display/GOAP/supplementary+gene+product+information+file