Core Consortium annotation activities - 2012

From GO Wiki
Revision as of 04:37, 24 April 2012 by Edimmer (talk | contribs)
Jump to navigation Jump to search

Activities to which contributing GO annotation groups must minimally agree

  • Supply an appropriately formatted GAF file with GO annotations that is syntactically valid and meets the GO Consortium data requirements as specified here: GAF:2.0. The GOC commits to backward compatibility with this format.
  • All annotations must be to UniProtKB identifiers located in the UniProt Reference Proteome Files, unless the group agrees to provide complete gp2protein, gp2rna and unlocalised_gp files, where appropriate (see below).
  • All groups must support the transition to using ECO identifiers in future annotation formats; this could be future versions of GAF or the GPAD format. ECO identifiers agreed as appropriate for usage in GO annotations may be more or less specific than the current GO evidence code list.
  • Curational groups are not necessarily committing to on-going updates to their annotations. In the case of non-recurring submissions or those from annotation groups which are now inactive annotation providers, responsibility for corrections and updates will revert to the GOC.
  • Please see the guidance provided here for full details on how new GO annotation efforts are supported by the GO Consortium.

Complete identifier mapping files

Complete gp2protein file

The file must meet the gp2protein format specification

The gp2protein-mapping file must contain the full list of all protein-encoding genes in the respective organism (or community), including those proteins not annotated to GO.

The first column contains all gene or gene product identifiers (these are typically MOD-specific identifiers) and the second column contains mappings to canonical identifiers. rotein coding genes must map to UniProtKB identifiers (Swiss-Prot in preference, if not then TrEMBL). If identifiers are truly unavailable in UniProtKB then NCBI identifiers (NP_ and XP_) are permissible.

If an annotation group is fully satisfied with the identifier mapping from an external identifier type to UniProtKB accessions, as supplied by the UniProt Knowledgebase cross-references, then UniProtKB is willing to take on the responsibility of supplying the external id -> UniProtKB mapping to the GO Consortium.

Complete gp2rna file

If your annotation file includes ncRNAs, then the gp2protein files must include all ncRNA-encoding genes currently identified in the genome build including those ncRNAs not annotated to GO.

Functional ncRNA must map to NCBI (NR_ or XR_) if available, blank if unavailable).

gp2rna format

Complete gp_unlocalized file

If your database supplies gene identifiers that have been manually curated from the literature, but where no sequence or genomic location is known (such genes have been variously described as 'unlocalised genes', 'single heritable traits' or 'phenotypic orphans'), then you should additionally supply a complete gp_unlocalized_file.

This file should contain all non-genome localized gene identifiers available, including those not annotated to GO.

gp_unlocalised file format

Macromolecular complexes

If the annotation file includes macromolecular complexes as the subject of the annotation then no corresponding entry is required for the gp2protein file – only gene or gene product mappings should be included.

Updates of identifier mapping files.

Groups must regularly update their gp2protein or gp2rna file (e.g. in response to UniProt-GOA feedback on inclusion of obsolete/secondary UniProtKB accessions in a group’s gp2protein, or obsoletion of NCBI identifiers).


For groups who provide authoritative files for a species, or who are funded by the GO NIH grant, please see the following description of additional GO annotation activities for core GO Consortium members