Identifiers

From GO Wiki
Revision as of 10:52, 21 April 2008 by Cjm (talk | contribs) (New page: All identifiers in GO should be composed from a binary key as follows: GlobalID = Database ':' LocalID The LocalID scheme is under the control of the Database. It should include no wh...)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

All identifiers in GO should be composed from a binary key as follows:

 GlobalID = Database ':' LocalID

The LocalID scheme is under the control of the Database. It should include no whitespace or non-ascii characters.

Examples of well behaved IDs:

  • GO:0008152
    • Database=GO LocalID=0008152
  • SGD:S000006435
    • Database=SGD LocalID=S000006435

In the gene association files, Database goes in column 1, LocalID goes in column 2

For filling in the WITH column, the Global ID should be used. This has to be the case, otherwise it would be difficult to tell where the ID came from

The Database should be registered in GO.xrf_abbs, available here:

Spec:

Also available in .obo format here:

It is strongly recommended that the primary abbreviation is always used in constructing the ID. However, in some contexts it is allowable to use a database abbreviation synonym.

For example, the following is allowed:

  • UniProt:Q09212

However, the following is strongly preferred:

  • UniProtKB:Q09212

Problems with existing usage

FLYBASE vs FB

We have both FB and FlyBase registered here Also in the fb gene_association files, the col1 is FB but the assigned_by column is FlyBase. NCBI seem to use FLYBASE

Josh has been alerted, bringing this up with FlyBase

MGI and RGD IDs

MGI and RGD IDs are a major problem

GAF (cols 1-3):

 MGI     MGI:98297       Shh

Using the concatenation rule, this composed to the global ID

  • MGI:MGI:98297

Here we have a doubling up of the MGI prefix. Note that this is inconsistent with NCBI xrefs, which use MGI:98297 - inconsistency is bad!

  • MGI:98297

Note also that typically people used the local ID in their WITH columns; eg rat has:

 RGD     RGD:3673        Shh             GO:0001525      RGD:1580654     ISS     MGI:98297       P       sonic hedgehog homolog (Drosophila)             gene    taxon:10116     20060820        RGD 

Compare with the (well-behaved) ZFIN GAFs and IDs:

 ZFIN    ZDB-GENE-980526-166     shha

col1:col2 =

  • ZFIN:ZDB-GENE-980526-166

This is identical to what NCBI uses in their xref (this is good!)

RGD uses the same pattern as MGI

recommendation

MGI and RGD should either

  • change their col2 in their GAFs such that only the number is used
  • coordinate with other databases, including NCBI to make it clear that the global ID is MGI:MGI:nnnnn