Identifiers: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
[[Category:Curator Guides]] [[Category:Annotation]] [[Category:Formats]]
[[Category:Curator Guides]] [[Category:Annotation]] [[Category:Formats]]


All identifiers in GO (e.g. referenced in GAF, ontology or associated documents) should be composed from a binary key as follows:
All identifiers in GO (e.g. referenced in GAF, ontology or associated documents) should be conform to the following format:


   GlobalID = Database ':' LocalID
   GlobalID = Prefix ':' LocalID


The LocalID scheme is under the control of the Database. It should include no whitespace or non-ascii characters.
We use the term 'identifier' as synonymous with GlobalID, although confusingly some use 'identifier' to refer to what we call LocalID.
 
* Prefixes: Each prefix must be registered in the GO xrfs registry file. It should correspond to some known authority. The characters must all be alphanumeric (a-z, A-Z, 0-9), underscores or dashes.
* The LocalID scheme is under the control of the Database/authorit. It should include no whitespace or non-ascii characters. The characters must all be printable ascii characters, excluding spaces.
 
== Examples ==


Examples of well behaved IDs:
Examples of well behaved IDs:
Line 15: Line 20:
* ZFIN:ZDB-GENE-980526-166
* ZFIN:ZDB-GENE-980526-166
** Database=ZFIN    LocalID=ZDB-GENE-980526-166
** Database=ZFIN    LocalID=ZDB-GENE-980526-166
== Identifiers in GAFs ==


In the gene association files, Database goes in column 1, LocalID goes in column 2
In the gene association files, Database goes in column 1, LocalID goes in column 2


For filling in the WITH column, the ''Global'' ID should be used. This has to be the case, otherwise it would be difficult to tell where the ID came from
For filling in the WITH column, the ''Global'' ID should be used. This has to be the case, otherwise it would be difficult to tell where the ID came from
== Prefix Registry ==


The Database '''should''' be registered in GO.xrf_abbs, available here:
The Database '''should''' be registered in GO.xrf_abbs, available here:
Line 28: Line 37:


* http://www.geneontology.org/doc/GO.xrf_abbs_spec
* http://www.geneontology.org/doc/GO.xrf_abbs_spec
Also available in .obo format here:
* http://www.berkeleybop.org/ontologies/obo-all/go_xrf_metadata/go_xrf_metadata.obo
It is strongly recommended that the ''primary'' abbreviation is always used in constructing the ID. However, in some contexts it is allowable to use a database abbreviation synonym.
For example, the following is allowed:
* UniProt:Q09212
However, the following is strongly preferred:
* UniProtKB:Q09212


== Problems with existing usage ==
== Problems with existing usage ==

Revision as of 16:21, 8 December 2014


All identifiers in GO (e.g. referenced in GAF, ontology or associated documents) should be conform to the following format:

 GlobalID = Prefix ':' LocalID

We use the term 'identifier' as synonymous with GlobalID, although confusingly some use 'identifier' to refer to what we call LocalID.

  • Prefixes: Each prefix must be registered in the GO xrfs registry file. It should correspond to some known authority. The characters must all be alphanumeric (a-z, A-Z, 0-9), underscores or dashes.
  • The LocalID scheme is under the control of the Database/authorit. It should include no whitespace or non-ascii characters. The characters must all be printable ascii characters, excluding spaces.

Examples

Examples of well behaved IDs:

  • GO:0008152
    • Database=GO LocalID=0008152
  • SGD:S000006435
    • Database=SGD LocalID=S000006435
  • ZFIN:ZDB-GENE-980526-166
    • Database=ZFIN LocalID=ZDB-GENE-980526-166

Identifiers in GAFs

In the gene association files, Database goes in column 1, LocalID goes in column 2

For filling in the WITH column, the Global ID should be used. This has to be the case, otherwise it would be difficult to tell where the ID came from

Prefix Registry

The Database should be registered in GO.xrf_abbs, available here:

Spec:

Problems with existing usage

FLYBASE vs FB

We have both FB and FlyBase registered here Also in the fb gene_association files, the col1 is FB but the assigned_by column is FlyBase. NCBI seem to use FLYBASE

Josh has been alerted, bringing this up with FlyBase

MGI

MGI IDs are a major problem

GAF (cols 1-3):

 MGI     MGI:98297       Shh

Using the concatenation rule, this composez to the global ID

  • MGI:MGI:98297

Here we have a doubling up of the MGI prefix. Note that this is inconsistent with NCBI xrefs, which use MGI:98297 - this inconsistency has yet to be resolved with NCBI

Note also that typically people used the local ID in their WITH columns; eg rat has:

 RGD     RGD:3673        Shh             GO:0001525      RGD:1580654     ISS     MGI:98297       P       sonic hedgehog homolog (Drosophila)             gene    taxon:10116     20060820        RGD 

(NOTE: RGD have fixed this. Thanks RGD!)

Compare with the (well-behaved) ZFIN GAFs and IDs:

 ZFIN    ZDB-GENE-980526-166     shha

col1:col2 =

  • ZFIN:ZDB-GENE-980526-166

This is identical to what NCBI uses in their xref

MGI has confirmed that the global ID is MGI:MGI:nnnnn, and the local internal ID is MGI:nnnn

RGD

RGD previously used the same pattern as MGI. As of 2008/06/23 they have confirmed their policy and fixed their files. RGD:nnnn is the global ID. The local ID is purely a number (for both genes and references)

recommendation

MGI should either

  • change their col2 in their GAFs such that only the number is used (PREFERRED)
  • coordinate with other databases, including NCBI to make it clear that the global ID is MGI:MGI:nnnnn

See Also