Identifiers: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
No edit summary
(27 intermediate revisions by 3 users not shown)
Line 1: Line 1:
[[Category:Curator Guides]] [[Category:Annotation]] [[Category:Formats]]
 
[[Category:Annotation]]  


All identifiers in GO (e.g. referenced in GAF, ontology or associated documents) should be conform to the following format:
All identifiers in GO (e.g. referenced in GAF, ontology or associated documents) should be conform to the following format:
Line 5: Line 6:
   GlobalID = Prefix ':' LocalID
   GlobalID = Prefix ':' LocalID


We use the term 'identifier' as synonymous with GlobalID, although confusingly some use 'identifier' to refer to what we call LocalID.
* Prefixes (aka Database): Each prefix must be registered in the GO xrefs registry metadata yaml file (see below). It should correspond to some known authority. The characters must all be alphanumeric (a-z, A-Z, 0-9), underscores or dashes.
* The LocalID scheme is under the control of the Database/authority. The characters must all be printable ascii characters, excluding spaces.
 
We use the term 'identifier' as synonymous with GlobalID, although confusingly some use 'identifier' to refer to what we call LocalID. Our concept of GlobalID corresponds with the W3C standard for CURIEs, see below


* Prefixes: Each prefix must be registered in the GO xrfs registry file. It should correspond to some known authority. The characters must all be alphanumeric (a-z, A-Z, 0-9), underscores or dashes.
* The LocalID scheme is under the control of the Database/authorit. It should include no whitespace or non-ascii characters. The characters must all be printable ascii characters, excluding spaces.


== Examples ==
== Examples ==
Line 20: Line 22:
* ZFIN:ZDB-GENE-980526-166
* ZFIN:ZDB-GENE-980526-166
** Database=ZFIN    LocalID=ZDB-GENE-980526-166
** Database=ZFIN    LocalID=ZDB-GENE-980526-166
* FlyBase:FBgn0011293
** Database=FlyBase LocalID=FBgn0011293


== Identifiers in GAFs ==
== Identifiers in GAFs ==


In the gene association files, Database goes in column 1, LocalID goes in column 2
''IMPORTANT NOTE ON THE GAF FORMAT''
 
In the gene association (GAF) files, the global ID is split across two column: Database goes in column 1, LocalID goes in column 2


For filling in the WITH column, the ''Global'' ID should be used. This has to be the case, otherwise it would be difficult to tell where the ID came from
For filling in the WITH column, the ''Global'' ID should be used. This has to be the case, otherwise it would be difficult to tell where the ID came from
Note that in GPAD and GPI formats, a single column with GlobalID is used


== Prefix Registry ==
== Prefix Registry ==


The Database '''should''' be registered in GO.xrf_abbs, available here:
The Database '''should''' be registered in db-xrefs.yaml, available here:
 
* https://github.com/geneontology/go-site/blob/master/metadata/db-xrefs.yaml
* http://amigo.geneontology.org/xrefs
 
Our prefix registry is coordinated with identifiers.org and the prefixcommons project: https://github.com/prefixcommons/biocontext
 
== Identifiers, CURIES and URIs ==


* http://www.geneontology.org/doc/GO.xrf_abbs
The GO uses semantic web standards such as OWL and RDF. In these standards, URIs are used to uniquely identify ontology terms, genes and associated provenance entities such as publications.
* http://www.geneontology.org/cgi-bin/xrefs.cgi


Spec:
In order to reconcile URIs with the identifier scheme used in formats such as obo, GAF, and how we display identifiers in publications and portals such as AmiGO, we conceive of GlobalIDs as *Compact URIs* (CURIES).


* http://www.geneontology.org/doc/GO.xrf_abbs_spec
* https://en.wikipedia.org/wiki/CURIE
* http://www.w3.org/TR/curie/


== Problems with existing usage ==
We assume a constant set of prefixes. For all ontologies used in GO, we assume that these have an OBO library PURL URI, so we have an implicit set of prefix declarations:


=== FLYBASE vs FB ===
  @prefix GO: http://purl.obolibrary.org/obo/GO_
  @prefix CL: http://purl.obolibrary.org/obo/CL_
  @prefix CHEBI: http://purl.obolibrary.org/obo/CHEBI_


We have both FB and FlyBase registered here
== Problems with existing usage ==
Also in the fb gene_association files, the col1 is FB but the assigned_by column is FlyBase.
NCBI seem to use FLYBASE


Josh has been alerted, bringing this up with FlyBase
The term "identifier" is used in different ways which causes confusion. Our concept of GlobalID corresponds with W3C CURIEs and should be unambiguous.


=== MGI  ===
=== MGI and prefix doubling ===


MGI IDs are a major problem
MGI IDs are a major problem
Line 56: Line 71:
   MGI    MGI:98297      Shh
   MGI    MGI:98297      Shh


Using the concatenation rule, this composez to the global ID
Using the concatenation rule, this composes to the global ID


* MGI:MGI:98297
* MGI:MGI:98297


Here we have a doubling up of the MGI prefix. Note that this is inconsistent with NCBI xrefs, which use MGI:98297 - '''this inconsistency has yet to be resolved with NCBI'''
Here we have a doubling up of the MGI prefix.


Note also that typically people used the local ID in their WITH columns; eg rat has:
Note that NCBI previously used 'non-doubled- global identifiers of the form MGI:nnn, but they are now switching to the doubled form MGI:MGI:nnn - see  http://www.ncbi.nlm.nih.gov/feed/rss.cgi?ChanKey=genenews article from Wed, 06 Aug 2014 "Important change coming for HGNC and MGI database identifiers"
 
  RGD    RGD:3673        Shh            GO:0001525      RGD:1580654    ISS    MGI:98297      P      sonic hedgehog homolog (Drosophila)            gene    taxon:10116    20060820        RGD
 
(NOTE: RGD have fixed this. Thanks RGD!)


Compare with the (well-behaved) ZFIN GAFs and IDs:
Compare with the (well-behaved) ZFIN GAFs and IDs:


   ZFIN    ZDB-GENE-980526-166    shha
   ZFIN    ZDB-GENE-980526-166    shha
(example col1,2,3 in GAF)


col1:col2 =
col1:col2 =
Line 78: Line 91:
This is identical to what NCBI uses in their xref
This is identical to what NCBI uses in their xref


MGI has confirmed that the global ID is MGI:MGI:nnnnn, and the local internal ID is MGI:nnnn
MGI previously confirmed that the global ID is MGI:MGI:nnnnn, and the local internal ID is MGI:nnnn (but this seems to have changed)
 
recommendation:
 
MGI should either
 
* change their col2 in their GAFs such that only the number is used (PREFERRED)
* coordinate with other databases, including NCBI to make it clear that the global ID is MGI:MGI:nnnnn
 


=== RGD ===
=== RGD ===
Line 84: Line 105:
RGD previously used the same pattern as MGI. As of 2008/06/23 they have confirmed their policy and fixed their files. RGD:nnnn is the global ID. The local ID is purely a number (for both genes and references)
RGD previously used the same pattern as MGI. As of 2008/06/23 they have confirmed their policy and fixed their files. RGD:nnnn is the global ID. The local ID is purely a number (for both genes and references)


==== recommendation ====
=== FlyBase vs FB ===
 
previously there was confusion as to whether to use FB or FlyBase (or FLYBASE)
 
in GO in our registry, FB is the canonical prefix, but we include FlyBase as a synonym


MGI should either
Historically, "flybase" was used in identifiers.org but this is now resolved to be "fb" consistent with GO and other resources


* change their col2 in their GAFs such that only the number is used (PREFERRED)
NCBI seem to use FLYBASE
* coordinate with other databases, including NCBI to make it clear that the global ID is MGI:MGI:nnnnn


== See Also ==
== See Also ==
Line 95: Line 119:
* [http://oboformat.googlecode.com/svn/trunk/doc/obo-syntax.html#2.5 OBO Format ID syntax]
* [http://oboformat.googlecode.com/svn/trunk/doc/obo-syntax.html#2.5 OBO Format ID syntax]
* [http://www.obofoundry.org/id-policy.shtml OBO Foundry ID policy]
* [http://www.obofoundry.org/id-policy.shtml OBO Foundry ID policy]
McMurry et al 2017, Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5490878/pdf/pbio.2001414.pdf

Revision as of 20:50, 1 March 2021


All identifiers in GO (e.g. referenced in GAF, ontology or associated documents) should be conform to the following format:

 GlobalID = Prefix ':' LocalID
  • Prefixes (aka Database): Each prefix must be registered in the GO xrefs registry metadata yaml file (see below). It should correspond to some known authority. The characters must all be alphanumeric (a-z, A-Z, 0-9), underscores or dashes.
  • The LocalID scheme is under the control of the Database/authority. The characters must all be printable ascii characters, excluding spaces.

We use the term 'identifier' as synonymous with GlobalID, although confusingly some use 'identifier' to refer to what we call LocalID. Our concept of GlobalID corresponds with the W3C standard for CURIEs, see below


Examples

Examples of well behaved IDs:

  • GO:0008152
    • Database=GO LocalID=0008152
  • SGD:S000006435
    • Database=SGD LocalID=S000006435
  • ZFIN:ZDB-GENE-980526-166
    • Database=ZFIN LocalID=ZDB-GENE-980526-166
  • FlyBase:FBgn0011293
    • Database=FlyBase LocalID=FBgn0011293

Identifiers in GAFs

IMPORTANT NOTE ON THE GAF FORMAT

In the gene association (GAF) files, the global ID is split across two column: Database goes in column 1, LocalID goes in column 2

For filling in the WITH column, the Global ID should be used. This has to be the case, otherwise it would be difficult to tell where the ID came from

Note that in GPAD and GPI formats, a single column with GlobalID is used

Prefix Registry

The Database should be registered in db-xrefs.yaml, available here:

Our prefix registry is coordinated with identifiers.org and the prefixcommons project: https://github.com/prefixcommons/biocontext

Identifiers, CURIES and URIs

The GO uses semantic web standards such as OWL and RDF. In these standards, URIs are used to uniquely identify ontology terms, genes and associated provenance entities such as publications.

In order to reconcile URIs with the identifier scheme used in formats such as obo, GAF, and how we display identifiers in publications and portals such as AmiGO, we conceive of GlobalIDs as *Compact URIs* (CURIES).

We assume a constant set of prefixes. For all ontologies used in GO, we assume that these have an OBO library PURL URI, so we have an implicit set of prefix declarations:

  @prefix GO: http://purl.obolibrary.org/obo/GO_
  @prefix CL: http://purl.obolibrary.org/obo/CL_
  @prefix CHEBI: http://purl.obolibrary.org/obo/CHEBI_

Problems with existing usage

The term "identifier" is used in different ways which causes confusion. Our concept of GlobalID corresponds with W3C CURIEs and should be unambiguous.

MGI and prefix doubling

MGI IDs are a major problem

GAF (cols 1-3):

 MGI     MGI:98297       Shh

Using the concatenation rule, this composes to the global ID

  • MGI:MGI:98297

Here we have a doubling up of the MGI prefix.

Note that NCBI previously used 'non-doubled- global identifiers of the form MGI:nnn, but they are now switching to the doubled form MGI:MGI:nnn - see http://www.ncbi.nlm.nih.gov/feed/rss.cgi?ChanKey=genenews article from Wed, 06 Aug 2014 "Important change coming for HGNC and MGI database identifiers"

Compare with the (well-behaved) ZFIN GAFs and IDs:

 ZFIN    ZDB-GENE-980526-166     shha

(example col1,2,3 in GAF)

col1:col2 =

  • ZFIN:ZDB-GENE-980526-166

This is identical to what NCBI uses in their xref

MGI previously confirmed that the global ID is MGI:MGI:nnnnn, and the local internal ID is MGI:nnnn (but this seems to have changed)

recommendation:

MGI should either

  • change their col2 in their GAFs such that only the number is used (PREFERRED)
  • coordinate with other databases, including NCBI to make it clear that the global ID is MGI:MGI:nnnnn


RGD

RGD previously used the same pattern as MGI. As of 2008/06/23 they have confirmed their policy and fixed their files. RGD:nnnn is the global ID. The local ID is purely a number (for both genes and references)

FlyBase vs FB

previously there was confusion as to whether to use FB or FlyBase (or FLYBASE)

in GO in our registry, FB is the canonical prefix, but we include FlyBase as a synonym

Historically, "flybase" was used in identifiers.org but this is now resolved to be "fb" consistent with GO and other resources

NCBI seem to use FLYBASE

See Also

McMurry et al 2017, Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5490878/pdf/pbio.2001414.pdf