Identifiers: Difference between revisions
m (→FlyBase vs FB) |
m (→See Also) |
||
Line 109: | Line 109: | ||
== See Also == | == See Also == | ||
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5490878/pdf/pbio.2001414.pdf McMurry et al 2017, Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data] | |||
McMurry et al 2017, Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data | |||
Latest revision as of 12:40, 23 May 2022
All identifiers in GO (e.g. referenced in GAF, ontology or associated documents) should be conform to the following format:
GlobalID = Prefix ':' LocalID
- Prefixes (aka Database): Each prefix must be registered in the GO xrefs registry metadata yaml file (see below). It should correspond to some known authority. The characters must all be alphanumeric (a-z, A-Z, 0-9), underscores or dashes.
- The LocalID scheme is under the control of the Database/authority. The characters must all be printable ascii characters, excluding spaces.
We use the term 'identifier' as synonymous with GlobalID, although confusingly some use 'identifier' to refer to what we call LocalID. Our concept of GlobalID corresponds with the W3C standard for CURIEs, see below
Examples
Examples of well behaved IDs:
- GO:0008152
- Database=GO LocalID=0008152
- SGD:S000006435
- Database=SGD LocalID=S000006435
- ZFIN:ZDB-GENE-980526-166
- Database=ZFIN LocalID=ZDB-GENE-980526-166
- FlyBase:FBgn0011293
- Database=FlyBase LocalID=FBgn0011293
Identifiers in GAFs
IMPORTANT NOTE ON THE GAF FORMAT
In the gene association (GAF) files, the global ID is split across two column: Database goes in column 1, LocalID goes in column 2
For filling in the WITH column, the Global ID should be used. This has to be the case, otherwise it would be difficult to tell where the ID came from
Note that in GPAD and GPI formats, a single column with GlobalID is used
Prefix Registry
The Database should be registered in db-xrefs.yaml, available here:
- https://github.com/geneontology/go-site/blob/master/metadata/db-xrefs.yaml
- http://amigo.geneontology.org/xrefs
Our prefix registry is coordinated with identifiers.org and the prefixcommons project: https://github.com/prefixcommons/biocontext
Identifiers, CURIES and URIs
The GO uses semantic web standards such as OWL and RDF. In these standards, URIs are used to uniquely identify ontology terms, genes and associated provenance entities such as publications.
In order to reconcile URIs with the identifier scheme used in formats such as obo, GAF, and how we display identifiers in publications and portals such as AmiGO, we conceive of GlobalIDs as *Compact URIs* (CURIES).
We assume a constant set of prefixes. For all ontologies used in GO, we assume that these have an OBO library PURL URI, so we have an implicit set of prefix declarations:
@prefix GO: http://purl.obolibrary.org/obo/GO_ @prefix CL: http://purl.obolibrary.org/obo/CL_ @prefix CHEBI: http://purl.obolibrary.org/obo/CHEBI_
Problems with existing usage
The term "identifier" is used in different ways which causes confusion. Our concept of GlobalID corresponds with W3C CURIEs and should be unambiguous.
MGI and prefix doubling
MGI IDs are a major problem
GAF (cols 1-3):
MGI MGI:98297 Shh
Using the concatenation rule, this composes to the global ID
- MGI:MGI:98297
Here we have a doubling up of the MGI prefix.
Note that NCBI previously used 'non-doubled- global identifiers of the form MGI:nnn, but they are now switching to the doubled form MGI:MGI:nnn - see http://www.ncbi.nlm.nih.gov/feed/rss.cgi?ChanKey=genenews article from Wed, 06 Aug 2014 "Important change coming for HGNC and MGI database identifiers"
Compare with the (well-behaved) ZFIN GAFs and IDs:
ZFIN ZDB-GENE-980526-166 shha
(example col1,2,3 in GAF)
col1:col2 =
- ZFIN:ZDB-GENE-980526-166
This is identical to what NCBI uses in their xref
MGI previously confirmed that the global ID is MGI:MGI:nnnnn, and the local internal ID is MGI:nnnn (but this seems to have changed)
recommendation:
MGI should either
- change their col2 in their GAFs such that only the number is used (PREFERRED)
- coordinate with other databases, including NCBI to make it clear that the global ID is MGI:MGI:nnnnn
FlyBase vs FB
In GO in our registry, FB is the canonical prefix, and FlyBase is included as a synonym. As of May 23, 2022:
- identifers.org uses "FB"
- Bioregistry uses "FlyBase"
- NCBI uses "FLYBASE"