Identifiers: Difference between revisions
(→MGI) |
|||
Line 41: | Line 41: | ||
* https://github.com/kltm/go-site/blob/master/metadata/go-db-xrefs.yaml | * https://github.com/kltm/go-site/blob/master/metadata/go-db-xrefs.yaml | ||
== Identifiers, CURIES and URIs == | |||
The GO uses semantic web standards such as OWL and RDF. In these standards, URIs are used to uniquely identify ontology terms, genes and associated provenance entities such as publications. | |||
In order to reconcile URIs with the identifier scheme used in formats such as obo, GAF, and how we display identifiers in publications and portals such as AmiGO, we conceive of GlobalIDs as *Compact URIs* (CURIES). | |||
* https://en.wikipedia.org/wiki/CURIE | |||
* http://www.w3.org/TR/curie/ | |||
== Problems with existing usage == | == Problems with existing usage == |
Revision as of 16:37, 8 December 2014
All identifiers in GO (e.g. referenced in GAF, ontology or associated documents) should be conform to the following format:
GlobalID = Prefix ':' LocalID
We use the term 'identifier' as synonymous with GlobalID, although confusingly some use 'identifier' to refer to what we call LocalID.
- Prefixes: Each prefix must be registered in the GO xrfs registry file. It should correspond to some known authority. The characters must all be alphanumeric (a-z, A-Z, 0-9), underscores or dashes.
- The LocalID scheme is under the control of the Database/authorit. It should include no whitespace or non-ascii characters. The characters must all be printable ascii characters, excluding spaces.
Examples
Examples of well behaved IDs:
- GO:0008152
- Database=GO LocalID=0008152
- SGD:S000006435
- Database=SGD LocalID=S000006435
- ZFIN:ZDB-GENE-980526-166
- Database=ZFIN LocalID=ZDB-GENE-980526-166
Identifiers in GAFs
In the gene association files, Database goes in column 1, LocalID goes in column 2
For filling in the WITH column, the Global ID should be used. This has to be the case, otherwise it would be difficult to tell where the ID came from
Prefix Registry
The Database should be registered in GO.xrf_abbs, available here:
Spec:
We are in the process of moving the primary version of this file to a yaml file with primary location in a github repo:
Identifiers, CURIES and URIs
The GO uses semantic web standards such as OWL and RDF. In these standards, URIs are used to uniquely identify ontology terms, genes and associated provenance entities such as publications.
In order to reconcile URIs with the identifier scheme used in formats such as obo, GAF, and how we display identifiers in publications and portals such as AmiGO, we conceive of GlobalIDs as *Compact URIs* (CURIES).
Problems with existing usage
FLYBASE vs FB
We have both FB and FlyBase registered here Also in the fb gene_association files, the col1 is FB but the assigned_by column is FlyBase. NCBI seem to use FLYBASE
Josh has been alerted, bringing this up with FlyBase
MGI and prefix doubling
MGI IDs are a major problem
GAF (cols 1-3):
MGI MGI:98297 Shh
Using the concatenation rule, this composez to the global ID
- MGI:MGI:98297
Here we have a doubling up of the MGI prefix.
Note that NCBI previously used 'non-doubled- global identifiers of the form MGI:nnn, but they are now switching to the doubled form MGI:MGI:nnn - see http://www.ncbi.nlm.nih.gov/feed/rss.cgi?ChanKey=genenews article from Wed, 06 Aug 2014 "Important change coming for HGNC and MGI database identifiers"
Compare with the (well-behaved) ZFIN GAFs and IDs:
ZFIN ZDB-GENE-980526-166 shha
(example col1,2,3 in GAF)
col1:col2 =
- ZFIN:ZDB-GENE-980526-166
This is identical to what NCBI uses in their xref
MGI previously confirmed that the global ID is MGI:MGI:nnnnn, and the local internal ID is MGI:nnnn
RGD
RGD previously used the same pattern as MGI. As of 2008/06/23 they have confirmed their policy and fixed their files. RGD:nnnn is the global ID. The local ID is purely a number (for both genes and references)
recommendation
MGI should either
- change their col2 in their GAFs such that only the number is used (PREFERRED)
- coordinate with other databases, including NCBI to make it clear that the global ID is MGI:MGI:nnnnn