Annotation Extension: Difference between revisions
(15 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
[[Category:Annotation extension]] | |||
==Introduction== | ==Introduction== | ||
Line 26: | Line 27: | ||
''Example'' | ''Example'' | ||
Unsuitable GO term: "regulation of Sonic hedgehog transcription from RNA polymerase II promoter" | |||
There is no specific | There is no specific regulator for Sonic hedgehog transcription which is separable from general regulation of transcription from RNA polymerase II. | ||
Curators should therefore be advised to use the existing term " | Curators should therefore be advised to use the existing term "regulation of transcription from RNA polymerase II promoter" (GO:0006357) and capture any further specifics (such as the Ensembl identifier for Sonic hedgehog) in the annotation_extension field. | ||
'''2. Term requests should not be made for specific extensions which are outside the scope of GO.''' | '''2. Term requests should not be made for specific extensions which are outside the scope of GO.''' | ||
Line 78: | Line 79: | ||
Here CL:0000084 is the identifier for T-cell in the OBO Cell Type (CL) Ontology. | Here CL:0000084 is the identifier for T-cell in the OBO Cell Type (CL) Ontology. | ||
All relationships used in the annotation extension field should be valid relationships (i.e. not obsolete) described in the [http://viewvc.geneontology.org/viewvc/GO-SVN/trunk/ontology/extensions/ | All relationships used in the annotation extension field should be valid relationships (i.e. not obsolete) described in the [http://viewvc.geneontology.org/viewvc/GO-SVN/trunk/ontology/extensions/gorel.obo gorel.obo file]. | ||
Identifiers can originate from GO or another ontology or database. All identifiers must be prefixed by the appropriate database namespace, e.g. UniProtKB:, CHEBI:, CL:, UBERON: and all database namespaces must be located in the [http://www.geneontology.org/cgi-bin/xrefs.cgi GO Database Abbreviations file]. | Identifiers can originate from GO or another ontology or database. All identifiers must be prefixed by the appropriate database namespace, e.g. UniProtKB:, CHEBI:, CL:, UBERON: and all database namespaces must be located in the [http://www.geneontology.org/cgi-bin/xrefs.cgi GO Database Abbreviations file]. | ||
Line 93: | Line 93: | ||
Only one identifier can be referenced by one relationship. | Only one identifier can be referenced by one relationship. | ||
== Defining the appropriate | == Defining the appropriate relationship(identifier) pair for an annotation using Domain and Range parameters == | ||
===Relationship Domains=== | ===Relationship Domains=== | ||
Line 101: | Line 101: | ||
For some relationships, top-level domain restrictions apply, and in these cases then top-level terms from the Basic Formal Ontology (BFO) are used: | For some relationships, top-level domain restrictions apply, and in these cases then top-level terms from the Basic Formal Ontology (BFO) are used: | ||
'''domain: BFO:0000007 ! process''' | '''domain: BFO:0000007 ! process''' | ||
Line 149: | Line 148: | ||
</pre> | </pre> | ||
===Examples of appropriate and inappropriate usage of | ===Examples of appropriate and inappropriate usage of relationship(identifier) pairs=== | ||
Many relationships have been restricted according to the type of term they can appropriately describe, and also the types of values that can be assigned. | Many relationships have been restricted according to the type of term they can appropriately describe, and also the types of values that can be assigned. | ||
Line 159: | Line 158: | ||
id: has_regulation_target | id: has_regulation_target | ||
name: has_regulation_target | name: has_regulation_target | ||
def: "Identifies a gene or gene product affected by a regulation BP." [GOC:mah] | def: "Identifies a gene or gene product affected by a regulation BP." [GOC:mah] | ||
xref: GOREL:0000015 | xref: GOREL:0000015 | ||
domain: GO:0065007 ! biological regulation | domain: GO:0065007 ! biological regulation | ||
range: ENTITY_UNION:0000003 ! gene or gene product | range: ENTITY_UNION:0000003 ! gene or gene product | ||
Line 177: | Line 171: | ||
Therefore, while this annotation is correct: | Therefore, while this annotation is correct: | ||
{| class="wikitable" border="1" | {| class="wikitable" border="1" | ||
Line 195: | Line 188: | ||
The following annotations, although containing appropriate information for the annotation extension field, inappropriately uses the has_regulation_target relationship as their use conflicts with the specified defined Range and Domain values defined in the description of the relation above. | The following annotations, although containing appropriate information for the annotation extension field, inappropriately uses the has_regulation_target relationship as their use conflicts with the specified defined Range and Domain values defined in the description of the relation above. | ||
{| class="wikitable" border="1" | {| class="wikitable" border="1" | ||
Line 235: | Line 227: | ||
* N.B. <span style="color:green">'''The text colored green'''</span> in the annotation examples on this page is not present in the annotation file, but is used here to improve the readers' understanding of annotation examples. | * N.B. <span style="color:green">'''The text colored green'''</span> in the annotation examples on this page is not present in the annotation file, but is used here to improve the readers' understanding of annotation examples. | ||
=== Format for referencing multiple relationship(identifier) pairs === | |||
=== Format for referencing multiple relationship identifier pairs === | |||
Only one identifier can be referenced by one relationship, therefore in order to make multiple 'relationship(identifier)' pairs in an annotation_extension field they must be separated with commas ',' or pipes '|'. | Only one identifier can be referenced by one relationship, therefore in order to make multiple 'relationship(identifier)' pairs in an annotation_extension field they must be separated with commas ',' or pipes '|'. | ||
Line 267: | Line 257: | ||
part_of(CL:0000182 <span style="color:green">hepatocyte</span>)|part_of(CL:0000091 <span style="color:green">Kupffer cell</span>) | part_of(CL:0000182 <span style="color:green">hepatocyte</span>)|part_of(CL:0000091 <span style="color:green">Kupffer cell</span>) | ||
|} | |} | ||
''Interpretation:'' | ''Interpretation:'' | ||
Nos2 has been | Nos2 has been observed by direct assay to be located in the peroxisome of hepatocyte cells, and also located in the peroxisome of Kupffer cells. | ||
The annotation format using multiple, pipe-separated 'relationship(identifier)' pairs supplies the equivalent information as supplying two separate annotation statement lines, each with one of the different relationship-value pairs: | The annotation format using multiple, pipe-separated 'relationship(identifier)' pairs supplies the equivalent information as supplying two separate annotation statement lines, each with one of the different relationship-value pairs: | ||
{| class="wikitable" border="1" | {| class="wikitable" border="1" | ||
Line 312: | Line 300: | ||
! Annotation Extension (col 16) | ! Annotation Extension (col 16) | ||
|- | |- | ||
| | | TMEM115 | ||
| GO:0005634 <span style="color:green">nucleus</span> | | GO:0005634 <span style="color:green">nucleus</span> | ||
| PMID: 17973242 | | PMID: 17973242 | ||
Line 321: | Line 309: | ||
''Interpretation:'' | ''Interpretation:'' | ||
TMEM115 is located in the nucleus that | TMEM115 is located in the nucleus that is part of an epithelial cell (CL:0000066) that is part of the cervix epithelium (UBERON:0004801). | ||
* N.B. <span style="color:green">'''The text colored green'''</span> in the annotation examples on this page is not present in the annotation file, but is used here to improve the readers' understanding of annotation examples. | * N.B. <span style="color:green">'''The text colored green'''</span> in the annotation examples on this page is not present in the annotation file, but is used here to improve the readers' understanding of annotation examples. | ||
Line 329: | Line 317: | ||
==Annotation Examples== | ==Annotation Examples== | ||
* [http://wiki.geneontology.org/index.php/Annotation_Extension:_Capturing_cell_and_tissue_types Annotation Extension field: Adding spatial location information to a GO annotation] | * [http://wiki.geneontology.org/index.php/Annotation_Extension:_Capturing_cell_and_tissue_types Annotation Extension field: Adding spatial location information to a GO annotation] | ||
* [http://wiki.geneontology.org/index.php/Annotation_Extension:_Capturing_participants Annotation Extension field: Adding specific subtrates, products or targets into a GO annotation] | * [http://wiki.geneontology.org/index.php/Annotation_Extension:_Capturing_participants Annotation Extension field: Adding specific subtrates, products or targets into a GO annotation] | ||
Line 340: | Line 328: | ||
* [http://wiki.geneontology.org/index.php/LEGO-style_annotation_ideas Data that cannot be captured by the current annotation format] | * [http://wiki.geneontology.org/index.php/LEGO-style_annotation_ideas Data that cannot be captured by the current annotation format] | ||
==Annotation Extension Meetings== | |||
*[[Annotation Extension meeting 2014-06-16|June 16]] |
Revision as of 04:11, 6 October 2015
Introduction
Each GO annotation pairs a single gene product identifier to a single term from the ontology. This format is very powerful however it can also restrict the descriptiveness of a specific instance of a functioning or subcellular location; there must be a pre-existing term in the ontology that provides full details of the specific aspects of the function.
It is not always possible to create individual terms that precisely describe the context of each activity (e.g. the cellular or anatomical location, the dependency on other processes, or particular, specific protein targets).
It is less restrictive if the annotator is able to combine additional terms in a single annotation to provide a more detailed functional description for an individual gene product.
This page describes the Annotation Extension field (column 16) in the Gene Association File GAF2.0 file format, which allows GO terms to be further specified, using gene product or chemical identifiers or terms from GO or external OBO ontologies.
When an annotator chooses to do this, they are effectively creating "on-the-fly" cross-product term. We say "on-the-fly" because the combinatorial term is not added to the ontology (although it could be at a later stage, if the ontology editors choose to create the appropriate GO term).
When should a curator use the Annotation Extension field instead of requesting a new GO term?
The primary way to provide more fine-grained annotations is by requesting more specific terms. For example, it is reasonable to create sub-types of the general term “apoptotic process”, for example “anoikis” (definition: 'Apoptosis triggered by inadequate or inappropriate adherence to substrate e.g. after disruption of the interactions between normal epithelial cells and the extracellular matrix').
Terms should be requested via the ontology SourceForge tracker or via TermGenie.
However, highly detailed terms can inflate the ontology and lead to a GO term creation bottleneck. In addition it can be more efficient for an annotator to create the entire annotation statement in one step during the curation of the paper rather than in two steps that require making a term request and then going back later to include the new GO term in the annotation of the paper.
GO editors will regularly review the contents of the annotation extension field in submitted annotation files and create new, more specific terms if they feel enough annotations exist to warrant a pre-composed term. This effort will be assisted in future by automated methods to reason over annotations enhanced with filled annotation extension fields, to ensure the annotations are consistently grouped by an appropriate common GO term class.
1. GO term requests should not be made when the curator would like to describe activities or locations that are not evidently mechanistically or compositionally distinct from an existing GO term
Example
Unsuitable GO term: "regulation of Sonic hedgehog transcription from RNA polymerase II promoter"
There is no specific regulator for Sonic hedgehog transcription which is separable from general regulation of transcription from RNA polymerase II.
Curators should therefore be advised to use the existing term "regulation of transcription from RNA polymerase II promoter" (GO:0006357) and capture any further specifics (such as the Ensembl identifier for Sonic hedgehog) in the annotation_extension field.
2. Term requests should not be made for specific extensions which are outside the scope of GO.
- Specific protein substrates or products of an enzyme
Example:
Gene Name (col 2) | GO ID (col 5) | Reference (col 6) | Annotation Extension (col 16) |
---|---|---|---|
NEK | GO:0004672 protein kinase activity | PMID:10880350 | has_direct_input(UniProtKB:P36873 PPP1CC) |
- Specific chemical substrates for a catalytic or transporter activity are often considered trivial, therefore curators are recommended to discuss the appropriateness of a new term with the ontology editors.
- N.B. The text colored green in the annotation examples on this page is not present in the annotation file, but is used here to improve the readers' understanding of annotation examples.
The basic format
An annotated GO term can be enhanced in the annotation format by one or more 'relationship(identifier)' pairs added into the annotation extension field (column 16).
The aim of the information added into the annotation extension field is to refine the GO term identifier entered into the GO_ID field (Column 5) of the annotation file.
For example, if a gene product Slp1 is localized to the plasma membrane of T-cells, the Gene Association File (GAF) would look like this (most columns omitted for brevity):
Gene Name (col 2) | GO ID (col 5) | Reference (col 6) | Annotation Extension (col 16) |
---|---|---|---|
Slp1 | GO:0005886 plasma membrane | PMID:1234567 | part_of(CL:0000084 T cell) |
Here CL:0000084 is the identifier for T-cell in the OBO Cell Type (CL) Ontology.
All relationships used in the annotation extension field should be valid relationships (i.e. not obsolete) described in the gorel.obo file.
Identifiers can originate from GO or another ontology or database. All identifiers must be prefixed by the appropriate database namespace, e.g. UniProtKB:, CHEBI:, CL:, UBERON: and all database namespaces must be located in the GO Database Abbreviations file.
Annotation Extension Format:
relationship_type(database namespace:identifier)
Example:
requires_direct_regulator(UniProtKB:O43236)
Only one identifier can be referenced by one relationship.
Defining the appropriate relationship(identifier) pair for an annotation using Domain and Range parameters
Relationship Domains
The Domain tag in a relationship's stanza refers to the type of GO identifier that has been present in GO_ID field, column 5 of GAF 2.0 of an annotation. Some relationships should not be used in annotations that use a GO identifier from a particular aspect or branch of the GO.
For some relationships, top-level domain restrictions apply, and in these cases then top-level terms from the Basic Formal Ontology (BFO) are used:
domain: BFO:0000007 ! process
Should be interpreted as a union of the Biological Process and Molecular Function ontologies
domain: BFO:0000001 ! entity
Should be interpreted as a union of Biological Process, Molecular Function and Cellular Component ontologies (i.e. all GO identifiers can be used)
domain: BFO:0000004 ! independent continuant
Should be interpreted as the Cellular Component ontology.
Relationship Ranges
The Range tag in a relationship's stanza refers to the types of database namespaces that can be appropriately used.
These Range values can be used as an additional check on the correctness of the relationship(identifier) pair.
In the Relations ontology, the range information is provided as named classes, identified by a ENTITY_UNION identifier.
Example:
range: ENTITY_UNION:0000003 ! gene or gene product
ENTITY_UNION identifiers are defined in separate stanzas in go_extension_rels.obo file:
id: ENTITY_UNION:0000003 name: gene or gene product def: "The union of gene, RNA and protein entities." [GOC:ecd] union_of SO:0000704 ! gene union_of: SO:0000673 ! transcript union_of: PR:000000001 ! protein
Specific appropriate database identifiers can be identified using the 'entity_type' tags located in the GO References file:
abbreviation: UniProtKB database: Universal Protein Knowledgebase description: A central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR databases entity_type: PR:000000001 ! protein
Examples of appropriate and inappropriate usage of relationship(identifier) pairs
Many relationships have been restricted according to the type of term they can appropriately describe, and also the types of values that can be assigned.
Example from the go_annotation_extension_relations.obo file:
[Typedef] id: has_regulation_target name: has_regulation_target def: "Identifies a gene or gene product affected by a regulation BP." [GOC:mah] xref: GOREL:0000015 domain: GO:0065007 ! biological regulation range: ENTITY_UNION:0000003 ! gene or gene product
The above extract of the go_annotation_extension_relations.obo file indicates that the has_regulation_target relationship can only correctly be applied in annotations which have used a GO term in column 5 that is a descendant of 'GO:0065007 biological regulation'.
In addition, only gene or gene product identifiers should be the types of values associated with this relationship, as indicated by the 'range' tag.
Therefore, while this annotation is correct:
Gene Name (col 2) | GO ID (col 5) | Reference (col 6) | Evidence (col 7) | Annotation Extension (col 16) |
---|---|---|---|---|
DAOA | GO:1900758 negative regulation of D-amino-acid oxidase activity | PMID:21679769 | IDA | has_regulation_target(UniProtKB:P14920 D-amino-acid oxidase) |
The following annotations, although containing appropriate information for the annotation extension field, inappropriately uses the has_regulation_target relationship as their use conflicts with the specified defined Range and Domain values defined in the description of the relation above.
Gene Name (col 2) | GO ID (col 5) | Reference (col 6) | Evidence (col 7) | Annotation Extension (col 16) |
---|---|---|---|---|
PEX19 | GO:0072662 protein localization to peroxisome | PMID:18782765 | IMP | has_regulation_target(UniProtKB:Q9Y3D6 Mitochondrial fission 1 protein) |
The example above is not acceptable because the GO term is not a descendant of 'biological regulation'.
Gene Name (col 2) | GO ID (col 5) | Reference (col 6) | Evidence (col 7) | Annotation Extension (col 16) |
---|---|---|---|---|
C-C motif chemokine 24 | GO:0008360 regulation of cell shape | PMID:10072545 | IDA | has_regulation_target(CL:0000771 eosinophil) |
The example above is not acceptable because cell type ontology (CL) identifiers are not included within the Range scope of the has_regulation_target relation.
- N.B. The text colored green in the annotation examples on this page is not present in the annotation file, but is used here to improve the readers' understanding of annotation examples.
Format for referencing multiple relationship(identifier) pairs
Only one identifier can be referenced by one relationship, therefore in order to make multiple 'relationship(identifier)' pairs in an annotation_extension field they must be separated with commas ',' or pipes '|'.
Very simply, the pipe can be interpreted as meaning "or" and comma meaning "and".
Use of the Pipe to separate Annotation Extension values
The current annotation format guidelines states that as inclusion of data in the annotation extension field is entirely optional for the correct interpretation of an annotation, two annotations should not exist that only differ by the contents of their annotation extension field.
Therefore where a gene product carries out its activity in different places or under different circumstances, then multiple annotation extension 'relationship(identifier)' pairs should be added into the annotation extension field of the same annotation and be separated from each other with a pipe. This format indicates to the user that the different relationship(identifier) pairs are making completely independent statements, and it would be equally correct to represent the annotation extension data in separate annotation lines.
Example
Where a gene product can act catalytically on any of a number of different substrates, but not all at the same instance, then the different 'relationship(identifier)' pairs should be separated using pipes:
Gene Name (col 2) | GO ID (col 5) | Reference (col 6) | Evidence (col 7) | Annotation Extension (col 16) |
---|---|---|---|---|
Nos2 | GO:0005777 peroxisome | PMID:12085352 | IDA |
part_of(CL:0000182 hepatocyte)|part_of(CL:0000091 Kupffer cell) |
Interpretation:
Nos2 has been observed by direct assay to be located in the peroxisome of hepatocyte cells, and also located in the peroxisome of Kupffer cells.
The annotation format using multiple, pipe-separated 'relationship(identifier)' pairs supplies the equivalent information as supplying two separate annotation statement lines, each with one of the different relationship-value pairs:
Gene Name (col 2) | GO ID (col 5) | Reference (col 6) | Evidence (col 7) | Annotation Extension (col 16) |
---|---|---|---|---|
Nos2 | GO:0005777 peroxisome | PMID:15127951 | IDA | part_of(CL:0000182 hepatocyte) |
Nos2 | GO:0005777 peroxisome | PMID:15127951 | IDA | part_of(CL:0000091 Kupffer cell) |
Use of the Comma to separate multiple Annotation Extension values
Commas enable curators to create "compound" annotation extensions. This format is applied where a combination of 'relationship(identifier)' pairs supplies a complex, detailed description of the context or specific nature of an instance of a function/location.
Example:
Gene Name (col 2) | GO ID (col 5) | Reference (col 6) | Evidence (col 7) | Annotation Extension (col 16) |
---|---|---|---|---|
TMEM115 | GO:0005634 nucleus | PMID: 17973242 | IDA | part_of(CL:0000066 epithelial cell),part_of(UBERON:0004801 cervix epithelium) |
Interpretation:
TMEM115 is located in the nucleus that is part of an epithelial cell (CL:0000066) that is part of the cervix epithelium (UBERON:0004801).
- N.B. The text colored green in the annotation examples on this page is not present in the annotation file, but is used here to improve the readers' understanding of annotation examples.
Annotation Examples