Annotation Cross Products old page (Retired): Difference between revisions
mNo edit summary |
mNo edit summary |
||
Line 1: | Line 1: | ||
Information here is being moved to the main [[Annotation Extension]] wiki page | |||
Each GO annotation refers to a single term from the ontology. This restricts annotators in what they can say - there must be a pre-existing term in the ontology, or one must be requested. It would be far less restrictive if the annotator could '''combine additional terms in a single annotation'''. These terms could even come from '''other [http://obofoundry.org OBO ontologies]'''. | Each GO annotation refers to a single term from the ontology. This restricts annotators in what they can say - there must be a pre-existing term in the ontology, or one must be requested. It would be far less restrictive if the annotator could '''combine additional terms in a single annotation'''. These terms could even come from '''other [http://obofoundry.org OBO ontologies]'''. | ||
Revision as of 12:55, 22 March 2019
Information here is being moved to the main Annotation Extension wiki page
Each GO annotation refers to a single term from the ontology. This restricts annotators in what they can say - there must be a pre-existing term in the ontology, or one must be requested. It would be far less restrictive if the annotator could combine additional terms in a single annotation. These terms could even come from other OBO ontologies.
This page describes the proposed new column 16 in the GAF, which allows additional terms to be specified to extend the meaning of an annotation. If an when an annotator chooses to do this, they are effectively creating on "on-the-fly" cross product term. We say "on-the-fly" because the combinatorial term is not added to the ontology (although it could be at a later stage, if the ontology editors choose to do do).
This proposal owes a lot to the MGI structured notes internal field in the MGD database.
External Ontologies required
Only ontologies committed to the principles of the [http:obofoundry.org OBO Foundry] should be included.
- CHEBI : Chemical Entities
- CL : Cell ontology
- taxon-centric anatomy ontologies (AOs):
- ZFA (zebrafish)
- MA (adult mouse)
- FMA (human)
- XAO (xenopus)
- FBbt (fly)
- WBbt (worm)
- (add others here)
Use Cases
Function and Process co-annotation
Molecular functions are always executed in the context of a biological process (in a cellular location)
At the moment, we "weakly" co-annotated function and process, but there is no way of knowing which functions go with which processes. A gene G may be annotated to F1, F2, F3 and P1, P2, P3. It may be the case that F1 and P3 never go together, or that when G executes F2 it is always in the context of P2.
Annotators need a way of saying on a per-annotation basis that a F is executed in the context of P.
Example:
F1: protein serine/threonine/tyrosine kinase activity
P1: peptidyl-tyrosine phosphorylation
P2: positive regulation of protein kinase activity
P3: positive regulation of small GTPase mediated signal transduction
F1: sequence-specific DNA binding
lots of Ps, one of which is 'negative regulation of transcription from RNA polymerase II promoter'.
Note that this is complementary to the project to link process and function ontologies. The inter-ontology link could be used as aids to annotators.
Immune System regulation terms: BP and CL
(see email thread from Evelyn on GO list, "another immune related query GO and CL")
chicken IL-10 is secreted from say.e.g macrophages BUT causes 'negative regulation of interferon gamma biosynthesis' in chicken splenocytes..
TODO: need help refining this use case. It was decided that splenocytes were not a great example
Subcellular localisation (CC) within a specific type of cell (CL)
- Toll-like receptor 4 (TLR4) (O00206) is located intracellularly in the perinuclear region (GO:0048471) only in immature DC, PMID:15027902
- TLR4 is located on the cell surface (GO:0005887) in monocytes, PMID:15027902
Evelyn's comments: So protein localisation is cell type specific and for immune gene GO annotation I think we need to be able to capture this.
Another example:
We want to annotate "localised to nucleus of spermatocyte"
Note that we have some pre-coordinated CC-CL terms in GO. See XP:cellular_component_xp_cell
Example from MGI: TODO
Regulation of expression and specific gene products
The GO will never pre-coordinate terms such as:
- regulation of oskar mRNA translation
- regulation of oskar mRNA transcription
But this is perfectly appropriate to post-compose this term at annotation time.
The GO term used would be "regulation of transcription/translation"
The properties column would contain an ID for oskar or oskar mRNA. Technically it should be
- a gene ID for "regulation of gene expression"
- a transcript ID for "regulation of transcription"
- a protein ID for "regulation of translation"
However, this can often be difficult. We can relax this so long as we are clear on what it means to provide a gene ID for "regulation of translation"
Ruth: example of protein ID in Column 16:
PMID:9368760 From the experiment summarized by 'In vitro, expressed PDPK1 (PDK1) O15530 phosphorylated Thr308 of ATK1 (PKB alpha) P31749. The following annotations could be made using column 16
- PDPK1 (PDK1) O15530 GO:0018107 peptidyl-threonine phosphorylation IDA PMID: 9368760 column 16: PKBalpha/ATK1 P31749
- PDPK1 (PDK1) O15530 GO:0032148 activation of protein kinase B activity IDA PMID: 9368760 column 16: PKBalpha/ATK1 P31749
- PDPK1 (PDK1) O15530 GO:0004674 protein serine/threonine kinase activity IDA PMID: 9368760 column 16:PKBalpha/ATK1 P31749
Additional examples of GO annotation protein targets in column 16: For Molecular Function annotations:
- P01023 GO:0004867 serine-type endopeptidase inhibitor activity IDA PMID:12538697 column16:P48740
- P01023 GO:0004867 serine-type endopeptidase inhibitor activity IDA PMID:12538697 column16:O00187
- Q9BRA2 GO:0047134 protein-disulfide reductase activity IDA PMID:1859519 column16:P63167
- Q13535 GO:0004672 protein kinase activity IDA PMID:14657349 column16:Q14683
For Biological Process annotations:
- P31749 GO:0006469 negative regulation of protein kinase activity IMP PMID:9373175 column16:P49841
- Q92574 GO:0031397 negative regulation of ubiquitination IDA PMID: 11175345 column16:P49815
- Q8K4B2 GO:0043407 negative regulation of MAP kinase activity IMP PMID:17379480 column16:P47811
But would we include information in Column 16 for function and process terms?
Also the above in vitro experiment provides very good evidence for function and process terms, but would column 16 be completed for less direct experiment evidence, eg:
PMID:9373175 co-expression of ATK1 (ATK/PKB alpha) P31749 with GSK3B (GSK3beta) P49841 in human 293 cells leads to the inactivation of GSK3B. This effect is also seen with transfection with PDK1 and GSK3B.
Could this be interpreted as
- ATK1 P31749 GO:0006469 negative regulation of protein kinase activity IMP PMID:9373175 column 16: GSK3B P49841
- PDPK1 (PDK1) O15530 GO:0006469 negative regulation of protein kinase activity IMP PMID:9373175 column 16: GSK3B P49841
Maybe a way of restating this is will column 16 be limited to use when there is evidence of really direct interaction between 2 proteins? Or will it be used more generally when a protein is part of a cascade that leads to an effect on many proteins in which case a large number of proteins will probably end up in column 16?
Would it be possible to pipe together multiple accessions which are 'targets' of GO annotation into column16?
Binding
https://sourceforge.net/tracker2/?func=detail&aid=2175326&group_id=36855&atid=440764
Response to drug (BP + CHEBI)
See tracker item discussion.
We don't want to make children of "response to drug" as this would violate the TP rule ("drugs" do not always play the role of drugs). Instead we would like to indicate when the response to chemical X is a drug-response at annotation time
Linking together annotations
Question from Emily:
"In addition, would this column be the place to specifically link together annotations from the different GO vocabularies? For instance if you had say, four annotations for protein X which had been annotated to: 'regulation of transcription', 'protein stabilization', 'cytoplasm' and 'nucleus' - a curator might want to link the 'regulation of transcription' process annotation specifically with the cellular component 'nucleus'."
The two options here are:
- group the annotations together somehow, perhaps using a grouping ID.
- redundantly indicate the localisation information
In the second scenario, there would be a normal looking annotation to 'nucleus' with nothing in the properties column. There would also be an annotation to 'regulation of transcription' annotation, and this would have 'nucleus' in the properties column.
Proposed Solutions
Column 16 of the GAF is used to refine the term used to describe the aspect of the gene product. We will call this the term extension (EXT) column here.
The basic syntax is:
RELATION '(' OBO-ID ')'
The relation would be drawn from the OBO relation ontology. It is important to state the relation between the GO term in col4 and the term in col 16.
Relations
To help ensure the correct relations are used in the correct circumstances we provide this table
Relation | Column 4 (core term) | Col 16 |
---|---|---|
occurs_in | BP | CC or CL or gross anatomy term |
part_of | CC | CC or CL or gross anatomy term |
part_of | MF | MF or BP |
part_of | BP | BP |
has_input | BP or MF | CC or CL or gross anatomy or CHEBI |
has_output | BP or MF | CC or CL or gross anatomy or CHEBI |
For example, part_of would not be used between a process and a component. It *could* be used in a CC annotation, to note the cell; eg spermatocyte:
This is the CL ID for "spermatocyte". If the GO term in the annotation was for "nucleus", then the overall meaning of the annotation would be "a nucleus that is in a spermatocyte"
Examples
The following examples are expressed as pseudo-GAFs. We omit some columns for brevity. (note that the parts after the ! would not be in the actual file, we are just including them here to make the examples readable!)
BP-MF Example
Here is gene 1234 that executes GTPase activity as part of an intracellular signaling cascade
Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
---|---|---|---|
Gene1234 | GO:0003924 ! GTPase activity | PMID:nnnn | part_of(GO:0007242) ! intracellular signaling cascade |
Gene1234 | GO:0007242 ! intracellular signaling cascade | PMID:nnnn | (empty) |
CC-CL Example
Here is an imaginary gene localized to the mitochondrial membrane in a spermatocyte:
Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
---|---|---|---|
Gene1234 | GO:0031966 ! mitochondrial membrane | PMID:nnnn | part_of(CL:0000017) ! spermatocyte |
BP x CC
- Gene1234 has a gene product that is involved in plastid translational elongation
At the time of writing this term is not declared in GO. Here we use the occurs_in relation:
Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
---|---|---|---|
gene1234 | GO:0006414 ! translational elongation | PMID:nnnn | occurs_in(GO:0009536) ! plastid |
|- | gene1234 | GO:0009536 ! plastid | PMID:nnnn | |}
Why, you might ask, can we not just co-annotate to
- GO:0032544 ! plastid translation
- GO:0006414 ! translational elongation
The answer is that co-annotation carries less information. Computationally we have no way of knowing these two processes are linked. See the FAQ
BP x anatomy example
Example of a gene product executing its function in a particular location. Here we use the occurs_in relation:
Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
---|---|---|---|
CREB | GO:0006094 ! gluconeogenesis | PMID:nnnn | occurs_in(MA:0000358) ! liver |
binding example
See 2175326
Also: http://gowiki.tamu.edu/wiki/index.php/RefGenome_Electronic_Jamboree_2008-10_PFKL
E coli pfkA has a function in PEP binding
DB (col 1) | GeneID (col 2) | Gene Symbol (col 3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
---|---|---|---|---|---|
UniProt | P0A796 | pfkA | GO:0042301 ! phosphate binding | PMID:17307338 | has_input(CHEBI:44897) ! phosphoenolpyruvic acid |
Important points:
- The most specific available pre-coordinated term goes in Col 5 (i.e. phosphate binding, not binding). This ensures that searches for phosphate binding work in the absence of a reasoner
- Note that we used GO:0005488 binding, not phosphate binding. Not sure I understand why one would use the latter --JimHu 07:40, 27 March 2009 (PDT)
- It's not clear which CHEBI term to use: CHEBI:44897 or CHEBI:18021 (phosphoenolpyruvate)?
TLR example
- Toll-like receptor 4 (TLR4) (O00206) is located intracellularly in the perinuclear region (GO:0048471) only in immature DC, PMID:15027902
- TLR4 is located on the cell surface (GO:0005887) in monocytes, PMID:15027902
In this example, one of the CL terms is not present, so the GO annotator would make a request on the CL tracker (for a list of trackers, see the front page of http://obofoundry.org)
Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
---|---|---|---|
TLR4 O00206 | perinuclear region (GO:0048471) | PMID:15027902 | part_of(CL:new) ! immature dendritic cell |
TLR4 O00206 | cell surface (GO:0005887) | PMID:15027902 | part_of(CL:0000576) ! monocyte |
Multiple localizations example
What if the publication describes separate observations - perhaps one for biopolar neuron and one for Purkinje cell?
We can separate these using the pipe symbol |. This is equivalent to splitting the annotation over two lines. For example:
Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
---|---|---|---|
Gene1234 | GO:0031966 ! mitochondrial membrane | PMID:nnnn | part_of(CL:0000121) PIPE part_of(CL:0000103) ! biopolar neuron & Purkinje cell |
(I can't figure out how to include a pipe in a wiki table so I just wrote PIPE!)
The "|" separator indicates that this is a separate localization of a different instance of this gene product.
The remember that the CL term names would not be in the GAF - they are included here to make the examples readable
What if we want to annotate two separate observations of the same subcellular localization - one from an astrocyte of the hippocampus, the other from a B cell in the lymph?
We use the "," to indicate an additional extension for the same observation. So the above would be:
Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
---|---|---|---|
Gene1234 | GO:0031966 ! mitochondrial membrane | PMID:nnnn | part_of(CL:0000127),part_of(MA:0000953) PIPE part_of(CL:0000236),part_of(MA:0002520) ! one from an astrocyte of the hippocampus, the other from a B cell in the lymph |
This would be equivalent to two annotations
Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
---|---|---|---|
Gene1234 | GO:0031966 ! mitochondrial membrane | PMID:nnnn | part_of(CL:0000127),part_of(MA:0000953) ! astrocyte of the hippocampus |
Gene1234 | GO:0031966 ! mitochondrial membrane | PMID:nnnn | part_of(CL:0000236),part_of(MA:0002520) ! a B cell in the lymph |
Here is another, real life example from MGI:
Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
---|---|---|---|
MGI:1919277 Slc39a4 | GO:0016324 ! apical plasma membrane | PMID:nnnn | part_of(MA:0000337),part_of(CL:0000584) ! enterocyte of small intestine |
MGI:1919277 Slc39a4 | GO:0016324 ! apical plasma membrane | PMID:nnnn | part_of(EMAP:6894),part_of(CL:0000223) ! endodermal cell of TS22\,extraembryonic component |
Response to drug
E.g. "response to cocaine".
Option 1:
Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
---|---|---|---|
moody (FBgn0025631) | GO:0042220 ! response to cocaine | PMID:nnnn | has_input(CHEBI:23888) ! drug |
Here we need a new relation, "response_to"
Transport
See: http://mcb.asm.org/cgi/content/full/22/4/1266?view=long&pmid=11809816
Gene (col 2/3) | Term (col 5) | Ref (col 6) | Ext (col 16) |
---|---|---|---|
Ipo11 (MGI:nnnn) | ribosomal import into nucleus | PMID:11809816 | results_in_transport_of(Uniprot:nnnnnn) ! rpL12 |
Implementation Plan
- test annotation files will be made available to Berkeley (contributors: MGI, GOA, Dicty...?) with col16 populated
- Berkeley will populate a test database (Seth)
- toy version of AmiGO with CL IDs queryable
- change schema of production db
- officially add spec for col16
- annotation contributors start adding columns
- CL populated and queryable in public amigo
- Extend scheme to other OBO ontologies
the toy v of amigo should be ready by the GO meeting
Database Implementation
See SWUG:Database
FAQ
Will this replace existing combinatorial GO terms like "B cell differentiation"
No! It is important to keep terms like this pre-coordinated in the GO.
When do I request a new term and when do I use the annotation xp column?
Request a new term if it seems like a sensible new term to have in GO. Combinatorial terms in GO are fine if it corresponds to a commonly used scientific term, and the combination is not completely arbitrary and accidental.
For more on this important issue, and a discussion of when to pre-composed and when to compose at annotation time, see this thread on the GO list from March 2009: http://fafner.stanford.edu/pipermail/go/2009-March/016501.html
How will this column be used by tools?
Tools and databases do not have to use col 16. If they elect not to use it, they are no worse off than prior to the introduction of column 16. It is an optional extensions.
However, we do recommend that tools start using it in order to provide more accurate results and queries ASAP. For example, using the annotation XP column it may be possible to get more sensitive term enrichment results.
What happens when new specific GO terms corresponding to the annotation XPs are added?
Let's say annotator A wishes to annotate to "plastid translational elongation", but there is no such term in GO, because it is (for example) deemed to be not sufficiently different from generic translational elongation.
They should then annotate to "translational elongation" and also put "occurs_in(plastid)" in col16
Then let's say later on we discover that "plastid translational elongation" does belong in GO after all (policy changes or we discover something about the biology), so the term gets added
Crucially, the annotator need do nothing. Their annotation can be automatically mapped forward, once an entry for "plastid translational elongation" is added to XP:biological_process_xp_cellular_component
Why allow GO IDs in col 16? Can I just co-annotate instead
co-annotation is not sufficient. Important information is lost. For example, if a gene has 4 annotations to
- mitochondrion
- nucleus
- translation
- transport
We have no way of knowing whether the gene is involved in
- nuclear translation vs mt translation (or both)
- transport within, to or from cytoplasm or nucleus
Appendix
Grammar for col 16
This is specified as a BNF grammar. This is necessary to keep the field extensible enough for future use. Note that the column is optional, so there is no requirements for people to parse it. It is an 'added bonus' column
PropertiesSet := Properties | Properties "|" PropertiesSet Properties := Property | Property ',' Properties Property := Relation '(' Term ')' Term := ID Relation := Relation-Abbrev | ID ID := ID-Space ':' Local-ID ID-Space := XML-NMToken Local-ID := chars Relation-Abbrev := chars
Relations can be abbreviated; eg part_of can be used in place of OBO_REL:part_of
This can be extended to allow for nested expressions:
Term := ID | ID '^' Properties