Annotation Cross Products old page (Retired)

From GO Wiki
Revision as of 02:20, 12 April 2019 by Pascale (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

 Information here is being moved to the main Annotation Extension wiki page

Each GO annotation refers to a single term from the ontology. This restricts annotators in what they can say - there must be a pre-existing term in the ontology, or one must be requested. It would be far less restrictive if the annotator could combine additional terms in a single annotation. These terms could even come from other OBO ontologies.

This page describes the proposed new column 16 in the GAF, which allows additional terms to be specified to extend the meaning of an annotation. If an when an annotator chooses to do this, they are effectively creating on "on-the-fly" cross product term. We say "on-the-fly" because the combinatorial term is not added to the ontology (although it could be at a later stage, if the ontology editors choose to do do).

This proposal owes a lot to the MGI structured notes internal field in the MGD database.

External Ontologies required

Only ontologies committed to the principles of the [ OBO Foundry] should be included.

  • CHEBI : Chemical Entities
  • CL : Cell ontology
  • taxon-centric anatomy ontologies (AOs):
    • ZFA (zebrafish)
    • MA (adult mouse)
    • FMA (human)
    • XAO (xenopus)
    • FBbt (fly)
    • WBbt (worm)
    • (add others here)

Use Cases

Function and Process co-annotation

Molecular functions are always executed in the context of a biological process (in a cellular location)

At the moment, we "weakly" co-annotated function and process, but there is no way of knowing which functions go with which processes. A gene G may be annotated to F1, F2, F3 and P1, P2, P3. It may be the case that F1 and P3 never go together, or that when G executes F2 it is always in the context of P2.

Annotators need a way of saying on a per-annotation basis that a F is executed in the context of P.


F1: protein serine/threonine/tyrosine kinase activity

P1: peptidyl-tyrosine phosphorylation

P2: positive regulation of protein kinase activity

P3: positive regulation of small GTPase mediated signal transduction

F1: sequence-specific DNA binding

lots of Ps, one of which is 'negative regulation of transcription from RNA polymerase II promoter'.

Note that this is complementary to the project to link process and function ontologies. The inter-ontology link could be used as aids to annotators.

Immune System regulation terms: BP and CL

(see email thread from Evelyn on GO list, "another immune related query GO and CL")

chicken IL-10 is secreted from say.e.g macrophages BUT causes 'negative regulation of interferon gamma biosynthesis' in chicken splenocytes..

TODO: need help refining this use case. It was decided that splenocytes were not a great example

Subcellular localisation (CC) within a specific type of cell (CL)

  • Toll-like receptor 4 (TLR4) (O00206) is located intracellularly in the perinuclear region (GO:0048471) only in immature DC, PMID:15027902
  • TLR4 is located on the cell surface (GO:0005887) in monocytes, PMID:15027902

Evelyn's comments: So protein localisation is cell type specific and for immune gene GO annotation I think we need to be able to capture this.

Another example:

We want to annotate "localised to nucleus of spermatocyte"

Note that we have some pre-coordinated CC-CL terms in GO. See XP:cellular_component_xp_cell

Example from MGI: TODO

Regulation of expression and specific gene products

The GO will never pre-coordinate terms such as:

  • regulation of oskar mRNA translation
  • regulation of oskar mRNA transcription

But this is perfectly appropriate to post-compose this term at annotation time.

The GO term used would be "regulation of transcription/translation"

The properties column would contain an ID for oskar or oskar mRNA. Technically it should be

  • a gene ID for "regulation of gene expression"
  • a transcript ID for "regulation of transcription"
  • a protein ID for "regulation of translation"

However, this can often be difficult. We can relax this so long as we are clear on what it means to provide a gene ID for "regulation of translation"

Ruth: example of protein ID in Column 16:

PMID:9368760 From the experiment summarized by 'In vitro, expressed PDPK1 (PDK1) O15530 phosphorylated Thr308 of ATK1 (PKB alpha) P31749. The following annotations could be made using column 16

  • PDPK1 (PDK1) O15530 GO:0018107 peptidyl-threonine phosphorylation IDA PMID: 9368760 column 16: PKBalpha/ATK1 P31749
  • PDPK1 (PDK1) O15530 GO:0032148 activation of protein kinase B activity IDA PMID: 9368760 column 16: PKBalpha/ATK1 P31749
  • PDPK1 (PDK1) O15530 GO:0004674 protein serine/threonine kinase activity IDA PMID: 9368760 column 16:PKBalpha/ATK1 P31749

Additional examples of GO annotation protein targets in column 16: For Molecular Function annotations:

  • P01023 GO:0004867 serine-type endopeptidase inhibitor activity IDA PMID:12538697 column16:P48740
  • P01023 GO:0004867 serine-type endopeptidase inhibitor activity IDA PMID:12538697 column16:O00187
  • Q9BRA2 GO:0047134 protein-disulfide reductase activity IDA PMID:1859519 column16:P63167
  • Q13535 GO:0004672 protein kinase activity IDA PMID:14657349 column16:Q14683

For Biological Process annotations:

  • P31749 GO:0006469 negative regulation of protein kinase activity IMP PMID:9373175 column16:P49841
  • Q92574 GO:0031397 negative regulation of ubiquitination IDA PMID: 11175345 column16:P49815
  • Q8K4B2 GO:0043407 negative regulation of MAP kinase activity IMP PMID:17379480 column16:P47811

But would we include information in Column 16 for function and process terms?

Also the above in vitro experiment provides very good evidence for function and process terms, but would column 16 be completed for less direct experiment evidence, eg:

PMID:9373175 co-expression of ATK1 (ATK/PKB alpha) P31749 with GSK3B (GSK3beta) P49841 in human 293 cells leads to the inactivation of GSK3B. This effect is also seen with transfection with PDK1 and GSK3B.

Could this be interpreted as

  • ATK1 P31749 GO:0006469 negative regulation of protein kinase activity IMP PMID:9373175 column 16: GSK3B P49841
  • PDPK1 (PDK1) O15530 GO:0006469 negative regulation of protein kinase activity IMP PMID:9373175 column 16: GSK3B P49841

Maybe a way of restating this is will column 16 be limited to use when there is evidence of really direct interaction between 2 proteins? Or will it be used more generally when a protein is part of a cascade that leads to an effect on many proteins in which case a large number of proteins will probably end up in column 16?

Would it be possible to pipe together multiple accessions which are 'targets' of GO annotation into column16?


Response to drug (BP + CHEBI)

See tracker item discussion.

We don't want to make children of "response to drug" as this would violate the TP rule ("drugs" do not always play the role of drugs). Instead we would like to indicate when the response to chemical X is a drug-response at annotation time

Linking together annotations

Question from Emily:

"In addition, would this column be the place to specifically link together annotations from the different GO vocabularies? For instance if you had say, four annotations for protein X which had been annotated to: 'regulation of transcription', 'protein stabilization', 'cytoplasm' and 'nucleus' - a curator might want to link the 'regulation of transcription' process annotation specifically with the cellular component 'nucleus'."

The two options here are:

  1. group the annotations together somehow, perhaps using a grouping ID.
  2. redundantly indicate the localisation information

In the second scenario, there would be a normal looking annotation to 'nucleus' with nothing in the properties column. There would also be an annotation to 'regulation of transcription' annotation, and this would have 'nucleus' in the properties column.

Proposed Solutions

Column 16 of the GAF is used to refine the term used to describe the aspect of the gene product. We will call this the term extension (EXT) column here.

The basic syntax is:


The relation would be drawn from the OBO relation ontology. It is important to state the relation between the GO term in col4 and the term in col 16.


To help ensure the correct relations are used in the correct circumstances we provide this table

Relation Column 4 (core term) Col 16
occurs_in BP CC or CL or gross anatomy term
part_of CC CC or CL or gross anatomy term
part_of MF MF or BP
part_of BP BP
has_input BP or MF CC or CL or gross anatomy or CHEBI
has_output BP or MF CC or CL or gross anatomy or CHEBI

For example, part_of would not be used between a process and a component. It *could* be used in a CC annotation, to note the cell; eg spermatocyte:

This is the CL ID for "spermatocyte". If the GO term in the annotation was for "nucleus", then the overall meaning of the annotation would be "a nucleus that is in a spermatocyte"


The following examples are expressed as pseudo-GAFs. We omit some columns for brevity. (note that the parts after the ! would not be in the actual file, we are just including them here to make the examples readable!)

BP-MF Example

Here is gene 1234 that executes GTPase activity as part of an intracellular signaling cascade

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
Gene1234 GO:0003924 ! GTPase activity PMID:nnnn part_of(GO:0007242) ! intracellular signaling cascade
Gene1234 GO:0007242 ! intracellular signaling cascade PMID:nnnn (empty)

CC-CL Example

Here is an imaginary gene localized to the mitochondrial membrane in a spermatocyte:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
Gene1234 GO:0031966 ! mitochondrial membrane PMID:nnnn part_of(CL:0000017) ! spermatocyte


  • Gene1234 has a gene product that is involved in plastid translational elongation

At the time of writing this term is not declared in GO. Here we use the occurs_in relation:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
gene1234 GO:0006414 ! translational elongation PMID:nnnn occurs_in(GO:0009536) ! plastid

|- | gene1234 | GO:0009536 ! plastid | PMID:nnnn | |}

Why, you might ask, can we not just co-annotate to

  • GO:0032544 ! plastid translation
  • GO:0006414 ! translational elongation

The answer is that co-annotation carries less information. Computationally we have no way of knowing these two processes are linked. See the FAQ

BP x anatomy example

Example of a gene product executing its function in a particular location. Here we use the occurs_in relation:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
CREB GO:0006094 ! gluconeogenesis PMID:nnnn occurs_in(MA:0000358) ! liver

binding example

See 2175326


E coli pfkA has a function in PEP binding

DB (col 1) GeneID (col 2) Gene Symbol (col 3) Term (col 5) Ref (col 6) Ext (col 16)
UniProt P0A796 pfkA GO:0042301 ! phosphate binding PMID:17307338 has_input(CHEBI:44897) ! phosphoenolpyruvic acid

Important points:

  • The most specific available pre-coordinated term goes in Col 5 (i.e. phosphate binding, not binding). This ensures that searches for phosphate binding work in the absence of a reasoner
    • Note that we used GO:0005488 binding, not phosphate binding. Not sure I understand why one would use the latter --JimHu 07:40, 27 March 2009 (PDT)
  • It's not clear which CHEBI term to use: CHEBI:44897 or CHEBI:18021 (phosphoenolpyruvate)?
    • I chose the former in this example simply because it has an is_a parent. CHEBI terms without is_a parents should NOT be used. This is because we need the is_a parent to figure out the correct parentage in GO
    • See the thread on the GO list for further discussion

TLR example

  • Toll-like receptor 4 (TLR4) (O00206) is located intracellularly in the perinuclear region (GO:0048471) only in immature DC, PMID:15027902
  • TLR4 is located on the cell surface (GO:0005887) in monocytes, PMID:15027902

In this example, one of the CL terms is not present, so the GO annotator would make a request on the CL tracker (for a list of trackers, see the front page of

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
TLR4 O00206 perinuclear region (GO:0048471) PMID:15027902 part_of(CL:new) ! immature dendritic cell
TLR4 O00206 cell surface (GO:0005887) PMID:15027902 part_of(CL:0000576) ! monocyte

Multiple localizations example

What if the publication describes separate observations - perhaps one for biopolar neuron and one for Purkinje cell?

We can separate these using the pipe symbol |. This is equivalent to splitting the annotation over two lines. For example:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
Gene1234 GO:0031966 ! mitochondrial membrane PMID:nnnn part_of(CL:0000121) PIPE part_of(CL:0000103) ! biopolar neuron & Purkinje cell

(I can't figure out how to include a pipe in a wiki table so I just wrote PIPE!)

The "|" separator indicates that this is a separate localization of a different instance of this gene product.

The remember that the CL term names would not be in the GAF - they are included here to make the examples readable

What if we want to annotate two separate observations of the same subcellular localization - one from an astrocyte of the hippocampus, the other from a B cell in the lymph?

We use the "," to indicate an additional extension for the same observation. So the above would be:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
Gene1234 GO:0031966 ! mitochondrial membrane PMID:nnnn part_of(CL:0000127),part_of(MA:0000953) PIPE part_of(CL:0000236),part_of(MA:0002520) ! one from an astrocyte of the hippocampus, the other from a B cell in the lymph

This would be equivalent to two annotations

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
Gene1234 GO:0031966 ! mitochondrial membrane PMID:nnnn part_of(CL:0000127),part_of(MA:0000953) ! astrocyte of the hippocampus
Gene1234 GO:0031966 ! mitochondrial membrane PMID:nnnn part_of(CL:0000236),part_of(MA:0002520) ! a B cell in the lymph

Here is another, real life example from MGI:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
MGI:1919277 Slc39a4 GO:0016324 ! apical plasma membrane PMID:nnnn part_of(MA:0000337),part_of(CL:0000584) ! enterocyte of small intestine
MGI:1919277 Slc39a4 GO:0016324 ! apical plasma membrane PMID:nnnn part_of(EMAP:6894),part_of(CL:0000223) ! endodermal cell of TS22\,extraembryonic component

Response to drug

E.g. "response to cocaine".

Option 1:

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
moody (FBgn0025631) GO:0042220 ! response to cocaine PMID:nnnn has_input(CHEBI:23888) ! drug

Here we need a new relation, "response_to"



Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
Ipo11 (MGI:nnnn) ribosomal import into nucleus PMID:11809816 results_in_transport_of(Uniprot:nnnnnn) ! rpL12

Implementation Plan

  1. test annotation files will be made available to Berkeley (contributors: MGI, GOA, Dicty...?) with col16 populated
  2. Berkeley will populate a test database (Seth)
  3. toy version of AmiGO with CL IDs queryable
  4. change schema of production db
  5. officially add spec for col16
  6. annotation contributors start adding columns
  7. CL populated and queryable in public amigo
  8. Extend scheme to other OBO ontologies

the toy v of amigo should be ready by the GO meeting

Database Implementation

See SWUG:Database


Will this replace existing combinatorial GO terms like "B cell differentiation"

No! It is important to keep terms like this pre-coordinated in the GO.

When do I request a new term and when do I use the annotation xp column?

Request a new term if it seems like a sensible new term to have in GO. Combinatorial terms in GO are fine if it corresponds to a commonly used scientific term, and the combination is not completely arbitrary and accidental.

For more on this important issue, and a discussion of when to pre-composed and when to compose at annotation time, see this thread on the GO list from March 2009:

How will this column be used by tools?

Tools and databases do not have to use col 16. If they elect not to use it, they are no worse off than prior to the introduction of column 16. It is an optional extensions.

However, we do recommend that tools start using it in order to provide more accurate results and queries ASAP. For example, using the annotation XP column it may be possible to get more sensitive term enrichment results.

What happens when new specific GO terms corresponding to the annotation XPs are added?

Let's say annotator A wishes to annotate to "plastid translational elongation", but there is no such term in GO, because it is (for example) deemed to be not sufficiently different from generic translational elongation.

They should then annotate to "translational elongation" and also put "occurs_in(plastid)" in col16

Then let's say later on we discover that "plastid translational elongation" does belong in GO after all (policy changes or we discover something about the biology), so the term gets added

Crucially, the annotator need do nothing. Their annotation can be automatically mapped forward, once an entry for "plastid translational elongation" is added to XP:biological_process_xp_cellular_component

Why allow GO IDs in col 16? Can I just co-annotate instead

co-annotation is not sufficient. Important information is lost. For example, if a gene has 4 annotations to

  • mitochondrion
  • nucleus
  • translation
  • transport

We have no way of knowing whether the gene is involved in

  • nuclear translation vs mt translation (or both)
  • transport within, to or from cytoplasm or nucleus


Grammar for col 16

This is specified as a BNF grammar. This is necessary to keep the field extensible enough for future use. Note that the column is optional, so there is no requirements for people to parse it. It is an 'added bonus' column

 PropertiesSet := Properties | Properties "|" PropertiesSet
 Properties := Property | Property ',' Properties
 Property := Relation '(' Term ')'
 Term := ID
 Relation := Relation-Abbrev | ID
 ID := ID-Space ':' Local-ID
 ID-Space := XML-NMToken
 Local-ID := chars
 Relation-Abbrev := chars

Relations can be abbreviated; eg part_of can be used in place of OBO_REL:part_of

This can be extended to allow for nested expressions:

 Term := ID | ID '^' Properties

= Column 16 discussion 12-12-09