LEGO-style annotation ideas

This page is for collecting together thoughts and ideas on what we want from a new annotation system. Given a clean sheet, what would you want to capture? Examples are always useful.

What can't be fully captured in current format

For annotation developments that could be included in the current GAF format, see full details at: Proposed Developments to the GAF annotation format

Terms from external ontologies

clarification: can be used as differentia in c16, but not in c5.
Allowing other onts in c5 is not in principal hard - BUT:
Adding other ontologies (e.g. CL in c5 would require explicit relationship type like expressed_in) (next item)

Nested class expressions (post-composed terms)

c16 allows multiple differentia but not nested class expressions
the syntax was designed to accommodate this, but at this point it gets quite complex for people and feels like overloading

Multiple pieces of evidence for a single assertion

A new annotation format could provide a more detailed, structured format for the evidence supporting an annotation

Currently the evidence for an annotation is located in the reference (col.6), evidence (col.7) and with (col.8) fields in the GAF. There are restrictions on the acceptable values and their cardinality in these fields. However, curators would like to make a chain of evidence that would result in the inclusion of multiple evidence and reference identifiers to support a single annotation. While work-arounds are being discussed on calls, solutions are not ideal.Edimmer 09:11, 16 August 2011 (PDT)

there are sets of annotations that can only be identified via the assigned_by field (for instance the GOC-assigned annotations automatically inferred from MF-BP links), this to me seems to indicate that we need another field to consistently indicate how these annotations are generated Edimmer 09:11, 16 August 2011 (PDT)

there is a diversity of values are included in the 'with' field. This diversity can only be interpreted correctly once users have consulted the evidence code documentation and in some cases the cited GO_REFs. Common values are:
- protein accessions:
  - for IPI-evidenced annotations these are binding partners to the annotation object (col.2) (there are also differences in the format between curation groups where many such ids are listed- 1:many or distinct binary interactions)
  - for ISS or IEA-evidenced annotations that use sequence/orthology information, they indicate the orthologous protein from which annotation data was obtained
- gene identifiers: mutated genes
- GO IDs that support an IC annotation by providing a way of tracing back to primary-evidenced annotations
IEA annotations include external vocabularies which have supported the annotation prediction (e.g. IPR ids)

09:11, 16 August 2011 (PDT)

Optimize the annotation format for viral curators.

The dual taxon requirement for many virus annotations is not ideal - many investigators use an organism for investigations that never act as its natural host. While it might be of interest to the user that the experimental host context is captured in some cases, perhaps data on known viral hosts should be additionally used/supplied? UniProt has a virus/host list that could be used. However how is such dual taxon information intended to be used by to our users? Should there exist a reciprocal annotation/link to the host protein/process to indicate they are targetted by viral action? (see below) Edimmer 09:11, 16 August 2011 (PDT)

Capture the subject of a GO term's activity.

Although the target of an activity can now be captured in column 16, how do curator annotate a target of an activity when they do not know the identity of the gene product carrying out the activity (the annotation object).

For instance:

1. PMID:10085113 describes the caspase cleavage site in Atrophin-1; indicating that it is a target of executioner caspases and involved in the execution phase of apoptosis. While caspase 3 is used in the paper to demonstrate this protein is a caspase substrate, it is likely to be the target of other executioner caspases as well.

Although could something be done using a full set of relationships between the id in col.2 and col.5 ? Could targets of an annotation that are cited in column 16 be used to automatically generate an annotation with the target in col. 2 along with an appropriate relationship to the GO ID in column 5? Edimmer 09:11, 16 August 2011 (PDT)

2. PMID:21775285, page 2. Atk is the subject of acetylation (by P300 and PCAF) which decreases Atk protein kinase activity, Atk is also the subject of deacetylation (by SIRT1) which promotes phosphorylation by ??? which increases Atk protein kinase activity. How can this data be fully represented in an annotation?

Sometimes only describing the target of an activity with relation to a particular object may not properly represent the data. Although only SIRT1 is used to test deacetylation of Atk, it might be possible that other protein deacetylases are involved in this action.

3. Example from Jax researcher: wants to capture proteins engaged in 'mRNA transport' to - synapse, - dendrite - axon, dendritic spines

Linking Annotations via a unique annotation id

See Alex's proposal from Bar Harbor and multiple term annotations

- this does sound powerful, but am concerned whether is possible before all annotations are kept and developed in the same one annotation database (CAF), where they can be consistently audited. Building complex annotation lines using as their basis annotation IDs might be problematic where we cannot be sure that all groups are maintaining the annotations and the associated IDs in the same manner?

- could be useful for different external annotation efforts. For instance, they might like to use an annotation ID to indicate where a specific gp involved in a normal MF/BP is disrupted to become involved in a disease/trait/phenotype?

Edimmer 09:11, 16 August 2011 (PDT)

Capturing further information on subcellular location

When a gene product is active in more than one location, but the curator is not provided with the activity carried out at each location, it would be useful to be able to indicate that the protein moves between locations A and B. Ideally two GO terms to be annotated in the equivalent of column 5 e.g. cytoplasm and nucleus. I can't see how this can be done with the current format without losing information.

Would it be desirable to indicate in an annotation when a gene product is predominately in location X?

Gene product state information

Capture specific information about the state or structure of a GP without having to give it a new ID. For example, a GP may be able to perform a reaction in a phosphorylated state but not when unphosphorylated. Different domains could be phosphorylated with different effects on the reactions the GP can perform. The configuration of pores and transporters is very important in whether or not transport occurs.

Uncertain information

Gene product X performs reaction X or Y
several gene products involved in process X; perhaps we know the functions involved but don't know which GP does which, or we have two candidates for performing a reaction, but don't know which does it

Negating an annotation extension

It not currently possible to negate an annotation extension, it would be useful to be able to do this. examples to be added

Build a pathway on the fly

Take a process like sucrose catabolism; there are a number of different routes by which this can occur - see this MetaCyc page for examples. May not be possible to capture this pathway information in GO due to the strength of the part-of / has-part relations (i.e. must be ALL X have part some Y or ALL Y part of some X). The pathway could instead be created at the annotation stage by specifying the order of the reactions, components of the cell in which the reactions occur, etc..

Annotating the route a signaling pathway takes. For example, in PMID 21245381, multiple growth factor signaling pathways (VEGF, PDGF, HGF) signal via phosphorylation of the p130Cas molecule. We probably do not want a separate GO term for 'p130Cas signaling' but could instead capture this at the annotation stage. Would be particularly useful to capture the order of the steps (Rebecca and Ruth).

Future annotation areas

There's likely to be a lot of data coming from metabolomics and metagenomics studies in the next few years e.g. the Human Microbiome Project so we might want to consider how you might annotate population-level processes