LEGO-style annotation ideas

From GO Wiki
Revision as of 12:17, 16 August 2011 by Edimmer (talk | contribs)
Jump to navigation Jump to search

This page is for collecting together thoughts and ideas on what we want from a new annotation system. Given a clean sheet, what would you want to capture? Examples are always useful.

What can't be fully captured in current format

For annotation developments that could be included in the current GAF format, see full details at: Proposed Developments to the GAF annotation format

Terms from external ontologies

    • clarification: can be used as differentia in c16, but not in c5.
    • Allowing other onts in c5 is not in principal hard - BUT:
    • Adding other ontologies (e.g. CL in c5 would require explicit relationship type like expressed_in) (next item)

Nested class expressions (post-composed terms)

    • c16 allows multiple differentia but not nested class expressions
    • the syntax was designed to accommodate this, but at this point it gets quite complex for people and feels like overloading

Multiple pieces of evidence for a single assertion

A new annotation format could provide a more detailed, structured format for the evidence supporting an annotation

    • Currently the evidence for an annotation is located in the reference (col.6), evidence (col.7) and with (col.8) fields in the GAF. There are restrictions on the acceptable values and their cardinality in these fields. However, curators would like to make a chain of evidence that would result in the inclusion of multiple evidence and reference identifiers to support a single annotation. While work-arounds are being discussed on calls, solutions are not ideal.Edimmer 09:11, 16 August 2011 (PDT)
    • there are sets of annotations that can only be identified via the assigned_by field (for instance the GOC-assigned annotations automatically inferred from MF-BP links), this to me seems to indicate that we need another field to consistently indicate how these annotations are generated Edimmer 09:11, 16 August 2011 (PDT)
    • there is a diversity of values are included in the 'with' field. This diversity can only be interpreted correctly once users have consulted the evidence code documentation and in some cases the cited GO_REFs. Common values are:
      • protein accessions:
        • for IPI-evidenced annotations these are binding partners to the annotation object (col.2) (there are also differences in the format between curation groups where many such ids are listed- 1:many or distinct binary interactions)
        • for ISS or IEA-evidenced annotations that use sequence/orthology information, they indicate the orthologous protein from which annotation data was obtained
      • gene identifiers: mutated genes
      • GO IDs that support an IC annotation by providing a way of tracing back to primary-evidenced annotations
  • IEA annotations include external vocabularies which have supported the annotation prediction (e.g. IPR ids)

09:11, 16 August 2011 (PDT)

Optimize the annotation format for viral curators.

The dual taxon requirement for many virus annotations is not ideal - many investigators use an organism for investigations that never act as its natural host. While it might be of interest to the user that the experimental host context is captured in some cases, perhaps data on known viral hosts should be additionally used/supplied? UniProt has a virus/host list that could be used. However how is such dual taxon information intended to be used by to our users? Should there exist a reciprocal annotation/link to the host protein/process to indicate they are targetted by viral action? (see below) Edimmer 09:11, 16 August 2011 (PDT)

Capture the subject of a GO term's activity.

Although the target of an activity can now be captured in column 16, shouldn't we be able to provide a simpler manner for users to retrieve gene identifiers that are the subject of a GO process/function? In addition, what happens where you only want to indicate the subject of a GO term, and not the object? As the information we provide to users becomes more sophisticated, we should try to provide it in as simple format as possible, to encourage all types of users to play with our data.

For instance:

1. PMID:10085113 describes the caspase cleavage site in Atrophin-1; indicating that it is a target of executioner caspases and involved in the execution phase of apoptosis. While caspase 3 is used in the paper to demonstrate this protein is a caspase substrate, it is likely to be the target of other executioner caspases as well.

2. PMID:9766676 RTP/rit42 is located in the cytoplasm, however with the response to DNA damage, the protein's localization is shifted to the nucleus. (Could this data be captured in column 16)

Although could something be done using a full set of relationships between the id in col.2 and col.5 ? Could targets of an annotation that are cited in column 16 be used to automatically generate an annotation with the target in col. 2 along with an appropriate relationship to the GO ID in column 5? Edimmer 09:11, 16 August 2011 (PDT)

Linking Annotations via a unique annotation id

- this does sound powerful, but am concerned whether is possible before all annotations are kept and developed in the same one annotation database (CAF), where they can be consistently audited. Building complex annotation lines using as their basis annotation IDs might be problematic where we cannot be sure that all groups are maintaining the annotations and the associated IDs in the same manner?

- could be useful for different external annotation efforts. For instance, they might like to use an annotation ID to indicate where a specific gp involved in a normal MF/BP is disrupted to become involved in a disease/trait/phenotype?

Edimmer 09:11, 16 August 2011 (PDT)

Future annotation areas

  • There's likely to be a lot of data coming from metabolomics and metagenomics studies in the next few years e.g. the Human Microbiome Project so we might want to consider how you might annotate population-level processes

Useful links

Technical

Meetings

Aug 23 8am PST