LEGO-style annotation ideas: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
Line 33: Line 33:


'''* Capturing the subject of a GO term's activity.'''
'''* Capturing the subject of a GO term's activity.'''
For instance:
For instance:



Revision as of 11:21, 16 August 2011

This page is for collecting together thoughts and ideas on what we want from a new annotation system. Given a clean sheet, what would you want to capture? Examples are always useful.

What can't be fully captured in current format

For annotation developments that could be included in the current GAF format, see full details at:Proposed Developments to the GAF annotation format

  • Multiple pieces of evidence for a single assertion
  • Terms from external ontologies
    • clarification: can be used as differentia in c16, but not in c5.
    • Allowing other onts in c5 is not in principal hard - BUT:
    • Adding other ontologies (e.g. CL in c5 would require explicit relationship type like expressed_in) (next item)
  • Nested class expressions (post-composed terms)
    • c16 allows multiple differentia but not nested class expressions
    • the syntax was designed to accommodate this, but at this point it gets quite complex for people and feels like overloading
  • Provide a more descriptive evidence record.

A new format could provide a more detailed, structured format for the evidence supporting an annotation

    • Currently the evidence for an annotation is located in the reference (col.6), evidence (col.7) and with (col.8) fields in the GAF. There are restrictions on the cardinality of values in these fields. However, curators would like to make chains of evidence that would result in the inclusion of multiple evidence and reference identifiers to support an annotation. While work-arounds are being discussed on calls, solutions are not ideal.
    • there are sets of annotations that can only be identified via the assigned_by field (for instance the GOC-assigned annotations automatically inferred from MF-BP links)
    • there is a diversity of values are included in the 'with' field. This diversity can only be interpreted correctly once users have consulted the evidence code documentation and in some cases the cited GO_REFs. Common values are:
      • protein accessions:
        • for IPI-evidenced annotations these are binding partners to the annotation object (col.2) (there are also differences in the format between curation groups where many such ids are listed- 1:many or distinct binary interactions)
        • for ISS or IEA-evidenced annotations that use sequence/orthology information, they indicate the orthologous protein from which annotation data was obtained
      • gene identifiers: mutated genes
      • GO IDs that support an IC annotation by providing a way of tracing back to primary-evidenced annotations
  • IEA annotations include external vocabularies which have supported the annotation prediction (e.g. IPR ids)
  • Optimize the annotation format for viral curators.

The dual taxon requirement for many virus annotations is not ideal - many investigators use an organism for investigations that never act as its natural host. While it might be of interest to the user that the experimental host context is captured in some cases, perhaps data on known viral hosts should be additionally used/supplied? UniProt has a virus/host list that could be used. However how is such dual taxon information intended to be used by to our users? Should there exist a reciprocal annotation/link to the host protein/process to indicate they are targetted by viral action? (see below)

* Capturing the subject of a GO term's activity.

For instance:

1. PMID:10085113 describes the caspase cleavage site in Atrophin-1; indicating that it is a target of executioner caspases and involved in the execution phase of apoptosis. While caspase 3 is used in the paper to demonstrate this protein is a caspase substrate, it is likely to be the target of other executioner caspases as well.

2. PMID:9766676 RTP/rit42 is located in the cytoplasm, however with the response to DNA damage, the protein's localization is shifted to the nucleus. (Could this data be captured in column 16)

Future annotation areas

  • There's likely to be a lot of data coming from metabolomics and metagenomics studies in the next few years e.g. the Human Microbiome Project so we might want to consider how you might annotate population-level processes

Useful links

Technical

Meetings

Aug 23 8am PST