Mechanisms for reducing annotation redundancy

From GO Wiki
Jump to navigation Jump to search

Annotation Redundancy

What is a redundant annotation?

An annotation where the GO term to gene/gene product identifier association has already been been supplied by another annotation.

The GO term can be exactly duplicate in the other annotation(s), or be an ascendant GO term.

How are redundant GO annotations created?

When a large amount of information is available to describe a gene product, it is often possible to find sets of annotations that repeatedly associate the same (or similar) GO term to a gene product.

Such 'redundant' GO term-gene product associations can be obtained from GO annotation efforts by:

1. the manual curation of data from different literature sources that provide information on the same gene product functionning.

2. the detailed manual curation of one literature source that provides the reader with multiple sources of evidence to support the association of the same GO term with the gene product.

3. the manual curation of data sources that specifically characterise different gene product forms (isoforms, post-translationally modified forms; described in column 17)

4. the independent curation of the same paper by different GO annotation groups (generally strongly discouraged as seen often as wasting valuable curation resources)

5. the manual or automatic transfer of terms from multiple similarly-functionning homologs (perhaps using a range of orthology/homology techniques)

6. automatically created annotation annotation predictions using a range of sequence and/or complementary gene product-specific data.

The advantages of annotation redundancy

A specific association between a GO term and gene product can be made stronger by the availability of multiple annotations that replicate this association using evidence from different, independent sources of data.

The supply of full, distinct annotation sets from different evidence sources or methods, allows the GO Consortium to supply to its users complete information as to the capabilities of different methods to describe the functionning of gene products. This is of particular interest for users of methods that use a sequence or orthology-based approach for annotation prediction.

In a similar manner, some annotation groups annotate on a paper-by-paper basis - aiming to describe the full set of annotations supplied to an organism from a particular set of journals.

The disadvantages of annotation redundancy

As traditionally each annotation is displayed alongside its supporting reference and evidence code, web displays of GO annotations for a gene product can become long, repetitive and unattractive for the reader

Multiple GO term-gene product associations can cause a GO annotation dataset to become unnecessarily large and cumbersome for manipulation of data.

An unedited paper-by-paper curation approach, can result in an annotation set that provide a historical view of the knowledge of gene product functions, rather than the most up-to-date current perspectives.

Methods for Dealing with Annotation Redundancy

Different displays of GO annotation data might require different filtering methods applied, according to their users's expectations.


Considerations for Dealing with Annotation Redundancy in Files from Authoritative Annotation Sources

Possible steps are listed below, aimed at generating GOC discussions.

A. Considerations:

  • What level of GO annotation filtering is necessary in annotation files c.f. website displays?
    • While web displays are intended for 'low-throughput GO annotation' users, interested in looking at the full GO annotation descriptions of one/few gene products; how important is filtering out of annotations in annotation files?
    • Filtering requirements for an 'authoritative GO annotation file for a species' can be different than the requirements for a MOD's web GO annotation display. Websites/annotation files might be considered to have different user groups or display expectations. In addition, as the association file is a representation of the 'GO Consortium product', external contributing groups need to be confident of the filtering mechanisms employed.
  • All authoritative sources of annotation data are expected to retrieve all publicly-sourced annotations of interest from the GO Consortium, including IEA-evidenced annotations, as well as column 16/17 data.

* Manual annotation filtering?

Should there be any manual annotation filtering in annotation files?

- Manual annotation is an expensive activity. 
- Problems with manual annotations should be communicated to the primary source, or removed via GOC-agreed GO Consortium annotation Hard QC checks. (soft QC check conflicts kept in annotation file?)
    • If manual annotation filtering is carried out, filtering specifically aimed at reducing annotation redundancy could be focused using considerations such as:
    • prefer data from active annotation groups than those from defunct sources (e.g. PINC, GDB etc.) ?
    • prefer data from primary annotation providers e.g. for protein binding - prefer IntAct/BioGrid that carry out a more complete annotation over other GOC sources describing the same interaction/paper?
  • prefer annotations that apply a more granular GO term?

N.B. this can cause problems:

  • if curators wanted to state that a protein is involved in all types of organ development, and also curates specific knowledge regarding known involvement in kidney development
  • where the GO term used by multiple annotations is identical, groups could use the annotation which has the lower quality score (based on evidence code sets, e.g.

1. Experimenta (IDA > IMP > IGI > IPI > EXP > IEP) 2. IC 3. ISS set (ISO > ISA > ISM > IBA > IBD> IKR > IGC(RCA?)) 4. TAS > NAS > ND


* Should IEA annotation filtering be carried out?

- IEA annotations created by GOC groups are now widely seen as very high-quality sources of data, all are actively maintained and therefore user resources/FTEs in the primary supplier groups.

- Incorrect, out-of-date or non-specific IEA annotations should be corrected by the primary supplier, or filtered out using GOC-agreed hard QC checks.

- Some users would like to retrieve the full annotation set provided by an automatic annotation method

- If IEA filtering considered, mechanisms to be considered:

- filter out IEA annotations that cover territory (exact or less granular terms) already covered by the manual set.

- filter out IEA annotations from a particular/all sources that predict the same GO term using different external supporting data

Example: InterPro can predict the the same GO term to a protein from different InterPro domain matches, e.g. moeA5 Streptomyces ghanaensis