Mechanisms for reducing annotation redundancy
What is a redundant annotation?
At its simplest, it could be considered as the situation when a GO term to gene/gene product identifier association is supplied by more than one annotation.
The GO term could be exactly duplicated in the other annotation(s), or be an ascendant GO term.
However, this does depends on whether the desired definition of redundancy should consider distinct information contained in the reference, evidence, with/from or annotation_extension fields...
How are redundant GO annotations created?
When a large amount of information is available to describe a gene product, it is often possible to find sets of annotations that repeatedly associate the same (or ascendant) GO term to a gene product.
Such 'redundant' GO term-gene product associations can be obtained from GO annotation efforts by:
1. the manual curation of data from different literature sources that provide information on the same gene product functionning/subcelluar location. i.e. the same terms associated by different references
2. the detailed manual curation of one literature source that provides the reader with multiple sources of evidence to support the association of the same GO term with the gene product. i.e. the same term associated by the same reference but different evidences.
3. the independent curation of the same paper by different GO annotation groups (this would be the most stringent description of a redundant annotation. This practice is generally strongly discouraged as seen often as wasting valuable curation resources)i.e. the same term, paper and evidence associated by different curation efforts.
4. the manual or automatic transfer of terms from multiple similarly-functionning homologs (perhaps using a range of orthology/homology techniques) i.e. the same term, paper and evidence associated by different orthology statements- with/from field values.
5. automatically created annotation annotation predictions using a range of sequence and/or complementary gene product-specific data. i.e. an automatic prediction and manual annotation co-existing, using independent or related data sources
The manual curation of data sources that specifically characterise different gene product forms with the same GO term may appear to be providing redundant annotations, however it is important to identify them as being separate, distinct annotation statements and that users should take into consideration the gene_product_id values in column 17.
The advantages of annotation redundancy
A specific association between a GO term and gene product can be made stronger/more trustworthy by the availability of multiple annotations that replicate an association using evidence from different, independent sources of data.
The supply of full, distinct annotation sets from different evidence sources or methods allows the GO Consortium to supply to its users complete information as to the capabilities of different methods to fully describe the functionning of gene products. This is of particular interest for users of methods that use a sequence or orthology-based approach for annotation prediction.
In a similar manner, some annotation groups manually annotate on a paper-by-paper basis - aiming to describe the full set of annotations supplied to an organism from a prioritized set of journals, enabling users to be provided as to the full paper set that provides characterization information for a specific gene product.
The disadvantages of annotation redundancy
As traditionally each annotation is displayed alongside its supporting reference and evidence code, web displays of GO annotations for a gene product can become long, repetitive and unattractive for the reader
Multiple GO term-gene product associations can cause a GO annotation dataset to become unnecessarily large and cumbersome for data manipulation
An unedited paper-by-paper curation approach can result in an annotation set that provide a historical view of the knowledge of gene product functions, rather than the most current literature sources.
Multiple GO term-gene products associations are sometimes based on the same primary data and result in a false impression that there is weight of evidence for the association (e.g. one interesting experiment leads to many reviews and multiple TAS annotations) [ST].
Methods for Dealing with Annotation Redundancy
Different displays of GO annotation data (MOD or knowledgebase website, annotation file, GO browser) may require different amounts of annotation filtering to be applied, according to user expectations.
Considerations for Dealing with Annotation Redundancy in Files from Authoritative Annotation Sources
Possible steps are listed below. Please note this text is aimed at generating GOC discussions, no method of annotation filtering has been agreed.
- What level of GO annotation filtering is necessary in annotation files in contrast to website displays?
- While web displays are intended for 'low-throughput GO annotation' users, interested in looking at the full GO annotation descriptions of one/few gene products; how important is filtering out of annotations in annotation files?
- Filtering requirements for an 'authoritative GO annotation file for a species' can be different than the requirements for the MOD's website GO annotation display. Websites/annotation files might be considered to have different user groups or display expectations. In addition, as the association file is a representation of the 'GO Consortium product', external contributing groups need to be confident that the appropriate level of annotation filtering has been employed.
- All authoritative sources of annotation data are expected to retrieve all publicly-sourced annotations of interest from the GO Consortium, including IEA-evidenced annotations, as well as column 16/17 data.
- Do groups providing a species-authoritative GO annotation file have to carry out any annotation filtering steps?
Should there be any manual annotation filtering in annotation files?
- Manual annotation is an expensive, detailed activity. Only IEAs should be filtered.
- Problems with manual annotations should be communicated to the primary source, or removed via GOC-agreed GO Consortium annotation Hard QC checks. (soft QC check conflicts should be kept in authoritative annotation files?)
- If manual annotation filtering is carried out, filtering specifically aimed at reducing annotation redundancy could be focused using considerations such as:
- prefer data from active annotation groups than those from defunct sources (e.g. PINC, GDB etc.) ?
- prefer data from primary annotation providers e.g. for protein binding - prefer IntAct/BioGrid that carry out a more complete annotation over other GOC sources describing the same interaction/paper?
- prefer annotations that apply a more granular GO term?
N.B. this can cause problems:
* e.g. if a curator wants to state a protein is located both in cytoplasm AND mitochondrion (mitochondrion is a part_of child to cytplasm: http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0005739#term=ancchart
* if curators wanted to state that a protein is involved in all types of organ development, and also curates specific knowledge regarding known involvement in kidney development
- where the GO term used by multiple annotations is identical, groups could use the annotation which has the lower quality score, based on evidence code sets, e.g.
1. Experimental (IDA > IMP > IGI > IPI > EXP > IEP)
3. ISS set (ISO > ISA > ISM > IBA > IBD> IKR > IGC(RCA?))
4. TAS > NAS > ND
- how should an annotation that only differs by the presence/absence of column 16 be treated/filtered?
NO? - IEA annotations created by GOC groups are now widely seen as very high-quality sources of data, all are actively maintained and therefore user resources/FTEs in the primary supplier groups.
- Incorrect, old or non-specific IEA annotations should be corrected by the primary supplier, or filtered out via the GOC-agreed hard QC checks.
- Some users would like to retrieve the full annotation set provided by an automatic annotation method
If IEA filtering considered, possible mechanisms to be considered:
- filter out IEA annotations that cover territory (exact or less granular terms) already covered by the manual set.
This approach is taken by FlyBase at present for filtering IEAs. The problem is that the IEA annotation set always appears at odds with the manual set and it draws attention to the occasional mapping error. This leads to negative user feedback and reduced user confidence in all IEA annotations. If users saw the entire set of IEAs they would realise the majority are accurate [ST].
- filter out IEA annotations from a particular/all sources that predict the same GO term using different external supporting data
Example: InterPro can predict the the same GO term to a protein from different InterPro domain matches, e.g. moeA5 Streptomyces ghanaensis
There isn't anything wrong with the InterPro-provided annotations for moeA5, this is an example of where InterPro have provided a number of annotations, some of which are redundant as providing less granular GO term associations - others provided repeated associations using the same term. Although this annotation set has been created because of protein matches to different InterPro ids, if users just look the protein-GO term association, InterPro then provides some annotation redundancy in its predictions. Should only the most specific annotation predictions be supplied, others filtered out of species-authoritative annotation files?
- should the GOC carry out filtering centrally on species-specific submitted files?
- should groups who want to remain authoritative suppliers of annotations for a species, carry out this filtering as part of their production pipeline?
Where filtering is applied to GO annotation files, how should we make this clear to users?
- is inclusion of the method in a file's readme adequate?
- should we point to files containing full sets of IEA predictions? (some users will want easy access to an unfiltered set).