Annotation Redundancy Specification

From GO Public

Jump to: navigation, search

Contents

Formal specification of redundancy in GO Annotations

This formal specification is to be used in implementing software systems that require processing based on the notion of redundancy. Notably, there is no single definition of redundancy - what is considered redundant is dependent on the use case and context. We therefore introduce a core definition plus various qualifiers, providing a family of redundancy relations.

A formal specification is required because there are many aspects to a GO annotation and we must have confidence that these are all taken into consideration in a robust fashion.

Briefly: one annotation A1 is redundant with respect to another A2 if they are about the same gene product (taking into account isoforms), and the biological description of A2 is more specific than A1; here "biological description" is more than just the value of the GO ID in column 5 of the GAF, as we must take column 16 into account. In addition, the evidence must also be subsumed.

however each statement is in reference to a particular evidence source (evid + reference), therefore supply complementary information Emily

Background Definitions

A1, A2, ... are annotations

A.c<N> refers to the value of column N for that annotation in a GAF2.0 file (of course, the definitions are applicable to other formats, just translate to GAF2.0)

Gene Product

A.gp is the gene product. It is derived as follows:

 IF A.c17 = null 
  THEN 
    A.gp = A.c1+":"+A.c2
  ELSE
    A.gp = A.c17

Note that we need to take c17 into account to avoid "false positive" redundancy calling

Class Expression

A.clsx is the GO class expression used in the annotation. Note that we must take account of c16 for redundancy checking. If there is no value in c16, then the class expression is simply the GO term. If c16 is specified, then it is equivalent to an an annotation to a more specific term, as refined by the relations in c16.

A.clsx is an OWL class expression derived as follows:

 IF A.c16=null, THEN A.clsx = A.c5
  ELSE A.clsx = IntersectionOf(A.c5, R1 some Y1, ..., Rn some Yn)

If c16 contains pipes then A is split into separate annotations first.

We also assume that for protein binding annotations, intra-species partners from A.c8 are added by default to c16.


See the OWL docs for c16 for more details

Negation and qualifiers

A.isNegative is true if A.c4 contains NOT, otherwise A.isPositive is true

(An alternative is to wrap the value of A.clsx in an OWL complementOf operator, but we avoid more complex constructs for now)

Other qualifiers: TODO

Core Redundancy Definition

A1 is redundant with A2, IFF

 (
  A1.isPositive AND
  A2.isPositive AND
  A1.gp = A2.gp
   AND
  (A2.clsx SubClassOf A1.clsx OR 
   A2.clsx SubClassOf part_of some A1.clsx OR
   A2.clsx SubClassOf occurs_in some A1.clsx)
   AND
  (A2.evidence_type SubClassOf A1.evidence_type OR
   A2.evidence_type SubClassOf moreReliableThan some A1.evidence_type)
 )
 OR
 (
  A1.isNegative AND
  A2.isNegative AND
  A1.gp = A2.gp
   AND
  (A1.clsx SubClassOf A2.clsx OR 
   A1.clsx SubClassOf part_of some A2.clsx OR
   A1.clsx SubClassOf occurs_in some A2.clsx)
   AND
  (A2.evidence_type SubClassOf A1.evidence_type OR
   A2.evidence_type SubClassOf moreReliableThan some A1.evidence_type)
 )

no mention of ref included in this definition? ~~

Notes:

SubClassOf is evaluated with respect to a background ontology, which by default is the full GO plus ECO

  • SubClassOf is reflexive. I.e. T SubClassOf T for all T.
  • A SubClassOf B, B SubClassOf C ==> A SubClassOf C
  • A SubClassOf R some B, B SubClassOf R some C, Transitive R ==> A SubClassOf R some C

Note that as we include ECO, we get axioms such as

 IDA SubClassOf EXP
 IKR SubClassOf ISS

This gives the desired results (i.e. two annotations identical but for evidence type with A1.evidence = EXP and A2.evidence = IDA then A.1 is redundant with A2).

Note we also make use of a moreReliableThan relation that can be used to place a partial order on ECO. Although this is not asserted in ECO, we assume one axiom:

 used_in some ECO:manual_assertion SubClassOf 
  moreReliableThan some used_in some ECO:automatic_assertion

i.e. everything is more reliable than IEA, but beyond that there is no judgment call.

Proper Redundancy

Note that if A1 and A2 are share the same gene product, term and evidence, they will be mutually redundant. We want to a stronger notion of redundancy, so we introduce two new definitions:

A1 is *properly* redundant with A2 IFF
 A1 is redundant with A2 AND
 A2 is not redundant with A1
A1 is *mutually* redundant with A2 IFF
 A1 is redundant with A2 AND
 A2 is redundant with A1

It follows that every annotation is mutually redundant with itself

Variants of core definition

Redundancy with respect to basic GO

Note that the full GO includes the axiom:

 'mitochondrial translation' SubClassOf occurs_in some mitochondrion

This means that if we have two annotations, both to the same gp, and one is to mtTn and the other is to mt, the annotation to mt is redundant with the annotation to mtTn.

However, we may not wish to filter this out at display time. Here we introduce the notion of redundancy with respect to a background ontology. The GO-basic ontology has inter-ontology links removed.

Example:

Given:

A1.clsx = 'mitochondrial translation'
A2.clsx = 'mitochondrial'
A1.gp = A2.gp
A1.evidence=A2.evidence

The following is entailed:

A2 is redundant with A1 (implicit: background is whole GO)
A2 is NOT redundant with A1 with respect to GO-basic

Redundancy regardless of evidence

We introduce another qualifier "ignoring evidence" that has the effect of removing evidence from consideration during calculation.

Consider

A1.clsx = 'mitochondrial translation'
A2.clsx = translation
A1.gp = A2.gp
A1.evidence=IDA
A2.evidence=IMP

It follows that:

A2 is NOT redundant with A1 (implicit: considering evidence)
A2 IS redundant with A1, when evidence is ignored 

Redundancy using ECO-slims

We can introduce a qualifier similar to the "with respect to GO-basic" that involves first mapping all evidence types to a slim

E.g. we might have a slim with 3 types: IEA, ISS and EXP

Personal tools