Annotation Redundancy Specification
From GO Public
Contents |
Formal specification of redundancy in GO Annotations
This formal specification is to be used in implementing software systems that require processing based on the notion of redundancy. Notably, there is no single definition of redundancy - what is considered redundant is dependent on the use case and context. We therefore introduce a core definition plus various qualifiers, providing a family of redundancy relations.
A formal specification is required because there are many aspects to a GO annotation and we must have confidence that these are all taken into consideration in a robust fashion.
Briefly: one annotation A1 is redundant with respect to another A2 if they are about the same gene product (taking into account isoforms), and the biological description of A2 is more specific than A1; here "biological description" is more than just the value of the GO ID in column 5 of the GAF, as we must take column 16 into account. In addition, the evidence must also be subsumed.
however each statement is in reference to a particular evidence source (evid + reference), therefore supply complementary information Emily
Background Definitions
A1, A2, ... are annotations
A.c<N> refers to the value of column N for that annotation in a GAF2.0 file (of course, the definitions are applicable to other formats, just translate to GAF2.0)
Gene Product
A.gp is the gene product. It is derived as follows:
IF A.c17 = null
THEN
A.gp = A.c1+":"+A.c2
ELSE
A.gp = A.c17
Note that we need to take c17 into account to avoid "false positive" redundancy calling
Class Expression
A.clsx is the GO class expression used in the annotation. Note that we must take account of c16 for redundancy checking. If there is no value in c16, then the class expression is simply the GO term. If c16 is specified, then it is equivalent to an an annotation to a more specific term, as refined by the relations in c16.
A.clsx is an OWL class expression derived as follows:
IF A.c16=null, THEN A.clsx = A.c5 ELSE A.clsx = IntersectionOf(A.c5, R1 some Y1, ..., Rn some Yn)
If c16 contains pipes then A is split into separate annotations first.
We also assume that for protein binding annotations, intra-species partners from A.c8 are added by default to c16.
See the OWL docs for c16 for more details
Negation and qualifiers
A.isNegative is true if A.c4 contains NOT, otherwise A.isPositive is true
(An alternative is to wrap the value of A.clsx in an OWL complementOf operator, but we avoid more complex constructs for now)
Other qualifiers: TODO
Core Redundancy Definition
A1 is redundant with A2, IFF
( A1.isPositive AND A2.isPositive AND A1.gp = A2.gp AND (A2.clsx SubClassOf A1.clsx OR A2.clsx SubClassOf part_of some A1.clsx OR A2.clsx SubClassOf occurs_in some A1.clsx) AND (A2.evidence_type SubClassOf A1.evidence_type OR A2.evidence_type SubClassOf moreReliableThan some A1.evidence_type) ) OR ( A1.isNegative AND A2.isNegative AND A1.gp = A2.gp AND (A1.clsx SubClassOf A2.clsx OR A1.clsx SubClassOf part_of some A2.clsx OR A1.clsx SubClassOf occurs_in some A2.clsx) AND (A2.evidence_type SubClassOf A1.evidence_type OR A2.evidence_type SubClassOf moreReliableThan some A1.evidence_type) )
no mention of ref included in this definition? ~~
Notes:
SubClassOf is evaluated with respect to a background ontology, which by default is the full GO plus ECO
- SubClassOf is reflexive. I.e. T SubClassOf T for all T.
- A SubClassOf B, B SubClassOf C ==> A SubClassOf C
- A SubClassOf R some B, B SubClassOf R some C, Transitive R ==> A SubClassOf R some C
Note that as we include ECO, we get axioms such as
IDA SubClassOf EXP IKR SubClassOf ISS
This gives the desired results (i.e. two annotations identical but for evidence type with A1.evidence = EXP and A2.evidence = IDA then A.1 is redundant with A2).
Note we also make use of a moreReliableThan relation that can be used to place a partial order on ECO. Although this is not asserted in ECO, we assume one axiom:
used_in some ECO:manual_assertion SubClassOf moreReliableThan some used_in some ECO:automatic_assertion
i.e. everything is more reliable than IEA, but beyond that there is no judgment call.
Proper Redundancy
Note that if A1 and A2 are share the same gene product, term and evidence, they will be mutually redundant. We want to a stronger notion of redundancy, so we introduce two new definitions:
A1 is *properly* redundant with A2 IFF A1 is redundant with A2 AND A2 is not redundant with A1
A1 is *mutually* redundant with A2 IFF A1 is redundant with A2 AND A2 is redundant with A1
It follows that every annotation is mutually redundant with itself
Variants of core definition
Redundancy with respect to basic GO
Note that the full GO includes the axiom:
'mitochondrial translation' SubClassOf occurs_in some mitochondrion
This means that if we have two annotations, both to the same gp, and one is to mtTn and the other is to mt, the annotation to mt is redundant with the annotation to mtTn.
However, we may not wish to filter this out at display time. Here we introduce the notion of redundancy with respect to a background ontology. The GO-basic ontology has inter-ontology links removed.
Example:
Given:
A1.clsx = 'mitochondrial translation' A2.clsx = 'mitochondrial' A1.gp = A2.gp A1.evidence=A2.evidence
The following is entailed:
A2 is redundant with A1 (implicit: background is whole GO) A2 is NOT redundant with A1 with respect to GO-basic
Redundancy regardless of evidence
We introduce another qualifier "ignoring evidence" that has the effect of removing evidence from consideration during calculation.
Consider
A1.clsx = 'mitochondrial translation' A2.clsx = translation A1.gp = A2.gp A1.evidence=IDA A2.evidence=IMP
It follows that:
A2 is NOT redundant with A1 (implicit: considering evidence) A2 IS redundant with A1, when evidence is ignored
Redundancy using ECO-slims
We can introduce a qualifier similar to the "with respect to GO-basic" that involves first mapping all evidence types to a slim
E.g. we might have a slim with 3 types: IEA, ISS and EXP