Proposal for integral to qualifier (Archived)
GO annotations indicate that a gene product
- is part of a cell component
- executes a molecular function
- is an active participant in a biological process.
Currently annotations are weak in that they only indicate that there is some context in which the gene product is observed to do these things. The addition of a new annotation qualifier integral_to will allow annotators to make stronger annotations to indicate if a gene product is required by an organism to carry out some process or constitute a cell component. This will in turn allow for additional inferences, improving comprehensivity of query results, term enrichment analyses, and cross-species annotation propagation.
Formally this can be thought of as making a has_part or has_participant relationship between the gene product and the GO term. This will be transitive with the existing has_part relationships in GO, allowing us to use them properly.
- NEF3 complex has_part core TFIIH complex (asserted in GO)
- yeast TFB1 integral to core TFIIH complex (asserted GO annotation)
- yeast TFB1 integral to NEF3 complex (inferred GO annotation)
Information is currently being lost in annotations because an important distinction is not being made. We currently have no way of indicating that a specific organism requires a gene product in order to carry out some process, or that a specific complex always contains a certain gene product. This means we are losing the ability to use the has_part relation to make additional inferences.
One side effect of the current lack of inference is confusion about when to exclude certain annotations. For example, at the 2010 GO annotation camp there was a discussion about worm rpb-2, and whether to exclude IMP annotations to developmental terms. There was a reluctance to remove these IMP annotations, because they were the only experimental data in worm (this is from my memory, TODO: check). If rpb-2 was annotated as being integral to transcription, and the ontology stated that all development requires transcription, then we would see that these are just another case of redundant annotations, as we can infer that rpb-2 is involved in ALL development.
For additional information, see the slides for has_part in GO. This focuses on a cell component example, but the solution presented is applicable to BP
We would allow an additional qualifier integral_to. This could be mixed with existing qualifiers and the NOT modifier.
The formal meaning of the qualifier is specified below. Informally, the use of this qualifier with a gene product G in a species S means:
- for a CC annotation: every instance of the annotated component in S has a G as part
- Example: every core TFIIH complex has_part some TFB1 (in yeast)
- for a BP annotation: every instance of the annotated process in S requires G, otherwise the process cannot be carried out
- for a MF annotation: every instance of the annotated molecular function in S is catalyzed or otherwise executed by a G
I believe the following annotations could be made (TODO: to be checked by an annotator). Currently these are normal annotations. These could be "promoted" to integral_to annotations.
- rpb-2 in C.elegans is integral_to transcription
- TFB1 in S.cer is integral_to every core TFIIH complex (GO:0000439)
- MSH2 meiosis example (Pascale/Paul to fill in)
TODO: more compelling example https://sourceforge.net/tracker/?func=detail&atid=440764&aid=3047074&group_id=36855 cell cycle and DNA replication
Annotation propagation behavior (informal description)
Propagation DOWN the is_a hierarchy
Typically annotations "propagate up". integral to propagates down. For example:
- If rpb-2 is integral to transcription in C elegans, then it is integral to DNA transcription.
Note that an integral to annotation also implies a normal annotation. So the full inferences are
- integral to transcription, and integral to all is_a descendants
- sometimes active in transcription, and therefore sometimes active in all is_a ancestors of transcription
Propagation over has_part
Protein Complex example:
- If TFB1 is integral to core TFIIH complex (GO:0000439), then it is integral to NEF3 complex and also integral to holo TFIIH complex. This is based on their being two relationships in the ontology:
- [every] NEF3 complex has_part [some] integral to core TFIIH complex
- [every] holo TFIIH complex has_part [some] integral to core TFIIH complex
here we assume that the ontology contains the links
- [every] developmental process has_part [some] gene expression
- [every] gene expression has_part [some] transcription
TODO: check with developmental biologist, but this seems uncontroversial.
TODO: when the annotations and ontologies are expressed in OWL, the correct semantics come for free. Maybe move this into a separate page...
Where do these annotations come from?
Obviously it would be a lot of work to retrospectively go back and strengthen existing annotations. Some of this could come from "common biological knowledge".
Another source is pathway databases. When a gene product is assigned to a step in reactome it can be treated as an integral_to qualified annotation. Care must be taken when mapping from the reatcome ID to the GO term, because the reactome process may well be more specific than the GO term.