Annotation set inference and validation (Archived)

From GO Wiki
Revision as of 06:38, 12 April 2019 by Pascale (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page formally specifies the quality control and validation procedures applied to an annotation file

Definition of terms used in this document

An annotation file is a document representimg a collection of annotations. The file MAY be in any format approved by the GOC, including GAF, GPAD, or in future OWL.

A rule is a procedure for determining if an individual annotation or set of annotations is valid in the context of the ontology or set of ontologies.

A rule identifier is a unique identifier for a rule. GO rule identifiers MUST be of the form GO_AR:nnnnnnn.

A rule definition describes how the rule is implemented. This MUST include a textual definition intended for humans, and SHOULD include one or more computable implementations, specified according to the GO Rule Schema. Each rule SHOULD have one or more contacts within the GOC.

A rule violation is an instance of an annotation or set of annotations not conforming to a rule.

Rules can be classified into hard and soft rules, with corresponding hard violations and soft violations. Hard violations MUST be filtered from an annotation set prior to release by the GOC. Soft violations SHOULD be reported back to the submitting group, and MAY be reported as violations on public displays. (TODO - add this to rule XML)

A rule inference is a new fact generated by a rule operating over the annotations and the ontology. This is typically an inferred annotation, although in future ontology inferences are possible. A rule inference MAY also be a rule violation, but this need not be the case. It is only a rule violation if the rule states that the annotation should be materialized in the input file.

A rule engine is an implementation of the rules that can be executed in a variety of contexts. Rule engines can be partial or complete. A rule engine environment is the environment in which the rule engine executes. This could be web services, a script timed to run daily, etc.

GO Annotation Rules XML File

The GO rules file is an XML document representing every rule used in annotation validation for GO.

See Annotation_Quality_Control_Checks for a description of the rules file.

Classification of rules

Rules can be subdivided according to whether they are syntactic, structural/procedural or semantic. There may be some overlap between these categories.

Syntactic Checks

Syntax checks are specific to the file format used. Refer to the documentation for the file format used

All syntax checks are currently bundled as rule GO_AR:0000001

Structural and procedural Checks

These checks are independent of the file format used, and apply to any representation of an annotation set.

Structural checks such as GO_AR:0000016 IC annotations require a With/From GO ID check the format-independent representation of the data; for example, checking the cardinality of particular fields based on values in other fields. Some structural checks may be subsumed by the annotation file syntax.

Procedural checks are typically checks that must be implemented by special-purpose code (which includes regular expressions) - for example GO_AR:0000010.

Semantic Checks

These checks are independent of the file format used.

Semantic checks are driven by meaning explicitly represented in one or more ontologies. Semantic checks SHOULD be performed by running a general purpose reasoner, although in some cases it MAY temporarily be more efficient to implement these using special-purpose code. The reasoner must also take as input an OWL representation of a GAF

An OWL reasoner will report certain classes as unsatisfiable if a constraint is violated. Formally this means that no instance of this class could exist according to the background ontologies.

An OWL reasoner can also provide new inferences.

Examples of semantic checks:

  • taxon constraints
  • Annotation deepening (not yet in rule file)
    • E.g. inferring "X binding" based on annotation to binding with X in c16, and a logical definition of "X binding" in GO
  • c16 relation domain and range constraints (not yet in rule file)
  • GAF inference i.e. F->P and C->P (not yet in rule file)

Note that the rule file does not yet specify the ontology(ies) required as background to detect the violation. In future this will be done by specifying an OWL importer ontology. For example, the ontology required to perform taxon violation checks is http://purl.obolibrary.org/obo/go/extensions/x-taxon-importer.owl. See Ontology extensions for more details.

Currently we use HermiT as the OWL reasoner in semantic checks, which can be slow for some ontology extensions. We therefore currently implement the taxon rules using custom java code. In future we expect to be able to do everything using Elk.

Rule Engines and environments

Implementations

filter-gene-associations

This is a partial implementation. It is currently executed as a cron job as part of the production pipeline.

Oort/OWLTools

Currently in beta, this is intended as a complete implementation, and will be the official GO rule engine

For the API see owltools.gaf.rules in the OWLTools-Annotation package

Running on command line:

ADD DOCS

Running in a GUI:

Environments

Cron

The simplest means of running an engine is a cron job which runs periodically at regular intervals. We currently use separate crons to filter the GAFs, produce inferred annotations, and generate taxon violation reports. The plan is to integrate this into a more cohesive user-friendly environment

Web Services

We have implemented web services for GAF validation as part of the GOLD framework. This is currently not up and running, as the plan is to supplant this with the Jenkins environment

Jenkins

See Jenkins

We are piloting a trial of running the rule engine in the GO Jenkins environment. This will most likely go live after we switch to SVN.