Compositional Term Submission Tool v1 Report
- 1 Background
- 2 Description
- 3 DISCUSSION
- 4 CONCLUSIONS
- 5 AVAILABILITY
- 6 TABLES
- 7 SCREENSHOTS
One of the common bottlenecks in annotation is the generation of new ontology classes. The standard workflow calls for the curator to make a request for a new class via some kind of tracking system. Ontology authors monitor the tracking system, and generate new classes on request. The curator then receives a message with the new class identifier, which can then be used in annotation. This is inefficient due to the lag between request and the generation of a new term. Curators can work more efficiently if classes are generated instantly on request.
Some commonly used ontologies are the Gene Ontology(GO)[REF] and phenotype ontologies such as the Mammalian Phenotype (MP)[REF]. These ontologies frequently make use of combinatorial classes that conform to standard patterns[REF]. We have exploited this characteristic feature to devise a new compositional class request system that exploits logical reasoning.
On entering the system, the user is asked to select one a series of pre-defined templates. For example, in GO one of the templates is called "morphogenesis" and is for generating classes such as "mesonephros morphogenesis". For the human phenotype ontology (HPO), one of the templates is called "entity_quality" and is for generating classes such as "fragmentation of the epiphysis of the thumb".
On selecting a template, the user is then asked to fill in one or more slots with classes taken from the appropriate ontology. An AJAX auto-completion system is used to assist in the selection of the correct term. Some templates require additional information - for example, when generating catalytic activity classes or protein complex classes, the user has the option of selecting cardinality (stoichiometry).
The user then has the option of adding additional information, including a preferred label (name), definition, comments and definition database cross-reference. These are typically optional, as the system can auto-generate these. For some templates, certain pieces of information can be mandatory - for example, when generating a protein complex class it is mandatory to provide a reference. In some cases the user does not have the option of over-riding the defaults. For example, for the generation of regulation classes, the name is forced to conform to the GO naming convention.
After filling in this information, the user can submit the request - this is done in "dry run" mode unless the user explicitly selects the commit checkbox. First of all the system will check to see the request conforms to the constraints encoded in the template - non-conformant requests are rejected.
The system will provisionally generate the ontology class according to the template. Names, synonyms and a textual definition will be generated using naming conventions encoded in the template (unless over-ridden by the user). A logical definition is generated, which is then used by a reasoner to calculate the correct placement of the class in the ontology graph. All of this is directly reported to user/submitter.
The reasoner is capable of detecting equivalent classes - if this happens, the annotator is informed that a class with the equivalent logical definition already exists. The annotator can then go ahead and use that class in their annotation. The system will also check to see ensure there is no exact name match to an existing class.
If the class is valid and not yet created, and if the user elected to commit the request, it does not immediately go into the main ontology - instead it goes into a separate "xp submit" ontology. This ontology is publicly visible, but is typically only inspected by the main gatekeepers of the ontology. After inspection, the gatekeepers can run a script that brings the submitted classes into the main ontology. The gatekeeper can add extra information, or if they disagree with the request then they can choose to obsolete the class. Typically the gatekeeper does not have to do much here, as the system takes care of most of the details.
The curator receives a new GO identifier which they can then immediately use in annotation. This identifier will appear in the main ontology soon after. In the event the request was rejected and the new class goes into the ontology as being obsolete, then normal annotation lifecycle procedures can be used to fix the annotation.
The user then has the opinion of submitting another similar request.
SETH TO WRITE
- lucene + go-moose
Flexible template system
One of the requirements in building this system was that creating and modifying templates would be fast, efficient and configurable. In addition the templates should be understandable by the ontology authors, which precludes encoding an imperative language such as perl or java.
We use a simplified version of Obol grammars[REF] to specify the templates. Each template has a collection of properties, listed in Table 1. The display and behavior of the system is driven entirely by the templates, rather than having to be explicitly programmed.
A simplified Obol grammar is used to specify how to generate names. This consists of a collection of tokens interspersed with commands for generation of names, synonyms or definitions. For example, the text definition template for development terms is:
['The process whose specific outcome is the progression of', refname(Structure),' over time, from its formation to the mature structure.', textdef(Structure)]
The entire template takes a variable called "Structure" (i.e. the anatomical entity). The token refname(Structure) is replaced by the name of the structure, prefixed by either "a" or "an". The final clause recapitulates the definition of the structure.
Note that like Obol grammars, these templates can be used for parsing as well as generation.
One of the main requirements of the system was for all newly generated classes to be automatically placed in the ontology. It is important for the submitter to be able to see this placement, in order to confirm that no mistakes were made. The submitter also needs to receive immediate warning if an equivalent class already exists.
These tasks can all be done by standard automated reasoners, so long as logical definitions are supplied in the ontology.
We evaluated several reasoners, including OWL reasoners such as Pellet, FaCT++ and HermiT, as well as the OBO-Edit reasoner. We found in all cases that reasoning was either too slow or did not complete the reasoning task at all.
In order to overcome this obstacle we implemented our own simple reasoner. This reasoner is not as comprehensive as existing OWL reasoners, but is sufficient for the subset of OWL used by many existing ontologies such as GO. The only OWL2 constructs used by the reasoner are: EquivalentTo (=), SubClassOf (<), SubObjectPropertyOf, intersectionOf, someValuesFrom, TransitiveProperty and PropertyChain.
X < X X < Y if EquivalentTo(X DX) and DX < Y X < Y if EquivalentTo(Y DY) and X < DY X < intersectionOf(Y1....Yn) if X < Y1 and ... X < Yn intersectionOf(X1....Xn) < Y if X in X1...Xn and X < Y someValuesFrom(PX X) < someValuesFrom(PY Y) if PX < PY and X < Y someValuesFrom(P X) < someValuesFrom(P Y) if Transitive(P) and someValuesFrom(P X) < someValuesFrom(PY Z) and someValuesFrom(P Z) < someValuesFrom(P Y) someValuesFrom(PX X) < someValuesFrom(PY Y) if PY < PropertyChain(PX PZ) someValuesFrom(PX X) < someValuesFrom(PY Z) and someValuesFrom(PZ Z) < someValuesFrom(PY Y)
These rules are implemented using a backward-chaining rule engine
The performance is generally robust with respect to the size of the input ontologies, because axioms that are not relevant to the classification of the input submission term are never used.
Most existing ontologies used version control systems such as cvs or svn. The submission system is intended to work alongside these - all newly generated classes are placed in a version control managed file alongside the main ontology file (in GO, this goes in a directory called xp_submit).
The system actually appends to 3 files
- An obo format file consisting of the newly submitted class, together
with full axioms for the class, including the reasoner-calculated superclasses (is_a parents)
- A file of new subclass links. This is necessary when new classes are
inferred to be "sandwiched" between two classes that previously has a direct subclass link.
- A file of subclass links to be deleted. When a new "sandwich" class
is created, the previous link becomes redundant. Although these are essentially harmless, redundant links can confuse users and it is good policy to remove these.
These files are visible though the normal mechanisms used by the version control system. This means that a "bleeding edge" version of the ontology can be viewed by dynamically combining the 3 files above plus the main ontology. However, this is typically not required, as the gatekeeper can swiftly deal with new requests.
The gatekeeper can choose to edit the submission file, but this should not be necessary in the majority of cases. Usually it is sufficient to quickly inspect the files and to run a merge script to pull in the new information from the 3 files above (after this happens, the files reset). If desired, even this one minimal manual step can be automated (for example, for experienced submitters it may be desirable to directly bring in the new submission).
Text definitions: Rabbit. Robert Stevens' system.
One of the current limitations of the existing system is that all ontologies must be in OBO format, and logical definitions must be expressible in the same format. In theory this need not be a problem for OWL ontologies that use a restricted set of OWL constructs, but in practice the need to convert files places additional administrative burdens.
It should be relatively easy to convert the system to use OWL ontologies rather than OBO ones, and we may do this in future, depending on which ontologies use the system.
The simplified reasoning strategy may be problematic for some ontologies. For example, the cell ontology uses logical definitions that require additional constructs including negation that pose problems for our backward-chaining reasoning strategy. We expect that before long we will be able to use standard OWL reasoners within our system. For example, the latest version of the Pellet reasoner has the ability to do incremental reasoning with caching of results, which eliminates some of the wait time currently associated with OWL reasoning. In addition, segmentation strategies such as MIREOT[REF] can be used to extract a tractable subset of an ontology.
The system is designed specifically for immediate granting of requests that follow some compositional template. In principle there is nothing preventing the extension of the system to be used for more free-from class generation. The submitter would have to manually specify all necessary information, rather than have this auto-generated according to a template. In practice there is less of a need for this system within the GO, as curators can use an ordinary term request system such as sourceforge and enter the terms directly using OBO-Edit.
The class request bottleneck is a frequent cause of curator inefficiency. In addition, the manual construction and placement of compositional ontology classes is time-consuming and error-prone. We have developed a system that simultaneously deals with both of these issues.
ontology - the home for the newly generated class
description - textual summary of what the template is for
externals - external ontologies required to define the class
arguments - a list of arguments that must be supplied to the template
logical definition - a template for the generation of the logical definition.
name - a template for generation of the name (preferred label)
synonym - a template for generation of synonyms
textdef - a template for generation of the textual definition
wraps - some templates can optionally wrap other templates.