Ontology Quality Control

From GO Wiki
Jump to: navigation, search

This page documents the quality control measures used to flag (or, ideally, avoid) various types of errors and inconsistencies in GO.

OBO-Edit verification system

(NOTE: need links to O-E documentation, but the "Ontology Verification" section is missing from the online version at http://www.oboedit.org/docs/index.html)

  • built-in checks - these are run manually or during editing
    • text checks on comments, definitions, names and synonyms
    • name redundancy check
    • dbxref check
  • custom checks - these are run upon every save in OBO-Edit
    • namespace: check that each term uses MF, BP or CC (not the default, gene_ontology) as its namespace
at present: ('self' 'namespace' 'equals' 'gene_ontology') AND (NOT 'self' 'is_property'); criteria may have to be updated later (when we load other ontologies for cross-products)
  • is_a complete: check that every term has an all-is_a path to the root
NOT 'self' 'is is_a complete'

External reasoner-based checks

These checks are run periodically external to the normal edit cycle

Regulation related reports

  • missing link report

See ftp://ftp.geneontology.org/pub/go/scratch/regulates_xp_live/

updated daily. cross-products (intersection_of defs) are generated by the oboedit semantic parser. These are then used to create the file ftp://ftp.geneontology.org/pub/go/scratch/regulates_xp_live/go_reglive_withPosNeg_noPF-newlinks.txt contains missing links (usually regulation based)
  • internal inconsistency report

See http://cvsweb.geneontology.org/cgi-bin/cvsweb.cgi/go/scratch/regulation-unimplied-report.txt

where the regulates hierarchy doesn't parallel the process hierarchy. What needs to be checked: Follow the paths of the regulates term up the regulates hierarchy and follow the path of the parent process up the process hierarchy. Make sure that the parentage is consistent between these two. If they are not, examine both paths to see which one is wrong. We've found examples of both cases.
  • semantic analysis report

See: http://cvsweb.geneontology.org/cgi-bin/cvsweb.cgi/go/scratch/regulation-report.txt

There are four categories of terms to check, all ones marked as OK can be ignored:
  • UNEXPECTED: Terms that are children of regulation of biological quality (ROBQ) for which the BQ is not expected to be found in GO but the term really does exist in GO. What needs to be checked: Should the regulation terms really be is_a children of ROBQ or should they be moved/added as children of regulation of biological process (ROBP)?
  • HIERARCHY: Terms that have both regulation of molecular function (ROMF) and ROBP as is-a parents. What needs to be checked:
  • Are both is_a parents appropriate? If yes, retain dual parentage, if not, remove one or the other. This will also detect cases where there is just one is_a parent to the regulation upper term but it is the wrong kind (i.e. a ROMF sub-term that only traces up to ROBP)
  • MISSING_LINK: Regulation terms that should be children of biological regulation via a direct is_a path but are not. What needs to be checked:
  • All regulation terms need to have an is_a path up to biological regulation.
  • NP: (a.k.a. No Parse). Terms that the reasoner cannot decompose into a logical definition. This category includes terms that David and Tanya have already looked at and decided were ok subtypes of a regulates parent but the regulated process was not worth including in the process ontology. It also includes terms that cannot be parsed because of their structure, for example, 'renal water retention.' What needs to checked: Check for univocity problems where process X term is named differently from the regulation of process X term and fix them.
  • multiple part_of parentage report
Right now, this is not in cvs but should be. It reports terms that have multiple part_of parents. What needs to be checked: Is the multiple parentage legitimate or not? Often this points out has_part relationships rather than part_of relationships. In most cases, these can be resolved by creating specific children of the original term each of which is always a part of the original part_of parents.

Other XP Based reports

See Category:Cross Products

Publish step checks

The New CVS layout and publish cycle allows us to add additional steps to prevent mistakes propagating to the public files.

When the new pipeline is complete, a simple cp will copy gene_ontology_write to the public place (executed by cron). However, we have the option of adding additional steps here.

For example, a simple grep could check the saved file was obof1.2. If not, the cp would be delayed and a report sent. The public would not see the erroneous file (unless they ignored warnings and went for the write file). The downside is they may have to wait an extra day for the problem to be fixed, but this is a small price.

Publish checks (proposed)

  • Run OE validation in batch mode
  • Check file format is obof 1.2
  • check there are no cvs artifacts (cvs clashes etc)
  • do an obodiff with the public file and ensure nothing is DESTROYed

Scripts

??

Metrics

Ontology_QC_Metrics

Reports

See XP:Progress 2008