Editor Guide

From GO Wiki
Jump to: navigation, search

This page includes some handy hints for ontology editors.

Files

For now at least (and probably for some time to come), the master version of GO is an OBO file: /ontology/editors/gene_ontology_write.obo on the GO SVN.

Routine editing checks

If an update fails

If for some reason you have to terminate an update session before the update is complete, you may get an error that your files are locked the next time you try to update. To unlock the files:

'svn cleanup [directory]'

Then run the update command again

'svn update [directory]'

What has changed in the file - before commit

One way to find out what has changed in your edited version of a file prior to an svn commit is to use the 'svn diff' command

'svn diff go/ontology/editors/gene_ontology_write.obo'

You can also use this to look at the differences between checked in versions, specifying revisions as follows:

'svn diff -r 5.249 -r 5.250 go/ontology/editors/gene_ontology_write.obo'

But you can get a much clearer display of the differences between files by using View VC:

http://viewvc.geneontology.org/viewvc/GO-SVN/trunk/ontology/editors/gene_ontology_write.obo?view=log

Editors

OBO-Edit and Protege are both used for editing the ontology. OBO-Edit is used for routine tasks because of its ease of use and drag and drop features. Protege is used for creation of logical definitions and checking any inferences using the reasoner.

Editing in OBOEdit

To edit this, you must use OBO-Edit version 2.3 or later (do not use any of the the 2.3 beta versions). This imports automatically

A new bug in saving from OBO-Edit introduces seven 'ghost relations' in the the ontology that should be removed with a text editor before commit. The relations will be visible in the diff and have 'id: RO:' in their first line. Currently they look like this where has participant is the last stanza we want and role_of is the next stanza we want:

is_a: has_participant ! has participant

[Typedef] id: RO:0000057 namespace: external xref: RO:0000057

[Typedef] id: RO:0002233 namespace: external xref: RO:0002233

[Typedef] id: RO:0002234 namespace: external xref: RO:0002234

[Typedef] id: RO:0002313 namespace: external xref: RO:0002313

[Typedef] id: RO:0002332 namespace: external xref: RO:0002332

[Typedef] id: RO:0002340 namespace: external xref: RO:0002340

[Typedef] id: RO:0002345 namespace: external xref: RO:0002345

[Typedef] id: role_of

There will also be differences in the order of line in some of the relations if the file has been saved from Protege or from TermGenie. Those differences will not cause a problem.

Editing in Protege

Guide to Editing in Protege - google doc

File:Protege OBO OWL roundtrip.docx

Verification in OBO-Edit

The verification system in OBO-Edit can check for a number of errors in the ontology, using both built-in and user-specified (custom) checks. We used to regularly run checks requiring the rule-based reasoner, but this reasoner is now painfully slow and these checks have been supeceded by checks run as part of the GO Jenkins build.

OBO-Edit text checks remain extremely important and include some issues that will not be caught by Jenkins (and, at the time of writing, cannot be performed using Protege). These include checks for name redundancy, spelling and grammar. Usable spell checking requires setting up dictionaries.

Setting up dictionaries

We maintain three dictionaries on svn:

1. ontology/editors/oboedit_user.dict

This is a dictionary of terms added by GO editors. You can add to this directly through OBO-Edit, but only if you create a softlink to this file from ~/oboedit_config/dict/ as follows:

'ln -s <path_to_go_repo>/ontology/editors/oboedit_user.dict ~/oboedit_config/dict/user.dict'

(You may need to delete ~/oboedit_config/dict/user.dict' first).

Alternatively, you can add to this dictionary by simply appending words directly to oboedit_user.dict - one per line.

2. ontology/editors/oboedit_go_imports.dict

This is automatically generated by parsing the names and synonyms of terms in import files + all of chebi. (Generating from all of CHEBI supresses many errors arising from the large number of chemical names in GO names, definitions and comments.)

3. ontology/editors/oboedit_allowedrepeats.dict

Does what it says on the tin - allows some words to be repeated without generating a non-fatal error warning.

Unlike oboedit_user.dict, you should add these two files directly to the list that OBO-Edit checks by adding it to the list under: Config-> Configuration Manager -> Text Editor -> Dictionary settings

Dictionary_settings.png

To configure which checks run and when, use the OBO-Edit Verification Manager (in the Tools menu):

OE verification manager.png

Some of these checks are no longer needed: Checking for cycles is not necessary because GO now allows logically consistent cycles (we produce a cycle free GO-basic.obo file for consumption by those who need this constraint); Disjoint checks: this is checked more thoroughly on Jenkins requires running the OE reasoner first,

For more details about any of the checks mentioned below, configuring when checks run, or on setting up additional custom checks, see the OBO-Edit user guide.

Custom checks required for GO

Two checks that are not built into OBO-EDit should be run before every commit: a check that the ontology is is_a complete, and a check that each term has one of molecular_function, biological_process, or cellular_component as its namespace.

To set up these checks, open the Verification Manager, click the 'Add check' button, and set up filters as shown; it's the same interface as searching and filtering.

  • is_a complete check: The check shown in the figure checks that each term has an is_a parent, which in turn ensures that everything is is_a complete. It also ignores relationship types and the ontology roots.

Isa parent OE check.png

  • namespace check:

Namespace OE check.png

Built-in checks

  • The name redundancy check does what its name implies: it detects when two or more terms have the same name or exact synonym(s). This check should be run before each CVS commit.
  • Text checks are useful to find typos, but if they're missed, it won't break the release pipeline. You may prefer to run these checks manually at some regular interval, e.g. weekly, rather than upon every load or save. Text checks can be (and by default are) configured to detect spelling errors, double spaces, repeated words, newlines, and (where relevant) correct sentence capitalization and punctuation.
  • Text checks can be run on term names, synonyms, definitions and comments.
  • The dbxref check detects various problems with dbxref formatting.
  • The disjoint check requires the reasoner, and finds terms that have is_a paths to two or more terms that have been declared disjoint. As of 2010-12-07, the three roots are mutually disjoint, so the disjoint check can find a term in one ontology with an is_a parent in a different ontology. If this check isn't run by editors, any disjoint violations will be caught by Chris and Mike's checks, but that will interrupt the release pipeline until the error is corrected.

Before committing, also make sure the file has been saved with a released version of OBO-Edit.

OBO-Edit (and other) utilities

Using OBOMerge

Don't! It is much safer to do branching and merging using a system based on standard text file merge/diff/patch system. In these systems, you need to specify an ancesory file and two descendants (typcially your branch and the trunk version). You can then easily iterate through changes, accepting or choosing content from either of the branch as necessary (or simply resolving any clashes).

Options include:

emacs merge (Aquamacs is a good option on mac) BB edit {Mac diff tool?}

Using obo2obo

If you are using obo2obo it is worth looking at the OBO-Edit help guide as the documentation there is much clearer than the command line documentation.

Bulk term name changes

You can use the script swap.pl in go/software/utilities to make bulk changes to the names of a lot of terms. When using the script, you only need to change the name text in the term name, not in the relationship lines of the stanza, because OBO-Edit it will automatically change the name strings in the relationship lines the next time you load and save the file.

Tips for finding background, history, etc.

Term history script

If you want to know what has happened to a term through many cvs commits you can use the script in go/software/utilities called cvs_diff_history.pl.

The script runs diffs between adjacent versions of files for as many rounds back in time as you require, and searches them for any word that you provide. This useful, for example, if you want to check if a term has been lost in a dodgy commit some time ago, or if you just want a list of all the GO terms altered in the last 30 commits.

The script is here: http://cvsweb.geneontology.org/cgi-bin/cvsweb.cgi/go/software/utilities/cvs_diff_history.pl

Checking the e-mail archive

There may have been discussions on a given term on one of the GO email lists, especially ontology-editors and go (the main GO list). If all else fails, you may be able to find something in go-diffs. See the mailing list tips on this page for more information on searching and archive locations.

OnEX

Ontology Evolution Explorer is a third-party tool that can trace the history of a GO term (or a term from any of several other ontologies) back to November 2002. It is available at http://aprilia.izbi.uni-leipzig.de:8080/onex/

CVS logs

There may be useful information on when terms were added, which SourceForge item requested a particular change, etc. in the CVS logs for the ontology files. The log messages are associated with whichever file the editors used to commit changes to the respository, so which log you need depends on when a change was made (and if you don't know, you may have to look at all of them). Logs for OBO format files:

Mailing list tips

Mailing list sign-up page

http://fafner.stanford.edu/mailman/listinfo

Archives and searching mailing lists

Each list has an archive that can be browsed, and some can be searched, to find pertinent e-mails. The search form (if available; not all lists have the search feature) is at the top of the listinfo page.

Listinfo page URL syntax:

  http://fafner.stanford.edu/mailman/listinfo/[listname]

Example:

 http://fafner.stanford.edu/mailman/listinfo/go

List archive page URL syntax:

  http://fafner.stanford.edu/pipermail/[listname]

Example:

 http://fafner.stanford.edu/pipermail/go

List of all Mailman lists (useful if you are not sure of the list's name, or to see which lists are available):

 http://fafner.stanford.edu/mailman/listinfo

GO Term Obsoletion Mails

When you have to make a term obsolete, first send an email to the GO list to give notice and allow a period for objections, questions, comments, etc.

The subject line of the email should be along the lines of:

Alert: Proposal to obsolete GO:nnnnnnn: term that impacts existing annotation

The subject line doesn't have to mention the GO:ID if you're alerting about more than one term in a single email.

You will need to give counts of the number of annotations the term is used in, as well as list any subsets of non-definition dbxrefs for the term. To get the information for the email, see GOOSE queries for terms proposed to be obsoleted below.

This SQL query can also be used to retrieve these figures.


The body of the email should include

  • The term name(s) and ID(s)
  • Counts of any annotations, by group, and distinguishing IEA from the rest
    • Over the years, two or three different format/syntax patterns have cropped up. Any of them is fine; see the example emails linked below.
  • The reason(s) for making the term(s) obsolete
  • A link to the relevant SourceForge entry/entries
  • Recommended terms to transfer annotations and mappings (the consider and replaced_by tags)
  • Any external2go mappings that use the term(s)
  • Slim subsets to which the term(s) belong
  • A link to the relevant SourceForge entry
  • The deadline for comments, and a reminder that if we hear nothing, the obsoletion goes ahead

Links to examples in the GO mailing list archive:


We give two weeks notice if the term is used in annotations, or one week if there are no annotations.


Obsoletion of terms in the generic GO slim, or with external GO mappings

Once the proposed obsoletions have been agreed within the GO consortium (or at least not objected to), for GO terms that also have external mappings (e.g. to InterPro, EC etc) or are in the GO slim subset, an obsoletion mail must also be sent to the GO friends mailing list notifying them that the term will be obsoleted in one week. This allows external mappings to be updated in advance.

Please note that discussions or disagreements regarding the proposed obsoletions will occur WITHIN the GO consortium. A notification mail will only be sent to gofriends once the term obsoletion has been agreed by the GOC.

Links to more editor guides

Relationships

Editor Guide to Regulates

Editor Guide to has_part

Content meetings

Content Meeting Participants Information

Miscellaneous tips

Merging terms

SOP for merging terms in Protege

Merging terms in Protege

SOP for merging terms in OBO-Edit

- svn update the ontology/editors directory.

- Check the inference file (go_inferences.obo) for any inference that refers to the terms you’re planning to merge. If found, delete it/them, and commit the change(s) immediately before editing the ontology itself.

Example: I’m going to merge GO:0042384 ‘cilium assembly’ into GO:0060271 ‘cilium morphogenesis’. 
In this particular case, I’m actually going to keep ‘cilium assembly’ as the primary name (see details at https://github.com/geneontology/go-ontology/issues/12236#issuecomment-246941622). 
Regardless, the ID that will become secondary is GO:0042384. Searching go_inferences.obo for ‘GO:0042384’ retrieves the following stanzas:
[Term]
id: GO:0042384 ! cilium assembly
is_a: GO:0010927 {is_inferred="true"} ! cellular component assembly involved in morphogenesis
is_a: GO:0030031 {is_inferred="true"} ! cell projection assembly
is_a: GO:0070925 {is_inferred="true"} ! organelle assembly
[Term]
id: GO:0044458 ! motile cilium assembly
is_a: GO:0042384 {is_inferred="true"} ! cilium assembly
[Term]
id: GO:1905515 ! non-motile cilium assembly
is_a: GO:0042384 {is_inferred="true"} ! cilium assembly
I will delete the stanzas above.
Also, because in this particular case the ‘mergee’ term (GO:0060271) will lose its link to the morphogenesis branch, I’ll also delete any inference that refers to GO:0060271. There’s only one:
[Term]
id: GO:0060271 ! cilium morphogenesis
is_a: GO:0048858 {is_inferred="true"} ! cell projection morphogenesis

- Immediately after committing the changes in go_inferences.obo, update the ontology file (gene_ontology_write.obo) and do the term merge in OBO-Edit. (Note that the terms you’ve just deleted inferences for will look unusual in the tree view - as if they are a branch in its own.) Make sure that logical definitions, if any, are consistent with the merge. (This may be clearer after the build completes, as you may get errors stemming from single intersections, that you’ll need to fix.) Commit. Check Jenkins go-build status and messages, as you may have to fix stray errors.

How to use 'Status' and 'Labels' in Sourceforge ontology requests

This section is no longer valid now that we have moved to Github. It has been retained for record.

- open: I haven't worked on it yet, or I started work but didn't get to do all the submitter needed; (e.g., I only added one out of two requested terms, or I'm waiting for info from the submitter and I'm confident that I'll get it soon, such as a PMID)

- open-accepted: I did what the submitter needed in the first place, but other necessary edits came up en-route that I haven't done yet; (e.g., I added requested terms, then realized naming needs to be made consistent in that branch, but higher priorities came by)

- pending: I'm waiting for the submitter, or someone else, to provide high-level info without which the request can't be processed; (e.g., waiting for Alex Diehl to clarify a branch of CL before I can fix corresponding GO terms)

- pending-accepted: I did what the submitter needed in the first place, but am waiting for someone to provide info to make things nice and complete; (e.g., I created an enzyme activity term for which no EC or RHEA entry is available, and am waiting for RHEA to create an entry so I can dbxref to it)

- closed: all is done and I'm not going to look at that ticket ever again.

Are these good indicator of how complex a ticket is? They can tell how often we need to rely on others (atm, I have 25 tickets and 7 of them are pending or pending-accepted). And they're useful to see quickly which tickets I should work on and with what priority (open first, open-accepted next). But for complexity, I think we should use labels. My suggestions:

- mini-project (can do by myself, but need more time than an average new term request),

- jamboree (need to discuss with others, on SF call, editors call or jamboree),

- combination of the two.

We have already labeled mini-projects, so maybe if we labeled jamboree where needed, we'd get a sense of how often we need to meet to keep numbers down and ontology healthy.

'Easy' tickets wouldn't need any special label, other than the usual topic-related ones.

Closing Github issues and recording info in SVN commits

 For closing GH items and recording that information in the svn commit, this is the suggested standard for the commit message:
 GH1088 demethylase binding - closed
 For other GH related work:
 GH1088 demethylase binding - work in progress, (+ very brief edit description)

Adding references to Wikipedia

As Chris has added Wikipedia to GO.xrf_abbs, we no longer have to put URLs in definition dbxrefs if we want to cite Wikipedia as a def source. Instead, citations should now be in the form 'Wikipedia:Page_name'.

In OBO-Edit, put 'Wikipedia' in the 'Database' field and the page name, as it appears at the end of the URL, in the 'ID' field.

Useful GOOSE queries

Some GOOSE queries to get info for a term proposed to be obsoleted

Here are some useful sample queries for GOOSE to get information for a term, or list of terms, that is proposed to be obsoleted. As written, they get information for two terms (e.g. 'GO:0030528', 'GO:0003712'). Edit the list of GOIDs to customize the queries for the term or terms you are interested in.

Query for counts of annotations to terms by Source and Evidence Code:

SELECT
 term.acc,
 term.name,
 dbxref.xref_dbname,
 evidence.code,
 count(association.id) AS a_count
FROM
 term,
 association,
 evidence,
 gene_product,
 dbxref
WHERE
 association.term_id = term.id
 AND evidence.association_id = association.id
 AND gene_product.id = association.gene_product_id
 AND gene_product.dbxref_id = dbxref.id
 AND term.acc IN ('GO:0022008', 'GO:0016787')
GROUP BY term.acc, dbxref.xref_dbname, evidence.code
ORDER BY term.acc, dbxref.xref_dbname, evidence.code;

Query for Direct, non-IEA annotations to a term, grouped by annotating database:

select db.name, 
count(association.id) 
from 
association, 
term, 
gene_product, 
evidence, 
db 
where 
association.term_id = term.id and 
association.gene_product_id = gene_product.id and 
evidence.association_id = association.id and 
association.source_db_id = db.id and 
evidence.code not in ('IEA') and 
term.acc = 'GO:0016585' 
group by db.name
order by 2, desc

Query for Subsets a term is a member of:

SELECT
term.acc,
term.name,
subset.name AS subset
FROM
term
LEFT OUTER JOIN term_subset ON term_subset.term_id = term.id
LEFT OUTER JOIN term AS subset ON subset.id = term_subset.subset_id
WHERE
term.acc IN ('GO:0030528', 'GO:0003712')
GROUP BY term.acc, subset.name
ORDER BY term.acc, subset.name;

Query for non-definition dbxrefs:

select distinct term.acc,
 term.name,
 concat(dbxref.xref_dbname, ':', dbxref.xref_key) as xref
from
 term
 left outer join term_dbxref on (term.id = term_dbxref.term_id)
 left outer join dbxref on (term_dbxref.dbxref_id = dbxref.id)
where acc in ('GO:0003899', 'GO:0022008')
 and term_dbxref.is_for_definition = 0
order by term.acc, xref;

Some additional queries

Query for counts of annotations to terms, grouped by Evidence Code:

SELECT
term.acc,
term.name,
count(association.id) AS a_count,
evidence.code
FROM
term, association,
evidence
WHERE association.term_id = term.id
AND evidence.association_id = association.id
AND term.acc IN ('GO:0030528', 'GO:0003712')
GROUP BY term.acc, evidence.code
ORDER BY term.acc, evidence.code;

Query for definition dbxrefs:

select distinct term.acc, term.name,
 concat(dbxref.xref_dbname, ':', dbxref.xref_key) as xref
from
 term
 left outer join term_dbxref on (term.id = term_dbxref.term_id)
 left outer join dbxref on (term_dbxref.dbxref_id = dbxref.id)
where acc in ('GO:0003899', 'GO:0022008')
 and term_dbxref.is_for_definition = 1
order by term.acc, xref;

Query for all dbxrefs (In the last column, the dbxrefs for the definition have a '1'; the other dbxrefs have a '0')

select distinct
 term.acc,
 term.name,
 concat(dbxref.xref_dbname, ':', dbxref.xref_key) as xref,
 term_dbxref.is_for_definition as is_for_def
from
 term
 left outer join term_dbxref on (term.id = term_dbxref.term_id)
 left outer join dbxref on (term_dbxref.dbxref_id = dbxref.id)
where acc in ('GO:0003899', 'GO:0022008')
order by term.acc, is_for_def, xref;

Tips for Windows users

Windows line endings

To get rid of Windows line endings use:

tr -d '\r' < oldfile > newfile

To detect Windows line endings when using a Windows machine

You can detect whether a file has Windows line endings when using a Windows machine by opening the file in any hex editor (e.g. XV132) and seeing whether the lines end in 0x0A (UNIX) or 0x0D 0x0A (DOS).

You can convert the line endings using Notepad++. The command is [Format]->[Convert to UNIX format]. If you have to commmit a file that only has the line endings changed from a windows machine then just do cvs commit, not update and then commit. If you update first then no changes will be apparent and there will be no commit option available.

Using cvs from Windows

If you want to edit using the windows operating system you can use TortoiseCVS (http://www.tortoisecvs.org/).

Here is an example of the settings that you will need in TortoiseCVS:

TortoiseSettings.PNG

You will also need to have PuTTY and Pageant set up, and when you are issuing cvs commands you will need to have Pageant open and the ssh key loaded.

To carry out a cvs diff command you will need to install a program that can do the diff operation. One good example is winmerge, which you can get from http://winmerge.org/.

To set it up to work with TortoiseCVS follow this screenshot:

Diff.PNG

Use of TortoiseCVS is quite intuitive. It works from within the file explorer window just by right clicking any file as follows.

Use.PNG

Before commit you must save the file with unix line endings using the windows installation of emacs. (Info: http://www.gnu.org/software/emacs/ Download: http://ftp.gnu.org/pub/gnu/emacs/windows/).

It takes at least ten minutes for each cvs command to complete so you need to be very patient.