Availability of GO (ontologies)
Currently GO has no formal versioning or release system for ontologies. Live bleeding edges versions of the ontology OBO files are maintained in CVS and are made available over http (also through ftp and a cvs client). In addition, there are other dumps made available as part of the GO database cycle:
Cycles vary from every 30 minutes, every day and every week
There are a variety of download options:
- obof1.0 (gene_ontology.obo)
- obof1.2 (gene_ontology_edit.obo)
- mySQL dump : not a format as such
There are also a number of other Derived_files_in_CVS that are available for public download
Note that the two .obo files conflate at least two different purposes: provision of separate format-versions and separation of live editorial version and for-public-consumption version.
Some files are gzipped, some are not
There is no indication of versioning or how a version should be indicated. CVS revision no (automatically added to the file) will NOT work as this is per physical file, not logical document.
Previous releases are available from the archive:
snapshots are taken at midnight on the first of the month
In addition there are mappings:
These have last update dates indicated on the web page, but not in the file. I'm not sure if there is any process to migrate mappings forward as terms are split, merged and obsoleted
In the near future we will also consider other kinds of mappings, eg cross-products.
Changes to ontology content are announced ad-hoc, eg on go-friends and in the newsletter
Availability of annotations
Annotations are available from a separate area:
The file format the tab delimited go-assoc format, which is handy for many bioinformatics users.
Association files must be downloaded individually.
Both filtered and unfiltered are available on this page. The difference is well-documented, but there are no instructions/best practice guidelines for authors (reproducility is dependent on using the same set).
The page has helpful statistics, generated by scripts from the flatfile.
There is also a link to the gp2protein downloads
This page also links to the old database download page on godatabase.org which links back to archives.geneontology.org in a confusing tangle. The archives also allow for downloading of:
- MySQL dumps
both of which provide a combined view over ontologies and annotations (and mappings and sequences).
The cycle for the database-generated resources is completely different from the go-assoc file dumps.
All database generated files are datestamped in YYYYMMDD - the intent was these could stand in for versions. We do not advertise this purpose, it is out of sync with the main annotations page, and no thus one appears to cite the date-versions in journals
There is no SOP for announcements (for example, if a new organism is added). These may end up in the newsletter. GOA make independent announcements on go-friends after committing to CVS.
Can this be improved
At the least there are some documentation changes at the database end of things that could improve things. Some of this has been addressed with the new schema pages, but the rest of the database pages are confusing and are out of sort with the rest of the site.
The main problem is the lack of any kind of versioning system which means vital in-silico analyses are non-reproducible. This is the main user case driving this proposal
A coherent versioning system
Members of the GOC are happy using bleeding edge versions, but this is not suitable for the general public. In addition, a lack of separation between live/edit versions and for-public versions means we have no buffer to insulate people from important and necessary changes to either files or content.
We should of course maintain means of access to bleeding edge versions for GOC members (through CVS and FTP). These will also be available to the public, but this will not be the default mode people stumble on by accident. The public will be guided to a releases page.
The releases page will have availability across all appropriate formats for any one particular release, in sync. Users will see an intuitive table that make the cross-product between their pertinent choices most apparent:
- Format [+ format-version]
- Release/Version (default to latest)
- Includes/Excludes (eg just process ontology; just GOA)
The current release would be synced with AmiGO
Each release would have its own statistics, plus a changelog (which can be automatically generated)
There is no use case for getting the cycle as short as possible. Monthly should be fine for the majority of users. (Remember, power users can always bypass the page and go straight to cvs/ftp. Of course these users should repeat significant results on a release prior to publication)
Releases can be regular (eg 1st day of month) or semi-regular (after significant milestones, in ontologies or annotations)
The system should be extensible to eventually offer more MART-style downloads and custom statistics, integrated with current AmiGO functionality.
We will also want to add more formats (OWL for annotations), and optional additional datatypes (eg cross-products, in obo or OWL)
Who needs to do what
- Collect requirements
- Fully specify system
- Implement system
- Maintain system
- Encourage both authors and re-distributors (eg ensembl) to use and
clearly indicate a version
- Coordination with NCBO and their proposed versioning system (note:
currently versions are provided without consulting ontology maintainers)
- gene_ontology.obo file has been separated into editors version and public version