Release Pipeline
June 2018: This documentation is currently a work in progress.
Overview
The GO Consortium (GOC) is now publically releasing data on a monthly basis. Data includes annotation files, ontology files, GO-CAM models, and... Official monthly releases are versioned and archived so that analyses performed with these data can be reproduced at any point in the future. Additionally, daily snapshot releases of GO data are available for internal use by GOC members. This allows annotators, for example, to have access to the most up-to-date version of the ontology for their curation. However, data generated using snapshot releases will not be officially released until the monthly public release.
Release Cycle
For both the daily and monthly releases, the pipeline starts at midnight and finishes within 24 hours; starting nightly for the daily snapshot release and the first of the month (or as close as can be obtained if there are failures) for the monthly public release.
Annotations
Overview
Annotation files are retrieved from each participating consortium member by GO Central, merged with PAINT annotation files, run through annotation QA/QC checks and then released as daily snapshot and monthly public releases.
Annotation Source Files
How to Submit an Annotation File
- Annotation files produced by GOC members should be made available to GO Central by placing the file on a publicly accessible site such as an FTP site, an Amazon S3 bucket, an HTTP server, etc.
- The URL or address from which the file can be obtained is stored in the Source field of each group's datasets metadata file on github located in the geneontology/go-site repository.
- More information about the format of the datasets metadata file can be found in the datasets README.md file under the Schema heading.
- For example Source URLs see:
- https://github.com/geneontology/go-site/blob/master/metadata/datasets/mgi.yaml
- https://github.com/geneontology/go-site/blob/master/metadata/datasets/wb.yaml
- https://github.com/geneontology/go-site/blob/master/metadata/datasets/xenbase.yaml
- https://github.com/geneontology/go-site/blob/master/metadata/datasets/zfin.yaml
- The Source address must resolve to the latest annotation file produced by the submitting group.
The pipeline will then run checks on this (see below) and repair any auto-repairable issues (for example, migrating annotations to merged terms). It will then public the processes GAFs, GPADs, GPIs, etc. to a public site, where it is available for the public to download.
The "publish" sites that are currently part of the pipeline are:
- http://snapshot.geneontology.org (~daily)
- http://release.geneontology.org (~monthly, plus historical sets from the new pipeline)
- http://current.geneontology.org (~monthly, containing the latest release set)
All pipeline runs start at midnight (12am) PDT, and currently take about 14hrs (this will be decreased in the future). The `release`/`current` pipeline runs are attempted on the first of every month. As a note to that, the `snapshot` run does not currently run on the day of the `release`.
Per-group curator QC reports
We run the full pipeline every day, sans the SVN writeback and software deployments; we call this snapshot. As part of this run, just as in the release, we generate products and reports that are of use to db admins and curators of the various contributing groups (MODs, UniProt, etc) to the GO Consortium.
- http://snapshot.geneontology.org/reports/
- summary.txt. EXAMPLE: dictybase.report.md
- prediction-report.txt EXAMPLE: mgi-prediction-report.txt
- owltools-check.txt EXAMPLE: mgi-owltools-check.txt
- http://snapshot.geneontology.org/products/annotations/
- GROUP-prediction.gaf EXAMPLE: pombase-prediction.gaf
Summary
This is a basic summary of the parsing of your GAF file. It functionally replaces the old "Mike's script"
These are found in reports
Example: http://snapshot.geneontology.org/reports/dictybase.report.md
These report basic syntax errors and implement a subset of checks in the GO QC Rules
Prediction Report and OWLTools Checks
Predictions
Technical Details
See the README.md in the pipeline GitHub repo: https://github.com/geneontology/pipeline
GO Consortium Dataflow
https://github.com/geneontology/go-site/blob/master/docs/go-consortium-dataflow.png