Release Pipeline: Difference between revisions
Line 47: | Line 47: | ||
*Currently, annotation (gaf and gpad) and gpi files are sorted according to contributing group. | *Currently, annotation (gaf and gpad) and gpi files are sorted according to contributing group. | ||
*In the future, annotation files will be sorted according to species. | *In the future, annotation files will be sorted according to species. | ||
*Note that the file names now conform to a | *Note that the file names now conform to a new naming schema that indicates the file type in the extension: | ||
**mgi.gaf.gz | **mgi.gaf.gz | ||
**mgi.gpad.gz | **mgi.gpad.gz |
Revision as of 13:15, 5 June 2018
June 2018: This documentation is currently a work in progress.
Overview
The GO Consortium (GOC) is now publically releasing data on a monthly basis. Data includes annotation files, ontology files, GO-CAM models, and... Official monthly releases are versioned and archived so that analyses performed with these data can be reproduced at any point in the future. Additionally, daily snapshot releases of GO data are available for internal use by GOC members. This allows annotators, for example, to have access to the most up-to-date version of the ontology for their curation. However, data generated using snapshot releases will not be officially released until the monthly public release.
The information below is meant to provide an overall summary and basic instructions for submitting and consuming GO data. For a more detailed discussion of the technical details, please see the README.md file in the pipeline repository on GitHub.
Release Cycle
For both the daily and monthly releases, the pipeline runs start at midnight (12am) PDT, and currently take about 14hrs (this will be decreased in the future); starting nightly for the daily snapshot release and the first of the month (or as close as can be obtained if there are failures) for the monthly public release. As a note to that, the `snapshot` run does not also run on the day of the monthly `release`. Data associated with each release can be accessed at the URLs below, with specific details about the contents of released files discussed where appropriate below.
- http://current.geneontology.org (~monthly, containing the latest release set)
- http://release.geneontology.org (~monthly, plus historical sets from the new pipeline)
- http://snapshot.geneontology.org (~daily)
All files generated as part of the monthly release will have a permanent, stable release identifier.
Annotations
Overview
Annotation files are retrieved from each participating consortium member by GO Central, merged with PAINT annotation files, run through annotation QA/QC checks and then released as daily snapshot and monthly public releases.
How to Submit an Annotation File
- Annotation files produced by GOC members should be made available to GO Central by placing the file on a publicly accessible site such as an FTP site, an Amazon S3 bucket, an HTTP server, etc.
- The URL or address from which the file can be obtained is stored in the Source field of each group's datasets metadata file which is located on github in the geneontology/go-site repository.
- More information about the format of the datasets metadata file can be found in the datasets README.md file under the Schema heading.
- For example Source URLs see:
- https://github.com/geneontology/go-site/blob/master/metadata/datasets/mgi.yaml
- https://github.com/geneontology/go-site/blob/master/metadata/datasets/wb.yaml
- https://github.com/geneontology/go-site/blob/master/metadata/datasets/xenbase.yaml
- https://github.com/geneontology/go-site/blob/master/metadata/datasets/zfin.yaml
- Important note: the Source address must resolve to the latest annotation file produced by the submitting group.
What Happens to Annotations During a Release Cycle
- For both the monthly public releases and the daily snapshot releases, the same set of QA/QC checks and annotation file merges are performed. Resulting annotation files and reports are then made accessible via the release URLs listed above.
- When GO Central retrieves an annotation file from a contributing group, the pipeline will run checks on the file and repair any auto-repairable issues (for example, migrating annotations to merged terms). It will then publish the processed GAFs, GPADs, GPIs, etc. to a public site, available for download.
- Need to link to the GO rules on github and/or articulate all of the checks that annotation files undergo after submission.
Annotation Files and Reports
- The annotation release pipeline generates products, e.g. annotation and gpi files, and reports e.g. error reports, inferred annotations, etc., that provide feedback to GOC contributing groups (MODs, UniProt, etc).
- Below is more detailed information about what files are generated during a release, using the snapshot URLs as an example.
Annotation and GPI Files
- Annotation (gaf and gpad) and gpi files are available here:
- Currently, annotation (gaf and gpad) and gpi files are sorted according to contributing group.
- In the future, annotation files will be sorted according to species.
- Note that the file names now conform to a new naming schema that indicates the file type in the extension:
- mgi.gaf.gz
- mgi.gpad.gz
- mgi.gpi.gz
- http://snapshot.geneontology.org/reports/
- summary.txt. EXAMPLE: dictybase.report.md
- prediction-report.txt EXAMPLE: mgi-prediction-report.txt
- owltools-check.txt EXAMPLE: mgi-owltools-check.txt
- http://snapshot.geneontology.org/products/annotations/
- GROUP-prediction.gaf EXAMPLE: pombase-prediction.gaf
Summary
This is a basic summary of the parsing of your GAF file. It functionally replaces the old "Mike's script"
These are found in reports
Example: http://snapshot.geneontology.org/reports/dictybase.report.md
These report basic syntax errors and implement a subset of checks in the GO QC Rules
Prediction Report and OWLTools Checks
Predictions
Ontology
Consuming GO Data
GO Consortium Dataflow
https://github.com/geneontology/go-site/blob/master/docs/go-consortium-dataflow.png