Release Pipeline: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
mNo edit summary
mNo edit summary
Line 1: Line 1:
'''August 2018: This documentation is currently a work in progress.'''
'''August 2018: This documentation is currently a work in progress.'''


[[Release Pipeline v2]]
= Overview =
The information below is intended for GOC members who are providers of annotations curated externally, and submitting them to GO data. It also describes how GOC processes submitted annotations to produce the data users can see in [http://amigo.geneontology.org/amigo AmiGO] and download on our site.


== Overview ==
==Release schedule and content==
The GO Consortium (GOC) is now publicly releasing data on a monthly basis. The pipeline runs daily starting at midnight (12am) PDT, and the 1st of each month (or as close as can be obtained if there are failures) for the official monthly release. Releases contain:
* [http://wiki.geneontology.org/index.php/Release_Pipeline#Annotations_files Annotation files]
* [http://wiki.geneontology.org/index.php/Release_Pipeline#Ontology_files Ontology files]
* [http://wiki.geneontology.org/index.php/Release_Pipeline#Annotation_reports Annotation reports]
* [http://wiki.geneontology.org/index.php/Release_Pipeline#Predictions Predictions]


The GO Consortium (GOC) is now publically releasing data on a monthly basis. Data includes annotation files, ontology files, GO-CAM models, '''and...'''  Official monthly releases are versioned and archived so that analyses performed with these data can be reproduced at any point in the future.  Additionally, daily snapshot releases of GO data are available for internal use by GOC members.  This allows annotators, for example, to have access to the most up-to-date version of the ontology for their curation.  However, data generated using snapshot releases will not be officially released until the monthly public release.
== Data sources ==
* Primary annotations by GOC contributing groups  (ADD MORE INFO)
* PAINT annotations  (ADD MORE INFO)
* Predictions (ADD MORE INFO)
* GO-CAM models (ADD MORE INFO)


The information below is meant to provide an overall summary and basic instructions for submitting and consuming GO data.  For a more detailed discussion of the technical details, please see the [https://github.com/geneontology/pipeline README.md file] in the pipeline repository on GitHub.  
Data from these sources are merged, checked as described in the [http://wiki.geneontology.org/index.php/Release_Pipeline#Annotation_QA.2FQC_checks Annotation QA/QC checks] section, and processed to publish exported files.  


== Release Cycle ==
===Data processing: annotation merging===
Merging of annotations integrates data form all sources, and after QC, the annotations passing the filters are included in the publish exported files, including PAINT annotations.


*For both the daily and monthly releases, the pipeline runs start at midnight (12am) PDT, and currently take about 14hrs (this will be decreased in the future); starting nightly for the daily snapshot release and the first of the month (or as close as can be obtained if there are failures) for the monthly public release. As a note to that, the `snapshot` run does not also run on the day of the monthly `release`. Data associated with each release can be accessed at the URLs below, with specific details about the contents of released files discussed where appropriate below.
===Original data location===
**http://current.geneontology.org (~monthly, containing the latest release set)
* Annotation files produced by GOC members are accessed via the URL or address provided by each group's [https://github.com/geneontology/go-site/tree/master/metadata/datasets datasets metadata file], in the ''Source'' field. The files most be stored on a publicly accessible site (such as an FTP site, an Amazon S3 bucket, an HTTP server, etc.) to be accessible by GO. Important note: the Source address must resolve to the latest annotation file produced by the submitting group, since that link is used directly in the script fetching the data.
**http://release.geneontology.org (~monthly, plus historical sets from the new pipeline)
 
**http://snapshot.geneontology.org (~daily)
    **More information about the format of the datasets metadata file can be found in the [https://github.com/geneontology/go-site/blob/master/metadata/datasets/README.md datasets README.md file] under the Schema heading.


*All files generated as part of the monthly release will have a permanent, stable release identifier.
==Versioning==
*All files generated as part of the snapshot release will NOT have permanent, stable release identifiers.
* '''Official monthly releases''': versioned and archived so that analyses performed with these data can be reproduced at any point in the future.  Note that all files generated as part of the monthly release will have a permanent, stable release identifier.
* '''Daily snapshot releases''': intended for internal use by GOC members. Daily snapshots are not versioned and not archived, therefore not citable. Note that the ''daily snapshot release'' is not generated on the day of the ''official monthly release''. Note also that all files generated as part of the snapshot release will NOT have permanent, stable release identifiers.


=== Annotations ===
== Annotation QA/QC checks==
*A number of checks are ran to ensure the integrity of the data. Checks include: data format, validity of identifiers, and a number of annotation rules. There are three types of checks:
** '''filter''': Violations of the rule lead to the filtering out of annotations not conforming.
** '''repair''': Violations of the rule lead to a replacement of the incorrect value by the correct value (for example, annotations to alternate identifiers are changed to the main id).
** '''report''': Violations of the rule are reported in the '''annotation reports'''.
*The checks are documented here: https://github.com/geneontology/go-site/blob/master/metadata/rules/README.md.
*Errors are reported in [http://wiki.geneontology.org/index.php/Release_Pipeline#Annotation_reports Annotation Reports].


==== Overview ====
== Data publishing and access ==
Annotation files are retrieved from each participating consortium member by GO Central, merged with additional annotation files, run through annotation QA/QC checks and then released as daily snapshot and monthly public releases.
Once the files are loaded, checked and merged, GAFs, GPADs, GPIs, etc. files are produced and published. Data produced by each release can be accessed at the URLs below:
*Current official monthly release: http://geneontology.org/page/download-go-annotations and http://current.geneontology.org
*Monthly releases, including archived releases: http://release.geneontology.org
*Daily snapshot release: http://snapshot.geneontology.org


==== How to Submit an Annotation File ====
=== Published Files and Reports ===
*Annotation files produced by GOC members should be made available to GO Central by placing the file on a publicly accessible site such as an FTP site, an Amazon S3 bucket, an HTTP server, etc.
*The URL or address from which the file can be obtained is stored in the Source field of each group's [https://github.com/geneontology/go-site/tree/master/metadata/datasets datasets metadata file] which is located on github in the geneontology/go-site repository.
**More information about the format of the datasets metadata file can be found in the [https://github.com/geneontology/go-site/blob/master/metadata/datasets/README.md datasets README.md file] under the Schema heading.
**For example Source URLs see:
***https://github.com/geneontology/go-site/blob/master/metadata/datasets/mgi.yaml
***https://github.com/geneontology/go-site/blob/master/metadata/datasets/wb.yaml
***https://github.com/geneontology/go-site/blob/master/metadata/datasets/xenbase.yaml
***https://github.com/geneontology/go-site/blob/master/metadata/datasets/zfin.yaml
**Important note: the Source address must resolve to the latest annotation file produced by the submitting group.
 
==== What Happens to Annotations During a Release Cycle ====
*For both the monthly public releases and the daily snapshot releases, the same set of QA/QC checks and '''annotation file merges (PAINT, GO-CAM, predictions, other external groups, e.g. UniProt)''' are performed.  Resulting annotation files and reports are then made accessible via the release URLs listed above.
*When GO Central retrieves an annotation file from a contributing group, the pipeline will run checks on the file and repair any auto-repairable issues (for example, migrating annotations to merged terms). It will then publish the processed GAFs, GPADs, GPIs, etc. to a public site, available for download.
*'''Need to link to the GO rules on github and/or articulate all of the checks that annotation files undergo after submission.'''
 
==== Annotation Files and Reports ====
*The annotation release pipeline generates ''products'', e.g. annotation and gpi files, and ''reports'' e.g. error reports, inferred annotations, etc., that provide feedback to GOC contributing groups (MODs, UniProt, etc).
*The annotation release pipeline generates ''products'', e.g. annotation and gpi files, and ''reports'' e.g. error reports, inferred annotations, etc., that provide feedback to GOC contributing groups (MODs, UniProt, etc).
*Below is more detailed information about what files are generated during a release, '''using the snapshot URLs as an example'''.
*Below is more detailed information about what files are generated during a release, using the citable URLs from the 2018-07-02 as an example. A typical directory, for example http://release.geneontology.org/2018-07-02/index.html, contains the following subdirectories:
**[http://wiki.geneontology.org/index.php/Release_Pipeline#Annotations_files annotations]
**[http://wiki.geneontology.org/index.php/Release_Pipeline#bin bin]
**[http://wiki.geneontology.org/index.php/Release_Pipeline#lib lib]
**[http://wiki.geneontology.org/index.php/Release_Pipeline#Metadata metadata]
**[http://wiki.geneontology.org/index.php/Release_Pipeline#Ontology files ontology]
**[http://wiki.geneontology.org/index.php/Release_Pipeline#Products products]
**[http://wiki.geneontology.org/index.php/Release_Pipeline#Annotations_reports reports]


===== Annotation and GPI Files =====
==== Annotations files ====
*Annotation (gaf and gpad) and gpi files are available here:
*The annotations folder contains annotations (in GAF and GPAD formats), as well as GPI files. A typical URL for these files is: http://release.geneontology.org/2018-07-02/annotations/index.html
**http://snapshot.geneontology.org/annotations/index.html
*The file names indicates the file type (GAF; GAPD, GPI) in the extension, for example:  
*Currently, annotation (gaf and gpad) and gpi files are sorted according to contributing group.
*'''In the future, annotation files will be sorted according to species.'''
*Note that the file names now conform to a new naming schema that indicates the file type in the extension:
**mgi.gaf.gz
**mgi.gaf.gz
**mgi.gpad.gz
**mgi.gpad.gz
**mgi.gpi.gz
**mgi.gpi.gz
==== bin ====
==== lib ====
==== Metadata ====
* Information about database cross references used in GO and in annotations, contributing groups, etc
==== Ontology files ====
* GO ontology in OBO, OWL and JSON formats, and in go and go-basic forms
* extensions -> ?
* external2go
* imports
* reports -> ?
* subsets
==== Products ====
*'''There are also annotation files under the products directory.  What are the intended use cases for each of these files?'''
*'''There are also annotation files under the products directory.  What are the intended use cases for each of these files?'''
**http://snapshot.geneontology.org/products/annotations/
**http://snapshot.geneontology.org/products/annotations/
Line 59: Line 86:
***[http://snapshot.geneontology.org/products/annotations/index.htmlwb_noiea.gaf.gz wb_noiea.gaf.gz]
***[http://snapshot.geneontology.org/products/annotations/index.htmlwb_noiea.gaf.gz wb_noiea.gaf.gz]
***[http://snapshot.geneontology.org/products/annotations/index.html/wb_valid.gaf.gz wb_valid.gaf.gz]
***[http://snapshot.geneontology.org/products/annotations/index.html/wb_valid.gaf.gz wb_valid.gaf.gz]
*** species-specific PAINT files


===== Annotation Reports =====
==== Annotation reports====
*Annotation reports are available here:
*Annotation reports are available here: http://snapshot.geneontology.org/reports/
**http://snapshot.geneontology.org/reports/
*Each contributing group currently has a consolidated report rendered as HTML for easier viewing in the form: http://snapshot.geneontology.org/reports/dictybase-report.html
*Each contributing group currently has a consolidated report rendered as HTML for easier viewing in the form:
http://snapshot.geneontology.org/reports/dictybase-report.html


=== Ontology ===
== Submitting GO Data ==
* Groups wishing to contribute their annotations to GO should have their group added in the '''GO groups'''
https://github.com/geneontology/go-site/blob/master/metadata/groups.yaml
* The location of the annotations to be uploaded by GO should be


== Consuming and Displaying GO Data ==
== Consuming and Displaying GO Data ==
=== GO Consortium Members ===
=== GO Consortium Members ===
*While GOC members may consume snapshot release files for internal purposes, we strongly encourage members to only display data from the monthly releases.
* While GOC members may consume snapshot release files for internal purposes, we request that the data displayed on third-party websites be obtained from an Official Monthly Release.
*GOC members may filter GO annotations before display on their local sites.
* GO annotations should not be changed in any way from their original content, although filtering is allowed.
*GO annotations should not be changed in any way from their original content.
* PAINT annotations are available in the 'annotations' files, that also contains all annotations for each given group.
* PAINT annotations can also be downloaded separately, from the 'products/annotations' directory, where PAINT annotations from each group are available.
 
=== Groups using GO for research and analysis purposes ===
* Groups should use the Gene Ontology and Annotations from the Official Monthly Release, and cite the version (date) of the files used.


== GO Consortium Dataflow ==
== GO Consortium Dataflow ==


https://github.com/geneontology/go-site/blob/master/docs/go-consortium-dataflow.png
[[File:GOconsortium-dataflow.png|1000px|GO consortium dataflow]]
 
Original: https://github.com/geneontology/go-site/blob/master/docs/go-consortium-dataflow.png





Revision as of 03:33, 6 August 2018

August 2018: This documentation is currently a work in progress.

Overview

The information below is intended for GOC members who are providers of annotations curated externally, and submitting them to GO data. It also describes how GOC processes submitted annotations to produce the data users can see in AmiGO and download on our site.

Release schedule and content

The GO Consortium (GOC) is now publicly releasing data on a monthly basis. The pipeline runs daily starting at midnight (12am) PDT, and the 1st of each month (or as close as can be obtained if there are failures) for the official monthly release. Releases contain:

Data sources

  • Primary annotations by GOC contributing groups (ADD MORE INFO)
  • PAINT annotations (ADD MORE INFO)
  • Predictions (ADD MORE INFO)
  • GO-CAM models (ADD MORE INFO)

Data from these sources are merged, checked as described in the Annotation QA/QC checks section, and processed to publish exported files.

Data processing: annotation merging

Merging of annotations integrates data form all sources, and after QC, the annotations passing the filters are included in the publish exported files, including PAINT annotations.

Original data location

  • Annotation files produced by GOC members are accessed via the URL or address provided by each group's datasets metadata file, in the Source field. The files most be stored on a publicly accessible site (such as an FTP site, an Amazon S3 bucket, an HTTP server, etc.) to be accessible by GO. Important note: the Source address must resolve to the latest annotation file produced by the submitting group, since that link is used directly in the script fetching the data.
   **More information about the format of the datasets metadata file can be found in the datasets README.md file under the Schema heading.

Versioning

  • Official monthly releases: versioned and archived so that analyses performed with these data can be reproduced at any point in the future. Note that all files generated as part of the monthly release will have a permanent, stable release identifier.
  • Daily snapshot releases: intended for internal use by GOC members. Daily snapshots are not versioned and not archived, therefore not citable. Note that the daily snapshot release is not generated on the day of the official monthly release. Note also that all files generated as part of the snapshot release will NOT have permanent, stable release identifiers.

Annotation QA/QC checks

  • A number of checks are ran to ensure the integrity of the data. Checks include: data format, validity of identifiers, and a number of annotation rules. There are three types of checks:
    • filter: Violations of the rule lead to the filtering out of annotations not conforming.
    • repair: Violations of the rule lead to a replacement of the incorrect value by the correct value (for example, annotations to alternate identifiers are changed to the main id).
    • report: Violations of the rule are reported in the annotation reports.
  • The checks are documented here: https://github.com/geneontology/go-site/blob/master/metadata/rules/README.md.
  • Errors are reported in Annotation Reports.

Data publishing and access

Once the files are loaded, checked and merged, GAFs, GPADs, GPIs, etc. files are produced and published. Data produced by each release can be accessed at the URLs below:

Published Files and Reports

  • The annotation release pipeline generates products, e.g. annotation and gpi files, and reports e.g. error reports, inferred annotations, etc., that provide feedback to GOC contributing groups (MODs, UniProt, etc).
  • Below is more detailed information about what files are generated during a release, using the citable URLs from the 2018-07-02 as an example. A typical directory, for example http://release.geneontology.org/2018-07-02/index.html, contains the following subdirectories:

Annotations files

bin

lib

Metadata

  • Information about database cross references used in GO and in annotations, contributing groups, etc

Ontology files

  • GO ontology in OBO, OWL and JSON formats, and in go and go-basic forms
  • extensions -> ?
  • external2go
  • imports
  • reports -> ?
  • subsets

Products

Annotation reports

Submitting GO Data

  • Groups wishing to contribute their annotations to GO should have their group added in the GO groups

https://github.com/geneontology/go-site/blob/master/metadata/groups.yaml

  • The location of the annotations to be uploaded by GO should be

Consuming and Displaying GO Data

GO Consortium Members

  • While GOC members may consume snapshot release files for internal purposes, we request that the data displayed on third-party websites be obtained from an Official Monthly Release.
  • GO annotations should not be changed in any way from their original content, although filtering is allowed.
  • PAINT annotations are available in the 'annotations' files, that also contains all annotations for each given group.
  • PAINT annotations can also be downloaded separately, from the 'products/annotations' directory, where PAINT annotations from each group are available.

Groups using GO for research and analysis purposes

  • Groups should use the Gene Ontology and Annotations from the Official Monthly Release, and cite the version (date) of the files used.

GO Consortium Dataflow

GO consortium dataflow

Original: https://github.com/geneontology/go-site/blob/master/docs/go-consortium-dataflow.png