Release Pipeline: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
mNo edit summary
Line 2: Line 2:


= Overview =
= Overview =
The information below is intended for GOC members who are providers of annotations curated externally, and submitting them to GO data. It also describes how GOC processes submitted annotations to produce the data users can see in [http://amigo.geneontology.org/amigo AmiGO] and download on our site.  
The information below is intended for GOC members who are providers of annotations curated externally, and submitting them to GO data. It also describes how GOC processes upstream annotations to produce the data users can see in [http://amigo.geneontology.org/amigo AmiGO] and download from our sites.


==Release schedule and content==
==Release schedule and content==

Revision as of 19:57, 7 August 2018

August 2018: This documentation is currently a work in progress.

Overview

The information below is intended for GOC members who are providers of annotations curated externally, and submitting them to GO data. It also describes how GOC processes upstream annotations to produce the data users can see in AmiGO and download from our sites.

Release schedule and content

The GO Consortium (GOC) is now publicly releasing data on a monthly basis. The pipeline runs daily starting at midnight (12am) PDT, and the 1st of each month (or as close as can be obtained if there are failures) for the official monthly release. Releases contain:

Data sources

  • Primary annotations by GOC contributing groups (ADD MORE INFO)
  • PAINT annotations (ADD MORE INFO)
  • Predictions (ADD MORE INFO)
  • GO-CAM models (ADD MORE INFO)

Data from these sources are merged, checked as described in the Annotation QA/QC checks section, and processed to publish exported files.

Data processing: annotation merging

Merging of annotations integrates data form all sources, and after QC, the annotations passing the filters are included in the publish exported files, including PAINT annotations.

Original data location

  • Annotation files produced by GOC members are accessed via the URL or address provided by each group's datasets metadata file, in the Source field. The files most be stored on a publicly accessible site (such as an FTP site, an Amazon S3 bucket, an HTTP server, etc.) to be accessible by GO. Important note: the Source address must resolve to the latest annotation file produced by the submitting group, since that link is used directly in the script fetching the data.
   **More information about the format of the datasets metadata file can be found in the datasets README.md file under the Schema heading.

Versioning

  • Official monthly releases: versioned and archived so that analyses performed with these data can be reproduced at any point in the future. Note that all files generated as part of the monthly release will have a permanent, stable release identifier.
  • Daily snapshot releases: intended for internal use by GOC members. Daily snapshots are not versioned and not archived, therefore not citable. Note that the daily snapshot release is not generated on the day of the official monthly release. Note also that all files generated as part of the snapshot release will NOT have permanent, stable release identifiers.

Annotation QA/QC checks

  • A number of checks are ran to ensure the integrity of the data. Checks include: data format, validity of identifiers, and a number of annotation rules. There are three types of checks:
    • filter: Violations of the rule lead to the filtering out of annotations not conforming.
    • repair: Violations of the rule lead to a replacement of the incorrect value by the correct value (for example, annotations to alternate identifiers are changed to the main id).
    • report: Violations of the rule are reported in the annotation reports.
  • The checks are documented here: https://github.com/geneontology/go-site/blob/master/metadata/rules/README.md.
  • Errors are reported in Annotation Reports.

Data publishing and access

Once the files are loaded, checked and merged, GAFs, GPADs, GPIs, etc. files are produced and published. Data produced by each release can be accessed at the URLs below:

Published Files and Reports

  • The annotation release pipeline generates products, e.g. annotation and gpi files, and reports e.g. error reports, inferred annotations, etc., that provide feedback to GOC contributing groups (MODs, UniProt, etc).
  • Below is more detailed information about what files are generated during a release, using the citable URLs from the 2018-07-02 as an example. A typical directory, for example http://release.geneontology.org/2018-07-02/index.html, contains the following subdirectories:

Annotations files

bin

lib

Metadata

  • Information about database cross references used in GO and in annotations, contributing groups, etc

Ontology files

  • GO ontology in OBO, OWL and JSON formats, and in go and go-basic forms
  • extensions -> ?
  • external2go
  • imports
  • reports -> ?
  • subsets

Products

Annotation reports

Submitting GO Data

  • Groups wishing to contribute their annotations to GO should have their group added in the GO groups

https://github.com/geneontology/go-site/blob/master/metadata/groups.yaml

  • The location of the annotations to be uploaded by GO should be

Consuming and Displaying GO Data

GO Consortium Members

  • While GOC members may consume snapshot release files for internal purposes, we request that the data displayed on third-party websites be obtained from an Official Monthly Release.
  • GO annotations should not be changed in any way from their original content, although filtering is allowed.
  • PAINT annotations are available in the 'annotations' files, that also contains all annotations for each given group.
  • PAINT annotations can also be downloaded separately, from the 'products/annotations' directory, where PAINT annotations from each group are available.

Groups using GO for research and analysis purposes

  • Groups should use the Gene Ontology and Annotations from the Official Monthly Release, and cite the version (date) of the files used.

GO Consortium Dataflow

GO consortium dataflow

Original: https://github.com/geneontology/go-site/blob/master/docs/go-consortium-dataflow.png