Release Pipeline

From GO Wiki
Jump to navigation Jump to search

September 2018: This documentation is currently a work in progress.

Overview

The information below is intended for GOC members who are providers of annotations. It describes how GOC processes annotations, which can be viewed at locations like AmiGO, downloaded from our sites, and queried via the SPARQL endpoint.

Annotations integrated in the GOC pipeline

These annotations are ingested daily.

Annotation sources

  • Groups wishing to contribute their annotations to GO should have their group added into the GO groups metadata:
    https://github.com/geneontology/go-site/blob/master/metadata/groups.yaml
    • Annotation files produced by GOC members are accessed via the URL or address provided by each group's datasets metadata file, in the source field. The files must be made publicly available via HTTP or FTP to be pulled in by GO. Important note: the source URL must resolve to the latest annotation file produced by the submitting group, since that link is used directly when fetching the data.
    • More information about the format of the datasets metadata file can be found in the metadata schema.yaml file.
    • Currently all data is ingested from as GAF format.
  UniProt-all file is processed differently

Data processing

Annotation QC checks

  • As files are read some lines can be modified or filtered as described in GO Rules Documentation
  • A number of checks are ran to ensure the integrity of the data (either at the parsing step or later in the pipeline). Checks include: data format, validity of identifiers, and a number of annotation rules. There are three types of checks:
    • filter: Violations of the rule lead to the filtering out of annotations not conforming.
    • repair: Violations of the rule lead to a replacement of the incorrect value by the correct value (for example, annotations to alternate identifiers are changed to the main identifier).
    • report: Violations of the rule are reported but no action is taken by the script.
  • Annotation lines that failed a check are reported in Snapshot Annotation Reports.
  • Each contributing group currently has a consolidated report rendered as HTML for easier viewing (group-report.html, for example for dictyBase: http://snapshot.geneontology.org/reports/dictybase-report.html)

Annotation merging and file generation

  • The annotation release pipeline generates many different products, including primary products such as annotation and GPI files and reports (such as error reports) and inferred annotations (predictions) for providing feedback to GOC contributing groups like MODs and UniProt). Once the upstream files have been loaded, checked, and merged, GAFs, GPADs, GPIs, TTLs, reports, and prediction files are produced.

Types of releases

  • Official monthly releases: versioned and archived so that analyses performed with these data can be reproduced at any point in the future. Note that all files generated as part of the monthly release have a permanent, stable release identifier.
  • Daily snapshot releases: intended for internal use by GOC members. Daily snapshots are not versioned and not archived, therefore not citable. Note that the daily snapshot release is not generated on the day of the official monthly release. Note also that all files generated as part of the snapshot release will NOT have permanent, stable release identifiers.

Data publishing and access

Data produced by each release can be accessed at the URLs below:

Release Content

The release content is the same for the official monthly and snapshot releases. Content is generally organized as:

  • Parent (a link to the parent directory)
  • Directories (a list of all directories or subdirectories within each specific location)
  • Files (a list of all files within each specific location)

The top-level list of directories, with information about the content found in each follows below.

annotations

  • The annotations directory contains annotation files organized alphabetically by contributing group.

contains annotations (in GAF and GPAD formats), as well as GPI files. A typical URL for these files is: http://release.geneontology.org/2018-07-02/annotations

  • The file names indicates the file type (GAF; GAPD, GPI) in the extension, for example:
    • mgi.gaf.gz
    • mgi.gpad.gz
    • mgi.gpi.gz

bin and lib

These are the binary files and libraries used by the GO pipeline to build the release.

Metadata

  • Information about database cross references used in GO and in annotations, contributing groups, etc.

Ontology Directory

  • The ontology directory contains ontology files and links to additional ontology-related directories.
  • The following versions of the ontology files are available for download:
    • go-base.owl
    • go-basic.json
    • go-basic.json.gz
    • go-basic.obo
    • go-basic.owl
    • go.json
    • go.obo
    • go.owl
  • Additional directories that contain ontology-related information are:
    • external2go
    • extensions
    • imports
    • reports
    • subsets

Products

  • TODO: There are also annotation files under the products directory. What are the intended use cases for each of these files?

Consuming and Displaying GO Data

GO Consortium Members

  • To get the most up-to-date data, groups can download data from the 'daily snapshots', and present that data on their web pages. However, for distributing annotation files, data from an Official Monthly Release must be used. Ideally each MOD should direct their users to their group GAF in the /current/annotations directory: http://current.geneontology.org/annotations.
  • GO annotations should not be changed in any way from their original content, although filtering is allowed.
  • PAINT annotations are available in the 'annotations' files, that also contains all annotations for each given group.
  • PAINT annotations can also be downloaded separately, from the 'products/annotations' directory, where PAINT annotations from each group are current available.

Groups using GO for research and analysis purposes

  • For citation purposes, groups should use the ontology and annotations from the official monthly release and cite the version (date) of the files used.

GO Consortium Dataflow

GO consortium dataflow

Original: https://github.com/geneontology/go-site/blob/master/docs/go-consortium-dataflow.png


Review Status

Last reviewed: August 28, 2018