Release Pipeline: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
mNo edit summary
(40 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''September 2018: This documentation is currently a work in progress.'''
'''October 2018: This documentation is currently a work in progress.'''


= Overview =
= Overview =
Line 11: Line 11:
== Annotation sources ==
== Annotation sources ==
* Groups wishing to contribute their annotations to GO should have their group added into the '''GO groups''' metadata: <br /> https://github.com/geneontology/go-site/blob/master/metadata/groups.yaml
* Groups wishing to contribute their annotations to GO should have their group added into the '''GO groups''' metadata: <br /> https://github.com/geneontology/go-site/blob/master/metadata/groups.yaml
** Annotation files produced by GOC members are accessed via the URL or address provided by each group's [https://github.com/geneontology/go-site/tree/master/metadata/datasets datasets metadata file], in the ''source'' field. The files must be made publicly available via HTTP or FTP to be pulled in by GO. Important note: the '''source''' URL must resolve to the latest annotation file produced by the submitting group, since that link is used directly when fetching the data.
** Annotation files produced by GOC members are accessed via the URL or address provided by each group's [https://github.com/geneontology/go-site/tree/master/metadata/datasets datasets metadata file], in the ''source'' field.  
** More information about the format of the datasets metadata file can be found in the [https://github.com/geneontology/go-site/blob/master/metadata/datasets.schema.yaml metadata schema.yaml file].
** The files must be made publicly available via HTTP or FTP to be pulled in by GO. Important note: the '''source''' URL must resolve to the latest annotation file produced by the submitting group, since that link is used directly when fetching the data.
** Currently all data is ingested from as GAF format.
* More information about the format of the datasets metadata file can be found in the [https://github.com/geneontology/go-site/blob/master/metadata/datasets.schema.yaml metadata schema.yaml file].
* Currently all data is ingested in GAF format.
**In the future, the GOC will switch to using GPAD/GPI for all internal data exchange.


   UniProt-all file is processed differently
   UniProt-all file is processed differently
Line 20: Line 22:


=== Annotation QC checks===
=== Annotation QC checks===
* As files are read some lines can be modified or filtered as described in [https://github.com/geneontology/go-site/blob/master/metadata/rules/README.md GO Rules Documentation]
* As files are read some lines may be modified or filtered as described in [https://github.com/geneontology/go-site/blob/master/metadata/rules/README.md GO Rules Documentation]
* A number of checks are ran to ensure the integrity of the data (either at the parsing step or later in the pipeline). Checks include: data format, validity of identifiers, and a number of annotation rules. There are three types of checks:  
* A number of checks are run to ensure the integrity of the data (either at the parsing step or later in the pipeline). Checks include: data format, validity of identifiers, and a number of annotation rules. There are three types of checks:  
** '''filter''': Violations of the rule lead to the filtering out of annotations not conforming.  
** '''filter''': Violations of the rule lead to filtering of annotations not conforming.  
** '''repair''': Violations of the rule lead to a replacement of the incorrect value by the correct value (for example, annotations to alternate identifiers are changed to the main identifier).  
** '''repair''': Violations of the rule lead to a replacement of an incorrect entry by the correct entry (for example, annotations to GO term alternate identifiers are changed to the GO term main identifier).  
** '''report''': Violations of the rule are reported but no action is taken by the script.  
** '''report''': Violations of the rule are reported but no action is taken by the QC/QA pipeline.  
* Annotation lines that failed a check are reported in [http://snapshot.geneontology.org/reports/ Snapshot Annotation Reports].
* Annotation lines that failed a check are collated in each group's reports.html page: [http://snapshot.geneontology.org/reports/ Snapshot Annotation Reports].
*Each contributing group currently has a consolidated report rendered as HTML for easier viewing (group-report.html, for example for dictyBase: http://snapshot.geneontology.org/reports/dictybase-report.html)


=== Annotation merging and file generation===
=== Annotation merging and file generation===
Line 41: Line 42:
*Current official monthly release: http://current.geneontology.org
*Current official monthly release: http://current.geneontology.org
*Monthly releases to date: http://release.geneontology.org
*Monthly releases to date: http://release.geneontology.org
*Daily snapshot release: http://snapshot.geneontology.org
*Daily snapshot: http://snapshot.geneontology.org


==Release Content==
==Release Content==


*The release content may be accessed from specific URLs and is the same for the [http://release.geneontology.org/ official monthly] and [http://snapshot.geneontology.org/ snapshot releases].  
*The release content may be accessed from the specific URLs listed above.  
**http://current.geneontology.org (the current official monthly release)
**http://release.geneontology.org/ (links to each official monthly release)
**http://snapshot.geneontology.org/ (the daily release)
 
*Each page of content is generally organized as:
*Each page of content is generally organized as:
**Parent (a link to the parent directory)
**Parent (a link to the parent directory)
Line 55: Line 52:
**Files (a list of all files within each specific location)
**Files (a list of all files within each specific location)


*The main list of directories, with information about the content found in each follows below.
*The main list of directories, with information about the content found in each, follows below.


==== annotations ====
==== annotations ====
*The annotations directory contains solely annotation files organized alphabetically by contributing group.  
*The annotations directory contains solely annotation files organized alphabetically by contributing group.  
*The annotation files available here are those files produced *after* the QC/QA rules have been applied.
*The annotation files available here are those files produced *after* the QC/QA rules have been applied and include annotations from GOC annotations tools, i.e. PAINT.
*Each group has three files, compressed using the gzip utility:
*Each group has three files, compressed using the gzip utility:
**gaf
**gaf
Line 71: Line 68:


==== bin ====
==== bin ====
*bin contains the binary files used by the GO pipeline to build the release.
*The bin directory contains the binary files used by the GO pipeline to build the release.


==== lib ====
==== lib ====
*lib contains the libraries used by the GO pipeline to build the release.
*The lib directory contains the libraries used by the GO pipeline to build the release.


==== metadata ====
==== metadata ====
*metadata contains relevant metadata used by the pipeline.  Examples include:
*The metadata directory contains relevant metadata used by the pipeline.  Examples include:
**datasets.yaml (information about the groups that contribute annotations and where the associated files can be retrieved for the pipeline)
**datasets.yaml (information about the groups that contribute annotations and where the associated files can be retrieved for the pipeline)
**the list of valid GO_REFs
**the list of valid GO_REFs
**the list of GO rules
**the list of GO rules


==== Ontology Directory ====
==== ontology ====
*The ontology directory contains ontology files and links to additional ontology-related directories.
*The ontology directory contains directories with ontology-related information as well as several different formats of the ontology.
*The following versions of the ontology files are available for download:
*Much of the content of this directory mirrors what is contained in https://github.com/geneontology/go-ontology/tree/master/src/ontology
*Directories that contain ontology-related information are:
**extensions
**external2go
**imports (terms imported from external ontologies used in GO equivalence axioms)
**reports
**subsets (the yaml files for metadata on each GO subset)
*Ontology files are:
**go-base.owl
**go-base.owl
**go-basic.json
**go-basic.json
Line 93: Line 97:
**go.obo
**go.obo
**go.owl
**go.owl
*Additional directories that contain ontology-related information are:
**external2go
**extensions
**imports
**reports
**subsets


==== Products ====
==== products ====
*The products directory contains six sub-directories:
**annotations
**As above, the annotations directory contains files organized alphabetically according to contributing group.
**Files present in this directory are:
***prediction.gaf (annotation predictions from annotation extensions and inter-ontology links)
***src.gaf (original annotation source file)
***gaf (PAINT only)
***gpad (PAINT only)
***gpi (PAINT only)
***noiea.gaf (plus IEA filtered out (no PAINT annotations)
***valid.gaf (original source file parsed and filtered after applying QA/QC rules but prior to merging with other files, e.g. PAINT)
**blazegraph
***The production data available at rdf.geneontology.org
***Includes:
****All release GAF data, including PAINT
****Production Noctua models
****GO ontology
***Note that there is also an internal blazegraph that also contains Noctua development models
**pages
***These are HTML pages created during the pipeline for various purposes.
**panther
***This contains PANTHER tree data, e.g. gene ids and PANTHER embedded tree structure.
***This is used for AmiGO.
**solr
***This contains the solr indexes that drive AmiGO.
**ttl
***All production annotations available in ttl format (same information as contained in the blazegraph directory.


* '''TODO''': There are also annotation files under the products directory.  What are the intended use cases for each of these files?
==== reports ====
**http://snapshot.geneontology.org/products/annotations/
*The reports directory contains links to files that document the results of various QC/QA checks as well as a link to the [http://current.geneontology.org/reports/gorule-report.html gorule-report.html].
***[http://snapshot.geneontology.org/products/annotations/wb-prediction.gaf wb-prediction.gaf]: annotation predictions (from annotation extensions and from Function-Process links) '''To Be Confirmed'''
*Report files are organized alphabetically by contributing group.
***[http://snapshot.geneontology.org/products/annotations/index.html/wb-src.gaf.gz wb-src.gaf.gz]: original source file
*The types of reports are:
***[http://snapshot.geneontology.org/products/annotations/index.htmlwb_noiea.gaf.gz wb_noiea.gaf.gz]: original source file parsed and filtered, plus IEA filtered out (no PAINT annotations)
**owltools-check.txt
***[http://snapshot.geneontology.org/products/annotations/index.html/wb_valid.gaf.gz wb_valid.gaf.gz]: original source file parsed and filtered (prior to merging with other files, for eg PAINT)
**prediction-experimental-report.txt
*** species-specific PAINT files
**prediction-report.txt
**report.html (this is the central report that contains all of the violations and other reports for a given resource)
**summary.txt
**report.json
**report.md


== Consuming and Displaying GO Data ==
== Consuming and Displaying GO Data ==
=== GO Consortium Members ===
=== GO Consortium Members (to confirm) ===
* To get the most up-to-date data, groups can download data from the 'daily snapshots', and present that data on their web pages. However, for distributing annotation files, data from an '''Official Monthly Release''' must be used. Ideally each MOD should direct their users to their group GAF in the /current/annotations directory: http://current.geneontology.org/annotations.
*To get the most up-to-date data, contributing groups can download GO data (e.g. ontology and annotations) using the snapshot URLs.  For example:
* GO annotations should not be changed in any way from their original content, although filtering is allowed.
  Snapshot annotations: http://snapshot.geneontology.org/annotations/wb.gaf.gz
* PAINT annotations are available in the 'annotations' files, that also contains all annotations for each given group.
 
* PAINT annotations can also be downloaded separately, from the 'products/annotations' directory, where PAINT annotations from each group are current available.
  Snapshot ontology: http://purl.obolibrary.org/obo/go/snapshot/go.obo
*Groups may also present snapshot data on their individual sites.  
*However, for distributing annotation or ontology files, data from an versioned monthly release should be used.  
*Each contributing group should direct users to the appropriate group GAF in the current annotations directory
  Current annotations: http://current.geneontology.org/annotations/
 
  Current ontology: http://purl.obolibrary.org/obo/go/go.obo


=== Groups using GO for research and analysis purposes ===
=== Groups using GO for research and analysis purposes ===
* For citation purposes, groups should use the ontology and annotations from the official monthly release and cite the version (date) of the files used.
* For citation purposes, groups should use the ontology and annotations from the official monthly release and cite the date and doi of the release they used.


== GO Consortium Dataflow ==
== GO Consortium Dataflow ==

Revision as of 19:10, 17 December 2018

October 2018: This documentation is currently a work in progress.

Overview

The information below is intended for GOC members who are providers of annotations. It describes how GOC processes annotations, which can be viewed at locations like AmiGO, downloaded from our sites, and queried via the SPARQL endpoint.

Annotations integrated in the GOC pipeline

These annotations are ingested daily.

Annotation sources

  • Groups wishing to contribute their annotations to GO should have their group added into the GO groups metadata:
    https://github.com/geneontology/go-site/blob/master/metadata/groups.yaml
    • Annotation files produced by GOC members are accessed via the URL or address provided by each group's datasets metadata file, in the source field.
    • The files must be made publicly available via HTTP or FTP to be pulled in by GO. Important note: the source URL must resolve to the latest annotation file produced by the submitting group, since that link is used directly when fetching the data.
  • More information about the format of the datasets metadata file can be found in the metadata schema.yaml file.
  • Currently all data is ingested in GAF format.
    • In the future, the GOC will switch to using GPAD/GPI for all internal data exchange.
  UniProt-all file is processed differently

Data processing

Annotation QC checks

  • As files are read some lines may be modified or filtered as described in GO Rules Documentation
  • A number of checks are run to ensure the integrity of the data (either at the parsing step or later in the pipeline). Checks include: data format, validity of identifiers, and a number of annotation rules. There are three types of checks:
    • filter: Violations of the rule lead to filtering of annotations not conforming.
    • repair: Violations of the rule lead to a replacement of an incorrect entry by the correct entry (for example, annotations to GO term alternate identifiers are changed to the GO term main identifier).
    • report: Violations of the rule are reported but no action is taken by the QC/QA pipeline.
  • Annotation lines that failed a check are collated in each group's reports.html page: Snapshot Annotation Reports.

Annotation merging and file generation

  • The annotation release pipeline generates many different products, including primary products such as annotation and GPI files and reports (such as error reports) and inferred annotations (predictions) for providing feedback to GOC contributing groups like MODs and UniProt). Once the upstream files have been loaded, checked, and merged, GAFs, GPADs, GPIs, TTLs, reports, and prediction files are produced.

Types of releases

  • Official monthly releases: versioned and archived so that analyses performed with these data can be reproduced at any point in the future. Note that all files generated as part of the monthly release have a permanent, stable release identifier.
  • Daily snapshot releases: intended for internal use by GOC members. Daily snapshots are not versioned and not archived, therefore not citable. Note that the daily snapshot release is not generated on the day of the official monthly release. Note also that all files generated as part of the snapshot release will NOT have permanent, stable release identifiers.

Data publishing and access

Data produced by each release can be accessed at the URLs below:

Release Content

  • The release content may be accessed from the specific URLs listed above.
  • Each page of content is generally organized as:
    • Parent (a link to the parent directory)
    • Directories (a list of all directories or subdirectories within each specific location)
    • Files (a list of all files within each specific location)
  • The main list of directories, with information about the content found in each, follows below.

annotations

  • The annotations directory contains solely annotation files organized alphabetically by contributing group.
  • The annotation files available here are those files produced *after* the QC/QA rules have been applied and include annotations from GOC annotations tools, i.e. PAINT.
  • Each group has three files, compressed using the gzip utility:
    • gaf
    • gpad
    • gpi
      • Note that the GPI file here corresponds to the GPAD annotation file, not the original GPI file produced by the contributing group.
  • An example of annotation files found in the annotation directory:
 mgi.gaf.gz
 mgi.gpad.gz
 mgi.gpi.gz

bin

  • The bin directory contains the binary files used by the GO pipeline to build the release.

lib

  • The lib directory contains the libraries used by the GO pipeline to build the release.

metadata

  • The metadata directory contains relevant metadata used by the pipeline. Examples include:
    • datasets.yaml (information about the groups that contribute annotations and where the associated files can be retrieved for the pipeline)
    • the list of valid GO_REFs
    • the list of GO rules

ontology

  • The ontology directory contains directories with ontology-related information as well as several different formats of the ontology.
  • Much of the content of this directory mirrors what is contained in https://github.com/geneontology/go-ontology/tree/master/src/ontology
  • Directories that contain ontology-related information are:
    • extensions
    • external2go
    • imports (terms imported from external ontologies used in GO equivalence axioms)
    • reports
    • subsets (the yaml files for metadata on each GO subset)
  • Ontology files are:
    • go-base.owl
    • go-basic.json
    • go-basic.json.gz
    • go-basic.obo
    • go-basic.owl
    • go.json
    • go.obo
    • go.owl

products

  • The products directory contains six sub-directories:
    • annotations
    • As above, the annotations directory contains files organized alphabetically according to contributing group.
    • Files present in this directory are:
      • prediction.gaf (annotation predictions from annotation extensions and inter-ontology links)
      • src.gaf (original annotation source file)
      • gaf (PAINT only)
      • gpad (PAINT only)
      • gpi (PAINT only)
      • noiea.gaf (plus IEA filtered out (no PAINT annotations)
      • valid.gaf (original source file parsed and filtered after applying QA/QC rules but prior to merging with other files, e.g. PAINT)
    • blazegraph
      • The production data available at rdf.geneontology.org
      • Includes:
        • All release GAF data, including PAINT
        • Production Noctua models
        • GO ontology
      • Note that there is also an internal blazegraph that also contains Noctua development models
    • pages
      • These are HTML pages created during the pipeline for various purposes.
    • panther
      • This contains PANTHER tree data, e.g. gene ids and PANTHER embedded tree structure.
      • This is used for AmiGO.
    • solr
      • This contains the solr indexes that drive AmiGO.
    • ttl
      • All production annotations available in ttl format (same information as contained in the blazegraph directory.

reports

  • The reports directory contains links to files that document the results of various QC/QA checks as well as a link to the gorule-report.html.
  • Report files are organized alphabetically by contributing group.
  • The types of reports are:
    • owltools-check.txt
    • prediction-experimental-report.txt
    • prediction-report.txt
    • report.html (this is the central report that contains all of the violations and other reports for a given resource)
    • summary.txt
    • report.json
    • report.md

Consuming and Displaying GO Data

GO Consortium Members (to confirm)

  • To get the most up-to-date data, contributing groups can download GO data (e.g. ontology and annotations) using the snapshot URLs. For example:
 Snapshot annotations: http://snapshot.geneontology.org/annotations/wb.gaf.gz
 Snapshot ontology: http://purl.obolibrary.org/obo/go/snapshot/go.obo
  • Groups may also present snapshot data on their individual sites.
  • However, for distributing annotation or ontology files, data from an versioned monthly release should be used.
  • Each contributing group should direct users to the appropriate group GAF in the current annotations directory
 Current annotations: http://current.geneontology.org/annotations/
 Current ontology: http://purl.obolibrary.org/obo/go/go.obo

Groups using GO for research and analysis purposes

  • For citation purposes, groups should use the ontology and annotations from the official monthly release and cite the date and doi of the release they used.

GO Consortium Dataflow

GO consortium dataflow

Original: https://github.com/geneontology/go-site/blob/master/docs/go-consortium-dataflow.png


Review Status

Last reviewed: August 28, 2018