Release Pipeline: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
(179 intermediate revisions by 4 users not shown)
Line 1: Line 1:
'''June 2018: This documentation is currently a work in progress.'''


== Overview ==
= Overview =
The information below is intended for GOC members who are providers of annotations. It describes how GOC processes annotations, which can be viewed at locations like [http://amigo.geneontology.org/amigo AmiGO], downloaded from our sites, and queried via the SPARQL endpoint.


The GO Consortium (GOC) is now publically releasing data on a monthly basis. Data includes annotation files, ontology files, GO-CAM models, '''and...''' Official monthly releases are versioned and archived so that analyses performed with these data can be reproduced at any point in the future. Additionally, daily snapshot releases of GO data are available for internal use by GOC members. This allows annotators, for example, to have access to the most up-to-date version of the ontology for their curation.  However, data generated using snapshot releases will not be officially released until the monthly public release.
== Annotations integrated in the GOC pipeline ==
* '''Primary annotations by GOC contributing groups''': [https://github.com/geneontology/go-site/tree/master/metadata/datasets datasets metadata file].
* '''PAINT annotations''': [https://github.com/geneontology/go-site/tree/master/metadata/datasets PAINT datasets metadata file]
These annotations are ingested daily. All IBA annotations are coming from PAINT. Others are filtered out.


The information below is meant to provide an overall summary and basic instructions for submitting and consuming GO data. For a more detailed discussion of the technical details, please see the [https://github.com/geneontology/pipeline README.md file] in the pipeline repository on GitHub.  
== Annotation sources ==
* Groups wishing to contribute their annotations to GO should have their group added into the '''GO groups''' metadata: <br /> https://github.com/geneontology/go-site/blob/master/metadata/groups.yaml
** Annotation files produced by GOC members are accessed via the URL or address provided by each group's [https://github.com/geneontology/go-site/tree/master/metadata/datasets datasets metadata file], in the ''source'' field.
** The files must be made publicly available via HTTP or FTP to be pulled in by GO. Important note: the '''source''' URL must resolve to the latest annotation file produced by the submitting group, since that link is used directly when fetching the data.
* More information about the format of the datasets metadata file can be found in the [https://github.com/geneontology/go-site/blob/master/metadata/datasets.schema.yaml metadata schema.yaml file].
* Currently all data is ingested in GAF format. In the future, the GOC will switch to using GPAD/GPI for all internal data exchange.


== Release Cycle ==
* '''Note that UniProt-all file is processed differently - ie the file is loaded directly, without the checks to which other files are submitted.'''
** To see which data are impacted by this, see the [[UniProt-GOA datasources]].


*For both the daily and monthly releases, the pipeline runs start at midnight (12am) PDT, and currently take about 14hrs (this will be decreased in the future); starting nightly for the daily snapshot release and the first of the month (or as close as can be obtained if there are failures) for the monthly public release.  As a note to that, the `snapshot` run does not also run on the day of the monthly `release`. Data associated with each release can be accessed at the URLs below, with specific details about the contents of released files discussed where appropriate below.
==Data processing==
**http://current.geneontology.org (~monthly, containing the latest release set)
**http://release.geneontology.org (~monthly, plus historical sets from the new pipeline)
**http://snapshot.geneontology.org (~daily)


*All files generated as part of the monthly release will have a permanent, stable release identifier.
=== Annotation QC checks===
*All files generated as part of the snapshot release will NOT have permanent, stable release identifiers.
* As files are read some lines may be modified or filtered as described in [https://github.com/geneontology/go-site/blob/master/metadata/rules/README.md GO Rules Documentation]
* A number of checks are run to ensure the integrity of the data (either at the parsing step or later in the pipeline). Checks include: data format, validity of identifiers, and a number of annotation rules. There are three types of checks:
** '''filter''': Violations of the rule lead to filtering of annotations not conforming.
** '''repair''': Violations of the rule lead to a replacement of an incorrect entry by the correct entry (for example, annotations to GO term alternate identifiers are changed to the GO term main identifier).  
** '''report''': Violations of the rule are reported but no action is taken by the QC/QA pipeline.
* Annotation lines that failed a check are collated in each group's reports.html page: [http://snapshot.geneontology.org/reports/ Snapshot Annotation Reports].


=== Annotations ===
=== Annotation merging and file generation===
* The annotation release pipeline generates many different ''products'', including primary products such as annotation and GPI files and ''reports'' (such as error reports) and inferred annotations (predictions) for providing feedback to GOC contributing groups like MODs and UniProt). Once the upstream files have been loaded, checked, and merged, GAFs, GPADs, GPIs, TTLs, reports, and prediction files are produced.


==== Overview ====
* The exception is annotations produced in Noctua: The GPAD/GPI files produced for the GO-CAM annotations are in Jenkins: http://build.berkeleybop.org/job/export-lego-to-gpad-sparql/lastSuccessfulBuild/artifact/legacy/
Annotation files are retrieved from each participating consortium member by GO Central, merged with PAINT annotation files, run through annotation QA/QC checks and then released as daily snapshot and monthly public releases.


==== How to Submit an Annotation File ====
=== Manual QC step during GO data release process ===
*Annotation files produced by GOC members should be made available to GO Central by placing the file on a publicly accessible site such as an FTP site, an Amazon S3 bucket, an HTTP server, etc.
* After all files are produced, the pipeline is automatically halted, and a manual input is needed for the release to be finalized.  
*The URL or address from which the file can be obtained is stored in the Source field of each group's [https://github.com/geneontology/go-site/tree/master/metadata/datasets datasets metadata file] which is located on github in the geneontology/go-site repository.
** The release process can be monitored on the [https://build.geneontology.org/job/geneontology/job/pipeline/job/release/ GO Jenkins page]
**More information about the format of the datasets metadata file can be found in the [https://github.com/geneontology/go-site/blob/master/metadata/datasets/README.md datasets README.md file] under the Schema heading.
** Data used for QC: [http://skyhook.berkeleybop.org/release/release_stats/ Release stats] and [https://amigo-staging.geneontology.io/amigo AmiGO staging], which are compared to the [http://current.geneontology.org/release_stats/index.html current stats] and [http://amigo.geneontology.org/amigo AmiGO site]
**For example Source URLs see:
** Observations/problems/queries/actions are noted in this [https://docs.google.com/document/d/1xzEwyEON6LqgMFe_Sjb1Fa-B-gYBVXGIfnEjP2656mo/edit Google doc] (notes are here: [https://docs.google.com/document/d/1xzEwyEON6LqgMFe_Sjb1Fa-B-gYBVXGIfnEjP2656mo/edit# 2021 releases candidates QC - notes] - [https://docs.google.com/document/d/1IMA54ycbHZxkFbIAyjRvzm_ybAaAj2V8q-Q670mE07U/edit#heading=h.bv1dqkyp53zf 2019 and 2020 releases candidates QC - notes]).
***https://github.com/geneontology/go-site/blob/master/metadata/datasets/mgi.yaml
** If there are issues with data from upstream sources:  
***https://github.com/geneontology/go-site/blob/master/metadata/datasets/wb.yaml
*** We always report to the source what we find. We give groups up to 5 working days to fix the issues, and trigger another release when the new data is available, or after 5 days (which ever is first).
***https://github.com/geneontology/go-site/blob/master/metadata/datasets/xenbase.yaml
*** If the issue is 'blocking' (ie the fluctuation in the number of annotations is too large, or something is very wrong wit hthe data), we may decide to use the previous version of the upstream's data. We do this by pointing the source of the annotation to the previous imported file, save at current/products/annotations/..-src.gaf.
***https://github.com/geneontology/go-site/blob/master/metadata/datasets/zfin.yaml
**Important note: the Source address must resolve to the latest annotation file produced by the submitting group.


==== What Happens to Annotations During a Release Cycle ====
In some cases (all IEAs missing, all qualifiers missing, etc), we might decide not to load the data as is.  
*For both the monthly public releases and the daily snapshot releases, the same set of QA/QC checks and '''annotation file merges (PAINT, GO-CAM, predictions, other external groups, e.g. UniProt)''' are performed.  Resulting annotation files and reports are then made accessible via the release URLs listed above.
*When GO Central retrieves an annotation file from a contributing group, the pipeline will run checks on the file and repair any auto-repairable issues (for example, migrating annotations to merged terms). It will then publish the processed GAFs, GPADs, GPIs, etc. to a public site, available for download.
*'''Need to link to the GO rules on github and/or articulate all of the checks that annotation files undergo after submission.'''


==== Annotation Files and Reports ====
*'''Procedure'''
*The annotation release pipeline generates ''products'', e.g. annotation and gpi files, and ''reports'' e.g. error reports, inferred annotations, etc., that provide feedback to GOC contributing groups (MODs, UniProt, etc).
** The [http://skyhook.berkeleybop.org/release/release_stats/go-annotation-changes.tsv go-annotation-changes.tsv file] is copied into a blank Google spreadsheet created in the [https://drive.google.com/drive/folders/1jyESoU6oUrLRBL3c7xCwKm_N3E8YR4uk Release Checks directory] of the GO Google drive.
*Below is more detailed information about what files are generated during a release, '''using the snapshot URLs as an example'''.
** The new file is named <code>YYYY-MM-DD Release candidate</code>
** First the section <code>SUMMARY: DIFF BETWEEN RELEASES</code> is checked for large differences.
*** Large changes (5,000 - 20,000) are OK for <code>annotations by evidence cluster PHYLO</code> and <code>annotations by evidence cluster IEA</code>, since that represents an overall small fraction of the changes in these groups of annotations
*** In 2020, EXP increased by 2-5,000 each month
***Any decrease to 0 in any category of the stats is suspicious of some error, either by the contributing group of by the processing of the GO pipeline.
** Second the section <code>CHANGES IN ANNOTATIONS BY QUALIFIER</code> is checked for large differences.  
*** If there are large differences (>1-2 %), the section <code>CHANGES IN ANNOTATIONS BY MODEL ORGANISM AND EVIDENCE (ALL) THEN QUALIFIER</code> is checked to see which group contributed those changes.
** <code>CHANGES IN ANNOTATIONS BY GROUP</code> usually do not exceed 1-2%. If there are larger changes, the section <code>CHANGES IN ANNOTATIONS BY MODEL ORGANISM AND EVIDENCE (ALL) THEN QUALIFIER</code> is used to determine where changes come from. Decreases in NAS, TAS, IEA are often due to annotation reviews and are usually OK (unless they are suddenly 0). Usually expect increases in EXP annotations. ISS-types of evidence are relatively stable or have small increases.
** <code>CHANGES IN REFERENCES AND PMIDS</code> are also usually of less than 1%.
** Large changes in <code>CHANGES IN ANNOTATED BIOENTITIES BY FILTERED TAXON AND BY BIOENTITY TYPE (ALL)</code> should be investigated.
* For groups or species with few annotations (less than 100), relatively minor changes cause large percentage changes, so absolute numbers should be considered in those cases.
* AmiGO's faceted search is useful when large differences are observed, for example by evidence code, by a species that is not part of the 11 model organisms.


===== Annotation and GPI Files =====
==Types of releases==
*Annotation (gaf and gpad) and gpi files are available here:
* '''Official monthly releases''': versioned and archived so that analyses performed with these data can be reproduced at any point in the future.  Note that all files generated as part of the monthly release have a permanent, stable release identifier.
**http://snapshot.geneontology.org/annotations/index.html
* '''Daily snapshot releases''': intended for internal use by GOC members. Daily snapshots are not versioned and not archived, therefore not citable. Note that the ''daily snapshot release'' is not generated on the day of the ''official monthly release''. Note also that all files generated as part of the snapshot release will NOT have permanent, stable release identifiers.
*Currently, annotation (gaf and gpad) and gpi files are sorted according to contributing group.
*'''In the future, annotation files will be sorted according to species.'''  
*Note that the file names now conform to a new naming schema that indicates the file type in the extension:
**mgi.gaf.gz
**mgi.gpad.gz
**mgi.gpi.gz
*'''There are also annotation files under the products directory.  What are the intended use cases for each of these files?'''
**http://snapshot.geneontology.org/products/annotations/
***[http://snapshot.geneontology.org/products/annotations/wb-prediction.gaf wb-prediction.gaf]
***[http://snapshot.geneontology.org/products/annotations/index.html/wb-src.gaf.gz wb-src.gaf.gz]
***[http://snapshot.geneontology.org/products/annotations/index.htmlwb_noiea.gaf.gz wb_noiea.gaf.gz]
***[http://snapshot.geneontology.org/products/annotations/index.html/wb_valid.gaf.gz wb_valid.gaf.gz]


===== Annotation Reports =====
== Data publishing and access ==
*Annotation reports are available here:
Data produced by each release can be accessed at the URLs below:
**http://snapshot.geneontology.org/reports/
*Current official monthly release: http://current.geneontology.org
*Each contributing group currently has six different reports; in the future, '''these reports will be consolidated as much as possible and rendered as HTML for easier viewing.'''
*Monthly releases to date: http://release.geneontology.org
*The current reports are (using MGI as an example):
*Daily snapshot: http://snapshot.geneontology.org
**[http://snapshot.geneontology.org/reports/mgi-owltools-check.txt mgi-owltools-check.txt]
***'''Need link out to definitive owltools documentation'''
**[http://snapshot.geneontology.org/reports/mgi-prediction-report.txt mgi-prediction-experimental-report.txt]
**[http://snapshot.geneontology.org/reports/mgi-prediction-report.txt mgi-prediction-report.txt]
**[http://snapshot.geneontology.org/reports/mgi-summary.txt mgi-summary.txt]
***This is a basic summary of the parsing of your GAF file. It functionally replaces the old "Mike's script"
**[http://snapshot.geneontology.org/reports/mgi.report.json mgi.report.json]
**[http://snapshot.geneontology.org/reports/mgi.report.md mgi.report.md]


=== Ontology ===
==Release Content==
 
*The release content may be accessed from the specific URLs listed above.
*Each page of content is generally organized as:
**Parent (a link to the parent directory)
**Directories (a list of all directories or subdirectories within each specific location)
**Files (a list of all files within each specific location)
 
*The main list of directories, with information about the content found in each, follows below.
 
==== annotations ====
*The annotations directory contains solely annotation files organized alphabetically by contributing group.
*The annotation files available here are those files produced *after* the QC/QA rules have been applied and include annotations from GOC annotations tools, i.e. PAINT.
*Each group has three files, compressed using the gzip utility:
**gaf
**gpad
**gpi
***Note that the GPI file here corresponds to the GPAD annotation file, not the original GPI file produced by the contributing group.
*An example of annotation files found in the annotation directory:
  mgi.gaf.gz
  mgi.gpad.gz
  mgi.gpi.gz
 
==== bin ====
*The bin directory contains the binary files used by the GO pipeline to build the release.
 
==== lib ====
*The lib directory contains the libraries used by the GO pipeline to build the release.
 
==== metadata ====
*The metadata directory contains relevant metadata used by the pipeline.  Examples include:
**datasets.yaml (information about the groups that contribute annotations and where the associated files can be retrieved for the pipeline)
**the list of valid GO_REFs
**the list of GO rules
 
==== ontology ====
*The ontology directory contains directories with ontology-related information as well as several different formats of the ontology.
*Much of the content of this directory mirrors what is contained in https://github.com/geneontology/go-ontology/tree/master/src/ontology
*Directories that contain ontology-related information are:
**extensions
**external2go
**imports (terms imported from external ontologies used in GO equivalence axioms)
**reports
**subsets (the yaml files for metadata on each GO subset)
*Ontology files are:
**go-base.owl
**go-basic.json
**go-basic.json.gz
**go-basic.obo
**go-basic.owl
**go.json
**go.obo
**go.owl
 
==== products ====
*The products directory contains six sub-directories:
**annotations
**As above, the annotations directory contains files that are pre-qc and '''not meant for general public consumption''', organized alphabetically according to contributing group.
**Files present in this directory are:
***prediction.gaf (annotation predictions from annotation extensions and inter-ontology links)
***src.gaf (original annotation source file)
***gaf (PAINT only)
***gpad (PAINT only)
***gpi (PAINT only)
***noiea.gaf (plus IEA filtered out (no PAINT annotations)
***valid.gaf (original source file parsed and filtered after applying QA/QC rules but prior to merging with other files, e.g. PAINT)
**blazegraph
***The production data available at rdf.geneontology.org
***Includes:
****All release GAF data, including PAINT
****Production Noctua models
****GO ontology
***Note that there is also an internal blazegraph that also contains Noctua development models
**pages
***These are HTML pages created during the pipeline for various purposes.
**panther
***This contains PANTHER tree data, e.g. gene ids and PANTHER embedded tree structure.
***This is used for AmiGO.
**solr
***This contains the solr indexes that drive AmiGO.
**ttl
***All production annotations available in ttl format (same information as contained in the blazegraph directory.
 
==== release_stats ====
* The release stats are files that are used to do quality assurance on the data (ontology and annotations) before GO releases. The code that generates the statistics is on the GO GitHub: https://github.com/geneontology/go-stats (note that the same code is in go-site, which the the location from which the statistics are generated).
 
The files produced are:
 
* [[File Description: go-annotation-changes |go-annotation-changes]]
* [[File description: go-annotation-changes no pb|go-annotation-changes_no_pb]]
* [[File Description: go-ontology-changes |go-ontology-changes]]
* [[File Description: go-stats|go-stats]]
* [[File Description: go-stats-no-pb|go-stats-no-pb]]
* [[File Description: go-stats-summary|go-stats-summary]]
* [[File Description: aggregated-go-stats-summaries|aggregated-go-stats-summaries]]
 
==== reports ====
*The reports directory contains links to files that document the results of various QC/QA checks as well as a link to the [http://current.geneontology.org/reports/gorule-report.html gorule-report.html].
*Report files are organized alphabetically by contributing group.
*The types of reports are:
**owltools-check.txt
**prediction-experimental-report.txt
**prediction-report.txt
**report.html (this is the central report that contains all of the violations and other reports for a given resource)
**summary.txt
**report.json
**report.md


== Consuming and Displaying GO Data ==
== Consuming and Displaying GO Data ==
=== GO Consortium Members ===
=== GO Consortium Members (to confirm) ===
*While GOC members may consume snapshot release files for internal purposes, we strongly encourage members to only display data from the monthly releases.
*To get the most up-to-date data, contributing groups can download GO data (e.g. ontology and annotations) using the snapshot URLs.  For example:
*GOC members may filter GO annotations before display on their local sites.
  Snapshot annotations: http://snapshot.geneontology.org/annotations/wb.gaf.gz
*GO annotations should not be changed in any way from their original content.
 
  Snapshot ontology: http://purl.obolibrary.org/obo/go/snapshot/go.obo
*Groups may also present snapshot data on their individual sites.  
*However, for distributing annotation or ontology files, data from an versioned monthly release should be used.
*Each contributing group should direct users to the appropriate group GAF in the current annotations directory
  Current annotations: http://current.geneontology.org/annotations/
 
  Current ontology: http://purl.obolibrary.org/obo/go/go.obo
 
=== Groups using GO for research and analysis purposes ===
* For citation purposes, groups should use the ontology and annotations from the official monthly release and cite the date and doi of the release they used.


== GO Consortium Dataflow ==
== GO Consortium Dataflow ==


https://github.com/geneontology/go-site/blob/master/docs/go-consortium-dataflow.png
[[File:GOconsortium-dataflow.png|1000px|GO consortium dataflow]]
 
Original: https://github.com/geneontology/go-site/blob/master/docs/go-consortium-dataflow.png
 


== Review Status ==


Last reviewed: August 28, 2018




[[Category:Annotation]]
[[Category:Release Pipeline]]

Revision as of 05:36, 18 June 2021

Overview

The information below is intended for GOC members who are providers of annotations. It describes how GOC processes annotations, which can be viewed at locations like AmiGO, downloaded from our sites, and queried via the SPARQL endpoint.

Annotations integrated in the GOC pipeline

These annotations are ingested daily. All IBA annotations are coming from PAINT. Others are filtered out.

Annotation sources

  • Groups wishing to contribute their annotations to GO should have their group added into the GO groups metadata:
    https://github.com/geneontology/go-site/blob/master/metadata/groups.yaml
    • Annotation files produced by GOC members are accessed via the URL or address provided by each group's datasets metadata file, in the source field.
    • The files must be made publicly available via HTTP or FTP to be pulled in by GO. Important note: the source URL must resolve to the latest annotation file produced by the submitting group, since that link is used directly when fetching the data.
  • More information about the format of the datasets metadata file can be found in the metadata schema.yaml file.
  • Currently all data is ingested in GAF format. In the future, the GOC will switch to using GPAD/GPI for all internal data exchange.
  • Note that UniProt-all file is processed differently - ie the file is loaded directly, without the checks to which other files are submitted.

Data processing

Annotation QC checks

  • As files are read some lines may be modified or filtered as described in GO Rules Documentation
  • A number of checks are run to ensure the integrity of the data (either at the parsing step or later in the pipeline). Checks include: data format, validity of identifiers, and a number of annotation rules. There are three types of checks:
    • filter: Violations of the rule lead to filtering of annotations not conforming.
    • repair: Violations of the rule lead to a replacement of an incorrect entry by the correct entry (for example, annotations to GO term alternate identifiers are changed to the GO term main identifier).
    • report: Violations of the rule are reported but no action is taken by the QC/QA pipeline.
  • Annotation lines that failed a check are collated in each group's reports.html page: Snapshot Annotation Reports.

Annotation merging and file generation

  • The annotation release pipeline generates many different products, including primary products such as annotation and GPI files and reports (such as error reports) and inferred annotations (predictions) for providing feedback to GOC contributing groups like MODs and UniProt). Once the upstream files have been loaded, checked, and merged, GAFs, GPADs, GPIs, TTLs, reports, and prediction files are produced.

Manual QC step during GO data release process

  • After all files are produced, the pipeline is automatically halted, and a manual input is needed for the release to be finalized.
    • The release process can be monitored on the GO Jenkins page
    • Data used for QC: Release stats and AmiGO staging, which are compared to the current stats and AmiGO site
    • Observations/problems/queries/actions are noted in this Google doc (notes are here: 2021 releases candidates QC - notes - 2019 and 2020 releases candidates QC - notes).
    • If there are issues with data from upstream sources:
      • We always report to the source what we find. We give groups up to 5 working days to fix the issues, and trigger another release when the new data is available, or after 5 days (which ever is first).
      • If the issue is 'blocking' (ie the fluctuation in the number of annotations is too large, or something is very wrong wit hthe data), we may decide to use the previous version of the upstream's data. We do this by pointing the source of the annotation to the previous imported file, save at current/products/annotations/..-src.gaf.

In some cases (all IEAs missing, all qualifiers missing, etc), we might decide not to load the data as is.

  • Procedure
    • The go-annotation-changes.tsv file is copied into a blank Google spreadsheet created in the Release Checks directory of the GO Google drive.
    • The new file is named YYYY-MM-DD Release candidate
    • First the section SUMMARY: DIFF BETWEEN RELEASES is checked for large differences.
      • Large changes (5,000 - 20,000) are OK for annotations by evidence cluster PHYLO and annotations by evidence cluster IEA, since that represents an overall small fraction of the changes in these groups of annotations
      • In 2020, EXP increased by 2-5,000 each month
      • Any decrease to 0 in any category of the stats is suspicious of some error, either by the contributing group of by the processing of the GO pipeline.
    • Second the section CHANGES IN ANNOTATIONS BY QUALIFIER is checked for large differences.
      • If there are large differences (>1-2 %), the section CHANGES IN ANNOTATIONS BY MODEL ORGANISM AND EVIDENCE (ALL) THEN QUALIFIER is checked to see which group contributed those changes.
    • CHANGES IN ANNOTATIONS BY GROUP usually do not exceed 1-2%. If there are larger changes, the section CHANGES IN ANNOTATIONS BY MODEL ORGANISM AND EVIDENCE (ALL) THEN QUALIFIER is used to determine where changes come from. Decreases in NAS, TAS, IEA are often due to annotation reviews and are usually OK (unless they are suddenly 0). Usually expect increases in EXP annotations. ISS-types of evidence are relatively stable or have small increases.
    • CHANGES IN REFERENCES AND PMIDS are also usually of less than 1%.
    • Large changes in CHANGES IN ANNOTATED BIOENTITIES BY FILTERED TAXON AND BY BIOENTITY TYPE (ALL) should be investigated.
  • For groups or species with few annotations (less than 100), relatively minor changes cause large percentage changes, so absolute numbers should be considered in those cases.
  • AmiGO's faceted search is useful when large differences are observed, for example by evidence code, by a species that is not part of the 11 model organisms.

Types of releases

  • Official monthly releases: versioned and archived so that analyses performed with these data can be reproduced at any point in the future. Note that all files generated as part of the monthly release have a permanent, stable release identifier.
  • Daily snapshot releases: intended for internal use by GOC members. Daily snapshots are not versioned and not archived, therefore not citable. Note that the daily snapshot release is not generated on the day of the official monthly release. Note also that all files generated as part of the snapshot release will NOT have permanent, stable release identifiers.

Data publishing and access

Data produced by each release can be accessed at the URLs below:

Release Content

  • The release content may be accessed from the specific URLs listed above.
  • Each page of content is generally organized as:
    • Parent (a link to the parent directory)
    • Directories (a list of all directories or subdirectories within each specific location)
    • Files (a list of all files within each specific location)
  • The main list of directories, with information about the content found in each, follows below.

annotations

  • The annotations directory contains solely annotation files organized alphabetically by contributing group.
  • The annotation files available here are those files produced *after* the QC/QA rules have been applied and include annotations from GOC annotations tools, i.e. PAINT.
  • Each group has three files, compressed using the gzip utility:
    • gaf
    • gpad
    • gpi
      • Note that the GPI file here corresponds to the GPAD annotation file, not the original GPI file produced by the contributing group.
  • An example of annotation files found in the annotation directory:
 mgi.gaf.gz
 mgi.gpad.gz
 mgi.gpi.gz

bin

  • The bin directory contains the binary files used by the GO pipeline to build the release.

lib

  • The lib directory contains the libraries used by the GO pipeline to build the release.

metadata

  • The metadata directory contains relevant metadata used by the pipeline. Examples include:
    • datasets.yaml (information about the groups that contribute annotations and where the associated files can be retrieved for the pipeline)
    • the list of valid GO_REFs
    • the list of GO rules

ontology

  • The ontology directory contains directories with ontology-related information as well as several different formats of the ontology.
  • Much of the content of this directory mirrors what is contained in https://github.com/geneontology/go-ontology/tree/master/src/ontology
  • Directories that contain ontology-related information are:
    • extensions
    • external2go
    • imports (terms imported from external ontologies used in GO equivalence axioms)
    • reports
    • subsets (the yaml files for metadata on each GO subset)
  • Ontology files are:
    • go-base.owl
    • go-basic.json
    • go-basic.json.gz
    • go-basic.obo
    • go-basic.owl
    • go.json
    • go.obo
    • go.owl

products

  • The products directory contains six sub-directories:
    • annotations
    • As above, the annotations directory contains files that are pre-qc and not meant for general public consumption, organized alphabetically according to contributing group.
    • Files present in this directory are:
      • prediction.gaf (annotation predictions from annotation extensions and inter-ontology links)
      • src.gaf (original annotation source file)
      • gaf (PAINT only)
      • gpad (PAINT only)
      • gpi (PAINT only)
      • noiea.gaf (plus IEA filtered out (no PAINT annotations)
      • valid.gaf (original source file parsed and filtered after applying QA/QC rules but prior to merging with other files, e.g. PAINT)
    • blazegraph
      • The production data available at rdf.geneontology.org
      • Includes:
        • All release GAF data, including PAINT
        • Production Noctua models
        • GO ontology
      • Note that there is also an internal blazegraph that also contains Noctua development models
    • pages
      • These are HTML pages created during the pipeline for various purposes.
    • panther
      • This contains PANTHER tree data, e.g. gene ids and PANTHER embedded tree structure.
      • This is used for AmiGO.
    • solr
      • This contains the solr indexes that drive AmiGO.
    • ttl
      • All production annotations available in ttl format (same information as contained in the blazegraph directory.

release_stats

  • The release stats are files that are used to do quality assurance on the data (ontology and annotations) before GO releases. The code that generates the statistics is on the GO GitHub: https://github.com/geneontology/go-stats (note that the same code is in go-site, which the the location from which the statistics are generated).

The files produced are:

reports

  • The reports directory contains links to files that document the results of various QC/QA checks as well as a link to the gorule-report.html.
  • Report files are organized alphabetically by contributing group.
  • The types of reports are:
    • owltools-check.txt
    • prediction-experimental-report.txt
    • prediction-report.txt
    • report.html (this is the central report that contains all of the violations and other reports for a given resource)
    • summary.txt
    • report.json
    • report.md

Consuming and Displaying GO Data

GO Consortium Members (to confirm)

  • To get the most up-to-date data, contributing groups can download GO data (e.g. ontology and annotations) using the snapshot URLs. For example:
 Snapshot annotations: http://snapshot.geneontology.org/annotations/wb.gaf.gz
 Snapshot ontology: http://purl.obolibrary.org/obo/go/snapshot/go.obo
  • Groups may also present snapshot data on their individual sites.
  • However, for distributing annotation or ontology files, data from an versioned monthly release should be used.
  • Each contributing group should direct users to the appropriate group GAF in the current annotations directory
 Current annotations: http://current.geneontology.org/annotations/
 Current ontology: http://purl.obolibrary.org/obo/go/go.obo

Groups using GO for research and analysis purposes

  • For citation purposes, groups should use the ontology and annotations from the official monthly release and cite the date and doi of the release they used.

GO Consortium Dataflow

GO consortium dataflow

Original: https://github.com/geneontology/go-site/blob/master/docs/go-consortium-dataflow.png


Review Status

Last reviewed: August 28, 2018