GO Consortium web GAF2.0 documentation

From GO Wiki
Jump to: navigation, search

TO REPLACE INFORMATION CURRENTLY PROVIDED AT: http://www.geneontology.org/GO.format.annotation.shtml

GO Annotation File Format Guide

The Gene Ontology Consortium provides downloads of annotation data a tab-delimited format, where each line represents a single link between a gene product (protein, gene, transcript, etc.) and a GO term with a certain evidence code and the reference to support the link. This page documents the format of these files. For more general information on annotation, please see the GO annotation guide.

   * Association File Formats
   * Annotation File Format Quality Control Script
   * Errors Checked
   * Taxon IDs
   * Script command line options

Association File Formats

Please note that as of January 2010, the annotation data supplied by the GO Consortium will be supplied in an additional, more descriptive 17 column, tab-delimited format called GAF2.0. This new file format is very similar to the original gene association file format (GAF 1.0), differing only in the range of identifiers permitted in the db_object_id field (column 2), the interpretation of the DB_Object_Type field, as well as the contents of two additional fields: Annotation Cross-Products (column 16) and Gene_Product_Form_ID (column 17).

Please visit the link below for a full description of the GAF1.0 and GAF2.0 formats.

GAF2.0 file format

GAF1.0 file format


From March 2010, the GAF2.0 format will become the primary gene association file format for all annotation files supplied in:

ftp://ftp.geneontology.org/pub/go/gene-associations/

Annotation files in GAF1.0 format will continue to be made available from the submissions directory of the GO Consortium web site:

ftp://ftp.geneontology.org/pub/go/gene-associations/submission/

The format of an annotation file is indicated in the file's header, with the text:

!gaf-version: 2.0

or

!gaf-version: 1.0

Annotation File Format Quality Control Script

This Perl script is provided as a quality control check in an effort to validate the format and to partially check the data provided within the gene association files. This script is used on all gene association files before they are loaded into the GO database. The results of this filtering step are reported back to the submitting group.

This script is intended to be generic and to enforce the standards defined by the GO Consortium. Use this script to validate your gene association file before committing it to the archive. The checks provided define the minimum standard format for the repository. Suggestions are welcome for enhancements to this process. Download the script directly, via the GO web CVS interface, or from the directory go/software/utilities in the GO CVS repository.

Submitted gene association files are committed to the GO CVS repository into the gene association file submissions directory (go/gene-associations/submission/). The checking and filtering script is run nightly on any newly deposited files by the GO Database staff at Stanford. The output of the script is placed in the gene association file directory (go/gene-associations/) and subsequently used to load the GO database.

Errors Checked

The input file is checked for the following types of errors. If a row of the gene association file is found to contain an error it is removed from the final output file.

The script checks each line for the correct number of columns, the cardinality of the columns, looks for leading or trailing whitespace and does a number of specific checks for data in particular columns.

These specific checks include use of the defined terms for Qualifier, Evidence, Aspect, and DB Object type columns. The DB:Reference, Taxon and GO ID columns are checked for minimal form. The Date is also verified to match the YYYYMMDD format.

Column 1, and all database abbreviations used within the gene association file is checked to see that the abbreviation (case insensitive) is defined within the GO database cross-references.

The GO IDs mentioned in the file are checked, using the current gene_ontology.obo file. Rows with obsolete GO IDs are removed, as well as any row containing an invalid GO ID.

All IEA annotations that are over one year old are removed. This filtering step is completed using the date of annotation stated in column 14. Obviously, the validity of the information in the date column is thus very important. Taxon IDs

A major component to the filtering is the requirement that particular taxon IDs can only be included within the association files provided by specific projects. For example, the taxon ID for Mus musculus (taxon:10090) is limited to the file provided by the Mouse Genome Informatics project. Please see the list of species and relevant database groups for more details. Script command line options

Usage help for the script is available with the -h option. The script is designed to be run from the go/gene-associations/submission directory within a GO CVS sandbox. By default the script needs the go/doc/GO.xrf_abbs and go/ontology/gene_ontology_edit.obo files. The input gene association file is read from STDIN by default, or from the specified file defined with the -i option. Usage

A. check a file for any errors, obsolete GO IDs or old IEA annotations

filter-gene-association.pl -i gene_association.sgd.gz

B. filter any problems and output the validated lines, including headers

filter-gene-association.pl -i gene_association.fb.gz -w > filtered-output

C. check file without the taxid checking on, and write the bad lines to STDOUT

filter-gene-association.pl -i gene_association.fb.gz -p nocheck -e > bad-lines System requirements

The script is written using basic Perl and should be portable to most systems. It has been tested on MacOSX with Perl 5.8.1 and Solaris with Perl 5.6.1 and greater.

Submitted by Mike Cherry, 2005-10-19