PAINT GAF production
PAINT GAF Format
By Huaiyu Mi
This document is to define the data used in each column of the PAINT GAF file. The file follows the GO Annotation File (GAF) format 2.1. See http://geneontology.org/page/go-annotation-file-gaf-format-21 for more details.
The legacy PAINT GAF files were created and stored on SVN at: http://viewvc.geneontology.org/viewvc/GO-SVN/trunk/gene-associations/submission/paint/pre-submission/?sortby=date
In March 2017, a modified PAINT tool was released to retrieve and store all PAINT data through a database. A new process was created to generate PAINT GAF from the database The format of these GAF files were strictly based on the legacy PAINT GAFs. In Feb. 2018, during the transition of GO data release, a number of issues were raised with regard to the data in the GAF files, especially what type of data that should be captured in each column.
This document will serve as a guideline to create and use the PAINT GAF.
PAINT GAF export
- Monthly release (not yet implemented; currently ad hoc)
- Load current GO file (date of the GO.obo download got in the GAF file header)(not yet implemented?)
- Run touchup: See other document
- Remove annotations to obsolete terms: If there is a replaced_by tag, annotation is updated; otherwise it is just removed
- Remove annotations for which experimental evidence is no longer available
- During the annual Panther release: all the PTNs, all IBDs are forwardly tracked. PTN can change families,
- Remove annotations to ‘do not annotate’ and ‘do not manually annotate’ term
- Run taxon constraints checks
- All these actions are recorded in the ‘comments’ file (mostly; some notes were not yet populated
- If a, b or c happens, the family curation status will be changed to ‘REQUIRE PAINT REVIEW’
- Generate GAF as described below
- GAF are stored on the Panther db: ftp://ftp.pantherdb.org/downloads/paint/presubmission
- GO loads the yaml file from https://github.com/geneontology/go-site/blob/master/metadata/datasets/paint.yaml
PAINT GAF Files
File Header All gene association files must start with a single line denoting the file format, followed by the date of creation, PANTHER and GO versions, as below, for example:
!gaf-version: 2.1 !Created on Thu Apr 5 23:42:21 2018. !PANTHER version: v.13.1. !GO version: 2017-12-27.
Annotation Fields and Data Contents
DB (column 1)
refers to the database from which the identifier in DB object ID (column 2) is drawn. According to the GAF document, it must be one of the values from the set of GO database cross-references. Below is the DB used for the GAF of each genome. The ones in red are the ones that are not in the GO database cross-references.
gene_association.paint_cgd.gaf CGD UniProtKB
gene_association.paint_dictyBase.gaf UniProtKB dictyBase → DictyBase
gene_association.paint_ecocyc.gaf EcoGene UniProtKB
gene_association.paint_fb.gaf FB UniProtKB
gene_association.paint_human.gaf UniProtKB !!! Please note that UniProtKB IDs, NOT HGNC IDs, are used for human genes.
gene_association.paint_mgi.gaf MGI UniProtKB
gene_association.paint_pombase.gaf PomBase UniProtKB
gene_association.paint_rgd.gaf RGD UniProtKB
gene_association.paint_tair.gaf Araport TAIR UniProtKB
gene_association.paint_wb.gaf UniProtKB WB
gene_association.paint_other.gaf UniProtKB WB (some CAEBR genes) Xenbase ZFIN (???)
DB Object ID (column 2)
Primary identifier of the gene from the DB specified in column 1. Example: MGI:1921966 SPCC895.04c Q13217
DB Object Symbol (column 3)
A gene symbol or gene name is used here. This is usually from the Reference Proteome dataset, from the GN file of their fasta file header.
Qualifier (column 4)
Enter qualifier such as NOT, contributes_to and colocalizes_with.
GO ID (column 5)
The GO identifier that is annotated to the gene in column 2, e.g., GO:0060070.
DB:Reference (column 6)
Evidence Code (column 7)
Should always be IBA
With [or] From (column 8)
The column contains the ancestral node PTN id that the annotation inherits from, as well as all the leaf sequence IDs with experimental annotations that are used as evidence for the IBD annotation to the ancestral node. They are in the following format:
Aspect (column 9)
refers to the namespace or ontology to which the GO ID (column 5) belongs. P (biological process) F (molecular function) C (cellular component)
DB Object Name (column 10)
Name of gene or gene product obtained from the UniProt Reference Proteomes. All annotated genomes should be in the UniProt Reference Proteomes to have the correct names.
DB Object Synonym (column 11)
It is the UniProtKB ID and the leaf PTN ID separated by a pipe. If the DB (column 1) is already UniProtKB, then it is not included. Example: UniProtKB:Q8R4E4|PTN000409177
DB Object Type (column 12)
It is always ‘protein’.
Taxon (column 13)
This is the taxon ID in the format as below: taxon:10090
Date (column 14)
Currently it is the date the update is done, but we will change to the date the original curation was done.
Assigned By (column 15)
Annotation Extension (column 16)
Gene Product Form ID (column 17)