GAF Col17 GeneProducts

From GO Wiki
Jump to: navigation, search

This page describes a proposal for a new GAF column, column 17. The proposal was ratified at the 2008 SLC GO meeting. However, it was not fully formulated at that time and was actually presented as a summary of problems with existing GAF practices. See Annotation_of_Alternate_Spliceforms for this.

See http://www.geneontology.org/GO.format.gaf-2_0.shtml for the GAF 2.0 spec, which includes column 17.

Summary of changes

  1. Addition of a NEW constraint: the entity referenced by col 2 (DB_Object_ID) MUST be a canonical entity. In addition the GAF must be non-redundant with respect to canonical entities in a genome
  2. Addition of a NEW column: col 17, the spliceform column
  3. Change in meaning of an EXISTING column: col 12, the DB_Object_Type column now indicates the type of the entity in col 17. Previously it indicated the type of the entity in col 2.
  4. Addition of gaf-version header

Note that the first two changes are backwards compatible: software built to the previous specification will not give erroneous results. The first change essentially standardizes practice - many groups already follow this constraint. The second change introduces a new column, which can be ignored with only a loss of specificity. The first change essentially standardizes practice - many groups already follow this constraint.

However, the 3rd change introduces a change in meaning. As such, it is important ALL USERS ARE AWARE OF THIS CHANGE.

Reason for changes

The full context is given in this document describing current practice in Annotation_of_Alternate_Spliceforms.

To summarise, we needed to:

  • standardize how groups were handling alternate splicing
  • make it simpler to do gene-centric analyses
    • remove the need for GAF consumers to map spliceform annotations to genes
    • eliminate redundancy at the gene annotation level
  • allow for the optional inclusion of annotation of alternate spliceforms

Specification details

Col 2: Canonical entity

Previously, the DB_Object_ID, col2 (or rather cols1,2,3 as the GAF is a denormalized table) could reference various kinds of entities: genes, proteins, transcripts. GAFs could include separate entries for alternate spliceforms - the only way to link these to the same gene would be to use external ID mapping files.

Now, col 2 must always reference a canonical entity. In the context of this specification, a canonical entity is either a gene OR an abstract protein that has a 1-1 correspondence to a gene.

We anticipate that most MODs will (continue) to use gene identifiers to reference canonical entities. UniProtKB will (continue) to provide UniProtKB identifiers to reference canonical entities (TODO: add link to UniProtKB docs stating how these abstract proteins are constructed).

This additional constraint essentially standardizes practice - many groups already follow this constraint.

  • ISSUE: there were some objections to the term canonical. I am

not sure of a good alternative. "Normative Entity"? "Abstract Gene" (yuk)

Col 17: Spliceform

This is the meat of the proposal. A new column is introduced for representing the specific spliceform of the gene product to which the annotation in col 5 (GO ID) applies to.

This column is optional. Where no information is known about the specific gene product spliceform, the column may be blank.

If this column is present, then the referenced spliceform must be a gene product of the gene referenced in col 2 (if col 2 has a reference to an abstract protein then the spliceform must be a gene product of the gene that bears a 1-1 relation to the abstract protein)

The meaning of a line in a GAF now becomes:

  • The entity in col 17 has either the function or localization indicated by the GO ID
  • The gene referenced in col 2 encodes the gene product in col 17

This allows greater specificity in annotation. It is technically not a change in semantics, as the gene in col 2 was always intended as a proxy for referencing the gene product.

Thus col 17 can be ignored with only a loss of specificity, not correctness

The identifier used in col 17 must be a standard 2-part global identifier, see Identifiers

This identifier should be stable and dereferenceable in the usual way. For example, if UniProtKB:QVCS5-1 is the spliceform ID then there must be a stable UniProtKB record with this ID, with its own web page

Issues:

  • Can this be blank? YES. If we don't know the isoform involved.
  • Can the generic UniProtKB protein ID go in here? YES. If we don't know the specific isoform but we know the parent UniProtKB we can put this in here.
  • Can non-UniProtKB IDs go in here?

12: Type column : references spliceform

Previously, the type column referenced the type of the entity in col 2. Now it references the type of the spliceform entity in col 17.

  • THIS IS A MAJOR CHANGE

IF col 17 is blank, col 12 MUST still be populated. The type will be the type of the entity believed to have the function/localization described (typically a protein)

  • gene_product
    • protein
    • ncRNA
      • rRNA
      • tRNA
      • snRNA
      • snoRNA
      • ...or any subtype of ncRNA in SO

Examples

UniProtKB example

OLD way:

  Column 1    2         3           4  5                Col 12
  UniProtKB   Q4VCS5    AMOT_HUMAN     GO:0031410  ...  protein  ...
  UniProtKB   Q4VCS5    AMOT_HUMAN     GO:0043532  ...  protein  ...
  UniProtKB   Q4VCS5-1  AMOT_HUMAN     GO:0043116  ...  protein  ...
  UniProtKB   Q4VCS5-2  AMOT_HUMAN     GO:0043532  ...  protein  ...
  UniProtKB   Q4VCS5-2  AMOT_HUMAN     GO:0043536  ...  protein  ...

NEW way:

  Column 1    2         3           4  5                Col 12        Col 17
  UniProtKB   Q4VCS5    AMOT_HUMAN     GO:0031410  ...  protein  ...
  UniProtKB   Q4VCS5    AMOT_HUMAN     GO:0043532  ...  protein  ...
  UniProtKB   Q4VCS5    AMOT_HUMAN     GO:0043116  ...  protein  ...  UniProtKB:QVCS5-1
  UniProtKB   Q4VCS5    AMOT_HUMAN     GO:0043532  ...  protein  ...  UniProtKB:QVCS5-2
  UniProtKB   Q4VCS5    AMOT_HUMAN     GO:0043536  ...  protein  ...  UniProtKB:QVCS5-2

Issues

  • col 17 - can it be left blank

FAQ

What happened to column 16?

Column 16 has been reserved for Annotation_Cross_Products for some time now. It just so happens that col 17 will probably contain values prior to column 16, due to the complexity of the column 16 proposal

What goes in column 2?

Gene identifiers, except in the case of UniProtKB, who supply canonical protein IDs (these are essentially proxies for the gene)

How does this relate to the gp2protein file

the concatenation of col1 and col2 of the GAF (using ':') should have a match in col1 of the gp2protein file

What goes in col 12?

The type of the entity in col 17

What goes in col 12 if col 17 is blank?

  • gene_product

What goes in col 17?

The DB:ID of the entity which has the function / participates in the process / localizes to the component.