GAF Col17 GeneProducts

From GO Wiki
Jump to navigation Jump to search

This page describes a proposal for a new GAF column, column 17. The proposal was ratified at the 2008 SLC GO meeting. However, it was not fully formulated at that time and was actually presented as a summary of problems with existing GAF practices. See Annotation_of_Alternate_Spliceforms for this.

Summary of changes

  1. Addition of a NEW constraint: the entity referenced by col 2 (DB_Object_ID) MUST be a canonical entity. In addition the GAF must be non-redundant with respect to canonical entities in a genome
  2. Addition of a NEW column: col 17, the spliceform column
  3. Change in meaning of an EXISTING column: col 12, the DB_Object_Type column now indicates the type of the entity in col 17. Previously it indicated the type of the entity in col 2.

Note that the first two changes are backwards compatible: software built to the previous specification will not give erroneous results. The first change essentially standardizes practice - many groups already follow this constraint. The second change introduces a new column, which can be ignored with only a loss of specificity. The first change essentially standardizes practice - many groups already follow this constraint.

However, the 3rd change introduces a change in meaning. As such, it is important ALL USERS ARE AWARE OF THIS CHANGE.

Reason for changes

The full context is given in this document describing current practice in Annotation_of_Alternate_Spliceforms.

To summarise, we needed to:

  • standardize how groups were handling alternate splicing
  • make it simpler to do gene-centric analyses
    • remove the need for GAF consumers to map spliceform annotations to genes
    • eliminate redundancy at the gene annotation level
  • allow for the optional inclusion of annotation of alternate spliceforms

Specification details

Col 2: Canonical entity

Previously, the DB_Object_ID, col2 (or rather cols1,2,3 as the GAF is a denormalized table) could reference various kinds of entities: genes, proteins, transcripts. GAFs could include separate entries for alternate spliceforms - the only way to link these to the same gene would be to use external ID mapping files.

Now, col 2 must always reference a canonical entity. In the context of this specification, a canonical entity is either a gene OR an abstract protein that has a 1-1 correspondence to a gene.

We anticipate that most MODs will (continue) to use gene identifiers to reference canonical entities. UniProtKB will (continue) to provide UniProtKB identifiers to reference canonical entities (TODO: add link to UniProtKB docs stating how these abstract proteins are constructed).

This additional constraint essentially standardizes practice - many groups already follow this constraint.

  • ISSUE: there were some objections to the term canonical. I am

not sure of a good alternative. "Normative Entity"? "Abstract Gene" (yuk)

Col 17: Spliceform

This is the meat of the proposal. A new column is introduced for representing the specific spliceform of the gene product to which the annotation in col 5 (GO ID) applies to.

This column is optional. Where no information is known about the specific gene product spliceform, the column may be blank.

If this column is present, then the referenced spliceform must be a gene product of the gene referenced in col 2 (if col 2 has a reference to an abstract protein then the spliceform must be a gene product of the gene that bears a 1-1 relation to the abstract protein)

The meaning of a line in a GAF now becomes:

  • The entity in col 17 has either the function or localization indicated by the GO ID
  • The gene referenced in col 2 encodes the gene product in col 17

This allows greater specificity in annotation. It is technically not a change in semantics, as the gene in col 2 was always intended as a proxy for referencing the gene product.

Thus col 17 can be ignored with only a loss of specificity, not correctness

The identifier used in col 17 must be a standard 2-part global identifier, see Identifiers

This identifier should be stable and dereferenceable in the usual way. For example, if UniProtKB:QVCS5-1 is the spliceform ID then there must be a stable UniProtKB record with this ID, with its own web page

Issues:

  • Can this be blank? YES. If we don't know the isoform involved.
  • Can the generic UniProtKB protein ID go in here? YES. If we don't know the specific isoform but we know the parent UniProtKB we can put this in here.
  • Can non-UniProtKB IDs go in here?

12: Type column : references spliceform

Previously, the type column referenced the type of the entity in col 2. Now it references the type of the spliceform entity in col 17.

  • THIS IS A MAJOR CHANGE

IF col 17 is blank, col 12 MUST still be populated. The type will be the type of the entity believed to have the function/localization described (typically a protein)

Examples

UniProtKB example

OLD way:

  UniProtKB	Q4VCS5  AMOT_HUMAN	GO:0031410	PMID:11257124	IDA	C	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProtKB
  UniProtKB	Q4VCS5	AMOT_HUMAN	GO:0043532	PMID:11257124	IDA	F	AMOT, KIAA1071: Angiomotin	IPI00163085     protein	taxon:9606	20051207	UniProtKB                                                                                                                                                         
  UniProtKB	Q4VCS5-1        AMOT_HUMAN	GO:0043116	PMID:16043488	IDA	P	AMOT, KIAA1071:Angiomotin	IPI00163085protein       taxon:9606       20051207	UniProtKB      
  UniProtKB	Q4VCS5-1	AMOT_HUMAN	GO:0005515	PMID:16043488	IPI	UniProtKB:Q6RHR9-2	F	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProtKB	                                                                                                                                           
  UniProtKB	Q4VCS5-2	AMOT_HUMAN	GO:0043532      PMID:16043488   IDA	F	AMOT, KIAA1071: Angiomotin      IPI00163085	protein	taxon:9606	20051207	UniProtKB
  UniProtKB	Q4VCS5-2        AMOT_HUMAN              GO:0043116      PMID:16043488   IDA             P       AMOT, KIAA1071: Angiomotin	IPI00163085    protein  taxon:9606	20051207        UniProtKB                                                                                                                                                      
  UniProtKB	Q4VCS5-2	AMOT_HUMAN	GO:0043536	PMID:16043488	IDA	P	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20060317	UniProtKB                                                                                                                                                 

NEW way:

  UniProtKB	Q4VCS5  AMOT_HUMAN	GO:0031410	PMID:11257124	IDA	C	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProtKB	
  UniProtKB	Q4VCS5	AMOT_HUMAN	GO:0043532	PMID:11257124	IDA	F	AMOT, KIAA1071: Angiomotin	IPI00163085     protein	taxon:9606	20051207	UniProtKB                                                                                                                                                         
  UniProtKB	Q4VCS5        AMOT_HUMAN	GO:0043116	PMID:16043488	IDA	P	AMOT, KIAA1071:Angiomotin	IPI00163085protein       taxon:9606       20051207	UniProtKB      UniProtKB:QVCS5-1
  UniProtKB	Q4VCS5	AMOT_HUMAN	GO:0005515	PMID:16043488	IPI	UniProtKB:Q6RHR9-2	F	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProtKB      UniProtKB:QVCS5-1
  UniProtKB	Q4VCS5	AMOT_HUMAN	GO:0043532      PMID:16043488   IDA	F	AMOT, KIAA1071: Angiomotin      IPI00163085	protein	taxon:9606	20051207	UniProtKB      UniProtKB:QVCS5-2
  UniProtKB	Q4VCS5        AMOT_HUMAN              GO:0043116      PMID:16043488   IDA             P       AMOT, KIAA1071: Angiomotin	IPI00163085    protein  taxon:9606	20051207        UniProtKB             UniProtKB:QVCS5-2                                                                                                                                               
  UniProtKB	Q4VCS5	AMOT_HUMAN	GO:0043536	PMID:16043488	IDA	P	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20060317	UniProtKB      UniProtKB:QVCS5-2                                                                                                                                                 

Issues

  • col 17 - can it be left blank

FAQ

What happened to column 16?

Column 16 has been reserved for Annotation_Cross_Products for some time now. It just so happens that col 17 will probably contain values prior to column 16, due to the complexity of the column 16 proposal