GAF Col17 GeneProducts
Summary of changes
- Addition of a NEW constraint, the entity referenced by col 2 (DB_Object_ID) MUST be a canonical entity. In addition the GAF must be non-redundant with respect to canonical entities in a genome
- Addition of a NEW column: col 17, the spliceform column
- Change in meaning of an EXISTING column: col 12, the DB_Object_Type column now indicates the type of the entity in col 17. Previously it indicated the type of the entity in col 2.
Note that the first two changes are backwards compatible: software built to the previous specification will not give erroneous results. The first change essentially standardizes practice - many groups already follow this constraint. The second change introduces a new column, which can be ignored with only a loss of specificity. The first change essentially standardizes practice - many groups already follow this constraint.
However, the 3rd change introduces a change in meaning. As such, it is important ALL USERS ARE AWARE OF THIS CHANGE.
Reason for changes
The full context is given in this document describing current practice in Annotation_of_Alternate_Spliceforms.
To summarise, we needed to:
- standardize how groups were handling alternate splicing
- make it simpler to do gene-centric analyses
- remove the need for GAF consumers to map spliceform annotations to genes
- eliminate redundancy at the gene annotation level
- allow for the optional inclusion of annotation of alternate spliceforms
Specification details
Col 2: Canonical entity
Previously, the DB_Object_ID, col2 (or rather cols1,2,3 as the GAF is a denormalized table) could reference various kinds of entities: genes, proteins, transcripts. GAFs could include separate entries for alternate spliceforms - the only way to link these to the same gene would be to use external ID mapping files.
Now, col 2 must always reference a canonical entity. In the context of this specification, a canonical entity is either a gene OR an abstract protein that has a 1-1 correspondence to a gene.
We anticipate that most MODs will (continue) to use gene identifiers to reference canonical entities. UniProtKB will (continue) to provide UniProtKB identifiers to reference canonical entities (TODO: add link to UniProtKB docs stating how these abstract proteins are constructed).
This additional constraint essentially standardizes practice - many groups already follow this constraint.
- ISSUE: there were some objections to the term canonical. I am
not sure of a good alternative. "Normative Entity"? "Abstract Gene" (yuk)
Col 17: Spliceform
This is the meat of the proposal. A new column is introduced for representing the specific spliceform of the gene product to which the annotation in col 5 (GO ID) applies to.
This column is optional. Where no information is known about the specific gene product spliceform, the column may be blank.
If this column is present, then the referenced spliceform must be a gene product of the gene referenced in col 2 (if col 2 has a reference to an abstract protein then the spliceform must be a gene product of the gene that bears a 1-1 relation to the abstract protein)
The meaning of a line in a GAF now becomes:
- The entity in col 17 has either the function or localization indicated by the GO ID
- The gene referenced in col 2 encodes the gene product in col 17
This allows greater specificity in annotation. It is technically not a change in semantics, as the gene in col 2 was always intended as a proxy for referencing the gene product.
Thus col 17 can be ignored with only a loss of specificity, not correctness
The identifier used in col 17 must be a standard 2-part global identifier, see Identifiers
This identifier should be stable and dereferenceable in the usual way. For example, if UniProtKB:QVCS5-1 is the spliceform ID then there must be a stable UniProtKB record with this ID, with its own web page
12: Type column : references spliceform
Previously, the type column referenced the type of the entity in col 2. Now it references the type of the spliceform entity in col 17.
- THIS IS A MAJOR CHANGE
IF col 17 is blank, col 12 MUST still be populated. The type will be the type of the entity believed to have the function/localization described (typically a protein)
Examples
UniProtKB example
OLD way:
UniProtKB Q4VCS5 AMOT_HUMAN GO:0031410 PMID:11257124 IDA C AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5 AMOT_HUMAN GO:0043532 PMID:11257124 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5-1 AMOT_HUMAN GO:0043116 PMID:16043488 IDA P AMOT, KIAA1071:Angiomotin IPI00163085protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5-1 AMOT_HUMAN GO:0005515 PMID:16043488 IPI UniProtKB:Q6RHR9-2 F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5-2 AMOT_HUMAN GO:0043532 PMID:16043488 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5-2 AMOT_HUMAN GO:0043116 PMID:16043488 IDA P AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5-2 AMOT_HUMAN GO:0043536 PMID:16043488 IDA P AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20060317 UniProtKB
NEW way:
UniProtKB Q4VCS5 AMOT_HUMAN GO:0031410 PMID:11257124 IDA C AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5 AMOT_HUMAN GO:0043532 PMID:11257124 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5 AMOT_HUMAN GO:0043116 PMID:16043488 IDA P AMOT, KIAA1071:Angiomotin IPI00163085protein taxon:9606 20051207 UniProtKB UniProtKB:QVCS5-1 UniProtKB Q4VCS5 AMOT_HUMAN GO:0005515 PMID:16043488 IPI UniProtKB:Q6RHR9-2 F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB:QVCS5-1 UniProtKB Q4VCS5 AMOT_HUMAN GO:0043532 PMID:16043488 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB:QVCS5-2 UniProtKB Q4VCS5 AMOT_HUMAN GO:0043116 PMID:16043488 IDA P AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB:QVCS5-2 UniProtKB Q4VCS5 AMOT_HUMAN GO:0043536 PMID:16043488 IDA P AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20060317 UniProtKB UniProtKB:QVCS5-2