Annotation of Alternate Spliceforms: Difference between revisions
mNo edit summary |
|||
(14 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
This page has been superseded by [[GAF_Spliceform_Column_Proposal]]. | |||
The page remains here to provide historical context. | |||
== The problem == | == The problem == | ||
GO Annotations refer to attributes of gene products. Often the | GO Annotations refer to attributes of gene ''products''. Often the | ||
association between a GO term and a gene product is implicit, as | association between a GO term and a gene product is implicit, a ''gene'' identifier is used as proxy for the gene product. | ||
as alternate spliceforms often have similar function. For organisms | This is fine for many cases: gene identifiers serve as useful proxies for gene products in the context of functional annotation, | ||
as alternate spliceforms and protein forms often have similar function. For organisms | |||
that rarely exhibit alternate splicing this is not an issue at | that rarely exhibit alternate splicing this is not an issue at | ||
all. However, sometimes different spliceforms have different function | all. However, sometimes different spliceforms have different function | ||
(& localization & process), and this is | (& localization & process), and curators would like to indicate this. One way is to switch from annnotating gene identifiers to gene product identifiers. Another is to carry on annotating to gene identifiers, but to indicate the specific spliceform some other way. | ||
we indicate | |||
How do we indicate spliceform-specific functionality in the gene association files, in a way that still makes it | |||
simple to do comparisons as the gene level, and does not break expectations of existing software? How can we ensure this is done in a standard way across the GO? | |||
This page assumes the reader is familiar with the [http://www.geneontology.org/GO.format.annotation.shtml GAF Spec] | |||
== Current practice == | == Current practice == | ||
For a summary of how MODs currently do this, see: [[Variant_annotation]]. What follows here is a summary | |||
This is the current practice in existing deposited association files: | This is the current practice in existing deposited association files: | ||
Line 17: | Line 27: | ||
* Most MODs annotate to genes. | * Most MODs annotate to genes. | ||
* UniProt annotates to proteins. (CHECK: are these always the "canonical" protein for a gene?) | * UniProt annotates to proteins. (CHECK: are these always the "canonical" protein for a gene?) | ||
* Some MODs (WB | * Some MODs (WB, RGD, ...others?) do a mixture. | ||
* Other groups express a desire to move to gene products OR to do a mixture | * Other groups express a desire to move to gene products OR to do a mixture | ||
Line 23: | Line 33: | ||
* MGI record the spliceform ID in their structured notes, where it is known that the experiment shows the F/P/C in a particular spliceform | * MGI record the spliceform ID in their structured notes, where it is known that the experiment shows the F/P/C in a particular spliceform | ||
* UniProt make additional annotations to spliceforms, which have IDs of the form: UniProt:P12345-1. It appears that not all of these annotations are submitted | * UniProt make additional annotations to spliceforms, which have IDs of the form: UniProt:P12345-1. It appears that not all of these annotations are submitted (this may have changed as of 2008-04) | ||
Note that MGI do not exclude annotations: they still provide annotations at the gene level, it is just missing information on the specific spliceform. | Note that MGI do not exclude annotations: they still provide annotations at the gene level, it is just missing information on the specific spliceform. | ||
Uniprot | ===Uniprot=== | ||
We can see: Q4VCS5 and two splice-forms here: | We can see: Q4VCS5 and two splice-forms here: | ||
http://www.ebi.ac.uk/ego/ | http://www.ebi.ac.uk/ego/GProtein?ac=Q4VCS5 | ||
Format of annotations in both UniProt and Human GOA files: | |||
UniProtKB Q4VCS5 AMOT_HUMAN GO:0031410 PMID:11257124 IDA C AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB | |||
UniProtKB Q4VCS5 AMOT_HUMAN GO:0043532 PMID:11257124 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB | |||
UniProtKB Q4VCS5-1 AMOT_HUMAN GO:0043116 PMID:16043488 IDA P AMOT, KIAA1071:Angiomotin IPI00163085protein taxon:9606 20051207 UniProtKB | |||
UniProtKB Q4VCS5-1 AMOT_HUMAN GO:0005515 PMID:16043488 IPI UniProtKB:Q6RHR9-2 F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB | |||
UniProtKB Q4VCS5-2 AMOT_HUMAN GO:0043532 PMID:16043488 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB | |||
UniProtKB Q4VCS5-2 AMOT_HUMAN GO:0043116 PMID:16043488 IDA P AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB | |||
UniProtKB Q4VCS5-2 AMOT_HUMAN GO:0043536 PMID:16043488 IDA P AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20060317 UniProtKB | |||
Individual spliceforms can be seen on the page for the canonical protein | |||
http://beta.uniprot.org/uniprot/Q4VCS5#Q4VCS5-1 | |||
note that Q4VCS5-1 is denoted the canonical form. This is typically the longest isoform. | |||
== Standardization == | == Standardization == | ||
Line 56: | Line 69: | ||
Here are two alternate approaches: | Here are two alternate approaches: | ||
=== Annotating to the Spliceform === | === Annotating to the Spliceform/Isoform === | ||
Here column 2 would contain a peptide ID rather than a gene ID. Column 12 would say "protein" | Here column 2 would contain a peptide ID rather than a gene ID. Column 12 would say "protein" | ||
Line 65: | Line 78: | ||
# using a mixed approach, providing protein IDs where splice-form level annotation is known, gene ID otherwise | # using a mixed approach, providing protein IDs where splice-form level annotation is known, gene ID otherwise | ||
==== Annotating only to protein IDs ==== | |||
This is the approach UniProt uses (obviously, since the entities they curate are proteins). | |||
A few things are not clear to me about the UniProt approach. Do the IDs denote "canonical" proteins? Each protein in UniProt is associated with a specific sequence, and thus a specific protein isoform in the case where alternate splicing of these gene is involved. Do UniProt curators only attach GO annotations if they are sure the experiment described in the paper involved these specific forms? I am presuming not, and that the UniProt IDs denote a canonical protein, analagous to a gene record. | |||
FlyBase have expressed a desire to switch to annotating canonical proteins. This seems reasonable. I think we need to do more work on exactly what it means to be a canonical protein. Does a CP have a specific sequence, or is it more akin to a generalisation of a collection of protein sequences? How is the CP related to the gene and to "non-canonical" proteins? If the CP to gene mapping is 1:1, is this really any different from using genes as proxies? | |||
==== Annotating to a mix of gene IDs and protein IDs ==== | |||
Here a choice is made depending on how much information is available. | |||
Currently WB uses the mixed approach. To illustrate, let us look at | Currently WB uses the mixed approach. To illustrate, let us look at | ||
Line 142: | Line 165: | ||
If UniProt were to do this they would always use IDs like | If UniProt were to do this they would always use IDs like | ||
UniProt:P12345 in column 2. However, where spliceform specific | UniProt:P12345 in column 2. However, where spliceform specific | ||
information is known an ID like UniProt:P12345-1 is added to | information is known an ID like UniProt:P12345-1 is added to col 17 | ||
UniProtKB Q4VCS5 AMOT_HUMAN GO:0031410 PMID:11257124 IDA C AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB | |||
UniProtKB Q4VCS5 AMOT_HUMAN GO:0043532 PMID:11257124 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB | |||
UniProtKB Q4VCS5 AMOT_HUMAN GO:0043116 PMID:16043488 IDA P AMOT, KIAA1071:Angiomotin IPI00163085protein taxon:9606 20051207 UniProtKB Q4VCS5-1 | |||
UniProtKB Q4VCS5 AMOT_HUMAN GO:0005515 PMID:16043488 IPI UniProtKB:Q6RHR9-2 F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB Q4VCS5-1 | |||
UniProtKB Q4VCS5 AMOT_HUMAN GO:0043532 PMID:16043488 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB Q4VCS5-2 | |||
UniProtKB Q4VCS5 AMOT_HUMAN GO:0043116 PMID:16043488 IDA P AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB | |||
UniProtKB Q4VCS5 AMOT_HUMAN GO:0043536 PMID:16043488 IDA P AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20060317 UniProtKB Q4VCS5-2 | |||
The advantage of this approach is that column 17 can be ignored by | The advantage of this approach is that column 17 can be ignored by | ||
Line 151: | Line 182: | ||
additional info if required, and can also implement queries such as | additional info if required, and can also implement queries such as | ||
"find all genes that exhibit spliceform-specific localizations" | "find all genes that exhibit spliceform-specific localizations" | ||
=== Protein coding vs non-coding === | |||
== Conclusions == | == Conclusions == | ||
Line 162: | Line 197: | ||
should be fairly simple, optional - and in the case of organisms like | should be fairly simple, optional - and in the case of organisms like | ||
Yeast, rarely used. | Yeast, rarely used. | ||
[[Category:Annotation Archived]] |
Latest revision as of 11:30, 12 April 2019
This page has been superseded by GAF_Spliceform_Column_Proposal. The page remains here to provide historical context.
The problem
GO Annotations refer to attributes of gene products. Often the association between a GO term and a gene product is implicit, a gene identifier is used as proxy for the gene product.
This is fine for many cases: gene identifiers serve as useful proxies for gene products in the context of functional annotation, as alternate spliceforms and protein forms often have similar function. For organisms that rarely exhibit alternate splicing this is not an issue at all. However, sometimes different spliceforms have different function (& localization & process), and curators would like to indicate this. One way is to switch from annnotating gene identifiers to gene product identifiers. Another is to carry on annotating to gene identifiers, but to indicate the specific spliceform some other way.
How do we indicate spliceform-specific functionality in the gene association files, in a way that still makes it simple to do comparisons as the gene level, and does not break expectations of existing software? How can we ensure this is done in a standard way across the GO?
This page assumes the reader is familiar with the GAF Spec
Current practice
For a summary of how MODs currently do this, see: Variant_annotation. What follows here is a summary
This is the current practice in existing deposited association files:
- Most MODs annotate to genes.
- UniProt annotates to proteins. (CHECK: are these always the "canonical" protein for a gene?)
- Some MODs (WB, RGD, ...others?) do a mixture.
- Other groups express a desire to move to gene products OR to do a mixture
Some groups record additional information not communicated to the association files:
- MGI record the spliceform ID in their structured notes, where it is known that the experiment shows the F/P/C in a particular spliceform
- UniProt make additional annotations to spliceforms, which have IDs of the form: UniProt:P12345-1. It appears that not all of these annotations are submitted (this may have changed as of 2008-04)
Note that MGI do not exclude annotations: they still provide annotations at the gene level, it is just missing information on the specific spliceform.
Uniprot
We can see: Q4VCS5 and two splice-forms here:
http://www.ebi.ac.uk/ego/GProtein?ac=Q4VCS5
Format of annotations in both UniProt and Human GOA files:
UniProtKB Q4VCS5 AMOT_HUMAN GO:0031410 PMID:11257124 IDA C AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5 AMOT_HUMAN GO:0043532 PMID:11257124 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5-1 AMOT_HUMAN GO:0043116 PMID:16043488 IDA P AMOT, KIAA1071:Angiomotin IPI00163085protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5-1 AMOT_HUMAN GO:0005515 PMID:16043488 IPI UniProtKB:Q6RHR9-2 F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5-2 AMOT_HUMAN GO:0043532 PMID:16043488 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5-2 AMOT_HUMAN GO:0043116 PMID:16043488 IDA P AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5-2 AMOT_HUMAN GO:0043536 PMID:16043488 IDA P AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20060317 UniProtKB
Individual spliceforms can be seen on the page for the canonical protein http://beta.uniprot.org/uniprot/Q4VCS5#Q4VCS5-1
note that Q4VCS5-1 is denoted the canonical form. This is typically the longest isoform.
Standardization
We seek a standard way of annotating gene products, both directly, via spliceform specific IDs, and indirectly, via gene IDs.
Ideally the standard will be non-lossy, in that it can capture everything the curator wishes to say. It should also exhibit "graceful degradation" - that is, simple software that does not take into account alternate spliceforms should "do the right thing" by default
Here are two alternate approaches:
Annotating to the Spliceform/Isoform
Here column 2 would contain a peptide ID rather than a gene ID. Column 12 would say "protein"
This approach can be broken down into two alternate sub-approaches:
- mandating a protein ID rather than a gene ID throughout the whole file
- using a mixed approach, providing protein IDs where splice-form level annotation is known, gene ID otherwise
Annotating only to protein IDs
This is the approach UniProt uses (obviously, since the entities they curate are proteins).
A few things are not clear to me about the UniProt approach. Do the IDs denote "canonical" proteins? Each protein in UniProt is associated with a specific sequence, and thus a specific protein isoform in the case where alternate splicing of these gene is involved. Do UniProt curators only attach GO annotations if they are sure the experiment described in the paper involved these specific forms? I am presuming not, and that the UniProt IDs denote a canonical protein, analagous to a gene record.
FlyBase have expressed a desire to switch to annotating canonical proteins. This seems reasonable. I think we need to do more work on exactly what it means to be a canonical protein. Does a CP have a specific sequence, or is it more akin to a generalisation of a collection of protein sequences? How is the CP related to the gene and to "non-canonical" proteins? If the CP to gene mapping is 1:1, is this really any different from using genes as proxies?
Annotating to a mix of gene IDs and protein IDs
Here a choice is made depending on how much information is available.
Currently WB uses the mixed approach. To illustrate, let us look at gene WBGene00000035, which has at least one known spliceform, CE07569
Gene annotations:
WB WBGene00000035 ace-1 GO:0040012 WB:WBPaper00003620|PMID:10438595 IGI WB:WBGene00000036 P ACE1|XQ987|NM_078259 gene taxon:6239 20061031 WB WB WBGene00000035 ace-1 GO:0040012 WB:WBPaper00006040|PMID:12911746 IGI WB:WBGene00000036 P ACE1|XQ987|NM_078259 gene taxon:6239 20061031 WB WB WBGene00000035 ace-1 GO:0006581 WB:WBPaper00003620|PMID:10438595 IMP WB:p1000 P ACE1|XQ987|NM_078259 gene taxon:6239 20060925 WB WB WBGene00000035 ace-1 GO:0003990 WB:WBPaper00003620|PMID:10438595 IMP WB:p1000 F ACE1|XQ987|NM_078259 gene taxon:6239 20060925 WB WB CE07569 ACE-1 GO:0042802 WB:WBPaper00004251|PMID:10891266 IPI WB:WBGene00000035 F ACE1|XQ987|NM_078259 protein taxon:6239 20061023 WB WB CE07569 ACE-1 GO:0042802 WB:WBPaper00004932|PMID:11580201 IPI WB:WBGene00000035 F ACE1|XQ987|NM_078259 protein taxon:6239 20061023 WB WB WBGene00000036 ace-2 GO:0040012 WB:WBPaper00003620|PMID:10438595 IGI WB:WBGene00000035 P 1D872|NM_058740 gene taxon:6239 20061031 WB WB WBGene00000036 ace-2 GO:0040012 WB:WBPaper00006040|PMID:12911746 IGI WB:WBGene00000035 P 1D872|NM_058740 gene taxon:6239 20061031 WB WB WBGene00000037 ace-3 GO:0043058 WB:WBPaper00001039|PMID:3272166 IGI WB:WBGene00000035 P 2O499|NM_064562 gene taxon:6239 20060203 WB WB WBGene00000037 ace-3 GO:0035188 WB:WBPaper00001039|PMID:3272166 IGI WB:WBGene00000035|WB:WBGene00000036 P 2O499|NM_064562 gene taxon:6239 20060203 WB WB WBGene00000037 ace-3 GO:0050879 WB:WBPaper00001039|PMID:3272166 IGI WB:WBGene00000035|WB:WBGene00000036 P 2O499|NM_064562 gene taxon:6239 20060203 WB WB WBGene00000037 ace-3 GO:0002119 WB:WBPaper00001039|PMID:3272166 IGI WB:WBGene00000035|WB:WBGene00000036 P 2O499|NM_064562 gene taxon:6239 20060203 WB
Protein annotations:
WB CE07569 ACE-1 GO:0006581 WB:WBPaper00002110|PMID:7835425 IDA P ACE1|XQ987|NM_078259 protein taxon:6239 20061016 WB WB CE07569 ACE-1 GO:0006581 WB:WBPaper00004251|PMID:10891266 IDA P ACE1|XQ987|NM_078259 protein taxon:6239 20061016 WB WB CE07569 ACE-1 GO:0001507 WB:WBPaper00004251|PMID:10891266 ISS UniProt:P07692 P ACE1|XQ987|NM_078259 protein taxon:6239 20061016 WB WB CE07569 ACE-1 GO:0005623 WB:WBPaper00004251|PMID:10891266 IDA C ACE1|XQ987|NM_078259 protein taxon:6239 20060925 WB WB CE07569 ACE-1 GO:0005576 WB:WBPaper00001929|PMID:8144590 IDA C ACE1|XQ987|NM_078259 protein taxon:6239 20061011 WB WB CE07569 ACE-1 GO:0005626 WB:WBPaper00001929|PMID:8144590 IDA C ACE1|XQ987|NM_078259 protein taxon:6239 20061011 WB WB CE07569 ACE-1 GO:0003990 WB:WBPaper00004251|PMID:10891266 IDA F ACE1|XQ987|NM_078259 protein taxon:6239 20061016 WB WB CE07569 ACE-1 GO:0003990 WB:WBPaper00001929|PMID:8144590 IDA F ACE1|XQ987|NM_078259 protein taxon:6239 20061016 WB WB CE07569 ACE-1 GO:0042802 WB:WBPaper00004251|PMID:10891266 IPI WB:WBGene00000035 F ACE1|XQ987|NM_078259 protein taxon:6239 20061023 WB WB CE07569 ACE-1 GO:0042802 WB:WBPaper00004932|PMID:11580201 IPI WB:WBGene00000035 F ACE1|XQ987|NM_078259 protein taxon:6239 20061023 WB
Note there is some redundancy (PMID:11580201), but they are certainly not completely redundant.
The fact that CE07569 is a protein encoded by WBGene00000035 can be seen from the go2protein file:
WB:CE07569 UniProtKB:P38433 WB:WBGene00000035 UniProtKB:P38433
The mixed approach exemplified by WB is problematic from the point of view of software that wishes to provide summary statistics or do any kind of enrichment analysis. Results will be biased in the above case, because ace-1 and ACE-1 will be treated as different entities.
Software could simply ONLY report for genes OR proteins - but this could lead to important omissions.
Software must explicitly use the gp2protein file in order to determine the relationship between these entities and report accordingly.
Note that the refG display software has been modified to use the gp2protein file to collapse both the Ace gene and ACE protein from wormbase here:
http://www.geneontology.org/images/RefGenomeGraphs/43.html
(Of course, it is commendable that WB are providing the full information set - the goal here is to standardize how this is done)
Annotating to the Gene / canonical spliceform, and indicating the spliceform as additional information
This approach is exemplified by MGI. It could be enshrined by GO, by providing an additional column 17 (note: 16 is reserved for annotation properties) in which an spliceform ID can be noted, if known.
For example, if WB were to do this we could concatenate the two sets above, using solely the gene ID in column 2, and for the second set give the CE07569 ID in column 17.
In the reference genome graph display, protein annotations would be collapsed into the gene annotations (alternate spliceform info could be optionally indicated). This is an advantage - if we show distinct protein IDs in the refG display it can misleadingly suggest additional homologs.
If UniProt were to do this they would always use IDs like UniProt:P12345 in column 2. However, where spliceform specific information is known an ID like UniProt:P12345-1 is added to col 17
UniProtKB Q4VCS5 AMOT_HUMAN GO:0031410 PMID:11257124 IDA C AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5 AMOT_HUMAN GO:0043532 PMID:11257124 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5 AMOT_HUMAN GO:0043116 PMID:16043488 IDA P AMOT, KIAA1071:Angiomotin IPI00163085protein taxon:9606 20051207 UniProtKB Q4VCS5-1 UniProtKB Q4VCS5 AMOT_HUMAN GO:0005515 PMID:16043488 IPI UniProtKB:Q6RHR9-2 F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB Q4VCS5-1 UniProtKB Q4VCS5 AMOT_HUMAN GO:0043532 PMID:16043488 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB Q4VCS5-2 UniProtKB Q4VCS5 AMOT_HUMAN GO:0043116 PMID:16043488 IDA P AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProtKB UniProtKB Q4VCS5 AMOT_HUMAN GO:0043536 PMID:16043488 IDA P AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20060317 UniProtKB Q4VCS5-2
The advantage of this approach is that column 17 can be ignored by software at a small loss of specificity, and the statistics will be essentially correct.
More advanced software can choose to use column 17 and can provide additional info if required, and can also implement queries such as "find all genes that exhibit spliceform-specific localizations"
Protein coding vs non-coding
Conclusions
The two approaches outlined above are essentially inter-convertible. With the second approach, we are providing an additional service by mapping to the gene/canonical level such that everything is "on the same level"
The second approach has the overhead of an additional column. But this should be fairly simple, optional - and in the case of organisms like Yeast, rarely used.