Annotation of Alternate Spliceforms: Difference between revisions
Line 10: | Line 10: | ||
we indicate this in the assocation files, in a way that still makes it | we indicate this in the assocation files, in a way that still makes it | ||
easy to do comparisons as the gene level? | easy to do comparisons as the gene level? | ||
== See also == | |||
[[Variant_annotation]] | |||
== Current practice == | == Current practice == |
Revision as of 12:06, 16 January 2008
The problem
GO Annotations refer to attributes of gene products. Often the association between a GO term and a gene product is implicit, as we only have information at the gene level. This is fine for many cases, as alternate spliceforms often have similar function. For organisms that rarely exhibit alternate splicing this is not an issue at all. However, sometimes different spliceforms have different function (& localization & process), and this is known by the curators. How do we indicate this in the assocation files, in a way that still makes it easy to do comparisons as the gene level?
See also
Current practice
This is the current practice in existing deposited association files:
- Most MODs annotate to genes.
- UniProt annotates to proteins. (CHECK: are these always the "canonical" protein for a gene?)
- Some MODs (WB only?) do a mixture.
- Other groups express a desire to move to gene products OR to do a mixture
Some groups record additional information not communicated to the association files:
- MGI record the spliceform ID in their structured notes, where it is known that the experiment shows the F/P/C in a particular spliceform
- UniProt make additional annotations to spliceforms, which have IDs of the form: UniProt:P12345-1. It appears that not all of these annotations are submitted
Note that MGI do not exclude annotations: they still provide annotations at the gene level, it is just missing information on the specific spliceform.
Uniprot:
We can see: Q4VCS5 and two splice-forms here:
http://www.ebi.ac.uk/ego/GSearch?query=Q4VCS5&mode=name_syno&ontology=all_ont
However, only annotations to the canonical form are submitted to goa_human:
UniProt Q4VCS5 AMOT_HUMAN GO:0043532 PMID:11257124 IDA F AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProt UniProt Q4VCS5 AMOT_HUMAN GO:0030036 PMID:16043488 TAS P AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProt UniProt Q4VCS5 AMOT_HUMAN GO:0043536 PMID:11257124 IDA P AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProt UniProt Q4VCS5 AMOT_HUMAN GO:0045766 PMID:11257124 IDA P AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051109 UniProt UniProt Q4VCS5 AMOT_HUMAN GO:0005884 PMID:11257124 IDA C AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051110 UniProt UniProt Q4VCS5 AMOT_HUMAN GO:0005923 GOA:spkw|GO_REF:0000004 IEA SP_KW:KW-0796 C AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20071003 UniProt UniProt Q4VCS5 AMOT_HUMAN GO:0030027 PMID:11257124 IDA C AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051109 UniProt UniProt Q4VCS5 AMOT_HUMAN GO:0030054 GOA:spkw|GO_REF:0000004 IEA SP_KW:KW-0965 C AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20071003 UniProt UniProt Q4VCS5 AMOT_HUMAN GO:0031410 PMID:11257124 IDA C AMOT, KIAA1071: Angiomotin IPI00163085 protein taxon:9606 20051207 UniProt
Standardization
We seek a standard way of annotating gene products, both directly, via spliceform specific IDs, and indirectly, via gene IDs.
Ideally the standard will be non-lossy, in that it can capture everything the curator wishes to say. It should also exhibit "graceful degradation" - that is, simple software that does not take into account alternate spliceforms should "do the right thing" by default
Here are two alternate approaches:
Annotating to the Spliceform
Here column 2 would contain a peptide ID rather than a gene ID. Column 12 would say "protein"
This approach can be broken down into two alternate sub-approaches:
- mandating a protein ID rather than a gene ID throughout the whole file
- using a mixed approach, providing protein IDs where splice-form level annotation is known, gene ID otherwise
Mandating a protein ID is probably too extreme: gene IDs are ubiquitous and convenient
Currently WB uses the mixed approach. To illustrate, let us look at gene WBGene00000035, which has at least one known spliceform, CE07569
Gene annotations:
WB WBGene00000035 ace-1 GO:0040012 WB:WBPaper00003620|PMID:10438595 IGI WB:WBGene00000036 P ACE1|XQ987|NM_078259 gene taxon:6239 20061031 WB WB WBGene00000035 ace-1 GO:0040012 WB:WBPaper00006040|PMID:12911746 IGI WB:WBGene00000036 P ACE1|XQ987|NM_078259 gene taxon:6239 20061031 WB WB WBGene00000035 ace-1 GO:0006581 WB:WBPaper00003620|PMID:10438595 IMP WB:p1000 P ACE1|XQ987|NM_078259 gene taxon:6239 20060925 WB WB WBGene00000035 ace-1 GO:0003990 WB:WBPaper00003620|PMID:10438595 IMP WB:p1000 F ACE1|XQ987|NM_078259 gene taxon:6239 20060925 WB WB CE07569 ACE-1 GO:0042802 WB:WBPaper00004251|PMID:10891266 IPI WB:WBGene00000035 F ACE1|XQ987|NM_078259 protein taxon:6239 20061023 WB WB CE07569 ACE-1 GO:0042802 WB:WBPaper00004932|PMID:11580201 IPI WB:WBGene00000035 F ACE1|XQ987|NM_078259 protein taxon:6239 20061023 WB WB WBGene00000036 ace-2 GO:0040012 WB:WBPaper00003620|PMID:10438595 IGI WB:WBGene00000035 P 1D872|NM_058740 gene taxon:6239 20061031 WB WB WBGene00000036 ace-2 GO:0040012 WB:WBPaper00006040|PMID:12911746 IGI WB:WBGene00000035 P 1D872|NM_058740 gene taxon:6239 20061031 WB WB WBGene00000037 ace-3 GO:0043058 WB:WBPaper00001039|PMID:3272166 IGI WB:WBGene00000035 P 2O499|NM_064562 gene taxon:6239 20060203 WB WB WBGene00000037 ace-3 GO:0035188 WB:WBPaper00001039|PMID:3272166 IGI WB:WBGene00000035|WB:WBGene00000036 P 2O499|NM_064562 gene taxon:6239 20060203 WB WB WBGene00000037 ace-3 GO:0050879 WB:WBPaper00001039|PMID:3272166 IGI WB:WBGene00000035|WB:WBGene00000036 P 2O499|NM_064562 gene taxon:6239 20060203 WB WB WBGene00000037 ace-3 GO:0002119 WB:WBPaper00001039|PMID:3272166 IGI WB:WBGene00000035|WB:WBGene00000036 P 2O499|NM_064562 gene taxon:6239 20060203 WB
Protein annotations:
WB CE07569 ACE-1 GO:0006581 WB:WBPaper00002110|PMID:7835425 IDA P ACE1|XQ987|NM_078259 protein taxon:6239 20061016 WB WB CE07569 ACE-1 GO:0006581 WB:WBPaper00004251|PMID:10891266 IDA P ACE1|XQ987|NM_078259 protein taxon:6239 20061016 WB WB CE07569 ACE-1 GO:0001507 WB:WBPaper00004251|PMID:10891266 ISS UniProt:P07692 P ACE1|XQ987|NM_078259 protein taxon:6239 20061016 WB WB CE07569 ACE-1 GO:0005623 WB:WBPaper00004251|PMID:10891266 IDA C ACE1|XQ987|NM_078259 protein taxon:6239 20060925 WB WB CE07569 ACE-1 GO:0005576 WB:WBPaper00001929|PMID:8144590 IDA C ACE1|XQ987|NM_078259 protein taxon:6239 20061011 WB WB CE07569 ACE-1 GO:0005626 WB:WBPaper00001929|PMID:8144590 IDA C ACE1|XQ987|NM_078259 protein taxon:6239 20061011 WB WB CE07569 ACE-1 GO:0003990 WB:WBPaper00004251|PMID:10891266 IDA F ACE1|XQ987|NM_078259 protein taxon:6239 20061016 WB WB CE07569 ACE-1 GO:0003990 WB:WBPaper00001929|PMID:8144590 IDA F ACE1|XQ987|NM_078259 protein taxon:6239 20061016 WB WB CE07569 ACE-1 GO:0042802 WB:WBPaper00004251|PMID:10891266 IPI WB:WBGene00000035 F ACE1|XQ987|NM_078259 protein taxon:6239 20061023 WB WB CE07569 ACE-1 GO:0042802 WB:WBPaper00004932|PMID:11580201 IPI WB:WBGene00000035 F ACE1|XQ987|NM_078259 protein taxon:6239 20061023 WB
Note there is some redundancy (PMID:11580201), but they are certainly not completely redundant.
The fact that CE07569 is a protein encoded by WBGene00000035 can be seen from the go2protein file:
WB:CE07569 UniProtKB:P38433 WB:WBGene00000035 UniProtKB:P38433
The mixed approach exemplified by WB is problematic from the point of view of software that wishes to provide summary statistics or do any kind of enrichment analysis. Results will be biased in the above case, because ace-1 and ACE-1 will be treated as different entities.
Software could simply ONLY report for genes OR proteins - but this could lead to important omissions.
Software must explicitly use the gp2protein file in order to determine the relationship between these entities and report accordingly.
Note that the refG display software has been modified to use the gp2protein file to collapse both the Ace gene and ACE protein from wormbase here:
http://www.geneontology.org/images/RefGenomeGraphs/43.html
(Of course, it is commendable that WB are providing the full information set - the goal here is to standardize how this is done)
Annotating to the Gene / canonical spliceform, and indicating the spliceform as additional information
This approach is exemplified by MGI. It could be enshrined by GO, by providing an additional column 17 (note: 16 is reserved for annotation properties) in which an spliceform ID can be noted, if known.
For example, if WB were to do this we could concatenate the two sets above, using solely the gene ID in column 2, and for the second set give the CE07569 ID in column 17.
In the reference genome graph display, protein annotations would be collapsed into the gene annotations (alternate spliceform info could be optionally indicated). This is an advantage - if we show distinct protein IDs in the refG display it can misleadingly suggest additional homologs.
If UniProt were to do this they would always use IDs like UniProt:P12345 in column 2. However, where spliceform specific information is known an ID like UniProt:P12345-1 is added to col17
The advantage of this approach is that column 17 can be ignored by software at a small loss of specificity, and the statistics will be essentially correct.
More advanced software can choose to use column 17 and can provide additional info if required, and can also implement queries such as "find all genes that exhibit spliceform-specific localizations"
Conclusions
The two approaches outlined above are essentially inter-convertible. With the second approach, we are providing an additional service by mapping to the gene/canonical level such that everything is "on the same level"
The second approach has the overhead of an additional column. But this should be fairly simple, optional - and in the case of organisms like Yeast, rarely used.