Annotation of Alternate Spliceforms

From GO Wiki
Revision as of 13:49, 29 November 2007 by Cjm (talk | contribs) (→‎Standardization)
Jump to navigation Jump to search

The problem

GO Annotations refer to attributes of gene products. Often the association between a GO term and a gene product is implicit, as we only have information at the gene level. This is fine for many cases, as alternate spliceforms often have similar function. For organisms that rarely exhibit alternate splicing this is not an issue at all. However, sometimes different spliceforms have different function (& localization & process), and this is known by the curators. How do we indicate this in the assocation files, in a way that still makes it easy to do comparisons as the gene level?

Current practice

This is the current practice in existing deposited association files:

  • Most MODs annotate to genes.
  • UniProt annotates to proteins. (CHECK: are these always the "canonical" protein for a gene?)
  • Some MODs (WB only?) do a mixture.
  • Other groups express a desire to move to gene products OR to do a mixture

Some groups record additional information not communicated to the association files:

  • MGI record the spliceform ID in their structured notes, where it is known that the experiment shows the F/P/C in a particular spliceform
  • UniProt make additional annotations to spliceforms, which have IDs of the form: UniProt:P12345-1. It appears that not all of these annotations are submitted

Note that MGI do not exclude annotations: they still provide annotations at the gene level, it is just missing information on the specific spliceform.

Uniprot:

We can see: Q4VCS5 and two splice-forms here:

http://www.ebi.ac.uk/ego/GSearch?query=Q4VCS5&mode=name_syno&ontology=all_ont

However, only annotations to the canonical form are submitted to goa_human:

 UniProt	Q4VCS5	AMOT_HUMAN		GO:0043532	PMID:11257124	IDA		F	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProt
 UniProt	Q4VCS5	AMOT_HUMAN		GO:0030036	PMID:16043488	TAS		P	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProt
 UniProt	Q4VCS5	AMOT_HUMAN		GO:0043536	PMID:11257124	IDA		P	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProt
 UniProt	Q4VCS5	AMOT_HUMAN		GO:0045766	PMID:11257124	IDA		P	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051109	UniProt
 UniProt	Q4VCS5	AMOT_HUMAN		GO:0005884	PMID:11257124	IDA		C	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051110	UniProt
 UniProt	Q4VCS5	AMOT_HUMAN		GO:0005923	GOA:spkw|GO_REF:0000004	IEA	SP_KW:KW-0796	C	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20071003	UniProt
 UniProt	Q4VCS5	AMOT_HUMAN		GO:0030027	PMID:11257124	IDA		C	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051109	UniProt
 UniProt	Q4VCS5	AMOT_HUMAN		GO:0030054	GOA:spkw|GO_REF:0000004	IEA	SP_KW:KW-0965	C	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20071003	UniProt
 UniProt	Q4VCS5	AMOT_HUMAN		GO:0031410	PMID:11257124	IDA		C	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProt
 

Standardization

We seek a standard way of annotating gene products, both directly, via spliceform specific IDs, and indirectly, via gene IDs.

Ideally the standard will be non-lossy, in that it can capture everything the curator wishes to say. It should also exhibit "graceful degradation" - that is, simple software that does not take into account alternate spliceforms should "do the right thing" by default

Here are two alternate approaches:

Annotating to the Spliceform

Here column 2 would contain a peptide ID rather than a gene ID. Column 12 would say "protein"

This approach can be broken down into two alternate sub-approaches:

  1. mandating a protein ID rather than a gene ID throughout the whole file
  2. using a mixed approach, providing protein IDs where splice-form level annotation is known, gene ID otherwise

Mandating a protein ID is probably too extreme: gene IDs are ubiquitous and convenient

Currently WB uses the mixed approach. To illustrate, let us look at gene WBGene00000035, which has at least one known spliceform, CE07569

Gene annotations:

 WB	WBGene00000035	ace-1		GO:0040012	WB:WBPaper00003620|PMID:10438595	IGI	WB:WBGene00000036	P		ACE1|XQ987|NM_078259	gene	taxon:6239	20061031	WB
 WB	WBGene00000035	ace-1		GO:0040012	WB:WBPaper00006040|PMID:12911746	IGI	WB:WBGene00000036	P		ACE1|XQ987|NM_078259	gene	taxon:6239	20061031	WB
 WB	WBGene00000035	ace-1		GO:0006581	WB:WBPaper00003620|PMID:10438595	IMP	WB:p1000	P		ACE1|XQ987|NM_078259	gene	taxon:6239	20060925	WB
 WB	WBGene00000035	ace-1		GO:0003990	WB:WBPaper00003620|PMID:10438595	IMP	WB:p1000	F		ACE1|XQ987|NM_078259	gene	taxon:6239	20060925	WB
 WB	CE07569	ACE-1		GO:0042802	WB:WBPaper00004251|PMID:10891266	IPI	WB:WBGene00000035	F		ACE1|XQ987|NM_078259	protein	taxon:6239	20061023	WB
 WB	CE07569	ACE-1		GO:0042802	WB:WBPaper00004932|PMID:11580201	IPI	WB:WBGene00000035	F		ACE1|XQ987|NM_078259	protein	taxon:6239	20061023	WB
 WB	WBGene00000036	ace-2		GO:0040012	WB:WBPaper00003620|PMID:10438595	IGI	WB:WBGene00000035	P		1D872|NM_058740	gene	taxon:6239	20061031	WB
 WB	WBGene00000036	ace-2		GO:0040012	WB:WBPaper00006040|PMID:12911746	IGI	WB:WBGene00000035	P		1D872|NM_058740	gene	taxon:6239	20061031	WB
 WB	WBGene00000037	ace-3		GO:0043058	WB:WBPaper00001039|PMID:3272166	IGI	WB:WBGene00000035	P		2O499|NM_064562	gene	taxon:6239	20060203	WB
 WB	WBGene00000037	ace-3		GO:0035188	WB:WBPaper00001039|PMID:3272166	IGI	WB:WBGene00000035|WB:WBGene00000036	P		2O499|NM_064562	gene	taxon:6239	20060203	WB
 WB	WBGene00000037	ace-3		GO:0050879	WB:WBPaper00001039|PMID:3272166	IGI	WB:WBGene00000035|WB:WBGene00000036	P		2O499|NM_064562	gene	taxon:6239	20060203	WB
 WB	WBGene00000037	ace-3		GO:0002119	WB:WBPaper00001039|PMID:3272166	IGI	WB:WBGene00000035|WB:WBGene00000036	P		2O499|NM_064562	gene	taxon:6239	20060203	WB

Protein annotations:

 WB	CE07569	ACE-1		GO:0006581	WB:WBPaper00002110|PMID:7835425	IDA		P		ACE1|XQ987|NM_078259	protein	taxon:6239	20061016	WB
 WB	CE07569	ACE-1		GO:0006581	WB:WBPaper00004251|PMID:10891266	IDA		P		ACE1|XQ987|NM_078259	protein	taxon:6239	20061016	WB
 WB	CE07569	ACE-1		GO:0001507	WB:WBPaper00004251|PMID:10891266	ISS	UniProt:P07692	P		ACE1|XQ987|NM_078259	protein	taxon:6239	20061016	WB
 WB	CE07569	ACE-1		GO:0005623	WB:WBPaper00004251|PMID:10891266	IDA		C		ACE1|XQ987|NM_078259	protein	taxon:6239	20060925	WB
 WB	CE07569	ACE-1		GO:0005576	WB:WBPaper00001929|PMID:8144590	IDA		C		ACE1|XQ987|NM_078259	protein	taxon:6239	20061011	WB
 WB	CE07569	ACE-1		GO:0005626	WB:WBPaper00001929|PMID:8144590	IDA		C		ACE1|XQ987|NM_078259	protein	taxon:6239	20061011	WB
 WB	CE07569	ACE-1		GO:0003990	WB:WBPaper00004251|PMID:10891266	IDA		F		ACE1|XQ987|NM_078259	protein	taxon:6239	20061016	WB
 WB	CE07569	ACE-1		GO:0003990	WB:WBPaper00001929|PMID:8144590	IDA		F		ACE1|XQ987|NM_078259	protein	taxon:6239	20061016	WB
 WB	CE07569	ACE-1		GO:0042802	WB:WBPaper00004251|PMID:10891266	IPI	WB:WBGene00000035	F		ACE1|XQ987|NM_078259	protein	taxon:6239	20061023	WB
 WB	CE07569	ACE-1		GO:0042802	WB:WBPaper00004932|PMID:11580201	IPI	WB:WBGene00000035	F		ACE1|XQ987|NM_078259	protein	taxon:6239	20061023	WB

Note there is some redundancy (PMID:11580201), but they are certainly not completely redundant.

The fact that CE07569 is a protein encoded by WBGene00000035 can be seen from the go2protein file:

 WB:CE07569      UniProtKB:P38433
 WB:WBGene00000035       UniProtKB:P38433

The mixed approach exemplified by WB is problematic from the point of view of software that wishes to provide summary statistics or do any kind of enrichment analysis. Results will be biased in the above case, because ace-1 and ACE-1 will be treated as different entities.

Software could simply ONLY report for genes OR proteins - but this could lead to important omissions.

Software must explicitly use the gp2protein file in order to determine the relationship between these entities and report accordingly.

Note that the refG display software has been modified to use the gp2protein file to collapse both the Ace gene and ACE protein from wormbase here:

http://www.geneontology.org/images/RefGenomeGraphs/43.html

(Of course, it is commendable that WB are providing the full information set - the goal here is to standardize how this is done)

Annotating to the Gene / canonical spliceform, and indicating the spliceform as additional information

This approach is exemplified by MGI. It could be enshrined by GO, by providing an additional column 17 (note: 16 is reserved for annotation properties) in which an spliceform ID can be noted, if known.

For example, if WB were to do this we could concatenate the two sets above, using solely the gene ID in column 2, and for the second set give the CE07569 ID in column 17.

In the reference genome graph display, protein annotations would be collapsed into the gene annotations (alternate spliceform info could be optionally indicated). This is an advantage - if we show distinct protein IDs in the refG display it can misleadingly suggest additional homologs.

If UniProt were to do this they would always use IDs like UniProt:P12345 in column 2. However, where spliceform specific information is known an ID like UniProt:P12345-1 is added to col17

The advantage of this approach is that column 17 can be ignored by software at a small loss of specificity, and the statistics will be essentially correct.

More advanced software can choose to use column 17 and can provide additional info if required, and can also implement queries such as "find all genes that exhibit spliceform-specific localizations"

Conclusions

The two approaches outlined above are essentially inter-convertible. With the second approach, we are providing an additional service by mapping to the gene/canonical level such that everything is "on the same level"

The second approach has the overhead of an additional column. But this should be fairly simple, optional - and in the case of organisms like Yeast, rarely used.