Annotation of Alternate Spliceforms

From GO Wiki
Jump to: navigation, search
  This page has been superseded by GAF_Spliceform_Column_Proposal.
  The page remains here to provide historical context.

The problem

GO Annotations refer to attributes of gene products. Often the association between a GO term and a gene product is implicit, a gene identifier is used as proxy for the gene product.

This is fine for many cases: gene identifiers serve as useful proxies for gene products in the context of functional annotation, as alternate spliceforms and protein forms often have similar function. For organisms that rarely exhibit alternate splicing this is not an issue at all. However, sometimes different spliceforms have different function (& localization & process), and curators would like to indicate this. One way is to switch from annnotating gene identifiers to gene product identifiers. Another is to carry on annotating to gene identifiers, but to indicate the specific spliceform some other way.

How do we indicate spliceform-specific functionality in the gene association files, in a way that still makes it simple to do comparisons as the gene level, and does not break expectations of existing software? How can we ensure this is done in a standard way across the GO?

This page assumes the reader is familiar with the GAF Spec

Current practice

For a summary of how MODs currently do this, see: Variant_annotation. What follows here is a summary


This is the current practice in existing deposited association files:

  • Most MODs annotate to genes.
  • UniProt annotates to proteins. (CHECK: are these always the "canonical" protein for a gene?)
  • Some MODs (WB, RGD, ...others?) do a mixture.
  • Other groups express a desire to move to gene products OR to do a mixture

Some groups record additional information not communicated to the association files:

  • MGI record the spliceform ID in their structured notes, where it is known that the experiment shows the F/P/C in a particular spliceform
  • UniProt make additional annotations to spliceforms, which have IDs of the form: UniProt:P12345-1. It appears that not all of these annotations are submitted (this may have changed as of 2008-04)

Note that MGI do not exclude annotations: they still provide annotations at the gene level, it is just missing information on the specific spliceform.

Uniprot

We can see: Q4VCS5 and two splice-forms here:

http://www.ebi.ac.uk/ego/GProtein?ac=Q4VCS5

Format of annotations in both UniProt and Human GOA files:

  UniProtKB	Q4VCS5  AMOT_HUMAN	GO:0031410	PMID:11257124	IDA	C	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProtKB
  UniProtKB	Q4VCS5	AMOT_HUMAN	GO:0043532	PMID:11257124	IDA	F	AMOT, KIAA1071: Angiomotin	IPI00163085     protein	taxon:9606	20051207	UniProtKB                                                                                                                                                         
  UniProtKB	Q4VCS5-1        AMOT_HUMAN	GO:0043116	PMID:16043488	IDA	P	AMOT, KIAA1071:Angiomotin	IPI00163085protein       taxon:9606       20051207	UniProtKB      
  UniProtKB	Q4VCS5-1	AMOT_HUMAN	GO:0005515	PMID:16043488	IPI	UniProtKB:Q6RHR9-2	F	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProtKB	                                                                                                                                           
  UniProtKB	Q4VCS5-2	AMOT_HUMAN	GO:0043532      PMID:16043488   IDA	F	AMOT, KIAA1071: Angiomotin      IPI00163085	protein	taxon:9606	20051207	UniProtKB
  UniProtKB	Q4VCS5-2        AMOT_HUMAN              GO:0043116      PMID:16043488   IDA             P       AMOT, KIAA1071: Angiomotin	IPI00163085    protein  taxon:9606	20051207        UniProtKB                                                                                                                                                      
  UniProtKB	Q4VCS5-2	AMOT_HUMAN	GO:0043536	PMID:16043488	IDA	P	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20060317	UniProtKB                                                                                                                                                 

Individual spliceforms can be seen on the page for the canonical protein http://beta.uniprot.org/uniprot/Q4VCS5#Q4VCS5-1

note that Q4VCS5-1 is denoted the canonical form. This is typically the longest isoform.

Standardization

We seek a standard way of annotating gene products, both directly, via spliceform specific IDs, and indirectly, via gene IDs.

Ideally the standard will be non-lossy, in that it can capture everything the curator wishes to say. It should also exhibit "graceful degradation" - that is, simple software that does not take into account alternate spliceforms should "do the right thing" by default

Here are two alternate approaches:

Annotating to the Spliceform/Isoform

Here column 2 would contain a peptide ID rather than a gene ID. Column 12 would say "protein"

This approach can be broken down into two alternate sub-approaches:

  1. mandating a protein ID rather than a gene ID throughout the whole file
  2. using a mixed approach, providing protein IDs where splice-form level annotation is known, gene ID otherwise

Annotating only to protein IDs

This is the approach UniProt uses (obviously, since the entities they curate are proteins).

A few things are not clear to me about the UniProt approach. Do the IDs denote "canonical" proteins? Each protein in UniProt is associated with a specific sequence, and thus a specific protein isoform in the case where alternate splicing of these gene is involved. Do UniProt curators only attach GO annotations if they are sure the experiment described in the paper involved these specific forms? I am presuming not, and that the UniProt IDs denote a canonical protein, analagous to a gene record.

FlyBase have expressed a desire to switch to annotating canonical proteins. This seems reasonable. I think we need to do more work on exactly what it means to be a canonical protein. Does a CP have a specific sequence, or is it more akin to a generalisation of a collection of protein sequences? How is the CP related to the gene and to "non-canonical" proteins? If the CP to gene mapping is 1:1, is this really any different from using genes as proxies?

Annotating to a mix of gene IDs and protein IDs

Here a choice is made depending on how much information is available.

Currently WB uses the mixed approach. To illustrate, let us look at gene WBGene00000035, which has at least one known spliceform, CE07569

Gene annotations:

 WB	WBGene00000035	ace-1		GO:0040012	WB:WBPaper00003620|PMID:10438595	IGI	WB:WBGene00000036	P		ACE1|XQ987|NM_078259	gene	taxon:6239	20061031	WB
 WB	WBGene00000035	ace-1		GO:0040012	WB:WBPaper00006040|PMID:12911746	IGI	WB:WBGene00000036	P		ACE1|XQ987|NM_078259	gene	taxon:6239	20061031	WB
 WB	WBGene00000035	ace-1		GO:0006581	WB:WBPaper00003620|PMID:10438595	IMP	WB:p1000	P		ACE1|XQ987|NM_078259	gene	taxon:6239	20060925	WB
 WB	WBGene00000035	ace-1		GO:0003990	WB:WBPaper00003620|PMID:10438595	IMP	WB:p1000	F		ACE1|XQ987|NM_078259	gene	taxon:6239	20060925	WB
 WB	CE07569	ACE-1		GO:0042802	WB:WBPaper00004251|PMID:10891266	IPI	WB:WBGene00000035	F		ACE1|XQ987|NM_078259	protein	taxon:6239	20061023	WB
 WB	CE07569	ACE-1		GO:0042802	WB:WBPaper00004932|PMID:11580201	IPI	WB:WBGene00000035	F		ACE1|XQ987|NM_078259	protein	taxon:6239	20061023	WB
 WB	WBGene00000036	ace-2		GO:0040012	WB:WBPaper00003620|PMID:10438595	IGI	WB:WBGene00000035	P		1D872|NM_058740	gene	taxon:6239	20061031	WB
 WB	WBGene00000036	ace-2		GO:0040012	WB:WBPaper00006040|PMID:12911746	IGI	WB:WBGene00000035	P		1D872|NM_058740	gene	taxon:6239	20061031	WB
 WB	WBGene00000037	ace-3		GO:0043058	WB:WBPaper00001039|PMID:3272166	IGI	WB:WBGene00000035	P		2O499|NM_064562	gene	taxon:6239	20060203	WB
 WB	WBGene00000037	ace-3		GO:0035188	WB:WBPaper00001039|PMID:3272166	IGI	WB:WBGene00000035|WB:WBGene00000036	P		2O499|NM_064562	gene	taxon:6239	20060203	WB
 WB	WBGene00000037	ace-3		GO:0050879	WB:WBPaper00001039|PMID:3272166	IGI	WB:WBGene00000035|WB:WBGene00000036	P		2O499|NM_064562	gene	taxon:6239	20060203	WB
 WB	WBGene00000037	ace-3		GO:0002119	WB:WBPaper00001039|PMID:3272166	IGI	WB:WBGene00000035|WB:WBGene00000036	P		2O499|NM_064562	gene	taxon:6239	20060203	WB

Protein annotations:

 WB	CE07569	ACE-1		GO:0006581	WB:WBPaper00002110|PMID:7835425	IDA		P		ACE1|XQ987|NM_078259	protein	taxon:6239	20061016	WB
 WB	CE07569	ACE-1		GO:0006581	WB:WBPaper00004251|PMID:10891266	IDA		P		ACE1|XQ987|NM_078259	protein	taxon:6239	20061016	WB
 WB	CE07569	ACE-1		GO:0001507	WB:WBPaper00004251|PMID:10891266	ISS	UniProt:P07692	P		ACE1|XQ987|NM_078259	protein	taxon:6239	20061016	WB
 WB	CE07569	ACE-1		GO:0005623	WB:WBPaper00004251|PMID:10891266	IDA		C		ACE1|XQ987|NM_078259	protein	taxon:6239	20060925	WB
 WB	CE07569	ACE-1		GO:0005576	WB:WBPaper00001929|PMID:8144590	IDA		C		ACE1|XQ987|NM_078259	protein	taxon:6239	20061011	WB
 WB	CE07569	ACE-1		GO:0005626	WB:WBPaper00001929|PMID:8144590	IDA		C		ACE1|XQ987|NM_078259	protein	taxon:6239	20061011	WB
 WB	CE07569	ACE-1		GO:0003990	WB:WBPaper00004251|PMID:10891266	IDA		F		ACE1|XQ987|NM_078259	protein	taxon:6239	20061016	WB
 WB	CE07569	ACE-1		GO:0003990	WB:WBPaper00001929|PMID:8144590	IDA		F		ACE1|XQ987|NM_078259	protein	taxon:6239	20061016	WB
 WB	CE07569	ACE-1		GO:0042802	WB:WBPaper00004251|PMID:10891266	IPI	WB:WBGene00000035	F		ACE1|XQ987|NM_078259	protein	taxon:6239	20061023	WB
 WB	CE07569	ACE-1		GO:0042802	WB:WBPaper00004932|PMID:11580201	IPI	WB:WBGene00000035	F		ACE1|XQ987|NM_078259	protein	taxon:6239	20061023	WB

Note there is some redundancy (PMID:11580201), but they are certainly not completely redundant.

The fact that CE07569 is a protein encoded by WBGene00000035 can be seen from the go2protein file:

 WB:CE07569      UniProtKB:P38433
 WB:WBGene00000035       UniProtKB:P38433

The mixed approach exemplified by WB is problematic from the point of view of software that wishes to provide summary statistics or do any kind of enrichment analysis. Results will be biased in the above case, because ace-1 and ACE-1 will be treated as different entities.

Software could simply ONLY report for genes OR proteins - but this could lead to important omissions.

Software must explicitly use the gp2protein file in order to determine the relationship between these entities and report accordingly.

Note that the refG display software has been modified to use the gp2protein file to collapse both the Ace gene and ACE protein from wormbase here:

http://www.geneontology.org/images/RefGenomeGraphs/43.html

(Of course, it is commendable that WB are providing the full information set - the goal here is to standardize how this is done)

Annotating to the Gene / canonical spliceform, and indicating the spliceform as additional information

This approach is exemplified by MGI. It could be enshrined by GO, by providing an additional column 17 (note: 16 is reserved for annotation properties) in which an spliceform ID can be noted, if known.

For example, if WB were to do this we could concatenate the two sets above, using solely the gene ID in column 2, and for the second set give the CE07569 ID in column 17.

In the reference genome graph display, protein annotations would be collapsed into the gene annotations (alternate spliceform info could be optionally indicated). This is an advantage - if we show distinct protein IDs in the refG display it can misleadingly suggest additional homologs.

If UniProt were to do this they would always use IDs like UniProt:P12345 in column 2. However, where spliceform specific information is known an ID like UniProt:P12345-1 is added to col 17

  UniProtKB	Q4VCS5  AMOT_HUMAN	GO:0031410	PMID:11257124	IDA	C	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProtKB
  UniProtKB	Q4VCS5	AMOT_HUMAN	GO:0043532	PMID:11257124	IDA	F	AMOT, KIAA1071: Angiomotin	IPI00163085     protein	taxon:9606	20051207	UniProtKB                                                                                                                                                         
  UniProtKB	Q4VCS5        AMOT_HUMAN	GO:0043116	PMID:16043488	IDA	P	AMOT, KIAA1071:Angiomotin	IPI00163085protein       taxon:9606       20051207	UniProtKB	Q4VCS5-1 
  UniProtKB	Q4VCS5	AMOT_HUMAN	GO:0005515	PMID:16043488	IPI	UniProtKB:Q6RHR9-2	F	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20051207	UniProtKB	Q4VCS5-1	                                                                                                                                           
  UniProtKB	Q4VCS5	AMOT_HUMAN	GO:0043532      PMID:16043488   IDA	F	AMOT, KIAA1071: Angiomotin      IPI00163085	protein	taxon:9606	20051207	UniProtKB	Q4VCS5-2
  UniProtKB	Q4VCS5        AMOT_HUMAN              GO:0043116      PMID:16043488   IDA             P       AMOT, KIAA1071: Angiomotin	IPI00163085    protein  taxon:9606	20051207        UniProtKB                                                                                                                                                      
  UniProtKB	Q4VCS5	AMOT_HUMAN	GO:0043536	PMID:16043488	IDA	P	AMOT, KIAA1071: Angiomotin	IPI00163085	protein	taxon:9606	20060317	UniProtKB	Q4VCS5-2

The advantage of this approach is that column 17 can be ignored by software at a small loss of specificity, and the statistics will be essentially correct.

More advanced software can choose to use column 17 and can provide additional info if required, and can also implement queries such as "find all genes that exhibit spliceform-specific localizations"

Protein coding vs non-coding

Conclusions

The two approaches outlined above are essentially inter-convertible. With the second approach, we are providing an additional service by mapping to the gene/canonical level such that everything is "on the same level"

The second approach has the overhead of an additional column. But this should be fairly simple, optional - and in the case of organisms like Yeast, rarely used.