Variant annotation: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
No edit summary
(16 intermediate revisions by 6 users not shown)
Line 1: Line 1:
[[Category:Reference Genome]]
This page describes the different ways in which GOC annotation generators deal with alternate spliceforms. See also [[Annotation_of_Alternate_Spliceforms]]
==Arabidopsis thaliana==
==Arabidopsis thaliana==
We annotate splice variant specific information when available.  Our INTERPRO2GO- and TargetP- based IEA annotations are splice variant specific, if one of the proteins that is encoded by the locus does not have a domain that the other/s do, this one does not get that annotation.  I think that there are very few experimental annotations that we have that are splice variant specific.* 
* '''CONFIRMED''': gene-associations/gene_association.tair.gz has only ''gene'' in col12 -- CJM
* '''NOTE''': we hope to transition to having more that one DB_Object_Type in the next month or so --Tanya
==Caenorhabditis elegans==
==Caenorhabditis elegans==
We are starting to see a few examples of isoform-specific functions and/or localization in the C. elegans literature.  In the cases where we can confidently match the isoform to a Wormpep protein identifier (e.g., WP:CE25075), then we make the annotation specifically to that isoform.  If, from the paper, we can't determine the specific isoform used, then by default we make the annotation to all of the protein isoforms.
* '''CONFIRMED''' -- CJM (see also [[Annotation_of_Alternate_Spliceforms]])
==Danio rerio==
==Danio rerio==
We almost never have enough info to curate to the level of a splice variant.  Our annotations are applied at the level of the gene.
We almost never have enough info to curate to the level of a splice variant.  Our annotations are applied at the level of the gene.
Line 9: Line 23:


==Drosophila melanogaster==
==Drosophila melanogaster==
At the moment we attach all GO terms to genes. We are in the process of figuring out how to move to change our curation method to annotating proteins. In the meantime we make an internal note for papers that describe isoform specific info so that we can revisit these.
==Escherichia coli==
==Escherichia coli==
==Gallus gallus==
==Gallus gallus==
We are using UniProtKB accession IDs wherever possible and this allows us to annotate specific isoforms if required.
We are using UniProtKB accession IDs wherever possible and this allows us to annotate specific isoforms if required.


==Homo sapiens==
==Homo sapiens==
The human group annotates to UniProtKB accessions. When a paper provides isoform-specific information, then this data can be captured using the appropriate UniProt isoid. E.g. Q4VCS5-1, Q4VCS5-2.
When isoform-specific information is not provided then the top-level UniProt accession number is only annotated to, e.g. Q4VCS5.
* Chris - the above may reflect how this is captured internally, but this is not reflected in the GAF
* Emily - this information *is* exported in the UniProt file, however due to a production bug, isoid annotations are not getting into the species-specific files. This is being rectified.
==Mus musculus==
==Mus musculus==
For each annotation, MGI has a "notes field" that is not available to the public. That note has a structure as follows:
For each annotation, MGI has a "notes field" that is not available to the public. That note has a structure as follows:


Line 34: Line 61:
We presently only have about 300 of these with experimental evidence codes, annotated after the adoption of the structured notes. So QC has to be done for some. Annotations done prior to that will not have any entry, as we had no way of capturing the data.
We presently only have about 300 of these with experimental evidence codes, annotated after the adoption of the structured notes. So QC has to be done for some. Annotations done prior to that will not have any entry, as we had no way of capturing the data.
We are looking at ways to "back annotate" by identifying having multiple isoforms identified in references that have been used for GO annotation at MGI.
We are looking at ways to "back annotate" by identifying having multiple isoforms identified in references that have been used for GO annotation at MGI.
* NOTE: See [[Annotation_of_Alternate_Spliceforms]] for more on MGI method --CJM


==Rattus norvegicus==
==Rattus norvegicus==
There are not too many splice variants currently in the database. Those that are have their own DB:ID, get the symbol of the parent gene with underscore vnumber followed by variant of symbol in parentheses with symbol hyperlinked to the report page of the parent gene. Example:geneX_v1 (variant of geneX). The variants can also be accessed from the top level gene. The variants may have some mapping, sequence, other external database links, if applicable. They seldom have annotations. It may happen that the information in the literature allows for annotation of the splice variants but that is rather rare.
* '''CONFIRMED''': GAF has gene and protein in col12. Protein annotations are to a mixture of ENSEMBL and UNIPROT IDs
==Saccharomyces cerevisiae==
==Saccharomyces cerevisiae==
We have very few documented cases of splicing or processing variants, and our database structure currently cannot display variant gene product forms.  So, we do not at this time annotate variants.  We annotate only one gene product per gene.  We are working on a database restructure so that we can represent different variants, but it has not yet been implemented.
==Schizosaccharomyces pombe==
==Schizosaccharomyces pombe==
S. pombe has ~ 44% of genes spliced but only 1 documented protein product variant where  a 2 exon gene  is alternatively transcribed to give a single exon form during meiosis. The solex data indicated that this may be a fairly common event but we don't yet have enough info to annotate the variations. There is no evidence so far for exon skipping or alternative splice site usage events.
There is a report of different polyadenylation sites for some transcripts, but not affecting protein product.
For the documented case I will record the longest version in the protein file.

Revision as of 17:54, 16 July 2014

This page describes the different ways in which GOC annotation generators deal with alternate spliceforms. See also Annotation_of_Alternate_Spliceforms

Arabidopsis thaliana

We annotate splice variant specific information when available. Our INTERPRO2GO- and TargetP- based IEA annotations are splice variant specific, if one of the proteins that is encoded by the locus does not have a domain that the other/s do, this one does not get that annotation. I think that there are very few experimental annotations that we have that are splice variant specific.*

  • CONFIRMED: gene-associations/gene_association.tair.gz has only gene in col12 -- CJM
  • NOTE: we hope to transition to having more that one DB_Object_Type in the next month or so --Tanya

Caenorhabditis elegans

We are starting to see a few examples of isoform-specific functions and/or localization in the C. elegans literature. In the cases where we can confidently match the isoform to a Wormpep protein identifier (e.g., WP:CE25075), then we make the annotation specifically to that isoform. If, from the paper, we can't determine the specific isoform used, then by default we make the annotation to all of the protein isoforms.

Danio rerio

We almost never have enough info to curate to the level of a splice variant. Our annotations are applied at the level of the gene.

Dictyostelium discoideum

So far we only have a few genes and publications that described splice variants, and the papers never described different functions for the different variants. Hence, we currently don't capture annotations to different variants of gene products.

Drosophila melanogaster

At the moment we attach all GO terms to genes. We are in the process of figuring out how to move to change our curation method to annotating proteins. In the meantime we make an internal note for papers that describe isoform specific info so that we can revisit these.

Escherichia coli

Gallus gallus

We are using UniProtKB accession IDs wherever possible and this allows us to annotate specific isoforms if required.

Homo sapiens

The human group annotates to UniProtKB accessions. When a paper provides isoform-specific information, then this data can be captured using the appropriate UniProt isoid. E.g. Q4VCS5-1, Q4VCS5-2. When isoform-specific information is not provided then the top-level UniProt accession number is only annotated to, e.g. Q4VCS5.

  • Chris - the above may reflect how this is captured internally, but this is not reflected in the GAF
  • Emily - this information *is* exported in the UniProt file, however due to a production bug, isoid annotations are not getting into the species-specific files. This is being rectified.

Mus musculus

For each annotation, MGI has a "notes field" that is not available to the public. That note has a structure as follows:

evidence:
anatomy:
cell type:
gene product:
qualifier:
target:
external ref:
text:

If a paper actually specifies a specific isoform, the appropriate refseq is entered into the "gene_product" field

eg, For the annotation of MGI:1341722,Kcnh2,to GO:0005886, plasma membrane,by IDA, the field would look like:

gene_product:SPKW:O35219-1

We presently only have about 300 of these with experimental evidence codes, annotated after the adoption of the structured notes. So QC has to be done for some. Annotations done prior to that will not have any entry, as we had no way of capturing the data. We are looking at ways to "back annotate" by identifying having multiple isoforms identified in references that have been used for GO annotation at MGI.

Rattus norvegicus

There are not too many splice variants currently in the database. Those that are have their own DB:ID, get the symbol of the parent gene with underscore vnumber followed by variant of symbol in parentheses with symbol hyperlinked to the report page of the parent gene. Example:geneX_v1 (variant of geneX). The variants can also be accessed from the top level gene. The variants may have some mapping, sequence, other external database links, if applicable. They seldom have annotations. It may happen that the information in the literature allows for annotation of the splice variants but that is rather rare.

  • CONFIRMED: GAF has gene and protein in col12. Protein annotations are to a mixture of ENSEMBL and UNIPROT IDs

Saccharomyces cerevisiae

We have very few documented cases of splicing or processing variants, and our database structure currently cannot display variant gene product forms. So, we do not at this time annotate variants. We annotate only one gene product per gene. We are working on a database restructure so that we can represent different variants, but it has not yet been implemented.

Schizosaccharomyces pombe

S. pombe has ~ 44% of genes spliced but only 1 documented protein product variant where a 2 exon gene is alternatively transcribed to give a single exon form during meiosis. The solex data indicated that this may be a fairly common event but we don't yet have enough info to annotate the variations. There is no evidence so far for exon skipping or alternative splice site usage events.

There is a report of different polyadenylation sites for some transcripts, but not affecting protein product.

For the documented case I will record the longest version in the protein file.