Annotation QC: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
mNo edit summary
mNo edit summary
 
(20 intermediate revisions by 5 users not shown)
Line 1: Line 1:
The purpose of this page is to find methods to check the quality of the GO annotations. There are four types of errors that we would like to find easily:  
The purpose of this page is to find methods to check the quality of the GO annotations. There are four types of errors that we would like to find easily:  


===Omission of annotations===
===Omission of annotations===
Line 6: Line 5:
Possible causes:
Possible causes:
* No experimental evidence in the organism: Should try using ISS. We need to find ways for the ISS annotations that can safely be transfered easier to find.
* No experimental evidence in the organism: Should try using ISS. We need to find ways for the ISS annotations that can safely be transfered easier to find.
* Original data ia old and difficult to find
* Original data is old and difficult to find
* Original data is from non-RG organisms  
* Original data is from non-RG organisms  
* We propose that IEA annotations would count with respect to 'completeness' of annotation (right?)
* We propose that IEA annotations would count with respect to 'completeness' of annotation (right?)
----
===ISS with===
* 'with' needs to be another exp characterized sequence
* a possible problem is when there is a reference and the 'source' is not cited. Then Suzi suggests making a second annotation with the proper 'with'.


----
===Overannotion ISS to process===
* Curators need to be careful about that!!!
* This is not an excuse NOT to annotate 


----
----
Line 18: Line 26:
----
----
===Varying granularity of annotations ===
===Varying granularity of annotations ===
Annotations for orthologs in various species may vary in granularity for several reasons, some based on variability of the annotation process which we would like to address, but also some based on the actual differences in either the available experimental characterization or in actual differences in the biology between orthologs in different species.
Possible causes:  
Possible causes:  
* New (more granular) term was created since the annotation was made.  
* New (more granular) term was created since the annotation was made.  
How to address this: Should we warn curator when a more granular term is created an their database have annotations to the parent term?
How to address this: Should we warn curator when a more granular term is created and their database have annotations to the parent term?
* Curator feels they do not have the expertise to annotate a gene.   
* Curator feels they do not have the expertise to annotate a gene.   
How to address this: Better communication: SF annotation tracker, email, wiki  
How to address this: Better communication: SF annotation tracker, email, wiki  
* Actual differences in amount of experimental characterization of the corresponding genes of various species
How to address this: We clearly do want to see the experimental annotations available for each species, even if some are less granular than others. A separate question is whether we would like to see additional ISS annotations to more granular terms for the species whose gene(s) were less well characterized experimentally
* Actual differences in the process, i.e. in some cases, authors of papers indicate that genes in some species may be doing things that their homologs in other species are not doing
How to address this: Be aware that sometimes the differences in annotation reflect real differences in the biology
----


----
===Incorrect annotations===
===Incorrect annotations===
Possible causes:  
Possible causes:  


* Errors during annotation. How to address this:
* Errors during annotation. How to address this:
# See [http://www.geneontology.org/images/RefGenomeGraphs/ graphs]; also queries [[Reference_Genome_Database_Reports]], in particular "non-IEA outliers"
# See [http://proto.informatics.jax.org/prototypes/GOgraphEX/RefGenomeGraphs/ graphs]; also queries [[Reference_Genome_Database_Reports]], in particular "non-IEA outliers"
# See also the list of commonly [[Misused_terms]]
# See also the list of commonly [[Misused_terms]]


* Different interpretations of results.   
* Different interpretations of results.
Some questions that have come up on annotation list:  
* Different interpretations of evidence codes documentation.   
* Different interpretations of GO terms.   
====Some questions that have come up on annotation list: ====
1. Doug: For the Ref. Genomes we are annotating the gene p2rx3, a subunit of an ATP activated cation channel.
1. Doug: For the Ref. Genomes we are annotating the gene p2rx3, a subunit of an ATP activated cation channel.
A paper I have shows that adding this gene to hek293 cells results in the generation of an inward current in the presence of ATP.
A paper I have shows that adding this gene to hek293 cells results in the generation of an inward current in the presence of ATP.
Line 47: Line 62:
The figure legend for Fig 5A says 'Cdc55 localization in the nucleolus'. Should Cdc55 be annotated to 'nucleolus' directly or to 'colocalizes with nucleolus'? The documentation on how to use this qualifier should be updated with more examples. http://www.geneontology.org/GO.annotation.conventions.shtml#colocalizes_with  
The figure legend for Fig 5A says 'Cdc55 localization in the nucleolus'. Should Cdc55 be annotated to 'nucleolus' directly or to 'colocalizes with nucleolus'? The documentation on how to use this qualifier should be updated with more examples. http://www.geneontology.org/GO.annotation.conventions.shtml#colocalizes_with  


Return to [Reference_Genome_Annotation_Project]
 
3. Val: I recently used the term RNA trimethylguanosine cap binding  to annotate pombe telomerase RNA and represent the fact that this is trimethylguanosine capped, but on re-reading the definition  I'm not sure if this is correct? Can I use this for the modification itself?  or is it for gene products which interact with a capped product?
 
There do not appear to be any other annotations to this term despite  the fact that many RNAs are capped which is another reason which made me suspect my usage may be wrong.
 
Should  the binding terms should only be used for non-covalent modifications (although this is only in some of the binding defs?), and  does not represent the use of some terms. For instance GPI anchor binding is used for a number of proteins which are GPI anchored, in addition to proteins which bins the GPI moiety during GPI anchor biosynthesis.
 
 
4. Pascale: I am looking at the 'ISS outliers' report and I wonder what is the  difference between - DNA-dependent protein kinase complex (GO:0005958):  A large protein complex which is involved in the repair of DNA double-strand breaks and  V(D)J recombination events. In mammals, it consists of the DNA-dependent
protein kinase catalytic subunit (DNA-PKcs), the DNA end-binding  heterodimer, Ku, the nuclear phosphoprotein XRCC4 and DNA ligase IV.  (cellular component ontology).
 
and
 
DNA ligase IV complex (GO:0032807):  A eukaryotically conserved protein  complex that contains DNA ligase IV and is involved in DNA repair by non-homologous end joining; in addition to the ligase, the complex also
contains XRCC4 or a homolog, e.g. Saccharomyces Lif4p. (cellular component ontology).
 
The first one is not localized but I think it should be nuclear as well?  I dont think bacteria have DNAPK. Also, is ' DNA-dependent protein kinase complex (GO:0005958)' a type of 'DNA ligase IV complex'?
 
 
 
5. Rama: Ribosomal proteins
I have couple of ribosomal proteins to annotate as part of the ref-genome curation project. Turns out that there is no direct experimental evidence showing that these proteins are involved in translation. Almost all the studies purify the ribosome from yeast and identify the subunits by one or more techniques.
 
I can do IDA for CC annotation, that is straightforward. Is IDA for function annotation- structural constituent of ribsomome okay? What about BP? I can do IC from the CC term, but that is not direct experimental evidence. What do you all think?
 
 
Return to [[Reference_Genome_Annotation_Project]]
 
 
[[Category: Reference Genome]]

Latest revision as of 14:00, 12 August 2008

The purpose of this page is to find methods to check the quality of the GO annotations. There are four types of errors that we would like to find easily:

Omission of annotations

A gene has no annotations in one of the three ontologies while other organisms do (see Reference_Genome_Database_Reports); this also includes having ISS annotations without an entry in the 'with' column. Possible causes:

  • No experimental evidence in the organism: Should try using ISS. We need to find ways for the ISS annotations that can safely be transfered easier to find.
  • Original data is old and difficult to find
  • Original data is from non-RG organisms
  • We propose that IEA annotations would count with respect to 'completeness' of annotation (right?)

ISS with

  • 'with' needs to be another exp characterized sequence
  • a possible problem is when there is a reference and the 'source' is not cited. Then Suzi suggests making a second annotation with the proper 'with'.

Overannotion ISS to process

  • Curators need to be careful about that!!!
  • This is not an excuse NOT to annotate

Problems in the ontology

When annotations in different organisms are very different, it may reflect problems in the ontology which makes certain terms unusable when curating genes from certain organisms; or it may be due to a complicated branch of the graph that curators have difficulty selecting from.


Varying granularity of annotations

Annotations for orthologs in various species may vary in granularity for several reasons, some based on variability of the annotation process which we would like to address, but also some based on the actual differences in either the available experimental characterization or in actual differences in the biology between orthologs in different species.

Possible causes:

  • New (more granular) term was created since the annotation was made.

How to address this: Should we warn curator when a more granular term is created and their database have annotations to the parent term?

  • Curator feels they do not have the expertise to annotate a gene.

How to address this: Better communication: SF annotation tracker, email, wiki

  • Actual differences in amount of experimental characterization of the corresponding genes of various species

How to address this: We clearly do want to see the experimental annotations available for each species, even if some are less granular than others. A separate question is whether we would like to see additional ISS annotations to more granular terms for the species whose gene(s) were less well characterized experimentally

  • Actual differences in the process, i.e. in some cases, authors of papers indicate that genes in some species may be doing things that their homologs in other species are not doing

How to address this: Be aware that sometimes the differences in annotation reflect real differences in the biology


Incorrect annotations

Possible causes:

  • Errors during annotation. How to address this:
  1. See graphs; also queries Reference_Genome_Database_Reports, in particular "non-IEA outliers"
  2. See also the list of commonly Misused_terms
  • Different interpretations of results.
  • Different interpretations of evidence codes documentation.
  • Different interpretations of GO terms.

Some questions that have come up on annotation list:

1. Doug: For the Ref. Genomes we are annotating the gene p2rx3, a subunit of an ATP activated cation channel. A paper I have shows that adding this gene to hek293 cells results in the generation of an inward current in the presence of ATP.

We know (not from this paper) that p2x receptors are a complex of subunits. Should this be annotated as 'contributes_to' ATP-gated cation channel activity by IDA because it is thought to be part of an ion channel complex, or is it not 'contributes_to' because you get the current by introduction of just the single gene product (even if it is forming a homomeric channel complex)??

Introduction of both zebrafish p2rx3 and rat p2rx5 produces a channel with novel properties....what can be done with that? 'protein heterooligomerization' with the rat p2rx5 by IGI?


2. Rama: We have a question about the use of 'colocalizes with' qualifier. We are curating PMID: 16713564. In the section titled " Separase-Dependent Downregulation of PP2ACdc55 at Anaphase Onset" the authors say that 'Colocalization with Net1 revealed nucleolar enrichment of Cdc55 in metaphase'....

The figure legend for Fig 5A says 'Cdc55 localization in the nucleolus'. Should Cdc55 be annotated to 'nucleolus' directly or to 'colocalizes with nucleolus'? The documentation on how to use this qualifier should be updated with more examples. http://www.geneontology.org/GO.annotation.conventions.shtml#colocalizes_with


3. Val: I recently used the term RNA trimethylguanosine cap binding to annotate pombe telomerase RNA and represent the fact that this is trimethylguanosine capped, but on re-reading the definition I'm not sure if this is correct? Can I use this for the modification itself? or is it for gene products which interact with a capped product?

There do not appear to be any other annotations to this term despite the fact that many RNAs are capped which is another reason which made me suspect my usage may be wrong.

Should the binding terms should only be used for non-covalent modifications (although this is only in some of the binding defs?), and does not represent the use of some terms. For instance GPI anchor binding is used for a number of proteins which are GPI anchored, in addition to proteins which bins the GPI moiety during GPI anchor biosynthesis.


4. Pascale: I am looking at the 'ISS outliers' report and I wonder what is the difference between - DNA-dependent protein kinase complex (GO:0005958): A large protein complex which is involved in the repair of DNA double-strand breaks and V(D)J recombination events. In mammals, it consists of the DNA-dependent protein kinase catalytic subunit (DNA-PKcs), the DNA end-binding heterodimer, Ku, the nuclear phosphoprotein XRCC4 and DNA ligase IV. (cellular component ontology).

and

DNA ligase IV complex (GO:0032807): A eukaryotically conserved protein complex that contains DNA ligase IV and is involved in DNA repair by non-homologous end joining; in addition to the ligase, the complex also contains XRCC4 or a homolog, e.g. Saccharomyces Lif4p. (cellular component ontology).

The first one is not localized but I think it should be nuclear as well? I dont think bacteria have DNAPK. Also, is ' DNA-dependent protein kinase complex (GO:0005958)' a type of 'DNA ligase IV complex'?


5. Rama: Ribosomal proteins I have couple of ribosomal proteins to annotate as part of the ref-genome curation project. Turns out that there is no direct experimental evidence showing that these proteins are involved in translation. Almost all the studies purify the ribosome from yeast and identify the subunits by one or more techniques.

I can do IDA for CC annotation, that is straightforward. Is IDA for function annotation- structural constituent of ribsomome okay? What about BP? I can do IC from the CC term, but that is not direct experimental evidence. What do you all think?


Return to Reference_Genome_Annotation_Project