PAINT annotation guidelines: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
No edit summary
Line 1: Line 1:
<big>'''PAINT'''</big>
= Semantics of annotations=
* Note that an annotation means that you are inferring that a particular GO term most likely FIRST EVOLVED along the branch leading to the node you are annotating.  It means that a particular "character" was present in the particular ancestral gene/genome/organism you are annotating.  For instance, you should not annotate a gene present in the common ancestor of all life with the term "nucleus" because that organism did not have a nucleus.  A "NOT" annotation means that an ancestral term that would otherwise be inherited is inferred to have been LOST in a particular descendant, and of course will not be inherited past that point.  We use NOT annotations to denote a functional change during evolution, so you will need to first make a positive annotation, and then make any annotations that indicate the loss of that GO term.


A central repository for PAINT developer info, announcements, and technical information.
=General Rules=
* In general, we will annotate to the most specific term possible and propagate as far back as possible, given the ancestral inference.
* For molecular function and cellular component, address every experimental annotation. For every experimental annotation, either:
** Use it for a propagation (note that if you already annotated a more specific term, you do not need to use the more general term)
** Explain in the notes box why you didn't use it
* For biological process: annotate all appropriate CELLULAR LEVEL PROCESSES.  Higher level processes should be annotated only if they do not require extensive work to clarify (i.e. don't read entire papers).


== Protein Family Annotation To-Do Lists ==


We are working from two to-do lists right now:
=Initial Steps=
* Look at the tree topology to see if it makes sense.  For example, use OrthoMCL mapping to do a reality check on the tree.  If it does not, contact Paul and the tree will be edited as appropriate.
* Very useful to spend a few minutes looking at a review, geneWiki, etc for an overview of the family when PAINT curators are not familiar.  Please write down the reviews you used in the notes box.
* Generally easiest to start with Mol. Function, then Cell. Component, then Biol. Process


1) The "finished" sets on the [http://sourceforge.net/tracker/?atid=1040173&group_id=36855&func=browse Source Forge Tracker].


2) Recent monthly target lists at [http://spreadsheets.google.com/ccc?key=pZhlLFuj8ewDe799QTmxzCA&hl=en Google docs].
=Annotation Rules=
* For closely related genes with opposite annotations, look at the papers and see if they are really contradictory, if so, don't propagate.  If not, make note of the annotations so they can be addressed later by the specific MOD(s)
* For something that looks indirect, is there something that looks more direct?  (IMP's may be more indirect.) We look for something that could be explaining it and use it if we can.
* Scoping
** Use common sense and keep the big picture of the tree and knowledge about the family in mind (eg. LON family: propagation of mito., light strand promoter anti-sense binding annotation to base of euks) ie. we should not always limit ourselves to the bare minimal triangulation.  Always include an evidence note when doing so.
** We can expand the scope of a BP term to reflect that of related MF and/or CC terms.  E.G. (LONP1): the MF and CC mitochondrial terms apply to the entire LONP1 euk. clade, so we can apply mitochondrial organization to the entire euk. clade.
** If a process (e.g. a "p53 dependent apoptotic process") involves a specific target, the scope of an inferred annotation should not extend beyond the phylogenetic distribution of the target.


== Standard Operating Procedures for annotating in PAINT ==


1) examine existing MOD annotations, in particular experimental annotations, in context of evolutionary trees.
=Term-specific notes=
* Do not propagate GO:0005515 protein binding (will be suppressed from PAINT), GO:0005488 binding, and enzyme binding.
* We will only propagate children of protein binding when the terms are specific enough to indicate a specific protein family and/or it provides useful biological information to the biologist wanting to learn more about this term ie. that molecular function is related to the biological process(es) that are annotated in this family.
* We will propagate small molecule binding terms.


2) if necessary, contact MOD curators about questions, suggestions for additions, etc.


3) determine annotations that can be propagated.  Send each MOD a GAF of proposed annotations, with a suggestion for incorporation within two weeks.
=Dealing with NOTs=
 
* NOTs are important: they allow us to capture likely functional changes over evolution so we do not make incorrect homology inferences.
=== MSH Family done by Mike ===
* You can only make a NOT for positive annotations made to an ancestor, so make the positive annotation first.
 
* Every NOT must have an manual note added in the Evidence paneAdd notes below the generic paragraph that pops up.
=== HPRT Family done by Kara ===
* If there is a NOT annotation among the experimental MOD annotations, use the "Inferred from Descendant Sequences" evidence code
Coming in via human protein UniProtKB:P00492, HPRT1, a hypoxanthine guanine phosphoribosyltransferase.
* If there is specific evidence about active site residues (these are not yet automatically identified in PAINT, but will be taken from SwissProt and CDD in the future) that are missing or substituted, use the "Inferred from Missing Residues"  
See SourceForge 1893061 and 1893082, in old version of PAINT, this is BookPTHR22573.
* If the branch is relatively long (indicating relatively rapid sequence evolution, a potential clue of adaptive evolution), use "Inferred from Rapid Divergence"
 
* NOT + rapid divergence = the line will not be in the GAF provided to the MOD but will be retained in the PAINT GAF.  This will  enable the ability to say "do not propagate" to a particular clade, distinguished from adding an explicit NOT.
General notes: The Panther family is very large and with this old PAINT version could not easily get to GO annotations, so used PPOD to look at more managable chunks of the larger Panther family:
 
==== Molecular Function ====
* [[http://ppod.princeton.edu/cgi-bin/ppod.cgi/j04/Jaccard1673 PPOD Jaccard1673]]: not enough experimental annotations (only human protein UniProtKB:P00492 had exp. ann.) in this subsection of the big PANTHER family, so could not do transfer.
 
* [[http://ppod.princeton.edu/cgi-bin/ppod.cgi/j04/Jaccard548 PPOD Jaccard548]]:
 
The three cervevisiae proteins, PGM1, PGM2, and PGM3, the mouse proteins PGM1 and PGM2, and the fly protein Pgm (last one on the graph) all have experimental annotations to phosphoglucomutase activity.  However, the mouse PGM5 has a NOT annotation to this activity and the human also has a NOT (but also a positive annotation(!)).
 
Propagation proposal:
Propagate to all proteins except for the clade with the NOT annotations (mouse and human PGM5) and the three C. elegans proteins that are part of an outgroup (top of graph)Pending info from MOD curators (see below), we might propagate that NOT to the rat proteins in that clade as well.
 
Suggestions for MOD annotators:
 
* ask about human NOT, ISO from NAS (emailed relevant curators March 18)
 
* ask E. coli to possibly add experimental annotation based on PubMed: 12351653 (emailed Debby March 18)
 
== Software Developer testing checklist prior to releasing PAINT ==
# Collapse a node
## Collapsing a node selects the node, and the selection is shown by a faint highlight
# "re-root to node"
# "output seq id's for leaves"# Edit --> Find... --> Search text
# Tree --> Scale...
# Tree --> Reset root to main
# Default sort order for GO terms is (1) by ontology and (2) by GOid.
# After selecting term and clicking "Propagate" button, term moves to top of list.
 
== Code and Trackers ==
* [[PAINT: Getting the Source Code]]
* [ https://sourceforge.net/tracker2/?group_id=184610&atid=1126622 Bug Tracker]
* [ https://sourceforge.net/tracker2/?group_id=184610&atid=1126623 Feature Tracker]
* [http://pantherdb.svn.sourceforge.net/viewvc/pantherdb/ Browse PAINT code in SourceForge SVN repository]
 
Return to [[Reference_Genome_Annotation_Project]]

Revision as of 17:31, 9 November 2010

Semantics of annotations

  • Note that an annotation means that you are inferring that a particular GO term most likely FIRST EVOLVED along the branch leading to the node you are annotating. It means that a particular "character" was present in the particular ancestral gene/genome/organism you are annotating. For instance, you should not annotate a gene present in the common ancestor of all life with the term "nucleus" because that organism did not have a nucleus. A "NOT" annotation means that an ancestral term that would otherwise be inherited is inferred to have been LOST in a particular descendant, and of course will not be inherited past that point. We use NOT annotations to denote a functional change during evolution, so you will need to first make a positive annotation, and then make any annotations that indicate the loss of that GO term.

General Rules

  • In general, we will annotate to the most specific term possible and propagate as far back as possible, given the ancestral inference.
  • For molecular function and cellular component, address every experimental annotation. For every experimental annotation, either:
    • Use it for a propagation (note that if you already annotated a more specific term, you do not need to use the more general term)
    • Explain in the notes box why you didn't use it
  • For biological process: annotate all appropriate CELLULAR LEVEL PROCESSES. Higher level processes should be annotated only if they do not require extensive work to clarify (i.e. don't read entire papers).


Initial Steps

  • Look at the tree topology to see if it makes sense. For example, use OrthoMCL mapping to do a reality check on the tree. If it does not, contact Paul and the tree will be edited as appropriate.
  • Very useful to spend a few minutes looking at a review, geneWiki, etc for an overview of the family when PAINT curators are not familiar. Please write down the reviews you used in the notes box.
  • Generally easiest to start with Mol. Function, then Cell. Component, then Biol. Process


Annotation Rules

  • For closely related genes with opposite annotations, look at the papers and see if they are really contradictory, if so, don't propagate. If not, make note of the annotations so they can be addressed later by the specific MOD(s)
  • For something that looks indirect, is there something that looks more direct? (IMP's may be more indirect.) We look for something that could be explaining it and use it if we can.
  • Scoping
    • Use common sense and keep the big picture of the tree and knowledge about the family in mind (eg. LON family: propagation of mito., light strand promoter anti-sense binding annotation to base of euks) ie. we should not always limit ourselves to the bare minimal triangulation. Always include an evidence note when doing so.
    • We can expand the scope of a BP term to reflect that of related MF and/or CC terms. E.G. (LONP1): the MF and CC mitochondrial terms apply to the entire LONP1 euk. clade, so we can apply mitochondrial organization to the entire euk. clade.
    • If a process (e.g. a "p53 dependent apoptotic process") involves a specific target, the scope of an inferred annotation should not extend beyond the phylogenetic distribution of the target.


Term-specific notes

  • Do not propagate GO:0005515 protein binding (will be suppressed from PAINT), GO:0005488 binding, and enzyme binding.
  • We will only propagate children of protein binding when the terms are specific enough to indicate a specific protein family and/or it provides useful biological information to the biologist wanting to learn more about this term ie. that molecular function is related to the biological process(es) that are annotated in this family.
  • We will propagate small molecule binding terms.


Dealing with NOTs

  • NOTs are important: they allow us to capture likely functional changes over evolution so we do not make incorrect homology inferences.
  • You can only make a NOT for positive annotations made to an ancestor, so make the positive annotation first.
  • Every NOT must have an manual note added in the Evidence pane. Add notes below the generic paragraph that pops up.
  • If there is a NOT annotation among the experimental MOD annotations, use the "Inferred from Descendant Sequences" evidence code
  • If there is specific evidence about active site residues (these are not yet automatically identified in PAINT, but will be taken from SwissProt and CDD in the future) that are missing or substituted, use the "Inferred from Missing Residues"
  • If the branch is relatively long (indicating relatively rapid sequence evolution, a potential clue of adaptive evolution), use "Inferred from Rapid Divergence"
  • NOT + rapid divergence = the line will not be in the GAF provided to the MOD but will be retained in the PAINT GAF. This will enable the ability to say "do not propagate" to a particular clade, distinguished from adding an explicit NOT.