PAINT annotation guidelines

From GO Wiki
Revision as of 01:11, 1 July 2021 by Pascale (talk | contribs)
Jump to navigation Jump to search

PAINT is a Java application for annotating phylogenetic trees. PAINT user guide: http://wiki.geneontology.org/index.php/PAINT_User_Guide


PAINT annotation guidelines

Those guidelines have been published (Gaudet, Livstone, Lewis, Thomas, 2011) [1]

General overview

An annotation of an ancestral node means that a gene function is inferred to have first arisen somewhere along the branch of the tree immediately preceding that node. Loss of a function "X" is annotated using the GO "NOT" qualifier, and means that a given function was inferred to have been lost along the branch immediately preceding the annotated node. Only experimental GO annotations (as represented by the evidence code) can be used as a basis for annotation of ancestral genes. NOT annotations can be supported by either an experimental GO annotation, absence of key residues in the sequences (IKR), or generally accelerated evolutionary rate (IRD).

The PAINT curation process is a manual process based on manual annotations. To some extent, those manual procedures are subjective and subject to variability due to various factors such as the completeness of the annotations and differences in curators' expertise. Moreover, the manual annotations are extracted from the literature, which lacks standardization in terms of experimental descriptions and data interpretation. This results in some inconsistencies even in the experiment-based annotations from which PAINT annotations are produced. To increase the consistency and reproducibility of annotations, we have elaborated detailed annotation guidelines, described here.

Overview of literature on protein family function and phylogeny

The first step in PAINT curation is to identify any published literature on the family as a whole (recent reviews are particularly helpful when available; UniProt, OMIM and wikipedia usually have very helpful high-level overviews) and its phylogeny. These papers are reviewed and PubMed identifiers are recorded by the curator in the Evidence Box in PAINT.

Annotation from single evidence

(to be completed)

Annotation from single paper, or papers from single groups

Sometimes we can have up to three annotation (human, mouse and/or zebrafish, for example) originating from one article or several articles from a same group, and being incorrect. It is important to be coherent in annotation, including: the function, the process and the localization; also, the domains must also be consistent with the MF.

High throughput evidence

Evidence from high throughput experiments (Guide_to_GO_Evidence_Codes#High_Throughput_Experimental_Evidence_Codes) should be taken carefully:

Annotations derived from High throughput (HTP) experiments can be used as evidence for IBD, but only if there is additionnal supporting information:

  • When a HTP-derived annotation for a CC is consistent with the MF or BP (ie mitochondrion for repiratory chain proteins, ...)
  • When there are several similar or close-related HTP-derived annotations from different groups (ie: mitochondrion, thylakoid, mitochondrial matrix).
  • When there is supporting evidence in the sequence (predicted functional domains or transmembrane regions).

Verification of the tree topology and composition

Next, the curator assesses the quality of the tree. PAINT displays orthologous clusters determined by OrthoMCL and imported from the PPOD database. The curator verifies that the PANTHER tree topology is consistent with those orthologous clusters, and with any published phylogenetic analyses. Also, the curator verifies that no proteins that should obviously be in the family are missing; for example if all mammals have two paralogs of a gene, except for humans, the curator investigates whether an ortholog of this protein can be found in the public databases. In the rare cases where there are inconsistencies that may affect PAINT annotations, the phylogeny is reviewed and reconstructed again to resolve the issues. On the other hand, if the errors are small and do not affect the PAINT annotations, proteins that are mistakenly groups in the family can be pruned (see above) either before or during curation.

Small duplications (leaf-level duplications)

The curator should look out for small duplications; for example we often see duplications specifically in S. pombe, in which there are loss or gains of function, and hence both paralogs do not have the same function. Likewise for C. elegans, insects, and in plants.

Ensuring sufficient annotation coverage

One limitation of the PAINT curation process is the fact that for almost all model organisms, due to limited resources, not all proteins that have been experimentally characterized are completely annotated. Moreover, in several cases the most recent literature is annotated first, while the most basic functions of certain proteins might be known for decades. To address this, before beginning to annotate a protein family the curator reviews the relevant literature and skims the existing annotations. Based on this background knowledge, the PAINT curator may request curators from one or more of the GO Reference Genomes to assign additional experimental annotations before starting the annotation of the family.

Annotating ancestral genes

The decision process involved in making annotations using PAINT is shown in this figure. Step 1 is to determine which ancestor would be annotated based on the experiment-based annotations to a given term, or its related terms in the ontology. The initial hypothesis is that the term was inherited from a common ancestor, so PAINT assists in this process by automatically highlighting the node in the tree corresponding to the most recent common ancestor (MRCA) of all sequences annotated by experiment with a particular term or its children. The curator may adjust this ancestor by considering all additional annotations, either ones that are directly related by GO relations (such as class-subclass relations), or those that may be biologically related but in a different part or even aspect of the ontology. Given this initial hypothesis, the curator needs to decide between three possibilities:

Option A The initial hypothesis is likely to be correct, i.e. the MRCA of the experimentally annotated sequences is where it likely first evolved.

Option B The actual annotation should be more ancient; in other words, the MRCA most likely inherited this function from a more ancient ancestor. In making this decision, the curator takes into account information such as duplication events/orthology, sequence conservation, the presence of essential/active site residues, branch length, and genes having inconsistent experimental annotations (i.e. descendants with annotations, or missing annotations in well-characterized genes, that are most likely not compatible with the annotation). Determining compatibility or mutual-exclusivity of annotations requires careful curator judgment. Finally, the actual term propagated is also important: annotators are more conservative for BP annotations than for MF. Curators actively look for whether the data are consistent with functional divergence occurring after duplication events or long branches.

Option C The annotation should be more recent, and probably arose more than once (homoplasy, or convergent evolution). The curator considers this possibility to be more likely for functions that are mechanistically more likely to evolve convergently, such as targeting to the mitochondrion in eukaryotes (gain or loss of a relatively short N-terminal targeting peptide) or loss of an enzymatic function by substitutions in the active site. Again, conflicting annotations among descendants is helpful, and this, as well as assessing the likelihood of independent evolutionary events, requires curator judgment.

Achieving high specificity in annotation

Curators attempt to propagate the most specific term possible. For example, if a human protein is annotated to “DNA binding” and its mouse ortholog is annotated to “double-stranded DNA binding,” the curator may infer, based on the evidence, that the human annotation refers to double-stranded DNA and may propagate the more specific term. Those types of annotation transfers may result in increasing levels of specificity of annotations, even for proteins already having experimentally supported annotations.

Avoiding over-propagation and uncertain statements

  • Molecular functions are usually more conserved than biological processes: for example, members of the MAP kinase family have kinase activity, but regulate a large number of varied processes. Therefore, the PAINT guidelines advise curators to be particularly conservative when annotating biological processes. * This often means that cellular processes can be confidently transferred, and only very limited organismal processes may be transferred.
  • Look for evidence across many different species and from more than one article to propagate specific processes. For example, heart development or synaptic transmission).
  • Look for evidence in the molecular function to support the biological process, for example a sodium channel annotation to synaptic transmission is likely to be correct; while the role of transcription factors are more difficult to assess without a lot of supporting evidence from several sources.
  • Certain molecular functions, for example those of ribosomal proteins, general transcription factors, protein involved in vesicle movement, can have pleiotropic effects on many processes, without playing a specific role in those processes. In this case we avoid making those annotations.
  • Beware of primary annotations, in particular based on phenotypes. See also http://wiki.geneontology.org/index.php/Annotating_from_phenotypes.
  • Also, curators try not to propagate terms to ancestral organisms in which they are clearly inappropriate, such as “nucleus” for a gene present in the last universal common ancestor (LUCA). GO has begun to perform taxonomic checks on annotations . We have integrated taxonomic checks within the software. If you try to annotate a term outside the taxonomic range for that term, you will be given a warning. If you think the taxonomic constraint is incorrect, please file a github ticket to report a potentially incorrect constraint.

Do NOT propagate the following terms

1) protein binding terms: examples:

  • protein binding (GO:0005515) and the following children: PTHR10032: identical protein binding (GO:0042802), protein homodimerization activity (GO:0042803) and protein self-association (GO:0043621)

2) partial molecular process terms that are covered by another MF term, e.g. if you have annotated transcription factor activity, do not annotate DNA binding; if you have annotated receptor activity, do not annotate ligand binding.

3) If Molecular Function is annotated, the corresponding redundant one-step Biological Process should not be annotated. Examples:

  • PTHR10012: GO:0008601 protein phosphatase type 2A regulator activity is propagated to the root. GO:0043666 regulation of phosphoprotein phosphatase activity is a one-step process, it's redundant with (but a lot less informative than) GO:0008601; therefore this term should NOT be propagated.
  • PTHR10032: 'sequence-specific DNA binding RNAP II transcription factor activity' is propagated to the root; do NOT propagate the redundant but less specific process term 'regulation of transcription'.

4) downstream processes that a gene product does not 'directly' function in: Examples:

  • PTHR10032: members of this family are zinc finger domain transcription factors. Mouse PLAGL2 is annotated to 'lipid metabolic process' (GO:0006629) (PIMD 17983586). Do not propagate GO:0006629 to the root of the PLAG2 clade since PLAG2 does not directly contribute to chemical synthesis or breakdown of lipids.
  • PTHR13697: do not propagate to 'regulation of insulin secretion', but propagate to 'cellular glucose homeostasis'.

Use caution for the following terms

Specific substrates, for enzymes, transport and transporter activity

The specificity of enzyme substrates and molecules being transported by transporters can evolve rapidly. Especially for large families with many duplications (which is often the case for transporters, such as the ABC transporters), use caution when propagating substrates. Be very conservative when annotating these.

Tree issues

If a Panther tree needs to be reviewed, please create a ticket in the Panther GitHub tracker: https://github.com/pantherdb/Helpdesk/issues

PAINT issues

Issues with the PAINT tools should be reported in this tracker: https://github.com/pantherdb/db-PAINT/issues