PAINT annotation guidelines

From GO Wiki
Jump to navigation Jump to search

PAINT is a Java application for annotating phylogenetic trees. PAINT user guide:

PAINT annotation guidelines

Those guidelines have been published (Gaudet, Livestone, Lewis, Thomas, 2011) [1]

The PAINT curation process is a manual process based on manual annotations. To some extent, those manual procedures are subjective and subject to variability due to various factors such as the completeness of the annotations and differences in curators' expertise. Moreover, the manual annotations are extracted from the literature, which lacks standardization in terms of experimental descriptions and data interpretation. This results in some inconsistencies even in the experiment-based annotations from which PAINT annotations are produced. To increase the consistency and reproducibility of annotations, we have elaborated detailed annotation guidelines, described here.

Overview of literature on protein family function and phylogeny

The first step in PAINT curation is to identify any published literature on the family as a whole (recent reviews are particularly helpful when available; UniProt, OMIM and wikipedia usually have very helpful high-level overviews) and its phylogeny. These papers are reviewed and PubMed identifiers are recorded by the curator in the Evidence Box in PAINT.

Verification of the tree topology and composition

Next, the curator assesses the quality of the tree. PAINT displays orthologous clusters determined by OrthoMCL and imported from the PPOD database. The curator verifies that the PANTHER tree topology is consistent with those orthologous clusters, and with any published phylogenetic analyses. Also, the curator verifies that no proteins that should obviously be in the family are missing; for example if all mammals have two paralogs of a gene, except for humans, the curator investigates whether an ortholog of this protein can be found in the public databases. In the rare cases where there are inconsistencies that may affect PAINT annotations, the phylogeny is reviewed and reconstructed again to resolve the issues. On the other hand, if the errors are small and do not affect the PAINT annotations, proteins that are mistakenly groups in the family can be pruned (see above) either before or during curation.

Ensuring sufficient annotation coverage

One limitation of the PAINT curation process is the fact that for almost all model organisms, due to limited resources, not all proteins that have been experimentally characterized are completely annotated. Moreover, in several cases the most recent literature is annotated first, while the most basic functions of certain proteins might be known for decades. To address this, before beginning to annotate a protein family the curator reviews the relevant literature and skims the existing annotations. Based on this background knowledge, the PAINT curator may request curators from one or more of the GO Reference Genomes to assign additional experimental annotations before starting the annotation of the family.

Annotating ancestral genes

The decision process involved in making annotations using PAINT is shown in this figure. Step 1 is to determine which ancestor would be annotated based on the experiment-based annotations to a given term, or its related terms in the ontology. The initial hypothesis is that the term was inherited from a common ancestor, so PAINT assists in this process by automatically highlighting the node in the tree corresponding to the most recent common ancestor (MRCA) of all sequences annotated by experiment with a particular term or its children. The curator may adjust this ancestor by considering all additional annotations, either ones that are directly related by GO relations (such as class-subclass relations), or those that may be biologically related but in a different part or even aspect of the ontology. Given this initial hypothesis, the curator needs to decide between three possibilities:

Option A The initial hypothesis is likely to be correct, i.e. the MRCA of the experimentally annotated sequences is where it likely first evolved.

Option B The actual annotation should be more ancient; in other words, the MRCA most likely inherited this function from a more ancient ancestor. In making this decision, the curator takes into account information such as duplication events/orthology, sequence conservation, the presence of essential/active site residues, branch length, and genes having inconsistent experimental annotations (i.e. descendants with annotations, or missing annotations in well-characterized genes, that are most likely not compatible with the annotation). Determining compatibility or mutual-exclusivity of annotations requires careful curator judgment. Finally, the actual term propagated is also important: annotators are more conservative for BP annotations than for MF. Curators actively look for whether the data are consistent with functional divergence occurring after duplication events or long branches.

Option C The annotation should be more recent, and probably arose more than once (homoplasy, or convergent evolution). The curator considers this possibility to be more likely for functions that are mechanistically more likely to evolve convergently, such as targeting to the mitochondrion in eukaryotes (gain or loss of a relatively short N-terminal targeting peptide) or loss of an enzymatic function by substitutions in the active site. Again, conflicting annotations among descendants is helpful, and this, as well as assessing the likelihood of independent evolutionary events, requires curator judgment.

Achieving high specificity in annotation

Curators attempt to propagate the most specific term possible. For example, if a human protein is annotated to “DNA binding” and its mouse ortholog is annotated to “double-stranded DNA binding,” the curator may infer, based on the evidence, that the human annotation refers to double-stranded DNA and may propagate the more specific term. Those types of annotation transfers may result in increasing levels of specificity of annotations, even for proteins already having experimentally supported annotations.

Avoiding over-propagation and uncertain statements

Molecular functions are usually more conserved than biological processes: for example, members of the MAP kinase family have kinase activity, but regulate a large number of varied processes. Therefore, the PAINT guidelines advise curators to be particularly conservative when annotating biological processes. This often means that cellular processes can be confidently transferred, and only very limited organismal processes may be transferred. Also, curators try not to propagate terms to ancestral organisms in which they are clearly inappropriate, such as “nucleus” for a gene present in the last universal common ancestor (LUCA). GO has begun to perform taxonomic checks on annotations . It is a high priority in the development of PAINT to integrate the taxonomic checks within the software.

Do NOT propagate the following terms

1) protein binding terms: examples:

  • protein binding (GO:0005515) and its children (???): PTHR10032: identical protein binding (GO:0042802), protein homodimerization activity (GO:0042803) and protein self-association (GO:0043621)

2) single-step molecular process terms: e.g. ??? (example).

3) If Molecular Function is annotated, the corresponding redundant one-step Biological Process should not be annotated. Examples:

  • PTHR10012: GO:0008601 protein phosphatase type 2A regulator activity is propagated to the root. GO:0043666 regulation of phosphoprotein phosphatase activity is a one-step process, it's redundant with (but a lot less informative than) GO:0008601; therefore this term should NOT be propagated.
  • PTHR10032: 'sequence-specific DNA binding RNAP II transcription factor activity' is propagated to the root; do NOT propagate the redundant but less specific process term 'regulation of transcription'.

[add additional examples]

4) downstream processes that a gene product does not 'directly' function in: e.g. PTHR13697: do not propagate to 'regulation of insulin secretion', but propagate to 'cellular glucose homeostasis'. Examples:

  • PTHR10032: members of this family are zinc finger domain transcription factors. Mouse PLAG2 is annotated to 'lipid metabolic process' (GO:0006629). Do not propagate GO:0006629 to the root of the PLAG2 clade since PLAG2 does not as contribute to chemical synthesis or breakdown of lipids.