Guidelines from Annotation Camp
- 1 Downstream Process guidelines
- 1.1 Requesting more specific terms for downstream processes
- 1.2 Annotating downstream processes for gene products involved in core or specific processes
- 1.3 Annotating downstream processes to poorly characterised gene products
- 1.4 Annotating downstream processes to gene products in a ligand-receptor signaling pathway
- 1.5 General note on revision of annotation sets
- 2 Binding guidelines
- 3 'Response to' guidelines
- 4 Use of Regulation Terms
- 4.1 Background
- 4.2 Guideline 1: Use existing biological knowledge to define the process.
- 4.3 Guideline 2: If you aren’t sure, annotate to the parent process term.
- 4.4 Guideline 3: Improve the ontology by defining, wherever possible, the beginning, middle, and end of a process.
- 4.5 Guideline 4: Revisit annotations when new knowledge becomes available.
- 4.6 Guideline 5: Annotations based on mutant phenotypes should take mechanism into account.
- 4.7 Guideline 6: Some gene products may be annotated to both a process and regulation of that process.
- 5 Protein complexes guidelines
- 6 Quality control checks
Downstream Process guidelines
Requesting more specific terms for downstream processes
Quite often it is the case that the most relevant GO term will not exist. It is desirable to request terms which describe the involvement of a process in another process, if that will give more specificity to the annotation. For example, to describe a gene product's "intent" to change the "state" of the cell;
• Growth factor BMP2 is instrumental in cardiac cell differentiation
• Following stimulation with BMP2, large numbers of genes are up/down regulated
Requesting the new GO term 'BMP signaling involved in cardiac cell differentiation' may be preferable to annotating to the separate terms 'BMP signaling' and 'cardiac cell differentiation' as it will be clear how the gene product is involved in cardiac cell differentiation. i.e. qualify how the gene product is involved in the downstream process in preference to annotating to the downstream process term.
To assist in the creation of these new terms, the AmiGO 'Cross-product Term Request' tool will be useful, when it has been put into production.
Annotating downstream processes for gene products involved in core or specific processes
For small scale experiments, curators should annotate to the experimental evidence in the paper.
However, curator judgement should be used, taking into account what the curator knows about:
a) the gene product; does it have a central role causing it to affect multiple processes, or does it have few specific targets?
b) the quality of the experimental assays performed in the paper; are they fully explained and the evidence supplied convincing? (See separate guidelines for annotation of high-throughput experiments.)
Example 1. Gene product involved in core process.
a) Yeast RNA polymerase II subunit RPB2
• has core function of RNA polymerase activity
• likely to affect large number of processes unrelated to its function
• most curators agree should annotate only to 'transcription'
b) Yeast spliceosome
• in S. cerevisiae several genes are components of spliceosome
• when mutated the strains have defects in translation
• later evidence confirmed the genes' involvement in mRNA splicing, NOT translation
• since most splicing in yeast is to ribosome genes the effect on translation was seen
• so annotations to 'translation' were removed from the spliceosome components Example 2. Gene product involved in core and specific process(es).
S. pombe gene Sre1
• direct transcriptional regulator of genes which have a role in heme and lipid biosynthesis PMID:16537923
• the curator judged this to be important information for this gene product
• annotations were made to:
- specific RNA polymerase II transcription factor activity
- regulation of transcription
- positive regulation of heme biosynthesis
- positive regulation of lipid biosynthesis
• In accordance with Guideline 1 for Downstream Processes, we would recommend that new terms are requested for;
- Regulation of transcription involved in heme biosynthesis
- Regulation of transcription involved in lipid biosynthesis
Annotating downstream processes to poorly characterised gene products
If a gene product has limited experimental literature, such as a newly characterised protein, it is acceptable to annotate to more general 'downstream' process terms that may represent a phenotype.
As more functional information is published about a gene product, these annotations to potential downstream processes may be removed if they are deemed by the annotating group as indirect, or they may be kept depending on each MOD's strategy.
Always remove annotations that are incorrect or are from substandard evidence (NAS/TAS/IC) when replaced with better evidence to the same or more-granular term.
Annotating downstream processes to gene products in a ligand-receptor signaling pathway
Annotate ligand-receptor signaling pathways as shown in following diagrams
General consideration; For a signaling pathway the ligand is considered part of the pathway, e.g. the insulin signaling pathway. In this case, a factor which limits/increases the availability of a ligand to a receptor should be annotated as regulating the ligand/receptor pathway.
N.B. Clarification of the start/end of a signaling pathway by the signaling group will allow us to refine these guidelines
General note on revision of annotation sets
Relevant to gene products with little annotatable evidence
When further information about a gene product is obtained, there are two options for the annotation set:
1. Remove annotations to indirect/downstream processes (or update them to ‘regulation’ terms). This ‘deleted’ information is usually stored in the annotating group’s phenotype database.
2. Do not remove annotations to indirect/downstream processes because;
a) downstream annotations are supported by good evidence / want to keep as history of annotation / want to give a complete overview of knowledge about the gene product.
b) do not have resources to revise annotation sets / do not have alternative place to store data
It is important to note that MODs that keep these annotations will be a source of downstream process terms to MODs which do not keep these terms, via ISS from orthologs (e.g. PAINT).
Using terms that imply binding of substrates
As many terms in the Molecular Function ontology implicitly or explicitly imply the binding of a chemical or protein, it is unnecessary to co-annotate a gene product to a term from the binding node of GO to describe the binding of substrates or products that are already adequately captured in the definition of the Molecular Function term. For instance, a protein with enzymatic activity MUST bind all of the substrates and products of the reaction it catalyzes. Similarly, a protein with transporter activity MUST bind the molecules it transports. The curator should try to capture the specifics as much as feasible and avoid redundant annotations. Annotate to a binding term whenever an experiment shows binding, but not catalysis/transport. Curators should use their judgment to decide whether the interaction is physiologically relevant and capture information relevant to the in vivo situation.
Choosing more descriptive terms than 'protein binding'
Child terms that describe a particular class of protein binding (e.g. GO:0030971:receptor tyrosine kinase binding) should be used in preference to the parent term GO:0005515 protein binding. The IPI evidence code should be used where possible for annotation of all protein-protein interactions and the precise identity of the interacting protein should be captured in the ‘with’ column (8). At present a variety of identifiers can be used in the ‘with’ column (8) or the annotation extension column (16), see GO Annotation File Format 2.0 Guide.
Identifying binding partners using columns 8 and 16
When a gene product is being annotated to a binding activity term, the 'with' column (8) and/or the annotation extension column (16) can be used to capture additional information about the identify of the binding partner of the gene product being annotated. To understand when to use column 8, column 16, or both, it is important to remember that entries in column 8 support the evidence used to infer the function, while entries in column 16 modify the GO term used in the GO_ID column (5). The curator also needs to remember that the 'with' column (8) can be used with only a subset of evidence codes: IPI, IC, IEA, IGI, IMP or ISS; column 8 cannot be used with an IDA evidence code, see evidence code documentation.
Examples of using the 'with' column (8)
The annotation of Protein A to a GO binding term with evidence code IPI and Protein B in the 'with' column (8) makes the statement that Protein A has the binding activity defined by the GO term and this function was inferred from interaction with Protein B; binding to Protein B isn't necessarily the in vivo function of Protein A.
1) Column 8 can be used to make annotations based on experiments where the evidence for the function of Protein A binding Protein B in species X is based on binding of protein B from species Y. For example, the C. elegans Unc-115 protein was shown to bind to actin filaments made with actin purified from rabbit skeletal muscle. This would be annotated as GO:0051015:actin filament binding using an IPI evidence code and putting an accession for rabbit skeletal muscle actin, UniProtKB:P68135, in the 'with' column (8). This annotation makes the statement that C. elegans Unc-115 has the molecular function of actin filament binding inferred from experiments using rabbit actin.
2) Column 8 can be used to indicate that the evidence for binding a small molecule is based on an experiment using an analog. The annotation Protein A GO:0005524:ATP binding IPI column 8 ATP-gamma-S captures the information that ATP binding activity was inferred from binding of a non-hydrolyzable ATP analog.
Examples of using the annotation extension column (16)
The annotation of Protein A to a GO function term with Protein B and a has_participant relationship in the annotation extension column (16) makes the statement that an in vivo target of Protein A is Protein B. This is equivalent to the post-compositional creation of a new child term.
3) The zebrafish Lnx2b protein (UnitProtKB:A4VCF7) was shown to ubiquitinate zebrafish Dharma (UniProtKB:O93236) in PMID:19668196. Therefore Lnx2b can be annotated to GO:0004842:ubiquitin-protein ligase activity adding has_input UniProtKB:O93236 in annotation extension column (16). This annotation makes the statement that Dharma is a substrate of the ubiquitin-protein ligase activity of Lnx2b.
4) The human ABCG1 protein has been annotated to GO:0034041 sterol-transporting ATPase activity with an IDA evidence code. The experiments in the paper, demonstrate that the target is 7β-hydroxycholesterol; this information can be added to the annotation by including the ChEBI ID for 7β-hydroxycholesterol, CHEBI:42989, in the annotation extension column (16): post-composing the GO term 7β-hydroxycholesterol-transporting ATPase activity.
The 'with' column (8) and the annotation extension column (16) should be used only for direct interactions and only when the binding relationship is not already included in the GO term and/or definition. See column 16 documentation for relationship types to use when adding IDs in the annotation extension column (16).
Ontology development for protein binding
Future ontology development efforts should be relied upon to improve the searching capability of any user who is specifically interested in gene products carrying out a certain type of substrate/product binding. Ongoing relevant ontology development of 'has_part' relationships will provide links to implied substrate binding (the GOC are developing 'has_part' relationships to implying substrate binding). The existing GO will follow this new format, e.g. Transcription factor activity will have a 'has_part' relationship to DNA binding rather than an 'is_a' relationship. Curators should request new 'has_part' relationships (and terms) if these do not exist.
'Response to' guidelines
1. Update definition of response to terms to indicate that we are capturing mediators (wording needs to be worked out)
2. Quality control check: High level ‘response to’ terms should not directly be used for annotation
3. Update guidelines: Encourage the use of granular terms for ‘responses’
4. Update guidelines: Expression experiments should not be annotated to response to terms
Use of Regulation Terms
The GO Consortium recognized quite early on in the development of the Biological Process ontology that there were gene products that participated directly in a process and gene products that regulated a process, positively and/or negatively. But how do curators know to which of these terms they should be annotating and is it possible, for a given process, to annotate the same gene product to both a parent term and one of its associated regulation term?
To begin to address these questions here are some guidelines for annotating, or not, to regulation terms:
Guideline 1: Use existing biological knowledge to define the process.
In order to determine whether a gene product participates in a process or regulates that process (or both) curators need to consider the nature of the process. Processes can be considered as ordered assemblies of molecular functions and every process has a beginning, middle, and end.
Use existing biological knowledge and the paper being curated as guides. Is there a defined pathway, i.e. distinct molecular functions, and have the gene products that perform those functions been identified? Does the gene product being annotated perform one of those functions or a function outside of the process that might start, stop, or change the rate at which the process proceeds?
In reality, the beginning, middle, and end of some processes will be easier to define than others. For example, signaling pathways, such as MAPK signaling, will be easier to define than broader, organismal-level processes such as embryonic development. Curators should use their jugdement, based on the published literature, to guide their annotation.
Saccharomyces cerevisiae Atg1 encodes a protein kinase that is involved in autophagy: "The process by which cells digest parts of their own cytoplasm; allows for both recycling of macromolecular constituents under conditions of cellular stress and remodeling the intracellular structure for cell differentiation."
Atg1 activity is critical for the induction of autophagy, specifically for formation of autophagic vacuoles. Should Atg1 be annotated to autophagic vacuole formation or regulation of autophagic vacuole formation? Authors have used language that could lead curators to make annotations to either term.
In this case, annotators need to consider the sum of what is known about the autophagic pathway and Atg1's role in that pathway.
Using that knowledge, SGD has annotated Atg1 to the parent process term, autophagic vacuole formation, because once Atg1 is active, the 'go' or 'no go' decision for autophagy has already been made. More upstream genes appear to actually be regulating the autophagic pathway.
Guideline 2: If you aren’t sure, annotate to the parent process term.
If the gene product performs one of the functions, annotate directly to the process. If the gene product regulates then it should be annotated to regulation of that process.
If you aren't sure what term to use, annotate to the parent process term. As more information about the process becomes available, you may be able to refine your annotations (see Guideline #4 below).
Guideline 3: Improve the ontology by defining, wherever possible, the beginning, middle, and end of a process.
Wherever possible, include the beginning, middle, and end of a process in the corresponding term definition. This will help annotators choose the appropriate term for their annotations.
Guideline 4: Revisit annotations when new knowledge becomes available.
GO annotations should reflect the present state of biological knowledge. Therefore, as the understanding of a biological process improves, it may be necessary to revisit and refine existing annotations.
Guideline 5: Annotations based on mutant phenotypes should take mechanism into account.
Mutant phenotypes are often used to make annotations to regulation terms because they fit the criteria of the term definition, i.e. authors report a change in the frequency, rate, or extent of a process.
However, in using IMP to correctly make regulation annotations it is important to consider various factors, including: 1) the assay type, 2) nature of the alleles (null vs reduction of function), and 3) molecular identity of the gene product.
Again, if it isn't clear that a gene product is involved in regulation, then it is better to annotate to the parent process term.
Example: muscle contraction and C. elegans mutants
In C. elegans, a number of genes can mutate to paralysis or slowed locomotion due to defects in muscle contraction. This includes genes that encode everything from myosin heavy chain to calcium channels to transcription factors. Depending upon the nature of the allele, sometimes the mutant phenotypes for the same gene can lead to both process and regulation terms. In this case, consideration of the process, the nature of the allele (complete or partial loss of function), and the molecular identity of the gene product can guide curators in making the appropriate annotation.
Guideline 6: Some gene products may be annotated to both a process and regulation of that process.
Positive and negative feedback loops are an essential part of many signaling pathways.
If one member of a pathway regulates the activity of a different member of the pathway, it could be annotated to both the process and regulation of that process.
When annotating gene products involved in a signaling pathway, however, curators should not annotate gene products that directly activate the next gene product in the pathway to regulation of that pathway.
For example, MAPKK would not be annotated to positive regulation of MAPKKK cascade just because it phosphorylates and activates MAPK.
However, gene products that, for example, feed back onto earlier steps in the pathway, may be annotated to both the parent process term and a regulation term.
ERK1/2 activation requires activity of FRS2alpha which, in turn, is negatively regulated by activated ERK1/2.
Could ERK1/2 be annotated to both MAPKKK cascade and negative regulation of MAPKKK cascade?
Cases where the presence/absence of one of the members of a pathway is limiting should not be annotated to regulation, e.g. if the amount of a receptor on the surface of a cell regulates the process, the receptor should not be annotated to the regulation term.
Protein complexes guidelines
1. Long term goal is to annotate complexes; details and requirements need to be clarified.
2. Guidelines + Quality control check: Avoid annotations to GO: MF by IPI (except for ‘protein binding’ and children) - Error reports will be generated.
3. Add to the guidelines: Do not make EXP annotations to MF when only the CC is observed.
Quality control checks
1. Check for co-annotation of a less-granular term with a more-granular term in the same path. Any action from this check is optional for each group as it may still be appropriate to keep both annotations, for example, it is acceptable to retain the less-granular annotation if;
• It has a 'better' evidence code
• The curator feels it adds weight to the more-granular annotation
• Both annotations add value, e.g. 'histone methylation' and 'protein amino acid methylation'
2. No use of the 'NOT' qualifier with 'protein binding'; GO:0005515. This rule only applies to GO:0005515, children of this term can be qualified with NOT, as further information on the type of binding is then supplied in the GO Term e.g. NOT + 'GO:0051529 NFAT4 protein binding', would be fine, as the negative binding statement only applies to the NFAT4 protein.
3. Annotations to 'protein binding'; GO:0005515, should only be supplied with an evidence code where the interactor can be identified in the 'with' field. This rule only applies to GO:0005515, is not such a problem with child terms of protein binding where the type of protein is identified in the GO term name.
4. Annotations to 'protein binding' should not use the ISS evidence code This rule only applies to GO:0005515, is not such a problem with child terms of protein binding where the type of protein is identified in the GO term name.