User:JimHu/Binding terms

Back to conference call information

I have been remiss in keeping up with the binding term discussion and am late in getting my thoughts together. I think that this discussion is too long for inline discussion on the Binding terms working group page --JimHu 22:27, 3 June 2009 (PDT)

What problem(s) are we trying to solve?

I feel like the discussion has lost track of what problem with the status quo we are trying to solve. My interest in changing things is focused on

Proliferation of "x binding terms" in the MF ontology.
The binding terms are currently a mess - proliferation will only make it worse
The ability to annotate to x binding without needing a new term request (NTR)

These may not be what others are concerned about. However, I think we should figure out if we're talking about the same things.

Curators must be able to annotate binding activities as Molecular Functions

Binding is a molecular function

The discussion of binding was stimulated by the observation that RefGenome jamboree participants were annotating substrates, especially ATP, despite standing guidelines to avoid such annotations. In the email discussion, someone raised the idea that ATP binding is a substep of the reaction carried out by an ATPase. Thus, ATP binding is implicit in annotation to any child of GO:0016887 ! ATPase activity. The ATP binding is not "atomic" with respect to the molecular function of the product¹. Fair enough. But there are proteins whose most basic molecular function is just binding something, e.g.:

DNA-binding proteins
Periplasmic substrate binding proteins, e.g. E. coli Maltose Binding Protein
Hemoglobin
Myoglobin
HDL and LDL
Retinoic acid binding protein

etc. There are over 300K entries in Genbank returned by a search for the literal string "binding protein". If we remove the ability to annotate the functions of these products, then we lose a lot of the ability of GO to describe function.

In addition, I would argue that it is useful to annotate effector binding. Examples:

PEP is an allosteric inhibitor of E. coli PFK
RecA binding stimulates autoproteolysis of a number of proteins, including LexA, lambda repressor, and UmuD.

One might argue that these are secondary activities of PFK and RecA. But they are important parts of the biological functions of these proteins. Also, the main function of the regulatory subunits of Aspartate transcarbamylase is to bind NTPs and regulate the activity of the catalytic subunits. The R subunits do not contribute to the activity; they regulate it in response to ligands.

Binding may be all you know about a protein

Pascale and others raised this point. The process of function discovery is not all-or-none. Sometimes a paper may identify ligand binding without demonstrating activity. I'm not sure, but I believe that there are examples where proteins are known by their cofactor binding, but the biologically relevant molecular function is not known. I suspect that high-throughput methods will increase the number of cases where binding and activity are found independently.

Proliferation of binding terms

I have argued that proliferation of "x binding" terms is problematic for two major reasons.

It is my sense that the performance of software tools is degraded as the ontology gets larger. When needed, this should not be a reason to exclude terms. But x binding where x can be any biologically significant small molecule (or worse, protein or nucleic acid sequence) will lead to massive expansion of the ontology. I'm pretty sure this would be bad for GONUTS; others can weigh in on how it would affect other applications.
When I curate results, I use search to get in the neighborhood of the proper term, but I browse the graph to find the correct term. Excessively specific terms tend to make it harder to see more useful terms. I may be the only one who works this way, but the fact that so many tools have graphical displays of the DAG suggests otherwise.

The DAG for the binding terms is already a mess

The current binding terms are already a mess. I think this is more apparent in the default views in GONUTS than in AmiGO, because GONUTS displays more about the children of a term by default.

In other parts of the ontology, there is a general, if imperfect tendency for the information content of a term to increase with its distance from the root. This is not the case under binding, and may not be possible while adhering to the true path rule. If there is a rationale for where binding terms are placed, it is not at all obvious. In some cases, compounds are grouped into families, while in other cases, they are direct children of binding.

methotrexate, a drug, is not a child of drug binding. Moreover, drug binding is not what drug targets evolved to bind
same for suramin binding
there is a branch for phthalates - are these natural products? If not, how do we decide about ligands from combinatorial chemistry?
is the normal function of FKBP to bind FK506
beta endorphin is a peptide, but beta-endorphin binding is a child of neurotransmitter binding and not peptide binding
there's a branch for peptide hormone binding, but not for steroid hormone binding
There are 95 children of protein binding
boron binding is almost certainly binding to borate, not elemental boron
pattern binding is in a position where it probably causes violations of the true path rule, e.g. for chitin binding.
the selection of sugar binding terms is spotty. Why lactose and not maltose? Why not ribose, or arabinose?
more violations of the true path rule - a fructose-6-P binding protein is not necessarily a fructose binding protein

Annotation without NTRs

The need for a new x binding term for any x means that annotation is delayed by the time needed to process new term requests. NTRs are currently handled pretty quickly, but this is a process that may not scale well. Is there anything in place that would block a natural language processing system from submitting NTRs based on mining PubMed for binding relationships? IEAs that capture binding are potentially valuable, but could cripple the ontology developers.

The need for NTRs also inhibits community annotation with GO.

Cross products as a solution

I proposed on Sourceforge that the CheBI entity that is bound be attached to a more generic GO term for binding at annotation time via post-composition. Originally, I proposed the wrong place to put this, but with the new GAF, it seems appropriate to annotate x binding as a cross product (See: Annotation_Cross_Products#binding_example. However, I would use a generic binding term. My proposal for the generic terms was:

Binding
- Substrate binding
- Effector binding
  - Activator binding
  - Inhibitor binding

Various examples would be annotated to

Binding
- MBP: binding x CHEBI:17306 maltose
- replace things like kinetochore binding or Gram-negative bacterial binding with binding x anatomy or taxonomy term
Substrate binding
- PFK: substrate binding x CHEBI ATP; substrate binding x CHEBI F6P
Effector binding
Activator binding
Inhibitor binding
- PFK: inhibitor binding x CHEBI PEP

Note that

one would not put "lists of substrates" in a single annotation in column 16. An enzyme would get one annotation per substrate per experiment.
existing annotations to ATP binding need not be recast to substrate binding. Plain binding is still a valid annotation. This is the same as any other case where new children appear in the ontology after annotations are curated.
This solution could be generalized to many of the situations where curators want to be very specific, but doing so consistently would cause the ontology to explode.

Implicit annotations

I agree that one should be able to infer that anything with PFK activity binds ATP and F6P. So why can't we add implicit annotations with cross-products to GO term definitions.

Responses to other comments

Pros

What these "binding" proposals all have in common is that they essentially want to track, for all enzymes the strict biochemical mechanism and all cofactors for each reaction, as well as all "relevant" substrate-product combinations. That is better left to some other database.

My definition of strict biochemical mechanism is much more mechanistic than listing substrates

GO does not track biochemical reactions. It doesn’t track the reactants nor the products.

But we already have these in many definitions, e.g. GO:0003872 ! 6-phosphofructokinase activity. The implicit annotation described above would be loaded from other databases with appropriate expertise, and could be cross referenced for source

Should GO track protein kinase substrates? Glycolization sites? Ubiqutinylation substrates? GO needs to be consistent, why should GO partially track some reactants some of the time. That's not going to help anyone in the long run. In 99% of all cases, it will be better to cross index a database that is actually DESIGNED to store this sort of data.

There's a fundamental difference between tracking in the ontology per se, vs. tracking at annotation time. It is better to cross index to databases designed for the desired data, but the proposal doesn't cross-index - it just throws out the annotations altogether.

This proposal suggests that we should remove from GO terms such as GO:0043287: poly(3-hydroxyalkanoate) binding and replace it with nothing - because this description of substrate binding is not the role of GO, and delete the majority of ATP binding annotations, ATP binding to only be associated with proteins which bind ATP as a co-factor.

I propose to get rid of the term, but not any annotations to that term.

Cons

Currently there are substantial numbers of binding terms associated with protein records by electronic means, for instance: 1,539,419 electronic annotations to just the ATP binding terms (versus 880 manual annotations), which include both 'substrate binding' and 'cofactor binding'. Furthermore, many of these terms are associated in a 'systematic manner', through for example protein domains, eg InterPro includes a number of domains which define a nucleotide binding site, for instance; IPR011761 ATP-grasp fold.

These could be readily translated to the cross product system

Could we store the information on substrate ATP binding annotations in another way, or alternatively still help users capture this information? Might it be useful to have a high-level grouping term so that proteins can be identified as to the energy source they use to carry out a catalysis (e.g. 'catalysis; ATP-hydrolysing'. Could such ribonucleotide terms be considered a bit differently from other binding terms, as one could say that the main purpose of an ATP-dependent enzyme, for instance a peptidase, is not to break down ATP, but to break peptide bonds.

Catalysis does not change the thermodynamics of the reaction catalyzed, so I'm uncomfortable with the idea of the "energy source for catalysis". More importantly, if you have an NTP binding term, curators are going to use it for more than the intended use. With either the proposal or the status quo, this leads to inconsistency. Which kind of inconsistency is worse is not clear to me. I'd rather have incompleteness than the implication that something is not an ATPase because correct practice would have omitted that annotation, but a curator did it anyway.

It is going to involve a vast amount of work for the annotation groups to split up the nucleotide binding annotations into 'substrate binding' or 'cofactor binding' types.
Not needed, see above. Translating would not lose any information present in the current annotations to ATP binding. Where there is also an annotation to a particular activity, the annotations could be flagged for rereview, but nothing incorrect would happen if reannotation was slow. Note that reannotation of this kind could be a good community task.
In addition a large amount of information will be deleted from GO.

Not with my alternative.

In order to preserve some of this information, but in a more appropriate ‘GO’ format could the GO terms provide an indication in the definition (or term's parentage) the specific ribonucleotide being used, e.g. making the term more specific 'GTP-dependent helicase activity', 'protein kinase activity' or expanding the ontology:
>‘ribonucleotide-dependent catalytic activity’ >> ‘ATP-dependent catalytic activity’ This would follow previous terms such as:
GO:0016723 'oxidoreductase activity, oxidizing metal ions, NAD or NADP as acceptor', as well as many specific terms e.g. 'GTP-dependent polynucleotide kinase activity', 'thymidylate synthase (FAD) activity', 'DNA ligase (ATP) activity', 'N-methylhydantoinase (ATP-hydrolyzing) activity').

As noted above, many of the definitions for enzymes have this level of specificity already.

This would at least mean that if any users had become accustomed to using the ATP binding annotation set to find those gene products that metabolised ATP, they could in future still gather together relevant gene products by using such a grouping term (and as a side benefit it could be helpful for curation consistency, if we could search for proteins co-annotated to 'ATP binding' and 'catalytic activity (ATP-hydrolysing)' then we would have reason to investigate further the validity of the 'ATP binding' annotation).

The reason I have suggested limiting the removal of substrate binding terms to all molecules except protein is that I am not comfortable with the idea that the only proteins annotated to 'receptor binding' will be their ligands, because the signal transduced after ligand binding is a series of catalytic reactions and these substrates will not be included. If the intention is that these substrates will be included in column 16 then I would be happy to accept this. However, in previous emails Ben has stressed that long substrate lists in column 16 is not his idea of column 16, (although my impression is that many annotators do want to use column 16 for this purpose).

I don't think of receptor ligands as substrates. They aren't consumed/transformed. I would annotate receptor and protein ligand both to binding, and receptor kinases to protein kinase activity x UniProtKB ID for the kinase substrates.

As an annotator I would like to be able to add the GO term 'DNA binding' to a protein whose only known function is that it binds DNA. However, unless I can show that this binding is not associated with a catalytic activity I will not be able to include this annotation because to do so would produce misleading annotations, for example potentially a novel helicase would be annotated as binding DNA, whereas no other helicases would be annotated to DNA binding. Consequently the removal of catalytic substrate binding terms will limit the amount of data that can be captured by GO. Not that I think we will run out of data to use for annotation, but for proteins which have almost no data this may make a big impact. For example if a protein is shown to bind DNA and has 60% homology to a known helicase you may be more tempted to use ISS to transfer helicase activity to the protein, but you may not wish to do this if there is no evidence of DNA binding activity.

I would be open to keeping some x binding terms for very high level classes of ligands such as DNA, protein, carbohydrate etc. But it's a slippery slope, and I would like to see guidelines regarding what kinds of ligands get to be direct children of binding.

Footnotes

¹ From one of Pascale's emails to the list:

Currently, the GO documentation states that "Binding terms should only be used in cases where a stable binding interaction occurs. "
http://geneontology.org/GO.function.guidelines.shtml?all#binding
Also, one of the GO 'dogmas' is that Molecular Functions describe single steps of reactions. Thus representing a catalytic function by annotating to the function itself as well as binding to all its substrates is in disagreement with this notion. However many groups currently do that: the GO database has annotations such as GTP binding for proteins with GTPase activity, x ligand binding for proteins with receptor activity, etc.