Proteases

From GO Wiki
Jump to navigation Jump to search

Background

This item grew out of work on adding terms to the function ontology for enzyme activities, based on EC entries that don't have corresponding GO terms. EC classifies enzymes mainly on the basis of reaction mechanism, which fits nicely with GO's usual criteria for including function terms, and allows us to capture most EC entries and the EC hierachical organization. (Also see SF 1226219)

For petidase and protease activities, however, all bets are off. EC includes not only various peptide-bond-cleaving mechanisms, it also has a rather large number of entries whose names and reactions make them seem much more like gene products than descriptions of distinct activities. Two examples that are very gene-producty:

[Term]
name: thermomycolin activity
namespace: molecular_function
def: "Catalysis of the reaction: Rather nonspecific hydrolysis of proteins. Preferential cleavage:
 Ala-|-, Tyr-|-, Phe-|- in small molecule substrates." [EC:3.4.21.65]

[Term]
name: streptogrisin B activity
namespace: molecular_function
def: "Catalysis of the reaction: Hydrolysis of proteins with trypsin-like specificity."
[EC:3.4.21.81]

Others are in a grey area, e.g.

[Term]
name: PepB aminopeptidase activity
namespace: molecular_function
def: "Catalysis of the reaction: Release of an N-terminal amino acid, Xaa, from a peptide or
arylamide. Xaa is preferably Glu or Asp but may be other amino acids, including Leu, Met, His,
Cys and Gln." [EC:3.4.11.23]
synonym: "PepB aminopeptidase activity" EXACT [EC:3.4.11.23]

[Term]
name: metridin activity
namespace: molecular_function
def: "Catalysis of the reaction: Preferential cleavage: Tyr-|-, Phe-|-, Leu-|-; little action on
Trp-|-." [EC:3.4.21.3]

Question: should GO continue to add new function terms corresponding to the gene-product-ish EC entries? GO has already included quite a few, e.g.

[Term]
id: GO:0004213
name: cathepsin B activity
namespace: molecular_function
def: "Catalysis of the hydrolysis of peptide bonds with a broad specificity. Preferentially
cleaves the terminal bond of -Arg-Arg-Xaa motifs in small molecule substrates (thus differing
from cathepsin L). In addition to being an endopeptidase, shows peptidyl-dipeptidase activity,
liberating C-terminal dipeptides." [EC:3.4.22.1]
synonym: "cathepsin B1 activity" EXACT [EC:3.4.22.1]
synonym: "cathepsin II" RELATED [EC:3.4.22.1]
xref: EC:3.4.22.1
xref: MetaCyc:3.4.22.1-RXN
is_a: GO:0004197 ! cysteine-type endopeptidase activity

(and the other cathepsin terms)

Proposal

From Ben Hitz:

Comments

Proteases Produce Pernicious Problems Periodically.

The reason for this is just historical, they are such an early and important class of proteins that their nomenclature is FUBAR.

I would not add the "gene_product" like activities from EC. Unfortunately this means either clean up EC or clean up GO.

I just spent a few minutes at this page:

Contents EC 3.4 to EC 3.12

I can't see that cleaning that mess up would be fun. Basically you can classify proteases by catalytic mechanism (serine, cysteine, metallo-, aspartyl-) or substrate (X-Y) where X and Y are different amino acids. Other distinguishing characteristics: endopeptidase vs. exopeptidase, D- vs. L- amino acids.

Furthermore, there is evolutionary classification which completely overlaps these boundries (the catalytic triad of Ser-Asp-His is _the_ classic example of convergent evolution to a common enzyme mechanism).

Perusing the go dag, I would say that it would be best off dumping 90% of the substrate specific terms. It may be worthwhile distinguishing between proteases (act on "protein") and peptidases (act only on short peptides) or endo/exo peptidases, but no further. I would also include things like EC 3.4.13.20 and EC 3.4.13.22 where the enzyme acts on "atypical" peptides.

I would probably go ahead and distinguish based on catalytic mechanism, if just to reduce the number.

Proposed organization

Here is a rough cut: -> (reverse is_a) == (reverse part_of)

catalytic activity -> hydrolase activity ->  peptidase activity
peptidase activity -> "regular" (i.e, L,L alpha-alpha peptide bond found in proteins) peptidase
activity (most of 3.4.-)
	-> D-D peptidase activity -> D-Ala-D-Ala peptidase activity (3.4.13.22, 3.4.11.19)
	-> Beta (L,L) peptidase activty (3.4.13.20; 3.4.19.5)
	-> Gamma (L, L) peptidase -> (3.4.19.9) tricky because I don't want to say Gamma-glutamyl ..
	-> Gamma (D, L) peptidase -> (3.14.19.11)

"regular" peptidase activity -> endopeptidase activity
	-> exopeptidase activity -> aminopeptidase activity
	-> carboxypeptidase activity

and/or
"regular" peptidase activity -> serine peptidase activity
	-> cysteine peptidase activity
	-> aspartyl peptidase activity
	-> threonine peptidease activity (see, e.g., 3.4.25.-)
	-> metallopeptidase activity

BUT note: EC 3.4.17.14 zinc D-Ala-D-Ala carboxypeptidase
and EC 3.4.16.4 serine-type D-Ala-D-Ala carboxypeptidase

So you could need many "cross products", but I suppose we could only add them as needed (i.e, don't need threonine exopeptidase until someone discovers one).

One area I didn't cover are the ATP dependent proteases (Lon, CliP, La) - GO:0004176. The Lon family I think are all serine proteases, but it's certainly not guarenteed. Not sure it's worth splitting even up higher into "ATP-dependent" and "ATP-independent"!

Hope this helps.

Ben

Very similar proposal

From: Colin Batchelor (RSC)

I pretty much second everything Ben has to say, especially about the D-amino acids.

From a text-mining point of view I don't want to see any single words ending in -ase disappearing altogether from the part of the ontology we scoop up (names, EXACT synonyms and potentially NARROW synonyms if I can be sure that there's no duplication).

So that means I want to keep "exopeptidase", "endopeptidase", "metallopeptidase", "metalloendopeptidase" and so forth. That would ideally mean a new tree, something like "molecular function attribute" with "metal-catalysed" (or even a has_catalyst relation pointing to ChEBI) but I'm happy to wait for the revolution for that one.

Likewise keep the substrate-based terms like "cyanophycinase" (though I don't see that cleavage of cyanophycin is necessarily a serine-type peptidase activity), "elastase" and "fibrolase".

On the other hand, for example, procollagen N-endopeptidase activity (GO:0017074) is, at least according to the definition, not intrinsically a metalloendopeptidase activity; that's a statement about the gene products that realize that activity.

So I'm not convinced that metalloexopeptidase, metalloendopeptidase, serine-type peptidase and so on should have any children. Does that sound fair?

I can't see the case for keeping the cathepsin terms in GO because I can't see how you would write genus--differentia definitions for them. I'd like to see the bare gene product names remain in GO as RELATED synonyms for their parents, though. Astacin activity (GO:0008533) could go, but bontoxilysin activity (GO:0033264) looks substrate-based so can stay.

A rule-of-thumb that feels right is that if something ends in -in and is qualified by a letter or a number at the end (cathepsin B activity, stromelysin 1 activity for example)

I certainly can't see the case for adding, for example, thermomycolin activity and metridin activity.

best wishes, Colin.

Comment

From: Peter D'Eustachio (Reactome)

EC recognizes that there is a mess at the bottom of the hierarchy, but has an organization essentially identical to the one proposed here two levels up (so the issue here is mostly one of granularity rather than of enzyme classification):

clip taken from http://www.chem.qmul.ac.uk/iubmb/enzyme/EC3/intro.html#EC34

"The nomenclature of the peptidases is troublesome. Their specificity is commonly difficult to define, depending upon the nature of several amino acid residues around the peptide bond to be hydrolysed and also on the conformation of the substrate polypeptide chain. A classification involving the additional criterion of catalytic mechanism is therefore used.

"Two sets of sub-subclasses of peptidases are recognised, those of the exopeptidases (EC 3.4.11-19) and those of the endopeptidases (EC 3.4.21-24 and EC 3.4.99). The exopeptidases act only near the ends of polypeptide chains, and those acting at a free N-terminus liberate a single amino-acid residue (aminopeptidases, EC 3.4.11), or a dipeptide or a tripeptide (dipeptidyl-peptidases and tripeptidyl-peptidases, EC 3.4.14). The exopeptidases acting at a free C-terminus liberate a single residue (carboxypeptidases, EC 3.4.16-18) or a dipeptide (peptidyl-dipeptidases, EC 3.4.15). The carboxypeptidases are allocated to four groups on the basis of catalytic mechanism: the serine-type carboxypeptidases (EC 3.4.16), the metallocarboxypeptidases (EC 3.4.17) and the cysteine-type carboxypeptidases (EC 3.4.18). Other exopeptidases are specific for dipeptides (dipeptidases, EC 3.4.13), or remove terminal residues that are substituted, cyclized or linked by isopeptide bonds (peptide linkages other than those of α-carboxyl to α-amino groups) (omega peptidases, EC 3.4.19).

"The endopeptidases are divided into sub-subclasses on the basis of catalytic mechanism, and specificity is used only to identify individual enzymes within the groups. These are the sub-subclasses of serine endopeptidases (EC 3.4.21), cysteine endopeptidases (EC 3.4.22), aspartic endopeptidases (EC 3.4.23), metalloendopeptidases (EC 3.4.24) and threonine endopeptidases (EC 3.4.25). Endopeptidases that could not be assigned to any of the sub-subclasses EC 3.4.21-25 were listed in sub-subclass EC 3.4.99."


Notes from conference call June 11, 2008

Midori, Colin, Peter, Ben

Agreed to stick with the two-dimensional organization discussed at the April GOC meeting, where one dimension is substrate and the other is mechanistic (see table below).

Substrate specificity:

Peter: endo- vs. exopeptidase is a distinction worth retaining, because whether the enzyme "seeks" an end is a meaningful mechanistic feature.

Exopeptidases can be further divided into aminopeptidases (which cleave N-terminal residues) and carboxypeptidases (C-terminal residues).

Ben: advises against distinguishing substrates based on cleavage sequence preferences.

Peter: notes that different proteases vary a lot in sequence specificity; e.g. contrast trypsins and chymotrypsins with blood coagulation cascade proteases.

The working group ended up preferring not to subdivide based on other aspects of substrate specificity such as sequence, reaction conditions, etc. The explosion would be huge, and not particularly useful; compare protein kinases and restriction endonucleases.

Classification matrix

The working group agreed that the first step is to implement the two-dimensional organization; the mechanism axis now has six entries (we forgot threonine and glutamic peptidases at the SLC meeting). The resulting "matrix" is shown in the table:

  endopeptidase activity GO:0004175 exopeptidase activity GO:0008238
aminopeptidase activity carboxypeptidase activity
aspartic peptidase activity aspartic endopeptidase activity aspartic aminopeptidase activity aspartic carboxypeptidase activity
cysteine peptidase activity GO:0008234 cysteine endopeptidase activity cysteine aminopeptidase activity cysteine carboxypeptidase activity
glutamic peptidase activity glutamic endopeptidase activity glutamic aminopeptidase activity glutamic carboxypeptidase activity
metallopeptidase activity GO:0008237 metalloendopeptidase activity GO:0004222 metalloaminopeptidase activity metallocarboxypeptidase activity
serine peptidase activity GO:0008236 serine endopeptidase activity serine aminopeptidase activity serine carboxypeptidase activity
threonine peptidase activity threonine endopeptidase activity threonine aminopeptidase activity threonine carboxypeptidase activity

Note (2008-06-13)Q alos have met This is basically consistent with EC, despite the fact that EC (of necessity) uses a one-dimensional classification. As far as we know, each cell is biologically/biochemically plausible, so we'll add (or keep) GO terms for each.

The more specific terms should then fit somewhere in the matrix. We plan to determine which terms match which matrix cell, and for cells occupied by more than one term, figure out whether information that fits into the scope of GO would be lost if we didn't have the more specific terms.

We'll invite the larger GO group to comment at this point. The [MEROPS] database curators may also be able to help; MEROPS classifies proteases by family and clan, which usually often correlate closely with mechanism.

In the long term, we envisage many of the existing EC-derived specific GO function terms being retired (either by obsoletion or merging with the relevant ancestor from the matrix) in favor of Protein Ontology (PRO) terms. Information would thus not be lost, but transferred to a more appropriate ontology.

For D-amino acid peptides, we decided to make top-level distinction, i.e. two child terms directly below peptidase activity:

 peptidase activity GO:0008233
 -- [i] peptidase activity, acting on D-amino acid peptides GO:new
 -- [i] peptidase activity, acting on L-amino acid peptides GO:new
 ---- [i] [child terms corresponding to matrix above]

Although, in theory, we could replicate the matrix for D-amino acid peptidases, at present we don't think there's a pressing need. It can always be done (fully or partially) later if the need arises.

To Do list

  1. Look up GO terms corresponding to matrix entries; add terms for any missing cells (put a GO OBO file in scratch directory). Also make sure GO relationships correctly reflect matrix organization. (Midori)
    1. Also fill in GO IDs in matrix table above. (Midori)
  2. Contact MEROPS and PRO; see if anyone can help with #3. (Midori)
  3. Meet again to look at the remaining descendants of peptidase activity and determine where they fit into the matrix. (all)

Note added after meeting: Darren Natale (dan5 at georgetown dot edu) is the contact for PRO.