Proteases

From GO Wiki
Revision as of 13:00, 24 June 2008 by Midori (talk | contribs)
Jump to navigation Jump to search

Background

This item grew out of work on adding terms to the function ontology for enzyme activities, based on EC entries that don't have corresponding GO terms. EC classifies enzymes mainly on the basis of reaction mechanism, which fits nicely with GO's usual criteria for including function terms, and allows us to capture most EC entries and the EC hierachical organization. (Also see SF 1226219)

For petidase and protease activities, however, all bets are off. EC includes not only various peptide-bond-cleaving mechanisms, it also has a rather large number of entries whose names and reactions make them seem much more like gene products than descriptions of distinct activities. Two examples that are very gene-producty:

[Term]
name: thermomycolin activity
namespace: molecular_function
def: "Catalysis of the reaction: Rather nonspecific hydrolysis of proteins. Preferential cleavage:
 Ala-|-, Tyr-|-, Phe-|- in small molecule substrates." [EC:3.4.21.65]

[Term]
name: streptogrisin B activity
namespace: molecular_function
def: "Catalysis of the reaction: Hydrolysis of proteins with trypsin-like specificity."
[EC:3.4.21.81]

Others are in a grey area, e.g.

[Term]
name: PepB aminopeptidase activity
namespace: molecular_function
def: "Catalysis of the reaction: Release of an N-terminal amino acid, Xaa, from a peptide or
arylamide. Xaa is preferably Glu or Asp but may be other amino acids, including Leu, Met, His,
Cys and Gln." [EC:3.4.11.23]
synonym: "PepB aminopeptidase activity" EXACT [EC:3.4.11.23]

[Term]
name: metridin activity
namespace: molecular_function
def: "Catalysis of the reaction: Preferential cleavage: Tyr-|-, Phe-|-, Leu-|-; little action on
Trp-|-." [EC:3.4.21.3]

Question: should GO continue to add new function terms corresponding to the gene-product-ish EC entries? GO has already included quite a few, e.g.

[Term]
id: GO:0004213
name: cathepsin B activity
namespace: molecular_function
def: "Catalysis of the hydrolysis of peptide bonds with a broad specificity. Preferentially
cleaves the terminal bond of -Arg-Arg-Xaa motifs in small molecule substrates (thus differing
from cathepsin L). In addition to being an endopeptidase, shows peptidyl-dipeptidase activity,
liberating C-terminal dipeptides." [EC:3.4.22.1]
synonym: "cathepsin B1 activity" EXACT [EC:3.4.22.1]
synonym: "cathepsin II" RELATED [EC:3.4.22.1]
xref: EC:3.4.22.1
xref: MetaCyc:3.4.22.1-RXN
is_a: GO:0004197 ! cysteine-type endopeptidase activity

(and the other cathepsin terms)

Proposal

From Ben Hitz:

Comments

Proteases Produce Pernicious Problems Periodically.

The reason for this is just historical, they are such an early and important class of proteins that their nomenclature is FUBAR.

I would not add the "gene_product" like activities from EC. Unfortunately this means either clean up EC or clean up GO.

I just spent a few minutes at this page:

Contents EC 3.4 to EC 3.12

I can't see that cleaning that mess up would be fun. Basically you can classify proteases by catalytic mechanism (serine, cysteine, metallo-, aspartyl-) or substrate (X-Y) where X and Y are different amino acids. Other distinguishing characteristics: endopeptidase vs. exopeptidase, D- vs. L- amino acids.

Furthermore, there is evolutionary classification which completely overlaps these boundries (the catalytic triad of Ser-Asp-His is _the_ classic example of convergent evolution to a common enzyme mechanism).

Perusing the go dag, I would say that it would be best off dumping 90% of the substrate specific terms. It may be worthwhile distinguishing between proteases (act on "protein") and peptidases (act only on short peptides) or endo/exo peptidases, but no further. I would also include things like EC 3.4.13.20 and EC 3.4.13.22 where the enzyme acts on "atypical" peptides.

I would probably go ahead and distinguish based on catalytic mechanism, if just to reduce the number.

Proposed organization

Here is a rough cut: -> (reverse is_a) == (reverse part_of)

catalytic activity -> hydrolase activity ->  peptidase activity
peptidase activity -> "regular" (i.e, L,L alpha-alpha peptide bond found in proteins) peptidase
activity (most of 3.4.-)
	-> D-D peptidase activity -> D-Ala-D-Ala peptidase activity (3.4.13.22, 3.4.11.19)
	-> Beta (L,L) peptidase activty (3.4.13.20; 3.4.19.5)
	-> Gamma (L, L) peptidase -> (3.4.19.9) tricky because I don't want to say Gamma-glutamyl ..
	-> Gamma (D, L) peptidase -> (3.14.19.11)

"regular" peptidase activity -> endopeptidase activity
	-> exopeptidase activity -> aminopeptidase activity
	-> carboxypeptidase activity

and/or
"regular" peptidase activity -> serine peptidase activity
	-> cysteine peptidase activity
	-> aspartyl peptidase activity
	-> threonine peptidease activity (see, e.g., 3.4.25.-)
	-> metallopeptidase activity

BUT note: EC 3.4.17.14 zinc D-Ala-D-Ala carboxypeptidase
and EC 3.4.16.4 serine-type D-Ala-D-Ala carboxypeptidase

So you could need many "cross products", but I suppose we could only add them as needed (i.e, don't need threonine exopeptidase until someone discovers one).

One area I didn't cover are the ATP dependent proteases (Lon, CliP, La) - GO:0004176. The Lon family I think are all serine proteases, but it's certainly not guarenteed. Not sure it's worth splitting even up higher into "ATP-dependent" and "ATP-independent"!

Hope this helps.

Ben

Very similar proposal

From: Colin Batchelor (RSC)

I pretty much second everything Ben has to say, especially about the D-amino acids.

From a text-mining point of view I don't want to see any single words ending in -ase disappearing altogether from the part of the ontology we scoop up (names, EXACT synonyms and potentially NARROW synonyms if I can be sure that there's no duplication).

So that means I want to keep "exopeptidase", "endopeptidase", "metallopeptidase", "metalloendopeptidase" and so forth. That would ideally mean a new tree, something like "molecular function attribute" with "metal-catalysed" (or even a has_catalyst relation pointing to ChEBI) but I'm happy to wait for the revolution for that one.

Likewise keep the substrate-based terms like "cyanophycinase" (though I don't see that cleavage of cyanophycin is necessarily a serine-type peptidase activity), "elastase" and "fibrolase".

On the other hand, for example, procollagen N-endopeptidase activity (GO:0017074) is, at least according to the definition, not intrinsically a metalloendopeptidase activity; that's a statement about the gene products that realize that activity.

So I'm not convinced that metalloexopeptidase, metalloendopeptidase, serine-type peptidase and so on should have any children. Does that sound fair?

I can't see the case for keeping the cathepsin terms in GO because I can't see how you would write genus--differentia definitions for them. I'd like to see the bare gene product names remain in GO as RELATED synonyms for their parents, though. Astacin activity (GO:0008533) could go, but bontoxilysin activity (GO:0033264) looks substrate-based so can stay.

A rule-of-thumb that feels right is that if something ends in -in and is qualified by a letter or a number at the end (cathepsin B activity, stromelysin 1 activity for example)

I certainly can't see the case for adding, for example, thermomycolin activity and metridin activity.

best wishes, Colin.

Comment

From: Peter D'Eustachio (Reactome)

EC recognizes that there is a mess at the bottom of the hierarchy, but has an organization essentially identical to the one proposed here two levels up (so the issue here is mostly one of granularity rather than of enzyme classification):

clip taken from http://www.chem.qmul.ac.uk/iubmb/enzyme/EC3/intro.html#EC34

"The nomenclature of the peptidases is troublesome. Their specificity is commonly difficult to define, depending upon the nature of several amino acid residues around the peptide bond to be hydrolysed and also on the conformation of the substrate polypeptide chain. A classification involving the additional criterion of catalytic mechanism is therefore used.

"Two sets of sub-subclasses of peptidases are recognised, those of the exopeptidases (EC 3.4.11-19) and those of the endopeptidases (EC 3.4.21-24 and EC 3.4.99). The exopeptidases act only near the ends of polypeptide chains, and those acting at a free N-terminus liberate a single amino-acid residue (aminopeptidases, EC 3.4.11), or a dipeptide or a tripeptide (dipeptidyl-peptidases and tripeptidyl-peptidases, EC 3.4.14). The exopeptidases acting at a free C-terminus liberate a single residue (carboxypeptidases, EC 3.4.16-18) or a dipeptide (peptidyl-dipeptidases, EC 3.4.15). The carboxypeptidases are allocated to four groups on the basis of catalytic mechanism: the serine-type carboxypeptidases (EC 3.4.16), the metallocarboxypeptidases (EC 3.4.17) and the cysteine-type carboxypeptidases (EC 3.4.18). Other exopeptidases are specific for dipeptides (dipeptidases, EC 3.4.13), or remove terminal residues that are substituted, cyclized or linked by isopeptide bonds (peptide linkages other than those of α-carboxyl to α-amino groups) (omega peptidases, EC 3.4.19).

"The endopeptidases are divided into sub-subclasses on the basis of catalytic mechanism, and specificity is used only to identify individual enzymes within the groups. These are the sub-subclasses of serine endopeptidases (EC 3.4.21), cysteine endopeptidases (EC 3.4.22), aspartic endopeptidases (EC 3.4.23), metalloendopeptidases (EC 3.4.24) and threonine endopeptidases (EC 3.4.25). Endopeptidases that could not be assigned to any of the sub-subclasses EC 3.4.21-25 were listed in sub-subclass EC 3.4.99."


Notes from conference call June 11, 2008

Midori, Colin, Peter, Ben

Agreed to stick with the two-dimensional organization discussed at the April GOC meeting, where one dimension is substrate and the other is mechanistic (see table below).

Substrate specificity:

Peter: endo- vs. exopeptidase is a distinction worth retaining, because whether the enzyme "seeks" an end is a meaningful mechanistic feature.

Exopeptidases can be further divided into aminopeptidases (which cleave N-terminal residues) and carboxypeptidases (C-terminal residues).

Ben: advises against distinguishing substrates based on cleavage sequence preferences.

Peter: notes that different proteases vary a lot in sequence specificity; e.g. contrast trypsins and chymotrypsins with blood coagulation cascade proteases.

The working group ended up preferring not to subdivide based on other aspects of substrate specificity such as sequence, reaction conditions, etc. The explosion would be huge, and not particularly useful; compare protein kinases and restriction endonucleases.

Classification matrix

The working group agreed that the first step is to implement the two-dimensional organization; the mechanism axis now has six entries (we forgot threonine and glutamic peptidases at the SLC meeting). The resulting "matrix" is shown in the table:

  endopeptidase activity GO:0004175 exopeptidase activity GO:0008238
aminopeptidase activity GO:0004177 carboxypeptidase activity GO:0004180
aspartic peptidase activity GO:0070001 aspartic endopeptidase activity GO:0004190 aspartic aminopeptidase activity (not needed at present) aspartic carboxypeptidase activity (not needed at present)
cysteine peptidase activity GO:0008234 cysteine endopeptidase activity GO:0004197 cysteine aminopeptidase activity GO:0070005 cysteine carboxypeptidase activity GO:0016807
glutamic peptidase activity GO:0070002 glutamic endopeptidase activity GO:0070007 glutamic aminopeptidase activity (not needed at present) glutamic carboxypeptidase activity (not needed at present)
metallopeptidase activity GO:0008237 metalloendopeptidase activity GO:0004222 metalloaminopeptidase activity GO:0070006 metallocarboxypeptidase activity GO:0004181
serine peptidase activity GO:0008236 serine endopeptidase activity GO:0004252 serine aminopeptidase activity GO:0070009 serine carboxypeptidase activity GO:0004185
threonine peptidase activity GO:0070001 threonine endopeptidase activity GO:0004298 threonine aminopeptidase activity (not needed at present) threonine carboxypeptidase activity (not needed at present)

Note (2008-06-13): can also include generic exopeptidase terms for each mechanism; e.g. we have metalloexopeptidase activity GO:0008235.

This is basically consistent with EC, despite the fact that EC (of necessity) uses a one-dimensional classification. As far as we know, each cell is biologically/biochemically plausible, so we'll add (or keep) GO terms for each.

The more specific terms should then fit somewhere in the matrix. We plan to determine which terms match which matrix cell, and for cells occupied by more than one term, figure out whether information that fits into the scope of GO would be lost if we didn't have the more specific terms.

We'll invite the larger GO group to comment at this point. The [MEROPS] database curators may also be able to help; MEROPS classifies proteases by family and clan, which usually often correlate closely with mechanism.

In the long term, we envisage many of the existing EC-derived specific GO function terms being retired (either by obsoletion or merging with the relevant ancestor from the matrix) in favor of Protein Ontology (PRO) terms. Information would thus not be lost, but transferred to a more appropriate ontology.

For D-amino acid peptides, we decided to make top-level distinction, i.e. two child terms directly below peptidase activity:

 peptidase activity GO:0008233
 -- [i] peptidase activity, acting on D-amino acid peptides GO:new
 -- [i] peptidase activity, acting on L-amino acid peptides GO:new
 ---- [i] [child terms corresponding to matrix above]

Although, in theory, we could replicate the matrix for D-amino acid peptidases, at present we don't think there's a pressing need. It can always be done (fully or partially) later if the need arises.

To Do list

  1. Look up GO terms corresponding to matrix entries; add terms for any missing cells (put a GO OBO file in scratch directory). Also make sure GO relationships correctly reflect matrix organization. (Midori)
    1. Also fill in GO IDs in matrix table above. (Midori)
  2. Contact MEROPS and PRO; see if anyone can help with #3. (Midori)
  3. Meet again to look at the remaining descendants of peptidase activity and determine where they fit into the matrix. (all)

Note added after meeting: Darren Natale (dan5 at georgetown dot edu) is the contact for PRO.

Progress report June 16, 2008

  • Created file protease.obo in go/scratch/ directory (web access via http://www.geneontology.org/scratch/protease.obo)
    • logistical note: started with revision 1.108 of gene_ontology_write.obo
  • New terms added for matrix:
    • aspartic-type peptidase activity GO:0070001
    • glutamic-type peptidase activity GO:0070002
    • threonine-type peptidase activity GO:0070003
    • cysteine-type exopeptidase activity GO:0070004
    • cysteine-type aminopeptidase activity GO:0070005
    • metalloaminopeptidase activity GO:0070006
    • glutamic-type endopeptidase activity GO:0070007
    • serine-type exopeptidase activity GO:0070008
    • serine-type aminopeptidase activity GO:0070009
    • peptidase activity, acting on D-amino acid peptides GO:0070010
    • peptidase activity, acting on L-amino acid peptides GO:0070011
  • Emailed working group about other matrix terms: MEROPS reports the existence of endopeptidases, but not exopeptidases, its families of aspartic-, glutamic-, and threonine-type peptidases, so I've asked whether we're sure we want to add:
    • aspartic-type exopeptidase activity
    • aspartic aminopeptidase activity
    • aspartic-type carboxypeptidase activity
    • glutamic-type exopeptidase activity
    • glutamic-type aminopeptidase activity
    • glutamic-type carboxypeptidase activity
    • threonine-type exopeptidase activity
    • threonine-type aminopeptidase activity
    • threonine-type carboxypeptidase activity
  • Rephrased some definitions to improve consistency
  • Noticed that arginine/lysine endopeptidase activity (GO:0010320) should be made obsolete; emailed GO list accordingly

Update June 18, 2008

Agreed (by email) not to add the terms in the second list above until/unless proteases falling into the categories are discovered.

Notes from conference call June 24, 2008

Midori, Colin, Peter

The matrix is shaping up, and Peter has started fitting existing terms into the matrix cells. Some can't be assigned to the most specific cells, but will have to go directly under one of the more general terms; that's no problem from GO's point of view.

We need a few more classifiers, e.g. for dipeptidases, omega peptidases, etc. Some terms exist already, so names and definitions can be examined, and edited if necessary. More generally, we'll probably need terms for peptidases that act on various non-linear peptides (e.g. branched peptide chains). See if MEROPS can help with those.

The other major task is to examine the definitions of existing terms, and put them into genus-differentiae style.

Midori will meet with MEROPS people next week, show them our proposed structure, and get their well-informed feedback.

Ideal scenario: MEROPS can devote the time and effort to provide differentiae for us to use in term definitions.


May have a quick follow-up call on Thursday (26th) just to bring Ben and Alex Diehl up to date.