From GO Wiki
Jump to navigation Jump to search


This item grew out of work on adding terms to the function ontology for enzyme activities, based on EC entries that don't have corresponding GO terms. EC classifies enzymes mainly on the basis of reaction mechanism, which fits nicely with GO's usual criteria for including function terms, and allows us to capture most EC entries and the EC hierachical organization. (Also see SF 1226219)

For petidase and protease activities, however, all bets are off. EC includes not only various peptide-bond-cleaving mechanisms, it also has a rather large number of entries whose names and reactions make them seem much more like gene products than descriptions of distinct activities. Two examples that are very gene-producty:

name: thermomycolin activity
namespace: molecular_function
def: "Catalysis of the reaction: Rather nonspecific hydrolysis of proteins. Preferential cleavage:
 Ala-|-, Tyr-|-, Phe-|- in small molecule substrates." [EC:]

name: streptogrisin B activity
namespace: molecular_function
def: "Catalysis of the reaction: Hydrolysis of proteins with trypsin-like specificity."

Others are in a grey area, e.g.

name: PepB aminopeptidase activity
namespace: molecular_function
def: "Catalysis of the reaction: Release of an N-terminal amino acid, Xaa, from a peptide or
arylamide. Xaa is preferably Glu or Asp but may be other amino acids, including Leu, Met, His,
Cys and Gln." [EC:]
synonym: "PepB aminopeptidase activity" EXACT [EC:]

name: metridin activity
namespace: molecular_function
def: "Catalysis of the reaction: Preferential cleavage: Tyr-|-, Phe-|-, Leu-|-; little action on
Trp-|-." [EC:]

Question: should GO continue to add new function terms corresponding to the gene-product-ish EC entries? GO has already included quite a few, e.g.

id: GO:0004213
name: cathepsin B activity
namespace: molecular_function
def: "Catalysis of the hydrolysis of peptide bonds with a broad specificity. Preferentially
cleaves the terminal bond of -Arg-Arg-Xaa motifs in small molecule substrates (thus differing
from cathepsin L). In addition to being an endopeptidase, shows peptidyl-dipeptidase activity,
liberating C-terminal dipeptides." [EC:]
synonym: "cathepsin B1 activity" EXACT [EC:]
synonym: "cathepsin II" RELATED [EC:]
xref: EC:
xref: MetaCyc:
is_a: GO:0004197 ! cysteine-type endopeptidase activity

(and the other cathepsin terms)


From Ben Hitz:


Proteases Produce Pernicious Problems Periodically.

The reason for this is just historical, they are such an early and important class of proteins that their nomenclature is FUBAR.

I would not add the "gene_product" like activities from EC. Unfortunately this means either clean up EC or clean up GO.

I just spent a few minutes at this page:

Contents EC 3.4 to EC 3.12

I can't see that cleaning that mess up would be fun. Basically you can classify proteases by catalytic mechanism (serine, cysteine, metallo-, aspartyl-) or substrate (X-Y) where X and Y are different amino acids. Other distinguishing characteristics: endopeptidase vs. exopeptidase, D- vs. L- amino acids.

Furthermore, there is evolutionary classification which completely overlaps these boundries (the catalytic triad of Ser-Asp-His is _the_ classic example of convergent evolution to a common enzyme mechanism).

Perusing the go dag, I would say that it would be best off dumping 90% of the substrate specific terms. It may be worthwhile distinguishing between proteases (act on "protein") and peptidases (act only on short peptides) or endo/exo peptidases, but no further. I would also include things like EC and EC where the enzyme acts on "atypical" peptides.

I would probably go ahead and distinguish based on catalytic mechanism, if just to reduce the number.

Proposed organization

Here is a rough cut: -> (reverse is_a) == (reverse part_of)

catalytic activity -> hydrolase activity ->  peptidase activity
peptidase activity -> "regular" (i.e, L,L alpha-alpha peptide bond found in proteins) peptidase
activity (most of 3.4.-)
	-> D-D peptidase activity -> D-Ala-D-Ala peptidase activity (,
	-> Beta (L,L) peptidase activty (;
	-> Gamma (L, L) peptidase -> ( tricky because I don't want to say Gamma-glutamyl ..
	-> Gamma (D, L) peptidase -> (

"regular" peptidase activity -> endopeptidase activity
	-> exopeptidase activity -> aminopeptidase activity
	-> carboxypeptidase activity

"regular" peptidase activity -> serine peptidase activity
	-> cysteine peptidase activity
	-> aspartyl peptidase activity
	-> threonine peptidease activity (see, e.g., 3.4.25.-)
	-> metallopeptidase activity

BUT note: EC zinc D-Ala-D-Ala carboxypeptidase
and EC serine-type D-Ala-D-Ala carboxypeptidase

So you could need many "cross products", but I suppose we could only add them as needed (i.e, don't need threonine exopeptidase until someone discovers one).

One area I didn't cover are the ATP dependent proteases (Lon, CliP, La) - GO:0004176. The Lon family I think are all serine proteases, but it's certainly not guarenteed. Not sure it's worth splitting even up higher into "ATP-dependent" and "ATP-independent"!

Hope this helps.


Very similar proposal

From: Colin Batchelor (RSC)

I pretty much second everything Ben has to say, especially about the D-amino acids.

From a text-mining point of view I don't want to see any single words ending in -ase disappearing altogether from the part of the ontology we scoop up (names, EXACT synonyms and potentially NARROW synonyms if I can be sure that there's no duplication).

So that means I want to keep "exopeptidase", "endopeptidase", "metallopeptidase", "metalloendopeptidase" and so forth. That would ideally mean a new tree, something like "molecular function attribute" with "metal-catalysed" (or even a has_catalyst relation pointing to ChEBI) but I'm happy to wait for the revolution for that one.

Likewise keep the substrate-based terms like "cyanophycinase" (though I don't see that cleavage of cyanophycin is necessarily a serine-type peptidase activity), "elastase" and "fibrolase".

On the other hand, for example, procollagen N-endopeptidase activity (GO:0017074) is, at least according to the definition, not intrinsically a metalloendopeptidase activity; that's a statement about the gene products that realize that activity.

So I'm not convinced that metalloexopeptidase, metalloendopeptidase, serine-type peptidase and so on should have any children. Does that sound fair?

I can't see the case for keeping the cathepsin terms in GO because I can't see how you would write genus--differentia definitions for them. I'd like to see the bare gene product names remain in GO as RELATED synonyms for their parents, though. Astacin activity (GO:0008533) could go, but bontoxilysin activity (GO:0033264) looks substrate-based so can stay.

A rule-of-thumb that feels right is that if something ends in -in and is qualified by a letter or a number at the end (cathepsin B activity, stromelysin 1 activity for example)

I certainly can't see the case for adding, for example, thermomycolin activity and metridin activity.

best wishes, Colin.


From: Peter D'Eustachio (Reactome)

EC recognizes that there is a mess at the bottom of the hierarchy, but has an organization essentially identical to the one proposed here two levels up (so the issue here is mostly one of granularity rather than of enzyme classification):

clip taken from

"The nomenclature of the peptidases is troublesome. Their specificity is commonly difficult to define, depending upon the nature of several amino acid residues around the peptide bond to be hydrolysed and also on the conformation of the substrate polypeptide chain. A classification involving the additional criterion of catalytic mechanism is therefore used.

"Two sets of sub-subclasses of peptidases are recognised, those of the exopeptidases (EC 3.4.11-19) and those of the endopeptidases (EC 3.4.21-24 and EC 3.4.99). The exopeptidases act only near the ends of polypeptide chains, and those acting at a free N-terminus liberate a single amino-acid residue (aminopeptidases, EC 3.4.11), or a dipeptide or a tripeptide (dipeptidyl-peptidases and tripeptidyl-peptidases, EC 3.4.14). The exopeptidases acting at a free C-terminus liberate a single residue (carboxypeptidases, EC 3.4.16-18) or a dipeptide (peptidyl-dipeptidases, EC 3.4.15). The carboxypeptidases are allocated to four groups on the basis of catalytic mechanism: the serine-type carboxypeptidases (EC 3.4.16), the metallocarboxypeptidases (EC 3.4.17) and the cysteine-type carboxypeptidases (EC 3.4.18). Other exopeptidases are specific for dipeptides (dipeptidases, EC 3.4.13), or remove terminal residues that are substituted, cyclized or linked by isopeptide bonds (peptide linkages other than those of α-carboxyl to α-amino groups) (omega peptidases, EC 3.4.19).

"The endopeptidases are divided into sub-subclasses on the basis of catalytic mechanism, and specificity is used only to identify individual enzymes within the groups. These are the sub-subclasses of serine endopeptidases (EC 3.4.21), cysteine endopeptidases (EC 3.4.22), aspartic endopeptidases (EC 3.4.23), metalloendopeptidases (EC 3.4.24) and threonine endopeptidases (EC 3.4.25). Endopeptidases that could not be assigned to any of the sub-subclasses EC 3.4.21-25 were listed in sub-subclass EC 3.4.99."

Notes from conference call June 11, 2008

Midori, Colin, Peter, Ben

Agreed to stick with the two-dimensional organization discussed at the April GOC meeting, where one dimension is substrate and the other is mechanistic (see table below).

Substrate specificity:

Peter: endo- vs. exopeptidase is a distinction worth retaining, because whether the enzyme "seeks" an end is a meaningful mechanistic feature.

Exopeptidases can be further divided into aminopeptidases (which cleave N-terminal residues) and carboxypeptidases (C-terminal residues).

Ben: advises against distinguishing substrates based on cleavage sequence preferences.

Peter: notes that different proteases vary a lot in sequence specificity; e.g. contrast trypsins and chymotrypsins with blood coagulation cascade proteases.

The working group ended up preferring not to subdivide based on other aspects of substrate specificity such as sequence, reaction conditions, etc. The explosion would be huge, and not particularly useful; compare protein kinases and restriction endonucleases.

Classification matrix

The working group agreed that the first step is to implement the two-dimensional organization; the mechanism axis now has six entries (we forgot threonine and glutamic peptidases at the SLC meeting). The resulting "matrix" is shown in the table:

  endopeptidase activity GO:0004175 exopeptidase activity GO:0008238
aminopeptidase activity GO:0004177 carboxypeptidase activity GO:0004180
aspartic peptidase activity GO:0070001 aspartic endopeptidase activity GO:0004190 aspartic aminopeptidase activity (not needed at present) aspartic carboxypeptidase activity (not needed at present)
cysteine peptidase activity GO:0008234 cysteine endopeptidase activity GO:0004197 cysteine aminopeptidase activity GO:0070005 cysteine carboxypeptidase activity GO:0016807
glutamic peptidase activity GO:0070002 glutamic endopeptidase activity GO:0070007 glutamic aminopeptidase activity (not needed at present) glutamic carboxypeptidase activity (not needed at present)
metallopeptidase activity GO:0008237 metalloendopeptidase activity GO:0004222 metalloaminopeptidase activity GO:0070006 metallocarboxypeptidase activity GO:0004181
serine peptidase activity GO:0008236 serine endopeptidase activity GO:0004252 serine aminopeptidase activity GO:0070009 serine carboxypeptidase activity GO:0004185
threonine peptidase activity GO:0070001 threonine endopeptidase activity GO:0004298 threonine aminopeptidase activity (not needed at present) threonine carboxypeptidase activity (not needed at present)

Note (2008-06-13): can also include generic exopeptidase terms for each mechanism; e.g. we have metalloexopeptidase activity GO:0008235.

This is basically consistent with EC, despite the fact that EC (of necessity) uses a one-dimensional classification. As far as we know, each cell is biologically/biochemically plausible, so we'll add (or keep) GO terms for each.

The more specific terms should then fit somewhere in the matrix. We plan to determine which terms match which matrix cell, and for cells occupied by more than one term, figure out whether information that fits into the scope of GO would be lost if we didn't have the more specific terms.

We'll invite the larger GO group to comment at this point. The [MEROPS] database curators may also be able to help; MEROPS classifies proteases by family and clan, which usually often correlate closely with mechanism.

In the long term, we envisage many of the existing EC-derived specific GO function terms being retired (either by obsoletion or merging with the relevant ancestor from the matrix) in favor of Protein Ontology (PRO) terms. Information would thus not be lost, but transferred to a more appropriate ontology.

For D-amino acid peptides, we decided to make top-level distinction, i.e. two child terms directly below peptidase activity:

 peptidase activity GO:0008233
 -- [i] peptidase activity, acting on D-amino acid peptides GO:new
 -- [i] peptidase activity, acting on L-amino acid peptides GO:new
 ---- [i] [child terms corresponding to matrix above]

Although, in theory, we could replicate the matrix for D-amino acid peptidases, at present we don't think there's a pressing need. It can always be done (fully or partially) later if the need arises.

To Do list

  1. Look up GO terms corresponding to matrix entries; add terms for any missing cells (put a GO OBO file in scratch directory). Also make sure GO relationships correctly reflect matrix organization. (Midori)
    1. Also fill in GO IDs in matrix table above. (Midori)
  2. Contact MEROPS and PRO; see if anyone can help with #3. (Midori)
  3. Meet again to look at the remaining descendants of peptidase activity and determine where they fit into the matrix. (all)

Note added after meeting: Darren Natale (dan5 at georgetown dot edu) is the contact for PRO.

Progress report June 16, 2008

  • Created file protease.obo in go/scratch/ directory (web access via
    • logistical note: started with revision 1.108 of gene_ontology_write.obo
  • New terms added for matrix:
    • aspartic-type peptidase activity GO:0070001
    • glutamic-type peptidase activity GO:0070002
    • threonine-type peptidase activity GO:0070003
    • cysteine-type exopeptidase activity GO:0070004
    • cysteine-type aminopeptidase activity GO:0070005
    • metalloaminopeptidase activity GO:0070006
    • glutamic-type endopeptidase activity GO:0070007
    • serine-type exopeptidase activity GO:0070008
    • serine-type aminopeptidase activity GO:0070009
    • peptidase activity, acting on D-amino acid peptides GO:0070010
    • peptidase activity, acting on L-amino acid peptides GO:0070011
  • Emailed working group about other matrix terms: MEROPS reports the existence of endopeptidases, but not exopeptidases, its families of aspartic-, glutamic-, and threonine-type peptidases, so I've asked whether we're sure we want to add:
    • aspartic-type exopeptidase activity
    • aspartic aminopeptidase activity
    • aspartic-type carboxypeptidase activity
    • glutamic-type exopeptidase activity
    • glutamic-type aminopeptidase activity
    • glutamic-type carboxypeptidase activity
    • threonine-type exopeptidase activity
    • threonine-type aminopeptidase activity
    • threonine-type carboxypeptidase activity
  • Rephrased some definitions to improve consistency
  • Noticed that arginine/lysine endopeptidase activity (GO:0010320) should be made obsolete; emailed GO list accordingly

Update June 18, 2008

Agreed (by email) not to add the terms in the second list above until/unless proteases falling into the categories are discovered.

Notes from conference call June 24, 2008

Midori, Colin, Peter

The matrix is shaping up, and Peter has started fitting existing terms into the matrix cells. Some can't be assigned to the most specific cells, but will have to go directly under one of the more general terms; that's no problem from GO's point of view.

We need a few more classifiers, e.g. for dipeptidases, omega peptidases, etc. Some terms exist already, so names and definitions can be examined, and edited if necessary. More generally, we'll probably need terms for peptidases that act on various non-linear peptides (e.g. branched peptide chains). See if MEROPS can help with those.

The other major task is to examine the definitions of existing terms, and put them into genus-differentiae style.

Midori will meet with MEROPS people next week, show them our proposed structure, and get their well-informed feedback.

Ideal scenario: MEROPS can devote the time and effort to provide differentiae for us to use in term definitions.

May have a quick follow-up call on Thursday (26th) just to bring Ben and Alex Diehl up to date.

Notes from meeting with MEROPS July 1, 2008

Midori, Alan Barrett, Neil Rawlings (latter two from MEROPS)

It all went very well, and Alan and Neil say the matrix makes sense (once M explained that GO does want to include the "compound" terms in the matrix, and why).

A clear conclusion emerged: if GO were to take the drastic step of retiring all of the gene-product-like terms in the peptidase branch, we would have unequivocal support from Alan and Neil. They agreed with our view that peptidases are analogous to restriction endonucleases, in that capturing cleavage sequence specificity isn't useful for GO. Furthermore, Alan stated frankly that it would be "a hopeless task" to construct clear, reasonable, unambiguous definitions for most of the existing peptidase terms.

The presence of gene product-ish terms in EC is largely a historical artifact: the peptidase branch of EC (EC:3.4) was initially developed long before MEROPS came into being, so EC stored a lot of the information that is now MEROPS' purview by creating lots of peptidase terms. In light of this, and the difficulty of defining the gene product-ish terms, Alan and Neil recommend keeping only the "matrix" terms plus a few others.

Their specific recommendations included:

  • For the parent term 'peptidase activity' (GO:0008233), add a definition of "peptide bond" and change bonds to bond (singular).
  • Keep the omega peptidase and dipeptidyl-peptidase terms (as we planned to anyway).
  • Also keep the "intramembrane cleaving" term (GO:0042500), because intramembrane isn't just a location in this context; the reaction takes place in an environment that excludes water. We would have the option of replicating part or all of the classification matrix if we so choose (e.g. Alan knows of intramembrane cleaving metallopeptidases).
  • Add a generic "oligopeptidase activity" term; it will be useful despite the fact that aany definition of oligo would be either fuzzy or rather arbitrary. Two existing term names can become narrow synonyms.
  • The peptidase community hasn't felt a great need to make the D- vs. L-amino acid distinction at the level of classification, so it's up to us whether we want to use it as an organizing axis in GO.

Notes from July 7-8, 2008

Midori, Darren Natale (PIR, PRO)

Explained background -- GO will soon make lots of existing peptidase terms obsolete on the grounds that they represent gene products. Wherever possible, GO suggests terms, using the OBO 'consider' and 'replaced_by' tags, that can be used in place of an obsolete term for annotations, mappings, etc. Each of the obsolete protease activity terms will point to at least one term in the 'peptidase activity' branch of GO MF (mainly from the classification matrix above).

GO would like to be able to refer annotators to protein ontology (PRO) entries as well as GO terms for peptidase activity MF terms that become obsolete. The reason is that terms that simply name and describe gene products have always been outside the scope of GO, and fit much better into the scope of PRO. Example (not all tags in OBO stanza shown):

  id: GO:0008129
  name: actinidain activity
  namespace: molecular_function
  replaced_by: GO:0004197 ! cysteine-type endopeptidase activity
  consider: PRO:nnnnnnnnn ! actinidain

Midori will browse PRO, and Darren will browse GO protease terms, to get an idea of what sort of information is included in definitions. That should enable us to come up with recommendations on how to handle GO-to-PRO references that will work for most cases. One important aspect of the recommendations will be what level in PRO should be cited by an obsolete term. We probably don't want to point from an obsolete GO term to a PRO isoform-specific term, and certainly not to a PRO sequence variant term. Whether we use PRO's family level or gene level may depend on the peptidase in question.

We've also formed the impression that it will generally be safer to use the OBO 'consider' tag rather than replaced_by.

Darren asked Midori to spell out what PRO has to capture for peptidases. For example, do we need the EC numbers? GO will not dictate to PRO what terms to include or how to define them, but we will tell them what we hope to be able to capture somehow for the GO peptidase terms that we make obsolete.

PRO curators have to be careful about including function/activity info in defs, because some of the isoforms of a given protein may not ever execute the function. Information not included in PRO definitions could still be captured by PRO, in comments, synonyms, dbxrefs, etc.

Addendum: Next steps

July 8, 2008

Midori & Darren looked at an example (actinidain) and used it to work out what to do next:

  1. Midori will send Darren a .obo file containing the GO terms that will be made obsolete.
  2. Midori and Emily will send a list of UniProt IDs for proteins annotated to the GO terms.
  3. Darren/PRO will identify the active version of each protein, create PRO entries, and send GO the corresponding PRO IDs.

Links to additional data

Changes committed August 15, 2008

As agreed among the working group, and announced on the GO list, I (Midori) have made 174 terms obsolete. New terms in the classification matrix have also gone "live" in the same commit.

The obsoletion completes the largest and most significant portion of work to be done on peptidase activity terms; any additional changes can go through the usual channels (mainly SourceForge requests).