Protein complexes: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
 
(55 intermediate revisions by 3 users not shown)
Line 1: Line 1:
*NOTE: This is a work in progress. It needs to be wrapped up, and revised by editors, Becky and Birgit. Also, we need to add examples - what works and what doesn't.
=GO definition of a protein-containing complex=
*Last updated: 13/5/2015, Birgit Meldal; 14/5/2015, Paola.
* A cellular component should be composed of more than one subunit (protein and another protein or a RNA), forming a stable interaction that exists as a functional unit <i>in vivo</i>. All complexes in the component ontology are created under the general term [https://amigo.geneontology.org/amigo/term/GO:0032991 GO:0032991 protein-containing complex].  
* Protein-containing complex terms should have 'complex' in the term label to avoid ambiguity. For example, the molecular function term <code>GO:0004738 pyruvate dehydrogenase activity</code> describes the enzyme activity whereas the cellular component term <code>GO:0045254 pyruvate dehydrogenase complex</code> describes the multi-subunit structure in which the enzyme activity resides.
* Complexes should be as species-agnostic as possible; for example if an homologous complex is present in different species and has different subunit composition, the definition should either be more vague about the number of subunits, or explain how the complex differs in different species.


=Textual definition for protein-containing complex terms=
* The textual definitions of protein-containing complex terms should start with "A protein-containing complex that", and continue with either "catalyzes" (some enzymatic activity), "is capable of" (some molecular function) and/or "consisting of" (and list the components).
* Term definitions for protein-containing complexes should be generic and species-agnostic as much as possible. To provide guidance, it is possible to add specific components for a (small) number of species, formulated as 'For example, in human this complex contains...' as a definition gloss or term comment.


=In scope=
* Complexes that exist in an ''in vivo'', physiologically relevant context.
* Homomultimeric proteins, e.g. the homodimeric alcohol dehydrogenase, may be included as cellular component terms, as should heteromultimeric proteins, e.g. hemoglobin with alpha and beta chains.
* Enzyme/substrate, receptor/ligand in which these are a critical part of the complex assembly (e.g. PDGF receptors only become 'dimeric' when linked by the dimeric ligand forming a tetramer).


== Background and rationale ==
=Out of scope=
* Complexes of one gene product with a cofactor, e.g. heme, chlorophyll, magnesium.
* Enzyme/substrate, receptor/ligand or any similar transient interactions unless these are a critical part of the complex assembly. These unstable interactions should be captured with 'GO:005488 binding' or 'GO:0005515 protein binding'.
* Putative complexes where the only evidence is based on genetic interaction data.
* Proteins associated in a pulldown/coimmunoprecipitation assay with no functional link or any evidence that this is a defined biological entity rather than a loose affinity complex. In other words, a <i>bona fide</i> complex should form under physiological conditions as part of an evolved function; things formed  <i>in vitro</i> as part of an experimental procedure are assays.
* Partial complexes and subcomplexes. Note that crystallization experiments often use partial complexes, for technical reasons: some subunits (e.g. transmembrane subunits) cannot be expressed as recombinant proteins and are 'left out' of detailed studies. More reading is often necessary to find out what the full complex is thought to be.
* Complexes differentiated from their parent by the cell type in which they are present.
* Complexes should NOT be defined by their stoichiometry, though this may be mentioned in the definition as a definition gloss, or in a comment. The rationale behind this recommendation is that, as knowledge advances and more examples are found, definitions mentioning stoichiometry would have to be updated, causing a lot of work. Also, stoichiometry can vary in different organisms; it is better to keep the definition more general. It is perfectly fine though to mention something like 'usually consists of a catalytic and a regulatory subunit and possibly further accessory subunits...'.


Recently, GO and IntAct have started to work together to improve the 'protein complex' branch in GO, making it less flat and more informative, and to provide species-agnostic GO terms that IntAct can reference to for their species-specific curation projects, namely the Complex Portal (http://www.ebi.ac.uk/intact/complex/). At the time of writing (Spring 2015) the focus of the Complex Portal lies on human, mouse, yeast and E.coli, but it can take direct curation requests as well (intact-help@ebi.ac.uk). Other MODs are encouraged to collaborate directly.  
=Specific complexes ("instances")=
* GO describes general classes of concepts, not specific ones. To describe specific complexes, described by their exact subunits in a specific organisms, can be submitted to [[https://www.ebi.ac.uk/about/contact/support/complexportal Complex Portal]] and/or [[https://github.com/PROconsortium/PRoteinOntology/issues/new?assignees=nataled&labels=Term+Request&projects=&template=1--term-request.md&title=Term+issue%3A+ Protein Ontology (PRO)]. These resources capture complexes with their exact subunit composition (similar to GO annotations).


Here, we collect current guidelines on protein complex terms, to aid GO curators in discerning whether a protein complex belongs in GO or not, and if yes, in including all necessary information when requesting a new protein complex.
=Taxon constraints=
For complexes known to be only present in certain taxa, curators are encouraged to provide this information, if applicable, when they request a new term, or come across an existing one that is missing useful taxon constraints. Typically there are prokaryote- and eukaryote-specific complexes, but this can apply to any complex.


== Protein complexes in GO ==
=Interontology links=


=== Rule 1: Is the complex stable? ===
==Protein-containing complex link to MF==
* A protein-containing complex can be linked to a molecular function using the 'capable_of' relation. Note that these cannot be used to annotate individual subunits to a MF, as an annotation to a protein-containing complex doesn't indicate which is the active subunit.


In GO, 'protein complex' is defined as "A stable macromolecular complex composed (only) of two or more polypeptide subunits along with any covalently attached molecules (such as lipid anchors or oligosaccharide) or non-protein prosthetic groups (such as nucleotides or metal ions). Prosthetic group in this context refers to a tightly bound cofactor. The component polypeptide subunits may be identical."
==Protein-containing complex link to BP==
* A protein-containing complex can be linked to a biological process using the 'capable_of_part_of' relation. These CC to BP relations can be used for inference of the BP from the annotation to the protein-containing complex.


When in doubt, how can a curator figure out if a complex is really stable?
==Protein-containing complex link to CC==
* A protein-containing complex can be linked to a cellular anatomical entity using the 'part_of' relation.


We can refer to the IntAct Complex Portal Rules [http://www.ebi.ac.uk/intact/complex/documentation/]. These are reported below for reference, and in a definition comment to 'protein complex' to serve as annotation guidance:
== How to request protein complexes in GO==


===== What can be described as a complex? =====
* Use the [https://github.com/geneontology/go-ontology/issues/new?assignees=&labels=&projects=&template=ntr--protein-containing-complex.md&title= GO-ontology GitHub tracker]
 
'''A stable set of (two or more) interacting macromolecules such as proteins which can be co-purified by an acceptable method and have been shown to exist as an isolated, functional unit in vivo. Any interacting non-protein molecules (e.g. small molecules, nucleic acids) will also be included.'''
 
===== What should not be captured: =====
*Enzyme/substrate, receptor/ligand or any similar transient interactions unless these are a critical part of the complex assembly (e.g. PDGF receptors only become 'dimeric' when linked by the dimeric ligand forming a tetramer).
*Proteins associated in a pulldown/coimmunoprecipitation assay with no functional link or any evidence that this is a defined biological entity rather than a loose affinity complex.
*Any literature complex where the only evidence is based on genetic interaction data.
*Partial complexes.
 
Note:
*If the complex is not stable, it's just protein binding. Interactions can then be captured by a protein-protein interaction DB such as IntAct.
*Beware of partial complexes shown experimentally, especially when crystallised. Some subunits (e.g. transmembrane subunits) cannot be expressed as recombinant proteins and are 'left out' of detailed studies. More reading is often necessary to find out what the full complex is thought to be.
 
===== Tricky cases we DO (or could) capture in the IntAct Complex Portal: =====
 
*Substrates or ligands if the enzyme or receptor complex only forms in their presence (see PDGF receptors above, e.g. EBI-9082861). These terms would also qualify for GO.
 
*Homologous proteins, with the same functionality, which would be inferred by sequence similarity to form a complex but for which no physical link has been demonstrated, e.g. proteins A and B have been shown to physically interact and form a functional complex, protein C is a homologue of protein B by sequence similarity and is know to have the same function as B but protein A-C interaction has not been demonstrated experimentally. E.g. SUMO - E1 ligase complexes where there is interaction evidence for binding with SUMO1 (EBI-9349603) but not with SUMO2 (EBI-9345927).
 
*The Complex Portal could also hold transient complexes, e.g. signaling complexes. We have not created any of these to date but they are possible, and controlled vocabulary terms exist to distinguish the two classes. BUT - they would probably fall outside the scope of GO if GO limit themselves to stable complexes.
 
*We can also curate complexes that lack full experimental evidence but are commonly regarded as existing, e.g. complexes submitted by ChEMBL for which we only have pharmacological evidence. These complexes are tagged with ECO:0000306 - inferred from background scientific knowledge by manual assertion. E.g GABA (EBI-9008426) receptors and many other transmembrane receptors.
 
=== Rule 2: Is the complex species-agnostic? ===
 
*GO should host species-agnostic complexes, ideally conserved across taxa. Where this isn't known, we should still make the definition generic, and add 'For example, in human this complex contains...' as a definition gloss or definition comment.
 
*Species-specific complexes don't belong in GO, but IntAct/Complex Portal and/or PRO can take them. (We acknowledge that GO contains many historic terms that contravene this rule. For the time being, the agreement is that we will not review them globally, though we may fix them if and when we come across them.)
 
*We may, however, need taxon restrictions on a case-by-case basis such as complexes that only exist in prokaryots or eukaryotes. Curators are encouraged to provide these information, if applicable, when they request a new term (or come across an existing one).
 
=== Rule 3: Does the complex have a molecular function? ===
 
*If yes, add capable_of link(s) to molecular function terms. These links are used by the reasoner to place the complex into the correct branch under 'protein complex'.
 
=== Rule 4: Is the complex known to be involved in one or more biological processes? ===
 
*If yes, add capable_of_part_of links to biological process(es).
 
*Note: we decided not to use BP as a qualifier for making grouping terms for complexes as these would become too unspecific, e.g. 'regulatory complex' could include most complexes!
 
=== Rule 5: Does the complex contain conserved subunits? ===
 
*GO does host complexes based on their subunits only, when no function or process information is available.
 
*Most complexes contain some wording such as: "In human, it is composed of..." BUT, this is getting messy where subunit composition is different in different branches of the tree of life and different groups/MODs add their own examples. Should these just go in as NARROW synonyms? [to be discussed]
 
*Complexes defined by their subunits but functionally identical to a more generic parent term should not be created as separate GO terms but added to the parent term as synonyms. The specific complex belongs in the Complex Portal.
 
[DOS to look into some automatic reasoning across subunits but we think it may become tricky. To be discussed on Editor's call.]
 
=== Rule 6: Where is the complex located? ===
 
*Indicate cellular location as specifically as possible, unless parent already has one.
 
*The CC location is meant for the complex as a whole. We discussed this in the context of transmembrane complexes where one or more members of the complex are located on one side of the membrane only or have no membrane attachment at all. As gene products have the part_of relationship with the complexes this is fine (and the only way of reflecting the CC for the complex as a whole).
 
*If we have complexes defined by their location (see below under 'Futures Plans'), does the reasoner take the part_of relationship to place them automatically into the right complex-by-location branch? [DOS?]
 
=== Rule 7: Adding appropriate is_a relationships ===
 
*We are trying to avoid placing complexes as direct is_a children of 'protein complex', by adding some granularity to this ontology branch.
 
*An is_a parent of a complex can be a
# complex defined by its activity, via the complex-by-activity TG template
# complex defined by its location, such as 'plasma membrane complex'. [Can we have a complex-by-location template?] [Update: DOS added some useful protein complex grouping terms based on location...]
# complex defined by its subunit composition. This may be related to protein families but it may be difficult to make it a rule/template (see below)
*We decided NOT to define complexes by their process or MF binding as they would become too generic.
 
*Note: Complexes can have multiple parents!
 
*Note: if capable_of MF links are added, and/or if location information is provided as part_of CC, the automatic assert-inference script will take care of placing most newly created protein complex terms more granularly in the ontology.
 
=== Rule 8: Adding appropriate part_of relationships ===
 
*All complexes should have a part_of link to a cellular component term, even if it's very generic, such as 'cell'.
*CC does not have to be added manually if it's the same as the parent term as it will be inferred.
*If the CC is more specific than the parent, the part_of relationship must be added manually.
*Complexes can be subcomplexes of larger entities and can therefore be part_of another protein complex. If the larger complex necessarily needs the smaller one as its component in order to be functional, a has_part link should also be added (larger complex has_part smaller complex).
*Complexes cannot have several part_of relationships to different CCs, as part_of must ALWAYS be true. If a complex can be part of several larger complexes or be found in several locations, such as cytoplasm and nucleus where it may have different functions, separate terms may have to be considered. [This point is still open to discussion, see https://sourceforge.net/p/geneontology/ontology-requests/10745/, now with DOS. To be discussed on Editor's call.]
 
== How to request protein complexes in GO based on the above (TG template, TG freeform) ==
 
*If the complex is generic and its function exists as a GO term, use the complex-by-activity template (and add relevant synonyms as discussed above).
 
*If the function does not yet exist in GO but is clearly defined, create the new MF term first (via SF or TG FF depending on the curator's experience), then create the CC term for the complex via the template.
 
*If the complex-by-activity template is not applicable, create the complex term either via SF or TG FF depending on the curator's experience.
 
*IntAct is happy to curate requested complexes into the Complex Portal at the same time as adding to the GO structure. Curators are encouraged to curate complexes directly into the Complex Portal after being trained by IntAct. SGD are doing this already. Contact intact-help@ebi.ac.uk for either use case.
 
== Complex Definition field ==
 
We discussed the structure of the definition as there are lots of different ways to build it. At the moment, Birgit starts with the function (if applicable), such as 'A protein complex capable of X function...'. Most defs contain examples of subunits but see above for difficulties. Other complex defs start with the list of subunits. We need to come up with a set of rules that suit most cases. Processes can also be mentioned in the def.
 
Complexes should NOT be defined by their stoichiometry but it may be mentioned in the def as a 'soft' comment. The problem is that, as knowledge advances and more examples are found, stoichiometry defs have to be updated, causing a lot of work. It is perfectly fine though to mention something like 'usually consists of a catalytic and a regulatory subunit and possibly further accessory subunits...' NB: Birgit created a lot of stoichiometry definitions in the beginning before we realised this was a bad idea!
 
[To be discussed on Editor's call, then discussed with Birgit and Sandra again.]
 
== Future plans ==
as discussed in a meeting with Birgit Meldal, Sandra Orchard, David Osumi-Sunderland and Paola Roncaglia on 28/4/2015
 
We discussed how we can make 'quick gains' in making the ontology more granular beyond the fixes Birgit does on a case by case basis. This is to target historic terms that have only 'protein complex' as a parent because they have no annotation extensions. The aim is to have most complexes grouped either by their function, location or subunit composition.
*Do a pass through term names and definitions to find major groups of complexes that can be grouped by function, e.g. 'catalytic complex' (the term exists but many historic terms have not automatically been classified as such as they have no capable_of extensions). Other keywords: kinase, activity, viral, receptor, respiratory chain... [BM, SO & DOS]
 
*Add parent terms based on location, such as 'membrane complex' and children or 'mitochondrial complex'. Can the reasoner place complexes automatically into this branch based on their part_of relationship (see above Rule 6)? Should we have a TG template for this? [BM, DOS]
 
*Combine activity and location, such as 'membrane receptor complex'.
 
*We discussed grouping by protein families but this may be tricky. Decide on a case by case basis. A working example are the BCL protein family complexes which cannot be grouped by function as they may be pro- and/or antiapoptotic.
 
== Previous work ==
 
Emily started documentation here, in case it's helpful, but this wasn't worked on since 2011:
http://wiki.geneontology.org/index.php/Protein_Complex_ids_as_GO_annotation_objects
 
 
Birgit's comments [from email to Paola]:
 
Inheritance of annotations:
I agree with the wiki, you cannot inherit MF from a complex to a subunit and even a CC is problematic, see the transmembrane example above. This needs more thinking about. I don't know what you are doing right now...
 
Orthologies:
We infer within taxon groups, e.g. human to mouse to rat or any other mammal etc, depending on where the exp evidence comes from. We systematically infer human-mouse. We have a few pombe complexes inferred from yeast (Sc!) but we don't do it systematically.
 
Paralogues:
We make inferences between related complexes in the same species when the gene products are very similar, e.g. hemoglobin chains for adult and developmental complexes.
 
'Large' complexes:
We have tackled the 'mediator' and we can now link to RNACentral for RNAs so time permitting we'll tackle the 'biggies' soon!
 
Pro:
We have a list of Pro complexes that we consult for refs.
 
===== What IntAct is doing - a summary: =====
 
We didn't draw up an official set of rules but in summary this is what we do (and it pretty much matches what Paola says below and the wiki she cites):
A complex should be taxon agnostic but may be restricted to certain taxonomic groups, such as pro- vs eukaryotes.
 
... should contain subunits in the def
 
... should have a 'as precise as possible' part_of relationship to the CC (may have to create new terms here as well of course!) which can be a complex (in cases of subcomplexes) or a location
 
... have, if possible, capable_of and capable_of_part_of annotation extensions.
 
... should have is_a relationship to an appropriate child term of 'protein complex'. This could be a term based on it's composition or function but NOT based on the PB. If no appropriate term exists, we create one based on either of the two classes. There is now a TG template for creating complex-by-MF which make curators' life much easier :) If there is no appropriate CC or complex-by-MF parent the new complex will be a direct child of 'protein complex'.


== Useful links ==
== Useful links ==


IntAct Complex Portal, http://www.ebi.ac.uk/intact/complex/
* [http://www.ebi.ac.uk/complexportal/ Complex Portal]


== Review Status ==
Last reviewed: 2023-09-07


[[Category:Ontology]]
Reviewed by: Peter D'Eustachio, Pascale Gaudet
----
[[Category:GO Editors]][[Category:Ontology]]

Latest revision as of 03:38, 30 January 2024

GO definition of a protein-containing complex

  • A cellular component should be composed of more than one subunit (protein and another protein or a RNA), forming a stable interaction that exists as a functional unit in vivo. All complexes in the component ontology are created under the general term GO:0032991 protein-containing complex.
  • Protein-containing complex terms should have 'complex' in the term label to avoid ambiguity. For example, the molecular function term GO:0004738 pyruvate dehydrogenase activity describes the enzyme activity whereas the cellular component term GO:0045254 pyruvate dehydrogenase complex describes the multi-subunit structure in which the enzyme activity resides.
  • Complexes should be as species-agnostic as possible; for example if an homologous complex is present in different species and has different subunit composition, the definition should either be more vague about the number of subunits, or explain how the complex differs in different species.

Textual definition for protein-containing complex terms

  • The textual definitions of protein-containing complex terms should start with "A protein-containing complex that", and continue with either "catalyzes" (some enzymatic activity), "is capable of" (some molecular function) and/or "consisting of" (and list the components).
  • Term definitions for protein-containing complexes should be generic and species-agnostic as much as possible. To provide guidance, it is possible to add specific components for a (small) number of species, formulated as 'For example, in human this complex contains...' as a definition gloss or term comment.

In scope

  • Complexes that exist in an in vivo, physiologically relevant context.
  • Homomultimeric proteins, e.g. the homodimeric alcohol dehydrogenase, may be included as cellular component terms, as should heteromultimeric proteins, e.g. hemoglobin with alpha and beta chains.
  • Enzyme/substrate, receptor/ligand in which these are a critical part of the complex assembly (e.g. PDGF receptors only become 'dimeric' when linked by the dimeric ligand forming a tetramer).

Out of scope

  • Complexes of one gene product with a cofactor, e.g. heme, chlorophyll, magnesium.
  • Enzyme/substrate, receptor/ligand or any similar transient interactions unless these are a critical part of the complex assembly. These unstable interactions should be captured with 'GO:005488 binding' or 'GO:0005515 protein binding'.
  • Putative complexes where the only evidence is based on genetic interaction data.
  • Proteins associated in a pulldown/coimmunoprecipitation assay with no functional link or any evidence that this is a defined biological entity rather than a loose affinity complex. In other words, a bona fide complex should form under physiological conditions as part of an evolved function; things formed in vitro as part of an experimental procedure are assays.
  • Partial complexes and subcomplexes. Note that crystallization experiments often use partial complexes, for technical reasons: some subunits (e.g. transmembrane subunits) cannot be expressed as recombinant proteins and are 'left out' of detailed studies. More reading is often necessary to find out what the full complex is thought to be.
  • Complexes differentiated from their parent by the cell type in which they are present.
  • Complexes should NOT be defined by their stoichiometry, though this may be mentioned in the definition as a definition gloss, or in a comment. The rationale behind this recommendation is that, as knowledge advances and more examples are found, definitions mentioning stoichiometry would have to be updated, causing a lot of work. Also, stoichiometry can vary in different organisms; it is better to keep the definition more general. It is perfectly fine though to mention something like 'usually consists of a catalytic and a regulatory subunit and possibly further accessory subunits...'.

Specific complexes ("instances")

  • GO describes general classes of concepts, not specific ones. To describe specific complexes, described by their exact subunits in a specific organisms, can be submitted to [Complex Portal] and/or [Protein Ontology (PRO). These resources capture complexes with their exact subunit composition (similar to GO annotations).

Taxon constraints

For complexes known to be only present in certain taxa, curators are encouraged to provide this information, if applicable, when they request a new term, or come across an existing one that is missing useful taxon constraints. Typically there are prokaryote- and eukaryote-specific complexes, but this can apply to any complex.

Interontology links

Protein-containing complex link to MF

  • A protein-containing complex can be linked to a molecular function using the 'capable_of' relation. Note that these cannot be used to annotate individual subunits to a MF, as an annotation to a protein-containing complex doesn't indicate which is the active subunit.

Protein-containing complex link to BP

  • A protein-containing complex can be linked to a biological process using the 'capable_of_part_of' relation. These CC to BP relations can be used for inference of the BP from the annotation to the protein-containing complex.

Protein-containing complex link to CC

  • A protein-containing complex can be linked to a cellular anatomical entity using the 'part_of' relation.

How to request protein complexes in GO

Useful links

Review Status

Last reviewed: 2023-09-07

Reviewed by: Peter D'Eustachio, Pascale Gaudet