Protein complexes: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
 
(91 intermediate revisions by 4 users not shown)
Line 1: Line 1:
*NOTE: This is a work in progress. It needs to be wrapped up, and revised by editors, Becky and Birgit. Also, we need to add examples - what works and what doesn't.
=GO definition of a protein-containing complex=
*Last updated: 30/4/2015, Birgit Meldal
* A cellular component should be composed of more than one subunit (protein and another protein or a RNA), forming a stable interaction that exists as a functional unit <i>in vivo</i>. All complexes in the component ontology are created under the general term [https://amigo.geneontology.org/amigo/term/GO:0032991 GO:0032991 protein-containing complex].  
* Protein-containing complex terms should have 'complex' in the term label to avoid ambiguity. For example, the molecular function term <code>GO:0004738 pyruvate dehydrogenase activity</code> describes the enzyme activity whereas the cellular component term <code>GO:0045254 pyruvate dehydrogenase complex</code> describes the multi-subunit structure in which the enzyme activity resides.
* Complexes should be as species-agnostic as possible; for example if an homologous complex is present in different species and has different subunit composition, the definition should either be more vague about the number of subunits, or explain how the complex differs in different species.


=Textual definition for protein-containing complex terms=
* The textual definitions of protein-containing complex terms should start with "A protein-containing complex that", and continue with either "catalyzes" (some enzymatic activity), "is capable of" (some molecular function) and/or "consisting of" (and list the components).
* Term definitions for protein-containing complexes should be generic and species-agnostic as much as possible. To provide guidance, it is possible to add specific components for a (small) number of species, formulated as 'For example, in human this complex contains...' as a definition gloss or term comment.


=In scope=
* Complexes that exist in an ''in vivo'', physiologically relevant context.
* Homomultimeric proteins, e.g. the homodimeric alcohol dehydrogenase, may be included as cellular component terms, as should heteromultimeric proteins, e.g. hemoglobin with alpha and beta chains.
* Enzyme/substrate, receptor/ligand in which these are a critical part of the complex assembly (e.g. PDGF receptors only become 'dimeric' when linked by the dimeric ligand forming a tetramer).


== Background and rationale ==
=Out of scope=
* Complexes of one gene product with a cofactor, e.g. heme, chlorophyll, magnesium.
* Enzyme/substrate, receptor/ligand or any similar transient interactions unless these are a critical part of the complex assembly. These unstable interactions should be captured with 'GO:005488 binding' or 'GO:0005515 protein binding'.
* Putative complexes where the only evidence is based on genetic interaction data.
* Proteins associated in a pulldown/coimmunoprecipitation assay with no functional link or any evidence that this is a defined biological entity rather than a loose affinity complex. In other words, a <i>bona fide</i> complex should form under physiological conditions as part of an evolved function; things formed  <i>in vitro</i> as part of an experimental procedure are assays.
* Partial complexes and subcomplexes. Note that crystallization experiments often use partial complexes, for technical reasons: some subunits (e.g. transmembrane subunits) cannot be expressed as recombinant proteins and are 'left out' of detailed studies. More reading is often necessary to find out what the full complex is thought to be.
* Complexes differentiated from their parent by the cell type in which they are present.
* Complexes should NOT be defined by their stoichiometry, though this may be mentioned in the definition as a definition gloss, or in a comment. The rationale behind this recommendation is that, as knowledge advances and more examples are found, definitions mentioning stoichiometry would have to be updated, causing a lot of work. Also, stoichiometry can vary in different organisms; it is better to keep the definition more general. It is perfectly fine though to mention something like 'usually consists of a catalytic and a regulatory subunit and possibly further accessory subunits...'.


Recently, GO and IntAct have started to work together to improve the 'protein complex' branch in GO, making it less flat and more informative, and to provide species-agnostic GO terms that IntAct can reference to for their species-specific curation projects (At the time of writing the focus lies on human, mouse, yeast and ecoli but we take direct curation requests as well. We'd like more MODs to collaborate directly.) Here, we collect current guidelines on protein complex terms, to aid GO curators in discerning whether a protein complex belongs in GO or not, and if yes, in including all necessary information when requesting a new protein complex.
=Specific complexes ("instances")=
* GO describes general classes of concepts, not specific ones. To describe specific complexes, described by their exact subunits in a specific organisms, can be submitted to [[https://www.ebi.ac.uk/about/contact/support/complexportal Complex Portal]] and/or [[https://github.com/PROconsortium/PRoteinOntology/issues/new?assignees=nataled&labels=Term+Request&projects=&template=1--term-request.md&title=Term+issue%3A+ Protein Ontology (PRO)]. These resources capture complexes with their exact subunit composition (similar to GO annotations).


=Taxon constraints=
For complexes known to be only present in certain taxa, curators are encouraged to provide this information, if applicable, when they request a new term, or come across an existing one that is missing useful taxon constraints. Typically there are prokaryote- and eukaryote-specific complexes, but this can apply to any complex.


== Protein complexes in GO ==
=Interontology links=


==Protein-containing complex link to MF==
* A protein-containing complex can be linked to a molecular function using the 'capable_of' relation. Note that these cannot be used to annotate individual subunits to a MF, as an annotation to a protein-containing complex doesn't indicate which is the active subunit.


=== Rule 1: Is the complex stable? ===
==Protein-containing complex link to BP==
* A protein-containing complex can be linked to a biological process using the 'capable_of_part_of' relation. These CC to BP relations can be used for inference of the BP from the annotation to the protein-containing complex.


From the Complex Portal Rules http://www.ebi.ac.uk/intact/complex/documentation/
==Protein-containing complex link to CC==
* A protein-containing complex can be linked to a cellular anatomical entity using the 'part_of' relation.  


== How to request protein complexes in GO==


===== What can be described as a complex? =====
* Use the [https://github.com/geneontology/go-ontology/issues/new?assignees=&labels=&projects=&template=ntr--protein-containing-complex.md&title= GO-ontology GitHub tracker]
 
'''A stable set of (two or more) interacting macromolecules such as proteins which can be co-purified by an acceptable method and have been shown to exist as an isolated, functional unit in vivo. Any interacting non-protein molecules (e.g. small molecules, nucleic acids) will also be included.
'''
 
 
===== What should not be captured: =====
*Enzyme/substrate, receptor/ligand or any similar transient interactions unless these are a critical part of the complex assembly.
*Proteins associated in a pulldown/coimmunoprecipitation with no functional link or any evidence that this is a defined biological entity rather than a loose affinity complex.
*Proteins with the same function but with either no demonstrable physical link or one that can be inferred by sequence homology. [wording to be updated by Sandra]
*Any literature complex where the only evidence is based on genetic interaction data.
 
 
===== Comments: =====
*If the complex is not stable, it's just protein binding. Interactions can then be captured by a protein-protein interaction DB such as IntAct.
 
*The Complex Portal could also hold transient complexes, e.g. signaling complexes that form for only split seconds but have some experimental evidence that they exist. We haven't done any of these but they are possible. BUT - they would probably fall outside the scope of GO if they limit themselves to stable complexes.
 
*We can also curate complexes that have no full experimental evidence but are commonly regarded as truly real, e.g. complexes submitted by ChEMBL for which we only have pharmacological evidence. These complexes are tagged with ECO:0000306 - inferred from background scientific knowledge by manual assertion.
 
 
=== Rule 2: Is the complex species-agnostic? ===
 
*GO should host species-agnostic complexes, ideally conserved across taxa. Where this isn't known, still make the def generic, and add 'For example, in human this complex contains...' as a def gloss or def comment.
*Species-specific complexes don't belong in GO, but IntAct/Complex Portal and/or PRO can take them.
 
*We may, however, need taxon restrictions on a case by case basis such as complexes that only exist in prokaryots or eukaryotes.
 
 
=== Rule 3: Does the complex have a molecular function? ===
 
*Ideally, add capable_of function links. These links are used by the reasoner to place the complex into the correct branch under 'protein complex'.
 
=== Rule 4: Is the complex known to be involved in one or more biological processes? ===
 
*If yes, add capable_of_part_of process links.
*Note: we decided not to use BP as a qualifier for making group terms for complexes as these would become too unspecific, e.g. 'regulatory complex' could include most complexes!
 
 
=== Rule 5: Does the complex contain conserved subunits? ===
 
*GO does host complexes based on their subunits only, when no function or process information is available.
 
*Most complexes contain some wording such as: "In human, it is composed of..." BUT, this is getting messy where subunit composition is different in different branches of the tree of life and different groups/MODs add their own examples. Should these just go in as NARROW synonyms? [to be discussed]
 
*Complexes defined by their subunits but functionally identical to a more generic parent term should not be created as separate GO terms but added to the parent term as synonyms. The specific complex belongs in the Complex Portal.
 
[DOS to look into some automatic reasoning across subunits but we think it may become tricky.]
 
 
=== Rule 6: Where is the complex located? ===
 
*Indicate cellular location as specifically as possible, unless parent already has one.
 
*The CC is for the complex as a whole. We discussed this in the context of transmembrane complexes with members that are only located on one side of the membrane or have no membrane attachment at all. As gene products have the part_of relationship with the complexes this is fine (and the only way of reflecting the CC for the complex as a whole).
 
*If we have complexes defined by their location (see below under 'Futures Plans'), does the reasoner take the part_of relationship to place them automatically into the right complex-by-location branch? [DOS?]
 
=== Adding appropriate is_a relationships ===
 
*We are trying to avoid placing complexes as direct is_a children of 'protein complex' but add some granularity to the ontology.
 
*An is_a parent of a complex can be a
# complex defined by its activity, via the complex-by-activity TG template
# complex defined by its location, such as 'plasma membrane complex'. [Can we have a complex-by-location template?]
# complex defined by its subunit composition. This may be related to protein families but it may be difficult to make it a rule/template (see below)
*We decided to NOT define complexes by their process or MF binding as they would become too generic.
 
*Complexes can have multiple parents!
 
=== Adding appropriate part_of relationships ===
 
*All complexes should have a part_of link to a cellular component, even if it's very generic, such as 'cell'.
*CC does not have to be added manually if it's the same as the parent term as it will be inferred.
*If the CC is more specific than the parent the part_of relationship must be added manually.
*Complexes can be subcomplexes of larger entities and can therefore be part_of another protein complex. BUT they must ALWAYS be part of this complex. If a complex can be part of several larger complexes separate terms may have to be considered. [This point is still open to discussion, see https://sourceforge.net/p/geneontology/ontology-requests/10745/, now with DOS]
 
== How to request protein complexes in GO based on the above (TG template, TG freeform) ==
 
*If the complex is generic and its function exists as a GO term, use the complex-by-activity template (and add relevant synonyms as discussed above).
 
*If the function does not yet exist in GO but is clearly defined, create the new MF term first (via SF or TG FF depending on the curator's experience), then create the CC term for the complex via the template.
 
*If the complex-by-activity template is not applicable, create the complex term either via SF or TG FF depending on the curator's experience.
 
*IntAct is happy to curate requested complexes into the Complex Portal at the same time as adding to the GO structure. Curators are encouraged to curate complexes directly into the Complex Portal after being trained by IntAct. SGD are doing this already.
 
== Future plans ==
as discussed in a meeting with Birgit Meldal, Sandra Orchard, David Osumi-Sunderland and Paola Roncaglia on 28/4/2015
 
We discussed how we can make 'quick gains' in making the ontology more granular beyond the fixes Birgit does on a case by case basis. This is to target history terms that have only 'protein complex' as a parent because they have no annotation extensions. The aim is to have most complexes grouped either by their function, location or subunit composition.
*Do a pass through term names and definitions to find major groups of complexes that can be grouped by function, e.g. catalytic complexes (the term exists but many historic terms have not automatically been classified as such as they have no capable_of extensions) [BM, SO & DOS]
 
*Add parent terms based on location, such membrane complexes and children. Can the reasoner place complexes automatically into this branch based on their part_of relationship (see above Rule 6)? Should we have a TG template for this? [BM, DOS]
 
*We discussed grouping by protein families but this may be tricky. Decide on a case by case basis. A working example are the BCL protein family complexes which cannot be grouped by function as they may be pro- and/or antiapoptotic.
 
== Previous work ==
 
Emily started documentation here, in case it's helpful, but this wasn't worked on since 2011:
http://wiki.geneontology.org/index.php/Protein_Complex_ids_as_GO_annotation_objects
 
[Birgit] Inheritance of annotations:
I agree with the wiki, you cannot inherit MF from a complex to a subunit and even a CC is problematic, see the transmembrane example above. This needs more thinking about. I don't know what you are doing right now...
 
Orthologies:
We infer within taxon groups, e.g. human to mouse to rat or any other mammal etc, depending on where the exp evidence comes from. We systematically infer human-mouse. We have a few pombe complexes inferred from yeast (Sc!) but we don't do it systematically.
 
Paralogues:
We make inferences between related complexes in the same species when the gene products are very similar, e.g. hemoglobin chains for adult and developmental complexes.
 
'Large' complexes:
We have tackled the 'mediator' and we can now link to RNACentral for RNAs so time permitting we'll tackle the 'biggies' soon!
 
Pro:
We have a list of Pro complexes that we consult for refs.
 
- What IntAct is doing - a summary:
 
[Birgit] We didn't draw up an official set of rules but in summary this is what we do (and it pretty much matches what Paola says below and the wiki she cites):
A complex should be taxon agnostic but may be restricted to certain taxonomic groups, such as pro- vs eukaryotes.
... should contain subunits in the def
... should have a 'as precise as possible' part_of relationship to the CC (may have to create new terms here as well of course!) which can be a complex (in cases of subcomplexes) or a location
... have, if possible, capable_of and capable_of_part_of annotation extensions.
... should have is_a relationship to an appropriate child term of 'protein complex'. This could be a term based on it's composition or function but NOT based on the PB. If no appropriate term exists, we create one based on either of the two classes. There is now a TG template for creating complex-by-MF which make curators' life much easier :) If there is no appropriate CC or complex-by-MF parent the new complex will be a direct child of 'protein complex'.
 


== Useful links ==
== Useful links ==


IntAct Complex portal, http://www.ebi.ac.uk/intact/complex/
* [http://www.ebi.ac.uk/complexportal/ Complex Portal]


== Review Status ==
Last reviewed: 2023-09-07


[[Category:Ontology]]
Reviewed by: Peter D'Eustachio, Pascale Gaudet
----
[[Category:GO Editors]][[Category:Ontology]]

Latest revision as of 03:38, 30 January 2024

GO definition of a protein-containing complex

  • A cellular component should be composed of more than one subunit (protein and another protein or a RNA), forming a stable interaction that exists as a functional unit in vivo. All complexes in the component ontology are created under the general term GO:0032991 protein-containing complex.
  • Protein-containing complex terms should have 'complex' in the term label to avoid ambiguity. For example, the molecular function term GO:0004738 pyruvate dehydrogenase activity describes the enzyme activity whereas the cellular component term GO:0045254 pyruvate dehydrogenase complex describes the multi-subunit structure in which the enzyme activity resides.
  • Complexes should be as species-agnostic as possible; for example if an homologous complex is present in different species and has different subunit composition, the definition should either be more vague about the number of subunits, or explain how the complex differs in different species.

Textual definition for protein-containing complex terms

  • The textual definitions of protein-containing complex terms should start with "A protein-containing complex that", and continue with either "catalyzes" (some enzymatic activity), "is capable of" (some molecular function) and/or "consisting of" (and list the components).
  • Term definitions for protein-containing complexes should be generic and species-agnostic as much as possible. To provide guidance, it is possible to add specific components for a (small) number of species, formulated as 'For example, in human this complex contains...' as a definition gloss or term comment.

In scope

  • Complexes that exist in an in vivo, physiologically relevant context.
  • Homomultimeric proteins, e.g. the homodimeric alcohol dehydrogenase, may be included as cellular component terms, as should heteromultimeric proteins, e.g. hemoglobin with alpha and beta chains.
  • Enzyme/substrate, receptor/ligand in which these are a critical part of the complex assembly (e.g. PDGF receptors only become 'dimeric' when linked by the dimeric ligand forming a tetramer).

Out of scope

  • Complexes of one gene product with a cofactor, e.g. heme, chlorophyll, magnesium.
  • Enzyme/substrate, receptor/ligand or any similar transient interactions unless these are a critical part of the complex assembly. These unstable interactions should be captured with 'GO:005488 binding' or 'GO:0005515 protein binding'.
  • Putative complexes where the only evidence is based on genetic interaction data.
  • Proteins associated in a pulldown/coimmunoprecipitation assay with no functional link or any evidence that this is a defined biological entity rather than a loose affinity complex. In other words, a bona fide complex should form under physiological conditions as part of an evolved function; things formed in vitro as part of an experimental procedure are assays.
  • Partial complexes and subcomplexes. Note that crystallization experiments often use partial complexes, for technical reasons: some subunits (e.g. transmembrane subunits) cannot be expressed as recombinant proteins and are 'left out' of detailed studies. More reading is often necessary to find out what the full complex is thought to be.
  • Complexes differentiated from their parent by the cell type in which they are present.
  • Complexes should NOT be defined by their stoichiometry, though this may be mentioned in the definition as a definition gloss, or in a comment. The rationale behind this recommendation is that, as knowledge advances and more examples are found, definitions mentioning stoichiometry would have to be updated, causing a lot of work. Also, stoichiometry can vary in different organisms; it is better to keep the definition more general. It is perfectly fine though to mention something like 'usually consists of a catalytic and a regulatory subunit and possibly further accessory subunits...'.

Specific complexes ("instances")

  • GO describes general classes of concepts, not specific ones. To describe specific complexes, described by their exact subunits in a specific organisms, can be submitted to [Complex Portal] and/or [Protein Ontology (PRO). These resources capture complexes with their exact subunit composition (similar to GO annotations).

Taxon constraints

For complexes known to be only present in certain taxa, curators are encouraged to provide this information, if applicable, when they request a new term, or come across an existing one that is missing useful taxon constraints. Typically there are prokaryote- and eukaryote-specific complexes, but this can apply to any complex.

Interontology links

Protein-containing complex link to MF

  • A protein-containing complex can be linked to a molecular function using the 'capable_of' relation. Note that these cannot be used to annotate individual subunits to a MF, as an annotation to a protein-containing complex doesn't indicate which is the active subunit.

Protein-containing complex link to BP

  • A protein-containing complex can be linked to a biological process using the 'capable_of_part_of' relation. These CC to BP relations can be used for inference of the BP from the annotation to the protein-containing complex.

Protein-containing complex link to CC

  • A protein-containing complex can be linked to a cellular anatomical entity using the 'part_of' relation.

How to request protein complexes in GO

Useful links

Review Status

Last reviewed: 2023-09-07

Reviewed by: Peter D'Eustachio, Pascale Gaudet