- NOTE: This is a work in progress. It needs to be wrapped up, and revised by editors, Becky and Birgit. Also, we need to add examples - what works and what doesn't.
- Last updated: 13/5/2015, Birgit Meldal; 14/5/2015, Paola.
Background and rationale
Recently, GO and IntAct have started to work together to improve the 'protein complex' branch in GO, making it less flat and more informative, and to provide species-agnostic GO terms that IntAct can reference to for their species-specific curation projects, namely the Complex Portal (http://www.ebi.ac.uk/intact/complex/). At the time of writing (Spring 2015) the focus of the Complex Portal lies on human, mouse, yeast and E.coli, but it can take direct curation requests as well (firstname.lastname@example.org). Other MODs are encouraged to collaborate directly.
Here, we collect current guidelines on protein complex terms, to aid GO curators in discerning whether a protein complex belongs in GO or not, and if yes, in including all necessary information when requesting a new protein complex.
Protein complexes in GO
Rule 1: Is the complex stable?
In GO, 'protein complex' is defined as "A stable macromolecular complex composed (only) of two or more polypeptide subunits along with any covalently attached molecules (such as lipid anchors or oligosaccharide) or non-protein prosthetic groups (such as nucleotides or metal ions). Prosthetic group in this context refers to a tightly bound cofactor. The component polypeptide subunits may be identical."
When in doubt, how can a curator figure out if a complex is really stable?
We can refer to the IntAct Complex Portal Rules . These are reported below for reference, and in a definition comment to 'protein complex' to serve as annotation guidance:
What can be described as a complex?
A stable set of (two or more) interacting macromolecules such as proteins which can be co-purified by an acceptable method and have been shown to exist as an isolated, functional unit in vivo. Any interacting non-protein molecules (e.g. small molecules, nucleic acids) will also be included.
What should not be captured:
- Enzyme/substrate, receptor/ligand or any similar transient interactions unless these are a critical part of the complex assembly (e.g. PDGF receptors only become 'dimeric' when linked by the dimeric ligand forming a tetramer).
- Proteins associated in a pulldown/coimmunoprecipitation assay with no functional link or any evidence that this is a defined biological entity rather than a loose affinity complex.
- Any literature complex where the only evidence is based on genetic interaction data.
- Partial complexes.
- If the complex is not stable, it's just protein binding. Interactions can then be captured by a protein-protein interaction DB such as IntAct.
- Beware of partial complexes shown experimentally, especially when crystallised. Some subunits (e.g. transmembrane subunits) cannot be expressed as recombinant proteins and are 'left out' of detailed studies. More reading is often necessary to find out what the full complex is thought to be.
Tricky cases we DO (or could) capture in the IntAct Complex Portal:
- Substrates or ligands if the enzyme or receptor complex only forms in their presence (see PDGF receptors above, e.g. EBI-9082861). These terms would also qualify for GO.
- Homologous proteins, with the same functionality, which would be inferred by sequence similarity to form a complex but for which no physical link has been demonstrated, e.g. proteins A and B have been shown to physically interact and form a functional complex, protein C is a homologue of protein B by sequence similarity and is know to have the same function as B but protein A-C interaction has not been demonstrated experimentally. E.g. SUMO - E1 ligase complexes where there is interaction evidence for binding with SUMO1 (EBI-9349603) but not with SUMO2 (EBI-9345927).
- The Complex Portal could also hold transient complexes, e.g. signaling complexes. We have not created any of these to date but they are possible, and controlled vocabulary terms exist to distinguish the two classes. BUT - they would probably fall outside the scope of GO if GO limit themselves to stable complexes.
- We can also curate complexes that lack full experimental evidence but are commonly regarded as existing, e.g. complexes submitted by ChEMBL for which we only have pharmacological evidence. These complexes are tagged with ECO:0000306 - inferred from background scientific knowledge by manual assertion. E.g GABA (EBI-9008426) receptors and many other transmembrane receptors.
Rule 2: Is the complex species-agnostic?
- GO should host species-agnostic complexes, ideally conserved across taxa. Where this isn't known, we should still make the definition generic, and add 'For example, in human this complex contains...' as a definition gloss or definition comment.
- Species-specific complexes don't belong in GO, but IntAct/Complex Portal and/or PRO can take them. (We acknowledge that GO contains many historic terms that contravene this rule. For the time being, the agreement is that we will not review them globally, though we may fix them if and when we come across them.)
- We may, however, need taxon restrictions on a case-by-case basis such as complexes that only exist in prokaryots or eukaryotes. Curators are encouraged to provide these information, if applicable, when they request a new term (or come across an existing one).
Rule 3: Does the complex have a molecular function?
- If yes, add capable_of link(s) to molecular function terms. These links are used by the reasoner to place the complex into the correct branch under 'protein complex'.
Rule 4: Is the complex known to be involved in one or more biological processes?
- If yes, add capable_of_part_of links to biological process(es).
- Note: we decided not to use BP as a qualifier for making grouping terms for complexes as these would become too unspecific, e.g. 'regulatory complex' could include most complexes!
Rule 5: Does the complex contain conserved subunits?
- GO does host complexes based on their subunits only, when no function or process information is available.
- Most complexes contain some wording such as: "In human, it is composed of..." BUT, this is getting messy where subunit composition is different in different branches of the tree of life and different groups/MODs add their own examples. Should these just go in as NARROW synonyms? [to be discussed]
- Complexes defined by their subunits but functionally identical to a more generic parent term should not be created as separate GO terms but added to the parent term as synonyms. The specific complex belongs in the Complex Portal.
[DOS to look into some automatic reasoning across subunits but we think it may become tricky. To be discussed on Editor's call.]
Rule 6: Where is the complex located?
- Indicate cellular location as specifically as possible, unless parent already has one.
- The CC location is meant for the complex as a whole. We discussed this in the context of transmembrane complexes where one or more members of the complex are located on one side of the membrane only or have no membrane attachment at all. As gene products have the part_of relationship with the complexes this is fine (and the only way of reflecting the CC for the complex as a whole).
- If we have complexes defined by their location (see below under 'Futures Plans'), does the reasoner take the part_of relationship to place them automatically into the right complex-by-location branch? [DOS?]
Rule 7: Adding appropriate is_a relationships
- We are trying to avoid placing complexes as direct is_a children of 'protein complex', by adding some granularity to this ontology branch.
- An is_a parent of a complex can be a
- complex defined by its activity, via the complex-by-activity TG template
- complex defined by its location, such as 'plasma membrane complex'. [Can we have a complex-by-location template?] [Update: DOS added some useful protein complex grouping terms based on location...]
- complex defined by its subunit composition. This may be related to protein families but it may be difficult to make it a rule/template (see below)
- We decided NOT to define complexes by their process or MF binding as they would become too generic.
- Note: Complexes can have multiple parents!
- Note: if capable_of MF links are added, and/or if location information is provided as part_of CC, the automatic assert-inference script will take care of placing most newly created protein complex terms more granularly in the ontology.
Rule 8: Adding appropriate part_of relationships
- All complexes should have a part_of link to a cellular component term, even if it's very generic, such as 'cell'.
- CC does not have to be added manually if it's the same as the parent term as it will be inferred.
- If the CC is more specific than the parent, the part_of relationship must be added manually.
- Complexes can be subcomplexes of larger entities and can therefore be part_of another protein complex. If the larger complex necessarily needs the smaller one as its component in order to be functional, a has_part link should also be added (larger complex has_part smaller complex).
- Complexes cannot have several part_of relationships to different CCs, as part_of must ALWAYS be true. If a complex can be part of several larger complexes or be found in several locations, such as cytoplasm and nucleus where it may have different functions, separate terms may have to be considered. [This point is still open to discussion, see https://sourceforge.net/p/geneontology/ontology-requests/10745/, now with DOS. To be discussed on Editor's call.]
How to request protein complexes in GO based on the above (TG template, TG freeform)
- If the complex is generic and its function exists as a GO term, use the TG complex-by-activity template (and add relevant synonyms as discussed above).
- If the function does not yet exist in GO but is clearly defined, create the new MF term first (via SF or TG FF (freeform) depending on the curator's experience), then create the new complex term via the TG complex-by-activity template.
- If the complex-by-activity template is not applicable, create the complex term either via SF or TG FF depending on the curator's experience.
- If a complex is known to be involved in a broader biological process (but not to have a specific molecular function), request the new term using TG FF (by filling in the capable_of_part_of field), or using SF depending on curator's experience. TG FF allows both capable_of and capable_of_part_of links in case a function is known and a process too, but the function is not part of that process.
- IntAct is happy to curate requested complexes into the Complex Portal at the same time as adding to the GO structure. Curators are encouraged to curate complexes directly into the Complex Portal after being trained by IntAct. SGD are doing this already. Contact email@example.com for either use case.
Complex Definition field
We discussed the structure of the definition as there are lots of different ways to build it. At the moment, Birgit starts with the function (if applicable), such as 'A protein complex capable of X function...'. Most defs contain examples of subunits, but see above for difficulties. Other complex defs start with the list of subunits. We need to come up with a set of rules that suit most cases. Processes can also be mentioned in the def.
Complexes should NOT be defined by their stoichiometry, though this may be mentioned in the def as a 'soft' comment (definition gloss). The problem is that, as knowledge advances and more examples are found, stoichiometry defs have to be updated, causing a lot of work. It is perfectly fine though to mention something like 'usually consists of a catalytic and a regulatory subunit and possibly further accessory subunits...'. NB: Birgit created a lot of stoichiometry definitions in the beginning before we realised this was a bad idea!
[To be discussed on Editor's call, then discussed with Birgit and Sandra again.]
as discussed in a meeting with Birgit Meldal, Sandra Orchard, David Osumi-Sunderland and Paola Roncaglia on 28/4/2015
We discussed how we can make 'quick gains' in making the ontology more granular beyond the fixes Birgit does on a case by case basis. This is to target historic terms that have only 'protein complex' as a parent because they have no annotation extensions. The aim is to have most complexes grouped either by their function, location or subunit composition.
- Do a pass through term names and definitions to find major groups of complexes that can be grouped by function, e.g. 'catalytic complex' (the term exists but many historic terms have not automatically been classified as such as they have no capable_of extensions). Other keywords: kinase, activity, viral, receptor, respiratory chain... [BM, SO & DOS]
- Add parent terms based on location, such as 'membrane complex' and children or 'mitochondrial complex'. Can the reasoner place complexes automatically into this branch based on their part_of relationship (see above Rule 6)? Should we have a TG template for this? [BM, DOS]
- Combine activity and location, such as 'membrane receptor complex'.
- We discussed grouping by protein families but this may be tricky. Decide on a case by case basis. A working example are the BCL protein family complexes which cannot be grouped by function as they may be pro- and/or antiapoptotic.
Emily started documentation here, in case it's helpful, but this wasn't worked on since 2011: http://wiki.geneontology.org/index.php/Protein_Complex_ids_as_GO_annotation_objects
Birgit's comments [from email to Paola]:
Inheritance of annotations: I agree with the wiki, you cannot inherit MF from a complex to a subunit and even a CC is problematic, see the transmembrane example above. This needs more thinking about. I don't know what you are doing right now...
Orthologies: We infer within taxon groups, e.g. human to mouse to rat or any other mammal etc, depending on where the exp evidence comes from. We systematically infer human-mouse. We have a few pombe complexes inferred from yeast (Sc!) but we don't do it systematically.
Paralogues: We make inferences between related complexes in the same species when the gene products are very similar, e.g. hemoglobin chains for adult and developmental complexes.
'Large' complexes: We have tackled the 'mediator' and we can now link to RNACentral for RNAs so time permitting we'll tackle the 'biggies' soon!
Pro: We have a list of Pro complexes that we consult for refs.
What IntAct is doing - a summary:
We didn't draw up an official set of rules but in summary this is what we do (and it pretty much matches what Paola says below and the wiki she cites): A complex should be taxon agnostic but may be restricted to certain taxonomic groups, such as pro- vs eukaryotes.
... should contain subunits in the def
... should have a 'as precise as possible' part_of relationship to the CC (may have to create new terms here as well of course!) which can be a complex (in cases of subcomplexes) or a location
... have, if possible, capable_of and capable_of_part_of annotation extensions.
... should have is_a relationship to an appropriate child term of 'protein complex'. This could be a term based on it's composition or function but NOT based on the PB. If no appropriate term exists, we create one based on either of the two classes. There is now a TG template for creating complex-by-MF which make curators' life much easier :) If there is no appropriate CC or complex-by-MF parent the new complex will be a direct child of 'protein complex'.
IntAct Complex Portal, http://www.ebi.ac.uk/intact/complex/