Talk:2010 GO camp Meeting Agenda
- 1 Day 1 morning session
- 1.1 9:00 Introductions and objectives of the meeting
- 1.2 GO – Ontology, annotation, tools and technical aspects
- 1.3 Binding documentation
- 2 Day 1 afternoon session
- 3 Day 2 Morning
- 4 Day 3 Morning
Day 1 morning session
9:00 Introductions and objectives of the meeting
- Introductions & Logistics: Serenella Ferro Rojas
- Poll for Thursday lunch reservations, depending on weather.
- Dinner at Brasserie la Bourse on the Carouge
- ~ 1.9 km from meeting site
Friday Reception at noon for Amos Bairoch celebration of the Otto Naegeli prize.
Goals: Pascale Gaudet
GO – Ontology, annotation, tools and technical aspects
Chairs: Serenella Ferro Rojas and Pascale Gaudet
An introduction to the GO ontology : terms, definitions, synonyms, relationships, cross-products. Jane Lomax
- Inter-ontology links
- Most tools don't make inferences across the ontoogies. Make redundant annotations.
- Cross products
- between GO ontologies
- external ontologies (cell ontology; CHEBI)
- Ontology development
- large scale targeted projects
- logical consistency
- small scale requests (Sourceforge tracker; future via Amigo)
Q/A: classical relationships (e.g. part_of within an ontology) are subset of cross-products.
General overview of the annotation guidelines used by GO, and contributing resources. Rama Balakrishnan
- Annotation guidelines
Goal:say as much as possible about a gene product. Be useful to bench and computational biologists.
- GO annotation: Gene product association with GO terms and other info.
- gene product identifiers
- GO term
- Evidence code
- Additional info
- Annotation detail (16)
- PAINT (new)
- inter-ontology inferences (new)
Differences between previous GO camps and this one. This one more internal and focused on strengthening guidelines.
- Challenges ...
- Avoiding redundancy.
- Authoritative sources
- no MOD - UniProt-GOA.
- Authoritative sources
General overview UniProtKB/SwissProt manual annotation. Serenella
- protein selected for manual annotation based on priorities
- Recent papers chosen for high impact
- Curation of specific processes (e.g ubiquitin-like conjugation)
- User requests
- sequence curation
- One record for all different products for the same gene
- Sequence analysis. - automated. manual checking. domains, ptms, etc.
- Literature curation. Species, protein names, gene names, journals, tissues, plasmids
- Store as comment lines free text with controlled tags(?)
- Sequence annotation of features (relation to SO?)
- GO annotation 50 curators, Automated: spkw2go, mappings2GO, etc.
- Family-based curation
- QA and integration
- e.g. throw error when nucleus kw for bacterial protein
A: linked to parent ID - ACCESSION_#
Q: Connection between references and items.
A: Findable in the XML. This is being retrofitted to older entries.
Q: What is the unit of annotation - Genes, isoforms?
A: Isoforms yes. Not yet things like cleavage products, but should be in the future.
- Chairs: Ruth Lovering and Ursula Hinz
- Minutes: Jim Hu - Damien Lieberherr
- Working group: 2010_GO_camp_working_groups_composition
- Working group notes: Binding documentation issues
Binding has been discussed at three consortium meetings.
Ursula Hinz presents guidelines on binding annotation (see presentation)
- Binding biological entity (not today)
Binding of macromolecules
- If possible, use one of the numerous child terms of GO:0005515 protein binding
- Protein binding should always be annotated with IPI evidence code
- Curators must use the “with” column for interaction partner
- Do not forget reciprocal annotation
- IPI for specific proteins
- Use IDA evidence code if the partner cannot be identified, i.e. IDA for classes of protein
- Annotation with IPI should not be propagated with ISS, but child terms can
- No use of the NOT qualifier with GO:0005515 Protein binding because it means no interaction with other proteins in any circumstances
- NOT with chilld terms is OK.
Binding small molecules
- To avoid redundant annotation, GO terms for small molecule binding should not be annotated when they are already mentioned in the MF GO term
But sometimes it is not clear or not included in the description of the MF GO term, so it can be annotated (see example in the presentation)
- avoid redundant annotation of substrates, including transporter substrates
- e.g. ATP binding for ATPases (exceptions where hydrolysis not shown)
- Example DNA demethylase/dioxygenase
- are annotations to alkylated DNA binding, O2 binding etc. redundant.
Q: protein binding - evidence that it does not bind a specific protein. Need a new GO term?
A: No. Use column 16 or create new GO term. Still in discussion. GO terms if the proteins can be put into groups. Don't want specific protein terms.
Q: What is wrong with having 25K GO terms?
A: Does it matter? May be able to do all PRO classes. Instantiate as needed.
Comment: NOT terms.. IntAct only annotates negative interactions for isoforms where a different isoform has a positive isoform. Negatives are not exported to GO.
Judy summary: discussion of are we going to instantiate lots of protein binding terms. PRO families could be used for terms. Column 16 could be used for NOT and specific isoforms.
Emily: some things are not well captured by GO.
Is there possible redundancy if there is annotation of the MF without experimental evidence and the indication of the target binding in column 16 (e.g. the target protein is a transcription factor and MF term is transcription factor binding without evidence)? Is this a source of inconsistency between organism-specific annotation?
The level of experiments is different among organisms (e.g. yeast vs human) which implies different ways of doing annotation. This is not seen as a negative point.
Annotation extension discussion
- Annotation extension = column 16
- Should only be used for direct targets.
- Co-IP. Lnx-I and Boz. Use two txn factor binding annotations with IPI and with for partner.
- Q: Do we need exp evidence that (e.g.) Boz is a txn factor?
- A: curator judgement at present. Rama: SGD would read the paper and make check other annotations of Boz, not just based on assertion in the paper. Same paper does not have to show Boz is a txn factor. Ruth: in humans, would use sequence analysis, e.g. domains. Actually SGD doesn't annotate protein binding.
- Co-IP. Lnx-I and Boz. Use two txn factor binding annotations with IPI and with for partner.
Paul: Annotations for the target must exist somewhere. Does this create redundancy to annotate binding to proteins of function X where target has function X?
Jane: Won't always be function terms. e.g. LIM binding domain binding.
Ruth: GOC still needs more discussion.
Judy: no inconsistency in what SGD does and what Ruth does. Annotations are consistent but SGD chooses different annotations to make. MODs bring specific special experimental strengths. This is a difference, not an inconsistency.
Mike L.: Biogrid curation does a lot of this. How much can be transferred. Ruth: more on this later.
- Column 16 example: Lnx-1 ubiquitinates Boz but not Gsc.
- Annotation. Lnx-1 has ubiquitin-protein ligase activity IDA Col 16:Boz
- Annotate preteen ubiquitination IDA w/o target.
Q: problem of propagation across species. Col 16 identifier is species-specific.
A: Transferring from human to mouse. Use col 16 or not?
One problem raised with the column 16 is the annotation propagation by ISS, because the ID used in column 16 is species specific. Alternatives:
- Column 16 should be excluded of propagation by ISS, which is consistent with the current ISS procedure for with/from
- Column 16 should use protein classes from sources like PRO to allow propagation
Q: is this redundant annotation of enzyme substrates?
A: No, we are doing substrate binding if the GO term does not provide the information.
Judy: knowledge statements vs description of the experiment.
Jim: column 16 post composition is equivalent to creation of a precomposed term, so ISS should be allowed (as appropriate, depending on whether the 16 ID is a class vs a specific product).
Paul: Think in terms of how we will do this with PAINT. We are annotating to ancestor nodes.
Comment: is the discussion generalizing? More general solution is to associate records with an external reference. Relational structure problem. In terms of binding let the protein interaction databases handle these.
Several people suggest that we should not have terms like "txn factor binding".
Ruth: Quick summary
- Use with term with IPIs if the GO term definition does not provide information
- Use column 16 for target
- In disagreement about propagation of column 16 by ISS
- Ideally info from with or col 16 to make inferences about the function of the protein. Other functions could come from other annotations of the target.
Kimberly: this has major implications for display. Keep the more specific terms (at least for now).
Ruth: enumeration of the kinds of targets could make things less clear.
When not to use Col 16
- For indirect targets
- FGF2 -> receptor -> phosporylation of Erk2 goes up. Erk2 is NOT a direct target of FGF2. Activation goes via Ras.
Ruth gives an example when annotators should not use column 16 (see presentation). She mentioned that the relationship ontology is in a renaming process. The relationship ontology with has_input (substrate) and has_output (product) with the CHEBI IDs in column 16 represents complicated way of annotation. To simplify the annotation, it is proposed not to use relationship ontology and a column 16 containing RHEA ID (reaction DB) which gives substrate and product information.
The annotation rules specify that catalytic activity terms should not be annotated with the evidence code IPI. There are 144 of these annotation in GO DB and 88 are from SGD. The evidence code IMP is stronger and should be preferred for the annotation. However, particular cases can occur and they have to be considered individually.
Col 16 relationship ontology
Relationships go along with the ID in Col 16.
- Lnx-1 is_a ub protein ligase IDA has_input Boz.
Col 16 and CHEBI
Concerning the annotation of small molecule binding, the idea is that they could be mentioned in column 16 of a MF term which does not already described the molecule in its definition. There can be inconsistency when annotating calcium binding (small molecule binding), because calcium binding can be required for the function or not. This calcium binding issue has to be discussed further.
Annotation in the column 16 provides a certain level of knowledge (e.g. the function of the target protein is known) which could be also displayed. What should be annotated in column 16 and how far to go (e.g. annotation of small molecule binding with CHEBI ID) and where to stop? There are concerns on how far to push up the annotation in GO regarding what GO has been defined for: describe what the genes are doing.
Example: steroid hydroxylase.
- CYP11B2 is_a steroid hydroxylase activity IDA has_input CHEBI:16827 Corticosterone
- CYP11B2 is_a steroid hydroxylase activity IDA has_output CHEBI:16827 Aldosterone
Where do we draw the lines with respect to specificity continues to be an issue of discussion.
Kimberly: Connections between CHEBI IDs and process terms - how will these be handled by GO. Will CHEBI IDs in function ontology propagate to process terms.
IPI and catalytic activity. Deprecate these?
- Rama: in SGD these came from combination of IPI and IMP evidence (Editorial comment: this is because SGD doesn't do GO:0005515).
Binding is not sufficient to infer activity by itself. GO does not capture multiple experiments in a single annotation. This is a general problem.
Judy: rules are made to be broken. (!)
Interaction with the IMEx consortium.
Results of the survey
- Consistency of the annotators on evidence code usage, but difference in MF terms annotation (parent vs child term)
- Seems ok to use column 16 in case of MF term, but not in case of BP term
Possible action items
More discussion by the working group:
- ISS propagation of binding across species requires additional discussion. Should column 16 identifier be to a class. Should column 16 be transferred in ISS transfer.
- CHEBI IDs and process terms - how will these be handled by GO. Will CHEBI IDs in function ontology propagate to process terms.
Day 1 afternoon session
Annotation and Annotation Propagation
HAMAP presentation (Alan Bridge)
Rama: How do you know which annotations are propagated and which derived from literature?
Alan: By the evidence tags, e.g. references, by similarity etc.
Paul: You said that you don’t propagate isoforms?
Alan: Isoform information is sourced from TrEMBL, we don’t project any isoform information
Judy: How does UniProt envisage to integrate their system with all the other available orthology prediction sources, to ensure that everyone works with a common set of proteins/families for GO annotation propagation?
Suzi: There is an initiative to create a common set of sequences in a common set of species to start building orthology groups. A set of species has been prepared by Dan Barrell at the EBI.
Judy: this effort needs to understand its relationship with other propagation methods
Pascale: Rolf participated in QFO meetings, the current session is only to highlight the differences between methods
Alan: In a first step, UniProt will also compare the output of their annotations with those produced by the Reference genome project using PAINT on selected protein families.
Judy: HAMAP and Quest for Orthologs both have related groupings. Sets of proteins with similarities, what is your global view. The utility of this effort is integration into global network
Alan: We have integrated into InterPro, and see several trends emerging, from this we are separating into groups
Paul: If groups want to use HAMAP will they have to fill out identity card for their species in order for it to work properly
Alan: You can either specify a species most closely related to yours or can ask a curator to fill one in for you as it is a closed system and is quite involved process
Suzi: There will be a follow up meeting for QFO next year, other groups can join in and contribute
Compara presentation (Javier Herrero)
Judy: What is the source statement for GO annotations derived from Compara and how can all these annotations be retrieved?
Emily: Compara annotations are in the GOA database, there is a GOref 19 specific for Compara-derived annotations and their annotations are present in the UniProtKB-GOA GAF.
Reference Genome presentation (Pascale)
How are the ‘high quality’ protein sets defined that are used by the project?
The sequences are from different sources for the different species and are put in a standard format using UniProtKB accession numbers.
Tree-based GO annotation presentation (Paul)
Cecilia: Which GO term to choose to annotate nodes of common ancestors? Is it better to use less specific GO terms to be able to move up to a higher node in the phylogenetic tree?
Paul: It’s better to annotate to the most specific term possible (explained in more detail in the PAINT demo presentation of Mike)
PAINT demonstration (Mike)
Judy: Concerned that correcting already existing GO annotations on proteins by going back to already curated papers during the process of annotating a tree with PAINT may be too time consuming and is not very efficient.
Cecilia: When single sequences below an annotated node are deselected for GO annotation propagation (because of curator judgement), how are these ‘negative’ GO annotations shown to the user? Is it more useful to not have an annotation there or to have a NOT annotation there?
Paul/Mike: There are two possibilities. On the one hand, if annotation propagation has been deselected because of rapid divergence of a branch, the annotation is not shown at all in the concerned entries. If the annotation propagation has been deselected because of missing critical residues in the sequence, the GO annotation is propagated with a ‘NOT’ qualifier and is available to the user.
'Response to' terms
Pascale’s presentation: http://wiki.geneontology.org/images/9/9b/WG-Response-to-Becky-Pascale.pdf. The aim of the working group is to improve the representation of biological responses. This has a lot of overlap with downstream events and signalling.
1. Definition is very wide The current GO definition of “response to stimulus” is shown on slide 3. This is a very wide definition and the term is being over-annotated as the definition is very broad. Slide 7 shows numbers of annotations to some high-level “response to” terms. There are a lot of child terms under these high-level terms which should be used if possible rather than annotating to the high-level terms. This doesn’t currently affect many annotations but annotation to high-level terms should be avoided in the future.
Judy: We seem to spending a lot of time discussing a small number of annotations. And the annotations to high-level terms are not wrong. Curators wouldn’t use a high-level term if they can use a more specific one.
Rama: Sometimes curators use high level terms to group a number of child terms.
Kimberly: It’s not always clear when to create new terms.
Paul: ‘response to stress’ means a response to at least one stress. If the response is to more stresses, we should annotate to each stress.
Judy: Agrees with this. GO is now 12 years old. If there are few annotations, they are legacy and are fine.
Pascale: Would like the guidelines clarified for future use.
Judy: High-level terms haven’t been used much.
Pascale: We need to be careful about grouping stresses to a parent term as the parent terms then mean 2 different things. This is a general issue with GO. For example, DNA-binding can be annotated to both positive and negative strands. Binding to the parent term is not the same as annotating to multiple child terms.
Judy: Agree and need to clarify this if it is a confusing issue.
Li: If something is a general core factor and annotated to lot of child terms, is there a danger of over-annotating?
Pascale: If this is what it does, it’s not wrong to annotate to all the child terms.
Summary of above discussion from Pascale: Avoid annotating to high-level terms if possible. Annotating to child terms is preferable and is not equivalent to annotating to the parent term.
Proposal 1: High-level ‘response to’ terms should not be used.
2. Specific terms can be informative
Specific “response to” terms are very informative and should be used where possible.
Proposal 2: Encourage use of granular terms.
3. Inconsistencies in ‘response to’ annotations (see slide 11) - Some groups only capture mediators of response - Some groups capture targets - Some groups don’t like microarray/expression experiments and annotate mediators by IDA when what is being measured is expression levels e.g. Western blot showing up-regulation in response to heat where the correct evidence code would be IEP.
Pascale: How do people feel about mediators v targets as concepts?
Rama: For a transcription factor up-regulating proteins in response to stress, the transcription factor would be a mediator.
Pascale: We need to distinguish between targets and mediators. Do we want to annotate all targets or only factors that have a role in change in the expression or state of a cell?
Example of target: response to cadmium ion (see slide 14) - 8 spots are upregulated by cadmium exposure. The role of the proteins is not known. This is showing targets. The mediators are not known.
Kimberly: Maybe this is a case where IEP is not sufficient to capture what is going into the annotation process. We either need another evidence code or need to be able to show that multiple evidence codes have been combined to produce annotation.
Debbie: If this is the only experiment, it doesn’t matter what you know about the proteins. The only thing that’s known from the experiments is that they are up-regulated so they are targets.
Pascale: Some groups don’t capture this type of experiment.
Ruth: A heat shock protein is not a mediator with respect to transcription regulation so we should capture both targets and mediators. It’s not known that they are involved in the process. There is no way of knowing if up-regulated proteins have a role in ‘response to’ based on expression data. We need a consensus on if we can use IEP for ‘response to’.
Paul: This discussion is similar to those on how to structure signalling terms. Could imagine having terms that structure these.
Ruth: 2 years ago, it was suggested that all ‘response to’ terms should have sub-terms which should be annotated to rather than high-level terms. Microarray experiments could still use the high level terms that others consider are meaningless.
Jim: The children terms don’t match very well to the parents.
Pascale: Perhaps there’s a need to restructure the ontology.
Becky: Trying to get signalling under ‘reponse to’ node.
Jim: IEP as evidence code doesn’t justify this kind of experiment.
Pascale: Agrees but most of the IEP cases are in this part of the tree.
Tanya: Would most groups use this experiment to annotate to GO?
Pascale: This was already asked and groups were split in the middle.
Judy: MGI doesn’t use IEP. Other groups use it sparingly such as FlyBase & WormBase.
Alan: If you have a paper with such an experiment only, there are probably other papers which could be annotated in preference to such a paper with better or more data.
Judy: There is so much core information yet to capture that energies should perhaps focus on other areas.
Debbie: From a user point of view, many people will do an array experiment and look for enrichment of terms to create a hypothesis e.g. cadmium response in multiple organisms. For users, these types of annotations are potentially useful.
Li: Some groups are small and use this kind of data to boost numbers.
Alan: Can’t you get that kind of data from a primary repository?
Pascale: Yes, but you can’t do the same analysis as with GO.
Ruth: Agrees that some groups want to use IEP. Try to get it into other terms e.g. signalling terms and leave very high level terms for groups who feel they are necessary.
Val: Is IEP disallowed for molecular function?
Rama: Yes but it is still allowed for biological process.
Judy: Leave it as an evidence code but take into account concerns of people. Enrichment analysis of microarray results to generate hypothesis to generate experiments but not to create in themselves annotations.
Jim: Happy to get rid of the IEP annotations from Pascale’s experiment. But the evidence code is useful for some experiments such as some yeast experiments which make more specific process-based inference which is when IEP should be used.
Proposal 3: Update “response to” definition as described on slide 12.
Becky: The new definition is more mediator-specific and is coherent with other parts of GO like signalling which don’t include targets in signalling terms. Would anyone object to the change in the definition?
Rama: Doesn’t mind changing the definition but is not sure about the proposed new definition.
Pascale: Working group can look into it and revise new proposed definition if necessary.
4. Concern about microarrays (see slide 17)
Pascale: Some people disregard microarrays but there’s a need to look more carefully at what’s been tested.
Alan: Most people use microarray as a first-pass and then do further experiments which are the ones that should be annotated.
Pascale: What about a microarray on wild-type versus mutant cells which finds differences in expression?
Alan: That’s a valid experimental system.
Proposal 4: Microarray hits should not be annotated to response to terms.
Val: Has annotated one paper from an array for ‘response to’ which are all core environmental response genes and all were further characterised.
Pascale: Agrees that this is fine.
Mike: Microarrays can provide numerical data. Should we ask for other experiments like 2D gels to demonstrate a numerical pattern?
Pascale: Every experiment is different. We can’t make general rules for these. It depends on the specific value you’re looking at.
Mike: Many microarray experiments use very specific algorithms to measure patterns but you don’t get same idea from a 2D gel.
Jim: Quantitative analysis of 2D gels happens.
Pascale: Don’t know enough about 2D gel or microarray data to make concrete proposal here.
Ruth: Microarray shows what mRNA is doing. Proteins don’t always follow these results. If we ban microarray data, it will be difficult to interpret proteomics results. Both should be considered in a similar vein.
Debbie: Thinking about this differently due to experience in lab where people did pulse-chase labelling, cut spots out of gels and did quantitative experiments. Gels can be quantitative, depending on what is done. Not comfortable about blanket ban on data from experiments such as microarray.
Pascale: Agrees. We don’t want to capture irrelevant data but we need to allow room for annotation from these experiments.
Becky: Did we resolve if we want to capture mediators and targets?
Pascale: We are capturing targets. Not sure why it’s a target. Heat shock response is downstream. Target of response and mediator of something else in response.
Mike L.: We need to separate out the concept of regulation. Up-regulation doesn’t mean that a protein is involved in response. We need to measure this more than just saying that it is on or off. There should be threshold.
Judy: Mike has a point but biology is sloppy. We can offer guidelines to avoid lack of rigour in annotations.
Debbie: Doesn’t understand why seeing the number of pixels changing in a microarray is more valid than other experiments such as Western blot.
Pascale – There is too much of a case-by-case basis to have guidelines.
Val: Is IEP allowed for annotation transfer?
Pascale: Yes, if the primary annotation is reasonable, it should be allowed.
Val: Maybe people would feel more comfortable if propagation not allowed.
Pascale: Has seen cases where propagation is reasonable. Happy to stop transfer if others agree but it can be valid to transfer annotation. Unless there’s proof that it’s not useful, we should continue.
Alan: Why specifically microarray? What about RNA-seq?
Pascale: The discussion also applies to these. It covers any technique which measures expression levels.
Paul: Can we clarify that for microarray, we mean any differential gene expression experiment? Anything that is differentially expressed is far downstream of the effect. It is downstream of the causing stimulus so we are not sure what it may do.
Jane: The definition change is fine but where does the process start and stop? Does the process describe the pathway between the start and stop or the whole thing?
Pascale: ‘response to’ terms are being reorganised as part of signalling pathways.
Becky: A signalling pathway ends with the trigger. There are also some downstream processes under signalling. The definition tries to include only genes which have an active role in a process, not those regulated by it.
Ruth: 2 points to add. 1. Proteomics experiments are fine. There’s no problem with them. 2. If we do ‘response to’ annotation by IEP, then for a protein that negatively regulates itself, you get over-expression in these assays as it’s being degraded too quickly to auto-regulate. The protein may be getting degraded before gets to nucleus. This is the type of thing to be concerned about when we say that we don’t want to do ‘response to’ with IEP.
Alan: A classic example of this is P53 where there is overexpression in tumors.
Paul: To summarise, we ought to think of biological process as an encoded program for the cell to do something and effects of stress are not part of the biologically encoded program. This is what we are trying to capture with ‘response to’.
Mike C.: Strike the word “microarray” as you mean anything that measures quantity. Microarrays don’t measure expression. What is meant is any technique which measures differential expression.
Day 2 Morning
Summary of ontology development
Chris Mungall presents rules for binding propagation (see presentation)
In the case of transcription factor activity which has DNA binding as parent, will it go to the same format? This has to be considered.
- It has been decided to add a has_part relationship as a link in the ontology.
- The propagation of has part relationship is not suitable in all cases (see example given in the presentation) and this makes the rules more difficult.
Example G capable_of ATPase activity -> G capable_of ATP binding
- Materialize relationships at central location
- Curator annotates to ATPase activity
- GAF pipeline materializes ATP binding using same EC
- Reimport allows query against ATP binding query to recover ATPases etc.
- Q: does redundancy of annotation raise issues? Probably not?
- Navigation via CHEBI too complex.
- is_a between AATPase activity and ATP binding
Automated population of ontology using intersection_of terms ... has_input + has_output The has_part links will be mainly populated automatically in the ontology using MF X CHEBI logical definition, but this can generates errors. Also it is important to stick with the original evidence code and original PubMed ID which gives the possibility to go back and have the ATP binding.
Concerning the problems of propagation of has_part, why do not use a link like “necessitate” ? This could be an alternative.
Ontology will contain information to relieve annotators of making redundant annotations.
Q: How will the chain of evidence work for the materialized ATP binding added to the GAF. A: original EC, reference, and ...?
Q: Look at other ontologies, e.g. txn factors. A: Don't want txn factor as a child of binding.
Q: is materializing a permanent solution? A: See later discussion.
- Problem of software development assumes prior version of GO structure
- Links are only in GO_ext files.
- Future: more links. Software will have to catch up.
- Materialization service for function to process links
- Want to limit prcomposition
- Annotate as if relationships are there
- When to request new term vs use col16 - would the term make sense in an enrichment analysis
- Reasoner can find equivalent terms if they exist, and materializer will add lines to the GAF.
Isoforms. No time to discuss
- Extensions provide greater expressivity
- Possibility of expressing things different ways, but reasoner can link synonymous annotations made in different ways by annotators.
Q: relationship matrix? A: this exists in part
- Gene search
- Term search
- View direct or include annotations to child terms
- More tools
- GOOSE: SQL environment
- precomposed SQL query list. Can request new ones via help
- GO slimmer
- Visualization - input GOIDs and see relationships
- OpenSearch - Browser widgets and OSX dashboard
- Homolog Set Summary - for reference genomes
- GOOSE: SQL environment
- AmiGO labs - more stuff
- Cross-product term request will issue GOIDs for specific types of cross-products (regulation, part_of, downstream process terms)
- Coannotation - see genes annotated to two GO terms
- Gene search
- download options, web services
- Term search also shows co-occurrence with other terms. Default EC selection was discussed.
- Annotation views have filtering options.
- Unlike current AmiGO, taxon filtering uses hierarchical relationships.
Annotation of complexes
Minutes by Kristian Axelsen and edited by Mike Livstone
Quick summary of session: There has been a need to address the following situation: Complexes are multiprotein machines that carry out a specific process or reaction. While it is clear that there should be annotations to the process for the catalytic subunit, there is a desire to annotate, using experimental evidence codes, other subunits in the complex based on their membership in the complex. One proposal has been to create a new experimental evidence code "ICM" (Inferred from Complex Membership). The general consensus in the session was that this type of inference should not be made and, as a consequence, ICM should not created.
More detailed notes:
The background for the sessions at this GO camp is that, after making group annotation sessions of groups of 5-10 genes, it was always the same 3 types of problems that appeared.
So the working groups were created to identify the issues, improve annotation, make annotation guidelines, and provide QC checks.
Bernd presented the current situation with a very broad definition of a complex, but stressed that "complex" terms should be defined so that they could be used in other organisms and not only in the organism where they were first seen.
Current Guidelines by Ontology:
- CC: gene products can be annotated to complexes; "colocalizes_with" qualifier also allowed. (slides 8, 9)
- MF and BP: Gene products are not annotated to complexes
- MF allows "contributes_to" in the context of a complex (slide 10)
- MF: catalytic and regulatory subunits can get different annotations (slide 20)
The use of contributes_to was discussed in the MF ontology. This was to be used for essential subunits only.
Annotations to MF should NOT be done based on IPI alone.
A lot of the discussion in the working group was concerned with how to annotate the subunits which are not responsible for the catalytic activity.
Working group suggestion: to create a new evidence code: ICM (Inferred from Complex Membership)
(Note: The consensus at the end of discussion was not to create this code.)
Furthermore, it was urged that annotators are better at putting "unknown" as MF if this is the case. It is acceptable not to know.
General consensus: We need to be more conservative when assigning MFs
This would also be more in line with the biologists' view.
Working group suggestion: From the evidence code documentation (IDA): "a fractionation experiment might provide "direct assay" evidence that a gene product is in the nucleus, but "protein interaction" (IPI) evidence for its function or process." Proposal 2: Remove this statement from the annotation documentation
General consensus: This statement should be removed (this was also a conclusion from the Binding session).
An important example that was discussed: Yeast RNA polymerase II vs. III. PolII is much better studied, and subunits that are indispensable for PolII function are annotated to transcription with "contributes_to." In contrast, the same level of detaile is not available for PolIII, so all subunits get contributes_to transcription. This reflects the level of understanding for both complexes, but does not sit well with many curators because it means that in cases where we know less, we make more annotations.
Summary (by Paul Thomas): We would like to be able to annotate entire complexes to MF and BP. For single gene products we should only annotate a MF for the subunits essential for the complex activity.
The use of contributes_to was raised. Pascale said incautiously that personally, she would have no problem getting rid of contributes_to.
Again, it should only be used for MF annotations of the subunits essential for activity.
Minute taker's comment (KA): This is perhaps an issue for the next camp/the continued work of the working group
Another issue: When MF terms are added to a complex based on early experiments. When more detailed knowledge appears and terms are added, it should be possible (more easy) to remove the old annotations when they have been added by different groups.
- Michael pointed out that ICM really is an ISS inference
- Paul says we need to be able to annotate complexes directly, the same way we annotate gene products.
Day 3 Morning
How is Downstream Effect defined (Rachael and Varsha)
Rachael and Varsha: Annotating to downstream processes Minutes: Yasmin & Ursula
- Definition of down-stream process, as proposed by work group - everyone thinks this is OK
Examples (1-4): see presentation
- Discussion of Survey (see presentation)
Everybody does at least occasionally annotate down-stream processes.
Most participants felt that annotating down-stream effect was ok, when no other information was available. Many participants felt it would be desirable to revise such annotations at a later time, but that this was not always feasible for various good reasons (see presentation)
- Guideline 1: Request new, specific terms describing a process involved in another process. Example: for growth factor BMP2 that regulates cardiac cell differentiation, it is more informative to use a composite term, such as “regulation of transcription involved in cardiac cell differentiation” as opposed to using two unlinked terms, e.g. “regulation of transcription” and “regulation of cardiac cell differentiation”. (The terms do not exactly match the case of BMP2).
- Guideline 2: for small scale experiments one should annotate to the experimental evidence in the paper. However, use curator judgment, and also take account of the quality of the evidence, etc.
If a gene product has a central role affecting multiple down-stream processes one should only annotate the core process. When a gene product is specific for a particular pathway and/or has just a few targets, one should annotate the down-stream processes.
Discussion of examples:
a) yeast RNA polII subunit should only be annotated to the core process.
b) for proteins associated with the yeast spliceosome, annotation describing indirect effects has been removed.
c) S.pombe sre1 (direct transcriptional regulator of genes which have a role in heme and lipid biosynthesis): new terms should be requested, e.g. “Regulation of transcription involved in heme biosynthesis“
Li - Are we ready to go for this transcriptional regulation process in the GO - directed at Chris: everything involves transcriptional regulation - does GO want to represent this?
Chris - yes - we should represent this. For the time being we should use precomposed terms. Use AmiGO Labs to request terms. Later it may be possible to use column 16 instead.
Li: ontology developers in group should discuss this ACTION ITEM:
- Guideline 3: If a gene product has limited experimental literature, such as a newly characterized protein, it is acceptable to annotate to more general 'downstream' process
Lively discussion of example of RNA polII subunit: should one keep the experimental annotation (indirect effects)?
Mike A: rpb2 is required for every transcription process; it is not useful to list indirect effects. The gene product should be annotated to the core process using ISS, and the phenotype-based experimental annotation should be removed. Describing the k/o phenotype is not informative.
- COMMENT: if it has a specific effect, one should keep both specific down-stream effect and description of core process.
Kimberly: rpb2 annotation originated from phenotype to GO mappings (ISS). We will review the pipelines issue.
Kimberly: How can one connect the core process with the biological knowledge? This is what is being tested in C.elegans. We need feedback
Sylvain: should one have GO terms for knockout data? Propose to use them only if there is further experimental characterization of the gene product. There are multiple phenotypes for any mutation, especially if these affect an important gene product. It is not the goal of GO to describe phenotypes.
Li: in this case this is a core process, but when the underlying function of a gene product is unknown, then making these annotations will give more information for the user
Many participants agreed with the above statements. But: it really depends on the MODs whether they want to keep the annotation or not.
Mike A: proposes to delete the evidence code IMP. IMP should be used very sparingly
Pascale: annotation based only on mutants may be misleading. In such cases, we’d need further information.
Kimberley: but users may want that information. If we can use a different evidence code then that will be welcome.
Rachael - if you didn't do IMP then what would you use? IDA?
Sylvain: HTP data effects were all annotated to different development processes
Pascale at organismal level it's hard to annotate directly
Michael: IMP is an absolutely valid code in some process - need to set a boundary for when to use it for capturing phenotypes.
Mutant data can be essential and have yielded precious information.
Phenotypes should be captured using existing phenotype descriptions, and maybe by a dedicated database.
Need to take into account if there is a paper discussing the mutant phenotype, or if this is stand-alone (HTP) data. We want to capture what the authors are saying, and what was accepted for publications by the reviewers.
Judy: we need to clarify what is the appropriate use of these evidence codes
- Guideline 4: annotation of ligand receptor signaling pathways (intercellular vs. intracellular)
For intercellular signaling, the ligand is part of the pathway. For intracellular signaling, the ligand regulates the pathway.
Pascale: this is confusing.
Becky: the goal is to avoid over-annotation. Is the ligand part of the pathway?
Becky, Pascale: Yes. Varsha: this is another discussion.
Becky: ligand is part of pathway. The pathway ends when response is initiated
CONSENSUS: need to clarify where pathways start and end.
ACTION: take intracellular example back to signaling group for clarification.
Going through slides showing (simplified) insulin receptor pathway and NF-kappa-B pathway: everybody agreed.
Becky: a lot of the time you don't know that the stimulus/ligand/receptor is involved in multiple pathways would prefer to request new terms e.g. X signaling involved in Y pathway
Pascale: likes this representation and is useful for helping to think about how to correctly represent the biology
Varsha: signaling diagram will be presented to the signaling group
Present summary slide (documentation) for dealing with cases such as RNA polII subunit (guideline 3): if reasonable, remove annotation of downstream effects once core activity is known. But: there may be good reasons to keep such annotation. It really depends on the case, and on the contributing MOD (see presentation).
Suggested Quality Control checks: (see presentation).
Discussion of survey example:
- Question 1: Functions of the deubiquitinating protease CYLD
Most (almost 90%) would annotate to core process = regulation of microtubule organization A sizeable minority also selected downstream effects.
Ursula’s example. Ursula: was strongly inclined to stick to “regulation of microtubule cytoskeleton organization”. This protein regulates everything, and it does not make sense to annotate “everything”.
Rama - SGD would be cautious too
Ursula: survey result may reflect limited time available for doing the survey. Reading several papers on the subject makes a difference.
- Q2: Bre1-Histone H2B monoubiquitination regulates histone H3 methylation
Most (85%) of the participants chose a “histone ubiquitination” term
65% chose also histone methylation. Most participants would like “Regulation of histone methylation”, or a new term such as “Histone monoubiquitination involved in regulation of histone methylation”.
Val: would annotate to main activity of enzyme but also regulation - has many pombe papers with experimental evidence
Problem: “Regulation of pathway X” is part of pathway X. People are not sure how to display the experimental data.
Rachael, Sylvain: it is important to avoid ending up with all histone-modifying proteins having exactly the same annotation. They have distinct activities, and this should be shown by the annotation.
Val: will ask people to look into this.
Sylvain: each histone modification affects other histone modifications (positive and negative regulation). There are about 80 histone-modifying enzymes, and each has down-stream effects on other histone modifications.
Ruth: these are process terms and users would find it useful to know which genes are involved in this process: how to convey the information?
Sylvain: Propose to change the definitions of existing terms, so that ubiquitination includes effect on subsequent methylation, etc.
Sylvain: Propose to split “Regulates” into cases where “Regulates process X” is a part of process X, and cases, where “Regulates process X” is NOT part of process X? Is this possible?
Paul: allow distinction between core process and regulation of core process
- CONCLUSION: Ontology should be revised, annotation checked.
- ACTION POINTS:
- revise process terms for transcription
- define start and end points of signaling processes