Talk:2010 GO camp Meeting Agenda
- 1 Day 1 morning session
- 1.1 9:00 Introductions and objectives of the meeting
- 1.2 GO – Ontology, annotation, tools and technical aspects
- 1.3 Binding documentation
- 2 Day 1 afternoon session
- 3 Day 2 Morning
- 3.1 Binding continued
- 3.2 GO browsers
- 3.3 Annotation of HTP data
- 3.4 Annotation of complexes
- 4 Day 3 Morning
- 5 Day 3 afternoon session: Future plans
Day 1 morning session
9:00 Introductions and objectives of the meeting
- Introductions & Logistics: Serenella Ferro Rojas
- Poll for Thursday lunch reservations, depending on weather.
- Dinner at Brasserie la Bourse on the Carouge
- ~ 1.9 km from meeting site
Friday Reception at noon for Amos Bairoch celebration of the Otto Naegeli prize.
Goals: Pascale Gaudet
GO – Ontology, annotation, tools and technical aspects
Chairs: Serenella Ferro Rojas and Pascale Gaudet
An introduction to the GO ontology : terms, definitions, synonyms, relationships, cross-products. Jane Lomax
- Inter-ontology links
- Most tools don't make inferences across the ontoogies. Make redundant annotations.
- Cross products
- between GO ontologies
- external ontologies (cell ontology; CHEBI)
- Ontology development
- large scale targeted projects
- logical consistency
- small scale requests (Sourceforge tracker; future via Amigo)
Q/A: classical relationships (e.g. part_of within an ontology) are subset of cross-products.
General overview of the annotation guidelines used by GO, and contributing resources. Rama Balakrishnan
- Annotation guidelines
Goal:say as much as possible about a gene product. Be useful to bench and computational biologists.
- GO annotation: Gene product association with GO terms and other info.
- gene product identifiers
- GO term
- Evidence code
- Additional info
- Annotation detail (16)
- PAINT (new)
- inter-ontology inferences (new)
Differences between previous GO camps and this one. This one more internal and focused on strengthening guidelines.
- Challenges ...
- Avoiding redundancy.
- Authoritative sources
- no MOD - UniProt-GOA.
- Authoritative sources
General overview UniProtKB/SwissProt manual annotation. Serenella
- protein selected for manual annotation based on priorities
- Recent papers chosen for high impact
- Curation of specific processes (e.g ubiquitin-like conjugation)
- User requests
- sequence curation
- One record for all different products for the same gene
- Sequence analysis. - automated. manual checking. domains, ptms, etc.
- Literature curation. Species, protein names, gene names, journals, tissues, plasmids
- Store as comment lines free text with controlled tags(?)
- Sequence annotation of features (relation to SO?)
- GO annotation 50 curators, Automated: spkw2go, mappings2GO, etc.
- Family-based curation
- QA and integration
- e.g. throw error when nucleus kw for bacterial protein
A: linked to parent ID - ACCESSION_#
Q: Connection between references and items.
A: Findable in the XML. This is being retrofitted to older entries.
Q: What is the unit of annotation - Genes, isoforms?
A: Isoforms yes. Not yet things like cleavage products, but should be in the future.
- Chairs: Ruth Lovering and Ursula Hinz
- Minutes: Jim Hu - Damien Lieberherr
- Working group: 2010_GO_camp_working_groups_composition
- Working group notes: Binding documentation issues
Binding has been discussed at three consortium meetings.
Ursula Hinz presents guidelines on binding annotation (see presentation)
- Binding biological entity (not today)
Binding of macromolecules
- If possible, use one of the numerous child terms of GO:0005515 protein binding
- Protein binding should always be annotated with IPI evidence code
- Curators must use the “with” column for interaction partner
- Do not forget reciprocal annotation
- IPI for specific proteins
- Use IDA evidence code if the partner cannot be identified, i.e. IDA for classes of protein
- Annotation with IPI should not be propagated with ISS, but child terms can
- No use of the NOT qualifier with GO:0005515 Protein binding because it means no interaction with other proteins in any circumstances
- NOT with chilld terms is OK.
Binding small molecules
- To avoid redundant annotation, GO terms for small molecule binding should not be annotated when they are already mentioned in the MF GO term
But sometimes it is not clear or not included in the description of the MF GO term, so it can be annotated (see example in the presentation)
- avoid redundant annotation of substrates, including transporter substrates
- e.g. ATP binding for ATPases (exceptions where hydrolysis not shown)
- Example DNA demethylase/dioxygenase
- are annotations to alkylated DNA binding, O2 binding etc. redundant.
Q: protein binding - evidence that it does not bind a specific protein. Need a new GO term?
A: No. Use column 16 or create new GO term. Still in discussion. GO terms if the proteins can be put into groups. Don't want specific protein terms.
Q: What is wrong with having 25K GO terms?
A: Does it matter? May be able to do all PRO classes. Instantiate as needed.
Comment: NOT terms.. IntAct only annotates negative interactions for isoforms where a different isoform has a positive isoform. Negatives are not exported to GO.
Judy summary: discussion of are we going to instantiate lots of protein binding terms. PRO families could be used for terms. Column 16 could be used for NOT and specific isoforms.
Emily: some things are not well captured by GO.
Is there possible redundancy if there is annotation of the MF without experimental evidence and the indication of the target binding in column 16 (e.g. the target protein is a transcription factor and MF term is transcription factor binding without evidence)? Is this a source of inconsistency between organism-specific annotation?
The level of experiments is different among organisms (e.g. yeast vs human) which implies different ways of doing annotation. This is not seen as a negative point.
Annotation extension discussion
- Annotation extension = column 16
- Should only be used for direct targets.
- Co-IP. Lnx-I and Boz. Use two txn factor binding annotations with IPI and with for partner.
- Q: Do we need exp evidence that (e.g.) Boz is a txn factor?
- A: curator judgement at present. Rama: SGD would read the paper and make check other annotations of Boz, not just based on assertion in the paper. Same paper does not have to show Boz is a txn factor. Ruth: in humans, would use sequence analysis, e.g. domains. Actually SGD doesn't annotate protein binding.
- Co-IP. Lnx-I and Boz. Use two txn factor binding annotations with IPI and with for partner.
Paul: Annotations for the target must exist somewhere. Does this create redundancy to annotate binding to proteins of function X where target has function X?
Jane: Won't always be function terms. e.g. LIM binding domain binding.
Ruth: GOC still needs more discussion.
Judy: no inconsistency in what SGD does and what Ruth does. Annotations are consistent but SGD chooses different annotations to make. MODs bring specific special experimental strengths. This is a difference, not an inconsistency.
Mike L.: Biogrid curation does a lot of this. How much can be transferred. Ruth: more on this later.
- Column 16 example: Lnx-1 ubiquitinates Boz but not Gsc.
- Annotation. Lnx-1 has ubiquitin-protein ligase activity IDA Col 16:Boz
- Annotate preteen ubiquitination IDA w/o target.
Q: problem of propagation across species. Col 16 identifier is species-specific.
A: Transferring from human to mouse. Use col 16 or not?
One problem raised with the column 16 is the annotation propagation by ISS, because the ID used in column 16 is species specific. Alternatives:
- Column 16 should be excluded of propagation by ISS, which is consistent with the current ISS procedure for with/from
- Column 16 should use protein classes from sources like PRO to allow propagation
Q: is this redundant annotation of enzyme substrates?
A: No, we are doing substrate binding if the GO term does not provide the information.
Judy: knowledge statements vs description of the experiment.
Jim: column 16 post composition is equivalent to creation of a precomposed term, so ISS should be allowed (as appropriate, depending on whether the 16 ID is a class vs a specific product).
Paul: Think in terms of how we will do this with PAINT. We are annotating to ancestor nodes.
Comment: is the discussion generalizing? More general solution is to associate records with an external reference. Relational structure problem. In terms of binding let the protein interaction databases handle these.
Several people suggest that we should not have terms like "txn factor binding".
Ruth: Quick summary
- Use with term with IPIs if the GO term definition does not provide information
- Use column 16 for target
- In disagreement about propagation of column 16 by ISS
- Ideally info from with or col 16 to make inferences about the function of the protein. Other functions could come from other annotations of the target.
Kimberly: this has major implications for display. Keep the more specific terms (at least for now).
Ruth: enumeration of the kinds of targets could make things less clear.
When not to use Col 16
- For indirect targets
- FGF2 -> receptor -> phosporylation of Erk2 goes up. Erk2 is NOT a direct target of FGF2. Activation goes via Ras.
Ruth gives an example when annotators should not use column 16 (see presentation). She mentioned that the relationship ontology is in a renaming process. The relationship ontology with has_input (substrate) and has_output (product) with the CHEBI IDs in column 16 represents complicated way of annotation. To simplify the annotation, it is proposed not to use relationship ontology and a column 16 containing RHEA ID (reaction DB) which gives substrate and product information.
The annotation rules specify that catalytic activity terms should not be annotated with the evidence code IPI. There are 144 of these annotation in GO DB and 88 are from SGD. The evidence code IMP is stronger and should be preferred for the annotation. However, particular cases can occur and they have to be considered individually.
Col 16 relationship ontology
Relationships go along with the ID in Col 16.
- Lnx-1 is_a ub protein ligase IDA has_input Boz.
Col 16 and CHEBI
Concerning the annotation of small molecule binding, the idea is that they could be mentioned in column 16 of a MF term which does not already described the molecule in its definition. There can be inconsistency when annotating calcium binding (small molecule binding), because calcium binding can be required for the function or not. This calcium binding issue has to be discussed further.
Annotation in the column 16 provides a certain level of knowledge (e.g. the function of the target protein is known) which could be also displayed. What should be annotated in column 16 and how far to go (e.g. annotation of small molecule binding with CHEBI ID) and where to stop? There are concerns on how far to push up the annotation in GO regarding what GO has been defined for: describe what the genes are doing.
Example: steroid hydroxylase.
- CYP11B2 is_a steroid hydroxylase activity IDA has_input CHEBI:16827 Corticosterone
- CYP11B2 is_a steroid hydroxylase activity IDA has_output CHEBI:16827 Aldosterone
Where do we draw the lines with respect to specificity continues to be an issue of discussion.
Kimberly: Connections between CHEBI IDs and process terms - how will these be handled by GO. Will CHEBI IDs in function ontology propagate to process terms.
IPI and catalytic activity. Deprecate these?
- Rama: in SGD these came from combination of IPI and IMP evidence (Editorial comment: this is because SGD doesn't do GO:0005515).
Binding is not sufficient to infer activity by itself. GO does not capture multiple experiments in a single annotation. This is a general problem.
Judy: rules are made to be broken. (!)
Interaction with the IMEx consortium.
Results of the survey
- Consistency of the annotators on evidence code usage, but difference in MF terms annotation (parent vs child term)
- Seems ok to use column 16 in case of MF term, but not in case of BP term
Possible action items
More discussion by the working group:
- ISS propagation of binding across species requires additional discussion. Should column 16 identifier be to a class. Should column 16 be transferred in ISS transfer.
- CHEBI IDs and process terms - how will these be handled by GO. Will CHEBI IDs in function ontology propagate to process terms.
Day 1 afternoon session
Annotation and Annotation Propagation
HAMAP presentation (Alan Bridge)
Rama: How do you know which annotations are propagated and which derived from literature?
Alan: By the evidence tags, e.g. references, by similarity etc.
Paul: You said that you don’t propagate isoforms?
Alan: Isoform information is sourced from TrEMBL, we don’t project any isoform information
Judy: How does UniProt envisage to integrate their system with all the other available orthology prediction sources, to ensure that everyone works with a common set of proteins/families for GO annotation propagation?
Suzi: There is an initiative to create a common set of sequences in a common set of species to start building orthology groups. A set of species has been prepared by Dan Barrell at the EBI.
Judy: this effort needs to understand its relationship with other propagation methods
Pascale: Rolf participated in QFO meetings, the current session is only to highlight the differences between methods
Alan: In a first step, UniProt will also compare the output of their annotations with those produced by the Reference genome project using PAINT on selected protein families.
Judy: HAMAP and Quest for Orthologs both have related groupings. Sets of proteins with similarities, what is your global view. The utility of this effort is integration into global network
Alan: We have integrated into InterPro, and see several trends emerging, from this we are separating into groups
Paul: If groups want to use HAMAP will they have to fill out identity card for their species in order for it to work properly
Alan: You can either specify a species most closely related to yours or can ask a curator to fill one in for you as it is a closed system and is quite involved process
Suzi: There will be a follow up meeting for QFO next year, other groups can join in and contribute
Compara presentation (Javier Herrero)
Judy: What is the source statement for GO annotations derived from Compara and how can all these annotations be retrieved?
Emily: Compara annotations are in the GOA database, there is a GOref 19 specific for Compara-derived annotations and their annotations are present in the UniProtKB-GOA GAF.
Reference Genome presentation (Pascale)
How are the ‘high quality’ protein sets defined that are used by the project?
The sequences are from different sources for the different species and are put in a standard format using UniProtKB accession numbers.
Tree-based GO annotation presentation (Paul)
Cecilia: Which GO term to choose to annotate nodes of common ancestors? Is it better to use less specific GO terms to be able to move up to a higher node in the phylogenetic tree?
Paul: It’s better to annotate to the most specific term possible (explained in more detail in the PAINT demo presentation of Mike)
PAINT demonstration (Mike)
Judy: Concerned that correcting already existing GO annotations on proteins by going back to already curated papers during the process of annotating a tree with PAINT may be too time consuming and is not very efficient.
Cecilia: When single sequences below an annotated node are deselected for GO annotation propagation (because of curator judgement), how are these ‘negative’ GO annotations shown to the user? Is it more useful to not have an annotation there or to have a NOT annotation there?
Paul/Mike: There are two possibilities. On the one hand, if annotation propagation has been deselected because of rapid divergence of a branch, the annotation is not shown at all in the concerned entries. If the annotation propagation has been deselected because of missing critical residues in the sequence, the GO annotation is propagated with a ‘NOT’ qualifier and is available to the user.
'Response to' terms
Chairs: Pascale Gaudet & Becky Foulger
Minutes: Michele Magrane & Shyamala Sundaram
Pascale’s presentation: http://wiki.geneontology.org/images/9/9b/WG-Response-to-Becky-Pascale.pdf. The aim of the working group is to improve the representation of biological responses. This has a lot of overlap with downstream events and signalling. Four challenges will be discussed along with proposals for improvements (see slide 5):
- Definition is very wide
- Specific terms can be informative
- Not all groups understand the definition in the same way
- Expression experiments are sometimes incorrectly annotated to IDA
1. Definition is very wide The current GO definition of “response to stimulus” is shown on slide 3. This is a very wide definition and the term is being over-annotated as the definition is very broad. Slide 7 shows numbers of annotations to some high-level “response to” terms. There are a lot of child terms under these high-level terms which should be used if possible rather than annotating to the high-level terms. This doesn’t currently affect many annotations but annotation to high-level terms should be avoided in the future.
Judy: We seem to spending a lot of time discussing a small number of annotations. And the annotations to high-level terms are not wrong. Curators wouldn’t use a high-level term if they can use a more specific one.
Rama: Sometimes curators use high level terms to group a number of child terms.
Kimberly: It’s not always clear when to create new terms.
Paul: ‘response to stress’ means a response to at least one stress. If the response is to more stresses, we should annotate to each stress.
Judy: Agrees with this. GO is now 12 years old. If there are few annotations, they are legacy and are fine.
Pascale: Would like the guidelines clarified for future use.
Judy: High-level terms haven’t been used much.
Pascale: We need to be careful about grouping stresses to a parent term as the parent terms then mean 2 different things. This is a general issue with GO. For example, DNA-binding can be annotated to both positive and negative strands. Binding to the parent term is not the same as annotating to multiple child terms.
Judy: Agree and need to clarify this if it is a confusing issue.
Li: If something is a general core factor and annotated to lot of child terms, is there a danger of over-annotating?
Pascale: If this is what it does, it’s not wrong to annotate to all the child terms.
Summary of above discussion from Pascale: Avoid annotating to high-level terms if possible. Annotating to child terms is preferable and is not equivalent to annotating to the parent term.
Proposal 1: High-level ‘response to’ terms should not be used.
2. Specific terms can be informative
Specific “response to” terms are very informative and should be used where possible.
Proposal 2: Encourage use of granular terms.
3. Inconsistencies in ‘response to’ annotations (see slide 11)
- Some groups only capture mediators of response
- Some groups capture targets
- Some groups don’t like microarray/expression experiments and annotate mediators by IDA when what is being measured is expression levels e.g. Western blot showing up-regulation in response to heat where the correct evidence code would be IEP.
Pascale: How do people feel about mediators v targets as concepts?
Rama: For a transcription factor up-regulating proteins in response to stress, the transcription factor would be a mediator.
Pascale: We need to distinguish between targets and mediators. Do we want to annotate all targets or only factors that have a role in change in the expression or state of a cell?
Example of target: response to cadmium ion (see slide 14) - 8 spots are upregulated by cadmium exposure. The role of the proteins is not known. This is showing targets. The mediators are not known.
Kimberly: Maybe this is a case where IEP is not sufficient to capture what is going into the annotation process. We either need another evidence code or need to be able to show that multiple evidence codes have been combined to produce annotation.
Debbie: If this is the only experiment, it doesn’t matter what you know about the proteins. The only thing that’s known from the experiments is that they are up-regulated so they are targets.
Pascale: Some groups don’t capture this type of experiment.
Ruth: A heat shock protein is not a mediator with respect to transcription regulation so we should capture both targets and mediators. It’s not known that they are involved in the process. There is no way of knowing if up-regulated proteins have a role in ‘response to’ based on expression data. We need a consensus on if we can use IEP for ‘response to’.
Paul: This discussion is similar to those on how to structure signalling terms. Could imagine having terms that structure these.
Ruth: 2 years ago, it was suggested that all ‘response to’ terms should have sub-terms which should be annotated to rather than high-level terms. Microarray experiments could still use the high level terms that others consider are meaningless.
Jim: The children terms don’t match very well to the parents.
Pascale: Perhaps there’s a need to restructure the ontology.
Becky: Trying to get signalling under ‘reponse to’ node.
Jim: IEP as evidence code doesn’t justify this kind of experiment.
Pascale: Agrees but most of the IEP cases are in this part of the tree.
Tanya: Would most groups use this experiment to annotate to GO?
Pascale: This was already asked and groups were split in the middle.
Judy: MGI doesn’t use IEP. Other groups use it sparingly such as FlyBase & WormBase.
Alan: If you have a paper with such an experiment only, there are probably other papers which could be annotated in preference to such a paper with better or more data.
Judy: There is so much core information yet to capture that energies should perhaps focus on other areas.
Debbie: From a user point of view, many people will do an array experiment and look for enrichment of terms to create a hypothesis e.g. cadmium response in multiple organisms. For users, these types of annotations are potentially useful.
Li: Some groups are small and use this kind of data to boost numbers.
Alan: Can’t you get that kind of data from a primary repository?
Pascale: Yes, but you can’t do the same analysis as with GO.
Ruth: Agrees that some groups want to use IEP. Try to get it into other terms e.g. signalling terms and leave very high level terms for groups who feel they are necessary.
Val: Is IEP disallowed for molecular function?
Rama: Yes but it is still allowed for biological process.
Judy: Leave it as an evidence code but take into account concerns of people. Enrichment analysis of microarray results to generate hypothesis to generate experiments but not to create in themselves annotations.
Jim: Happy to get rid of the IEP annotations from Pascale’s experiment. But the evidence code is useful for some experiments such as some yeast experiments which make more specific process-based inference which is when IEP should be used.
Proposal 3: Update “response to” definition as described on slide 12.
Becky: The new definition is more mediator-specific and is coherent with other parts of GO like signalling which don’t include targets in signalling terms. Would anyone object to the change in the definition?
Rama: Doesn’t mind changing the definition but is not sure about the proposed new definition.
Pascale: Working group can look into it and revise new proposed definition if necessary.
4. Concern about microarrays (see slide 17)
Pascale: Some people disregard microarrays but there’s a need to look more carefully at what’s been tested.
Alan: Most people use microarray as a first-pass and then do further experiments which are the ones that should be annotated.
Pascale: What about a microarray on wild-type versus mutant cells which finds differences in expression?
Alan: That’s a valid experimental system.
Proposal 4: Microarray hits should not be annotated to response to terms.
Val: Has annotated one paper from an array for ‘response to’ which are all core environmental response genes and all were further characterised.
Pascale: Agrees that this is fine.
Mike: Microarrays can provide numerical data. Should we ask for other experiments like 2D gels to demonstrate a numerical pattern?
Pascale: Every experiment is different. We can’t make general rules for these. It depends on the specific value you’re looking at.
Mike: Many microarray experiments use very specific algorithms to measure patterns but you don’t get same idea from a 2D gel.
Jim: Quantitative analysis of 2D gels happens.
Pascale: Don’t know enough about 2D gel or microarray data to make concrete proposal here.
Ruth: Microarray shows what mRNA is doing. Proteins don’t always follow these results. If we ban microarray data, it will be difficult to interpret proteomics results. Both should be considered in a similar vein.
Debbie: Thinking about this differently due to experience in lab where people did pulse-chase labelling, cut spots out of gels and did quantitative experiments. Gels can be quantitative, depending on what is done. Not comfortable about blanket ban on data from experiments such as microarray.
Pascale: Agrees. We don’t want to capture irrelevant data but we need to allow room for annotation from these experiments.
Becky: Did we resolve if we want to capture mediators and targets?
Pascale: We are capturing targets. Not sure why it’s a target. Heat shock response is downstream. Target of response and mediator of something else in response.
Mike L.: We need to separate out the concept of regulation. Up-regulation doesn’t mean that a protein is involved in response. We need to measure this more than just saying that it is on or off. There should be threshold.
Judy: Mike has a point but biology is sloppy. We can offer guidelines to avoid lack of rigour in annotations.
Debbie: Doesn’t understand why seeing the number of pixels changing in a microarray is more valid than other experiments such as Western blot.
Pascale – There is too much of a case-by-case basis to have guidelines.
Val: Is IEP allowed for annotation transfer?
Pascale: Yes, if the primary annotation is reasonable, it should be allowed.
Val: Maybe people would feel more comfortable if propagation not allowed.
Pascale: Has seen cases where propagation is reasonable. Happy to stop transfer if others agree but it can be valid to transfer annotation. Unless there’s proof that it’s not useful, we should continue.
Alan: Why specifically microarray? What about RNA-seq?
Pascale: The discussion also applies to these. It covers any technique which measures expression levels.
Paul: Can we clarify that for microarray, we mean any differential gene expression experiment? Anything that is differentially expressed is far downstream of the effect. It is downstream of the causing stimulus so we are not sure what it may do.
Jane: The definition change is fine but where does the process start and stop? Does the process describe the pathway between the start and stop or the whole thing?
Pascale: ‘response to’ terms are being reorganised as part of signalling pathways.
Becky: A signalling pathway ends with the trigger. There are also some downstream processes under signalling. The definition tries to include only genes which have an active role in a process, not those regulated by it.
Ruth: 2 points to add. 1. Proteomics experiments are fine. There’s no problem with them. 2. If we do ‘response to’ annotation by IEP, then for a protein that negatively regulates itself, you get over-expression in these assays as it’s being degraded too quickly to auto-regulate. The protein may be getting degraded before gets to nucleus. This is the type of thing to be concerned about when we say that we don’t want to do ‘response to’ with IEP.
Alan: A classic example of this is P53 where there is overexpression in tumors.
Paul: To summarise, we ought to think of biological process as an encoded program for the cell to do something and effects of stress are not part of the biologically encoded program. This is what we are trying to capture with ‘response to’.
Mike C.: Strike the word “microarray” as you mean anything that measures quantity. Microarrays don’t measure expression. What is meant is any technique which measures differential expression.
Summary from Pascale:
1, 2 and 4 are related points. If all you know is very general, it is fine to annotate to high level terms. But if you know more specific information, use more granular terms. If an experiment is testing a specific step of a response, that’s what we should be looking for and not annotating 500 genes that are up-regulated in response to calcium. We need to improve curation guidelines and also to improve the ontology.
Becky: What is meant by improving the ontology?
Pascale: Better integration with signalling.
Becky: Need to improve definitions of signalling terms which affects what ‘response to’ term can be used.
Day 2 Morning
Summary of ontology development
Chris Mungall presents rules for binding propagation (see presentation)
In the case of transcription factor activity which has DNA binding as parent, will it go to the same format? This has to be considered.
- It has been decided to add a has_part relationship as a link in the ontology.
- The propagation of has part relationship is not suitable in all cases (see example given in the presentation) and this makes the rules more difficult.
Example G capable_of ATPase activity -> G capable_of ATP binding
- Materialize relationships at central location
- Curator annotates to ATPase activity
- GAF pipeline materializes ATP binding using same EC
- Reimport allows query against ATP binding query to recover ATPases etc.
- Q: does redundancy of annotation raise issues? Probably not?
- Navigation via CHEBI too complex.
- is_a between AATPase activity and ATP binding
Automated population of ontology using intersection_of terms ... has_input + has_output The has_part links will be mainly populated automatically in the ontology using MF X CHEBI logical definition, but this can generates errors. Also it is important to stick with the original evidence code and original PubMed ID which gives the possibility to go back and have the ATP binding.
Concerning the problems of propagation of has_part, why do not use a link like “necessitate” ? This could be an alternative.
Ontology will contain information to relieve annotators of making redundant annotations.
Q: How will the chain of evidence work for the materialized ATP binding added to the GAF. A: original EC, reference, and ...?
Q: Look at other ontologies, e.g. txn factors. A: Don't want txn factor as a child of binding.
Q: is materializing a permanent solution? A: See later discussion.
- Problem of software development assumes prior version of GO structure
- Links are only in GO_ext files.
- Future: more links. Software will have to catch up.
- Materialization service for function to process links
- Want to limit prcomposition
- Annotate as if relationships are there
- When to request new term vs use col16 - would the term make sense in an enrichment analysis
- Reasoner can find equivalent terms if they exist, and materializer will add lines to the GAF.
Isoforms. No time to discuss
- Extensions provide greater expressivity
- Possibility of expressing things different ways, but reasoner can link synonymous annotations made in different ways by annotators.
Q: relationship matrix? A: this exists in part
- Gene search
- Term search
- View direct or include annotations to child terms
- More tools
- GOOSE: SQL environment
- precomposed SQL query list. Can request new ones via help
- GO slimmer
- Visualization - input GOIDs and see relationships
- OpenSearch - Browser widgets and OSX dashboard
- Homolog Set Summary - for reference genomes
- GOOSE: SQL environment
- AmiGO labs - more stuff
- Cross-product term request will issue GOIDs for specific types of cross-products (regulation, part_of, downstream process terms)
- Coannotation - see genes annotated to two GO terms
- Gene search
- download options, web services
- Term search also shows co-occurrence with other terms. Default EC selection was discussed.
- Annotation views have filtering options.
- Unlike current AmiGO, taxon filtering uses hierarchical relationships.
Annotation of HTP data
Chair: Rama Balakrishnan
Minutes: Cecilia Arighi and Silvia Jiménez
Annotating from HTP studies (Rama)
Survey 4-5 question
Slide summary from survey: What do you consider HTP data? Genome wide screen, many mods have these type of papers but would like to flag those as coming from these studies SGD has been annotating this data since 2003. Come up with some guidelines. Go to SGD practice
-genome wide studies
There are other studies that are neither genome wide or small scale
It is not always about the number of genes characterized in the paper, they look for:
- Have the authors checked from every construct? Did they do necessary controls? E.g. GFP constructs.
- Results can be measured by cutoff.
- Are there are follow up experiments.
- At SGD every paper is discussed before doing GO annotation.
Pascale: Question at GFP fusion protein. If in the paper they don’t test the function do you capture cell component?.
A: In the example, GFP example authors looked at cell localization by microscopy, did not check the constructs were OK, so they did not add those annotations.
Do curators verify data with published genes? No
All the data reported in the paper is loaded. Authors have decided and reviewers approved. SGD does not have time and expertise to do that. The data is either loaded from tables or sent directly by the authors to SGD.
Q: Is HTP annotation removed at some point?
A: Yes, if a paper describes that HTP is flaw or if a more confident paper is found.
Ruth: Do you contact author themselves and make sure they are comfortable with data?
A: No, it is not systematically done. They had the situation where they contact SGD
Q: If the author gives confidence of just part of the results?
A: Then just this high confidence part of the data will be added.
GFP dataset was the very first one SGD dealt with 2nd one, isolation of mitochondrial proteins That the author gives data, the high confident set.
Q: Does author submission mean high confidence?
Examples of not so obvious HTP papers
- PMID:16702403 ribosomal proteins
Purify using 3 different approaches. Found 77 proteins previously uncharacterized No stringent conditions to assess the ribosome. They characterized 12 further (for these they gave manual annotation, non-HTP).
- PMID:17443350 MS pre-60S ribosomal subunit
75 proteins, 46 ribosomal Remaining 29 annotated as GO: pre-ribosome
- Biological process HTP data:
This can be tricky Experiments have to be carefully checked because we could be annotating to indirect effects.
Flagging HTP annotations. At SGD they indicate to the user about this. Example on how it is displayed at SGD. They have three sections manual annotation, HTP annotations, and computational.
Michael L.: How many HTP papers have annotated for GO?
A: more than 25
Michael L.: How is this discussed at length in SGD?
A: Whenever the curator finds a paper that is an important HTP paper they have to write a proposal and is discussed in a meeting.
Susan T.: Is this included in GAF?
A: Yes, but no way to flag in the GAF
Annotation of HTP data in SP (Emmanuel)
Semi-automatic integration: Main HTP data have been integrated in batch. Paper manually curated, emphasys on MS-MS parameters and identification cut-off. Only papers fitting our guidelines will be annotated.
SUBA database collects pubmed papers for subcellular location in Arabidopsis. When curating a protein they check if in SUBA db. All retrieved PMIDs are analyzed only those localizations that have been proven or predicted are kept. Limitation of HTP data: separation methods are not 100% "clean", they often introduce contaminations.
Positive example: annotation of Q8H1R4. Prediction for chloroplast protein. An HTP paper shows chloroplastic, then it is added.
Negative example: P34791 shows HTP data conflicting with prediction. SP does not include.
SP use HTP data as tool to make decisions in complement to other resources
Michael: HTP add annotation only if matches prediction?
A: Yes for subcellular location, but also with other papers.
Paul: So should we remove the ones in conflict?
A: You need to warn the user that these annotations come from HTP because may be false positives.
Michael L.:Do you have computational ways to pick up conflicts?
A: Conflicts are found during the annotation process on a one by one basis, we don’t look at these things in batch.
Alan for RAMA: What would you do if you have data from 2 different groups claiming high confident data but their cut of criteria is different? Example for PPI.
A:PPI is not added directly at SGD but Biogrid does it. Users can filter themselves the information.
Evidence code proposal by Rama:
Conference call issue: how to flag these annotations lead to the proposal.
Problem: assign evidence and indicate if has been reviewed by curator or not.
IEA is not accurate for this. New evidence proposal:
- Two high level nodes: computational|experimental.
IEA includes, predictions based on various things and does not distinguished the evidence for the annotation. Anything seq based could get a computational evidence code.
This will be fleshed out at the ISMB meeting.
- R-IDA reviewed IDA.
- NR-IDA non reviewed IDA.
Nothing was agreed since this will be discussed at ISMB.
Annotation of complexes
Minutes by Kristian Axelsen and edited by Mike Livstone
Quick summary of session: There has been a need to address the following situation: Complexes are multiprotein machines that carry out a specific process or reaction. While it is clear that there should be annotations to the process for the catalytic subunit, there is a desire to annotate, using experimental evidence codes, other subunits in the complex based on their membership in the complex. One proposal has been to create a new experimental evidence code "ICM" (Inferred from Complex Membership). The general consensus in the session was that this type of inference should not be made and, as a consequence, ICM should not created.
More detailed notes:
The background for the sessions at this GO camp is that, after making group annotation sessions of groups of 5-10 genes, it was always the same 3 types of problems that appeared.
So the working groups were created to identify the issues, improve annotation, make annotation guidelines, and provide QC checks.
Bernd presented the current situation with a very broad definition of a complex, but stressed that "complex" terms should be defined so that they could be used in other organisms and not only in the organism where they were first seen.
Current Guidelines by Ontology:
- CC: gene products can be annotated to complexes; "colocalizes_with" qualifier also allowed. (slides 8, 9)
- MF and BP: Gene products are not annotated to complexes
- MF allows "contributes_to" in the context of a complex (slide 10)
- MF: catalytic and regulatory subunits can get different annotations (slide 20)
The use of contributes_to was discussed in the MF ontology. This was to be used for essential subunits only.
Annotations to MF should NOT be done based on IPI alone.
A lot of the discussion in the working group was concerned with how to annotate the subunits which are not responsible for the catalytic activity.
Working group suggestion: to create a new evidence code: ICM (Inferred from Complex Membership)
(Note: The consensus at the end of discussion was not to create this code.)
Furthermore, it was urged that annotators are better at putting "unknown" as MF if this is the case. It is acceptable not to know.
General consensus: We need to be more conservative when assigning MFs
This would also be more in line with the biologists' view.
Working group suggestion: From the evidence code documentation (IDA): "a fractionation experiment might provide "direct assay" evidence that a gene product is in the nucleus, but "protein interaction" (IPI) evidence for its function or process." Proposal 2: Remove this statement from the annotation documentation
General consensus: This statement should be removed (this was also a conclusion from the Binding session).
An important example that was discussed: Yeast RNA polymerase II vs. III. PolII is much better studied, and subunits that are indispensable for PolII function are annotated to transcription with "contributes_to." In contrast, the same level of detaile is not available for PolIII, so all subunits get contributes_to transcription. This reflects the level of understanding for both complexes, but does not sit well with many curators because it means that in cases where we know less, we make more annotations.
Summary (by Paul Thomas): We would like to be able to annotate entire complexes to MF and BP. For single gene products we should only annotate a MF for the subunits essential for the complex activity.
The use of contributes_to was raised. Pascale said incautiously that personally, she would have no problem getting rid of contributes_to.
Again, it should only be used for MF annotations of the subunits essential for activity.
Minute taker's comment (KA): This is perhaps an issue for the next camp/the continued work of the working group
Another issue: When MF terms are added to a complex based on early experiments. When more detailed knowledge appears and terms are added, it should be possible (more easy) to remove the old annotations when they have been added by different groups.
- Michael pointed out that ICM really is an ISS inference
- Paul says we need to be able to annotate complexes directly, the same way we annotate gene products.
Day 3 Morning
Use of Regulation (Jane and Kimberly)
Chairs: Jane and Kimberly
Minutes: Susan & Andrea
Insert link to slide here.
Jane gave David Hill's presentation: Annotating to GO regulation terms
Summary of back ground: Regulation of a process affects the beginning, middle or end of a process. The key is defining process X. Processes can be considered as ordered assemblies of molecular functions - GO are trying to make links between functions and processes. If Y modulates any of the functions within the process then it regulates it. Problem is that processes are subjective and not all currently defined in terms of beginning middle and end. GO need to reflect the community consensus about which functions are part of a process and which are not.
Rama: If Y alters ANY one of the functions? In a series of functions, doesn't function 2 regulate function 3? For instance the MAP kinase pathway - phosphorylation events are part of pathway - but most people think these are regulation steps. Are the downstream phoshorylation events regulation or in the pathway?
Jane: It depends how you define the process. Functions can regulate the process and be part of it e.g. if enzyme 5 in pathway regulates level of enzyme 1 then it would regulate the process and be in it.
Guideline 1: If the gene product performs one of the functions, annotate directly to the process. If the gene product regulates then it should be annotated to regulation. If you aren’t sure, annotate to the process term
Pascale: For signaling pathways e.g. MAP kinases - every step is a regulation step - so is it better to annotate these to regulates?
Val: Signaling processes were all part of regulation - are they still?
Becky: No - they aren't anymore
Val: for the MAP kinase pathway I would annotate these to the pathway.
Becky/Michael(?): Regulation affects the entire pathway - it is too hard to dissect out what regulates the individual parts of the process.
Paul: Agree with Michael, few experiments show mechanism of regulation.
Guideline 2: Use your biological knowledge. Consider how much is known about the process, is there a defined path & have players been identified? Is the gene product being annotated thought to be a major player in the process or is it outside of it?
Ruth: Annotators often use ISS to decide what to curate.
Kimberly: Usually annotations based on IMP are made in conjunction with knowledge about the sequence - we should consider whether we need another evidence code to reflect this or combine them. It is easy to be misled by phenotype alone - [see example of this later].
Guideline 3: Try to reflect the paper that you are reading.
Guideline 4: Try to improve ontology by improving definition to include start and ends.
Guideline 5: Fix annotations where possible when new knowledge becomes available.
Guideline 6: Take care making annotations to regulates based on IMP. Mutant phenotypes are often used to make annotation to regulation terms because they fit the criteria of the definition of the regulation terms. In using IMP to make regulation annotations it is important to consider various factors: including the assay type, nature of the alleles (null vs reduction of function) and identity of the gene product. If it isn't clear that it is involved in regulation then it is better to annotate to the parent process term.
These guidelines are high level and annotators will need help with individual expts e.g. with worked examples for annotators to refer to. Beginning and end of processes aren’t clear. Ligand binding is beginning of process and at the same time is regulation of the process.
Becky: a signaling process starts with ligand binding to the receptor.
Pascale: Is it true of every signaling pathway?
Ruth: not every ligand will regulate the pathway but many simple pathways will have the ligand regulating the pathway.
Pascale: will this be defined for each pathway?
Becky: yes the start will be made clear in defs.
Jane: but the annotator will decide if ligand regulates the pathway?
Alan: some ligands will be regulated extracellularly (and then the receptor regulates), not the ligand.
Ruth: often a second cell will regulate things… the receptor can also regulate the pathway.
Michael: We have to be pragmatic. Going down that route, everything regulates everything else – this is not useful to GO (although is true)
Paul T. This is similar to the response_to discussion – only if it is a programmed response should it be considered regulation, although this is not always clear. We’re after mechanism.
Judy: This is a useful discussion but curator judgement will reflect the way biologists talk about regulation - not easy to make clear rules.
Ruth: it isn't helpful… when annotating ligands we need guidelines as to whether to annotate with regulates or not. We need to help curators.
Becky: any protein that affects level of a ligand should be annotated with regulation of the pathway but the ligand itself is part of the pathway - we should annotate what restricts the level of the ligand.
Jane: In summary, only if one member of a pathway goes back and regulates the level of an earlier member should it be annotated with both regulates and the pathway itself.
Mike L: given that regulation is a child of the process can you distinguish what is in the pathway and isn't?
Jane: it isn't a normal parent child relationship because it is regulates - it depends on the tool if they understand regulates or not. If it is part of then this is made explicit.
Chris/Jane: further discussion re whether these relationships are transitive or not - need clarification about this.
Chris: there will be doc on the wiki from Amelia that helps describe this see regulates page. It is important to consider the gene product and process relationship - participates_in etc. Allows more advanced queries.
Ruth: trouble is that tools don't behave in the same ways to people get confused - e.g. differences between AmiGO and QuickGO. Do the tools keep regulates separate or not? What should the default be?
Jane: sometimes you want to look both ways so ideally you would be able to toggle between taking these into account or not.
Potential mis-annotations. More than 2500 cases exist where a gene product is annotated to a process and its regulation. Which cases are correct and which are mis-annotations? Some feel all may be wrong, others feel all may be right.
Distinction between process & regulation isn’t yet well understood As more data is available it will become clear which will be wrong. Can we figure out a way to sort this out automatically vs going and looking piece by piece?
Judy: As we find errors they should be corrected but is this a big problem - what % of the total annotations are the 2500? Are all the regulation annotations manual annotations?
Michael: we want to get it right. Most of the 2500 are probably wrong so why don't we just check and fix them all if they are wrong. Not many per database so do-able job.
Judy: basically we are getting this right - get guidelines in place and this will help.
Pascale: We should check to make sure they’re right, especially the IMPs and we’ll get an idea from doing a few and then decide about the rest once we see what the issues are.
1. Foxo1 - would you annotate mouse Foxo1 TF to regulation of gluconeogenesis based on expression of dominant negative construct? 63% yes, 37% no. Didn’t get many comments on why.
Rama: we were worried that effect was indirect.
Tanya: there was more data in the paper to support that the effect was direct.
Li: we are concerned that this was based on expression levels only.
Ruth: but the annotation captures the intent of the authors in this case (channeling David Hill here).
Generally considered that this annotation was acceptable.
2. ACOX1 - Paper describes two patients with abnormal long chain fatty acids and deficiency of peroxisomal acyl-CoA oxidase in fibroblasts from these patients. Would you annotate human ACOX1 to fatty acid beta-oxidation or regulation of that process? 42.3% oxidation, 23.1% regulation of process, 34.6% neither. So not strong consensus.
Those that would make annotations were taking background knowledge about sequence of the protein into account. Others would be happier if there was enzyme assay in the paper.
Alan: Knowing what it does changes the way we annotate it. Are we testing or confirming hypotheses? We don’t leave ourselves open to new data if we only look for what we want to see. It is dangerous to only add terms based on what we know already
Paul: what did the authors conclude?
Ruth: We have to believe the authors know what they’re doing. Scientists would think it odd if we didn't capture this information - experiments may only ever be done in other species.
Alan: it is better to be strict about what is actually proven in the species in question.
Pascale: in this case it is only a process not a function so not so much of an issue. But worried about the IMP aspect - doesn't actually show regulation here and wouldn't annotate function from this data.
Alan: From 2 people you can’t decide anything. In this case they only looked at the one enzyme they thought would change in these patients. People are under pressure to confirm their hypotheses - when authors have preconceptions they often find what they are looking for. We should be annotating what is in the paper, not based on background knowledge.
Jim: Disagree - GO annotators shouldn’t censor the literature gets curated.
Muscle contraction V regulation of muscle contraction example:Presented by Kimberly.
Three C. elegans genes with different muscle phenotypes based on whether the allele is a null or partial loss of activity.
Loss of functions would lead you to annotate to the process whereas reduced function to regulation. However, knowing what the products are (channel protein, myosin, TF) influences the annotation. Transcription factor is involved in assembly of muscle, is it regulation of contraction?
Pascale: great example illustrating the problems with interpreting mutant phenotypes - shows the danger of inferring regulation based on IMP. If you don’t know, annotate to the process is the default rule.
Val: I would still annotate all of these to the process anyway
Michael: there is no evidence for any regulation here at all
Jane: with IMP it is hard to ever get evidence for regulation.
Kimberly: We need put more examples on the wiki - e.g. where you would annotate TFs to regulation of a process or not. ACTION ITEM
How is Downstream Effect defined (Rachael and Varsha)
Chairs: Rachael and Varsha: Annotating to downstream processes
Minutes: Yasmin & Ursula
- Definition of down-stream process, as proposed by work group - everyone thinks this is OK
Examples (1-4): see presentation
- Discussion of Survey (see presentation)
Everybody does at least occasionally annotate down-stream processes.
Most participants felt that annotating down-stream effect was ok, when no other information was available. Many participants felt it would be desirable to revise such annotations at a later time, but that this was not always feasible for various good reasons (see presentation)
- Guideline 1: Request new, specific terms describing a process involved in another process. Example: for growth factor BMP2 that regulates cardiac cell differentiation, it is more informative to use a composite term, such as “regulation of transcription involved in cardiac cell differentiation” as opposed to using two unlinked terms, e.g. “regulation of transcription” and “regulation of cardiac cell differentiation”. (The terms do not exactly match the case of BMP2).
- Guideline 2: for small scale experiments one should annotate to the experimental evidence in the paper. However, use curator judgment, and also take account of the quality of the evidence, etc.
If a gene product has a central role affecting multiple down-stream processes one should only annotate the core process. When a gene product is specific for a particular pathway and/or has just a few targets, one should annotate the down-stream processes.
Discussion of examples:
a) yeast RNA polII subunit should only be annotated to the core process.
b) for proteins associated with the yeast spliceosome, annotation describing indirect effects has been removed.
c) S.pombe sre1 (direct transcriptional regulator of genes which have a role in heme and lipid biosynthesis): new terms should be requested, e.g. “Regulation of transcription involved in heme biosynthesis“
Li - Are we ready to go for this transcriptional regulation process in the GO - directed at Chris: everything involves transcriptional regulation - does GO want to represent this?
Chris - yes - we should represent this. For the time being we should use precomposed terms. Use AmiGO Labs to request terms. Later it may be possible to use column 16 instead.
Li: ontology developers in group should discuss this ACTION ITEM:
- Guideline 3: If a gene product has limited experimental literature, such as a newly characterized protein, it is acceptable to annotate to more general 'downstream' process
Lively discussion of example of RNA polII subunit: should one keep the experimental annotation (indirect effects)?
Mike A: rpb2 is required for every transcription process; it is not useful to list indirect effects. The gene product should be annotated to the core process using ISS, and the phenotype-based experimental annotation should be removed. Describing the k/o phenotype is not informative.
- COMMENT: if it has a specific effect, one should keep both specific down-stream effect and description of core process.
Kimberly: rpb2 annotation originated from phenotype to GO mappings (ISS). We will review the pipelines issue.
Kimberly: How can one connect the core process with the biological knowledge? This is what is being tested in C.elegans. We need feedback
Sylvain: should one have GO terms for knockout data? Propose to use them only if there is further experimental characterization of the gene product. There are multiple phenotypes for any mutation, especially if these affect an important gene product. It is not the goal of GO to describe phenotypes.
Li: in this case this is a core process, but when the underlying function of a gene product is unknown, then making these annotations will give more information for the user
Many participants agreed with the above statements. But: it really depends on the MODs whether they want to keep the annotation or not.
Mike A: proposes to delete the evidence code IMP. IMP should be used very sparingly
Pascale: annotation based only on mutants may be misleading. In such cases, we’d need further information.
Kimberley: but users may want that information. If we can use a different evidence code then that will be welcome.
Rachael - if you didn't do IMP then what would you use? IDA?
Sylvain: HTP data effects were all annotated to different development processes
Pascale at organismal level it's hard to annotate directly
Michael: IMP is an absolutely valid code in some process - need to set a boundary for when to use it for capturing phenotypes.
Mutant data can be essential and have yielded precious information.
Phenotypes should be captured using existing phenotype descriptions, and maybe by a dedicated database.
Need to take into account if there is a paper discussing the mutant phenotype, or if this is stand-alone (HTP) data. We want to capture what the authors are saying, and what was accepted for publications by the reviewers.
Judy: we need to clarify what is the appropriate use of these evidence codes
- Guideline 4: annotation of ligand receptor signaling pathways (intercellular vs. intracellular)
For intercellular signaling, the ligand is part of the pathway. For intracellular signaling, the ligand regulates the pathway.
Pascale: this is confusing.
Becky: the goal is to avoid over-annotation. Is the ligand part of the pathway?
Becky, Pascale: Yes. Varsha: this is another discussion.
Becky: ligand is part of pathway. The pathway ends when response is initiated
CONSENSUS: need to clarify where pathways start and end.
ACTION: take intracellular example back to signaling group for clarification.
Going through slides showing (simplified) insulin receptor pathway and NF-kappa-B pathway: everybody agreed.
Becky: a lot of the time you don't know that the stimulus/ligand/receptor is involved in multiple pathways would prefer to request new terms e.g. X signaling involved in Y pathway
Pascale: likes this representation and is useful for helping to think about how to correctly represent the biology
Varsha: signaling diagram will be presented to the signaling group
Present summary slide (documentation) for dealing with cases such as RNA polII subunit (guideline 3): if reasonable, remove annotation of downstream effects once core activity is known. But: there may be good reasons to keep such annotation. It really depends on the case, and on the contributing MOD (see presentation).
Suggested Quality Control checks: (see presentation).
Discussion of survey example:
- Question 1: Functions of the deubiquitinating protease CYLD
Most (almost 90%) would annotate to core process = regulation of microtubule organization A sizeable minority also selected downstream effects.
Ursula’s example. Ursula: was strongly inclined to stick to “regulation of microtubule cytoskeleton organization”. This protein regulates everything, and it does not make sense to annotate “everything”.
Rama - SGD would be cautious too
Ursula: survey result may reflect limited time available for doing the survey. Reading several papers on the subject makes a difference.
- Q2: Bre1-Histone H2B monoubiquitination regulates histone H3 methylation
Most (85%) of the participants chose a “histone ubiquitination” term
65% chose also histone methylation. Most participants would like “Regulation of histone methylation”, or a new term such as “Histone monoubiquitination involved in regulation of histone methylation”.
Val: would annotate to main activity of enzyme but also regulation - has many pombe papers with experimental evidence
Problem: “Regulation of pathway X” is part of pathway X. People are not sure how to display the experimental data.
Rachael, Sylvain: it is important to avoid ending up with all histone-modifying proteins having exactly the same annotation. They have distinct activities, and this should be shown by the annotation.
Val: will ask people to look into this.
Sylvain: each histone modification affects other histone modifications (positive and negative regulation). There are about 80 histone-modifying enzymes, and each has down-stream effects on other histone modifications.
Ruth: these are process terms and users would find it useful to know which genes are involved in this process: how to convey the information?
Sylvain: Propose to change the definitions of existing terms, so that ubiquitination includes effect on subsequent methylation, etc.
Sylvain: Propose to split “Regulates” into cases where “Regulates process X” is a part of process X, and cases, where “Regulates process X” is NOT part of process X? Is this possible?
Paul: allow distinction between core process and regulation of core process
- CONCLUSION: Ontology should be revised, annotation checked.
- ACTION POINTS:
- revise process terms for transcription
- define start and end points of signaling processes
Day 3 afternoon session: Future plans
- Chairs: Suzanna Lewis and Pascale Gaudet
- Minutes: Jane Lomax - Kimberly Van Auken
Quality control checks
- Annotation Matrix method (Val Wood)
The annotation matrix can be used for identifying annotation outliers
e.g. ribonucleotide reductase rrm2b has annotations to:
purine/pyrimidine dNTP biosynthesis
induction of apoptosis
Which of these should be annotations to regulation terms?
Val has used the matrix to identify potential annotation errors.
How could the matrix be put into practice more generally?
- Taxon checks
Taxon-based constraints to detect annotation inconsistencies
Obvious inherent differences
Rules are collected in a central taxon constraint file
e.g. lactation, photosynthesis, mitochondrion
Chris will send the URL of what’s in the current constraint file to see if it’s okay.
The checking script is run weekly.
- Specific checks based on GO IDs and evidence codes: 5515, IEP, etc (Rama) - File:Annotation QCs rb.pdf
Hard QC checks
Annotations are wrong for trivial reasons – obsolete, no WITH column for IPI
Soft QC checks
Annnotations need to be reviewed to distinguish between misannotations or real annotations
Review of current GAF pipeline
Most error checks are carried out by an annotation control script maintained by Mike Cherry
New QC checks – New annotation checks are being formulated by annotation working groups
e.g. No use of the ‘NOT’ qualifier for protein binding
protein binding needs a WITH entry
protein binding should not use the ISS evidence code
IEP should only be used for BP annotations
Possible new checks:
IPI evidence code may not be used with catalytic activity molecular function terms
System will be in place to alert curators about the annotations that need to be evaluated
What is the best system for notification of QC checks? Email and a file could both be used
Every time you submit a GAF file, and then also once a week?
All annotation checks must be fully described at:
GOC wiki – annotation quality control checks
People should put suggestions for new checks on the wiki
Suggestions for particular term checks on GONuts could be fed to the QC pipeline
Annotation advocacy group
Emily, Rama – co-managers
Goal is to make sure that curators/annotators are well informed about developments in ontology, evidence codes, annotation working groups, annotation inferences (PAINT, inter-ontology links).
Encourage users to post/send concerns or requests to the go or annotation mailing lists
Future Plans for QC
Establish rules of all high level processes
Establish more specific rules, taxon-specific rules, function and component rules (for example chloroplast/cytoplasmic ribosome example)
Ultimately will be able to assess whether annotations are correct in the context of known biology, or whether they identify new previously unknown connections between divergent processes
Removal of experimental annotations, need more rigorous alterting for unsupported ISS annotations
High level terms where annotations can consistently be transferred could be identified, i.e. transcription, translation, replication, x metabolism (improve GO slims, easier to identity total ‘unknowns’)
We need to have a system to address when an annotation has been checked so that a checking pipeline doesn’t continue to flag reviewed annotations.
GO is moving towards an annotation QC team, approach.
What corrections should get priority?
Misleading vs non-misleading annotations?
Should QC meetings incorporate a representative from each group?
Response to Working Group
Update definitions of response to terms to indicate that we’re capturing mediators
Where does response to start and stop?
Should a glucose transporter be annotated to response to insulin?
How could terms like response to heat stress, osmotic stress be better defined?
Increase synthesis or activity of a limited number of gene products that are shown or hypothesized to help the cell deal with that stress
Should the full extent of the response, i.e. how the cell deals with it, be included?
Does having response to insulin add more to annotations than just insulin receptor signaling pathway?
Glucose transporter could be annotated to response to, but not signaling pathway term
begins with detection of signal, includes signaling, and ends when the cell has resolved the stress
From Debby Siegele:
Here are some suggested rewords of response to stress terms:
GO:0006950 response to stress
A change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of a disturbance in organismal or cellular homeostasis, usually, but not necessarily, exogenous (e.g. temperature, humidity, ionizing radiation). [source: GOC:mah]
GO:0009408 response to heat
A change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of a heat stimulus, a temperature stimulus above the optimal temperature for that organism. [source: GOC:lr]
Debby's first draft of new definitions:
response to stress:
The response of a cell or organism to a stress that leads to the increased synthesis or activity of a limited number of gene products that are thought to protect cells in some way from the deleterious effects of the stress(es) that induce their synthesis.
response to heat:
The response of a cell or organism to an increase in temperature that leads to the increased synthesis or activity of a limited number of gene products that are thought to protect cells in some way from the deleterious effects of the temperature increase.
Try to be as specific as possible with the entity to which the cell/organism is responding.
Use more granular terms for annotation.
Expression experiments should not be annotated to response to terms – soft QC
Annotators have been annotating to the definition
Looking at a change in expression sometimes may be valid, though
Need to work on the wording of this guideline
Will the current terms need to be obsoleted
Action Item – curators need to look at their annotations to see what is valuable to capture for response to annotations wrt expression to refine these guidelines
Annotating to complexes directly would help with annotation issues, but we aren’t able to do that yet
Should we annotate to GO IDs (they exist) or do something else?
Avoid annotations to GO: MF by IPI (except for ‘protein binding’ and children) – soft QC check
Could these be moved to IDA?
The guidelines should state that curators should not make EXP annotations to MF when only the CC is observed, i.e., MF annotations based upon existence of a complex
Request new terms as needed to qualify the role of a gene
Small-scale experiments, curators should annotate to the experimental evidence in the paper
If a gene product has limited experimental literature, such as a newly characterized protein, downstream effect annotations could be kept
Provide diagram summarizing downstream annotations which can be made to components of signaling pathways
What is the process term for a specific transcription factor?
ACTION ITEM: transcription ontology revision
ACTION ITEM: Define start and end of signaling processes, signaling working group
Some MODs keep legacy annotations, some prefer to remove them, is this a problem?
Are legacy annotations always wrong?
How can re-annotation of legacy annotations be prioritized?
ACTION ITEM: Form a working group to look into phenotype/development/IMP issues
Use of Regulation Terms
Changes to guidelines given in initial presentation:
Gene products can be part of the pathway and regulate that pathway if it affects a different step in the pathway (feedback loops)
9268 IMP annotations to regulation terms – how to assess?
Binding (incomplete notes) Agreed Guidelines for GOC website 19 July 2010
See wiki for unresolved issues, some of them are:
Incorporation of IMEX data
Disagreement about transferring cross species information by ISS and inclusion of no-in-vivo targets in column 8 or 16
How specific to make substrate/product target information
Will ChEBI IDs in function ontology propagate to process terms?
Automatic creation of protein binding child term, from known functions of protein
Existing GO to follow new has_part relationships implying substrate binding
Were the mailing lists effective?
Alternative: emails could go to the annotation list with the working group title in the subject of the email
This list would be open
Community involvement in annotation
- Jim Hu: students project CACAO
Jim Hu – CACAO (Community Assessment of Community Annotation with Ontologies)
i.e. undergrads doing functional annotation using Gene Ontology
student curation focused on IEA validation for genes with lots of IEAs and very little experimental annotation
evaluation: peer review by competition
students loved the competition
need to tweak the training
need multiple rounds
more time for challenges
better wiki tools for mentors/judges to track student annotation
recruiting more participants for Fall 2010
ASM, ASMCUE, AgBase… Not just for E. coli
content related to GO/genomics/function
support for assessment, e.g. rubrics, surveys, etc.
plans for possible publications
credit for innovative teaching on their campuses
NSF broader impacts
Inter-institution teams for recruiting
- dictyBase, pomBase: involving the community
Proposing annotation projects to the reference genome
Pascale: Presentation : File:Pascale-RefGenome-Process.pdf see Strategy_for_establishing_RefG_annotation_priorities
Co-current annotation of biological ‘modules’
Annotation consistency, guidelines, and quality control
Enable propagation of annotations via PAINT
Publicize the Reference Genome initiative – can create better stories, perhaps publishing the results Provide opportunities to involve experts
Prioritizing projects How conserved the process is – still trying to find processes that all MODs can annotate
Each project needs a project leader and external experts
SOP for ref genome Biocuration projects:
Leader prepares summary
Identify experts in the field
Identify biologically coherent targets
Ensure that ontology is generally correct
Curators perform primary literature annotation
Tree curators annotate families – 80+ possible
Project leaders, tree curators, experts discuss annotations, make sure biology is well represented
Next project: Wnt signaling pathway (Varsha and Suzi)
Highly conserved secreted signaling molecules
Regulate cell-cell interactions
Insights from many organisms
Mutations lead to specific developmental defects
Target families for June – July 2010
Suggest new projects on wiki page
Trial of new time frame for Ref Genome curation – curators do as much annotation on one family in a given week as possible
Closing discussion and summary of meeting