2010 GO camp Annotation of HTP data
- 1 Members of this working group
- 2 1. Background
- 3 2. Review of current GO annotation practices
- 4 Meetings
- 5 3. Proposed annotation policy
- 6 4. Examples (papers) and discussion of GO annotation issues
- 7 5. Examples of annotation incoherence for the user
- 8 6. Suggestions for Quality Control procedures
Members of this working group
With the advent of technology and genomic data, it is now possible to study/look at the various aspects of gene products at a genomic level. This implies studying the localization of thousands of gene products or looking at phenotypes in a large scale and so on. These studies typically describe the experimental method in the main text and can't show the results for each gp within the paper and often supply that in the suppl. data. It is time consuming for a curator to look at/review the data for thousands of gps. What is the best way to annotate these types of studies?
2. Review of current GO annotation practices
HTP conference call on June 7, 2010
- What is a HTP paper?
- Should we add all the data from a HTP paper?
- for all the genes (some genes in the HTP study might be very well studied)
- use some cut-off to indicate confidence so we don't load noise
- Do HTP annotations ever get removed?
- What happens to propagation?
- How do we indicate to the user that the data is from a HTP paper?
- How to decide if a given paper is a HTP paper?
- Rama and Eurie on SGD's practice-
- In some cases, it is obvious because it is a genome wide study.
- SGD has come across non-genome wide papers that they have tagged as HTP.
- Criteria for identifying these non-genome wide HTP papers are-
- The authors haven't checked every construct - this is not expected for publication (for example, GFP fusion constructs aren't necessarily shown to be functional in vivo)
- The results can be measured by one condition/cutoff
- What is the purpose of the exp? Is it for a large group of genes/proteins? Is it hypothesis driven?
- HTP experiments tend to be more open-ended or like a fishing expedition, rather than hypothesis driven.
- The major distinction between HTP and core techniques is in the methods and controls rather than in the number of genes/proteins involved
- Do we usually verify the data to see if data for a well established gene/protein was reproduced in the large-scale paper?
- SGD- No. We review the method, almost every HTP paper gets discussed in our group meeting and then we load the data.
- Figuring out annotations for component and function are not tricky, but BP annotations from HTP studies can be tricky. The process mentioned in the paper could be downstream effects. For example, SGD came across this HTP paper where the telomere length was measured in non-essential KO collection. We did make the telomere maintenance annotation, but the authors pointed out that that could be an indirect effect. SGD removed those. So, caution when making BP annotations.
- When do you remove annotations? SGD removes annotations if another paper clearly shows why the HTP data was wrong.
- Currently in SGD all the HTP experimental data get an experimental evidence code with a HTP annotation method and all HTP predictions get a RCA or what ever is applicable with computational method as the annotation method.
- How to flag the annotations to indicate they are HTP?
- the new evidence code proposal discussed at the Stanford GOC meeting should provide a system to handle this.
Two high level nodes:Computational | Experimental -Computational ---sequence based ------Reviewed (R) sequence based ------Not_Reviewed (NR) sequence based ---integrative computational analysis ------R ICA ------NR_ICA ---text-based computational analysis -Experimental --IDA ----R-IDA ----NR-IDA --IMP --etc Each of these will have 2 subclasses- Reviewed and Not_reviewed as shown above and all these codes/subcodes will have IDs.
3. Proposed annotation policy
4. Examples (papers) and discussion of GO annotation issues
- Hazbun TR, et al. (2003) Assigning function to yeast proteins by integration of technologies. Mol Cell 12(6):1353-65 PMID:14690591
- Kumar A, et al. (2002) Subcellular localization of the yeast proteome. Genes Dev 16(6):707-19, PMID 11914276
- Reinders J, et al. (2006) Toward the complete yeast mitochondrial proteome: multidimensional separation techniques for mitochondrial proteomics. J Proteome Res 5(7):1543-54, PMID 16823961
- Huh WK, et al. (2003) Global analysis of protein localization in budding yeast. Nature 425(6959):686-91, PMID 14562095
- Sickmann A, et al. (2003) The proteome of Saccharomyces cerevisiae mitochondria. Proc Natl Acad Sci U S A 100(23):13207-12, PMID 14576278
5. Examples of annotation incoherence for the user
Multiple subcellular locations
At the moment, there is no way to distinguish large scale results from specific papers. What will happen with propagations ?
GO:0005618 cell wall IDA GO:0005730 nucleolus IDA GO:0005739 mitochondrion GO:0005741 mitochondrial outer membrane IDA GO:0005750 mitochondrial respiratory chain complex III IDA GO:0005758 mitochondrial intermembrane space IDA GO:0005759 mitochondrial matrix IDA GO:0005886 plasma membrane IDA GO:0009507 chloroplast IDA GO:0016020 membrane IDA
GO:0005730 nucleolus IDA GO:0005739 mitochondrion IDA GO:0005886 plasma membrane IDA GO:0009507 chloroplast IDA GO:0009535 chloroplast thylakoid membrane IDA GO:0009579 thylakoid IDA GO:0022626 cytosolic ribosome IDA
When no distinction between 2 subcellular locations is made in the paper...
Should we annotate it at all ? PMID 16618929 no distinction between mitochondrion and chloroplasts, but both subcellular locations have been added !!! see page 3 '...corresponding to mitochondria/plastids (which were not resolved from each other within the density gradient used in this study),...'
Is that correct, or shall we discard such result
GO:0005739 mitochondrion IDA GO:0009536 plastid IDA
Induction by high-throughput analysis...
Should we be aware about significant induction or just take it as the authors suggest it...? PMID 16463103 response to a lot of biotic and abiotic stresses...with
+ ++ +++ ++++ >++++
...when the control is + and the induction is ++ is it really significant, when in the same table you can have up to ++++> ? what about propagation after that, especially in this case where a large family is analyzed with differential inductions ?
Limit of TAS statement
PMID 11118137, family of 1984 putative transcription factors
GO:0045449 regulation of transcription TAS
Should we consider such large TAS ?
6. Suggestions for Quality Control procedures
Back to 2010_GO_camp_Meeting_Agenda