SGD GO HTP guidelines
To differentiate annotations made from published small scale experiments, genome-wide or high-throughput experiments and computational predictions, we have separated GO annotations at SGD into three sets: manually curated, high-throughput, and computational GO annotations.
This page contains a summary of high-throughput dataset policy and guideline discussions/decisions made during various SGD curatorpaloozas and curator jams. These guidelines are fluid, and remain a work in progress.
- 1 Three different types of GO annotations at SGD
- 2 What is an HTP paper? (2006 curatorpalooza)
- 3 Guidelines for considering manual vs HTP annotations (Feb 7, 2008)
- 4 Guidelines for adding large-scale datasets (Feb 14, 2008)
- 5 GO High-throughput annotations (June 26, 2008)
- 6 Using the RCA evidence code and Updating/Deleting RCA annotations (Jan. 31 and June 26, 2008)
- 7 What do with HTP IMP data when no other pubs exist (July 17, 2008)
- 8 HTP GO Process annotations by IPI? (August 21, 2008)
Three different types of GO annotations at SGD
- Manually curated GO annotations reflect our best understanding of the basic molecular function, biological process, and cellular component for a gene product. Manually curated annotations are assigned by SGD curators reading the literature for each gene and making annotations from published papers when available. When published literature is available, such annotations may include those based on experiments, sequence similarity, or other computational analyses described in the paper, or on statements made by the authors. When no published literature is available for a gene, annotations may be made on the basis of curatorial judgements. Curators periodically review all Manually curated GO annotations for accuracy and completeness and update as necessary, adding new annotations to reflect advances in knowledge and removing any annotations that are no longer supported by the literature. The Last Reviewed on: date on the GO evidence and references page for a gene indicates the date when an SGD curator reviewed all of the Manually curated GO annotations for that gene. In addition, SGD also reviews and incorporates manual GO annotations for S. cerevisiae proteins from the GO Annotation (GOA) project at Uniprot. These annotations can be identified at SGD by the source, e.g., 'Uniprot', 'MGI', 'HGNC' (GO consortium members), displayed on the 'Assigned By' column of the GO evidence and references page.
- GO annotations from high-throughput experiments are assigned based on a variety of large scale high-throughput experiments, including genome-wide experiments. Many of these annotations are made based on GO annotations (or mappings to GO annotations) assigned by the authors, rather than SGD curators. While SGD curators read these publications and often work closely with authors to incorporate the information, each individual annotation is not necessarily reviewed by a curator. GO Annotations from high-throughput experiments will be assigned only when this type of data is available, and thus may not be assigned in all three aspects of the Gene Ontologies.
- Computational GO annotations are made by a variety of computational methods, such as sequence similarity methods, including protein domain motifs, and keyword mapping files. When annotations based on computational methods are NOT reviewed by a curator, they are placed in the Computational GO annotations section. Note that the criteria for including a GO annotation in this section is whether or not it was reviewed by a curator; when annotations made by a computational method, such as sequence analysis, are reviewed by a curator, they may be found in the Manually curated section. Currently (as of 09/2007), all computational GO annotations for S. cerevisiae are assigned by an external source (for example, the Gene Ontology Annotation (GOA) project of the European Bioinformatics Institute (EBI)). In SGD, curators read the research literature and associate specific GO terms with the appropriate gene products to provide information about the state of knowledge of the yeast genome. We are constantly updating our GO annotations and always welcome suggestions for improvement or corrections when the understanding about a gene has changed since the last time we reviewed the literature for a given gene.
What is an HTP paper? (2006 curatorpalooza)
Often it's obvious, but ultimately it's up to curator judgement as to whether or not to consider a paper HTP. Some characteristics of HTP papers (note that not every HTP paper will have all of these):
- The authors haven't checked every construct - this is not expected for publication
- (for example, GFP fusion constructs aren't necessarily shown to be functional in vivo)
- The results can be measured by one condition/cutoff
- It's necessary to have the genome sequence to do the experiment
- It's necessary to have the systematic deletion set to do the experiment
- What is the purpose of the exp? Is it for a large group of genes/proteins? Is it hypothesis driven?
- HTP experiments tend to be more open-ended or like a fishing expedition, rather than hypothesis driven.
- The major distinction between HTP and core techniques is in the methods and controls rather than in the number of genes/proteins involved
Guidelines for considering manual vs HTP annotations (Feb 7, 2008)
- Was there an assay used to check the purified complex for activity?
- Were any or all of the proteins/RNAs characterized further?
- Were terms such as 'proteomic screening' used in the paper?
If the answer to questions 1 or 2 are YES, or the answer to #3 is a NO, then consider manual annotations. In cases where some but not all of these guidelines are satisfied or when in doubt it never hurts to send an email to the group to discuss it further.
Guidelines for adding large-scale datasets (Feb 14, 2008)
Currently our focus for GO is to capture the primary role of the gene product. Here are some general questions that can be applied for many htp datasets
- Add the entire set for all genes? or only for those uncharacterized?
- If and when should the dataset be removed?
- Guidelines we decided upon:
- We will do nothing with the Huh dataset.
- This is an old and our first htp dataset. We only added what was unknown at the time and that's the way it is.
- For future htp datasets, add the entire dataset.
- If there is published evidence that a reagent or construct used in an htp experiment is bad, remove all the associated data (GO, phenotypes, sequence, interactions (via an email to BioGrid), etc.)
- Another issue to think about for the future is how to record instances where a reagent or construct is faulty.
- High-throughput mutant studies should be captured as phenotypes and not GO, with room for exceptions.
- We should review papers that were used to make HTP GO annotations with IMP evidence. We may want to delete the GO annotations and curate the phenotypes instead.
- We will do nothing with the Huh dataset.
GO High-throughput annotations (June 26, 2008)
It was previously agreed that the telomere maintenance process annotations from Gatbonton et al. (PMID:16552446) and Askree et al. (PMID:15161972) would be best represented through phenotype curation, rather than GO annotations. There has been a delay in adding all this information to the phenotypes, as there were some annotation issues that needed to be worked out as we transition to the new phenotype curation system. However, we are still planning on adding the phenotypes and then, once the phenotype system goes live in July 2008, deleting the GO annotations.
Using the RCA evidence code and Updating/Deleting RCA annotations (Jan. 31 and June 26, 2008)
Jan. 31, 2008: Basically it was concluded that RCA, as a curator assigned code, should only be used if a curator looks specifically at that gene. If it is felt to be appropriate to add annotations from a similar analysis in the future without looking at each gene to be annotated, the annotations will be associated with the IEA evidence code. For the Wade et al. 2006 (PMID:16544271) paper annotations, it was agreed that they can be removed for genes that have experimental evidence supporting an annotation to ribosome biogenesis or to a more granular term in that branch.
June 26, 2008: We decided that RCA evidence is fundamentally predictive, rather than experimental, and should expire after 1 year, just like computational evidence does. This proposal will be brought up with the larger group during SGD Group meeting.
What do with HTP IMP data when no other pubs exist (July 17, 2008)
Some different questions to help frame the discussion:
- Do we selectively pick genes from a HTP study to do a GO annotation? We had shifted towards treating the entire dataset the same way, such as adding the entire Sickman cellular component localization dataset.
- Should we even be making GO annotations from HTP phenotypes (IMP)? We don't think about making GO annotations from HTP IGI data that is loaded from BioGRID so it's a similar situation. Ultimately, this means that the use of the HTP designation for IMP annotations for Biological Process will be limited.
- If we can't capture HTP data in any other curated field, should we capture it as GO? We have captured information in the headlines.
- Is it better to have something than nothing? Especially for unnamed, uncharacterized ORFs where there is very little data? Can an exception be made in these cases? But if the information can be captured in the phenotypes, the data will be present. And we can always update the headline with salient summaries.
- Is there a difference in dealing with information from a paper during single paper-based curation as opposed to reviewing all the literature for a gene? Yes, when reviewing the publications for a gene, we may want to summarize some HTP results for the description. But on a per paper basis, the task of digging up all the uncharacterized ORFs from supplemental data to update their descriptions may be too onerous or impractical.
In the end, we decided to not do GO annotations from HTP IMP annotations as a general guideline, even if there are no other pubs for an uncharacterized gene. Though if there appears to be a useful paper, we should bring it up for discussion. (Because these are working guidelines that can be revised as necessary/appropriate.)
HTP GO Process annotations by IPI? (August 21, 2008)
Last week we discussed the high-throughput GO Process annotations made from the Hazbun paper. We decided we needed to review the Hazbun paper before making a decision.
Review of: Hazbun TR, et al. (2003) Assigning function to yeast proteins by integration of technologies. Mol Cell 12(6):1353-65 PMID:14690591
- used two protein interaction methods: copurification (TAP tagged) & two hybrid, but would make assignments based on only one of the two, and "a single copurifying protein of known function determined the associated GO term for eight uncharacterized ORFs". These were the source of the Process annotations using IPI.
- Most of the time, "The cellular component term was assigned based on the fluorescence microscopy". We assigned these using IDA. They also assigned the term "nucleus" to some proteins based on copurification in a complex that was known to be nuclear. We either did not enter GO annotations based on this type of method or have already deleted them.
- "The molecular function term was assigned based on remote homologies to proteins of known function using PSI-BLAST, consensus fold recognition methods, or structure-based matches of de novo structure predictions to proteins of known structures. As proteins with the same fold can have different functions (Todd et al., 2001), assignments were only made if the GO term was consistent with the data generated by other technologies, as was true for 27 out 29 possible annotations (Table 1, column 7).
Hazbun IPI annots
nuclear mRNA splicing, via spliceosome:
- NTR2: has other experimental process annots, more specific; purified with multiple things
- SPP382: has other experimental process annots, more specific; purified with multiple things
- CWC25: has other experimental process annot (also from proteomics paper), identical term; copur with several things (PRP3, SNU114, PRP19, CLF1)
- YLR132C: would have no non-IEA process annots; has 2 Hazbun IPIs, both from single copurs (PRP19, COR1)
- NOP9: has other experimental process annots, more specific; copur with 3 other things, but 1 of the 3 not obviously involved in rRNA processing (GCD6 vs SNU13 & NSR1)
- ENP2: would only have Wade RCA for process, but has component annots that are consistent; purified with multiple things (BFR2, LCP5, UTP9, HCA4, NOP58, ESF1/YDR365C)
- ESF1: has other experimental process annots, identical term; purified with multiple things (BFR2, LCP5, UTP9, HCA4, NOP58, ENP2/YGR145W)
- NSE5: has other experimental process annots, identical term; purified with multiple things (QRI2, MMS21, KRE29, RHC18, SMC5)
- NSE3: has other experimental process annots, identical term; purified with multiple things (QRI2, MMS21, RHC18, SMC5, NSE1)
coenzyme A biosynthetic process:
- YKL088W: has the same P annot by ISS and a different one by IGI; Hazbun IPI should have been made by ISS according to their web page
establishment of cell polarity:
- RRP14: has other experimental process annots, inconsistent with Hazbun; two hybrid and structure (nuclolar matrix protein)
- SWC4: has other experimental process annots, less specific; purified with multiple things (RVB1, YDR334W, VID21, YEL018W, EPL1, YNG2, TRA1, ARP4, YAF9, ESA1, RVB2), also two-hybrid (nuclear division), and structure (histone methyltransferase associated binding protein)
- YPP1: has other experimental process annots, possibly more specific; copur with 1 thing
- YJR141W: would have no P annots; two-hybrid with 1 thing
protein import into mitochondrial matrix:
- TAM41: has other experimental process annots, identical term; copur with 1 thing
- JIP5: would have no process annotation; copur with 1 thing
- MIA40: has other experimental process annots, more specific; copur with 3 other things, but not all involved in same thing (KAP123, TOM40, ADH1)
- YJR136C: would have no P annots; copur with 1 thing
- YJR012C: would have no P annots; copur with multiple things (PEX7, MMP1, PLB1, HOL1), but kind of a mixed bag
Questions to be resolved:
- The IPI process annotations seem a bit of a mixed bag, with some coming from multiple interactions and others from only a single interaction. For the Samanta paper, we think we didn't make annotations when there was only a single interaction. Are we comfortable making the process annotations when some of them come from copurification with a single protein?
- We put the Process (IPI) annotations in as "high-throughput", but used "manually curated" for the Function (ISS) and the Component (IDA) annotations. Should we use the same annotation type for all of them?
- We used ISS for the Function annotations, but they only made the F annotations if there was additional data from one of the other technologies, so maybe these should be RCA.
- We agreed to remove these process annotations for the genes indicated in red, above. The only objection to this was that the authors of this paper might be upset to see these citations disappear. The majority felt that the authors were unlikely to notice, and even if they did they probably wouldn't say anything.
- We agreed to leave the process annotations for the genes indicated in black, above (using IPI and the htp tag). This was the subject that generated the most discussion, primarily focusing on making sure that our actions were consistent with our general curatorial practices. The possible actions proposed were:
- leave as is (the winner, with at least 2/3 majority in favor)
- leave these annotations, but remove the htp tag, making them manual annoations
- remove these annotations (in addition to those indicated in red)
- go through each of them individually and make a decision of the most proper annotation, changing it to manual.
- We agreed to change the function and component annotations made with this paper to "htp". Karen will make this change.
- Karen will also look at the gene indicated in orange and annotate as she sees fit.