SGD GO HTP guidelines

From GO Wiki
Revision as of 15:43, 7 August 2008 by Stacia (talk | contribs) (New page: This page contains a summary of high-throughput dataset policy and guideline discussions/decisions made during various SGD curatorpaloozas and curator jams. These guidelines are fluid, and...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page contains a summary of high-throughput dataset policy and guideline discussions/decisions made during various SGD curatorpaloozas and curator jams. These guidelines are fluid, and remain a work in progress.

What is an HTP paper? (2006 curatorpalooza)

Often it's obvious, but ultimately it's up to curator judgement as to whether or not to consider a paper HTP. Some characteristics of HTP papers (note that not every HTP paper will have all of these):

  • The authors haven't checked every construct - this is not expected for publication
    • (for example, GFP fusion constructs aren't necessarily shown to be functional in vivo)
  • The results can be measured by one condition/cutoff
  • It's necessary to have the genome sequence to do the experiment
  • It's necessary to have the systematic deletion set to do the experiment
  • What is the purpose of the exp? Is it for a large group of genes/proteins? Is it hypothesis driven?
    • HTP experiments tend to be more open-ended or like a fishing expedition, rather than hypothesis driven.
  • The major distinction between HTP and core techniques is in the methods and controls rather than in the number of genes/proteins involved

Guidelines for considering manual vs HTP annotations (Feb 7, 2008)

  1. Was there an assay used to check the purified complex for activity?
  2. Were any or all of the proteins/RNAs characterized further?
  3. Were terms such as 'proteomic screening' used in the paper?

If the answer to questions 1 or 2 are YES, or the answer to #3 is a NO, then consider manual annotations. In cases where some but not all of these guidelines are satisfied or when in doubt it never hurts to send an email to the group to discuss it further.

Guidelines for adding large-scale datasets (Feb 14, 2008)

Currently our focus for GO is to capture the primary role of the gene product. Here are some general questions that can be applied for many htp datasets

  1. Add the entire set for all genes? or only for those uncharacterized?
  2. If and when should the dataset be removed?
  • Guidelines we decided upon:
    • We will do nothing with the Huh dataset.
      • This is an old and our first htp dataset. We only added what was unknown at the time and that's the way it is.
    • For future htp datasets, add the entire dataset.
    • If there is published evidence that a reagent or construct used in an htp experiment is bad, remove all the associated data (GO, phenotypes, sequence, interactions (via an email to BioGrid), etc.)
      • Another issue to think about for the future is how to record instances where a reagent or construct is faulty.
    • High-throughput mutant studies should be captured as phenotypes and not GO, with room for exceptions.
    • We should review papers that were used to make HTP GO annotations with IMP evidence. We may want to delete the GO annotations and curate the phenotypes instead.

GO High-throughput annotations (June 26, 2008)

It was previously agreed that the telomere maintenance process annotations from Gatbonton et al. (PMID:16552446) and Askree et al. (PMID:15161972) would be best represented through phenotype curation, rather than GO annotations. There has been a delay in adding all this information to the phenotypes, as there were some annotation issues that needed to be worked out as we transition to the new phenotype curation system. However, we are still planning on adding the phenotypes and then, once the phenotype system goes live in July 2008, deleting the GO annotations.

Using the RCA evidence code and Updating/Deleting RCA annotations (Jan. 31 and June 26, 2008)

Jan. 31, 2008: Basically it was concluded that RCA, as a curator assigned code, should only be used if a curator looks specifically at that gene. If it is felt to be appropriate to add annotations from a similar analysis in the future without looking at each gene to be annotated, the annotations will be associated with the IEA evidence code. For the Wade et al. 2006 (PMID:16544271) paper annotations, it was agreed that they can be removed for genes that have experimental evidence supporting an annotation to ribosome biogenesis or to a more granular term in that branch.

June 26, 2008: We decided that RCA evidence is fundamentally predictive, rather than experimental, and should expire after 1 year, just like computational evidence does. This proposal will be brought up with the larger group during SGD Group meeting.

What do with HTP IMP data when no other pubs exist (July 17, 2008)

Some different questions to help frame the discussion:

  • Do we selectively pick genes from a HTP study to do a GO annotation? We had shifted towards treating the entire dataset the same way, such as adding the entire Sickman cellular component localization dataset.
  • Should we even be making GO annotations from HTP phenotypes (IMP)? We don't think about making GO annotations from HTP IGI data that is loaded from BioGRID so it's a similar situation. Ultimately, this means that the use of the HTP designation for IMP annotations for Biological Process will be limited.
  • If we can't capture HTP data in any other curated field, should we capture it as GO? We have captured information in the headlines.
  • Is it better to have something than nothing? Especially for unnamed, uncharacterized ORFs where there is very little data? Can an exception be made in these cases? But if the information can be captured in the phenotypes, the data will be present. And we can always update the headline with salient summaries.
  • Is there a difference in dealing with information from a paper during single paper-based curation as opposed to reviewing all the literature for a gene? Yes, when reviewing the publications for a gene, we may want to summarize some HTP results for the description. But on a per paper basis, the task of digging up all the uncharacterized ORFs from supplemental data to update their descriptions may be too onerous or impractical.

In the end, we decided to not do GO annotations from HTP IMP annotations as a general guideline, even if there are no other pubs for an uncharacterized gene. Though if there appears to be a useful paper, we should bring it up for discussion. (Because these are working guidelines that can be revised as necessary/appropriate.)