TermEnrichment: Gold Standard Data Sets

From GO Wiki
Revision as of 16:39, 8 November 2012 by Hitz (talk | contribs) (What goes in the gold standard gene sets)

Jump to: navigation, search


Goal is to collect a range of GO annotated gene IDs to evaluate GO. These will serve as control sets so enrichment can be run routinely, say monthly basis, to see how enrichment changes as a consequence of annotation and ontology changes. This will allow users to know what to expect when they do GO enrichment analysis. For example we should be able to see how enrichment is affected when say 10% of the annotations are deleted or if major changes happen in the ontology.

  • We need to define what to expect for any given set of genes. What is the truth?
  • SGD, mouse, fly, zfin, worm will put together some gene sets for this exercise.

What goes in the gold standard gene sets

  • We need separate sets for the 3 different ontologies, although most people enrich only on BP.
  • Provide details on:
    • what are the top 5 hits/enriched terms you expect from your set
    • what is the background set you checked it against
    • Taxon ID
    • Size of the gene set
    • email address of submitter
    • Year submitted:
    • Description
  • there can be multiple sets/ontology and can be different sizes too (100 genes, 500 genes and so on)
  • there can be a set of genes all related to metabolism and another set where these genes are mixed with genes annotated to different processes
  • pick sets of genes that have known functional relationships INDEPENDENT of go terms - many of these are automated or semi-automated queries from MODs/UniProt. Since "co-expression" is so common a use-case for term enrichment, I would not use lists of genes from co-expression unless their was term (set) concordant with the conditions of the experiment.
    • In yeast, genes with the same "name", i.e, FLO*
    • In yeast, gene lists with the same or similar phenotypes (obviously not all phenotypes will map well to GO processes, but some will)
    • metabolic or signally pathway components "list"
    • Genes that genetically interact with each other
    • Genes that physically interact (are in the same complex(es) as each other.
  • For a very "pure" set of genes (incredibly strong signal/low Pvalue for set of GO terms {X}) dilute this list by adding random genes.