TermEnrichment: Gold Standard Data Sets

From GO Wiki
Jump to: navigation, search


Goal is to collect a range of GO annotated gene IDs to evaluate GO. These annotations are well annotated genes. These will serve as control sets so enrichment can be run routinely, say monthly basis, to see how enrichment changes as a consequence of annotation and ontology changes. This will allow users to know what to expect when they do GO enrichment analysis. For example we should be able to see how enrichment is affected when say 10% of the annotations are deleted or if major changes happen in the ontology.

  • We need to define what to expect for any given set of genes. What is the truth?
  • SGD, mouse, fly, zfin, worm will put together some gene sets for this exercise.

Factors affecting GO analysis

  • Annotation depth (when there is push to reannotate
  • Ontology version
  • Size of the input list
  • IEAs in the background gene set
  • FDR
  • Define a gene set and test term enrichment by poisoning the input list, by diluting it.

What goes in the gold standard gene sets

  • We need separate sets for the 3 different ontologies, although most people enrich only on BP.
  • there can be multiple sets/ontology and can be different sizes too (100 genes, 500 genes and so on)
  • there can be a set of genes all related to metabolism and another set where these genes are mixed with genes annotated to different processes. Key is they should be manually annotated and should be completely annotated (all literature).
  • pick sets of genes that have known functional relationships INDEPENDENT of go terms - many of these are automated or semi-automated queries from MODs/UniProt. Since "co-expression" is so common a use-case for term enrichment, I would not use lists of genes from co-expression unless their was term (set) concordant with the conditions of the experiment.
    • In yeast, gene lists with the same or similar phenotypes (obviously not all phenotypes will map well to GO processes, but some will)
    • metabolic or signally pathway components "list"
    • Genes that genetically interact with each other
    • Genes that physically interact (are in the same complex(es) as each other.
  • For a very "pure" set of genes (incredibly strong signal/low Pvalue for set of GO terms {X}) dilute this list by adding random genes.

Meta Data for the Gene sets

    • Taxon ID
    • Size of the gene set
    • email address of submitter
    • Year submitted:
    • Description
    • geneID [tab] UniProt ID [tab] alternate ID
    • When you are ready submit the file to SVN-