Reference Genome minutes (Archived)

From GO Wiki
Revision as of 07:41, 2 October 2007 by Pascale (talk | contribs)

Jump to: navigation, search

Minutes for the reference genome meeting, September 26-27, 2007

Orthology determination

Kara Dolinski

Background information

  • ‘Aggregator’ tools:
    • YOGY (PMID 16845020; not really maintained; has all except chicken and zebra fish – methods include KOGs, InParanoid, homologene, orthoMCL and a table of curated orthologs between budding yeast and fission yeast.),
    • bioPIXIE (PMID 16420673; Princeton), a data intergration approach: ; incorporates data from several methods to generate a ‘probability of orthology’ with Troyanskaya. Use same protein sets with all the algorithms and update as required. Agreed as a good idea (see action item).
    • P-POD (PMID 17712414; (based on OrthoMCL ( and Jaccard Coefficient Cluster.
    • problem is that none of them doing exactly what they wanted… There is no gold standard set. It is sometimes necessary to manually look for an ortholog when no tool finds them (short proteins, for example, or divergent, like E, coli proteins.
    • Another problem is the way databases handle orthologs – the mouse schema can’t cope with many to many relationships.
    • Methods to do orthology comparisons (see slide)

Comparison of tools is made more difficult due to:

  1. different species are covered by different tools
  2. problems with inconsistent use of identifiers (Treefam is a mess - MA) (Emily points out that UniProt and ensembl have joined forces in trying to reduce differences (in proteins?) and fill in holes in UniProt – doing human first with mouse second on the list for clean up.)
  3. that not all sets are based on the same proteins due to varying frequency of updates/maintenance.

- Quality of ortholog tools: PMID 17440619: assessing performance of orthology detection methods

Issues regarding ortholog determination in the context of the reference genome project

  1. We need a set of sequences to work with

[ACTION ITEM]: Suzi and Karen E will generate a page where all sequences will be available

  1. We need a complete set of orthologs that covers all reference genomes. The orthology determination should capture one-to-many relationships and many-to-many relationships (question: does that need to be captured somewhere? “Unique putative ortholog”,” one2many”, “many2many” (gene family?)). Our concern is to capture the full set rather than making statements about the evolutionary relationships between gene products and/or organisms (we probably need to clarify this in our documentation and what we present to the public as our goals).

Setting curation priorities

Rex Chisholm ====Background====: When we started the reference genome project last year we made our main priority genes involved in human diseases. The Scientific Advisory Board suggested to also try to curate genes for which there is no GO annotations but that have published data. Other suggestions:

  • Encode
  • members of a complex should all be done at the same time
  • all enzymes in a pathways should be done at the same time

[ACTION ITEM] (everybody): We will add categories of genes to annotate in addition to ‘disease genes’. We will choose five genes from each of the following four groups:

[ACTION ITEM] (everybody): Each database should keep an eye open for those genes to have genes to suggest when it’s their turn to do the assignments

  • conserved genes/unannotated genes; genes that have few annotations and have lots of literature

[ACTION ITEM] (Val): Provide the list of 207 genes conserved between pombe and human with no annotation/information [ACTION ITEM] (Jim): Provide the set of conserved genes found by InParanoid that are conserved in all 12 species (660 or so); we might want to prioritize this list by ascending order of number of annotations to target unannotated genes (who can do that?) [ACTION ITEM] (Ruth): send the HGNC list of genes with few annotations This will be done on a rotation basis from all databases. I suggest we go alphabetically:

  • November 2007: Arabidopsis thaliana
  • December 2007: Caenorhabditis elegans
  • January 2008: Danio rerio
  • February 2008: Dictyostelium discoideum
  • March 2008: Drosophila melanogaster
  • April 2008: Escherichia coli
  • May 2008: Gallus gallus
  • June 2008: Homo sapiens
  • July 2008: Mus musculus
  • August 2008: Rattus norvegicus
  • September 2008: Saccharomyces cerevisiae
  • October 2008: Schizosaccharomyces pombe

[ACTION ITEM]: contact/meet with people who have made tools for orthology determination to see if they can help us (that possibly includes re-running the analyses using the most recent set of sequences and proper IDs) THIS ACTION ITEM NEEDS TO BE ASSIGNED TO SOMEONE:

  • Compara: Emily?
  • Homologene: Judy?
  • TreeFam
  • in paranoid
  • others?

[ACTION ITEM]: Kara: run the P-POD over the full ref genomes set? analysis on the ref genome data set. Need computational pipeline with existing resources. Currently takes 3 weeks to do 8 species all v all.