Reference Genome minutes (Archived)

From GO Wiki
Revision as of 11:39, 2 October 2007 by Pascale (talk | contribs)
Jump to navigation Jump to search

Minutes for the reference genome meeting, September 26-27, 2007

Orthology determination

Kara Dolinski

Background information

  • Available tools:
    • inparanoid (PMID 15608241),
    • homologene (PMID 17170002): Does get updated but doesn’t have all species in; also doesn’t perform very well, reciprocal blast based and not phylogeny based.
    • HCOP (PMID 16284797),
    • treefam (PMID 16381935),
    • Compara (no pub yet); produces trees
    • OrthoMCL (PMID 12952885)
  • ‘Aggregator’ tools:
    • YOGY (PMID 16845020; not really maintained; has all except chicken and zebra fish – methods include KOGs, InParanoid, homologene, orthoMCL and a table of curated orthologs between budding yeast and fission yeast.),
    • bioPIXIE (PMID 16420673; Princeton), a data intergration approach: ; incorporates data from several methods to generate a ‘probability of orthology’ with Troyanskaya. Use same protein sets with all the algorithms and update as required. Agreed as a good idea (see action item).
    • P-POD (PMID 17712414; http://ortholog.princeton.edu/findorthofamily.html) (based on OrthoMCL (http://orthomcl.cbil.upenn.edu/cgi-bin/OrthoMclWeb.cgi) and Jaccard Coefficient Cluster.
    • problem is that none of them doing exactly what they wanted… There is no gold standard set. It is sometimes necessary to manually look for an ortholog when no tool finds them (short proteins, for example, or divergent, like E, coli proteins.
    • Another problem is the way databases handle orthologs – the mouse schema can’t cope with many to many relationships.
    • Methods to do orthology comparisons (see slide)

Comparison of tools is made more difficult due to:

  1. different species are covered by different tools
  2. problems with inconsistent use of identifiers (Treefam is a mess - MA) (Emily points out that UniProt and ensembl have joined forces in trying to reduce differences (in proteins?) and fill in holes in UniProt – doing human first with mouse second on the list for clean up.)
  3. that not all sets are based on the same proteins due to varying frequency of updates/maintenance.

- Quality of ortholog tools: PMID 17440619: assessing performance of orthology detection methods

Issues regarding ortholog determination in the context of the reference genome project

  1. We need a set of sequences to work with

[ACTION ITEM]: Suzi and Karen E will generate a page where all sequences will be available

  1. We need a complete set of orthologs that covers all reference genomes. The orthology determination should capture one-to-many relationships and many-to-many relationships (question: does that need to be captured somewhere? “Unique putative ortholog”,” one2many”, “many2many” (gene family?)). Our concern is to capture the full set rather than making statements about the evolutionary relationships between gene products and/or organisms (we probably need to clarify this in our documentation and what we present to the public as our goals).

[ACTION ITEM]: contact/meet with people who have made tools for orthology determination to see if they can help us (that possibly includes re-running the analyses using the most recent set of sequences and proper IDs) THIS ACTION ITEM NEEDS TO BE ASSIGNED TO SOMEONE: Compara: Emily? Homologene: Judy? TreeFam in paranoid others?

[ACTION ITEM]: Kara: run the P-POD over the full ref genomes set? analysis on the ref genome data set. Need computational pipeline with existing resources. Currently takes 3 weeks to do 8 species all v all.