Reference Genome minutes (Archived): Difference between revisions

Revision as of 11:50, 2 October 2007

Minutes for the reference genome meeting, September 26-27, 2007

Orthology determination

Kara Dolinski

Background information

Available tools:
- inparanoid (PMID 15608241),
- homologene (PMID 17170002): Does get updated but doesn’t have all species in; also doesn’t perform very well, reciprocal blast based and not phylogeny based.
- HCOP (PMID 16284797),
- treefam (PMID 16381935),
- Compara (no pub yet); produces trees
- OrthoMCL (PMID 12952885)

‘Aggregator’ tools:
- YOGY (PMID 16845020; not really maintained; has all except chicken and zebra fish – methods include KOGs, InParanoid, homologene, orthoMCL and a table of curated orthologs between budding yeast and fission yeast.),
- bioPIXIE (PMID 16420673; Princeton), a data intergration approach: ; incorporates data from several methods to generate a ‘probability of orthology’ with Troyanskaya. Use same protein sets with all the algorithms and update as required. Agreed as a good idea (see action item).
- P-POD (PMID 17712414; http://ortholog.princeton.edu/findorthofamily.html) (based on OrthoMCL (http://orthomcl.cbil.upenn.edu/cgi-bin/OrthoMclWeb.cgi) and Jaccard Coefficient Cluster.

Problem is that none of them doing exactly what they wanted… There is no gold standard set. It is sometimes necessary to manually look for an ortholog when no tool finds them (short proteins, for example, or divergent, like E, coli proteins.

Another problem is the way databases handle orthologs – the mouse schema can’t cope with many to many relationships.

- Methods to do orthology comparisons (see slide)

Comparison of tools is made more difficult due to:

different species are covered by different tools
problems with inconsistent use of identifiers (Treefam is a mess - MA) (Emily points out that UniProt and ensembl have joined forces in trying to reduce differences (in proteins?) and fill in holes in UniProt – doing human first with mouse second on the list for clean up.)
that not all sets are based on the same proteins due to varying frequency of updates/maintenance.

- Quality of ortholog tools: PMID 17440619: assessing performance of orthology detection methods

Issues regarding ortholog determination in the context of the reference genome project

We need a set of sequences to work with

[ACTION ITEM]: Suzi and Karen E will generate a page where all sequences will be available

We need a complete set of orthologs that covers all reference genomes. The orthology determination should capture one-to-many relationships and many-to-many relationships (question: does that need to be captured somewhere? “Unique putative ortholog”,” one2many”, “many2many” (gene family?)). Our concern is to capture the full set rather than making statements about the evolutionary relationships between gene products and/or organisms (we probably need to clarify this in our documentation and what we present to the public as our goals).

Setting curation priorities

Rex Chisholm

Background

When we started the reference genome project last year we made our main priority genes involved in human diseases. The Scientific Advisory Board suggested to also try to curate genes for which there is no GO annotations but that have published data. Other suggestions:

Encode
members of a complex should all be done at the same time
all enzymes in a pathways should be done at the same time

[ACTION ITEM] (everybody): We will add categories of genes to annotate in addition to ‘disease genes’. We will choose five genes from each of the following four groups:

diseases (Resources: RGD portal (http://rgd.mcw.edu/dportal/), OMIM (http://www.ncbi.nlm.nih.gov/Omim/getmorbid.cgi)
biochemical/ signaling pathways/ (reactome) (Resources: Reactome (http://reactome.org/), Pathway Tools http://bioinformatics.ai.sri.com/ptools/)
bleeding edge list: For example: New “hot” genes in your area of interest; genes that come up in computational studies/population studies; most cited papers by some text-mining method; genes cited in newsmedia; newly named genes (HGNC)

[ACTION ITEM] (everybody): Each database should keep an eye open for those genes to have genes to suggest when it’s their turn to do the assignments

conserved genes/unannotated genes; genes that have few annotations and have lots of literature

[ACTION ITEM] (Val): Provide the list of 207 genes conserved between pombe and human with no annotation/information [ACTION ITEM] (Jim): Provide the set of conserved genes found by InParanoid that are conserved in all 12 species (660 or so); we might want to prioritize this list by ascending order of number of annotations to target unannotated genes (who can do that?) [ACTION ITEM] (Ruth): send the HGNC list of genes with few annotations This will be done on a rotation basis from all databases. I suggest we go alphabetically:

November 2007: Arabidopsis thaliana
December 2007: Caenorhabditis elegans
January 2008: Danio rerio
February 2008: Dictyostelium discoideum
March 2008: Drosophila melanogaster
April 2008: Escherichia coli
May 2008: Gallus gallus
June 2008: Homo sapiens
July 2008: Mus musculus
August 2008: Rattus norvegicus
September 2008: Saccharomyces cerevisiae
October 2008: Schizosaccharomyces pombe

[ACTION ITEM]: contact/meet with people who have made tools for orthology determination to see if they can help us (that possibly includes re-running the analyses using the most recent set of sequences and proper IDs) THIS ACTION ITEM NEEDS TO BE ASSIGNED TO SOMEONE:

Compara: Emily?
Homologene: Judy?
TreeFam
in paranoid
others?

[ACTION ITEM]: Kara: run the P-POD over the full ref genomes set? analysis on the ref genome data set. Need computational pipeline with existing resources. Currently takes 3 weeks to do 8 species all v all.

@@ Line 1: / Line 1: @@
 == Minutes for the reference genome meeting, September 26-27, 2007 ==
-=== Orthology determination ===
+== Orthology determination ==
 Kara Dolinski
-==== Background information ====
+=== Background information ===
 * Available tools:
 ** inparanoid (PMID 15608241),
@@ Line 29: / Line 29: @@
 - Quality of ortholog tools: PMID 17440619: assessing performance of orthology detection methods
-==== Issues regarding ortholog determination in the context of the reference genome project ====
+=== Issues regarding ortholog determination in the context of the reference genome project ===
 # We need a set of sequences to work with
 [ACTION ITEM]:  Suzi and Karen E will generate a page where all sequences will be available
 # We need a complete set of orthologs that covers all reference genomes. The orthology determination should capture one-to-many relationships and many-to-many relationships (question: does that need to be captured somewhere? “Unique putative ortholog”,” one2many”, “many2many” (gene family?)). Our concern is to capture the full set rather than making statements about the evolutionary relationships between gene products and/or organisms (we probably need to clarify this in our documentation and what we present to the public as our goals).
-=== Setting curation priorities ===
+== Setting curation priorities ==
 Rex Chisholm
-==== Background ====
+=== Background ===
 When we started the reference genome project last year we made our main priority genes involved in human diseases. The Scientific Advisory Board suggested to also try to curate genes for which there is no GO annotations but that have published data.  Other suggestions:
 * Encode