Tools for identifying orthologs

From GO Wiki
Jump to: navigation, search

Judy, Petra, Karen, DongHui and Kimberley

Val Wood

An increasing number of methods are available to identify orthologous proteins in different organisms employing diverse algorithms. These can be classified broadly as BLAST-based methods, which have high sensitivity, and tree-based methods, which have high specificity (5). KOGs (euKaryotic Orthologous Groups; ref. 6) is a homology database for seven eukaryotic genomes, which uses BLAST reciprocal best hits (RBH) between three proteins from different organisms supplemented by manual curation (7,8). This makes the resource difficult to update and extend. OrthoMCL and Inparanoid improve on RBH by the inclusion of the detection of co-orthologs, a normalization step, and Markov Clustering (OrthoMCl) or bootstrap confidence values (Inparanoid) (9-11). HomoloGene is a system for automated detection of homologs among the annotated proteins of 18 eukaryotic genomes and includes both curated and calculated orthologs (12). A curated list of orthologous clusters between fission yeast and budding yeast has been compiled by manual inspection of alignments and clusters on a protein-by-protein basis, taking into account experimental evidence, domain organization and protein length, and species distribution (13). These various homology resources have different advantages and complement each other, but it is difficult to assess accuracy (specificity and coverage) in the absence of a ‘gold standard’ orthology test data set (5). However, assessing the enrichment of results from multiple resources can provide increased confidence in orthology calls. For individual scientists, ortholog identification and accurate extraction of relevant functional data can be time consuming and confusing, due to a lack of integration of the various resources and the different results obtained for many proteins (both false positives and false negatives). These differences can frequently cause important functional clues to be overlooked or to place unnecessary emphasis on false predictions when a user is limited to individual resources. (This was from the YOGY grant app)

5. Chen,F., Mackey,A.J., Vermunt,J.K., Roos,D.S. (2007) Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes. /PLoS ONE/ *2*, e383

6. Tatusov,R.L., Fedorova,N.D., Jackson,J.D., Jacobs,A.R., Kiryutin,B., Koonin,E.V., Krylov,D.M., Mazumder,R., Mekhedov,S.L., Nikolskaya,A.N., /et al./ (2003) The COG database: an updated version includes eukaryotes. /BMC Bioinformatics/, *4,* 41

7. Tatusov,R.L., Koonin,E.V. and Lipman,D.J. (1997) A genomic perspective on protein families. /Science/, *278*, 631

8. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. /Nucleic Acids Res./, *25*, 3389

9. Chen,F., Mackey,A.J., Stoeckert,C.J. Jr and Roos,D.S. (2006) OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. /Nucleic Acids Res./, *34*, D363

10. Sonnhammer,E.L. and Koonin,E.V. (2002) Orthology, paralogy and proposed classification for paralog subtypes. /Trends Genet./, *18*, 619

11. O'Brien,K.P., Remm,M. and Sonnhammer,E.L. (2005) Inparanoid: a comprehensive database of eukaryotic orthologs. /Nucleic Acids Res./, *33*, D476

12. Wheeler,D.L., Barrett,T., Benson,D.A., Bryant,S.H., Canese,K., Chetvernin,V., Church,D.M., DiCuccio,M., Edgar,R., Federhen,S., /et al./ (2005) Database resources of the National Center for Biotechnology Information. /Nucleic Acids Res./, *33*, D3

13. Wood,V. (2006) /Schizosaccharomyces pombe/ comparative genomics; from sequence to systems. In Sunnerhagen,P., Piskur,J. (eds.) /Comparative Genomics Using Fungi as Models (Series: Topics in Current Genetics)./ Vol 15, pp.233

In contrast "TreeFam infers orthologs by means of gene trees. It fits a gene tree into the universal species tree and finds historical duplications, speciations and losses events. TreeFam uses this information to evaluate tree building, guide manual curation, and infer complex ortholog and paralog relations" (This was lifted from the longer description here:

Both Blast based and tree based methods miss many divergent orthologs (at least between the yeast). Shortly I'll be providing Treefam with a list of all proteins which are known to be conserved from yeast to human and are currently excluded from Treefam clusters. This should increase sensitivity for a number of broadly conserved proteins.

Retun to [July 10 Conference call agenda]