Orthology discussion page (Retired)

From GO Wiki
Revision as of 17:05, 11 May 2007 by Vanaukenk (talk | contribs) (Added elegans orthology determination methods)

Jump to: navigation, search

This is the place for general discussions of methods, problems or ideas regarding general principles for establishing orthology between reference genome genes and the human disease gene targets.

Specific discussion of gene specific issues should be directed toward the gene specific pages. A link will be added here as soon as the pages are established.

ZFIN Orthology Determination Method

We always use the same methods as outlined here:

1. Check to see if orthology has already been established between a zebrafish gene and the human gene by searching in ZFIN.

2. If there is no established zebrafish ortholog, the human sequence is used to search zebrafish mRNA, Vega and Ensembl transcripts and protein sequences for potential orthologs. If there are several zebrafish sequences that are candidates for being the ortholog, reciprocal blasts of the zebrafish sequences against human sequences are used to order them. The best matches are then analyzed for conserved synteny with the human gene.

The current version of the zebrafish assembly at Ensembl is used to determine the location of the zebrafish gene. After the zebrafish gene has been localized, the flanking regions around the gene are analyzed for other orthologous genes between zebrafish and human chromosomes. The presence of conserved synteny is used as evidence to confirm orthology and the human gene is assigned as the ortholog of the zebrafish gene in ZFIN. If necessary, the zebrafish gene nomenclature in ZFIN is updated.

In cases, where sequence analyses and synteny do not provide clear evidence to distinguish between two or more zebrafish genes, orthology is not established. This is also the case for human genes that do not match any zebrafish cDNA or EST sequences but have a sequence match in the zebrafish genome. Genscan or FGENESH identifiers are instead provided as identfiers for putative zebrafish orthologs.

TAIR Orthology Determination Method

We use YOGY and maintain a separate spreadsheet of results for each method (analysis not done included) for each human gene. If an Arabidopsis gene appears in more than one analysis, we consider it an Arabidopsis ortholog. Arabidopsis genes that only occur in one analysis are not considered orthologs.

dictyBase Orthology Determination Method

1. Check YOGY for orthologs (if human name(s) are not recognized in YOGY search HGNC, UniProt or even Google).

2. If there is an ortholog in YOGY (Dicty is included in two methods: Inparanoid and OrthoMLC) we confirm ortology with reciprocal BLAST. We have the rule that an ortholog should be at least 30% identical over 80% of the length of the protein; however, the curator can decide a protein with lower sequence similarity is an ortholog if there are single genes in both organisms.

We also routinely compare domains in InterPro, including and/or Pfam, TMHMM and SignalP if the human protein contains such structures. This helps to firmly determine if there is a single Dicty ortholog.

3. In case there is no ortholog in YOGY, we blast the human protein sequence in dictyBase and change parameters like E-value and/or turning filtering off if necessary (e.g. for very short proteins such as DNAJC19). If we identify a potential ortholog this way, we proceed with our analysis as described in 2.

4. If Dicty has one or more genes that are just similar, e.g. conservation is only over a large domain, we do not consider this an ortholog. Depending on the degree of similarity we might mention this in our free-text description on each gene page.

SGD "Orthology" Determination Method

  1. Check each human gene name in YOGY for hits with each of the four methods: KOG, Inparanoid, HomoloGene, and OrthoMCL. Record which S. cerevisiae genes come up as hits for the human gene via each method.
  2. Evaluate whether I am getting the same hits from each of the available methods. Note that sometimes a given method is not available for a given human gene. In these cases, the comparison is made only between the results from the available methods.
  3. Take the S. cerevisiae hits obtained by searching with the human genes and use YOGY to search for their orthologs.
  4. Make a decision of what, if anything, to call, based on how many methods produced the S. cerevisiae hits, and on how many of the methods returned the original human gene as a hit in the reverse search with the S. cerevisiae gene. The examples below may help illustrate the decision process.

Some Examples

  • Of getting hits but not making a call
    1. Sometimes I get hits in KOG, usually multiple hits, but no hits with any of the other 3 methods, where at least some of them were available. In cases like this, I take a look at the KOG info. When this occurs, the KOG usually turns to be something rather general, e.g. "permease of the major facilitator superfamily" as was the case for SLC37A4 (G6PT1; Glycogen storage disease Ib, 232220 (3)) and I ignore the S. cerevisiae hits coming from the KOG.
    2. For the human gene DPAGT2 [Congenital disorder of glycosylation, type Ij, 608093 (3)], Inparanoid and OrthoMCL give ScALG7 as a hit, though HomoloGene does not. However, searches with ALG7 give DPAGT1 rather than DPAGT2 so I did not make a call.
  • Of making a call
    1. Sometimes I get hits with some methods, but not others. For example, for DPM1 [Congenital disorder of glycosylation, type Ie, 608799 (3)], 3 methods (KOG, Inparanoid, and OrthoMCL) produced the same hit, ScDPM1, but HomoloGene, while available did not. When ScDPM1 was used to search YOGY, it produced the original human gene, HsDPM1, as the sole human hit by KOG, Inparanoid, and OrthoMCL. HomoloGene produced hits, but only in Kluyveromyces lactis and Eremothecium gossypii, both of which are also fungi. As I have sometimes seen this pattern before, where HomoloGene has a much narrower range of calls than both Inparanoid and OrthoMCL, I went with the consensus of the other methods and called ScDPM1.
    2. Sometimes I get the same hit with all four methods. For example, for the human gene G6PD [Favism (3)], all four methods produce the same hit, the S. cerevisiae gene ZWF1. Doing the reverse search with ScZWF1, all four methods produced HsG6PD as a hit, though two methods also produced two additional hits. In this case, I ignored the additional hits and called ScZWF1 as the best hit for HsG6PD.
  • Of a mistaken call
    1. For the human gene DPYD [Thymine-uraciluria (3)], Inparanoid and OrthoMCL hit URA1, KOG and HomoloGene while available for this gene did not call ScURA1. Search YOGY with ScURA1 gives back the original human gene with KOG, Inparanoid, and OrthoMCL. Thus I made the call. However, Julie Park, an SGD curator who had previously curated the ScURA1 gene in detail informed me that URA1 has a different specific human homolog, thus that ScURA1 is not orthologous to this human gene. The appropriate Human homolog for URA1 corresponds to GenBank ID: M94065 and is described in a paper (PMID:1446837) and an OMIM record (http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=126064). Julie added that there is a similar domain between the human DPYD gene (encoding Dihydropyrimidine dehydrogenase [NADP+]) and ScURA1 (encoding dihydroorotate dehydrogenase), but the similarity does not extend to the entire protein. So, while the called looked good in terms of the hits and reciprocal hits, it was actually false.
  • Of a case where I'm not sure whether I should have made a call
    1. For the human gene ACTC [Cardiomyopathy, familial hypertrophic, 192600 (3)], I got hits with 3 methods. For KOG, this was "Actin and related proteins" so it's not really specific enough to use. For both Inparanoid and OrthoMCL, I only got one hit in S. cerevisiae, but the human gene 'ACTC1' pulled up multiple hits in human. For the Reverse checks: searching with Sc ACT1 (S000001855, YFL039c), I got the same general KOG and ignored it. Both Inparanoid and OrthoMCL pulled up multiple human genes, the starting human gene as well as many of the same human genes pulled up in the original search with HsACTC1. However, HomoloGene pulled up only HsACTG1, and not HsACTC1. I opted not to call this, but remain uncertain as to whether I should have or not.

Further Comments

  1. I would like to know more about the methodology behind each of the four prediction methods.
  2. I would like to be able to see alignments of the results, or at least some indication as to whether the hit is across the full length or just to a domain.
  3. In reverse searches with the S. cerevisiae genes, it is sometimes very difficult to determine whether the KOG hits actually correspond to the human gene I started with.
  4. It's really time consuming to do the equivalent searches via the individual pages for KOG, Inparanoid, HomoloGene, and OrthoMCL. They all search differently. Some don't allow searching by the gene name and you have to use an ID, and not just any ID only the one they've chosen to reference.
  5. Should we be using the word ortholog? Some people use a very precise meaning of ortholog. Do we really mean that here?

From http://www.reference.com/browse/wiki/Homology_(biology):

Orthologs, or orthologous genes, are any genes in different species, that are similar to each other and originated from a common ancestor, regardless of their functions. Thus orthologs are separated by an evolutionary speciation event: if a gene exists in a species, and that species diverges into two species, then the divergent copies of this gene in the resulting species are orthologous. The term "ortholog" was coined in 1970.

A second definition of orthologous has arisen to describe any genes with very similar functions in different species. This differs from the original definition in that there is no statement about evolutionary relation, or similarity in sequence or structure.

WormBase: C. elegans Orthology Determination Method

1. Check YOGY using the human gene name and record the number of hits from each method in the elegans spreadsheet, Column T. C. elegans gene products are included in all four methods, but if a method lists no elegans gene product, then that method is not included in Column T.

2. If one C. elegans gene product is listed for each method, then that gene is entered as the ortholog of the human gene. We also perform reciprocal BLAST searches between the human and elegans proteins to confirm the orthology. BLAST scores are now being recorded in the spreadsheet.

3. If there is no ortholog listed in YOGY, we still perform reciprocal BLAST searches and examine the highest scoring pairs. In some cases, for example ATXN2, this identifies an elegans ortholog that was not identified by the methods listed in YOGY.

4. Since in some cases the C. elegans ortholog is highly diverged in sequence from the human protein (see BRCA2/BRC-2, for example), we also search WormBase and the C. elegans literature using Textpresso to see if there are identified orthologs whose sequence identity was not high enough to be returned in a BLAST search. (p53 and CEP-1 are another example)

5. The trickiest cases are those for which there are a number of C. elegans protein that are identified in YOGY (usually via KOG analyses) and that are roughly equivalent matches as far as BLAST scores are concerned. This has happened, for example, with some transmembrane receptors and Forkhead transcription factors. In these cases, it's not always clear that there is a true elegans ortholog. However, since many of these genes play important roles in C. elegans development and/or behavior, there would no doubt be value in having them annotated in GO. We would like to annotate these genes as time permits, but may not include them as orthologs in our spreadsheet.

6. We also examine the results of TreeFam analysis as listed on the gene summary pages of WormBase to corroborate the YOGY and BLAST results.