Orthology discussion page (Retired)

From GO Wiki
Jump to navigation Jump to search

This is the place for general discussions of methods, problems or ideas regarding general principles for establishing orthology between reference genome genes and the human disease gene targets.

Specific discussion of gene specific issues should be directed toward the gene specific pages. A link will be added here as soon as the pages are established.

ZFIN Orthology Determination Method

We always use the same methods as outlined here:

1. Check to see if orthology has already been established between a zebrafish gene and the human gene by searching in ZFIN.

2. If there is no established zebrafish ortholog, the human sequence is used to search zebrafish mRNA, Vega and Ensembl transcripts and protein sequences for potential orthologs. If there are several zebrafish sequences that are candidates for being the ortholog, reciprocal blasts of the zebrafish sequences against human sequences are used to order them. The best matches are then analyzed for conserved synteny with the human gene.

The current version of the zebrafish assembly at Ensembl is used to determine the location of the zebrafish gene. After the zebrafish gene has been localized, the flanking regions around the gene are analyzed for other orthologous genes between zebrafish and human chromosomes. The presence of conserved synteny is used as evidence to confirm orthology and the human gene is assigned as the ortholog of the zebrafish gene in ZFIN. If necessary, the zebrafish gene nomenclature in ZFIN is updated.

In cases, where sequence analyses and synteny do not provide clear evidence to distinguish between two or more zebrafish genes, orthology is not established. This is also the case for human genes that do not match any zebrafish cDNA or EST sequences but have a sequence match in the zebrafish genome. Genscan or FGENESH identifiers are instead provided as identfiers for putative zebrafish orthologs.

TAIR Orthology Determination Method

We use YOGY and maintain a separate spreadsheet of results for each method (analysis not done included) for each human gene. If an Arabidopsis gene appears in more than one analysis, we consider it an Arabidopsis ortholog. Arabidopsis genes that only occur in one analysis are not considered orthologs.

dictyBase Orthology Determination Method

1. Check YOGY for orthologs (if human name(s) are not recognized in YOGY search HGNC, UniProt or even Google).

2. If there is an ortholog in YOGY (Dicty is included in two methods: Inparanoid and OrthoMLC) we confirm orthology with reciprocal BLAST. We have the rule that an ortholog should be at least 30% identical over 80% of the length of the protein; however, the curator can decide a protein with lower sequence similarity is an ortholog if there are single genes in both organisms.

We also routinely compare domains in InterPro, including and/or Pfam, TMHMM and SignalP if the human protein contains such structures. This helps to firmly determine if there is a single Dicty ortholog.

3. In case there is no ortholog in YOGY, we blast the human protein sequence in dictyBase and change parameters like E-value and/or turning filtering off if necessary (e.g. for very short proteins such as DNAJC19). If we identify a potential ortholog this way, we proceed with our analysis as described in 2.

4. If Dicty has one or more genes that are just similar, e.g. conservation is only over a large domain, we do not consider this an ortholog. Depending on the degree of similarity we might mention this in our free-text description on each gene page.

SGD "Orthology" Determination Method

  1. Check each human gene name in YOGY for hits with each of the four methods: KOG, Inparanoid, HomoloGene, and OrthoMCL. Record which S. cerevisiae genes come up as hits for the human gene via each method.
  2. Evaluate whether I am getting the same hits from each of the available methods. Note that sometimes a given method is not available for a given human gene. In these cases, the comparison is made only between the results from the available methods.
  3. Take the S. cerevisiae hits obtained by searching with the human genes and use YOGY to search for their orthologs.
  4. Make a decision of what, if anything, to call, based on how many methods produced the S. cerevisiae hits, and on how many of the methods returned the original human gene as a hit in the reverse search with the S. cerevisiae gene. The examples below may help illustrate the decision process.

Some Examples

  • Of getting hits but not making a call
    1. Sometimes I get hits in KOG, usually multiple hits, but no hits with any of the other 3 methods, where at least some of them were available. In cases like this, I take a look at the KOG info. When this occurs, the KOG usually turns to be something rather general, e.g. "permease of the major facilitator superfamily" as was the case for SLC37A4 (G6PT1; Glycogen storage disease Ib, 232220 (3)) and I ignore the S. cerevisiae hits coming from the KOG.
    2. For the human gene DPAGT2 [Congenital disorder of glycosylation, type Ij, 608093 (3)], Inparanoid and OrthoMCL give ScALG7 as a hit, though HomoloGene does not. However, searches with ALG7 give DPAGT1 rather than DPAGT2 so I did not make a call.
  • Of making a call
    1. Sometimes I get hits with some methods, but not others. For example, for DPM1 [Congenital disorder of glycosylation, type Ie, 608799 (3)], 3 methods (KOG, Inparanoid, and OrthoMCL) produced the same hit, ScDPM1, but HomoloGene, while available did not. When ScDPM1 was used to search YOGY, it produced the original human gene, HsDPM1, as the sole human hit by KOG, Inparanoid, and OrthoMCL. HomoloGene produced hits, but only in Kluyveromyces lactis and Eremothecium gossypii, both of which are also fungi. As I have sometimes seen this pattern before, where HomoloGene has a much narrower range of calls than both Inparanoid and OrthoMCL, I went with the consensus of the other methods and called ScDPM1.
    2. Sometimes I get the same hit with all four methods. For example, for the human gene G6PD [Favism (3)], all four methods produce the same hit, the S. cerevisiae gene ZWF1. Doing the reverse search with ScZWF1, all four methods produced HsG6PD as a hit, though two methods also produced two additional hits. In this case, I ignored the additional hits and called ScZWF1 as the best hit for HsG6PD.
  • Of a mistaken call
    1. For the human gene DPYD [Thymine-uraciluria (3)], Inparanoid and OrthoMCL hit URA1, KOG and HomoloGene while available for this gene did not call ScURA1. Search YOGY with ScURA1 gives back the original human gene with KOG, Inparanoid, and OrthoMCL. Thus I made the call. However, Julie Park, an SGD curator who had previously curated the ScURA1 gene in detail informed me that URA1 has a different specific human homolog, thus that ScURA1 is not orthologous to this human gene. The appropriate Human homolog for URA1 corresponds to GenBank ID: M94065 and is described in a paper (PMID:1446837) and an OMIM record (http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=126064). Julie added that there is a similar domain between the human DPYD gene (encoding Dihydropyrimidine dehydrogenase [NADP+]) and ScURA1 (encoding dihydroorotate dehydrogenase), but the similarity does not extend to the entire protein. So, while the called looked good in terms of the hits and reciprocal hits, it was actually false.
  • Of a case where I'm not sure whether I should have made a call
    1. For the human gene ACTC [Cardiomyopathy, familial hypertrophic, 192600 (3)], I got hits with 3 methods. For KOG, this was "Actin and related proteins" so it's not really specific enough to use. For both Inparanoid and OrthoMCL, I only got one hit in S. cerevisiae, but the human gene 'ACTC1' pulled up multiple hits in human. For the Reverse checks: searching with Sc ACT1 (S000001855, YFL039c), I got the same general KOG and ignored it. Both Inparanoid and OrthoMCL pulled up multiple human genes, the starting human gene as well as many of the same human genes pulled up in the original search with HsACTC1. However, HomoloGene pulled up only HsACTG1, and not HsACTC1. I opted not to call this, but remain uncertain as to whether I should have or not.

Further Comments

  1. I would like to know more about the methodology behind each of the four prediction methods.
  2. I would like to be able to see alignments of the results, or at least some indication as to whether the hit is across the full length or just to a domain.
  3. In reverse searches with the S. cerevisiae genes, it is sometimes very difficult to determine whether the KOG hits actually correspond to the human gene I started with.
  4. It's really time consuming to do the equivalent searches via the individual pages for KOG, Inparanoid, HomoloGene, and OrthoMCL. They all search differently. Some don't allow searching by the gene name and you have to use an ID, and not just any ID only the one they've chosen to reference.
  5. Should we be using the word ortholog? Some people use a very precise meaning of ortholog. Do we really mean that here?

From http://www.reference.com/browse/wiki/Homology_(biology):

Orthologs, or orthologous genes, are any genes in different species, that are similar to each other and originated from a common ancestor, regardless of their functions. Thus orthologs are separated by an evolutionary speciation event: if a gene exists in a species, and that species diverges into two species, then the divergent copies of this gene in the resulting species are orthologous. The term "ortholog" was coined in 1970.

A second definition of orthologous has arisen to describe any genes with very similar functions in different species. This differs from the original definition in that there is no statement about evolutionary relation, or similarity in sequence or structure.

WormBase: C. elegans Orthology Determination Method

1. Check YOGY using the human gene name and record the number of hits from each method in the elegans spreadsheet, Column T. C. elegans gene products are included in all four methods, but if a method lists no elegans gene product, then that method is not included in Column T.

2. If one C. elegans gene product is listed for each method, then that gene is entered as the ortholog of the human gene. We also perform reciprocal BLAST searches between the human and elegans proteins to support/confirm the orthology assignment. BLAST scores are now being recorded in the spreadsheet.

3. If there is no ortholog listed in YOGY, we still perform reciprocal BLAST searches and examine the highest scoring pairs. In some cases, for example ATXN2, this identifies an elegans ortholog that was not identified by the methods listed in YOGY release available at that time.

4. Since in some cases the C. elegans ortholog is highly diverged in sequence from the human protein (see BRCA2/BRC-2, for example), we also search WormBase and the C. elegans literature using Textpresso to see if there are identified orthologs whose identity is not high enough to be returned in sequence-based searches. (p53 and CEP-1 are another example)

5. The trickiest cases are those for which there are a number of C. elegans proteins that are identified in YOGY (usually via KOG analyses) and that are roughly equivalent matches as far as BLAST scores are concerned. This has happened, for example, with some transmembrane receptors and Forkhead transcription factors. In these cases, it's not always clear that there is a true elegans ortholog. However, since many of these genes play important roles in C. elegans development and/or behavior, there would no doubt be value in having them annotated in GO. We would like to annotate these genes as time permits, but may not include them as orthologs in our spreadsheet.

6. We also examine the results of TreeFam analysis, on the TreeFam site as well as those listed on the gene summary pages of WormBase, to corroborate YOGY and BLAST results.

MGI: Mouse Orthology Determination Method

1. MGI has, for a long time ( > 15 years), curated and maintained orthology records for a variety of mammalian gene sets with a primary focus on mouse/human/rat sets. Recently, we updated our full orthology set loads to include chimp and dog as a result of the sequencing of chimp and dog genomes. We will continue to add full ortholog sets as additional genomes are completed. We expect to extend our orthology sets to include other vertebrates in addition to mammals in the future (chicken for example).

2. We now rely heavily on the sets resulting from Homologene algorithm. Homologene algorithms are re-run whenever there is a new genome release. We re-load Homologene sets following each new run. We are now loading orthology sets for mouse/human/rat/chimp/dog. We co-curate mouse data with EntrezGene curators for gene and sequence identities and intersections, orthology, etc. EntrezGene incorporates GO annotations from MGI. The EntrezGene files provides OMIM IDs for human genes as available.

Homologene releases new data several times a year. But we only expect to see a major changes after a new genome build. So our updates are timed around a new genome build and/or a couple of times a year.

3. We incorporate specific subsets of orthology assertions from Homologene. We also load addition orthologs where there is no conflict with existing orthology data. We obtain ~17,000 human/mouse ortholog determinations from this process, somewhat fewer numbers of rat/dog/chimp orthologs are included.

Homologene compares the protein sequences and determines if a given gene has orthologs and annotate the data as one of the following.

  1. Reciprocal best hits between two organisms (b)
  2. Reciprocal best hits between more than 3 organisms (B)
  3. A match between two organisms, not a reciprocal best (m)

We take the b and B sets. Then we compare with our existing MGI data. The orthology sets that exactly match existing MGI data and any new Homolegenes sets that do not conflict with existing MGI orthology data are loaded as part of Homologne load. We follow the same steps for mosue-human, mouse-rat, mouse-chimp and mouse-dog data. We use a specific citation number to distinguish Homologene load. We have QC reports that give details when there are conflicts with existing orthology in MGI. Curators can over-ride Homologene.

4. We work closely with gene family experts to resolve relationships among genes in mouse/human/rat where there are paralogs/orthologs with extensive sequence similarity. These are not always resolved by Homologene other than by clumping paralog/ortholog sets in a single Homologene record. Community collaborations usually result in a publication, and we enter the determined orthologs into MGI. If Homologene reports conflict with the gene family curated set, a QC record will result and Homologene won't be loaded. Curators evaluate QC reports regularly.

5. We also supplement Homology orthology with additional orthology determinations from the HCOP set from HGNC. The HCOP project curates the Homologene, Compara, and Inparanoid ortholog set intersections to provide a unique representation of orthologs in humans and other mammalian species. Through this mechanism, we resolve some clusters that were filtered into QC reports from the Homologene load. From this process we obtain an additional 700+ ortholog sets.

6. MGI orthology sets are available on our ftp site. Start here to see file structure. ftp://ftp.informatics.jax.org/pub/reports/index.html Then you can select the file of interest. If you need help or want some other file structure, contact MGI user support at mgi-help@informatics.jax.org

FlyBase Orthology Determination Method

FlyBase use InParanoid for both reference genome and general FlyBase orthology calls.

In detail:

1. Search the HGNC site to find official human gene symbol and corresponding Ensembl gene ID.

2. Search InParanoid with Ensembl gene ID for Drosophila melanogaster orthologs using the default parameters (exclude inparalogs scoring below 0.05).

3. If result is 'no cluster' then report there is no ortholog.

4. If the human sequence clusters with Drosophila sequences then report the Drosophila gene only if it is the 'main ortholog' of the human gene. This is in line with what FlyBase reports as orthology calls but is possibly too conservative.

For instance, by our criteria there is 'no ortholog' for ACHE (acetylcholinesterase) based on the following InParanoid results:

Protein ID	Score	Bootstrap	Gene
ENSP00000264381	1	99%		BCHE (human)
ENSP00000350037	0.246			ACHE (human)
FBgn0000024	1	99%		Ace (fly)

BCHE is the 'main ortholog' of Ace but ACHE is also considered to be orthologous by InParanoid. There is clearly an argument for curating the Drosophila Ace gene even though it is more closely related to BCHE than ACHE. The decision to stick with curating 'main orthologs' is partly an attempt to be consistent (I started doing it this way) and also influenced by time limitations - I'm currently failing to keep up-to-date with curating even the 'main orthologs'. Also, if one fly gene is reported as the ortholog of many human genes we will end up with the same data in several rows of the spread sheet and end up with a inaccurately high impression of how much curation has been done in the metrics - is this a problem?

5. Record either 'main ortholog', 'no cluster' or the most similar human/Drosophila gene pair from the cluster (e.g. BCHE/Ace in ACHE search) along with the search date in a local results spread sheet.

Comments: At present if no cluster is reported by InParanoid, no further searches are performed. However, following the discussion of different methods, I have spot-checked a few genes using different methods and have not found any additional orthologs. By ignoring some of the human genes in many_human_genes-to-one_fly_gene relationships, I suspect we have under reported orthologs relative to the other MODs. I am happy to revisit our calls based on whatever consensus the project agrees but given the lack of time, the gene target numbers may need to be reviewed.

RGD Orthology Determination Method

RGD has been providing information on the homologous mouse and human genes for quite some time.

All homology information and relationships are based on the mouse/human/rat sets that MGI puts together and makes available on their ftp site. Detailed infomation is found at MGI's entry on this page.

Comment: occasionaly, it is possible to find entries for the homologous genes but nor for the rat gene becuase of some delayes in loading rat information from Entrez Gene. However, all our pipelines have been updated and expanded and will be working on a regular basis.

GOA/Human Orthology Determination Method

Orthologs are chosen following a protein MPsrch (http://www.ebi.ac.uk/MPsrch) or Blast, and where alignments are obtained which show sequences that have a high degree of similarity over their entire lengths, making it reasonable to infer that the two proteins have a common function. It must be emphasized that curators must check each alignment and use their experience to assess whether similarity is considered to be strong enough to project annotations. While there is no fixed cut-off point in percentage sequence similarity, generally mammalian proteins which have greater than 80% identity that covers greater than 90% of the length of both proteins are examined further. Additional tools, such as the HCOP orthology tool (http://www.genenames.org/cgi-bin/hcop.pl), are used when possible.

In addition, curated gene orthology data obtained from the Ensembl Compara system is used to automatically project GO terms from a source organism onto one or more target species. Only one to one and apparent one to one orthologies are used, and only manually annotated GO terms with an evidence type of IDA, IEP, IGI, IMP or IPI are projected. GO annotations produced by this automated technique receive the evidence code IEA.

Details on the Ensembl Compara method can be found at: http://www.ensembl.org/info/data/compara/homology_method.html

S. pombe Orthology Determination Method

Orthologs have been curated manually between S. pombe and S. cerevisiae, and the basic methods used are described in:

V. Wood, Schizosaccharomyces pombe comparative genomics; from sequence to systems. In: Comparative Genomics using fungi as models (P. Sunnerhagen, J. Piskur, eds.) Topics in Current Genetics vol 15, pp 233- 285 (2006)

Independently, the curation is also supplemented by annotation which describes species distribution, to identify/annotate gene products which are conserved from S. pombe to human.

The orthology assignment methods used combine both of these resources. First YOGY is used to identify the potential fungal orthologs of the human gene. If the ortholog predictors are not in agreement, and this orthology has not been curated or recorded previously further checks will be made (described in the reference above). All predictions are also checked with Treefam. Curated alignments will be submitted to Treefam when they represent orthologous groups (at present alignments are submitted only to Pfam)

Checks Include

  • Checking domain organisation using Pfam (Pfam families can also be used as a robust way to confirm homology when they are specific for orthologous groups)
  • Checking whether the predicted ortholog is assigned ortholog for an alternative human gene product which is in a different orthologous cluster (Although single yeast gene can be orthologous to more than one human gene product in a one to many relationship, and vice versa)
  • Checking with members of the community working on the gene/ biological knowledge

For example, biological knowledge was used to resolve the following calls where the orthology reporting from the ortholog predictors was either absent or ambiguous:

ACTC/Act1, actin

MYL2/rlc1, myosin II regulatory light chain (Treefam outgroup)

ERCC8 /SPBC577.09 (because SPBC577.09 is the ortholog of S. cerevisiae RAD28 which is reported to be the ortholog of human CSA/ERCC8

DNAJC19/tim14 (is known to be ortholog to the human Tim complex subunits DNAJC15 and 19) and this is shown correctly in Treefam http://www.treefam.org/cgi-bin/TFinfo.pl?ac=TF320584

The following are not picked up by any of the predictors in YOGY for yeasts (and possibly other organisms) using the human a query but are conserved from yeast to human from biological knowledge and confirmed using Treefam:

TPM1/cdc8 tropomyosin http://www.treefam.org/cgi-bin/TFinfo.pl?ac=TF500776

MYL3/cdc4 myosin II light chain http://www.treefam.org/cgi-bin/TFinfo.pl?ac=TF504363

ATXN2/SPBC21B10.03c ataxin-2 homolog http://www.treefam.org/cgi-bin/TFinfo.pl?ac=TF503091

All of the myosins (Pers. Comm from Dan Mulvihill)

Difficult to determine:

Some could be 'domain only' hits (for example the FOXC* genes. These are forkhead transcription factors and have conserved domains but I haven't yet determined whether there is conservation outside these regions, or whether they are just RBH to a domain. Need more work.

Identification of false positives:

ybr172c is reported as ortholog of TNNT2 and this is conserved in S. pombe SPAC4F10.13c. However TNNT2 is a troponin family member and ybr172c SPAC4F10.13c are GYF domain proteins: http://www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.pl?name=O36025 http://www.treefam.org/cgi-bin/TFinfo.pl?ac=TF504363 This appears to be an example of a false positive generated by a coiled-coil region that is not backed up by domain analysis.

Distant ortholog detection:

There are many examples where both the ortholog predictors and Treefam report false positives (especially for the yeasts).

examples: kinetochore protein Mis14 (this is not a reference genome gene) The ortholog predictors show only fungal conservation: http://www.sanger.ac.uk/cgi-bin/PostGenomics/S_pombe/YOGY/yogy-search.pl?gene=SPAC688.02c&species=S._pombe&wild=No&go_term=No&go_final=No but building a Pfam family shows conservation to human. http://www.sanger.ac.uk/cgi-bin/Pfam/getacc?PF08641

SPAC13F5.04c endosomal sorting protein no YOGY but http://www.sanger.ac.uk/cgi-bin/Pfam/getacc?PF04652 shows human conservation (5 proteins but all refer to c6orf55)

These alignments will be submitted to Treefam for manual tree curation

Other Notes:

  • Not all orthologs are reciprocal best hits
  • Some families (WD/TPR/LRR etc.) are known to give many false positives and require subsequent analysis i.e clustering
  • Coiled coil proteins should also be treated with caution (see email from Dan Mulvihill where myosin relationships are resolved, all predictors and Treefam got these relationships incorrect, Treefam are using this to resolve their fungal outgroups)
  • Not all orthologs are conserved over their entire length (will add examples)

  • The meaning of ortholog is well defined and we should adhere to the original and intended use, I think what we are trying to curate here are orthologs (i.e direct evolutionary counterparts by vertical descent)