Talk:2010 GO camp Annotation propagation rules

From GO Wiki
Jump to navigation Jump to search

Conference call April 28 2010: Ensembl Gene Trees & Annotation propagation

Javier Herrero & Glenn Proctor

Gene Tree building pipeline:

  • Load genes and longest translation from all ensembl species
  • WU blastP + SmithWaterman longest translation of every gene against every other gene
  • Build cluster and MSA with Mcoffee
  • Build a reconciled gene press with internal duplication nodes taking species tree into account (TreeBeST)
  • Infer orthologs and Paralogs (OrthoTree)
  • Gene trees predicting too many gene losses and gains are corrected to make simpler models of evolution, and the predicted ancestor is replaced by an 'ambiguous' ancestor.
  • Often when gene trees disagree with the species tree (or predict complicated gene losses/gene gains), this is due to gene splits often resulting from incorrect sequence or assemblies.
  • This is fed back to gene builders

Gene Trees: species coverage

ensembl is human-centric; gene trees covers vertebrates and use drosophila, c. elegans and yeast as outgroups. The software does not try to ensure that the trees are correct for the outgroup. Pan-compara is a separate effort aimed at extending the trees to other species.

Annotation Propagation by ensembl

  1. Gene names:
    • From human and mouse to all vertebrates: Use the 1:1 orthologs
    • From human to fish: Use the 1:many orthologs; Names become NAME (1 of 3), NAME (2 of 3), etc.
    • Rules:
      • if source gene has an HGNC name, and target gene has no name or only a RefSeq predicted name, add name to target gene.
      • Change status to “KNOWN_BY_PROJECTION”
  2. GO annotations
    • Use the 1:1 orthologs only
    • From human and mouse to all vertebrates
    • From rat to human and mouse
    • From zebrafish to other fish
    • From human to zebrafish
    • Rules:
      • add GO terms from source gene to target gene, avoiding duplicates. Only project source GO terms with evidence codes IDA, IEP, IGI, IMP, IPI.
      • Projected GO terms on target gene are given evidence code IEA.
      • All terms from all 3 ontologies are used; eventually plan to used the GO's taxon rules.