Reference Genome minutes (Archived): Difference between revisions

Revision as of 11:53, 2 October 2007

Minutes for the reference genome meeting, September 26-27, 2007

Orthology determination

Kara Dolinski

Background information

Available tools:
- inparanoid (PMID 15608241),
- homologene (PMID 17170002): Does get updated but doesn’t have all species in; also doesn’t perform very well, reciprocal blast based and not phylogeny based.
- HCOP (PMID 16284797),
- treefam (PMID 16381935),
- Compara (no pub yet); produces trees
- OrthoMCL (PMID 12952885)

‘Aggregator’ tools:
- YOGY (PMID 16845020; not really maintained; has all except chicken and zebra fish – methods include KOGs, InParanoid, homologene, orthoMCL and a table of curated orthologs between budding yeast and fission yeast.),
- bioPIXIE (PMID 16420673; Princeton), a data intergration approach: ; incorporates data from several methods to generate a ‘probability of orthology’ with Troyanskaya. Use same protein sets with all the algorithms and update as required. Agreed as a good idea (see action item).
- P-POD (PMID 17712414; http://ortholog.princeton.edu/findorthofamily.html) (based on OrthoMCL (http://orthomcl.cbil.upenn.edu/cgi-bin/OrthoMclWeb.cgi) and Jaccard Coefficient Cluster.

Problem is that none of them doing exactly what they wanted… There is no gold standard set. It is sometimes necessary to manually look for an ortholog when no tool finds them (short proteins, for example, or divergent, like E, coli proteins.

Another problem is the way databases handle orthologs – the mouse schema can’t cope with many to many relationships.

- Methods to do orthology comparisons (see slide)

Comparison of tools is made more difficult due to:

different species are covered by different tools
problems with inconsistent use of identifiers (Treefam is a mess - MA) (Emily points out that UniProt and ensembl have joined forces in trying to reduce differences (in proteins?) and fill in holes in UniProt – doing human first with mouse second on the list for clean up.)
that not all sets are based on the same proteins due to varying frequency of updates/maintenance.

- Quality of ortholog tools: PMID 17440619: assessing performance of orthology detection methods

Issues regarding ortholog determination in the context of the reference genome project

We need a set of sequences to work with

[ACTION ITEM]: Suzi and Karen E will generate a page where all sequences will be available

We need a complete set of orthologs that covers all reference genomes. The orthology determination should capture one-to-many relationships and many-to-many relationships (question: does that need to be captured somewhere? “Unique putative ortholog”,” one2many”, “many2many” (gene family?)). Our concern is to capture the full set rather than making statements about the evolutionary relationships between gene products and/or organisms (we probably need to clarify this in our documentation and what we present to the public as our goals).

Setting curation priorities

Rex Chisholm

Background

When we started the reference genome project last year we made our main priority genes involved in human diseases. The Scientific Advisory Board suggested to also try to curate genes for which there is no GO annotations but that have published data. Other suggestions:

Encode
members of a complex should all be done at the same time
all enzymes in a pathways should be done at the same time

[ACTION ITEM] (everybody): We will add categories of genes to annotate in addition to ‘disease genes’. We will choose five genes from each of the following four groups:

diseases (Resources: RGD portal (http://rgd.mcw.edu/dportal/), OMIM (http://www.ncbi.nlm.nih.gov/Omim/getmorbid.cgi)
biochemical/ signaling pathways/ (reactome) (Resources: Reactome (http://reactome.org/), Pathway Tools http://bioinformatics.ai.sri.com/ptools/)
bleeding edge list: For example: New “hot” genes in your area of interest; genes that come up in computational studies/population studies; most cited papers by some text-mining method; genes cited in newsmedia; newly named genes (HGNC)

[ACTION ITEM] (everybody): Each database should keep an eye open for those genes to have genes to suggest when it’s their turn to do the assignments

conserved genes/unannotated genes; genes that have few annotations and have lots of literature

[ACTION ITEM] (Val): Provide the list of 207 genes conserved between pombe and human with no annotation/information [ACTION ITEM] (Jim): Provide the set of conserved genes found by InParanoid that are conserved in all 12 species (660 or so); we might want to prioritize this list by ascending order of number of annotations to target unannotated genes (who can do that?) [ACTION ITEM] (Ruth): send the HGNC list of genes with few annotations This will be done on a rotation basis from all databases. I suggest we go alphabetically:

November 2007: Arabidopsis thaliana
December 2007: Caenorhabditis elegans
January 2008: Danio rerio
February 2008: Dictyostelium discoideum
March 2008: Drosophila melanogaster
April 2008: Escherichia coli
May 2008: Gallus gallus
June 2008: Homo sapiens
July 2008: Mus musculus
August 2008: Rattus norvegicus
September 2008: Saccharomyces cerevisiae
October 2008: Schizosaccharomyces pombe

[ACTION ITEM]: contact/meet with people who have made tools for orthology determination to see if they can help us (that possibly includes re-running the analyses using the most recent set of sequences and proper IDs) THIS ACTION ITEM NEEDS TO BE ASSIGNED TO SOMEONE:

Compara: Emily?
Homologene: Judy?
TreeFam
in paranoid
others?

[ACTION ITEM]: Kara: run the P-POD over the full ref genomes set? analysis on the ref genome data set. Need computational pipeline with existing resources. Currently takes 3 weeks to do 8 species all v all.

METRICS

The Reference Genome has as one of its goals to provide quantitative measures for annotation progress. Specifically, we want to know the breadth (what fraction of genes have been annotated in each genome) and the depth (to what degree of precision is each gene annotated). This is difficult to do because each measure that we need depends on information that is difficult to measure: (1) total number of genes (and ideally gene products, which includes splice variants), in each organism; (2) number of pseudogenes (which should be excluded when measuring annotation progress); (3) to what granularity a gene may be annotated based on experimental information available. The following presentations addressed different aspects of those issues.

Sequence metrics

Karen Eilbeck Grant has a goal to manage the contents of sequence annotation. Try to address quantitatively, how to evaluate sequence annotations, how different is a genome release from a previous release, complexity of alternative splicing; can we keep track of sequence curation progress? Can we track more than the splits and merges?

Using GFF3 files, measures 1. sequence annotation turnover, 2. annotation edit distance, 3. splicing complexity
to specifically measure what was attributable to curation, removed all assembly-induced changed. Only look at annotation where there was no change to underlying.
measured, between two releases, how many genes still existed in the next release, and how many were traceable to the next release: Worm and c. elegans stable in terms of changes.
How well does a prediction match a reference annotation? Burset and Guigo 1996 (PMID: 8786136)
How to compare sequence annotations: based on distance measure (on a scale of 0-1); built on the ideas of sensitivity, specificity and accuracy (referred to as congruency):
how well does a prediction matches a reference annotation: you can measure true positives, false positive, false negatives (in the bits of the gene models): a bad prediction would have a score of 0, a perfect match would have a score of 1. They give numerical values to sensitivity, specificity and congruency (accuracy). This is also complicated by alternative splicing; they take this into account by comparing each pair of transcripts. See slides for examples or updated gene models and how that affects the scores. From that you get a bar graph that gives a quantitative value of how much the genome has changed between two releases; some genomes like mouse and human vary quite a bit between releases; others (for eg fly) are much more stable.
Alternative splicing: the trend for alternative splicing is increasing but lower than what is found in the literature. They have a formula to calculate splicing complexity: if the two splice variants are very different from each other they get a higher score (range 0-1). See slides for a graph of the distribution of splicing complexity. One issue is with bad gene models: some may be annotated as splice variants but they could actually be several genes.
Conservation of alternative splicing: is there any correlation between different species? The answer is that there is a small correlation but not as much as you expect from leading the literature.
Kimberly points out that Wormbase annotates a mix of genes and proteins. Biggest problem is that protein is often unspecified – don’t know which one it is? Concern about overstating evidence.

How this affect the reference genome project

Can we use this to implement a mechanism to warn curators that an annotated protein has been modified?

-> Not a high priority since most groups do not annotate to gene products; rather they annotate to genes. It is still the goal to eventually annotate to gene products and we should anticipate potential problems that might create.

For genes with information in the ‘WITH/FROM’ column, should databases be notified if there is a modification to the sequence corresponding to the ID in the ‘WITH/FROM’ column?

-> The consensus is that although this is important to consider, it’s NOT a priority. When 90% of the genes are annotated we can set that up. Moreover, with the Ref Genomes annotation guidelines, since the WITH is well characterized, this is relatively unlikely to be a major problem (corresponding proteins are expected to be well curated).

Issues about annotating splice variants

1. For most genes, we are probably not even aware which splice variants exist or are expressed; even when they are known, many papers do not specify which splice variant they use, and many assays do not allow making the distinction. UniProt has a “generic” version of each protein for which there is a splice variant (ie, the complete gene product? or something else more theoretical?), so one can annotate to the generic form when the information is not available. Suzi emphasizes that it’s important to be able to document this and do it to the finest level we can.

2. We are unsure how that affects users that view annotations; using the UniProt IDs it is possible to merge all information to the ‘generic gene’, but we are unsure of whether or not this is true/easy/obvious depending on whether you look at the UniProt page, the QuickGO page, the gene association file. For example, can this ever be computed as annotations to two different genes?

[ACTION ITEM]: (developers/software group): consider the potential impact of annotating to different forms of the gene.

@@ Line 73: / Line 73: @@
 [ACTION ITEM]: Kara:  run the P-POD over the full ref genomes set? analysis on the ref genome data set. Need computational pipeline with existing resources. Currently takes 3 weeks to do 8 species all v all.
+== METRICS ==
+The Reference Genome has as one of its goals to provide quantitative measures for annotation progress. Specifically, we want to know the breadth (what fraction of genes have been annotated in each genome) and the depth (to what degree of precision is each gene annotated). This is difficult to do because each measure that we need depends on information that is difficult to measure: (1) total number of genes (and ideally gene products, which includes splice variants), in each organism; (2) number of pseudogenes (which should be excluded when measuring annotation progress); (3) to what granularity a gene may be annotated based on experimental information available. The following presentations addressed different aspects of those issues.
+=== Sequence metrics===
+Karen Eilbeck
+Grant has a goal to manage the contents of sequence annotation.  Try to address quantitatively, how to evaluate sequence annotations, how different is a genome release from a previous release, complexity of alternative splicing; can we keep track of sequence curation progress?  Can we track more than the splits and merges?
+* Using GFF3 files, measures 1. sequence annotation turnover, 2. annotation edit distance, 3. splicing complexity
+* to specifically measure what was attributable to curation, removed all assembly-induced changed. Only look at annotation where there was no change to underlying.
+* measured, between two releases, how many genes still existed in the next release, and how many were traceable to the next release: Worm and c. elegans stable in terms of changes.
+* How well does a prediction match a reference annotation? Burset and Guigo 1996 (PMID: 8786136)
+* How to compare sequence annotations: based on distance measure (on a scale of 0-1); built on the ideas of sensitivity, specificity and accuracy (referred to as congruency):
+* how well does a prediction matches a reference annotation: you can measure true positives, false positive, false negatives (in the bits of the gene models): a bad prediction would have a score of 0, a perfect match would have a score of 1. They give numerical values to sensitivity, specificity and congruency (accuracy). This is also complicated by alternative splicing; they take this into account by comparing each pair of transcripts. See slides for examples or updated gene models and how that affects the scores. From that you get a bar graph that gives a quantitative value of how much the genome has changed between two releases; some genomes like mouse and human vary quite a bit between releases; others (for eg fly) are much more stable.
+* Alternative splicing: the trend for alternative splicing is increasing but lower than what is found in the literature. They have a formula to calculate splicing complexity: if the two splice variants are very different from each other they get a higher score (range 0-1). See slides for a graph of the distribution of splicing complexity. One issue is with bad gene models: some may be annotated as splice variants but they could actually be several genes.
+* Conservation of alternative splicing: is there any correlation between different species? The answer is that there is a small correlation but not as much as you expect from leading the literature.
+* Kimberly points out that Wormbase annotates a mix of genes and proteins. Biggest problem is that protein is often unspecified – don’t know which one it is? Concern about overstating evidence.
+====How this affect the reference genome project====
+* Can we use this to implement a mechanism to warn curators that an annotated protein has been modified?
+-> Not a high priority since most groups do not annotate to gene products; rather they annotate to genes. It is still the goal to eventually annotate to gene products and we should anticipate potential problems that might create.
+* For genes with information in the ‘WITH/FROM’ column, should databases be notified if there is a modification to the sequence corresponding to the ID in the ‘WITH/FROM’ column?
+-> The consensus is that although this is important to consider, it’s NOT a priority. When 90% of the genes are annotated we can set that up. Moreover, with the Ref Genomes annotation guidelines, since the WITH is well characterized, this is relatively unlikely to be a major problem (corresponding proteins are expected to be well curated).
+* Issues about annotating splice variants
+. For most genes, we are probably not even aware which splice variants exist or are expressed; even when they are known, many papers do not specify which splice variant they use, and many assays do not allow making the distinction. UniProt has a “generic” version of each protein for which there is a splice variant (ie, the complete gene product? or something else more theoretical?), so one can annotate to the generic form when the information is not available. Suzi emphasizes that it’s  important to be able to document this and do it to the finest level we can.
+. We are unsure how that affects users that view annotations; using the UniProt IDs it is possible to merge all information to the ‘generic gene’, but we are unsure of whether or not this is true/easy/obvious depending on whether you look at the UniProt page, the QuickGO page, the gene association file. For example, can this ever be computed as annotations to two different genes?
+[ACTION ITEM]: (developers/software group): consider the potential impact of annotating to different forms of the gene.