Reference Genome Meeting Minutes April 2008
April 20, 2008
Annotation Progress (Mike Cherry)
- Number of annotated genes per organism by evidence type (overall)
- Compare graphs for Sept 2007 and Apr 2008 - overall size and size the same, but IEA decreasing
Discussion: What is effort/person? X-axis is absolute number of genes, which doesn't reflect differences in genome size.
- Number of annotated genes per organism by evidence code for Reference Genome project
- majority of genes have experimental evidence codes
- Discussion:
- Graph needs outline that indicates "no ortholog". This allows a comparison of the genes present or absent in the reference genome genomes. It will also show which organisms are lagging behind.
- Number of annotations as a metric?
Annotation Progress (Chris Mungall)
Review Annotation Pipeline proposal (Suzi Lewis)
Step 1: Generation of protein sets (excluding functional RNAs)
- How to define a coherent set For experimental annotations, want to annotate to isoforms. But for tree building want longest protein produced from a gene. So for ortho sets want a unique protein/gene ID for the "canonical" gene/protein.
- Heterogeneity in column 2 of Gene Association files (see Annotation of alternate spliceforms) What is current practice?
- How does UniProt deal with alternate splice forms? Most of the time, there is a 1:1 correspondence between the canonical protein ID and the gene. Uniprot uses canonical identifier followed by -1, -2, etc to indicate isoforms. But sometimes isoforms are so different that they are given separate accessions. In that case, what connects them? have to link out to genomic database.
- WormBase uses a mixture of gene and protein IDs, which is used depends upon how the experiments were done. Is this a problem? Goal would be converge on one type.
- MGI uses canonical MGI IDs in column 2.
- Heterogeneity in column 2 of Gene Association files (see Annotation of alternate spliceforms) What is current practice?
Chris's Proposal: Use canonical ID in column 2. Add additional column for isoforms; put multiple isoform IDs on one line. Column 2: Use canonical gene ID. Gene Index Column 17: ID for the thing that was annotated (protein/gene/transcript). Must match column 12 (SO type).
Discussion:
Add a column that is always for a gene. A gene is a "concept", it's a lumping term that reflects biological reality. Provides the link we want.
Rex: what is needed is
Column 2: ID for thing that was annotated. (ideally would be the gene product) Column 12: keep is it is, because it refers to column 2 Add Column 17: Canonical ID for the gene that codes for the product that was annotated.
Have to look at how any change will affect our users.
What do users expect to be in column 2? they expect canonical ID, but it isn't always the case.
Most groups in favor of making column 2 the canonical ID.
What should column 12 refer to? still column 2? has to refer to what is annotated?
- Notifying users
Before change is implemented, should it be discussed with a few users?
Need a pushout list to identify users of changes/updates.
Write up Proposal A, Proposal B and ask for public comment.