Reference Genome Meeting Minutes April 2008

From GO Wiki
Revision as of 10:22, 20 April 2008 by Siegele (talk | contribs) (Step 1: Generation of protein sets (excluding functional RNAs))

Jump to: navigation, search

April 20, 2008

Annotation Progress (Mike Cherry)

  • Number of annotated genes per organism by evidence type (overall)
    • Compare graphs for Sept 2007 and Apr 2008 - overall size and size the same, but IEA decreasing

Discussion: What is effort/person? X-axis is absolute number of genes, which doesn't reflect differences in genome size.

  • Number of annotated genes per organism by evidence code for Reference Genome project
    • majority of genes have experimental evidence codes
  • Discussion:
    • Graph needs outline that indicates "no ortholog". This allows a comparison of the genes present or absent in the reference genome genomes. It will also show which organisms are lagging behind.
    • Number of annotations as a metric?

Annotation Progress (Chris Mungall)

Review Annotation Pipeline proposal (Suzi Lewis)

Step 1: Generation of protein sets (excluding functional RNAs)

    • How to define a coherent set For experimental annotations, want to annotate to isoforms. But for tree building want longest protein produced from a gene. So for ortho sets want a unique protein/gene ID for the "canonical" gene/protein.
      • Heterogeneity in column 2 of Gene Association files (see Annotation of alternate spliceforms) What is current practice?
        • How does UniProt deal with alternate splice forms? Most of the time, there is a 1:1 correspondence between the canonical protein ID and the gene. Uniprot uses canonical identifier followed by -1, -2, etc to indicate isoforms. But sometimes isoforms are so different that they are given separate accessions. In that case, what connects them? have to link out to genomic database.
        • WormBase uses a mixture of gene and protein IDs, which is used depends upon how the experiments were done. Is this a problem? Goal would be converge on one type.
        • MGI uses canonical MGI IDs in column 2.

Chris's Proposal: Use canonical ID in column 2. Add additional column for isoforms; put multiple isoform IDs on one line. Column 2: Use canonical gene ID. Gene Index Column 17: ID for the thing that was annotated (protein/gene/transcript). Must match column 12 (SO type).


Add a column that is always for a gene. A gene is a "concept", it's a lumping term that reflects biological reality. Provides the link we want.

Rex: what is needed is

    Column 2:  ID for thing that was annotated.  (ideally would be the gene product)
    Column 12: keep is it is, because it refers to column 2
    Add Column 17: Canonical ID for the gene that codes for the product that was annotated.    

Have to look at how any change will affect our users.

    What do users expect to be in column 2?  they expect canonical ID, but it isn't always the case. 

Most groups in favor of making column 2 the canonical ID.

What should column 12 refer to? still column 2? has to refer to what is annotated?

        • Notifying users

Before change is implemented, should it be discussed with a few users?

Need a pushout list to identify users of changes/updates.

Write up Proposal A, Proposal B and ask for public comment.

Step 2: Experimental Annotation

Step 3: Inferential Annotation

Step 4: Quality Checks