Reference Genome Meeting Minutes April 2008

From GO Wiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

April 20, 2008

Annotation Progress (Mike Cherry)

  • Number of annotated genes per organism by evidence type (overall)
    • Compare graphs for Sept 2007 and Apr 2008 - overall size and size the same, but IEA decreasing

Discussion: What is effort/person? X-axis is absolute number of genes, which doesn't reflect differences in genome size.

  • Number of annotated genes per organism by evidence code for Reference Genome project
    • majority of genes have experimental evidence codes
  • Discussion:
    • Graph needs outline that indicates "no ortholog". This allows a comparison of the genes present or absent in the reference genome genomes. It will also show which organisms are lagging behind.
    • Number of annotations as a metric?

Annotation Progress (Chris Mungall)

Review Annotation Pipeline proposal (Suzi Lewis)

Step 1: Generation of protein sets (excluding functional RNAs)

    • How to define a coherent set For experimental annotations, want to annotate to isoforms. But for tree building want longest protein produced from a gene. So for ortho sets want a unique protein/gene ID for the "canonical" gene/protein.
      • Heterogeneity in column 2 of Gene Association files (see Annotation of alternate spliceforms) What is current practice?
        • How does UniProt deal with alternate splice forms? Most of the time, there is a 1:1 correspondence between the canonical protein ID and the gene. Uniprot uses canonical identifier followed by -1, -2, etc to indicate isoforms. But sometimes isoforms are so different that they are given separate accessions. In that case, what connects them? have to link out to genomic database.
        • WormBase uses a mixture of gene and protein IDs. (Which is used depends upon how the experiments were done.)
        • MGI uses MGI IDs in column 2.

Proposal: Use canonical ID in column 2 (could be gene, protein, transcript). Add additional column for isoforms; put multiple isoform IDs on one line.


Step 2: Experimental Annotation

Step 3: Inferential Annotation

Step 4: Quality Checks