Reference Genome Meeting Minutes April 2008: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
Line 22: Line 22:


====Step 1: Generation of protein sets (excluding functional RNAs)====
====Step 1: Generation of protein sets (excluding functional RNAs)====
**How to define a coherent set
**'''How to define a coherent set'''  For experimental annotations, want to annotate to isoforms.  But for tree building want longest protein produced from a gene.  So for ortho sets want a unique protein/gene ID for the "canonical" gene/protein.
***For experimental annotations, want to annotate to isoforms.  But for tree building want longest protein produced from a gene.  So for ortho sets want a unique protein/gene ID for the "canonical" gene/protein.


***Heterogeneity in column 2 (gene association file).  One suggestion is to add another column.  Multiple isoform IDS on one line. 
***Heterogeneity in column 2 of Gene Association files (see Annotation of alternate spliceforms)  What is current practice?  
 
 
***Gene Association files (see Annotation of alternate spliceforms)  What is current practice?  
****How does UniProt deal with alternate splice forms? Most of the time, there is a 1:1 correspondence between the canonical protein ID and the gene.  Uniprot uses canonical identifier followed by -1, -2, etc to indicate isoforms. But sometimes isoforms are so different that they are given separate accessions.  In that case, what connects them? have to link out to genomic database.  
****How does UniProt deal with alternate splice forms? Most of the time, there is a 1:1 correspondence between the canonical protein ID and the gene.  Uniprot uses canonical identifier followed by -1, -2, etc to indicate isoforms. But sometimes isoforms are so different that they are given separate accessions.  In that case, what connects them? have to link out to genomic database.  
****WormBase uses a mixture of gene and protein IDs.  (Which is used depends upon how the experiments were done.)   
****WormBase uses a mixture of gene and protein IDs.  (Which is used depends upon how the experiments were done.)   
****MGI
****MGI uses MGI IDs in column 2.
 
Proposal: Use canonical ID in column 2 (could be gene, protein, transcript).  Add additional column for isoforms; put multiple isoform IDs on one line. 





Revision as of 12:41, 20 April 2008

April 20, 2008

Annotation Progress (Mike Cherry)

  • Number of annotated genes per organism by evidence type (overall)
    • Compare graphs for Sept 2007 and Apr 2008 - overall size and size the same, but IEA decreasing

Discussion: What is effort/person? X-axis is absolute number of genes, which doesn't reflect differences in genome size.

  • Number of annotated genes per organism by evidence code for Reference Genome project
    • majority of genes have experimental evidence codes
  • Discussion:
    • Graph needs outline that indicates "no ortholog". This allows a comparison of the genes present or absent in the reference genome genomes. It will also show which organisms are lagging behind.
    • Number of annotations as a metric?

Annotation Progress (Chris Mungall)

Review Annotation Pipeline proposal (Suzi Lewis)

Step 1: Generation of protein sets (excluding functional RNAs)

    • How to define a coherent set For experimental annotations, want to annotate to isoforms. But for tree building want longest protein produced from a gene. So for ortho sets want a unique protein/gene ID for the "canonical" gene/protein.
      • Heterogeneity in column 2 of Gene Association files (see Annotation of alternate spliceforms) What is current practice?
        • How does UniProt deal with alternate splice forms? Most of the time, there is a 1:1 correspondence between the canonical protein ID and the gene. Uniprot uses canonical identifier followed by -1, -2, etc to indicate isoforms. But sometimes isoforms are so different that they are given separate accessions. In that case, what connects them? have to link out to genomic database.
        • WormBase uses a mixture of gene and protein IDs. (Which is used depends upon how the experiments were done.)
        • MGI uses MGI IDs in column 2.

Proposal: Use canonical ID in column 2 (could be gene, protein, transcript). Add additional column for isoforms; put multiple isoform IDs on one line.


Step 2: Experimental Annotation

Step 3: Inferential Annotation

Step 4: Quality Checks