Reference Genome Meeting Minutes April 2008

From GO Wiki
Revision as of 14:51, 20 April 2008 by Siegele (talk | contribs) (Using Textpressor (Kimberly Van Auken))

Jump to: navigation, search

April 20, 2008

Annotation Progress

Annotation Progress (Mike Cherry)

  • Number of annotated genes per organism by evidence type (overall)
    • Compare graphs for Sept 2007 and Apr 2008 - overall size and size the same, but IEA decreasing

Discussion: What is effort/person? X-axis is absolute number of genes, which doesn't reflect differences in genome size.

  • Number of annotated genes per organism by evidence code for Reference Genome project
    • majority of genes have experimental evidence codes
  • Discussion:
    • Graph needs outline that indicates "no ortholog". This allows a comparison of the genes present or absent in the reference genome genomes. It will also show which organisms are lagging behind.
    • Number of annotations as a metric? would give a different view of the progress, but too variable b/c of differences in depth of knowledge in different organisms, different areas of the ontology.
    • View progress between Sept 2007 and April 2008 as a % change. Can see that everyone has doubled experimental annotations, although it doesn't show the starting number of annotations.
    • Need to discuss which metrics we want to track and why. Need consistent measures across groups.
    • How annotations change over time lets you see whether groups are still engaged in the process.
    • Would be useful to have a display that shows how much is known about these genes. Some of this information will come from Chris's reports.

Annotation Progress (Chris Mungall)

  • Metrics:
    • distance to leaf (shows average number for all genes)
      • didn't change between Jan 2006 to Sept 2007
      • consider breaking down by the 3 ontologies, also show % of length to leaf
    • information content
      • a quality control measure
    • coverage (# of nodes covered per gene)
      • as you look at gene in more detial it wil have more coverage
      • can there be too much coverage?
    • publications per gene
    • GO terms per gene
  • General Question: what is appropriate range for each category? need a sense of the scale, perhaps express as a %
  • Reference Genome Reports

Annotation Progress: Discussion of other ieas for measuring progress

  • Measure that shows progress made in curating the experimental literature for reference genes in reference genomes. This is an aim of the grant. Can determine number of publications annotated.
  • Measure of time spent (% effort) actually doing experimental annotations. Disagreements: Can't do curation w/o ontology development and visa versa. Worried about trying to parse out too much. How to you separate annotation from time spent considering how you do annotations or assessing quality of annotations.
  • Measure of the number of genes that have been comprehensively annotated.

Annotation Pipeline, Part 1 (Suzi Lewis)

Step 1: Generation of protein sets

Step 2: Experimental Annotation

Step 3: Inferential Annotation

Step 4: Quality Checks

Step 1: Generation of protein sets (excluding functional RNAs)

    • How to define a coherent set For experimental annotations, want to annotate to isoforms. But for tree building want longest protein produced from a gene. So for ortho sets want a unique protein/gene ID for the "canonical" gene/protein.
      • Currently there is heterogeneity in column 2 of Gene Association files (see Annotation of alternate spliceforms)
        • How does UniProt deal with alternate splice forms? Most of the time, there is a 1:1 correspondence between the canonical protein ID and the gene. Uniprot uses canonical identifier followed by -1, -2, etc to indicate isoforms. But sometimes isoforms are so different that they are given separate accessions. In that case, what connects them? have to link out to genomic database.
        • WormBase uses a mixture of gene and protein IDs in column 2. Which is used depends upon how the experiments were done. Is this a problem? Goal would be converge on one type.
        • MGI uses canonical MGI IDs in column 2.
      • Chris's Proposal: Use canonical ID in column 2. Add additional column for isoforms; put multiple isoform IDs on one line.

Column 2: Use canonical gene ID. Gene Index Column 17: ID for the thing that was annotated (protein/gene/transcript). Must match column 12 (SO type).

        • Discussion:

Add a column that is always for a gene. A gene is a "concept", it's a lumping term that reflects biological reality. It provides the link we want.

Rex's proposal:

    Column 2:  Keep as is, the ID for thing that was annotated.  (ideally would be the gene product)
    Column 12: keep is it is, because it refers to column 2
    Add Column 17: Canonical ID for the gene that codes for the product that was annotated.    

Have to look at how any change will affect our users.

    What do users expect to be in column 2?  they expect canonical ID, but it isn't always the case. 

Most groups in favor of the proposal of making column 2 the canonical ID.

        • What should column 12 refer to?

Point to 17, which means that column 17 must be filled in; it can't be left blank and inferred from column 2.)

        • Notifying users

Before change is implemented, should it be discussed with a few users?

Need a pushout list to identify users of changes/updates.

      • Still need gene to protein associations
    right now it is a free-floating column 18

gene association file should be gene association file

gp2protein file should be separate

Proposal: The header of gene association file should state this file contains annotations for x out of total number of genes estimated in this organism.

gp2protein file : For every canonical gene ID there will be an associated canonical protein ID.

What about those cases where gene has been annotated, but there is no known protein sequence associated with it. Leave blank? or explicitly state "uncloned?"

state that no protein has been identified for gene that was identified split out functional RNAs that have been identified

gp2protein: 123 AA sequence Accession (UniprotKB:xxx or NCBI:xxx) 456 RNA 789 uncloned

Don't want to overload the file (putting non ID information in an ID column). If needed, should make a separate file or find other ways of dealing with the blanks. Can generate report that gives type from column 12.

If gp2protein file has only canonical protein IDs, how do you get information about other protein IDs (column 17)?

review: GAF column 2 is canonical gene ID

   column 17 is thing you are annotating (always required)
   column 12 matches column 17 and contains SO ID's

gp2protein file: 1) includes complete gene index (except for pseudogenes and transposons)

  column 1 is canonical gene ID
  column 2 is accession for sequence of longest form of protein from UniProtKB: or NCBI: 

Action items:

1) update documentation

2) write notice of changes to users

3) individual data providers make sure that their input matches

4) software changes as necessary

5) add header to gene association file

6) syntex of file will be provided by Mike and Chris

Software Update

Demo of RefGenome tracker interface(Siddhartha Basu)

  • RefGenome tracker interface (database to replace current google spreadsheet)
   Browse: List Target, List Target (from db), List Ortholog
   Search: search by id, name, target, taxon
   Curation: add target, import spreadsheet (interim feature)
    • For programmers:
  1. add box for taxon id to the "add target" entry box
  2. add column with MOD id so curators can link to the MOD rather than NCBI
    • Time frame?

RefGenome Graphs (Mary Dolan)

  • Comparison matrix of GO terms across organism

entry indicates that ortholog exists for this gene colored entries indicate experimental annotation parentheses indicate ISS annotation only no ortholog "X"

  • Graph
  • Look at annotations directly
  • PPOD graphs
  • compare PPOD clusters with MOD calls
    • Suggestion: For ISS include the "with" information

AmiGO (Seth Carbon)

  • Summary table of genes in the RefGenomes List
    • star to indicate ISS only
  • graphical display
    • will have ability to pan and zoom display
    • mouse over to see more details
    • can view direct annotations for organism
    • can mark annotations that are only ISS annotations
  • Want feedback on summary table and visual graphical displays
    • Are there other types of visual displays that people need?

AmiGO will gradually be moving to this structure.

  • Cross Products

Example TAZ gene, various annotations related to heart in different organisms, but couldn't see connections in the graph. Working on version of graph display to show these connections.

Community Annotation at GONUTS (Jim Hu)

  • Create New Gene Page
  • Connection between AmiGO and GONUTS

webservice that will identify pages in GONUTS that have been annotated by a human being

    • Discussion

two possible types of input: small, individual annotations; bulk sets of predictions

could you use it for getting input on IEA annotations?

will increase input by making it easier for people to provide input

try connecting Cardio and Immunology pages to GONUTS

Annotation Pipeline, part 2 (Suzi Lewis and Judy Blake)

  • Is there consensus about the steps shown in File:Ref Genome annotation pipeline2008Mar31.pdf?
    • Need to discuss how the "focal-sets" will be determined
    • Step V: Changed to "Curators add/remove proteins to/from the "focal set" based on dialog and agreement"
    • What is the purpose of the focal set? defined at meeting in Princeton, to be able to say that these products have been experimentally annotated across the reference genomes
    • Rex: Agrees with procedure but wants to emphasize that it can't be written in stone. There has to be option for future discussion changes in the procedure. Suzi: Yes, in light of further knowledge the procedure may change and people involved have to be open-minded and leave behind their preconceived notions.
    • Rex: Want to capture both depth and breadth (annotations to as many genes as possible based on exptl. annotations and also ISS).
    • Judy: Inferential annotations. how do you transform experimental annotations in one organism to inferential annotations in another organism? what measures are useful? what about large family sets?
    • Suzi: QC is something that happens during the entire process, not just at the end of the process. Will be useful to think about QC at each of these steps.

Consistency within experimental annotations (Pascale)

   * Tricky terms Misused_terms
   * Outliers
   * Val (co-occurences of annotations) 
  • Annotation errors
    • transient localization vs long-term cellular localization
    • secreted protein annotated to secretory process
    • IMP evidence code used for results from high through-put experiments

David: Once a focal set is annotated, send one of Mary's graph to someone who has published a lot of papers on that gene/protein and ask them if there is anything wrong.

Fixing errors: should there be a systematic effort to review earlier annotations?

Develop SOPs to prevent future errors.

Are there automatic checks that can be done to identify anomalies to be reivewed?

Pop-ups to warn curator?

At end of day, look at totality of annotations in other organisms and then review the annotation of your gene.

[ACTION ITEM] QuickGO cocurrency terms, maybe this should be in AmiGO?

How do we continue to study the process in order to improve it?

Judy: Have a list of misused terms and comments made about them. Where can this information be put to make it more visible?

Page needs to provide more information about how the term was misused.

[moving towards an ACTION ITEM] Is review by a human the only way to check the usage of frequently misused terms?

growth, cell growth, cell cycle, cell proliferation frequently used interchangeably

  • Part terms.
    • Need to use them, would be helpful to have SOPs.
  • Possible way to identify outliers (Val)

SLIM by SLIM matrix to review intersections of different cellular processes and look for unexpected intersections which may identify possible errors

try applying to function and component terms

outline cells that you expect to be empty

[action item] generated automatically from AmiGO database rather than each refgenome group doing it themselves

spot checks have to be built into the process, need to build different ways of looking at quality control

suggestion: collect tricky terms, run reports, and email groups asking them to review their annotations. If annotations are correct, can drop term from the tricky list.

problem in ontology systematic problem xx?

can you find systematic errors by looking at co-occurrence of misused terms with particular paper?

David: often thought that if look at set of annotations from a particular paper can get more information about the gene

Amazon shopping cart model: 90% of annotators who used this term

people should be on lookout for misused terms and add to list along with explanation

   need way of notifying people that something was added

also need regular process for resolving these

a) software for generating comatrix b) buddy annotation c) categories of problems, not just endless list of problems d) regular assessment

adding comments in OBO-edit

  • additional areas where the groups aren't consistent and need to be discussed
  • discussions at annotation camps are very helpful

annotation consistency test: summary was that there was no consistency does this mean that refgenome project is doomed? no, set included people who had never annotated before and only small number of people that worked on the organism

Using Textpressor (Kimberly Van Auken)

Textpresso for GO annotation key features:

  • search through fulltext
  • in addition to keyword searches, have category searches based on groups of related words

Wormbase uses Textpresso

  • get PDFs
  • convert to text
  • marked up by Textpresso

Textpresso for wormbase curation presentation. Example from looking for P granule annotation. Most papers were P granule mislocalization in mutant. Need relevance markup. Hired a student to go through papers and mark up localization based on antibody staining. 219 pubs, 1400 sentences. Used curation form that divides sentences. Student checked yes or no for relevance. Compute word frequency histograms. Single words or phrases. Single words worked well and are more efficient than phrases. Created Textpresso categories

  • Cellular component: Adherens junction, nuclei,
  • Verbs: localizes to, accumulates
  • Other: ... missed this.

Metrics: precision and recall. First generation categories 75% precision 40% recall. Could get 80% of the known annotations, thanks to info redundancy. Building second generation categories. Curation pipeline.

  • Keyword - Ce protein name
  • Look for match in 3 categories

Returns matching sentences in documents. Can browse sentences in context. Sentences get a score. Interface - 3 columns: Protein, Textpresso match terms, GO terms from relationship index.

  • working on problem of how to identify new associations from those that have been reported before, e.g. commonly used markers
  • how much does this increase the efficiency of curation? can't answer right now because still testing
  • how does it affect your annotation? how do you know what you're missing?
  • current pipeline is just for cellular component terms, but think it will be amenable to function terms, haven't thought about application to process terms
  • Can customize category terms

April 21, 2008