Reference Genome progress report for 2009
- Paper has been accepted in PLoS Computational Biology (April 24)
- A note about the reference genome project was published in the GO news site: http://go.berkeleybop.org/news4go/node/27
- Pascale presented the Reference genome project at the Biocurator meeting (Berlin, April -09), the Quest for ortholog meeting (Hinxton, July 09), and the Dicty meeting (Estes Park, CO, September 09)
- We have changed the curation targets from a gene-by-gene selection to a 'topic': the first group of genes are genes involved in lung development selected by the MGI group.
Number of families annotated
- ~ 750 families (424 PANTHER families)
- ~ 6,000 gene products
- constant at about 2000 per year
This survey was done to assess how every participating database was handling GAF files and to see how the PAINT annotations would be integrated. (November 2009) Most groups can upload a GAF file in their database but not all can display the data.
PAINT (Protein Annotation Inferencing Tool) is the curation tool for annotating by phylogenetic relationships used by the reference genome project. In 2009 a large part of the effort of the reference genome group was the development of the PAINT tool, which allows to visualize annotations and annotate multiple sequences in a single step.
- April 09:
- the node annotation panel layout was simplified. There are a couple of other speed-ups to the code as well.
- May 09:
- Added links to AmiGO to visualize the ontology easily.
- Clicking on a GO term highlights, on the tree, all proteins (or collapsed nodes) annotated to that term (or its descendants) in the clade
- June 09
- Main change is that the 'NOT' information is now sent from the server to the client. You should see this information now getting populated in the annotation panel, when you click a node that has been annotated with a 'NOT' qualifier.
- August 09
- allows curators to annotate to parents of terms with annotations.
- paint now has flexible docking, proper gaf files, proper go hierarchy,
- PAINT no longer relies on a local GO database installation because of speed improvements (running the GO db connection as separate threads in the background)
- annotation by dragging terms from the term tree onto the gene info
- and highlighting of nodes that share annotations to terms
- negation, qualifiers,
- and more--including some known bugs.
- November 2009: PAINT update: Vbeta15
- GAF generated by PAINT is now compliant with all GO rules
- You can now enter and save your comments/evidence for the annotations. This is the feature Pascale, Paul and I discussed. It is nothing more than a simple text box, but cutting and pasting works so you can enter URLs and the things you used as background information.
- In order to make this work you need to be able to save the evidence separate from the GAF and so while I was at it I simply added the ability to save a complete session (tree, gene data, msa, gaf, and evidence). These are stored as a suite of files with an additional xml file as an index to the individual files. This means that you can completely restore a session, although you'll still need a connection for getting the GO term file and the most current GO annotations. But it does round trip.
- The file menu now works as follows:
- "Open from database ..." just as it did before, loads the family from the panther db and the go annotations too
- "Open from files ... " restores tree, gene data, msa, gaf, and evidence from the local file system
- "Save annotations ... " records tree, gene data, msa, gaf, and evidence to the local file system
- "Restore annotations ... " this is to be used in conjunction with the open from db option. it lets you first retrieve the tree from the panther db and then overlay your existing locally saved gaf+evidence onto that tree.
- "Export ..." saves a stripped down version of the GAF to the local file system. these are the files that can be delivered to the mods.
- Appearance, The color scheme was getting far too overloaded so I tried to simplify it. Here is how it now works:
- Shapes are used to indicate the state/type of node
- speciation nodes: circles (as before)
- duplication nodes: squares
- rerooted: triangle (as before)
- collapsed: vertical rectangle
- subfamily: diamond (as before)
- selection is now "pink", no stars, just lines.
- Colors are strictly reserved for "painting", as you add more annotation the nodes become more colored (it starts out just black&white with a sprinkling of deep red)
- Deep red: experimental annotations for the node/gene
- yellow-orange: direct annotation added by one of you
- Dark blue: inferred annotation
- December 9, 2009: PAINT update Vbeta16
Supports configuring curation status colors. You can configure the colors (the changes are persistent) from the "Edit->Curation status colors..." menu option.
- We have started to annotate families using the PAINT tool. The data is available here for review by curators and integration in their respective database: GAFs_for_trees-based_annotations
Visualising PAINT annotations with GO nuts wiki
GONUTS GOsummary extension pulls all gene associations for a gene into the summary table and graph, not just those added from the PAINT GAF.
- Currently IEA is excluded from the summary table and graph. Should other evidence codes be excluded [Mary comment: the refG graphs display only experimental evidence code annotations (EXP, IDA, IPI,IMP, IGI, IEP); IC; and the ISS codes (ISS, ISO, ISA, ISM), which are only displayed (and labeled as "ISS_only") if there is no experimental annotation to the term.]
- ISS to an ancestor node is now displayed as ISS-An in the table
- GO nuts is also used during the electronic annotation jamborees to visualize annotations in realtime.
- To show only the graph, enter this line on a category page.
- Alternatively, you can show only the table with:
- We have just released new PPOD families in July 2009 based on the new protein sets assembled by Paul Thomas' group. Note that we also have run InParanoid/MultiParanoid this time around, in addition to OrthoMCL. We are in contact with Mary about re-generating her GO graphs for the new families, and we will add back those links when the graphs are available. One interface change note: the Functional Conservation info is now on a different page, rather than on the bottom of the family page as it was previously.
- In 2009, we added to P-POD clusters of orthologous groups of proteins from the 12 Reference Genomes based on the InParanoid ortholog prediction algorithm. We have also implemented an algorithm to generate consensus clusters based on combining results from OrthoMCL and InParanoid and plan to release these results after additional testing in 2010.
- Currently, we are running our P-POD analyses on the set of 48 genomes used in the PANTHER protein families to better integrate the orthologous prediction results from OrthoMCL and InParanoid with the broader PANTHER families. This work required some modifications to the backend of P-POD to handle the increased data and memory load. In addition to the direct benefit to the Reference Genome project, this work has two positive side effects to the community: 1) we have worked with Chris Stoeckert’s group at Penn to make improvements to the OrthoMCL code, which is used by many other groups in the research community, and 2) the P-POD pipeline will be able to be more easily leveraged when used as the ortholog prediction resource for modENCODE and other projects.
Electronic annotation jamborees
The goal of the annotation jamborees is to discuss annotation of a gene family by the entire group.