Reference Genome October 2009 (Archived)

PAINT update: Vbeta12

PAINT-based annotations

We have started to annotate families using the PAINT tool. The data is available here for review by curators and integration in their respective database: GAFs_for_trees-based_annotations
New annotations done:

- Striatin family: PANTHER15653
- MVD Family: PANTHER10977

Meeting in Princeton: October 27, 2009

Attending: Pascale Gaudet, Paul Thomas, Kara Dolinski, Mike Livstone, Rose Oughtred, Suzanna Lewis (via phone).

PAINT software and protein family annotation

We all agreed that PAINT is really coming along, but there are a couple of features that are required in order to scale up the process in terms of:

protein family annotation:
- A feature that allows better tracking for each annotation is needed (e.g. so a curator can make a note about the thought process behind an annotation, basically extending the “long branch length” feature). Tracking more info on each annotation that we worked with will enable better metrics (see below).
- having the software do an automated calculation to give the curator a default recommendation for an annotation to the common ancestor would save a huge amount of time

incorporation of the annotations by the MODs:
- we need a central location to store a) the GAFs produced by PAINT that contains all the annotations, including those to ancestor nodes, and b) filtered GAFs that contain only the annotation that the MODs would be interested in i.e. filtered by Ref Genome taxon ID. The plan that we proposed:
  - store the complete GAFs in a central place at PPOD or Panther FTP sites
  - store the filtered GAFs on the GO central site

To do this, we need modifications to PAINT to generate compliant GAFs, and to have PAINT (or a separate script) generate the filtered GAFs as well.

- we need to work with the MODs on improving the incoming experimental annotations, so that protein family annotation requires less re-reading of papers and re-interpreting the experiments. Ideas:
  - RefGenome annotation camp, centered around issues that come up with the protein family annotation
  - Document with specific recommendations and examples for annotation standards

We have a firm deadline of January 1 to have 1) and 2) ready to go, with MODs incorporating the GAFs that will be available at a central site.

Minor/lower priority PAINT suggestions:
- In “fetching gene products” window, show where PAINT thinks it’s getting gene products from and display an error message if the connection to the database fails.
- Enable addition of a literature-based annotation via PAINT

Objectives and Metrics

Everyone thought that focusing on biology should be a primary objective for the entire project i.e. we have biology driving the annotation (both at the protein and at the family level), and the annotation driving the ontology development.

Pascale, Suzi, and Emily are writing up a proposal describing this approach. While writing up the guidelines, we think a good first example would be full annotation of the ribosome. Pascale has been discussing this with Serenella at Swiss-Prot about annotating the ribosome, and Serenella thought that would be productive on their end as well, since it was done a really long time ago with lower annotation standards. Another benefit is that technically it's "easy": the members of the complex are rather well defined, conserved, and the function is pretty well characterized.

Potential future collaborator for the biologically focused curation: Kara’s group at Princeton is collaborating with Mike Tyers’ group on the BioGRID project. For curation of human interactions, we are focusing on particular areas of biology (e.g. ubiquitin pathway, signaling). It could be very fruitful to get together to do both interaction curation and Ref. Genome annotation for these groups of genes—we can look at conservation of networks across species, etc.

Other objectives that can be met via protein family annotation:

central, consistent review of experimental annotation
enable annotation transfer in an efficient yet rigorous manner based on evolution from MOD to MOD and from MOD to emergent species.

Quality Control note: It was a useful exercise to have Mike Livstone and Paul do the MSH family; early on, we all did the same thing with topoisomerases. It might be useful to do a few more now that we are a bit further along.

Metrics idea proposed by Mike L:

Perform this calculation one organism at a time.
For a given GO term, count all the proteins annotated to the term (with EXP codes) and to its children and divide by the total number of proteins annotated to that term. Then, take the negative log.
For a given protein, take all the EXP annotations and put them on a DAG, then remove any redundancies (duplicate annotations and parents whose children are annotated). Calculate the score for each of the remaining terms and add them up to get the total score. For a PANTHER family, calculate the total score from all the RefGenome proteins put together.
Repeat (2) for the set of annotations that consists of all the EXP annotations PLUS all the new ISS annotations. Comparing (3) to (2) (straight ratio should do it) will give a measure of the degree of improvement of annotation.
We can compare (3) to what we get if we consider all the pre-existing EXP plus IEA annotations. This will give us an idea of how much our ISS's improved annotation compared to the sum total of all the IEA's that we already had.

Lung Development Genes: November Targets

Targets are here: http://spreadsheets.google.com/ccc?key=pZhlLFuj8ewDe799QTmxzCA&hl=en

list (human gene names:

ASCL1
FGF10
FOXA2
GATA6
GLI2 - NOTE that we have done some (but not all) members of this family)
HNF4
KDR
NKX2
NPC1/PTCH1
RARA
SOX8
VEGFA

Back to Reference_Genome_Annotation_Project#Progress_Reports