Princeton, September 2009
Specific Aim 1: Generate new protein family clusters as required for the Reference Genome effort. Existing PANTHER families only exhaustively cover protein families containing human genes. Additional protein families will be incrementally added to PANTHER using data from the CLUSTR project at EBI. In the nearer term, the Reference Genome project requires a method for generating new family clusters that include data from all the reference genomes that can then be run through the GO-PANTHER pipeline to generate protein trees within the GO annotation environment. The Jaccard clustering method, which is already a standard part of our P-POD analysis pipeline in Princeton, will be used in this project and is ideal for this application.
Progress: this has been done, although it is unclear that this is still needed by PANTHER.
Specific Aim 2: Integrate data from multiple homolog/ortholog detection methods to enable efficient leverage of existing homology resources. We will generate overlays for protein families based on alternate ortholog/protein family detection methods (e.g. TreeFam, OrthoMCL, Homologene and InParanoid). Within the context of the GO Reference Genome effort, the PANTHER protein/gene trees provide a broad evolutionary context for annotation by inference. However, the protein family trees themselves still require expert evaluation and curation. Having the results from other methodologies available for comparison will greatly assist this process. Just as importantly, from the perspective of the larger research community, these overlays will enable the comparison of different approaches to homolog set determination.
Progress: we have generated families based on OrthoMCL and Multi/InParanoid. These have been provided to PANTHER, and OrthoMCL families are part of the PAINT display. We also have constructed "consensus" clusters based on these two methods and are testing the results. We are also working on doing OrthoMCL, Multi/InParanoid, and consensus families on the 48 species that PANTHER families contain, and will be able to do the same when the Quest for Orthologs set is available. The stumbling block thus far has been a memory problem with the publicly available version of OrthoMCL; we are working on a solution with the developers at U Penn.
Specific Aim 3: Provide the coordination and expert review necessary to enable reliable transfer of GO annotations to newly sequenced genomes. One of the strengths of the GO Reference Genome project is that the results from different model organisms—each with its own advantages for certain types of experiments—can be combined to get a much fuller picture of gene function than would be possible from any one organism alone. The experimental data from one organism can be used to infer function of proteins/genes in another organism. The Reference Genome curators have a clear task as described in the GO proposal to generate comprehensive GO annotations for the proteins in their respective model organism databases (MODs). Integrating the annotations from the protein data of MOD organisms within the context of the ancestral proteins from which the proteins evolved is a separate task. This integration must occur using the evolutionary relationships connecting these different organisms and it must be expertly evaluated. The objective of this aim is to integrate experimental data from different organisms, by (1) annotating GO terms for ancestral sequences in an evolutionary tree, and (2) coordinating and integrating additional expert review of the annotated tree by individual GO curators.
Progress: Progress in annotation up to this point has been slow; we have a relatively small number of protein families that are completely annotated, ready for incorporation into MODs, see:
However, with the recent release of PAINT, we should now have in our hands a software tool that has all the basic features that are needed to really move ahead on the annotation front, which will be a very high priority now. In the next couple of months, I would like to not only really gear up the annotation on our end, but also really hash out how we can make it easy for the MODs to confidently incorporate these annotations. This is a huge issue, both for making real progress, and for addressing the concern/impressions of NIH program officers that we do not have "MOD buy-in," even though much of the GOC consists of MODs.