Reference Genome progress report for 2012
Aim 3. We will perform phylogenetically-based propagation of annotations.
There are two main parts to this aim: (1) creating the infrastructure required to support phylogenetic propagation, and (2) performing phylogenetic propagation. As originally planned in the grant proposal, the first year was mostly focused on infrastructure.
On the infrastructure side, we are currently ahead of schedule, having met the first year goal in the first 9 months of funding. During this year, we have migrated the project to use protein-coding gene sequence sets compiled by the UniProt resource (http://www.ebi.ac.uk/reference_proteomes/), rather than continuing to generate these sets in house. This will, over the course of the grant period, incrementally reduce the resources required to update and maintain these sets, and allow integration with orthology predictions from the wider community (http://www.ncbi.nlm.nih.gov/pubmed/22332236). However, as with any large-scale data migration, it requires up-front investment of substantial resources. The latest release of these standard "reference proteome sets" was in April 2012, and has worked out many of the glitches with unique identifiers and non-redundant protein selection in the previous (April 2011) release. The PANTHER trees that form the basis for GO annotation propagation, have now been updated to use the UniProt sets. The GOC is actively working with UniProt to resolve a few remaining bugs in these UniProt sets before the next release in April of 2013. The feedback provided by the GOC curators carrying out the phylogenetic annotations are instrumental to the review and improvements of these reference proteomes.
Another significant infrastructure development is that the phylogenetic annotations now refer to stable tree node identifiers, rather than identifiers that vary between releases of the phylogenetic annotations. This development has been published (http://www.ncbi.nlm.nih.gov/pubmed/23193289), and will make maintenance of the phylogenetic annotations more robust and efficient.
In addition, we have made additional enhancements to the PAINT software for phylogenetic annotation (http://www.ncbi.nlm.nih.gov/pubmed/21873635), which further increase the efficiency of the phylogenetic curation process.
The actual accumulation of phylogenetically-based annotations is just beginning. As of this report, there are phylogenetic annotations for 84 gene families, out of a projected 4000 families targeted for the 5 year funding period. This amounts to approximately 2% progress, somewhat short of the 6% we had projected for this time period. This was due to an accelerated focus on infrastructure as described above, as well as a cutback of 1FTE curator on this aim due to project budget cuts. Importantly, though, we have met the main goal of curator training, having just held a 4-day workshop in PAINT curation (http://wiki.geneontology.org/index.php/2012_PAINT_workshop_Logistics), during which 6 curators from different model organism database groups were trained as part-time phylogenetic curators. The work of these curators has been reviewed and approved for consistency, and adherence to guidelines. In addition, several new features to the PAINT software were implemented in response to requests by these new curators, including visual cues for finding gene duplications in phylogenetic trees, and for focusing on the annotations that have the most potential for integration across related genes.