PAINT progress report for 2014
Aim 3. We will perform phylogenetically-based propagation of annotations. [This effort cut by 1 FTE as a result of final funding level].
During the grant year, we made excellent progress on this aim. Thanks to the additional software and infrastructure development over the previous two years, we were able to focus our efforts this year on curated phylogenetic annotation. Our progress this year, in terms of the number of genes annotated through phylogenetic annotation, was approximately what we had projected in our original grant proposal for year 3, even though the effort was substantially decreased due to initial budget cuts.
- We annotated gene families covering approximately 3000 human genes. This represents about 15% of all protein-coding genes (nearly meeting our original goal for year 3 of 18%, which assumed substantially greater resource allocation).
- Of these, the project added new annotations for 2552 human genes, over 75% of the genes covered during this period. This project is thus making a large impact on the computational representation of human gene function.
- The project also added new annotations for an additional 101,636 genes across 84 other genomes.
- Other statistics: 706 families have now been curated. This has resulted in the annotation of 1954 internal tree nodes, comprising 976 molecular function annotations, 1335 biological process annotations and 1129 cellular component annotations. These annotations were propagated within the tree to annotate the 104,188 genes listed above, yielding a total of 202,379 biological process annotations, 143,080 molecular function annotations and 130,050 cellular component annotations.
- Updated phylogenetic trees. All gene trees were updated using the May 2014 release of the UniProt Reference Proteomes. These comprise 213 organisms, which were all used to build the phylogenetic trees. These "complete" trees are too complex for curated phylogenetic annotation, so we then pruned the trees to about 100 genomes. These updated trees will be released for curation prior to the end of the year 3 period. In addition to updating the gene sets, the trees have been improved in several ways:
- as planned, carry over funds were used to compare curated orthologs from ZFIN (zebrafish-human and zebrafish-mouse) and PomBase (fission yeast to budding yeast and fission yeast to human) to predicted orthologs from the phylogenetic trees. The goal is to improve the trees we annotate, and this project resulted in many improvements. The biggest source of discrepancies between the curated and automated orthologs was due to gene families that had been artificially separated into two or more distinct families. We identified nearly 200 cases over these genomes, and these were corrected by merging them into a single, larger family. Other discrepancies allowed us to identify a less common artifact arising from incorrect handling of partial gene sequences.
- handling of horizontal transfer events has been improved
- handling of fragment/partial sequences has been improved
- worked closely with UniProt team to improve set of human and mouse genes