PAINT progress report for 2014
Aim 3. We will perform phylogenetically-based propagation of annotations. [This effort cut by 1 FTE as a result of final funding level].
During the grant year, we made excellent progress on this aim. Thanks to the additional software and infrastructure development over the previous two years, we were able to focus our efforts this year on curated phylogenetic annotation. Our progress this year, in terms of the number of genes annotated through phylogenetic annotation, was approximately what we had projected in our original grant proposal for year 3, even though the effort was substantially decreased due to initial budget cuts.
Creation of GO annotations using phylogenetic inference
- We annotated gene families covering approximately 3000 human genes. This represents about 15% of all protein-coding genes (nearly meeting our original goal for year 3 of 18%, which assumed substantially greater resource allocation).
- Of these 3000 human genes that could have potentially received additional GO annotations, the project added new annotations for 2552 human genes (over 75%). A total of 7262 biological process annotations were added for these human genes, 4821 molecular function annotations and 4246 cellular component annotations. This project is thus making a large impact on the computational representation of human gene function.
- The project also added new annotations for an additional 101,636 genes across 84 other genomes.
- Other statistics: 706 families have now been curated. This has resulted in the annotation of 1954 internal tree nodes, comprising 976 molecular function annotations, 1335 biological process annotations and 1129 cellular component annotations. These annotations were propagated within the tree to annotate the 104,188 genes listed above, yielding a total of 202,379 biological process annotations, 143,080 molecular function annotations and 130,050 cellular component annotations.
- Updated phylogenetic trees. All gene trees were updated using the May 2014 release of the UniProt Reference Proteomes. These comprise 213 organisms, which were all used to build the phylogenetic trees. These "complete" trees are too complex for curated phylogenetic annotation, so we then pruned the trees to about 100 genomes. These updated trees will be released for curation prior to the end of the year 3 period. In addition to updating the gene sets, the trees have been improved in several ways:
- as planned, carry over funds were used to compare curated orthologs from ZFIN (zebrafish-human and zebrafish-mouse) and PomBase (fission yeast to budding yeast and fission yeast to human) to predicted orthologs from the phylogenetic trees. The goal is to improve the trees we annotate, and this project resulted in many improvements. The biggest source of discrepancies between the curated and automated orthologs was due to gene families that had been artificially separated into two or more distinct families. We identified nearly 200 cases over these genomes, and these were corrected by merging each group of separated families into a single, larger family. Other discrepancies allowed us to identify a less common artifact arising from incorrect handling of partial gene sequences.
- handling of horizontal transfer events has been improved
- handling of fragment/partial sequences has been improved
- worked closely with UniProt team to improve set of human and mouse genes
Phylogenetic Annotation Software (PAINT)
At the beginning of this period the code was at beta70. In May we released PAINT 1.0 and since then there have been 13 minor releases. A large number of enhancements were added and bugs fixed during the PAINT hackathon during July, providing attendees with immediate response to their requests.
- Created the ability to add columns for more general terms to enable their use for ancestral annotation, when the experimental annotations of the extant descendents are to more specific terms.
- Provided complete undo/redo support, with the history recorded and displayed in the log file (also known as ‘notes’)
- Added capability of collapsing branches of the tree for which there are no experimental annotations among the descendents.
- Implemented a GO taxon check web service (currently runs on Berkeley server)
- Added a call out to the GO taxon check web service dynamically when a user attempts to annotate an ancestral node to determine if it is allowable.
- Improved the Multiple Sequence Alignment (MSA) view.
- Improved the search functionality
- Added special graphic for lateral transfer
- Numerous other small enhancements (e.g. tooltips, formatting of notes, switch to GO_Central as the source) and maintenance as bugs were reported.
- Initial work on an updating script (“touchup”) is underway and will be completed in the first quarter of 2015. This code will ensure that the GAF files exported from PAINT by the annotators remain synchronized with the latest versions of the GO, the experimental annotations, and the PANTHER family trees.
- We assisted in the mentoring of a Google Summer of Code student (under BioJS) in the development of a Web Browser MSA viewer to use in jsPAINT
Use of PAINT for Quality Assurance
PAINT allows curators to have a bird's eye of all annotations for a family. This functionality is extremely valuable to review annotations and identify errors and inconsistencies. These errors can be grouped in major categories:
- Annotation to a biological process versus "regulation of" a biological process
It is sometimes difficult to establish the role of a protein within a process, or as a regulator of the process; this is visible by families having annotations to both process X and the regulation of process X. In such cases, the 'regulation' annotations are often assigned by the IMP evidence code; so if there is evidence for a direct role, the 'regulation' should not be annotated. This is a case where PAINT provides a clear advantage for annotation.
IMP/IGI annotations lead to a lot of phenotypic annotations (e.g. cell proliferation, cell growth, and apoptosis). It often happens that the role of the gene product is actually in a process far upstream of the observed phenotype. Again this is easily visible in PAINT; the symptom is usually many, varied annotations that do not consistently point to any one process.
- HTP annotations
High-throughput papers are a source of overannotation, with false positive rates that are usually much higher than in lower throughput papers. In PAINT, we have identified many papers that have been used as evidence for cellular component annotations, but that often conflict with low-throughput experiments on the same gene products. We have created an "exclusion list" that identifies the papers with relatively high false-positive rates.
- Incorrect annotations
In almost every family we have identified at least a few misannotation issues; sometimes these are relatively minor, such as the regulation versus process problem mentioned above. In other cases there are more serious issues, such as the wrong protein being annotated, or misinterpretation of an experiment. Protein2GO has a mechanism to dispute annotations, and the feedback from PAINT curators has contributed to improving the overall quality of the GO experimental annotation set.
- Missing annotations
In many cases annotations are missing to seed the propagation of functions/processes/components in a tree. This is of course to be expected, as the annotation effort is necessarily lagging behind the generation of data. These annotations are added to the GO annotation set via protein2GO, which also contributes to the improvement of the overall set.