PAINT progress report for 2014

From GO Wiki
Revision as of 12:42, 19 December 2014 by Paul Thomas (talk | contribs) (Progress)

Jump to: navigation, search

Aim 3. We will perform phylogenetically-based propagation of annotations. [This effort cut by 1 FTE as a result of final funding level].

During the grant year, we made excellent progress on this aim. Thanks to the additional software and infrastructure development over the previous two years, we were able to focus our efforts this year on curated phylogenetic annotation. Our progress this year, in terms of the number of genes annotated through phylogenetic annotation, was approximately what we had projected in our original grant proposal for year 3, even though the effort was substantially decreased due to initial budget cuts.

Progress

  • We annotated gene families covering approximately 3000 human genes. This represents about 15% of all protein-coding genes (nearly meeting our original goal for year 3 of 18%, which assumed substantially greater resource allocation).
  • Of these, the project added new annotations for 2552 human genes, over 75% of the 3000 human genes covered during this period. A total of 7262 biological process annotations were added for these human genes, 4821 molecular function annotations and 4246 cellular component annotations. This project is thus making a large impact on the computational representation of human gene function.
  • The project also added new annotations for an additional 101,636 genes across 84 other genomes.
  • Other statistics: 706 families have now been curated. This has resulted in the annotation of 1954 internal tree nodes, comprising 976 molecular function annotations, 1335 biological process annotations and 1129 cellular component annotations. These annotations were propagated within the tree to annotate the 104,188 genes listed above, yielding a total of 202,379 biological process annotations, 143,080 molecular function annotations and 130,050 cellular component annotations.
  • Updated phylogenetic trees. All gene trees were updated using the May 2014 release of the UniProt Reference Proteomes. These comprise 213 organisms, which were all used to build the phylogenetic trees. These "complete" trees are too complex for curated phylogenetic annotation, so we then pruned the trees to about 100 genomes. These updated trees will be released for curation prior to the end of the year 3 period. In addition to updating the gene sets, the trees have been improved in several ways:
    • as planned, carry over funds were used to compare curated orthologs from ZFIN (zebrafish-human and zebrafish-mouse) and PomBase (fission yeast to budding yeast and fission yeast to human) to predicted orthologs from the phylogenetic trees. The goal is to improve the trees we annotate, and this project resulted in many improvements. The biggest source of discrepancies between the curated and automated orthologs was due to gene families that had been artificially separated into two or more distinct families. We identified nearly 200 cases over these genomes, and these were corrected by merging each group of separated families into a single, larger family. Other discrepancies allowed us to identify a less common artifact arising from incorrect handling of partial gene sequences.
    • handling of horizontal transfer events has been improved
    • handling of fragment/partial sequences has been improved
    • worked closely with UniProt team to improve set of human and mouse genes

Use of PAINT for Quality Assurance

PAINT allows curators to have a bird's eye of all annotations for a family. This functionality is extremely valuable to review annotations and identify errors and inconsistencies. These errors can be grouped in major categories:

  • Annotation to BP versus regulation of BP

It is sometimes difficult to establish the role of a protein within a process, or as a regulator of the process; this is visible by families having annotations to both process X and the regulation of process X. In such cases, the 'regulation' annotations are often assigned by the IMP evidence code; so if there is evidence for a direct role, the 'regulation' should not be annotated. This is a case where PAINT provides a clear advantage for annotation.

  • Over-annotation

IMP/IGI annotations lead to a lot of phenotypic annotations (cell proliferation, cell growth, apoptosis, …). It often happens that the role of the protein is actually in a process far upstream of the observed phenotype. Again this is easily visible in PAINT; the symptom is usually many, varied annotations that do not point to any one process.

  • HTP annotations

High-throughput papers are a source of overannotation; with false positive rates that are usually much higher than in lower throughput papers. In PAINT, this provided so many false positives over the entire set of experimental annotations that we have created an exclusion list to exclude these papers.

  • Incorrect annotations

In almost family there are some misannotation issues; sometimes these are relatively minor, such as the regulation versus process problem mentioned above. In other cases there are more serious issues, such as the wrong protein being annotated, or misinterpretation of an experiment. Protein2GO has a mechanism to dispute annotations, and the feedback from PAINT curators has contributed to improving the overall quality of the GO experimental annotation set.

  • Missing annotations

In many cases annotations are missing to seed the propagation of functions/processes/components in a tree. This is of course to be expected, as the annotation effort is necessarily lagging behind the generation of data. These annotations are added to the GO annotation set via protein2GO, which also contributes to the improvement of the overall set.