PAINT progress report for 2014: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 17: Line 17:
PAINT allows curators to have a bird's eye of all annotations for a family. This functionality is extremely valuable to review annotations and identify errors and inconsistencies. These errors can be grouped in major categories:  
PAINT allows curators to have a bird's eye of all annotations for a family. This functionality is extremely valuable to review annotations and identify errors and inconsistencies. These errors can be grouped in major categories:  
** Annotation to BP versus regulation of BP:
** Annotation to BP versus regulation of BP:
It is sometimes difficult to establish the role of a protein within a process, or as a regulator of the process; this is visible by families having annotations to both process X and the regulation of process X. In such cases, the 'regulation' annotations are often assigned by the IMP evidence code; so if there is  
It is sometimes difficult to establish the role of a protein within a process, or as a regulator of the process; this is visible by families having annotations to both process X and the regulation of process X. In such cases, the 'regulation' annotations are often assigned by the IMP evidence code; so if there is evidence for a direct role, the 'regulation' should not be annotated. This is a case where PAINT provides a clear advantage for annotation.


Over-annotation:  
** Over-annotation:  
IMP/IGI annotations lead to a lot of phenotypic annotations (cell proliferation, cell growth, apoptosis, …). Sometimes it is totally indirect.
IMP/IGI annotations lead to a lot of phenotypic annotations (cell proliferation, cell growth, apoptosis, …). It often happens that the role of the protein is totally indirect. Again this is easily visible in PAINT; the symptom is usually many, varied annotations that do not point to any given process.  


Missing annotation:
** HTP annotations:  
It would be helpful to annotate directly in PAINT - or have a faster turnover between Protein2GO and the GO db.
 
HTP annotations:  
Exclusion list for now; is this the best solution ?  
Exclusion list for now; is this the best solution ?  
Even if there are many false positives, the majority should be true positives and are sometimes the only information we have.
Even if there are many false positives, the majority should be true positives and are sometimes the only information we have.


Wrong annotations:
** Wrong annotations:
Disputes: no real stats; but we could dispute at least one annotation per family .
Disputes: no real stats; but we could dispute at least one annotation per family .
Too bad stats are not maintained - it would help identify areas where more annotation guidelines are needed
Too bad stats are not maintained - it would help identify areas where more annotation guidelines are needed
** Missing annotation:
It would be helpful to annotate directly in PAINT - or have a faster turnover between Protein2GO and the GO db.

Revision as of 10:24, 19 December 2014

Aim 3. We will perform phylogenetically-based propagation of annotations. [This effort cut by 1 FTE as a result of final funding level].

During the grant year, we made excellent progress on this aim. Thanks to the additional software and infrastructure development over the previous two years, we were able to focus our efforts this year on curated phylogenetic annotation. Our progress this year, in terms of the number of genes annotated through phylogenetic annotation, was approximately what we had projected in our original grant proposal for year 3, even though the effort was substantially decreased due to initial budget cuts.

Progress

  • We annotated gene families covering approximately 3000 human genes. This represents about 15% of all protein-coding genes (nearly meeting our original goal for year 3 of 18%, which assumed substantially greater resource allocation).
  • Of these, the project added new annotations for 2552 human genes, over 75% of the genes covered during this period. This project is thus making a large impact on the computational representation of human gene function.
  • The project also added new annotations for an additional 101,636 genes across 84 other genomes.
  • Other statistics: 706 families have now been curated. This has resulted in the annotation of 1954 internal tree nodes, comprising 976 molecular function annotations, 1335 biological process annotations and 1129 cellular component annotations. These annotations were propagated within the tree to annotate the 104,188 genes listed above, yielding a total of 202,379 (7262 human) biological process annotations, 143,080 (4821 human) molecular function annotations and 130,050 (4246 human) cellular component annotations.
  • Updated phylogenetic trees. All gene trees were updated using the May 2014 release of the UniProt Reference Proteomes. These comprise 213 organisms, which were all used to build the phylogenetic trees. These "complete" trees are too complex for curated phylogenetic annotation, so we then pruned the trees to about 100 genomes. These updated trees will be released for curation prior to the end of the year 3 period. In addition to updating the gene sets, the trees have been improved in several ways:
    • as planned, carry over funds were used to compare curated orthologs from ZFIN (zebrafish-human and zebrafish-mouse) and PomBase (fission yeast to budding yeast and fission yeast to human) to predicted orthologs from the phylogenetic trees. The goal is to improve the trees we annotate, and this project resulted in many improvements. The biggest source of discrepancies between the curated and automated orthologs was due to gene families that had been artificially separated into two or more distinct families. We identified nearly 200 cases over these genomes, and these were corrected by merging each group of separated families into a single, larger family. Other discrepancies allowed us to identify a less common artifact arising from incorrect handling of partial gene sequences.
    • handling of horizontal transfer events has been improved
    • handling of fragment/partial sequences has been improved
    • worked closely with UniProt team to improve set of human and mouse genes
  • Use of PAINT to do Quality Assurance

PAINT allows curators to have a bird's eye of all annotations for a family. This functionality is extremely valuable to review annotations and identify errors and inconsistencies. These errors can be grouped in major categories:

    • Annotation to BP versus regulation of BP:

It is sometimes difficult to establish the role of a protein within a process, or as a regulator of the process; this is visible by families having annotations to both process X and the regulation of process X. In such cases, the 'regulation' annotations are often assigned by the IMP evidence code; so if there is evidence for a direct role, the 'regulation' should not be annotated. This is a case where PAINT provides a clear advantage for annotation.

    • Over-annotation:

IMP/IGI annotations lead to a lot of phenotypic annotations (cell proliferation, cell growth, apoptosis, …). It often happens that the role of the protein is totally indirect. Again this is easily visible in PAINT; the symptom is usually many, varied annotations that do not point to any given process.

    • HTP annotations:

Exclusion list for now; is this the best solution ? Even if there are many false positives, the majority should be true positives and are sometimes the only information we have.

    • Wrong annotations:

Disputes: no real stats; but we could dispute at least one annotation per family . Too bad stats are not maintained - it would help identify areas where more annotation guidelines are needed

    • Missing annotation:

It would be helpful to annotate directly in PAINT - or have a faster turnover between Protein2GO and the GO db.