PAINT progress report for 2015: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
Line 19: Line 19:


===Creation of GO annotations using phylogenetic inference===
===Creation of GO annotations using phylogenetic inference===
* We annotated gene families covering approximately 7200 human genes.  This represents about 36% of all protein-coding genes (nearly meeting our original goal for year 4 of ????%, which assumed substantially greater resource allocation).
* We have now annotated, in total, gene families covering approximately 7200 human genes.  This represents about 36% of all protein-coding genes.  As of this time last year, we had covered only 15% of human genes, so we covered an additional 21% this year (nearly meeting our original goal for year 4 of 24%, which assumed substantially greater resource allocation).
* Of these 7200 human genes that could have potentially received additional GO annotations, the project added new annotations for about 6000 human genes (over 80%).  A total of 25,000 annotations were added for these human genes (12,000 biological process, 7,000 molecular function and 6,000 cellular component annotations).  This project is thus making a large impact on the computational representation of human gene function.
* Of these 7200 human genes that could have potentially received additional GO annotations, the project added new annotations for about 6000 human genes (over 80%).  A total of 25,000 annotations were added for these human genes (12,000 biological process, 7,000 molecular function and 6,000 cellular component annotations).  This project is thus making a large impact on the computational representation of human gene function.
* The project also added new annotations for an additional ~300,000 genes across 104 other genomes.
* The project also added new annotations for an additional ~300,000 genes across 104 other genomes.
* Other statistics: 2300 families have now been curated.  This has resulted in the annotation of 5300 internal tree nodes, comprising 2400 molecular function annotations, 3200 biological process annotations and 2600 cellular component annotations.  These annotations were propagated within the tree to annotate the 300,00 genes listed above, yielding a total of 565000 biological process annotations, 430,000 molecular function annotations and 360,000 cellular component annotations.
* Other statistics: 2300 families have now been curated.  This has resulted in the annotation of 5300 internal tree nodes, comprising 2400 molecular function annotations, 3200 biological process annotations and 2600 cellular component annotations.  These annotations were propagated within the tree to annotate the 300,00 genes listed above, yielding a total of 565000 biological process annotations, 430,000 molecular function annotations and 360,000 cellular component annotations.
* Updated phylogenetic trees.  All gene trees were updated using the May 2014 release of the UniProt Reference Proteomes.
* Updated the phylogenetic trees.  All gene trees were updated using the May 2015 release of the UniProt Reference Proteomes.


===PAINT Software===
===PAINT Software===

Revision as of 09:35, 17 December 2015


Dec 2015

Prepared and Submitted by Huaiyu Mi and Pascale Gaudet on behalf of the PAINT working group

Curators

  • Marc Feuermann
  • Pascale Gaudet
  • Karen Christie
  • Huaiyu Mi
  • Donghui Li
  • Moni Munoz-Torres

Software

  • Suzanna Lewis
  • Heiko Dietze
  • Seth Carbon

Creation of GO annotations using phylogenetic inference

  • We have now annotated, in total, gene families covering approximately 7200 human genes. This represents about 36% of all protein-coding genes. As of this time last year, we had covered only 15% of human genes, so we covered an additional 21% this year (nearly meeting our original goal for year 4 of 24%, which assumed substantially greater resource allocation).
  • Of these 7200 human genes that could have potentially received additional GO annotations, the project added new annotations for about 6000 human genes (over 80%). A total of 25,000 annotations were added for these human genes (12,000 biological process, 7,000 molecular function and 6,000 cellular component annotations). This project is thus making a large impact on the computational representation of human gene function.
  • The project also added new annotations for an additional ~300,000 genes across 104 other genomes.
  • Other statistics: 2300 families have now been curated. This has resulted in the annotation of 5300 internal tree nodes, comprising 2400 molecular function annotations, 3200 biological process annotations and 2600 cellular component annotations. These annotations were propagated within the tree to annotate the 300,00 genes listed above, yielding a total of 565000 biological process annotations, 430,000 molecular function annotations and 360,000 cellular component annotations.
  • Updated the phylogenetic trees. All gene trees were updated using the May 2015 release of the UniProt Reference Proteomes.

PAINT Software

  • Touchup: We designed and implemented an application for automatically updating the entire corpus of phylogenetic-based annotations. It assures that all PAINT annotations are using the latest release of the GO, that all of the experimental evidence is up to date, re-propagates ancestral annotations to update current protein annotations when there are updates to the protein families trees themselves, and incorporates the latest quality controls that have been recommended by the curators (e.g. improved taxon checks). This software also provides the underlying logic for the latest version of PAINT, currently a desktop application, but ultimately this offers a solid foundation for the JavaScript version of PAINT to be developed in the coming year.
  • PAINT: A major refactoring of PAINT was carried out to utilize the touchup logic server. This is current in an early release and being tested by the curators. In addition, a number of user features and more subtle bugs were fixed. For example: a bug in Java 1.7 sorting was detected when we were increasing the speed of loading the ontology; Version, date and user are now recorded in the GAF, log file and the title bar; considerable time was spent in making PAINT more clever in terms of ID resolution to ensure that no new gene products are introduced into the GO database that in fact are already present under a different ID.
  • PANTHER release wrangling: Whenever there is a new release of PANTHER (this year produced v10) a number of steps must be taken to incorporate the new release everywhere it is used. New families may be added, previous families may disappear, and annotated nodes may move from one family to another. Scripts are required to cope with all of these changes and these have been written. The “salvage” script removes nodes from the old family GAF file and adds these to the new family GAF file, it also copies over any curator added notes and commits the modified GAFs to the GO repository. In addition, before running either PAINT or Touchup, a check must be carried out to ensure that all the taxa included in this PANTHER release are accounted for by the current taxon checker. And finally, to load the families into AmiGO, the tree files must be corrected to standard newick format (.nhx suffix) and converted into JSON objects. Similarly the default database names used may change between PANTHER releases (e.g. WormBase changed to WB in PANTHER v10).

Plans for coming year (Software)

  • Develop Selenium/behave tests
  • Develop training material for PAINT
  • Deploy full JavaScript implementation of PAINT
  • Integrate JS-PAINT with Noctua and TermGenie
  • Ensure touchup is run regularly as part of the continuous integration pipeline