Berkely Software Infrastructure

From GO Wiki
Jump to: navigation, search
Report period: Dec 1, 2014 to Nov 30, 2015.
B. ACCOMPLISHMENTS
B.2. What’s been done this year
Aim 1. We will provide experimental annotations for human and major model organisms
  • We have made major improvements to the gene association submission, validation and ingestion pipeline. All submissions are validated by a continuous integration system, which performs a number of validations on the association files. Many of these are ontology-driven, such as taxonomy constraint checks.
  • We also perform automated inference on the association files, using OWL reasoning to ‘deepen’ annotations to more specific classes, making use of annotation extensions (Huntley et al).
  • Additionally, we implemented a new metadata system, in which all gene association files submitted to the Gene Ontology Consortium have a JSON file describing the contents of the file and submission details. This now drives the selection of association files available on the website, and helps automate many of the association file processing tasks.
Aim 2. (not funded) We will extend the GO into emerging genome annotation communities and for key biological processes
Aim 3. (partially funded) We will perform phylogenetically-based propagation of annotations
  • Touchup: We designed and implemented an application for automatically updating the entire corpus of phylogenetic-based annotations. It assures that all PAINT annotations are using the latest release of the GO, that all of the experimental evidence is up to date, re-propagates ancestral annotations to update current protein annotations when there are updates to the protein families trees themselves, and incorporates the latest quality controls that have been recommended by the curators (e.g. improved taxon checks). This software also provides the underlying logic for the latest version of PAINT, currently a desktop application, but ultimately this offers a solid foundation for the JavaScript version of PAINT to be developed in the coming year.
  • PAINT: A major refactoring of PAINT was carried out to utilize the touchup logic server. This is current in an early release and being tested by the curators. In addition, a number of user features and more subtle bugs were fixed. For example: a bug in Java 1.7 sorting was detected when we were increasing the speed of loading the ontology; Version, date and user are now recorded in the GAF, log file and the title bar; considerable time was spent in making PAINT more clever in terms of ID resolution to ensure that no new gene products are introduced into the GO database that in fact are already present under a different ID.
  • PANTHER release wrangling: Whenever there is a new release of PANTHER (this year produced v10) a number of steps must be taken to incorporate the new release everywhere it is used. New families may be added, previous families may disappear, and annotated nodes may move from one family to another. Scripts are required to cope with all of these changes and these have been written. The “salvage” script removes nodes from the old family GAF file and adds these to the new family GAF file, it also copies over any curator added notes and commits the modified GAFs to the GO repository. In addition, before running either PAINT or Touchup, a check must be carried out to ensure that all the taxa included in this PANTHER release are accounted for by the current taxon checker. And finally, to load the families into AmiGO, the tree files must be corrected to standard newick format (.nhx suffix) and converted into JSON objects. Similarly the default database names used may change between PANTHER releases (e.g. WormBase changed to WB in PANTHER v10).
Aim 4. We will develop a Common Annotation Framework
  • Noctua software development: We have progressed our next-generation collaborative web-based GO common annotation environment ‘Noctua’ to beta release and are rolling it out for use in curation. We transitioned the back-end storage to GitHub, allowing full tracking and rollback. We have implemented numerous features on the front-end, meeting a series of milestones. These include features such as full editing of evidence using the evidence ontology and the ability to import existing legacy annotations and upgrade them, as well as export to legacy GAF format. As well, all supporting libraries have been generalized and refactored, allowing for the easy creation of extensions and new clients.
  • Noctua documentation and training: We have created a series of videos demonstrating how to use Noctua (linked from http://noctua.berkeleybop.org). We have created training material and are providing a curator training workshop in Geneva in 2015.
  • Noctua data integration: We have developed a pipeline for exporting Noctua annotations to GAF format, allowing curators to use Noctua as a replacement for existing GO annotation (note that the conversion from OWL to GAF is lossy, but the fully expressive annotation is retained as part of the stored Noctua models).
Aim 5. We will maintain and upgrade the Gene Ontologies
  • TermGenie: We have continued to develop TermGenie, adding new features (e.g. Github support) and templates (added templates for anatomy using Uberon). TG now accounts for 80% (1362 of 1710) of new terms added to the GO. Seven new templates have been added to the GO TermGenie instance and it has now 48 templates in total. We have also implemented a TermGenie for the cell type ontology, and other ontologies.
  • Ontology Release Tooling: We have developed a new tool for managing the release process of complex ontologies such as the GO, incorporating reasoning, classification, validation, file format conversion. The tool is openly available from https://github.com/ontodev/robot/ and was demonstrated at the 2015 International Conference on Biomedical Ontologies (http://ceur-ws.org/Vol-1515/demo6.pdf ).
  • Ontology tracking and infrastructure: The GO has previously hosted all issue trackers on SourceForge since 2000. These are vital to the both the continued operation of the GO, and as a provenance-trail; providing a knowledge base of decisions made regarding the structure of the ontology. Due to ongoing issues at SourceForge, we migrated our trackers, including complete history to GitHub. We developed a framework that has since been used by multiple groups migrating from sourceforge to github (https://github.com/cmungall/gosf2github/). We also took this opportunity to centralize multiple pieces of metadata in yaml on our github site, improving the efficiency of multiple operations within GO. This includes metadata on all GO editors and curators, that is used by the common annotation framework and TermGenie.
  • Integration with external ontologies: We have integrated the GO with external ontologies, including the cell-type ontology and Uberon. We have also extended and improved the cell type ontology in multiple areas and are working with external groups to apply this for functional data annotation and interpretation (FANTOM5, ENCODE).
  • Molecular Function: A proposed major refactoring of molecular functioning was completed that groups high-level / intermediate MF classes by biological context. In support of these revisions, which currently is in progress, is the development of unit tests for the logical definitions of these terms. The code for testing improvements to MF axiomatisation are defining design patterns to use for constructing compound MF terms is in development. These include specification of the strategies and patterns for improving axiomatisation of simple MFs such as enzyme activities using Rhea & its mappings to ChEBI.
Aim 6. We will provide annotations and ontologies to the broad genetics community, supporting the use of the Gene Ontology resources
  • AmiGO: the GOC’s tool for querying, browsing, and visualizing the GO database, continues to be updated regularly with many improvements and increased documentation. This year’s upgrades continued to expand on the variety of search modes, graph traversals (and visualization), as well as availability of data types. Over 80 tickets have been closed, such as improved filtering, bringing experimental results for human genes to the top of the results lists for gene products, and loading of PANTHER family data. Largely, all these feature refinements were accomplished without disruption to services (hot swap). Access to the GO enrichment tool has likewise been greatly improved.
  • Public servers of GO: Considerable work has been carried out in preparation for the transition of all GO services to Berkeley, including an increased number of unit tests and improved status reporting. For example, load balancing was improved so that users never experience discontinuity during sessions that coincide with regular reloading is occurring (which takes ~one hour).
  • GOC website: Throughout this year we have continued to improve the documentation on our website, refining content for both clarity and brevity. Before these focused efforts, some areas of our documentation were lacking important details, and others contained related information dispersed throughout two or more pages; these made some pages at times complementary, but more often redundant. Going through sections and updating the information, we have achieved documentation pages that both offer complete and more concise information, as well as aggregate previously dispersed details. This facilitates ease of finding and usability. For example, researchers willing to contribute to the Gene Ontology Consortium (GOC), by either providing suggestions for updating the ontology or by providing annotations, can do so when they review the options we offer on our pages about “Contributing to GO” (http://geneontology.org/page/contributing-go) and “Submitting GO Annotations” (http://geneontology.org/page/submitting-go-annotations).
  • FAQ: A total of 83 frequently asked questions (FAQs) are now available on the GO website, covering a variety of topics from annotation, to analysis, to mappings. Special emphasis was also given to improving documentation on how to conduct term enrichment analyses using tools supported by the GOC as well as third-party tools, and documentation on how to submit annotations and contribute to the ontology.
  • First GO Symposium Day: In August 2015, as part of the annual meeting of the Gene Ontology Consortium in Washington, DC, we organized a day of talks and discussions centered around the GO. The GO Symposium day was open to the general public, and attendees included researchers, faculty, and students from local institutions as well as NIH senior personnel. The sessions included two talks and two workshops. The talks were offered by our guest speakers Dr. Donna Slonim (http://www.cs.tufts.edu/~slonim/) and Dr. Trey Ideker (http://healthsciences.ucsd.edu/som/medicine/research/labs/ideker/Pages/default.aspx). The workshop on Annotation included details on the use and implementation of GOC standards and tools. Lastly, the second workshop was focused on Term Enrichment Analysis using the resources of the GOC. Further details regarding the symposium are available from http://wiki.geneontology.org/index.php/2015_Washington_DC_GOC_Symposium_Agenda
B.6 ‘What do you plan to do during next reporting period to accomplish the goals?’
Aim 1. We will provide experimental annotations for human and major model organisms:
  • Move annotation submission pipeline to Berkeley
  • Replace MySQL legacy database with graph database
  • Improve our production statistics reporting (available on the public website now)
  • Tighten and improve the infrastructure for responding to challenges to existing literature annotations
  • Continue to support and extend on an ongoing basis our Integration System for monitoring, quality control, and publication of annotations (using Jenkins). The monitoring site includes statistics and metrics of data quality.
Aim 3. (partially funded) We will perform phylogenetically-based propagation of annotations:
  • Develop Selenium/behave tests
  • Develop training material for PAINT
  • Deploy full JavaScript implementation of PAINT
  • Integrate JS-PAINT with Noctua and TermGenie
  • Ensure touchup is run regularly as part of the continuous integration pipeline
Aim 4. We will develop a Common Annotation Framework
  • Develop training material for Noctua/CAF
  • Implement new features based on feedback from Geneva meeting
  • Develop Selenium/behave tests
Aim 5. We will maintain and upgrade the Gene Ontologies
  • Complete the refactoring of Molecular Function
  • Implement pattern-based Term generation system
  • Continued to work on integration between CL, UBERON, and other ontologies with GO
  • Provide support to the NCI for replacement of NCIt with GO
  • Provide ongoing support to curators
Aim 6. We will provide annotations and ontologies to the broad genetics community, supporting the use of the Gene Ontology resources
  • Release AmiGO 2.4
  • Develop additional improvements to Term Enrichment
  • Continue to improve website usability
  • Continue to provide user support
  • Continue to improve the AmiGO interface
  • Continue to support online web services
C. OVERALL PRODUCTS
C.1. Publications
  • Dietze H, Berardini TZ, Foulger RE, Hill DP, Lomax J, Osumi-Sutherland D, Roncaglia P, Mungall CJ. TermGenie - a web-application for pattern-based ontology class generation. J Biomed Semantics. 2014 Dec 11;5:48. doi: 10.1186/2041-1480-5-48. eCollection 2014. PubMed PMID: 25937883; PubMed Central PMCID: PMC4417543.
  • Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucleic Acids Res. 2015 Jan;43(Database issue):D1049-56. doi: 10.1093/nar/gku1179. Epub 2014 Nov 26. PubMed PMID: 25428369; PubMed Central PMCID: PMC4383973.
  • Yoshihara M, Ohmiya H, Hara S, Kawasaki S; FANTOM consortium, Hayashizaki Y, Itoh M, Kawaji H, Tsujikawa M, Nishida K. Discovery of molecular markers to discriminate corneal endothelial cells in the human body. PLoS One. 2015 Mar 25;10(3):e0117581. doi:10.1371/journal.pone.0117581. eCollection 2015. Erratum in: PLoS One. 2015;10(5):e0129412. PubMed PMID: 25807145; PubMed Central PMCID: PMC4373821.
  • Bastian FB, Chibucos MC, Gaudet P, Giglio M, Holliday GL, Huang H, Lewis SE, Niknejad A, Orchard S, Poux S, Skunca N, Robinson-Rechavi M. The Confidence Information Ontology: a step towards a standard for asserting confidence in annotations. Database (Oxford). 2015 May 9;2015:bav043. doi: 10.1093/database/bav043. Print 2015. PubMed PMID: 25957950; PubMed Central PMCID: PMC4425939.
  • Haendel MA, Vasilevsky N, Brush M, Hochheiser HS, Jacobsen J, Oellrich A, Mungall CJ, Washington N, Köhler S, Lewis SE, Robinson PN, Smedley D. Disease insights through cross-species phenotype comparisons. Mamm Genome. 2015 Oct;26(9-10):548-55. doi:10.1007/s00335-015-9577-8. Epub 2015 Jun 20. PubMed PMID: 26092691; PubMed Central PMCID: PMC4602072.
  • Boeckmann B, Marcet-Houben M, Rees JA, Forslund K, Huerta-Cepas J, Muffato M, Yilmaz P, Xenarios I, Bork P, Lewis SE, Gabaldón T; Quest for Orthologs Species Tree Working Group. Quest for Orthologs Entails Quest for Tree of Life: In Search of the Gene Stream. Genome Biol Evol. 2015 Jul 1;7(7):1988-99. doi: 10.1093/gbe/evv121. PubMed PMID: 26133389; PubMed Central PMCID: PMC4524488.
  • Groza T, Köhler S, Moldenhauer D, Vasilevsky N, Baynam G, Zemojtel T, Schriml LM, Kibbe WA, Schofield PN, Beck T, Vasant D, Brookes AJ, Zankl A, Washington NL, Mungall CJ, Lewis SE, Haendel MA, Parkinson H, Robinson PN. The Human Phenotype Ontology: Semantic Unification of Common and Rare Disease. Am J Hum Genet. 2015 Jul 2;97(1):111-24. doi:10.1016/j.ajhg.2015.05.020. Epub 2015 Jun 25. PubMed PMID: 26119816.
  • Buske OJ, Schiettecatte F, Hutton B, Dumitriu S, Misyura A, Huang L, Hartley T, Girdea M, Sobreira N, Mungall C, Brudno M. The Matchmaker Exchange API: Automating Patient Matching Through the Exchange of Structured Phenotypic and Genotypic Profiles. Hum Mutat. 2015 Oct;36(10):922-7. doi: 10.1002/humu.22850. PubMed PMID: 26255989.
  • Bone WP, Washington NL, Buske OJ, Adams DR, Davis J, Draper D, Flynn ED, Girdea M, Godfrey R, Golas G, Groden C, Jacobsen J, Köhler S, Lee EM, Links AE, Markello TC, Mungall CJ, Nehrebecky M, Robinson PN, Sincan M, Soldatos AG, Tifft CJ, Toro C, Trang H, Valkanas E, Vasilevsky N, Wahl C, Wolfe LA, Boerkoel CF, Brudno M, Haendel MA, Gahl WA, Smedley D. Computational evaluation of exome sequence data using human and model organism phenotypes improves diagnostic efficiency. Genet Med. 2015 Nov 12. doi:10.1038/gim.2015.137. PubMed PMID: 26562225.
  • Philippakis AA, Azzariti DR, Beltran S, Brookes AJ, Brownstein CA, Brudno M, Brunner HG, Buske OJ, Carey K, Doll C, Dumitriu S, Dyke SO, den Dunnen JT, Firth HV, Gibbs RA, Girdea M, Gonzalez M, Haendel MA, Hamosh A, Holm IA, Huang L, Hurles ME, Hutton B, Krier JB, Misyura A, Mungall CJ, Paschall J, Paten B, Robinson PN, Schiettecatte F, Sobreira NL, Swaminathan GJ, Taschner PE, Terry SF, Washington NL, Züchner S, Boycott KM, Rehm HL. The Matchmaker Exchange: A Platform for Rare Disease Gene Discovery. Hum Mutat. 2015 Oct;36(10):915-21. doi: 10.1002/humu.22858. PubMed PMID: 26295439; PubMed Central PMCID: PMC4610002.
  • Mungall CJ, Washington NL, Nguyen-Xuan J, Condit C, Smedley D, Köhler S, Groza T, Shefchek K, Hochheiser H, Robinson PN, Lewis SE, Haendel MA. Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery. Hum Mutat. 2015 Oct;36(10):979-84. doi: 10.1002/humu.22857. Epub 2015 Sep 8. PubMed PMID:26269093.
C.2. Website(s) or Other Internet Site(s)

GOC website: http://geneontology.org/

AmiGO: http://amigo.geneontology.org/amigo

C.3. Technologies or Techniques
  • All of the source code and documentation for the Gene Ontology is available on github: https://github.com/geneontology.
  • In addition, because most of the ontology infrastructure we have developed is generally useful to the broader bioinformatics community, a substantial body of source code can be found here: https://github.com/owlcollab/owltools (this includes, for example, the Taxon checker web service software used for validating the applicability of GO classes)
  • Because interoperating ontologies are a necessary part of our ontology engineering strategy, much of the work accomplished as part of GO includes major contributions to the OBO Foundry infrastructure: https://github.com/OBOFoundry/OBOFoundry.github.io
  • As one small example of reuse GO’s technology, we would note AmiGO 2 was adopted by the NSF Planteome project (http://planteome.org) this year, simply to illustrate .
D. OVERALL PARTICIPANTS

Suzanna Lewis, Christopher Mungall, Seth Carbon, Heiko Dietze, and Monica Munoz-Torres

E. OVERALL IMPACT
F. OVERALL CHANGES

Nothing significant