GOA December 2016
UniProt Gene Ontology Annotation (GOA) Project Summary 2016
EMBL-EBI has been a member of the GO Consortium since 2001. One of the major activities is the UniProt Gene Ontology Annotation project which is delivered by staff from the Protein Function Content and Development teams. The core UniProt-GOA project staff are primarily responsible for supplying the GO Consortium with manual and electronic GO annotations to the human proteome. UniProt-GOA staff not only create manual annotations, but coordinate and check the integration of GO annotations from other curation efforts at the EMBL-EBI (including from InterPro, IntAct and Reactome). The UniProt-GOA dataset is supplemented with manual annotations from 35 annotating groups, including all members of the GO Consortium, as well as a number of external groups which produce relevant functional data. Nine electronic annotation pipelines are incorporated into the UniProt-GOA dataset, which provide the vast majority of annotations for non-model organism species. UniProt-GOA is therefore able to consolidate multiple sources of specialised knowledge, ensuring the UniProt-GOA resource remains a key up-to-date reference for a large number of research communities.
In addition, all UniProt Knowledgebase (UniProtKB) curators in the Protein Function Content team at EMBL-EBI, SIB Swiss Institute of Bioinformatics (SIB) and Protein Information Resource (PIR) are actively involved in curating UniProtKB entries with Gene Ontology terms during the UniProt literature curation process, providing both high-quality manual GO annotations in addition to their contributions to electronic GO annotation pipelines. The multi-species nature of UniProtKB means that the GO Annotation project is able to assist in the GO curation of proteins from around 675,000 taxonomic groups.
Staff from the Protein Function Content and Development teams at EMBL-EBI who deliver the GOA project:
Claire O'Donovan, Protein Function Content Team Leader (Consortium PI)
Maria J. Martin Protein Function Development Team Leader (Senior Personnel)
Michele Magrane Annotation Coordinator (with responsibility for GOA since September 2016)
Melanie Courtot, GO/GOA Project Leader (until September 2016)
George Georghiou GOA curator
Penelope Garmiri GOA curator
Tony Sawford* GOA programmer
Aleksandra Shypitsyna* GOA curator
Tony Wardell GOA programmer
- Funded partially by GOC.
UniProt contributors (EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, UK; SIB, Geneva, Switzerland; and PIR, Washington DC): Ioannis Xenarios, Lydie Bougueleret, Ghislaine Argoud-Puy, Andrea Auchinchloss, Kristian Axelsen, Marie-Claude Blatter, Emmanuel Boutet, Lionel Breuza, Alan Bridge, Elizabeth Coudert, Isabelle Cusin, Paula Duek Roggli, Anne Estreicher, Livia Famiglietti, Marc Feuermann, Arnaud Gos, Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Florence Jungo, Guillaume Keller, Kati Laiho, Philippe Lemercier, Damien Lieberherr, Michele Magrane, Patrick Masson, Ivo Pedruzzi, Klemens Pichler, Diego Poggioli, Sylvain Poux, Catherine Rivoire, Bernd Roechert, Michel Schneider, Elena Speretta, Hema Bye-A-Jee, Rossana Zaru, Andre Stutz, Shyamala Sundaram, Michael Tognolli
12 sets of UniProt-GOA release files were produced by the GOA project between January 2016 and November 2016. These included non-redundant sets of GO annotations to 13 specific proteomes as well as data releases for annotations of all proteins in UniProtKB.
The UniProt-GOA project currently provides GO annotations for 65% of UniProtKB entries. Altogether, UniProt-GOA now provides almost 321 million GO annotations for more than 49 million proteins in over 675,000 different taxonomic groups. UniProt-GOA provides 403,251 annotations for the 61,394 proteins in the human reference proteome.
UniProt-GOA UniProt gene association file release stats (comparison of January 2016 and November 2016 releases)
- The difference in the numbers is due to a change in the pipeline for predicting orthologs
- The difference the numbers is due to a change in the GOC rules
Methods and strategies for annotation
Expert curation priorities:
1. Human proteins
2. Disordered proteins
2. Moonlighting proteins
3. Requests from user community
4. Proteins annotated during UniProt curation duties
5. Annotation corrections based on quality control reports
UniProt-GOA provides IEA annotations from the following methods:
- UniProt Keyword 2GO (SPKW2GO)1,2
- UniProt Subcellular Locations2GO (SPSL2GO)1,2
- Ensembl Compara (vertebrates)
- Ensembl Genomes Compara (plants, fungi)
1: mapping tables created and maintained by UniProt
2: electronic annotations generated by UniProt
UniProtKB curators add information to entries that is subsequently used in electronic GO annotation pipelines such as UniProtKB keywords2GO, UniProtKB subcellular location2GO, UniRule2GO and HAMAP2GO. Altogether, automatic annotation pipelines provide 318 million annotations to almost 49 million proteins.
Presentations and Publications
An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Jiang Y, Oron TR, Clark WT, Bankapur AR, D'Andrea D, Lepore R, Funk CS, Kahanda I, Verspoor KM, Ben-Hur A, Koo da CE, Penfold-Brown D, Shasha D, Youngs N, Bonneau R, Lin A, Sahraeian SM, Martelli PL, Profiti G, Casadio R, Cao R, Zhong Z, Cheng J, Altenhoff A, Skunca N, Dessimoz C, Dogan T, Hakala K, Kaewphan S, Mehryary F, Salakoski T, Ginter F, Fang H, Smithers B, Oates M, Gough J, Törönen P, Koskinen P, Holm L, Chen CT, Hsu WL, Bryson K, Cozzetto D, Minneci F, Jones DT, Chapman S, Bkc D, Khan IK, Kihara D, Ofer D, Rappoport N, Stern A, Cibrian-Uhalte E, Denny P, Foulger RE, Hieta R, Legge D, Lovering RC, Magrane M, Melidoni AN, Mutowo-Meullenet P, Pichler K, Shypitsyna A, Li B, Zakeri P, ElShal S, Tranchevent LC, Das S, Dawson NL, Lee D, Lees JG, Sillitoe I, Bhat P, Nepusz T, Romero AE, Sasidharan R, Yang H, Paccanaro A, Gillis J, Sedeño-Cortés AE, Pavlidis P, Feng S, Cejuela JM, Goldberg T, Hamp T, Richter L, Salamov A, Gabaldon T, Marcet-Houben M, Supek F, Gong Q, Ning W, Zhou Y, Tian W, Falda M, Fontana P, Lavezzo E, Toppo S, Ferrari C, Giollo M, Piovesan D, Tosatto SC, Del Pozo A, Fernández JM, Maietta P, Valencia A, Tress ML, Benso A, Di Carlo S, Politano G, Savino A, Rehman HU, Re M, Mesiti M, Valentini G, Bargsten JW, van Dijk AD, Gemovic B, Glisic S, Perovic V, Veljkovic V, Veljkovic N, Almeida-E-Silva DC, Vencio RZ, Sharan M, Vogel J, Kansakar L, Zhang S, Vucetic S, Wang Z, Sternberg MJ, Wass MN, Huntley RP, Martin MJ, O'Donovan C, Robinson PN, Moreau Y, Tramontano A, Babbitt PC, Brenner SE, Linial M, Orengo CA, Rost B, Greene CS, Mooney SD, Friedberg I, Radivojac P. Genome Biol. 2016 Sep 7;17(1):184. doi: 10.1186/s13059-016-1037-6.
Extending gene ontology in the context of extracellular RNA and vesicle communication. Cheung KH, Keerthikumar S, Roncaglia P, Subramanian SL, Roth ME, Samuel M, Anand S, Gangoda L, Gould S, Alexander R, Galas D, Gerstein MB, Hill AF, Kitchen RR, Lötvall J, Patel T, Procaccini DC, Quesenberry P, Rozowsky J, Raffai RL, Shypitsyna A, Su AI, Théry C, Vickers K, Wauben MH, Mathivanan S, Milosavljevic A, Laurent LC. J Biomed Semantics. 2016 Apr 12;7:19. doi: 10.1186/s13326-016-0061-5.
b. Presentations including Talks, Tutorials and Teaching
Talk “The Gene Ontology and Its Annotation Sets”. Plant and Animal Genome Conference XXIV. Claire O’Donovan, San Diego, CA, January, 2016.
Training “The Gene Ontology (GO) & GOA”. Industry workshop. Melanie Courtot, March 2016.
Training “Ontologies for life sciences: examples from the Gene Ontology.” Earlham Institute summer school on bioinformatics. Melanie Courtot, May 2016.
Training “Describing your data – standards and ontologies.” Bioinformatics for PI course. Melanie Courtot, June 2016.
Webinar “QuickGO - Gene ontology annotation.” Aleksandra Shypitsyna, July 2016.
Talk “Annotation project: community curation standards and best practices.” Melanie Courtot, Aleksandra Shypitsyna, Elena Speretta, Alexander Holmes, Tony Sawford, Tony Wardell, Maria Martin and Claire O'Donovan. Biocuration, Geneva, Switzerland, April 2016.
“QuickGO: a web-based tool for Gene Ontology browsing, interpretation and analysis.” Aleksandra Shypitsyna, Melanie Courtot, Elena Speretta, Alexander Holmes, Tony Sawford, Tony Wardell, Sangya Pundir, Xavier Watkins, Maria Martin and Claire O'Donovan. Biocuration, Geneva, Switzerland, April 2016
“Ten simple rules for biomedical ontology development.” International Conference on Biological Ontology & BioCreative. Melanie Courtot , James Malone , Chris Mungall. August 2016
Ontology Development Contributions
- All curators continue to request new GO terms or updates to the ontology where necessary, using either Term Genie or the GitHub tracker.
Annotation Outreach and User Advocacy Efforts
- Aleksandra Shypitsyna and Penelope Garmiri trained 3 new curators in GO annotation.
- Melanie Courtot is on the rota for the GO Consortium helpdesk.
- Melanie Courtot and Aleksandra Shypitsyna are on the rota for UniProt-GOA project helpdesk.
- The Protein Function teams support external annotation groups, such as AgBase, BHF-UCL, DFLAT at Tuft's University, SIB and PIR by providing use of the Protein2GO curation tool, including WormBase and SGD this year.
- The Protein Function teams assist GO Consortium groups with migration of their annotations into the GOA files and UniProtKB, as well as providing access and training for the UniProt curation tool Protein2GO.
- Access and training for the Protein2GO curation tool has been given to recently joined curators from several groups, such as UCL, HGNC and NTNU.
i. Improvements to the QuickGO user interface
Work to improve the QuickGO user interface has continued throughout 2016. This work also involves extending the range of features currently provided by QuickGO, as well as extensive testing for the new version of QuickGO and contributions to the user interface design.
ii. Improvements to the Protein2GO curation tool
As more GO Consortium curation groups migrate their annotations into the UniProt database and move to using Protein2GO as their sole curation tool for protein GO annotation, we continue to add more functionality to the tool.
- support for new with_string format, plus all of the ECO-code-specific usage constraints
At the 2015 GOC meeting in Washington and 2016 GOC meeting in Geneva a change to the format of the with/from annotation column ("with_string") was agreed, which allows components of the with_string to be separated by both pipes and commas. In addition, a new set of rules was agreed that govern the usage, and acceptable format, of with/from with the GO evidence codes. Protein2GO now fully supports this enhanced format and the usage rules.
- Annotation provenance and other newly implemented features
A number of usability enhancements have been made to Protein2GO this year, but one of the most useful is the ability to trace the provenance of the annotations. It allows to easily identify the original annotations and acknowledges all the parties which were participating in the annotation process. For example, if the user selects an annotation created by the inference based on sequence or structure similarity, he/she is able to find immediately the original manual annotation and the group that created the annotation; this makes the task of checking the validity of annotations, for example, much easier. Apart from that, as Textpresso Literature Search Function and Lookup function for UniProtKB identifiers with priority listing (SWISS/TrEMBL entries) have been implemented this year as a result of GOC community requests. In terms of number of users, we currently have 125 active users, with 23 different affiliations. We continue to provide group-specific reports with the statistical data of curation and quality checks which require curator input. In addition, Protein2GO technical support is provided via GitHub website: https://github.com/geneontology/go-ontology/issues.