GOA December 2012
Gene Ontology Annotation at UniProt Summary, December 2012
The UniProt GO Annotation project (UniProt-GOA) at the European Bioinformatics Institute has been a member of the GO Consortium since 2001. All UniProt curators are actively involved in curating UniProtKB entries with Gene Ontology terms during the UniProt literature curation process, providing both high-quality manual GO annotations in addition to their contributions to electronic GO annotation pipelines. The multi-species nature of UniProtKB means that the GO Annotation project is able to assist in the GO curation of proteins from over 370,000 taxonomic groups.
The core UniProt-GOA project staff are primarily responsible for supplying the GO Consortium with manual and electronic GO annotations to the human proteome. UniProt-GOA staff not only manually annotate, but coordinate and check the integration of GO annotations from other curation efforts at the EBI (including from InterPro, IntAct and Reactome). The UniProt-GOA dataset is supplemented with manual annotations from 27 external groups, including all members of the GO Consortium, as well as a number of external groups which produce relevant functional data. Nine electronic annotation pipelines are incorporated into the UniProt-GOA dataset, which provide the vast majority of annotations for non-model organism species. UniProt-GOA is therefore able to consolidate multiple sources of specialised knowledge, ensuring the UniProt-GOA resource remains a key up-to-date reference for a large number of research communities.
Swiss-Prot contributors (EBI, Hinxton, UK, SIB, Geneva, Switzerland and PIR, Washington DC): Ioannis Xenarios, Lydie Bougueleret
Ghislaine Argoud-Puy, Andrea Auchinchloss, Kristian Axelsen, Marie-Claude Blatter, Emmanuel Boutet, Lionel Breuza, Alan Bridge, Wei Mun Chan, Gayatri Chavali, Elizabeth Coudert, Isabelle Cusin, Paula Duek Roggli, Anne Estreicher, Livia Famiglietti, Marc Feuermann, Arnaud Gos, Nadine Gruaz-Gumowski, Reija Hieta, Ursula Hinz, Chantal Hulo, Janet James, Florence Jungo, Guillaume Keller, Kati Laiho, Duncan Legge, Philippe Lemercier, Damien Lieberherr, Michele Magrane, Patrick Masson, Ivo Pedruzzi, Klemens Pichler, Diego Poggioli, Sylvain Poux, Catherine Rivoire, Bernd Roechert, Michel Schneider, Andre Stutz, Shyamala Sundaram, Michael Tognolli
* Funded entirely or partially by GO
1Left July 2012
Between January 2012 and November 2012, the UniProt-GOA project provided the GO Consortium with 11 annotation file releases, including non-redundant sets of GO annotations to 13 specific proteomes, as well as data releases for annotations of all proteins in UniProtKB.
UniProt incorporates manual annotations from other GO Consortium members and affiliates and displays these annotations in the relevant UniProtKB entries. Currently, the UniProt-GO Annotation project provides GO annotations for 68% of UniProt entries. Altogether, UniProt-GOA now provides almost 127 million GO annotations for almost 19 million proteins in over 370,000 different taxonomic groups. UniProt-GOA provides 354,486 annotations for the human proteome.
*Reduction in InterPro2GO annotations due to change in representation of ‘with’ field contents
**New sources of annotation after January 2012
Methods and strategies for annotation
The renal annotation project, funded by Kidney Research UK and under the direction of Yasmin Alam-Faruque, has been very successful. The project ended in April 2012 and resulted in the provision of 2,810 proteins with 43,858 annotations. As a result of the renal project 479 new GO terms were created allowing curators to choose much more specific terms when annotating kidney function and process. A paper summarizing the project is in preparation.
During 2012, Prudence Mutowo-Meullenet completed a project to annotate all of the proteins in the human peroxisome. The project has enabled us to provide a list of 88 proteins that are experimentally determined as being located to the peroxisome. These proteins have been given full functional annotation using the available literature resulting in a total of 218 manual annotations for this set of proteins. 296 other proteins were also partly annotated during the process leading to a total of 1,589 annotations. A paper describing this project has been submitted for publication. Prudence has now started a similar project to annotate proteins present in the exosome.
Computational annotation strategies:
UniProt-GOA provides IEA annotations from the following methods:
- UniProt Keyword 2GO (SPKW2GO)1,2
- UniProt Subcellular Locations2GO (SPSL2GO)1,2
- Ensembl Compara (vertebrates)
- Ensembl Genomes Compara (plants, fungi)
1: mapping tables created and maintained by UniProt
2: electronic annotations generated by UniProt
UniProt curators supply information to entries that is subsequently used in electronic GO annotation pipelines such as UniProtKB keywords2GO, UniProtKB subcellular location2GO and HAMAP2GO. Altogether, automatic annotation pipelines provide 125 million annotations to almost 19 million proteins.
Two new automatic pipelines were incorporated by UniProt in 2012; UniPathway2GO (a collaboration between UniProt, INRIA (Rhone-Alpes) and Laboratoire d'Ecologie Alpine (Grenoble)), which provides GO annotations describing the metabolic pathways that proteins are involved was initiated in May 2012 and a pipeline that uses orthology data from Ensembl Compara to project GO annotations between fungal proteins was initiated by the Ensembl Genomes group and incorporated into the UniProt gene association file in July 2012.
UniProt-GOA now maintains an annotation blacklist, which contains a list of UniProtKB accessions and any GO identifiers that they should not be associated with. This is especially useful for suppressing incorrect annotations made by electronic methods that predict GO terms for groups of proteins that may not be correct for all members of that group, e.g. some electronic annotation sources have a cut-off such that if the annotation is correct for 95% of the proteins in the set, the GO annotation will be added to the whole set.
Priorities for annotation
1. Proteins associated with the exosome (Prudence)
2. Requests from user community (all curators)
3. Proteins annotated during Swiss-Prot curation duties (all Swiss-Prot/UniProtKB curators at the EBI and SIB)
Presentations and Publications
The UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt) 2012 Nucleic Acids Res 40 (Database issue): D71-D75. PMCID: PMC3245120
Dimmer EC, Huntley RP, Alam-Faruque Y, Sawford T, O’Donovan C, Martin MJ et al. The UniProt-GO Annotation database in 2011. 2012 Nucleic Acids Res 40 (Database issue): D565-570. PMCID: PMC3245010
b. Presentations including Talks and Tutorials and Teaching
Huntley RP. Introduction to the Gene Ontology and GO annotation resources. 15 Jan 2012 Plant and Animal Genome Conference, San Diego, CA USA (Talk)
Mutowo-Meullenet P. UniProt-GOA activities update. 12 July 2012 Hinxton Scientific Forum, Wellcome Trust Genome Campus, Hinxton UK (Talk)
Mutowo-Meullenet P. Peroxisome annotation enrichment. 4 Sept 2012 EBI Ontology Workshop, Wellcome Trust Genome Campus, Hinxton UK (Talk)
c. Poster presentations
Alam-Faruque Y. The UniProt-GOA project. 2 Apr 2012 International Society for Biocuration Conference, Washington DC USA (Poster)
Mutowo-Meullenet P. The UniProt-GOA project. 8 Sept 2012 The first 10 years of UniProt ECCB12 Satellite Symposium (Poster)
Mutowo-Meullenet P. The UniProt-GOA project. 20 Nov 2012 EBI Open Day, Wellcome Trust Genome Campus, Hinxton UK (Poster)
Ontology Development Contributions:
- During the course of the peroxisome annotation project, Prudence requested the addition of 63 terms to the ontology covering processes that occur in the peroxisome.
- The renal annotation project coordinated by Yasmin Alam-Faruque ended in April 2012. A total of 479 terms related to kidney development and processes were added to the ontology during the course of this project.
- Emily Dimmer was part of the apoptosis working group set up to discuss changes to the apoptosis node in GO. Emily and Rachael both contributed to the Apoptosis Content Meeting held at the EBI in June 2012.
Annotation Outreach and User Advocacy Efforts:
- Emily Dimmer and Rachael Huntley provided GO annotation training to two new UniProt (EBI) curators.
- Rachael Huntley provided an introduction to GO annotation to a new InterPro curator.
- Rachael Huntley and Prudence Mutowo-Meullenet continue to answer queries sent to the GO helpdesk.
- Rachael Huntley, Prudence Mutowo-Meullenet and Yasmin Alam-Faruque continue to answer user queries sent to the UniProt-GOA project.
- UniProt is continuing to support external annotation groups, such as AgBase, BHF-UCL, DFLAT at Tuft's University and SIB, by providing use of the Protein2GO curation tool.
- UniProt is assisting GO Consortium groups with migration of their annotations into the UniProt database, as well as providing access and training for the UniProt curation tool Protein2GO. WormBase are currently undergoing this transition, to be followed by SGD and other groups in 2013.
- As part of the exosome annotation project, Prudence Mutowo-Meullenet is in contact with the ExoCarta database to assist them in providing UniProt with annotations they have for exosome proteins.
- Together with Rama Balakrishnan from SGD, Rachael Huntley is a manager for the GO Consortium's Annotation Advocacy and Coordination group. The aims of the group are to;
* educate GO Consortium curators about best annotation practice * enforce the annotation rules and policies within the GOC * maintain the annotation and evidence code documentation * educate and keep all the annotating groups up-to-date with changes in GAF format and ontology development * assist new groups with annotations
i. Improvements to the QuickGO user interface
In Spring 2012 work started on a new user interface for the UniProt GO browser QuickGO. The initial user testing phase of the work is now complete and the interface is currently being developed. It is expected that the new interface will be publicly released by Spring 2013.
ii. Improvements to the Protein2GO curation tool.
A number of improvements to the functionality of Protein2GO have been implemented in 2012.
a. The tool is now able to accept annotation extension data, a relatively new field in the GAF2.0 format of the gene association file that provides context to an annotation. UniProt now provides 1,639 manual annotations with annotation extension data.
b. A series of annotation suggestions are now displayed to assist curators based on e.g. protein binding information, GO terms that are commonly co-annotated and inter-ontology links.
c. Warnings are now displayed if a curator tries to make an annotation to a protein that has a NOT-qualified annotation to the same GO term or if it has a comment caution in the UniProtKB entry specific to the GO term being used.
d. Curators are now able to start a dispute over an annotation they disagree with. This allows a direct dialog to occur between the Protein2GO curator and the curator or group that made the annotation.
iii. Annotation Formats.'
- Rachael Huntley and Emily Dimmer have been directly involved, with curators from other groups, in determining the format and relationships required for the new annotation extension format. Both have provided documentation for the GO Consortium website regarding this format.
- Rachael Huntley and Tony Sawford have been involved in determining the specification of the GPAD/GPI 1.1 file formats.
iv. Gene Association File changes.
Improvements to the HAMAP2GO annotation set.
UniProt-GOA now contains an improved set of HAMAP2GO annotations. The previous HAMAP2GO pipeline was unable to use the full complexity of the manually-curated UniProtKB HAMAP rules, however this has changed due to recent developments. The HAMAP source now predicts 3,458,519 GO annotations.
As of this release, references supplied in annotations from two UniProtKB automatic annotation pipelines have changed.
Annotations created from mappings between GO and the UniProtKB keywords and UniProtKB Subcellular Location controlled vocabularies previously cited the GO reference GO_REF:0000004 and GO_REF:0000023, respectively.
However, terms from these UniProtKB controlled vocabularies are applied differently to UniProt Swiss-Prot and TrEMBL entries; UniProtKB terms are manually annotated to UniProtKB/Swiss-Prot entries, whereas UniProtKB/TrEMBL entries are annotated from data supplied by the underlying nucleic acid databases and/or by the UniProt automatic annotation program. As advised in our December 2011 release, we have now changed the cited references in the supplied GO annotations to highlight these differences.
From the current release onwards, the UniProt-GOA annotation set will use: GO_REF:0000037 or GO_REF:0000038 instead of GO_REF:0000004 for UniProtKB keyword annotations GO_REF:0000039 or GO_REF:0000040 instead of GO_REF:0000023 for UniProtKB Subcellular Location annotations
Further descriptions all of these references are available at: http://www.geneontology.org/cgi-bin/references.cgi
1. Annotation post-processing
UniProt-GOA now displays some IEA annotations that have been subject to minor post-processing by UniProt to correct the assigned GO term. The focus of the changes is to ensure taxonomic correctness of annotated GO terms, using data supplied by the GO taxon rules. (For further info see here; http://www.ebi.ac.uk/GOA/news.html)
2. Removal of automatic annotations that conflict with NOT-qualified manual GO annotations
All IEA-evidenced annotations have been filtered to remove those which conflict with a NOT-qualified manual GO annotation for the same protein that has applied a GO identifier that is the same, or parent of the go identifier applied by the IEA method.
This filtering step has resulted in the removal of 8,000 incorrect IEA annotations from the UniProt file.
1. New UniPathway2GO mapping In collaboration with the Swiss Institute of Bioinformatics, INRIA (Rhone-Alpes) and Laboratoire d'Ecologie Alpine (Grenoble), we now offer an additional 113,285 GO annotations that describe the pathway(s) that 105,041 UniProtKB entries are involved in. 48% of these annotations apply a GO term that either uniquely describes a protein's involvement in a certain process, or supplies a more granular term than is supplied by other UniProt electronic annotation methods.
The mapping is available from: http://www.grenoble.prabi.fr/dev/obiwarehouse/download/unipathway/public/unipathway2go.tsv
(For further info see here; http://www.ebi.ac.uk/GOA/news.html)
2. Inclusion of multiple identifiers in the 'with/from' column of UniProt-GOA annotations
UniProt-GOA now includes those GO annotations that have applied more than one identifier in the 'with/from' annotation field (column 8 of the GAF2.0 format). This means that 12,317 annotations from external annotation groups are now more fully represented in the UniProt-GOA files, ensuring we are able to provide a more comprehensive set of GO annotations to users.
UniProt-GOA now includes electronic GO annotations created by EnsemblFungi. The annotations are created by projection of manual GO annotations from Saccharomyces cerevisiae or Saccharomyces pombe proteins onto proteins from one or more target species based on gene orthology obtained from Ensembl Compara. This release contains over 41,000 annotations to over 9,000 proteins covering 36 taxonomies including; Ashbya gossypii, Emericella nidulans and Aspergillus species.
Our display of InterPro2GO annotations was altered in order to reduce the redundancy. This resulted in a large reduction in the number of GO annotations from InterPro from 83 million to around 51 million, a decrease of approximately 32 million annotations.
Previously, when different InterPro domains predict the same GO ID to the same protein, we displayed these as separate annotations. We have changed this so that all the InterPro domains that predict the same GO ID for the same protein will be piped together in the 'with' field of a single annotation line, ensuring no loss of data.