GOA Progress Report for October 2008
UniProtKB - GOA October , 2008
The GOA project (a member of the PANDA team at EBI-EMBL, UK) provides high-quality, manual and electronic Gene Ontology (GO) annotations to proteins in the UniProt KnowledgeBase(UniProtKB) and International Protein Index (IPI).
GOA has been a member of the GO Consortium since 2001, and in addition to providing annotations to all species in its UniProtKB file releases, is particularly responsible for the integration and release of manual GO annotations to the human, chicken and cow proteomes.
Emily Dimmer 1 FTE: Curator (funded by NIHGRI grant) Rachael Huntley 0.8 FTE: Curator (funded by EMBL) Daniel Barrell 0.5 FTE: Programmer (funded by NIHGRI and BHF (British Heart Foundation) grants)
David Binns: maintenance of QuickGO and Protein2GO tools
EBI curators: Yasmin Alam-Faruque, Paul Browne, Wei Mun Chan, Louise Daugherty, Ruth Eberhardt, Jyoti Khadake, Kati Laiho, Michele Magrane, Eleanor Whitfield, Rebecca Foulger and Duncan Legge : manual GO annotation of UniProtKB.
Louise Daugherty, Jennifer McDowell and David Lonsdale (InterPro): development and maintenance of InterPro2GO mapping.
2. Annotation Progress
Since January 2008, GOA has provided another ten file releases, which include non-redundant sets of GO annotations to the human, mouse, rat, chicken, cow, zebrafish and Arabidopsis proteomes, as well as data releases to all species (GOA-UniProtKB). GOA has developed in parallel with the growing number of sequences and annotations available in UniProtKB, and currently contains over 34 million GO annotations to more than 4.5 million UniProtKB accession numbers; covering in excess of 175,140 taxonomic groups (October 2008 release). This represents an increase of over 8 million GO annotations (a 33% increase) and a 19% increase in taxonomic coverage over the last ten months. 65% of UniProtKB is now covered by GO annotation (either manual or electronic) and supplementary species-specific files are now provided to 938 complete proteomes.
Electronic Annotation Pipelines
In UniProtKB there are currently 168,308 species (4,257,090 proteins) for which electronic annotation pipelines, provided by GOA are the only source of GO annotation.
GOA is responsible for providing 5 GO mapping files to the GO Consortium (Swiss-Prot keyword to GO, UniProtKB Subcellular Location to GO, Enzyme Commission numbers to GO, InterPro to GO and HAMAP to GO), which provide electronic GO annotations for many species. GOA continues to develop and update these files, so that currently mappings from over 14,450 external terms are provided, which has produced more than 33 million annotations for the UniProtKB.
In addition, the electronic annotation collaboration with Ensembl Compara has continued successfully; in the October 2008 UniProtKB release, 147,858 annotations for 30 species are generated from this pipeline, an increase in over 300% over the last ten months. Further information on electronic methods applied by GOA can be found in section 3.b.
Manual Annotation Progress
The majority of the manual annotation that GOA curators carry out involves the targets and annotation practices set forth by Reference Genomes initiative, of which GOA is a member. The GOA and BHF-UCL annotation groups work closely together to generate high-quality GO annotations for the human proteome, and as of October 2008 the groups had fully curated 78.7% of the reference genome targets (comprising 410 gene products). GOA has provided annotations to 370 of these targets, and of their 4,688 manual annotations, 84% are supported by experimental evidence.
As well as increasing the coverage of GO annotations in UniProtKB, the group works to improve the quality of manual annotations, in the last year this has included the removal of all remaining NR-evidenced annotations and the update of over 2,000 annotations created initially by the Proteomic Inc annotation group in 2002.
The GOA group is responsible for providing manual annotations to the human proteome, and here, 61,382 manual annotations (a rise of 17% over the last ten months) and 142,278 electronic annotations (a 9% increase) are available. In Septermber 2008, 4,642 experimentally-evidenced annotations for human protein subcellular localization were integrated from the Human Protein Atlas project. Similarly the Reactome and GOA groups have worked closely together to enable Reactome to release over 3,830 ‘EXP’-evidenced annotations to 60 different species, the large majority of which have been made to human proteins.
Manual GO annotation from UniProtKB is supplemented with the latest data from 19 external databases: AgBase, BHF-UCL, dictyBase, Flybase, Gramene, GeneDB, Human Protein Atlas, HGNC, IntAct, LIFEdb, MGI, Reactome, RGD, SGD, TAIR, TIGR, WormBase, ZFIN, and with some data from the Roslin Institute. External annotation integrations occur on a monthly basis, and integrations from all groups has increased over the year, providing the GOA UniProtKB dataset with 378,633 manual annotations (October 2008).
3. Methods and strategies for annotation
(please note % effort on literature curation vs. computational annotation methods)
Literature curation: 95% total curator effort Computational annotation: 5% total curator effort (consisting of updates to existing mappings, creating new mappings and assessing new IEA methods).
a. Literature curation:
Part-time staff usually focus on curation of a different database (UniProtKB, IntAct or InterPro), therefore GO annotation is an extra activity, and annotations are only added if the curator finds information which is within the scope of GO annotation and has not been previously curated.
GOA curators concentrate on the GO annotation of the human proteome as well as orthologs to human proteins when necessary. As the GOA group is a member of the GO Consortium Reference Genome effort, their manual annotation work involves providing a comprehensive set of granular GO terms to selected human entries, prioritizing the assignment of terms which have experimental evidence to support the GO term-protein association.
In addition to the papers archived in the UniProtKB records, the NCBI PubMed, GOPubMed, iHOP and EBI’s CiteXplore advanced searches are queried to find papers providing appropriate functional data. The curators aim is to find the most recent papers which provide experimental evidence for the unique features of a given protein.
Once a relevant paper is found, the full text is read to identify the unique features of a given protein. The majority of papers will mention more than one protein; however, a curator will concentrate on capturing the information pertinent to the main protein chosen for annotation.
GO terms are chosen by querying the GO files with the QuickGO web browser. Before assigning a GO term, the definition must be read to check suitability. Obsolete GO terms are not used in annotation. When electronic or manual GO annotations become obsolete, they are manually updated with an alternative appropriate term when possible. If a useful term is missing from the ontology, an existing GO term is in the incorrect hierarchical position or a definition needs to be refined, a curator request is sent to the GO editorial office using SourceForge. To improve annotation consistency, curators often consult the ‘Statistics’ tab provided for each GO term in the QuickGO browser where GO terms that are frequently assigned in tandem are displayed.
If no functional annotation can be found for a given protein after an exhaustive literature search, the root GO terms molecular_function (GO:0003674), biological_process (GO:0008150) or cellular_component (GO:0005575) can be assigned with GO evidence code ND (‘No Data’).
Pipeline of data integration from IntAct.
GOA has integrated annotations from the EBI's IntAct protein-protein interaction database. Only those binary interactions which are of high-enough quality to be integrated into the UniProtKB database have been included (this is decided on experimental method type). All IntAct protein-protein binding interactions are manually curated. All GO terms in these annotations are children of the protein binding or identical protein binding terms (GO:0005515 and GO:0042802), use the 'IPI' evidence code along with information on the protein's binding partner in column 8 (with).
b. Computational annotation strategies:
IEA pipelines in GOA: - Enzyme Commission to GO - InterPro domains to GO - Swiss-Prot Keyword to GO - UniProtKB Subcellular Location to GO (new as of November 2007) - HAMAP family rules to GO - Ensembl Compara ortholog transfer (new as of December 2006)
The large-scale assignment of GO terms to UniProtKB entries has been made possible by successfully converting a proportion of the data in UniProtKB entries, added by external annotation projects, into GO terms. For example, UniProtKB description lines [DE] may contain Enzyme Commission (EC) numbers. Using an existing mapping of EC numbers to the GO molecular function ontology (EC2GO) and a mapping of protein accession numbers to EC numbers, GOA can produce a UniProtKB to GO association. Such mapping files are routinely used to generate a large number of annotations to GO process, function and component ontologies. Similar methods are applied to create GO annotations to UniProtKB Subcellular Location comments and Swiss-Prot Keywords. Bi-directional database cross-references also help to integrate GO annotations. For example, the majority of UniProtKB entries will cross-reference an InterPro identification number and vice versa. InterPro is a key database maintained at the EBI. It provides an integrated documentation resource for proteins, families and domains. A single InterPro entry provides comprehensive annotation describing a set of related proteins, some of which may have identical functions, be involved in the same processes, and act in the same locations. During the curation of each InterPro entry, high-level GO terms are manually assigned, based on a review of the information available on the set of Swiss-Prot proteins assigned the InterPro entry. This annotation is used to generate an InterPro2GI mapping and also serves as a biological function summary in the InterPro entry. So far, the application of the InterPro2GO mapping in the electronic assignment of GO terms to gene products has produced the most coverage in the GOA dataset. To support interoperability, InterPro2GO has been used to generate GO mappings to its member databases (see Table 1) and these also are available for download.
As of December 2006, GOA has released electronic annotations resulting from a collaboration with the Ensembl group, which has provided an additional 147,858 annotations for 30 proteomes (many of which are non-model organism species, such as macaque and chimp). Using orthology data obtained from the Ensembl Compara system, GO terms from a source species have been projected onto corresponding orthologs one or more target species. Only one to one and apparent one to one orthologies are used, and only manually-annotated GO terms with an evidence type of IDA, IEP, IGI, IMP or IPI are projected.
c. Priorities for annotation
1. Reference Genomes Initiative target genes 2. Requests received from users 3. Those human proteins with no GO annotation
4. Presentations and Publications
a. Papers with substantial GO content
Dimmer, E.C., Huntley, R.P., Barrell, D.G., Binns, D., Draghici, S., Camon, E.B., Hubank, M., Talmud, P.J., Apweiler, R. and Lovering, R.C. (2008) ‘The Gene Ontology – Providing a Functional Role in Proteomics Studies’ Practical Proteomics DOI 10.1002/pmic.200800002
Lovering R.C., Dimmer E, Khodiyar V.K., Barrell D.G., Scambler P., Hubank M., Apweiler R. and Talmud P.J. Cardiovascular GO annotation initiative year 1 report: why cardiovascular GO? Proteomics. 2008 May;8(10):1950-3.
Accepted for publication:
Barrell D., Dimmer E., Huntley R., Binns,D., O’Donovan, C. and Apweiler, R. The GOA database in 2009 – an integrated Gene Ontology Annotation resource Nucleic Acids Database issue 2009.
Rachael P. Huntley, Emily C. Dimmer and Rolf Apweiler Practical Applications of the Gene Ontology Resource. For: Problem Solving Handbook, Springer Inc. Editors: Lenwood S. Heath and Naren Ramakrishnan from Virginia Tech
b. Presentations including Talks and Tutorials and Teaching
30th January 2008 Rachael Huntley, talk and tutorial ‘Introduction to the Gene Ontology and the GO annotation resources’, ‘Transciptomics’ EBI Hands on Courses. EBI, Cambridge, UK
27th August 2008 Emily Dimmer, talk and tutorial: ‘Introduction to the Gene Ontology and the GO annotation resources’, ‘Interactions and Pathways workshop’ EBI Hands on Courses. EBI, Cambridge, UK
18th March 2008 Emily Dimmer Talk and tutorial: Gene Ontology tutorial primer for EBI-EMBL Post-docs. EBI, Cambridge, UK
8th October 2008 Emily Dimmer, talk: ‘Introduction to the Gene Ontology and GO Annotation (GOA) resource at the EBI’. ‘DIP into EBI resources’ workshop, EBI, Cambridge, UK
22-24th October, 2008 Rachael Huntley, talk: ‘The Gene Ontology Annotation (GOA) Database: central resource for multi-species GO annotation.’, Fungal Biology and Biotechnology in the Genomic Era conference (Eurofungbase Consortium); Sant Feliu de Guixols, Spain.
8-12th December 2008 Rachael Huntley, talk and tutorial: ‘Introduction to GOA’ Wellcome Trust Advanced Courses Proteomics Workshop, EBI, Cambridge, UK
c. Poster presentations
19th August 2008 Emily Dimmer: poster: ‘The Gene Ontology Annotation (GOA) resource at the EBI’ HUPO 7th Annual Word Congress. Amsterdam.
5, Other Highlights:
A. Ontology Development Contributions:
Since January 2008 curators in the GOA team have made 42 Source Forge request which has led to the creation of 25 new GO terms. These requests have been related either to the manual curation work related to the reference genomes effort, or in the mapping of terms from external vocabularies, such as the Swiss-Prot subcellular location controlled vocabulary to GO.
B. Annotation Outreach and User Advocacy Efforts:
October 22 – 24, 2008 Rachael Huntley, will attend the ‘Fungal Biology and Biotechnology in the Genomic Era’ conference, Hotel Eden Roc, Sant Feliu de Guixols, Spain. Organised by the Eurofungbase Consortium
C. Other Highlights:
Kidney Research UK
In 2008 the GOA group was successful in obtaining funding from the Kidney Research UK charity, who have agreed to fund one dedicated curator for three years to improve the functional dataset available for mammalian gene products implicated in kidney development and disease. This will provide a valuable community resource for renal researchers. It is hoped that the curator will start in January 2008. The grant is entitled 'Integration of knowledge of genes involved in renal processes from biomedical research using the Gene Ontology', and will be lead by Dr. Rolf Apweiler Prof. Peter Scambler.
QuickGO (http://www.ebi.ac.uk/QuickGO), has been extensively redeveloped with a grant from the BBSRC Tools and Resources Fund, such that users can now query QuickGO with a range of different keywords or identifier types, to find either comprehensive, detailed information on GO terms or sets of GO annotation data, which they can filter to their specific needs and download in a range of formats.
The new version of QuickGO has placed GOA’s extensive set of GO annotations at the heart of the tool and enables users to easily filter by a number of characteristics, such as species or taxonomic group, evidence type or GO term set, then evaluate and download the resulting set of annotations. Drop-down menus at the top of the annotation table on the ‘Annotation Download’ page provide a simple way of filtering the GO annotations, whereas more complex queries can be entered into the ‘Advanced’ search text box on this page, allowing users to apply a combination of Boolean operations (AND, NOT, OR) to their queries.
To support QuickGO users who would like to map GO annotations to different identifier types, an identifier mapping facility for 14 different sequence identifier types (including Ensembl, Entrez Gene, RefSeq, SGD, MGI and IPI identifiers) has been included directly in the tool.
Once users have selected their desired annotation set, QuickGO can provide detailed paginated views of annotations as well as statistics on multiple aspects of the filtered annotations. Users can also specify the format that their downloaded data takes – with the tool offering GOA association file, gene2go or customised formats as well as protein FASTA files or identifier lists. These facilities ensure that QuickGO users are now able to download GO annotations that are tailor-made to their requirements.
QuickGO also now provides the ability to view and modify the existing GO slim or generate their own. GO slims are subsets of GO terms extracted from the whole Gene Ontology and tend to consist of a limited number of high-level GO terms that have been selected to provide an overview of some or all of the content of GO. GO slims are often used to provide a broad overview of the chief functional characteristics of a set of sequences. QuickGO users have direct access to pre-defined GO slims, which have been extracted from the GO Consortium’s OBO file, but can equally easily create a new GO slim set by entering a list of GO identifiers or by selecting terms when browsing QuickGO. QuickGO can provide graphical displays of all slims and GO annotations can be ‘mapped-up’ to these term sets, along with statistics for number/percentage of annotations associated to each slim term.
All of the data provided by QuickGO can also be queried remotely, both for GO term information and annotation data. These web services are fully integrated, so that the filtering options and datasets available are fully synchronised between the browsable and web service interfaces. Web service information is provided at the bottom of each QuickGO page where details on how to construct web queries, format options and sample scripts showing how to query QuickGO in Java, Perl and Bash are provided. The web services have been designed for ease of use; QuickGO provides a REST style query interface in which all information is provided in the URL and the results are in tab separated, OBO or XML formats conforming to well-established standards.
In September 2008 Swiss-Prot announced that it had generated a complete representation of all currently known human proteins (20,342). This is remarkable achievement provides human researchers with an extremely high-quality, non-redundant proteome set, which has been extensively annotated. However this set, is of course, not the final as Swiss-Prot will be adding further information, including: - gradually adding entries for newly discovered human proteins; - updating almost 400 entries that are tagged as have only uncertain evidence supporting their existance (PE5) e.g. there is a possibility that the sequence originates from a pseudogene - continuing to correct the sequence of some proteins in this set, in this process some of them will be extended while other will shrink; - increase the number of splice variants (it is expected from ENCODE that 50% of genes have isoforms, currently we have only 13,475 isoforms in Swiss-Prot) - continue to build a comprehensive view of protein variation in the human population; - explore the full range of post-translational modifications. It is obvious - Similarly, efforts to capture information on subcellular location, tissue expression and protein/protein interaction, among other things, are very much ongoing.
In October 2008, the Swiss-Prot group at the Swiss Institute of Bioinformatics in Geneva announced their decision to start contributing manual annotations to GO. GOA already works closely with SIB as Swiss-Prot curators based at the EBI currently manually annotate to GO, and Swiss-Prot Geneva have worked with the GOA group in the creation of external2GO mapping links and in the generation of electronic annotations from both Swiss-Prot curation and family rules created by the Swiss-Prot HAMAP group. GOA will be visiting SIB a number of times in 2008/9 to ensure the manual GO annotation pipeline can be extended to enable SIB curators can easily contribute manual annotations to GOA, and to train new curators in GO annotation methods.