GOA December 2010

From GO Wiki
Jump to: navigation, search

Gene Ontology Annotation at UniProtKB, 2010

Staff:

Rolf Apweiler

Claire O'Donovan

Emily Dimmer

Rachael Huntley

Yasmin Alam-Faruque

Daniel Barrell (left July 2010)

David Binns (left October 2010)

Tony Sawford

Swiss-Prot contributors (EBI, Hinxton, UK and SIB, Geneva, Switzerland): Ioannis Xenarios, Amos Bairoch, Lydie Bougueleret, Serenella Ferro-Rojas

Ghislaine Argoud-Puy, Andrea Auchinchloss, Kristian Axelsen, Marie-Claude Blatter, Emmanuel Boutet, Silvia Braconi Quintaje, Lionel Breuza, Alan Bridge, Paul Browne, Wei Mun Chan, Elizabeth Coudert, Isabelle Cusin, Louise Daugherty, Paula Duek Roggli, Ruth Eberhardt, Anne Estreicher, Livia Famiglietti, Marc Feuermann, Rebecca Foulger, Michael Gardner, Arnaud Gos, Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Janet James, Silvia Jimenez, Florence Jungo, Guillaume Keller, Kati Laiho, Duncan Legge, Philippe Lemercier, Damien Lieberherr, Michele Magrane, Patrick Masson, Madelaine Moinat, Ivo Pedruzzi, Klemens Pichler, Diego Poggioli, Sylvain Poux, Catherine Rivoire, Bernd Roechert, Michel Schneider, Harminder Sehra, Eleanor Stanley, Andre Stutz, Shyamala Sundaram, Michael Tognolli

Annotation Progress

We continue to prioritise the annotation of those genes selected for the Reference Genome Project.

Proteins associated with kidney development and disease are the focus of the UniProtKB-GOA Renal Annotation Initiative, which is headed by Dr. Yasmin Alam-Faruque.

Currently the curators from the UniProtKB-GOA and BHF-UCL projects have together completely annotated 65% (721/1111) of supplied Reference Genome Targets.

Between January 2010 and November 2010, the UniProtKB-GOA project provided the GO Consortium with eleven annotation file releases, including non-redundant sets of GO annotations to the human, mouse, rat, zebrafish, Arabidopsis, chicken and cow proteomes, as well as data releases for annotations of all proteins in UniProtKB. Since 12th July 2010, UniProtKB-GOA has provided an interim release of the human and chicken gene association files to allow the Reference Genomes PAINT project to collect the most up-to-date annotations for use in the tree curation. The human and chicken files are now released every two weeks, that is as part of the main UniProtKB-GOA monthly file release and again two weeks later. UniProtKB-GOA now provides over 72 million GO annotations for 8.5 million proteins in over 283,000 different taxonomic groups. UniProtKB-GOA provides almost 199,000 annotations for the human proteome (providing over 92% of the human proteome with at least one GO annotation). Over the last year the number of manual annotations has increased by 19.8% in the UniProtKB file and the number of manual annotations for the human file has increased by 36%. Between January and November 2010, UniProtKB-GOA has continued training, checking and supporting 35 curators in the Swiss-Prot team at the Swiss Institute of Bioinformatics, who have since created a total of almost 41,700 manual GO annotations for UniProtKB entries from a range of species.

UniProtKB-GOA UniProt gene association file release stats (comparison of January 2010 and November 2010 releases)

UniProt Stats 2010 A.jpg UniProt Stats 2010 B.jpg

Key

The two cells in orange are the TIGR statistics from UniProtKB-GOA's December 2009 release as these were temporarily missing from the January 2010 release.

* New sources of annotation after January 2010

Methods and strategies for annotation

1. Literature curation:

Literature curation continues to be the major focus of our annotation efforts, with an emphasis on the use of experimental evidence codes.


2. Computational annotation strategies:

UniProtKB-GOA provides IEA annotations from the following methods:

  1. Swiss-Prot Keyword 2GO (SPKW2GO)1,2
  2. Swiss-Prot Subcellular Locations2GO (SPSL2GO) 1,2
  3. HAMAP2GO2
  4. InterPro2GO2
  5. Ensembl Compara


Key

1: mapping tables created and maintained by the UniProtKB-GOA group

2: electronic annotations generated by the UniProtKB-GOA group, using UniProtKB.


3. Priorities for annotation

  1. Genes assigned by Reference Genome Project (Rachael, Emily)
  2. Genes associated with renal processes (Yasmin)
  3. Requests from user community (all curators)
  4. Proteins annotated during Swiss-Prot curation duties (all Swiss-Prot/UniProtKB curators at the EBI and SIB)

Presentations and Publications

Publications, Talks, Posters 2010-

Other Highlights

A. Ontology Development Contributions:

  • 102 SourceForge items regarding requested changes to the GO have been placed by curators associated with the UniProtKB-GOA group between January and November 2010.
  • Yasmin Alam-Faruque and the UniProtKB-GOA group hosted a kidney-related ontology development meeting in January 2010 during which renal experts, ontology editors and curators discussed new renal-related terms. As a result of this meeting 426 new GO terms have so far been created allowing curators to choose much more specific terms when annotating kidney function and process.

B. Annotation Outreach and User Advocacy Efforts:

  • In September, UniProtKB-GOA was contacted by a researcher requesting that annotations he had personally made to M. tuberculosis proteins be included in the UniProtKB-GOA database. The annotations were reviewed by Rachael Huntley (UniProtKB-GOA) and Rama Balakrishnan (SGD) and were deemed to be of high-quality. The annotations were subsequently incorporated into the September release of the UniProtKB-GOA UniProt file.
  • During 2010 UniProtKB-GOA curators continued to provide manual GO annotation training to Swiss-Prot curators at the Swiss Institute of Bioinformatics (SIB), Geneva. All 35 SIB curators have now completed initial training, 18 of which have completed the entire training program (including post-training annotation checking). The Swiss-Prot team in Geneva have so far generated 41,665 manual GO annotations to 11,072 UniProtKB proteins (data taken 24th November 2010). Annotations are created in UniProtKB-GOA's Protein2GO tool, and released in the group's gene association files. Such annotations use the existing source 'UniProtKB' (for column 15 of the gene association file). UniProtKB-GOA will continue to mentor the SIB curators into 2011.
  • Emily Dimmer and Rachael Huntley provided GO annotation training to three new UniProt (EBI) curators.
  • Emily Dimmer and Rachael Huntley continue to answer queries sent to the GO helpdesk.
  • Emily Dimmer, Rachael Huntley and Yasmin Alam-Faruque continue to answer user queries sent to the UniProtKB-GOA project.
  • UniProtKB-GOA is continuing to support external annotation groups, such as AgBase, BHF-UCL, DFLAT at Tuft's University and SIB, by providing use of the Protein2GO curation tool.
  • Together with Rama Balakrishnan from SGD, Emily Dimmer is a manager for the GO Consortium's Annotation Advocacy and Coordination group. From June to December, Rachael Huntley has been covering Emily Dimmer's role in this group. The aims of the group are to;
  * educate GO Consortium curators about best annotation practice
  * enforce the annotation rules and policies within the GOC
  * maintain the annotation and evidence code documentation
  * educate and keep all the annotating groups up-to-date with changes in GAF format and ontology development 
  * assist new groups with annotations


C. Other

i. Improvements to the QuickGO user interface

Following the extensive redevelopment of QuickGO (http://www.ebi.ac.uk/QuickGO) in 2008, the user interface of QuickGO was, this year, substantially improved. The main improvements are;

  • inclusion of toolbars containing icons to access particular functions (e.g. filtering, ID mapping, and downloading) provide a neat and intuitive display for the functionality within QuickGO
  • addition of lightbox displays for performing the various functions within QuickGO
  • re-design of the GO Slims and GO Term Comparison page to provide a more structured procedure for using a chosen list of GO terms
  • consistency of displays, for example, wherever you see a GO ID in QuickGO this will be a link to the term page for that GO term

The statistics section of QuickGO has also been improved with the addition of statistics both for counts of annotations and for counts of proteins. These are now available for download in a text file format, which will enable users to create graphical representations of the statistical data for use in publications, e.g. a bar graph for the number of proteins in a given list annotated to GO terms present in a GO slim, which will provide an overview of the biological attributes of the list of proteins.


ii. Changes to Swiss-Prot Keyword (SPKW2GO) and Swiss-Prot Subcellular2GO (SPSL2GO)

With the aid of Serenella Ferro-Rojas at SIB, the SPKW2GO and SPSL2GO mappings have been updated - there are currently 414 mappings between SPSL and GO and 694 mappings between SPKW and GO.


iii. Renal GO annotation initiative funded by Kidney Research UK.

The UniProtKB-GOA renal project, under the direction of Yasmin Alam-Faruque, has been very successful to date providing 1014 proteins with 9711 manual annotations. As a consequence of early annotation efforts of genes expressed in the Loop of Henle, it became apparent that there was a lack of GO terms describing the many aspects of kidney development. To address this, a kidney ontology development workshop was hosted by UniProtKB-GOA in January. As a result of this meeting, over 445 new GO terms were created (representing approximately 1.4% of the total number of terms in the Gene Ontology). The new terms were largely Biological Process terms and detailed not only the anatomical description (nephron, collecting duct, stroma, renal capsule, kidney vasculature, pattern specification and Malpighian tubule development) but also the biological processes (cell differentiation, signaling pathways, growth, morphogenic processes and regulatory processes) that contribute to renal development. The terms represent development of the various renal systems across organisms i.e. metanephros (mammalian; 129 terms); pronephros (amphibian; 24 terms); mesonephros (fish; 102 terms) and renal system/Malpighian tubule (insect; 18 terms). A publication is currently in progress outlining this significant development in the Gene Ontology.

Following the ontology workshop, further species-specific experts have been recruited to assist with representing the correct anatomical structures within Xenopus and fly to describe renal system development within amphibians and insects.

The success of this project so far has highlighted the benefits of collaboration with the research community. Further annotation targets were suggested by the Edinburgh team of the GUDMAP Consortium and, as a result of this, GO annotation is almost complete for a list of 117 mouse gene products expressed in the urogenital system that were previously lacking any GO annotation.

Yasmin Alam-Faruque and Emily Dimmer are advisors to the SysKid project, a member of which has provided a list of 108 kidney-related proteins which are differentially expressed in response to overexpression of neuropilins. This list of gene products is also poorly represented by GO annotation and focused curation to this list is ongoing.

UniProtKB-GOA continues to support the British Heart Foundation GO Annotation Initiative at UCL.


iv. Quality Control Checks

Several new quality checks have been incorporated both into our database and into the UniProtKB-GOA curation tool Protein2GO during 2010. A summary of these is as follows;

Database

  • application of taxon rules to flag annotation errors where certain GO terms have been used for an inappropriate taxonomy
  • reporting of annotations using 'IEP' evidence which duplicate experimentally evidenced annotations to the same or descendant term - curators should decide whether the 'IEP' evidenced annotation is required.

Protein2GO

  • curators are prevented from creating 'ISS' evidenced annotations for a protein entry that already has an experimentally evidenced annotation to the same or descendent GO term
  • curators are prevented from creating an 'ND' evidenced annotation for a protein entry that already has an experimentally evidenced annotation to a GO term in the same ontological aspect as the 'ND' annotation
  • curators are prevented from copying 'NOT' qualified annotations to other proteins using the 'ISS' evidence code
  • curators are prevented from creating annotations using two taxon identifiers (in the case of dual taxon annotations) if the GO ID used does not match, or is not a descendent of, GO:0051704 (multi-organism process) or GO:0018995 (host)


v. Gene Association File changes

April 2010

1. To avoid processing problems for GO Consortium tools we changed the contents of column 1 in the GOA-UniProt gene association file. Column 1 was originally displaying the values 'UniProtKB/TrEMBL' or 'UniProtKB/Swiss-Prot' to indicate which section of UniProtKB an accession is a member . This information is now provided in a tab-separated, supplementary gene product information file (gp_information file can be found here; ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/) that is released alongside the GOA-UniProt gene association files (for more information on this file see the readme; ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme). Column 1 of our UniProt GAF has changed to consistently display 'UniProtKB' for all UniProtKB accessions.

2. UniProtKB-GOA files are supplied in GAF2.0 format (http://www.geneontology.org/GO.format.gaf-2_0.shtml).

3. New PDB gene association file. This file has been generated from a collaboration between the InterPro, PDB and UniProtKB-GOA teams, and once again is able to offer annotations to PDB chain identifiers. In addition, further sources of GO annotations are now associated with PDB chains, to provide a more comprehensive PDB GO annotation resource. Manual and electronic GO annotations are now provided in this file from two sources:

a. where an InterPro entry matches a PDB chain, annotations supplied by the InterPro2GO electronic method are assigned to the chain identifier (for further details on this method see: http://www.geneontology.org/cgi-bin/references.cgi#GO_REF:0000002).

b. PDB chains are additionally supplied with manual and electronic GO annotations (excluding InterPro2GO) when a PDB chain maps with at least 90% identity to a UniProtKB accession (more specifically with the UniProtKB's CHAIN feature), whereupon manual and electronic annotations are supplied to the PDB chain identifier from the matching UniProtKB accession.

May 2010

UniProtKB-GOA gene association files changed to correctly attribute the InterPro group as the source of annotations generated by the InterPro2GO electronic annotation pipeline. This means that the value in column 15 (Assigned_By) has changed from 'UniProtKB' to 'InterPro' where column 6 (DB:Reference) displays the reference 'GO_REF:0000002'.

June 2010

1. UniProtKB-GOA gene association files started to include manual annotations for Candida albicans, Plasmodium falciparum and Agrobacterium tumefaciens UniProtKB accessions that have been created by the Candida Genome Database (http://www.candidagenome.org/), Plasmodium falciparum GeneDB (http://www.genedb.org/genedb/malaria) and Agrobacterium Genome Consortium, PAMGO project (http://www.agrobacterium.org/) respectively, from files these groups have submitted to the GO Consortium.

2. The UniProtKB-GOA group made available two new files that have been generated from gene_association.goa_uniprot. These files have split between them the information specifically required to describe a GO annotation (gp_association.goa_uniprot) and to describe the proteins for which annotations are provided (gp_information.goa_uniprot). Use of these two files instead of the gene association file has the advantage of reduced redundancy in the information supplied, resulting in a combined size of 126MB less than the gene_association.goa_uniprot file. However the format of these two files is subject to ongoing discussions by the GO Consortium, so their exact format may change over time. Readmes to describe the format of these files are available from the ftp site, alongside these two files. Files can be downloaded from:

ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz

3. The contents of column 1 (DB) of all UniProtKB-GOA gene association files has changed in this release, so that UniProtKB accession numbers are only identified by the namespace 'UniProtKB' instead of the previous values of either: 'UniProtKB/TrEMBL' or 'UniProtKB/Swiss-Prot'. This change has occurred in response to requests from GO tool providers. However the new gp_information.goa_uniprot.gz file described above does contain the UniProtKB subset (Swiss-Prot/TrEMBL) information in column 2 for all UniProtKB proteins that have been GO annotated.

July 2010

UniProtKB-GOA stopped producing the EC2GO mapping file. The production of EC2GO was passed to the GO Consortium and the file can be accessed from the GO Consortium ftp site here; ftp://ftp.geneontology.org/pub/go/external2go/ec2go or from GO CVS here; http://cvsweb.geneontology.org/cgi-bin/cvsweb.cgi/go/external2go/ec2go.

September 2010

Manual annotations to M. tuberculosis proteins from MTBbase were included in the UniProtKB-GOA UniProt gene association file.

October 2010

UniProtKB-GOA gene association files started to include manual annotations from the GO Consortium Reference Genome project. GO terms based on experimental data from the scientific literature are used to annotate ancestral genes in phylogenetic trees from the PANTHER database (http://www.pantherdb.org) by sequence similarity (evidence code ISS), and unannotated descendants of these ancestral genes are inferred to have inherited these same GO annotations by descent.

Planned for December 2010

The proteome sets provided by UniProtKB-GOA were historically based on those defined by the Integr8 project. Integr8 is planning to close after the launch of Ensembl Genomes as the next-generation interface for genome-scale data from non-vertebrate species. UniProtKB is taking over responsibility for the maintenance of the complete proteome sets, which can be found here: http://www.uniprot.org/taxonomy/complete-proteomes. UniProtKB-GOA have started to use the UniProtKB proteome sets to produce the proteomes gene association files.

There are two main consequences of this that should be noted:

1) The proteomes are available only for those species that have a complete genome, as defined by INSDC. For the complete genomes, please see http://www.ebi.ac.uk/genomes and the NCBI Project database, http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprjh

2) The entries within the proteome are the direct translation of the sequenced reference genome; this logic is applied to all species. Any species-specific rules that Integr8 used have not been propagated to the new sets. The proteome sets will be subsets of the GOA-UniProt gene association file, and consequently if a protein accession is deemed not to be present in a particular proteome then any manual annotations made to this protein will not be visible in the proteome set; they will, however, still be available from the GOA-UniProt gene association file.