GOA December 2010: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
Line 92: Line 92:
'''B. Annotation Outreach and User Advocacy Efforts:'''
'''B. Annotation Outreach and User Advocacy Efforts:'''


* In September, GOA was contacted by a researcher requesting that annotations he had personally made to M. tuberculosis proteins be included in the GOA database. The annotations were reviewed by Rachael Huntley (GOA) and Rama Balakrishnan (SGD) and were deemed to be of high-quality. The annotations were subsequently incorporated into the September release of the GOA-UniProt file.
* In September, GOA was contacted by a researcher requesting that annotations he had personally made to ''M. tuberculosis'' proteins be included in the GOA database. The annotations were reviewed by Rachael Huntley (GOA) and Rama Balakrishnan (SGD) and were deemed to be of high-quality. The annotations were subsequently incorporated into the September release of the GOA-UniProt file.


* During 2010 UniProtKB-GOA curators continued to provide manual GO annotation training to Swiss-Prot curators at the Swiss Institute of Bioinformatics (SIB), Geneva. All 35 SIB curators have now completed initial training, 18 of which have completed the entire training program (including post-training annotation checking). The Swiss-Prot team in Geneva have so far generated approximately 40,042 manual GO annotations to 10,736 UniProtKB proteins. Annotations are created in GOA's Protein2GO tool, and released in the group's gene association files. Such annotations use the existing source 'UniProtKB' (for column 15 of the gene association file). GOA will continue to mentor the SIB curators into 2011.  
* During 2010 UniProtKB-GOA curators continued to provide manual GO annotation training to Swiss-Prot curators at the Swiss Institute of Bioinformatics (SIB), Geneva. All 35 SIB curators have now completed initial training, 18 of which have completed the entire training program (including post-training annotation checking). The Swiss-Prot team in Geneva have so far generated 41,665 manual GO annotations to 11,072 UniProtKB proteins (data taken 24th November 2010). Annotations are created in GOA's Protein2GO tool, and released in the group's gene association files. Such annotations use the existing source 'UniProtKB' (for column 15 of the gene association file). GOA will continue to mentor the SIB curators into 2011.  


* Emily Dimmer and Rachael Huntley continue to answer queries sent to the GO helpdesk.
* Emily Dimmer and Rachael Huntley continue to answer queries sent to the GO helpdesk.

Revision as of 10:45, 24 November 2010

!!Report In Progress!!

Gene Ontology Annotation at UniProtKB, 2010

Staff:

Rolf Apweiler

Claire O'Donovan

Emily Dimmer

Rachael Huntley

Yasmin Alam-Faruque

Daniel Barrell

David Binns

Tony Sawford

Swiss-Prot contributors (EBI, Hinxton, UK and SIB, Geneva, Switzerland): Ioannis Xenarios, Amos Bairoch, Lydie Bougueleret, Serenella Ferro-Rojas

Ghislaine Argoud-Puy, Andrea Auchinchloss, Kristian Axelsen, Marie-Claude Blatter, Emmanuel Boutet, Silvia Braconi Quintaje, Lionel Breuza, Alan Bridge, Paul Browne, Wei Mun Chan, Elizabeth Coudert, Isabelle Cusin, Louise Daugherty, Paula Duek Roggli, Ruth Eberhardt, Anne Estreicher, Livia Famiglietti, Marc Feuermann, Rebecca Foulger, Michael Gardner, Arnaud Gos, Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Janet James, Silvia Jimenez, Florence Jungo, Guillaume Keller, Kati Laiho, Duncan Legge, Philippe Lemercier, Damien Lieberherr, Michele Magrane, Patrick Masson, Madelaine Moinat, Ivo Pedruzzi, Klemens Pichler, Diego Poggioli, Sylvain Poux, Catherine Rivoire, Bernd Roechert, Michel Schneider, Harminder Sehra, Eleanor Stanley, Andre Stutz, Shyamala Sundaram, Michael Tognolli

Annotation Progress

We continue to prioritise the annotation of those genes selected for the Reference Genome Project.

Proteins associated with kidney development and disease are the focus of the GOA Renal Annotation Initiative, which is headed by Dr. Yasmin Alam-Faruque.

Currently the curators from the GOA and BHF-UCL projects have together completely annotated 65% (721/1111) of supplied Reference Genome Targets.

Between January 2010 and November 2010, the GOA project provided the GO Consortium with eleven annotation file releases, including non-redundant sets of GO annotations to the human, mouse, rat, zebrafish, Arabidopsis, chicken and cow proteomes, as well as data releases for annotations of all proteins in UniProtKB. Since 12th July 2010, GOA has provided an interim release of the human and chicken gene association files to allow the Reference Genomes PAINT project to collect the most up-to-date annotations for use in the tree curation. The human and chicken files are now released every two weeks, that is as part of the main GOA monthly file release and again two weeks later. GOA now provides over 72 million GO annotations for 8.5 million proteins in over 283,000 different taxonomic groups. GOA provides almost 199,000 annotations for the human proteome (providing over 92% of the human proteome with at least one GO annotation). Over the last year the number of manual annotations has increased by 19.8% in the UniProtKB file and the number of manual annotations for the human file has increased by 36%. Between January and November 2010, GOA has continued training, checking and supporting 35 curators in the Swiss-Prot team at the Swiss Institute of Bioinformatics, who have since created a total of almost 41,700 manual GO annotations for UniProtKB entries from a range of species.

GOA UniProt gene association file release stats (comparison of January 2010 and November 2010 releases)

Key

The two cells in orange are the TIGR statistics from GOA's December 2009 release as these were temporarily missing from the January 2010 release.

* New sources of annotation after January 2010

Methods and strategies for annotation

1. Literature curation:

Literature curation continues to be the major focus of our annotation efforts, with an emphasis on the use of experimental evidence codes.


2. Computational annotation strategies:

GOA provides IEA annotations from the following methods:

  1. Swiss-Prot Keyword 2GO (SPKW2GO)1,2
  2. Swiss-Prot Subcellular Locations2GO (SPSL2GO) 1,2
  3. HAMAP2GO2
  4. InterPro2GO2
  5. Ensembl Compara


Key

1: mapping tables created and maintained by the GOA group

2: electronic annotations generated by the GOA group, using UniProtKB.


3. Priorities for annotation

  1. Genes assigned by Reference Genome Project (Rachael, Emily)
  2. Genes associated with renal processes (Yasmin)
  3. Requests from user community (all curators)
  4. Proteins annotated during Swiss-Prot curation duties (all Swiss-Prot/UniProtKB curators at the EBI and SIB)

Presentations and Publications

Publications, Talks, Posters 2010-

Other Highlights

A. Ontology Development Contributions:

  • 102 SourceForge items regarding requested changes to the GO have been placed by curators associated with the GOA group between January and November 2010.
  • Yasmin Alam-Faruque and the GOA group hosted a kidney-related ontology development meeting in January 2010 during which renal experts, ontology editors and curators discussed new renal-related terms. As a result of this meeting 426 new GO terms have so far been created allowing curators to choose much more specific terms when annotating kidney function and process.

B. Annotation Outreach and User Advocacy Efforts:

  • In September, GOA was contacted by a researcher requesting that annotations he had personally made to M. tuberculosis proteins be included in the GOA database. The annotations were reviewed by Rachael Huntley (GOA) and Rama Balakrishnan (SGD) and were deemed to be of high-quality. The annotations were subsequently incorporated into the September release of the GOA-UniProt file.
  • During 2010 UniProtKB-GOA curators continued to provide manual GO annotation training to Swiss-Prot curators at the Swiss Institute of Bioinformatics (SIB), Geneva. All 35 SIB curators have now completed initial training, 18 of which have completed the entire training program (including post-training annotation checking). The Swiss-Prot team in Geneva have so far generated 41,665 manual GO annotations to 11,072 UniProtKB proteins (data taken 24th November 2010). Annotations are created in GOA's Protein2GO tool, and released in the group's gene association files. Such annotations use the existing source 'UniProtKB' (for column 15 of the gene association file). GOA will continue to mentor the SIB curators into 2011.
  • Emily Dimmer and Rachael Huntley continue to answer queries sent to the GO helpdesk.
  • Emily Dimmer, Rachael Huntley and Yasmin Alam-Faruque continue to answer user queries sent to the GOA project.
  • GOA is continuing to support external annotation groups, such as AgBase, BHF-UCL, DFLAT at Tuft's University and SIB, by providing use of the Protein2GO curation tool.
  • Together with Rama Balakrishnan from SGD, Emily Dimmer is a manager for the GO Consortium's Annotation Advocacy and Coordination group. From June to December, Rachael Huntley has been covering Emily Dimmer's role in this group. The aims of the group are to;
  * educate GO Consortium curators about best annotation practice
  * enforce the annotation rules and policies within the GOC
  * maintain the annotation and evidence code documentation
  * educate and keep all the annotating groups up-to-date with changes in GAF format and ontology development 
  * assist new groups with annotations


C. Other

Improvements to the QuickGO user interface

Following the extensive redevelopment of QuickGO (http://www.ebi.ac.uk/QuickGO) in 2008, the user interface of QuickGO has now been substantially improved. The main improvements are;

  • inclusion of toolbars containing icons to access particular functions (e.g. filtering, ID mapping, and downloading) provide a neat and intuitive display for the functionality within QuickGO
  • addition of lightbox displays for performing the various functions within QuickGO
  • re-design of the GO Slims and GO Term Comparison page to provide a more structured procedure for using a chosen list of GO terms
  • consistency of displays, for example, wherever you see a GO ID in QuickGO this will be a link to the term page for that GO term

The statistics section of QuickGO has also been improved with the addition of statistics both for counts of annotations and for counts of proteins. These are now available for download in a text file format, which will enable users to create graphical representations of the statistical data for use in publications, e.g. a bar graph for the number of proteins in a given list annotated to GO terms present in a GO slim, which will provide an overview of the biological attributes of the list of proteins.


Changes to Swiss-Prot Keyword (SPKW2GO) and Swiss-Prot Subcellular2GO (SPSL2GO)

With the aid of Serenella Ferro-Rojas at SIB, the SPKW2GO and SPSL2GO mappings have been updated - there are currently 414 mappings between SPSL and GO and 694 mappings between SPKW and GO.


Renal GO annotation initiative funded by Kidney Research UK.

The GOA renal project, under the direction of Yasmin Alam-Faruque, has been very successful to date providing 745 proteins with 5942 annotations. Additionally, as a result of a kidney ontology development workshop hosted by UniProtKB-GOA, 426 new GO terms were created (1.4% of the whole of GO).

UniProtKB-GOA continues to support the British Heart Foundation GO Annotation Initiative at UCL.

Quality Control Checks


Gene Association File changes

April 2010

1. To avoid processing problems for GO Consortium tools we changed the contents of column 1 in the GOA-UniProt gene association file. Column 1 was originally displaying the values 'UniProtKB/TrEMBL' or 'UniProtKB/Swiss-Prot' to indicate which section of UniProtKB an accession is a member . This information is now provided in a tab-separated, supplementary gene product information file (gp_information file can be found here; ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/) that is released alongside the GOA-UniProt gene association files (for more information on this file see the readme; ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme). Column 1 of our UniProt GAF has changed to consistently display 'UniProtKB' for all UniProtKB accessions.

2. UniProtKB-GOA files are supplied in GAF2.0 format (http://www.geneontology.org/GO.format.gaf-2_0.shtml).

3. New PDB gene association file. This file has been generated from a collaboration between the InterPro, PDB and UniProtKB-GOA teams, and once again is able to offer annotations to PDB chain identifiers. In addition, further sources of GO annotations are now associated with PDB chains, to provide a more comprehensive PDB GO annotation resource. Manual and electronic GO annotations are now provided in this file from two sources:

a. where an InterPro entry matches a PDB chain, annotations supplied by the InterPro2GO electronic method are assigned to the chain identifier (for further details on this method see: http://www.geneontology.org/cgi-bin/references.cgi#GO_REF:0000002).

b. PDB chains are additionally supplied with manual and electronic GO annotations (excluding InterPro2GO) when a PDB chain maps with at least 90% identity to a UniProtKB accession (more specifically with the UniProtKB's CHAIN feature), whereupon manual and electronic annotations are supplied to the PDB chain identifier from the matching UniProtKB accession.

May 2010

UniProtKB-GOA gene association files changed to correctly attribute the InterPro group as the source of annotations generated by the InterPro2GO electronic annotation pipeline. This means that the value in column 15 (Assigned_By) has changed from 'UniProtKB' to 'InterPro' where column 6 (DB:Reference) displays the reference 'GO_REF:0000002'.

June 2010

1. UniProtKB-GOA gene association files started to include manual annotations for Candida albicans, Plasmodium falciparum and Agrobacterium tumefaciens UniProtKB accessions that have been created by the Candida Genome Database (http://www.candidagenome.org/), Plasmodium falciparum GeneDB (http://www.genedb.org/genedb/malaria) and Agrobacterium Genome Consortium, PAMGO project (http://www.agrobacterium.org/) respectively, from files these groups have submitted to the GO Consortium.

2. The UniProtKB-GOA group made available two new files that have been generated from gene_association.goa_uniprot. These files have split between them the information specifically required to describe a GO annotation (gp_association.goa_uniprot) and to describe the proteins for which annotations are provided (gp_information.goa_uniprot). Use of these two files instead of the gene association file has the advantage of reduced redundancy in the information supplied, resulting in a combined size of 126MB less than the gene_association.goa_uniprot file. However the format of these two files is subject to ongoing discussions by the GO Consortium, so their exact format may change over time. Readmes to describe the format of these files are available from the ftp site, alongside these two files. Files can be downloaded from:

ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz

3. The contents of column 1 (DB) of all UniProtKB-GOA gene association files has changed in this release, so that UniProtKB accession numbers are only identified by the namespace 'UniProtKB' instead of the previous values of either: 'UniProtKB/TrEMBL' or 'UniProtKB/Swiss-Prot'. This change has occurred in response to requests from GO tool providers. However the new gp_information.goa_uniprot.gz file described above does contain the UniProtKB subset (Swiss-Prot/TrEMBL) information in column 2 for all UniProtKB proteins that have been GO annotated.

July 2010

GOA stopped producing the EC2GO mapping file. The production of EC2GO was passed to the GO Consortium and the file can be accessed from the GO Consortium ftp site here; ftp://ftp.geneontology.org/pub/go/external2go/ec2go or from GO CVS here; http://cvsweb.geneontology.org/cgi-bin/cvsweb.cgi/go/external2go/ec2go.

September 2010

Manual annotations to M. tuberculosis proteins from MTBbase were included in the UniProtKB-GOA UniProt gene association file.

October 2010

UniProtKB-GOA gene association files started to include manual annotations from the GO Consortium Reference Genome project. GO terms based on experimental data from the scientific literature are used to annotate ancestral genes in phylogenetic trees from the PANTHER database (http://www.pantherdb.org) by sequence similarity (evidence code ISS), and unannotated descendants of these ancestral genes are inferred to have inherited these same GO annotations by descent.

December 2010??

The proteome sets provided by GOA were historically based on those defined by the Integr8 project. Integr8 is planning to close after the launch of Ensembl Genomes as the next-generation interface for genome-scale data from non-vertebrate species. UniProtKB is taking over responsibility for the maintenance of the complete proteome sets, which can be found here: http://www.uniprot.org/taxonomy/complete-proteomes. UniProtKB-GOA have started to use the UniProtKB proteome sets to produce the proteomes gene association files.

There are two main consequences of this that should be noted:

1) The proteomes are available only for those species that have a complete genome, as defined by INSDC. For the complete genomes, please see http://www.ebi.ac.uk/genomes and the NCBI Project database, http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprjh

2) The entries within the proteome are the direct translation of the sequenced reference genome; this logic is applied to all species. Any species-specific rules that Integr8 used have not been propagated to the new sets. The proteome sets will be subsets of the GOA-UniProt gene association file, and consequently if a protein accession is deemed not to be present in a particular proteome then any manual annotations made to this protein will not be visible in the proteome set; they will, however, still be available from the GOA-UniProt gene association file.