DictyBase Progress Report December 2010

From GO Wiki
Jump to: navigation, search


PI: Rex Chisholm Annotators: Petra Fey, Pascale Gaudet, Robert Dodson Developers: Siddhartha Basu, Yulia Bushmanova, Eric Just, (consultant)

All dictyBase staff contributes to GO activities. This is a total of 5 FTE positions. Of these we receive sufficient funding from the GO grant to support about 1.05 FTEs.

Annotation. Gene Ontology annotation is integral to the curation process at dictyBase. Annotation of gene products to GO terms is done concurrently with curation of literature, phenotypes, and sequences. All curators work to annotate gene products of the Dictyostelium genome. We import GOA annotations into dictyBase and incorporate them into our monthly gene association file.

The Reference Genome Project: Pascale is manager for the Reference Genome project. All dictyBase curators annotate reference genome genes and are up-to-date with the selected orthologs.

Other dictyBase contributions to GO:

Pascale works with Suzi Lewis, Kara Dolinski and Paul Thomas (Panther) on PAINT, the tool for a tree-based orthology inference system for the Reference Genome project.

Pascale co-organized the GO annotation camp in Geneva in June 2010

Pascale is a member of the Reference Genome, AmiGO/web presence, GO Evidence Code and Ontology development working groups. In the course of the Reference Genome Project,

Petra is a member of the Newsletter group, the OBO-Edit working group, and the Reference Genome annotation group.

Siddhartha is part of the Software group.

dictyBase has moved to the use of the Chado schema to store gene ontology and GO annotations. dictyBase will also facilitate the annotation of other genomes as as they become publicly available.

Annotation Progress

Table 1: Number of Annotations

12/2009 12/2010 % Change
Total number of annotations 30830 31359 1.7%
Function 13200 31209 0.7%
Process 9885 9926 0.4%
Component 7745 8218 6.1%

Table 2: Number of non-IEA Annotations

12/2009 12/2010 % Change
Total number of annotations 20075 20723 3.2%
Function 6817 6878 0.9%
Process 7182 7279 1.3%
Component 6076 6566 8.1 %

Table 3: Number of annotations per evidence code

12/2009 12/2010 % Change
IMP 1269 1323 4.2%
IGI 195 207 6.2%
IPI 221 223 0.9%
ISS 9491 9483 -0.1%
IDA 1807 2262 25%
IEP 132 132 0%
TAS 480 477 -0.6%
NAS 16 16 0%
NR 0 0 N/A
IEA 10755 10630 -1.2%
ND 6212 6347 2.2%
IC 252 253 0.4%
RCA 0 0 N/A
ISO 0 0 N/A
ISA 0 0 N/A
ISM 0 0 N/A
IGC 0 0 N/A

Methods and strategies for annotation

(please note % effort on literature curation vs. computational annotation methods)

Literature and other manual curation represent nearly 100% of the curation activities at dictyBase.

Literature curation

In addition to gene product, strain and phenotype annotation, dictyBase curators extract GO annotations from Dictyostelium publications.

We are collaborating with WormBase in using Textpresso to suggest GO terms for annotation to cellular component terms (Van Auken et al., 2009, BMC Bioinformatics 2009, 10:228). This tool has been trained to evaluate the semantic context of GO terms to enrich for terms in sections of the papers describing methods or results, providing a high recall rate for experimentally supported GO annotations. Tests have been performed and evaluated, and we are ready to establish a pipeline that will run full text searches on GO cellular component terms and provide curators suggestions for annotations. Wormbase is working on setting up a Textpresso search interface that would funnel results to a curation form that we will be able to use, whose output would be a gene_association file that we can pick up from an ftp site and import into dictyBase. Extension of the tool to capture GO molecular functions is currently also under development by WormBase. This will increase efficiency of GO curation by reducing time curators spend on literature mining.

Automated methods

Automated annotations have not been updated in a while. Since we are migrating GO to Chado right now, we will then also integrate a pipeline to import InterPro2GO and SPKW2GO again, as well as implementing regular updates.

Once these pipelines are implemented for D. discoideum, we will use it also to add automated GO annotations to other genomes.

Quality control measures

dictyBase curators work closely to ensure that annotations are consistent between curators and conform to the guidelines set in the annotation documentation. We also have a set of internal guidelines recorded in the dictyBase Standard Operating Procedures (http://wiki.dictybase.org/dictywiki/index.php/Standard_Operating_Procedures) to which curators adhere. The three curators discuss consistency issues as they arise and decisions are recorded in the Standard Operating Procedures.

Development: GO migration to the Chado schema

dictyBase has migrated to GMOD Chado schema to store, display and distribute GO annotations. The migration was done in two broad steps:

Import of OBO file

A generic loader has been written to model OBO spec(at least version 1.2) in chado. In addition to the GO file, the loader is generic enough to import any obo file distributed via OBO foundry. The loader can model various relation attributes (cyclicity,transitivity,domain and range) and other obo relations(union_of, disjoint_from, intersection_of etc). The obo updater script updates any existing ontology information in chado. The update process detects the following changes ...

  • New terms
  • Term obsoletions
  • Changes to terms properties such as xrefs, synonyms, comments, definitions etc.

The records will be refreshed in the database as necessary.

Import/Export of GAF file

A GAF file importer is written to model the latest GAF2.0 in chado schema. This includes particularly the support of two additional columns in GAF2.0, annotation extension (column 16) and gene product form ID (column 17). The importer will also be used to load the GAF file from the Reference Genome Annotation project. The companion exporter script will generate a GAF2.0 file that is distributed via dictyBase and the GO consortium download section. The GAF2.0 file format is checked using a quality control check script provided by GO. All the exporter/importer uses Chris Mungalls GOBO module to parse/output GAF and OBO files. The gaf-inference script from GOBO module is also used for checking taxon constraints. We also continue to provide gp2protein file for mapping Uniprot accession numbers.

Presentations and Publications

The GO annotation Camp in Geneva, June 2010

Pascale was one of the main organizers of the 3rd Gene Ontology’s Annotation Camp, held from June 16-18 2010 (Wednesday-Friday) at the Centre Médical Universitaire (CMU) Geneva, Switzerland. Members from several model organism databases were represented, for a total of 63 attendees, including 40 from the SIB and 23 external delegates. This annotation camp aimed to update and refine the skills of GO biocurators, including the Swiss-Prot curation team. The major themes of the meeting covered processes difficult to represent in the Gene Ontology such as regulation, responses to stimulus, and protein complexes. The goal is to improve annotation consistency for GO users to have high quality data to support their work.

Minutes of the GO camp can be found here: http://wiki.geneontology.org/index.php/Talk:2010_GO_camp_Meeting_Agenda

Other Highlights

Ontology Development Contributions

Annotators have requested several additions and changes to the ontologies necessary to annotate Dictyostelium development. These requests focus on, but are not limited to, process terms to describe developmental events such as cell type differentiation and formation of developmental structures. Curators also continue to develop a pre-composed phenotype ontology using qualities from the PATO ontology. Every newly added term comes from a GO process, or Dicty anatomy term plus a PATO term from the quality.obo.

New D. purpureum web portal

We now have a D. purpureum web portal that is tightly integrated with dictyBase. The different genomes are available from a drop-down in the top bar. The D. purpureum GBrowse instance contains tracks of D. discoideum orthologs, and our BLAST server provides access to both genomes.

Gene model curation for a completely annotated Dicty proteome

Curators have focused for most of the year on curating all roughly 12,500 Dictyostelium genes. To meet the goal to finish gene model annotations in early 2011 we have developed a new Gene model annotation tool. This new software produces reports where the information for making a curated gene model is available on a single web page. This includes a view of the gene structure where coding and non-coding regions are highlighted to ensure that the start, stop context, as well as intron boundaries are correct, BLASTP reports to determine the sequence similarity support, and Genome browser snapshots to visualize the transcript evidence. The gene report also provides a quick curation tool, from which a gene model can immediately be approved. As a result, the approval of gene models is now 15 x faster than before, when information about gene model support had to be gathered from many different pages. Curators change gene models in 15% of the cases, thus improving the gene structure that often adds functional domains and sequence similarity, which down the road might allow for GO annotations. Inspection of the last 15% of gene predictions also leads to deletions of small, not expressed and obviously wrongly predicted genes, increasing the percentage of Dictyostelium genes that can be annotated.

Orthologs on dictyBase gene pages

Orthologs from eight different species (D. purpureum, H. sapiens, M. musculus, D. melanogaster, C. elegans, S. cerevisiae, A. thaliana, and E. coli) are now available from a new tab on the dictyBase gene page.