DictyBase Progress Report December 2012

From GO Wiki
Jump to: navigation, search

Staff:

PI: Rex Chisholm Annotators: Petra Fey, Robert Dodson; Pascale Gaudet (consultant) Developers: Siddhartha Basu, Yogesh Pandit

All dictyBase staff contributes to GO activities. This is a total of 4.1 FTE positions. Of these we receive sufficient funding from the GO grant to support about 0.5 FTEs.

Annotation:

Gene Ontology annotation is integral to the curation process at dictyBase. Annotation of gene products to GO terms is done concurrently with curation of literature, strains, phenotypes, and general nomenclature. Both curators work to annotate gene products of the Dictyostelium genome.

In 2011 we moved to store GO annotations in Chado and implemented the ability to import obo and GAF2.0 format files. This made our own GO annotation tool obsolete, and we then started to use the Protein2GO tool from the EBI to annotate GO for Dictyostelium proteins. Currently we are working to re-import annotations from GOA back into dictyBase, including electronic annotations, on a regular schedule. The annotations will then be sent to the GO consortium on a biweekly basis.

The use of protein2GO has the advantage that we are able to quickly react to changing annotation practices without the need to immediately invest in dictyBase developer time. For example, dictyBase curators now routinely use column 16 for annotation extensions in the protein2GO tool. While these annotation extensions will be available in our GAF file, the display in dictyBase of those extensions can be deferred to fit into our priorities. dictyBase curators can also send feedback to Rachael Huntley and Tony Sawford at the EBI, which is quickly resolved through collegial discussions.. For example, IGI annotations have now been made easier to annotate via a reciprocal mechanism in response to our request. Rachael and Tony have also been very helpful for other requests, e.g. they converted our legacy DDB IDs in the ‘with’ column into UniProt IDs, or deleted some ISS annotations that had no value in the ‘with’ column.

Semi-automated annotation: We have done extensive testing and are about to start using Textpresso to suggest GO terms for annotation to cellular component terms (Van Auken et al., BMC Bioinformatics 2009, 10:228). As part of the evaluation we participated in the BioCreative Workshop Track III in April 2012 for which we compared purely manually curated cell component annotations with Textpresso-assisted annotations (van Auken et al. Database 2012; Arighi et al. Database 2012). The results indicated that Textpresso-assisted annotations would be beneficial for our small group. To fit this into our workflow, new papers in dictyBase are being sent to Donghui Li at Wormbase on a weekly basis and run through the Textpresso pipeline. Results are then presented in the Textpresso curation interface, which will soon be implemented for dictyBase curators. We expect this will increase efficiency of GO curation by reducing time curators spend on literature.

Petra is a member of the Newsletter group, and is coordinating all dictyBase GO annotation issues with regard to Protein2GO with the EBI. Robert completed training in August 2012 and started to independently annotate Dicty GO at Protein2GO. Petra, Robert and Siddhartha are working with Donghui Li and Kimberly Van Auken on the semi-automatic CC annotations using Textpresso. Siddhartha is part of the Software group and is working with Tony Sawford on the pipeline to bi-weekly import our annotations from the EBI. Pascale is a RefGenome manager and is involved in the development of the Phylogenetic Annotation and INference Tool (PAINT) and in the training of other curators using the tool.


Other dictyBase contributions to GO:

Both dictyBase curators work to improve the GO with GO editors and other curators in the field, and contribute to discussion on the GO email list and Source Forge. 7 new GO terms were requested in 2012 to curate Dictyostelium papers. Petra and Robert participated in the BioCreative Workshop Track III in April 2012. For 15 Dictyostelium publications, Petra prepared control gold-standard cell component annotations. Robert used the Textpresso annotation tool for the same papers and results were compared. The outcome was encouraging enough to implement the Textpresso pipeline for dictyBase. dictyBase created a multi-genome environment that now contains a D. purpureum, D. fasciculatum and P. pallidum website (Basu et al. Nuc. Acids Res. 2012). In the future, we are planning to semi-automatically transfer experimental GO annotations by ISS to 1:1 orthologs in these species.


Annotation Progress

The first reimport of GO annotations into dictyBase from the EBI is imminent, However, note that the 2012 numbers in this report were obtained from QuickGO before reimport, and filtered for dictyBase assigned annotations. Please also note that the dramatic decrease in non-IEA annotations (Table 2) is due to deleting the majority of ISS annotations (5734) because those were not valid under current ISS annotation standards. However, new IEAs more than compensate for this loss in total annotation numbers (Table 1).

Table 1: Number of Annotations

2011 2012  % Change
Total annotations 31604 56475 +79%
Function 13278 22166 + 67%
Process 10001 18600 + 86%
Component 8325 15709 + 87%

Table 2: Number of non-IEA Annotations

2011 2012  % Change
Total annotations 20985 18604 - 11%
Function 6959 4566 - 34%
Process 7358 8471 + 15%
Component 6668 5567 - 17%


Table 3: Additional Numbers

2011 2012  % Change
Total ISS annotations 9531 3797 - 61%
New Manual GO (protein2GO 283 642 + 127%
Column 16 annotations - 66 -


Methods and strategies for annotation

(please note % effort on literature curation vs. computational annotation methods) Literature and other manual curation represent nearly 100% of the curation activities at dictyBase.

Literature curation

In addition to gene product, strain and phenotype annotation, dictyBase curators extract GO annotations from Dictyostelium publications. GO annotations are added using the Protein2GO tool provided by the EBI.

Use of Textpresso to annotate cellular components is imminent. Extension of the Textpresso pipeline to capture other GO aspects is currently under development by Wormbase. We recently started a pilot project to involve the Dictyostelium research community to assist Literature curation. The idea was brought forward at the 2012 International Dictyostelium meeting. Each week we request authors of newly published papers to provide curation from their paper in a simple MS word file. We ask for strain, phenotype, and GO annotations and provide links and a small Help Document to aid annotations. The return rate is nearly 100 %. The obtained information does help curators to focus their effort and saves time to find the relevant data in the document. In addition, it may have the long-term benefit of making authors more aware of curator needs when they write.

Automated methods

IEAs will be imported from GOA and assigned to the respective gene products on a biweekly schedule. Quality control measures dictyBase curators work closely to ensure that annotations are consistent between curators and conform to the guidelines set in the annotation documentation. We also have a set of internal guidelines recorded in the dictyBase Standard Operating Procedures (http://wiki.dictybase.org/dictywiki/index.php/Standard_Operating_Procedures) to which curators adhere. Curators discuss consistency issues as they arise and decisions are recorded in the Standard Operating Procedures.

Publications

Van Auken K, Fey P, Berardini TZ, Dodson R, Cooper L, Li D, Chan J, Li Y, Basu S, Muller HM, Chisholm R, Huala E, Sternberg PW; the WormBase Consortium. Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR. Database (Oxford). 2012 Nov 17;2012(0):bas040. Print 2012. PubMed PMID: 23160413; PubMed Central PMCID: PMC3500519.


Cecilia N. Arighi1,2*, Ben Carterette2, K. Bretonnel Cohen3, Martin Krallinger4, W. John Wilbur5, Petra Fey6, Robert Dodson6, Laurel Cooper7, Ceri E. Van Slyke8, Wasila Dahdul9, Paula Mabee9, Donghui Li10, Bethany Harris5, Marc Gillespie11, Silvia Jimenez12, Phoebe Roberts13, Lisa Matthews14, Kevin Becker15, Harold Drabkin16, Susan Bello16, Luana Licata17, Andrew Chatr-aryamontri18, Mary L. Schaeffer19, Julie Park20, Melissa Haendel21, Kimberly Van Auken22, Yuling Li22 , Juancarlos Chan22, Hans-Michael Muller22, Hong Cui23, James P. Balhoff24,25, Johnny Chi-Yang Wu26, Zhiyong Lu5, Chih-Hsuan Wei5, Catalina O. Tudor1,2, Kalpana Raja27, Suresh Subramani27, Jeyakumar Natarajan27, Juan Miguel Cejuela28, Pratibha Dubey1, and Cathy Wu1,2. An Overview of the BioCreative 2012 Workshop Track III: Interactive Text Mining Task. Database, accepted.


Basu S, Fey P, Pandit Y, Dodson R, Kibbe WA, Chisholm RL. dictyBase 2013: integrating multiple Dictyostelid species. Nucl. Acids Res. (2012). doi: 10.1093/nar/gks1064 PubMed PMID: 23172289.

Presentations

Petra Fey, Robert Dodson, and Rex L. Chisholm. dictyBase Literature Curation and how Authors can Help. International Dictyostelium Conference 2012, Madrid, Spain.


Siddhartha Basu, Robert Dodson, Petra Fey, Warren A. Kibbe, Rex L. Chisholm. Tools to explore new genomes at dictyBase. International Dictyostelium Conference 2012, Madrid, Spain.