SGD December 2013
- 1 Saccharomyces Genome Database Summary, 2013
Saccharomyces Genome Database Summary, 2013
Rama Balakrishnan, Gail Binkley, J. Michael Cherry, Maria Costanzo, Selina Dwight, Stacia Engel, Dianna Fisk, Ben Hitz, Stuart Miyasato, Matt Simison, Rob Nash, Marek Skrzypek, Shuai Weng, Edith Wong, Paul Lloyd, Janos Demeter, Diane Inglis, Kelley Paskov
[please include FTEs working on GOC tasks designating as well how many FTEs funding by GOC NIHGRI grant]
|GO Aspect||Number of Annotations Added||Number of genes updated||Number of publications used|
Note that these numbers count manually curated and high-throughput annotations only for ORFs that are Verified or Uncharacterized (Dubious ORFs are excluded), for RNA genes (ncRNA, rRNA, snRNA, snoRNA, or tRNA) and for genes encoded within transposable elements. It should also be noted these annotations may include both new annotations and updated annotations which replaced older ones.
State of GO annotations Genome wide
|Type||Counts as of December 18, 2013|
|GP with Any Annotation||6380|
|GPs with Manual annotation||6380|
|GP with Experimental and Computational Evidence||5445|
|GP with Computational Evidence||5445|
|GP with Curator Evidence (TAS, NAS, IC)||1216|
|GP with No Data (ND) in MF||1953|
|GP with No Data (ND) in BP||1152|
|GP with No Data (ND) in CC||751|
|Total Annotations with Manual curation||44254|
|Total Annotations with Computational Evidence||40561|
|Annotations with Curator Evidence (TAS, NAS, IC)||2397|
|GeneProducts with No Data (ND)||3856|
Methods and strategies for annotation (please note % effort on literature curation vs. computational annotation methods)
a. Literature curation: 100% of SGD’s effort is dedicated to manual curation based on the published literature for budding yeast gene and their products.
b. Computational annotation strategies: SGD does not employ automated methods to assign annotations, rather we absorb the computationally predicted annotations made by the UniProtKB GOA project for S. cerevisiae. The IEA annotations are loaded into the SGD database from the GOA gene association file after each release. All these annotations are included in the gene_association.sgd file, which represents a significant expansion of the types of evidence codes and data sources that are provided by SGD.
c. Priorities for annotation: The highest priority is to capture annotations where new information is available for an Uncharacterized gene product. These papers are identified during the literature triage process. In addition, we update older annotations. SGD captures the date when the annotations for a gene were reviewed. Using this date reviewed, older annotations are checked for consistency with the current literature.
d. SGD has incorporated Phylogeny based annotations made by PAINT. These annotations are now part of SGD's gene_association.sgd file
f. SGD curators are routinely creating terms via the new TermGenie interface to speed up the process of annotation.
Presentations and Publications
Papers with substantial GO content
- Dutkowski J, Kramer M, Surma MA, Balakrishnan R, Cherry JM, Krogan NJ, Ideker T. 2013. A gene ontology inferred from molecular networks. Nat Biotechnol. 2013 Jan;31(1):38-45. PMID: 23242164
- Balakrishnan R, Harris MA, Huntley R, Van Auken K, Cherry JM. 2013. A guide to best practices for Gene Ontology (GO) manual annotation. Database (Oxford). 2013 Jul 9;2013:bat054. PMC3706743
- Tripathi S, Christie KR, Balakrishnan R, Huntley R, Hill DP, Thommesen L, Blake JA, Kuiper M, Lægreid A. Gene Ontology annotation of sequence-specific DNA binding transcription factors: setting the stage for a large-scale curation effort. 2013. Database (Oxford). 2013 Aug 27;2013:bat062. PMC3753819
- Balakrishnan R, Harris MA, Huntley R, Van Auken K, Cherry JM. 2013. A guide to best practices for Gene Ontology (GO) manual annotation. 2013 International Biocuration Conference, Cambridge, UK.
A. Migration to protein2GO
- We migrated to using the protein2GO tool developed and maintained by UniProtKB for annotating GO data, instead of using our internal interface. This required exporting our existing annotations into the protein2GO database, complying with their QC checks, getting trained to use the protein2GO interface and then integrating the annotations back into SGD. This round trip process took 8 months to complete.
B. Col-16 curation
- Migrating to protein2GO has provided us the tool to capture annotation extensions (aka col-16 data). We currently have about 2000 annotations (covering 160 proteins) with col-16 data. This data will be visible on our public web pages beginning of 2014.
C. Annotation Outreach
- SGD curators participate in Annotation conference calls and curation Jamboree.
- R. Balakrishnan is a manager for the Annotation Advocacy working group
- R. Balakrishnan is part of the rotation that answers user email from gohelp.
- R. Balakrishnan is part of a working group involved in the redesign of the GOC website.
- R. Balakrishnan is part of the PAINT curation team
- R. Balakrishnan is part of the AmiGO working group
- R. Balakrishnan worked with Karen Christie (MGI), Rachael Huntley (GOA) to coordinate with Astrid Laegrid and Martin Kuiper of the Norwegian University of Science and Technology in Trondheim to annotate mammalian transcription factors
- R. Balakrishnan is providing annotation guidance to the MoonProt group, who want to annotate moonlighting (multi-function) proteins
- R. Balakrishnan is working with PseudoCAP to assist them in providing an updated annotation file.
- R. Balakrishnan is working with Scott Dawson (UC Davis) to develop the ontologies to annotate Giardia Lambia proteins.
- Kalpana Karra has built GOMine which is an implementation of InterMine. GOMine (http://gomine.geneontology.org) should serve as a fast search and retrieval tool for GO data without having to know sql or the database table structure. This tool has been live since beginning of 2013.
- GO ontology data is loaded from gene_ontology_edit.obo and goslim_generic.obo (from the GO FTP site).
- GO annotation data is loaded for the following GAF files (from the GO FTP site).
go-annotation-Lmajor go-annotation-Pfalciparum go-annotation-Tbrucei go-annotation-Atumefaciens go-annotation-Ddadantii go-annotation-Mgrisea go-annotation-Oomycetes go-annotation-aspgd go-annotation-gramene go-annotation-jcvi go-annotation-cgd go-annotation-eco-cyc (when it does not cause problems) go-annotation-chicken go-annotation-pseudocap go-annotation-mgi go-annotation-pombase go-annotation-rgd go-annotation-sgd go-annotation-tair go-annotation-wb go-annotation-zfin go-annotation-dictybase go-annotation-fb go-annotation-cow go-annotation-dog go-annotation-human go-annotation-pig
- SwissProt portion of GO annotations from UniProt GOA which includes both IEA and non-IEA (file generated by processing) is loaded as well. TremBL is not loaded.
- Information about the UniProt proteins is loaded from UniProt SwissProt XML file. Fasta sequence is loaded as part of the loading process.
- In addition, to aid in cross referencing, external IDs are loaded from the ID mapping file.
- We created 15 number of templates to start with.
- We have 7 real users who have created account.
--------------------------------- firstname.lastname@example.org email@example.com firstname.lastname@example.org email@example.com firstname.lastname@example.org email@example.com firstname.lastname@example.org email@example.com firstname.lastname@example.org (9 rows)
- Google Analytics
- We have had 121 visits from 84 unique visitors since September when we added GA code to track.
- Stanford continues to serve as the Production server for the GO Database and for the AmiGO web application.
- Migration and release of AmiGO2 from Stanford servers is in progress
- The AmiGO 2 service is currently in a pre-beta testing state. We anticipate that this service will be moved into production(at Stanford) sometime in mid- to late January. The service consists of two web servers running on KVM/QEMU virtual machines (VMs) behind a hardware load balancer. Two more VMs provide a “reverse proxy” service to prevent the public from accessing “private” portions of the server. The VMs are running on separate physical servers in order to maximize service uptime in the event of hardware failure. There is also a physical back end “loading” machine that downloads the source data files and creates search indexes. These servers run on Dell hardware. The VMs are running the CentOS 6.4 operating system, while the loading server and the physical machines hosting the VMs are running on the Red Hat Enterprise Linux 6.4 operating system. A NetApp appliance provides the VMs with disk storage via NFS, which allows for live migration (no downtime) of VMs between physical hosts if needed. The loading server downloads and processes the raw data and creates indexes that allow for very fast data queries. This data is generated daily and gets copied to the web servers. This allows subsequent loads to be performed without impacting web server availability. This process is fully automated and requires no manual intervention.
- B. Hitz, S. Miyasato, G. Binkley are on the go-software mailing list