Meeting Progress Report April 2008
- 1 April: Salt Lake City GOC Meeting
- 1.1 SO Sequence Ontology
- 1.2 Production Services (Stanford)
- 1.3 Ontology Development
- 1.4 Reference Genome Project
- 1.5 Software and Utilities
- 1.6 User Advocacy
- 1.7 Outreach
April: Salt Lake City GOC Meeting
SO Sequence Ontology
SO term statistics
April 17th, 2008
Tracker items since 1st Oct 2007
Colin Batchelor of RSC became a SO editor in October.
SO work in progress
SO-BFO alignment and annotator consistency The aim of this project is to work out what types we currently have as sequence_attributes, and use that to overhaul that part of the tree and provide better definitions. The attributes in SO have diverged from being straightforwards qualities. We have looked at how BFO describes dependent continuants and tried to work out where SO terms fit into this. We have used a method to test annotator consistency whereby the editors (Colin and Karen) are assigned 30 terms and have to decide the appropriate category. We then compare our results using various statistics. We have undergone several iterations of testing and have got to the point where we have SO versions of quality, role, disposition and function.
Replacing derives_from We are working on replacing the derives from relation with more appropriate relations such as translation_of and transcribed_from.
RNAO workshop The RNA ontology consortium and SO are having a meeting in May to discuss topological relations in SO and how that relates to RNAO. Neocles Leontis and Thomas Bittner will be attending.
Production Services (Stanford)
Gail Binkley, Mike Cherry, Frank Hallstrom, Ben Hitz, Eurie Hong, Stuart Miyasato, Shuai Weng, Edith Wong.
The SGD group at Stanford is responsible for hosting various production aspects of GO. Included are: Maintenance and hosting of geneontology.org Internet services including CVS, FTP, HTTP, and CVSWEB. This includes hosting of the AmiGO ontology browsing service, regular data loading, export, Wiki and FTP hosting, nightly creation of GODIFF and the old OBO and GO formatted ontology files, plus filtering the gene association files supplied by consortium members. The statistics for the filtered gene association files is also updated nightly. The usage of geneontology.org has been relatively constant since the last report with ~30,000 visits/week. Usage of AmiGO fluctuates between 30,000 - 70,000 hits/week.
Software & Databases
Maintenance and support of the production AmiGO web site continues. Amigo was updated with release 1.5 in April, 2008. AmiGO 1.5 has reduced the number of long running queries that previously swamped the production nodes. A “killing” script was developed to axe long running processes and malformed queries. This script is no longer necessary, as of the release of AmiGO 1.5. The new version of AmiGO greatly reduces the unnecessary load on the production nodes in two ways. Firstly, huge SGL queries (greater than 200K characters) are not created by AmiGO. For the past year these huge queries were catch by the “killing” script. Secondly, a bug in mysql caused threads of “Killed” status to not be terminated, thus they would continue to be executed until finished. Some of these huge SQL queries would regularly use >4 hours of CPU time. The only way to reduce the load once these huge queries were started was to shutdown and restart the mysql daemon. This was done whenever several of these Killed threads were detected and the load was greater than 8. We are happy to report the production nodes have not needed mysql to be restarted since AmiGO 1.5 was installed. We continue to support AmiGO on a new development server that allows efficient testing and deployment of the database and software to the two production nodes.
The three GO relational databases continue to be built and maintained. After server months of development and testing a new database loading environment has been put into production in March, 2008. Bulk loading for both associations and sequences has been added. The new loader is faster, more importantly it avoids a critical memory error that existed in the old sequence loading code. This previous problem prevented the release of both the January and February GO-FULL builds. Currently, all UniProtKB sequences are loaded into GO-FULL, this should prevent any scaling issues in the future, as the number of annotations is expected to grow faster than the number of sequences in UniProtKB. This is faster than identifying the sequences are used in the gene association files and then scanning the downloaded UniProtKB files to locate and then load this subset of UniProtKB entries.
Further improvements slated for 2008 include:
- Improved unit testing procedures.
- Add IEA annotations for all organisms with species-specific gene association files.
- Use of UniProtKB mapping files to avoid problematic and slow queries to NCBI. That is not load NCBI protein entries unless the sequence is not present in UniProtKB.
- Add support for secondary UniprotUniProtKB ids in gp2protein files.
- Add GO.xref_abbr to database for use in AmiGO.
- Export complete protein sets for reference genomes in FASTA format.
Gene association filters
All submitted gene association files are filtered for errors in content or syntax before being published to the FTP site or loaded into the relational database. The filtering program is revised and modified continuously to account for changes in standards and format.
The GOC wiki site is hosted by the Stanford group. This year, at the request of the GO managers we merged the gocwiki.geneontology.org into the wiki.geneontology.org site. This required merging the gocwiki database content with that of the public wiki’s.
All 28 GO mailing lists were transferred from the majordomo system to mailman. Mailman is by far superior to majordomo, and has become the standard software used for mailing list. Hopefully the transition caused minimal problems. The mailman environment allows moderation of messages very a web interface, majordomo had no web features. The administration of mailman is also via web interfaces. Mailman automatically tracks bounced messages and unsubscribes bad addresses. A more effective SPAM filtering setup was possible with mailman, previously the moderator (Mike) had to scan a large number of messages. Now spamassassin is used to remove likely SPAM from the list queues. Among the many other useful features provided by mailman all messages to the list are automatically converted to HTML and archived in web pages. The text searching environment provided for some lists continues to be provided by a webglimpse engine.
GO database loading and AmiGO have been installed and are fully functioning on three Linux machines. A load balancer appliance is used to direct the incoming AmiGO traffic between the two production nodes. Each production node provides AmiGO and GOst services. The development node also provides a testing environment. This year we plan to add at least one more production node.
GO term statistics
October 1, 2007
April 16, 2008
SourceForge statistics (Oct. 1 - April 17)
- items opened: 500
- items closed: 476
SourceForge reports (on SF site)
Our most notable accomplishment since the Princeton meeting in September is that the regulates relationships have gone live. Chris, David and Tanya did an enormous amount of work, which is documented in the regulation section of the wiki. A brief summary of metrics is also available.
Other completed work
- The revamp of Sensu terms is now complete. We described our approach of renaming terms and, where necessary, improving definitions or merging terms, at the September meeting.
- We reported on the Cardiovascular physiology/development and Muscle Biology content meetings at the September meeting. Changes stemming from those meetings have gone live.
- Smaller-scale efforts include:
- A number of disjointness violations have been corrected.
- Electron transport terms have been reorganized.
- New enzyme-activity function terms and (many!) synonyms added, improving consistency with EC.
- Process and component terms for plasma lipoprotein particles added.
- Sporulation terms have been reorganized, and new terms added (connected with 'sensu' work).
- More new terms have been added for PAMGO.
- PIR GO slim added.
Work in progress
- Two pilot projects to add links between the function and process ontologies are going on. Progress and future directions will be discussed during the meeting.
- A content meeting on lung development was held December 5-6. Progress will be briefly noted during the meeting.
- Jen has started gathering information and identifying experts to work on an overhaul of signal transduction process terms.
Collaboration with IMG
Jane is working with Iain Anderson from IMG. The first set of IMG terms (about 1800) from April 2007 have been mapped and sent back, but since then another approx 1500 terms have been added to IMG and I am currently mapping these. The IMG pathways and parts also require mapping.
I am collaborating with Antonio Jimeno from Dietrich Rebholz-Schuhmann's group (EBI) to create automatic mappings for these terms, which I then manually verify and return the data to him to improve the algorithms. We hope to eventually use this work to create a generic vocabulary mapping tool.
Reference Genome Project
- There are currently 394 genes in the Target gene list. - Selection of genes: Since Nov 2007, we rotate the group selecting target genes. - Curation priorities: Since Nov 2007, targets are not only disease genes anymore. We select 20 genes, 5 in each of 4 categories: (1) disease genes, (2) 'hot genes', (3) metabolic pathways, (4) uncharacterized.
|Organism||# genes looked at||# genes with orthologs||# genes curated|
|Arabidopsis||372||131 (35%)||129 (98%)|
|Caenorhabditis||412||271 (66%)||198 (73%)|
|Gallus||99||82 (82%)||none marked completed|
|Homo||394||393 (99%)||231 (59%)|
|Mus||394||382 (97%)||338 (88%)|
|Saccharomyces||394||164 (41%)||159 (95%)|
|Drosophila||376||171 (45%)||70 (41%)|
|Rattus||394||347 (88%)||250 (72%)|
|Danio||374||283 (75%)||264 (93%)|
|Dictyostelium||351||164 (46%)||53 (32%)|
|Schizosaccharomyces||334||124 (37%)||105 (85%)|
|Escherichia||375||51 (13%)||none marked completed|
Annotation Quality Control
We are trying to address the issue of quality control of the annotations. Some of the concerns are:
- Omission of annotations
- Errors in annotations
- Absence of 'with' for ISS annotations or 'with' object not experimentally characterized
- Overannotation with ISS to process terms
- Problems in the ontology that can become evident when comparing annotations from different species
Methods to address this:
- There are some queries that can be done: for example, genes for which an ortholog has GO annotations and that is either lacking annotations or annotated to ND
- (Val Wood): Looking for co-occurences of annotations as a high-level way to check for errors
- Manual verification of ortho sets (Source forge tracker: http://sourceforge.net/tracker/?group_id=36855&atid=1040173
Currently the targets genes and annotation status are captured using Google spreadsheets (Target genes and links to every group's annotation status page can be found at http://spreadsheets.google.com/ccc?key=pwOksMOra5uq4vIYjPgefPw
- Ortho set curation status: Siddhartha Basu, Chris Mungall, Seth Carbon and Mary Dolan are working on a database and a tool where target genes (ortho sets) and their curation status will be maintained.
- Graphical displays (Mary Dolan): several improvements
- Integration of ref genomes genes into AmiGO
Generating Ortholog sets
P-POD: Kara Dolinski (Princeton) : Procedure:
- Obtain FASTA files from each group from gp2protein files
- all vs all BLAST
- Ortho MCL
- Output will be rooted trees reconciled with species tress, graphic image of tree
- We are on the Notung step right now and are adding data as they are generated. Currently, OrthoMCL families can be queried by gene name, though you just get back a list of members right now. Data are being made available as soon as we have them:
The reference genome group holds a monthly phone conference. Minutes can be found at Conference_Calls
Software and Utilities
- Major code refactoring, split into 3 parts: general utils (org.bbop), object model and API (org.obo) and GUI (org.oboedit)
- converted map2slim to OE framework
- SWUG:OBO-Edit Report 2008-04
- OBOMerge GUI in progress (Jen)
- Analysed current patterns for Annotation_of_Alternate_Spliceforms and came up with proposal for GAF
- Proposal for Annotation_Cross_Products column (col16) in GAF
- support for species taxonomy trees added (e.g. we can now do searches by kingdom, phylum etc), in production
- support for homolsets added to schema (for refG), in production
- added support for GO.xrf_abbs (not yet in production)
- added multiple views to facilitate reporting: see http://www.berkeleybop.org/goose/reports
- Instigated SWUG:Quality_Control software QC principles
- Collating all Derived_files_in_CVS
- Managed support for regulates relation across GO software
John left for Google. Nomi has stepped in to steer the OEWG. We have hired a developer to work on OE full time, but this person is stuck outside the country due to new US entry regulations. Sohel left, Siddhartha joined dictyBase and is now working on the refG tracker. Jen has been dabbling in OE development.
Continues to be run efficiently on a rota system. The email system was recently moved to Mailman.
Number of GO helpdesk queries
Two editions since the last meeting. We have applied for an ISSN for the newsletter.
Web-presence Working Group (formerly AmiGO WG)
AmiGO 1.5 was released earlier this month with many new features including a GO slimmer tool, a term enrichment tool and SQL search interface. We are now beginning to set priorities for the next release.
The advocacy group has not been involved with AmiGO development recently, but in the future we have decided that the advocacy group will be involved in setting priorities, from a biologist's perspective, at the beginning of a release and working with the software group to come up with a release plan. The software group will develop the release independently, with advocacy only getting involved again when testing in the run-up to the release is required.
Outreach group activity reduced to supporting groups who approach GO directly. We are not currently actively seeking out new annotation groups.
- Group at CRIBI (Italy) committed to carrying out grape annotation.
- Plant Physiology journal have agreed to accept annotations from submitting authors. (TAIR collaboration)
- TAIR outreach at PAG 2008. Discussion of community annotation with TAIR, SGN (SOL Genomics Network) and WormBase.
- Sol Genomics Network database annotation file has been submitted.
- Reactome have created annotation files according to the plans laid down in Princeton, and are ready to commit when they have cvs access. (Emily Dimmer and Esther Schmidt)
- ISAFG Conference - Fiona McCarthy reports continuing interest in GO.
- Muscle Annotation wiki (Erika Feltrin and Alex Diehl)