Meeting Progress Report April 2008

From GO Wiki
Jump to: navigation, search

April: Salt Lake City GOC Meeting

SO Sequence Ontology

SO term statistics

April 17th, 2008

Current Defined Obsolete
so-xp.obo 1425 1176 112

Tracker items since 1st Oct 2007

Opened Closed
58 50

SO news

Colin Batchelor of RSC became a SO editor in October.

SO work in progress

SO-BFO alignment and annotator consistency The aim of this project is to work out what types we currently have as sequence_attributes, and use that to overhaul that part of the tree and provide better definitions. The attributes in SO have diverged from being straightforwards qualities. We have looked at how BFO describes dependent continuants and tried to work out where SO terms fit into this. We have used a method to test annotator consistency whereby the editors (Colin and Karen) are assigned 30 terms and have to decide the appropriate category. We then compare our results using various statistics. We have undergone several iterations of testing and have got to the point where we have SO versions of quality, role, disposition and function.

Replacing derives_from We are working on replacing the derives from relation with more appropriate relations such as translation_of and transcribed_from.

Upcoming events

RNAO workshop The RNA ontology consortium and SO are having a meeting in May to discuss topological relations in SO and how that relates to RNAO. Neocles Leontis and Thomas Bittner will be attending.

Production Services (Stanford)

Staff

Gail Binkley, Mike Cherry, Frank Hallstrom, Ben Hitz, Eurie Hong, Stuart Miyasato, Shuai Weng, Edith Wong.

The SGD group at Stanford is responsible for hosting various production aspects of GO. Included are: Maintenance and hosting of geneontology.org Internet services including CVS, FTP, HTTP, and CVSWEB. This includes hosting of the AmiGO ontology browsing service, regular data loading, export, Wiki and FTP hosting, nightly creation of GODIFF and the old OBO and GO formatted ontology files, plus filtering the gene association files supplied by consortium members. The statistics for the filtered gene association files is also updated nightly. The usage of geneontology.org has been relatively constant since the last report with ~30,000 visits/week. Usage of AmiGO fluctuates between 30,000 - 70,000 hits/week.

Software & Databases

AmiGO

Maintenance and support of the production AmiGO web site continues. Amigo was updated with release 1.5 in April, 2008. AmiGO 1.5 has reduced the number of long running queries that previously swamped the production nodes. A “killing” script was developed to axe long running processes and malformed queries. This script is no longer necessary, as of the release of AmiGO 1.5. The new version of AmiGO greatly reduces the unnecessary load on the production nodes in two ways. Firstly, huge SGL queries (greater than 200K characters) are not created by AmiGO. For the past year these huge queries were catch by the “killing” script. Secondly, a bug in mysql caused threads of “Killed” status to not be terminated, thus they would continue to be executed until finished. Some of these huge SQL queries would regularly use >4 hours of CPU time. The only way to reduce the load once these huge queries were started was to shutdown and restart the mysql daemon. This was done whenever several of these Killed threads were detected and the load was greater than 8. We are happy to report the production nodes have not needed mysql to be restarted since AmiGO 1.5 was installed. We continue to support AmiGO on a new development server that allows efficient testing and deployment of the database and software to the two production nodes.

GO Database

The three GO relational databases continue to be built and maintained. After server months of development and testing a new database loading environment has been put into production in March, 2008. Bulk loading for both associations and sequences has been added. The new loader is faster, more importantly it avoids a critical memory error that existed in the old sequence loading code. This previous problem prevented the release of both the January and February GO-FULL builds. Currently, all UniProtKB sequences are loaded into GO-FULL, this should prevent any scaling issues in the future, as the number of annotations is expected to grow faster than the number of sequences in UniProtKB. This is faster than identifying the sequences are used in the gene association files and then scanning the downloaded UniProtKB files to locate and then load this subset of UniProtKB entries.

Further improvements slated for 2008 include:

  • Improved unit testing procedures.
  • Add IEA annotations for all organisms with species-specific gene association files.
  • Use of UniProtKB mapping files to avoid problematic and slow queries to NCBI. That is not load NCBI protein entries unless the sequence is not present in UniProtKB.
  • Add support for secondary UniprotUniProtKB ids in gp2protein files.
  • Add GO.xref_abbr to database for use in AmiGO.
  • Export complete protein sets for reference genomes in FASTA format.

Gene association filters

All submitted gene association files are filtered for errors in content or syntax before being published to the FTP site or loaded into the relational database. The filtering program is revised and modified continuously to account for changes in standards and format.

Wiki

The GOC wiki site is hosted by the Stanford group. This year, at the request of the GO managers we merged the gocwiki.geneontology.org into the wiki.geneontology.org site. This required merging the gocwiki database content with that of the public wiki’s.

Mailing Lists

All 28 GO mailing lists were transferred from the majordomo system to mailman. Mailman is by far superior to majordomo, and has become the standard software used for mailing list. Hopefully the transition caused minimal problems. The mailman environment allows moderation of messages very a web interface, majordomo had no web features. The administration of mailman is also via web interfaces. Mailman automatically tracks bounced messages and unsubscribes bad addresses. A more effective SPAM filtering setup was possible with mailman, previously the moderator (Mike) had to scan a large number of messages. Now spamassassin is used to remove likely SPAM from the list queues. Among the many other useful features provided by mailman all messages to the list are automatically converted to HTML and archived in web pages. The text searching environment provided for some lists continues to be provided by a webglimpse engine.

Hardware

GO database loading and AmiGO have been installed and are fully functioning on three Linux machines. A load balancer appliance is used to direct the incoming AmiGO traffic between the two production nodes. Each production node provides AmiGO and GOst services. The development node also provides a testing environment. This year we plan to add at least one more production node.

Ontology Development

Metrics

GO term statistics

October 1, 2007

Current Defined Obsolete Total
Function 7879 7492 556 8435
Process 13916 13757 458 14374
Component 2019 2019 114 2133
All 23814 24396 1128 24942


April 16, 2008

Current Defined Obsolete Total
Function 8262 7909 566 8828
Process 14702 14564 470 15172
Component 2077 2077 117 2194
All 25041 25703 1153 26194


SourceForge statistics (Oct. 1 - April 17)

  • items opened: 500
  • items closed: 476

SourceForge reports (on SF site)

Completed work

Regulates relationship

Our most notable accomplishment since the Princeton meeting in September is that the regulates relationships have gone live. Chris, David and Tanya did an enormous amount of work, which is documented in the regulation section of the wiki. A brief summary of metrics is also available.

Other completed work

  • The revamp of Sensu terms is now complete. We described our approach of renaming terms and, where necessary, improving definitions or merging terms, at the September meeting.
  • We reported on the Cardiovascular physiology/development and Muscle Development content meetings at the September meeting. Changes stemming from those meetings have gone live.
  • Smaller-scale efforts include:
    • A number of disjointness violations have been corrected.
    • Electron transport terms have been reorganized.
    • New enzyme-activity function terms and (many!) synonyms added, improving consistency with EC.
    • Process and component terms for plasma lipoprotein particles added.
    • Sporulation terms have been reorganized, and new terms added (connected with 'sensu' work).
    • More new terms have been added for PAMGO.
    • PIR GO slim added.

Work in progress

Collaboration with IMG

Jane is working with Iain Anderson from IMG. The first set of IMG terms (about 1800) from April 2007 have been mapped and sent back, but since then another approx 1500 terms have been added to IMG and I am currently mapping these. The IMG pathways and parts also require mapping.

I am collaborating with Antonio Jimeno from Dietrich Rebholz-Schuhmann's group (EBI) to create automatic mappings for these terms, which I then manually verify and return the data to him to improve the algorithms. We hope to eventually use this work to create a generic vocabulary mapping tool.

Reference Genome Project

Target genes

- There are currently 394 genes in the Target gene list. - Selection of genes: Since Nov 2007, we rotate the group selecting target genes. - Curation priorities: Since Nov 2007, targets are not only disease genes anymore. We select 20 genes, 5 in each of 4 categories: (1) disease genes, (2) 'hot genes', (3) metabolic pathways, (4) uncharacterized.

Annotation Progress

Organism # genes looked at # genes with orthologs # genes curated
Arabidopsis 372 131 (35%) 129 (98%)
Caenorhabditis 412 271 (66%) 198 (73%)
Gallus 99 82 (82%) none marked completed
Homo 394 393 (99%) 231 (59%)
Mus 394 382 (97%) 338 (88%)
Saccharomyces 394 164 (41%) 159 (95%)
Drosophila 376 171 (45%) 70 (41%)
Rattus 394 347 (88%) 250 (72%)
Danio 374 283 (75%) 264 (93%)
Dictyostelium 351 164 (46%) 53 (32%)
Schizosaccharomyces 334 124 (37%) 105 (85%)
Escherichia 375 51 (13%) none marked completed


2008-04-RefGenomeMetric-all data.jpg

Annotation Quality Control

We are trying to address the issue of quality control of the annotations. Some of the concerns are:

  • Omission of annotations
  • Errors in annotations
  • Absence of 'with' for ISS annotations or 'with' object not experimentally characterized
  • Overannotation with ISS to process terms
  • Problems in the ontology that can become evident when comparing annotations from different species

Methods to address this:

  1. There are some queries that can be done: for example, genes for which an ortholog has GO annotations and that is either lacking annotations or annotated to ND
  2. (Val Wood): Looking for co-occurences of annotations as a high-level way to check for errors
  3. Manual verification of ortho sets (Source forge tracker: http://sourceforge.net/tracker/?group_id=36855&atid=1040173

Software development

Currently the targets genes and annotation status are captured using Google spreadsheets (Target genes and links to every group's annotation status page can be found at http://spreadsheets.google.com/ccc?key=pwOksMOra5uq4vIYjPgefPw

  1. Ortho set curation status: Siddhartha Basu, Chris Mungall, Seth Carbon and Mary Dolan are working on a database and a tool where target genes (ortho sets) and their curation status will be maintained.
  2. Graphical displays (Mary Dolan): several improvements
  3. Integration of ref genomes genes into AmiGO

Generating Ortholog sets

P-POD: Kara Dolinski (Princeton) : Procedure:

  1. Obtain FASTA files from each group from gp2protein files
  2. all vs all BLAST
  3. Ortho MCL
  4. ClustalW
  5. PHYLIP
  • Notung
  • Output will be rooted trees reconciled with species tress, graphic image of tree
  • We are on the Notung step right now and are adding data as they are generated. Currently, OrthoMCL families can be queried by gene name, though you just get back a list of members right now. Data are being made available as soon as we have them:

Web interface: http://ppod.princeton.edu/cgi-bin/ppod.cgi FTP site: ftp://gen-ftp.princeton.edu/ppod/go_ref_genome/

Communication

The reference genome group holds a monthly phone conference. Minutes can be found at Conference_Calls

Software and Utilities

OBO-Edit

  • Major code refactoring, split into 3 parts: general utils (org.bbop), object model and API (org.obo) and GUI (org.oboedit)
  • converted map2slim to OE framework
  • SWUG:OBO-Edit Report 2008-04
  • OBOMerge GUI in progress (Jen)

AmiGO development

Annotation

Database

  • support for species taxonomy trees added (e.g. we can now do searches by kingdom, phylum etc), in production
  • support for homolsets added to schema (for refG), in production
  • added support for GO.xrf_abbs (not yet in production)
  • added multiple views to facilitate reporting: see http://www.berkeleybop.org/goose/reports

Reference Genomes

General

Personnel

John left for Google. Nomi has stepped in to steer the OEWG. We have hired a developer to work on OE full time, but this person is stuck outside the country due to new US entry regulations. Sohel left, Siddhartha joined dictyBase and is now working on the refG tracker. Jen has been dabbling in OE development.

User Advocacy

GO helpdesk

Continues to be run efficiently on a rota system. The email system was recently moved to Mailman.

Number of GO helpdesk queries

Sept 27
Oct 47
Nov 34
Dec 21
Jan 39
Feb 45
March 42
April 29


GO newsletter

Two editions since the last meeting. We have applied for an ISSN for the newsletter.

Web-presence Working Group (formerly AmiGO WG)

AmiGO 1.5 was released earlier this month with many new features including a GO slimmer tool, a term enrichment tool and SQL search interface. We are now beginning to set priorities for the next release.
The advocacy group has not been involved with AmiGO development recently, but in the future we have decided that the advocacy group will be involved in setting priorities, from a biologist's perspective, at the beginning of a release and working with the software group to come up with a release plan. The software group will develop the release independently, with advocacy only getting involved again when testing in the run-up to the release is required.

Outreach

Outreach group activity reduced to supporting groups who approach GO directly. We are not currently actively seeking out new annotation groups.

Major Developments:

  • Group at CRIBI (Italy) committed to carrying out grape annotation.
  • Plant Physiology journal have agreed to accept annotations from submitting authors. (TAIR collaboration)

[online submission tool]

  • TAIR outreach at PAG 2008. Discussion of community annotation with TAIR, SGN (SOL Genomics Network) and WormBase.
  • Sol Genomics Network database annotation file has been submitted.
  • Reactome have created annotation files according to the plans laid down in Princeton, and are ready to commit when they have cvs access. (Emily Dimmer and Esther Schmidt)
  • ISAFG Conference - Fiona McCarthy reports continuing interest in GO.
  • Muscle Annotation wiki (Erika Feltrin and Alex Diehl)