April: Salt Lake City GOC Meeting

SO Sequence Ontology

SO term statistics

April 17th, 2008

	Current	Defined	Obsolete
so-xp.obo	1425	1176	112

Tracker items since 1st Oct 2007

Opened	Closed
58	50

SO news

Colin Batchelor of RSC became a SO editor in October.

SO work in progress

SO-BFO alignment and annotator consistency The aim of this project is to work out what types we currently have as sequence_attributes, and use that to overhaul that part of the tree and provide better definitions. The attributes in SO have diverged from being straightforwards qualities. We have looked at how BFO describes dependent continuants and tried to work out where SO terms fit into this. We have used a method to test annotator consistency whereby the editors (Colin and Karen) are assigned 30 terms and have to decide the appropriate category. We then compare our results using various statistics. We have undergone several iterations of testing and have got to the point where we have SO versions of quality, role, disposition and function.

Replacing derives_from We are working on replacing the derives from relation with more appropriate relations such as translation_of and transcribed_from.

Upcoming events

RNAO workshop The RNA ontology consortium and SO are having a meeting in May to discuss topological relations in SO and how that relates to RNAO. Neocles Leontis and Thomas Bittner will be attending.

Production Services (Stanford)

Staff

Gail Binkley, Mike Cherry, Frank Hallstrom, Ben Hitz, Eurie Hong, Stuart Miyasato, Shuai Weng, Edith Wong.

The SGD group at Stanford is responsible for hosting various production aspects of GO. Included are: Maintenance and hosting of geneontology.org Internet services including CVS, FTP, HTTP, and CVSWEB. This includes hosting of the AmiGO ontology browsing service, regular data loading, export, Wiki and FTP hosting, nightly creation of GODIFF and the old OBO and GO formatted ontology files, plus filtering the gene association files supplied by consortium members. The statistics for the filtered gene association files is also updated nightly. The usage of geneontology.org has been relatively constant since the last report with ~30,000 visits/week. Usage of AmiGO fluctuates between 30,000 - 70,000 hits/week.

Software & Databases

AmiGO

Maintenance and support of the production AmiGO web site continues. Amigo was updated with release 1.5 in April, 2008. AmiGO 1.5 has reduced the number of long running queries that previously swamped the production nodes. A “killing” script was developed to axe long running processes and malformed queries. This script is no longer necessary, as of the release of AmiGO 1.5. The new version of AmiGO greatly reduces the unnecessary load on the production nodes in two ways. Firstly, huge SGL queries (greater than 200K characters) are not created by AmiGO. For the past year these huge queries were catch by the “killing” script. Secondly, a bug in mysql caused threads of “Killed” status to not be terminated, thus they would continue to be executed until finished. Some of these huge SQL queries would regularly use >4 hours of CPU time. The only way to reduce the load once these huge queries were started was to shutdown and restart the mysql daemon. This was done whenever several of these Killed threads were detected and the load was greater than 8. We are happy to report the production nodes have not needed mysql to be restarted since AmiGO 1.5 was installed. We continue to support AmiGO on a new development server that allows efficient testing and deployment of the database and software to the two production nodes.

GO Database

The three GO relational databases continue to be built and maintained. After server months of development and testing a new database loading environment has been put into production in March, 2008. Bulk loading for both associations and sequences has been added. The new loader is faster, more importantly it avoids a critical memory error that existed in the old sequence loading code. This previous problem prevented the release of both the January and February GO-FULL builds. Currently, all UniProtKB sequences are loaded into GO-FULL, this should prevent any scaling issues in the future, as the number of annotations is expected to grow faster than the number of sequences in UniProtKB. This is faster than identifying the sequences are used in the gene association files and then scanning the downloaded UniProtKB files to locate and then load this subset of UniProtKB entries.

Further improvements slated for 2008 include:

Improved unit testing procedures.
Add IEA annotations for all organisms with species-specific gene association files.
Use of UniProtKB mapping files to avoid problematic and slow queries to NCBI. That is not load NCBI protein entries unless the sequence is not present in UniProtKB.
Add support for secondary UniprotUniProtKB ids in gp2protein files.
Add GO.xref_abbr to database for use in AmiGO.
Export complete protein sets for reference genomes in FASTA format.

Gene association filters

All submitted gene association files are filtered for errors in content or syntax before being published to the FTP site or loaded into the relational database. The filtering program is revised and modified continuously to account for changes in standards and format.

Wiki

The GOC wiki site is hosted by the Stanford group. This year, at the request of the GO managers we merged the gocwiki.geneontology.org into the wiki.geneontology.org site. This required merging the gocwiki database content with that of the public wiki’s.

Mailing Lists

All 28 GO mailing lists were transferred from the majordomo system to mailman. Mailman is by far superior to majordomo, and has become the standard software used for mailing list. Hopefully the transition caused minimal problems. The mailman environment allows moderation of messages very a web interface, majordomo had no web features. The administration of mailman is also via web interfaces. Mailman automatically tracks bounced messages and unsubscribes bad addresses. A more effective SPAM filtering setup was possible with mailman, previously the moderator (Mike) had to scan a large number of messages. Now spamassassin is used to remove likely SPAM from the list queues. Among the many other useful features provided by mailman all messages to the list are automatically converted to HTML and archived in web pages. The text searching environment provided for some lists continues to be provided by a webglimpse engine.

Hardware

GO database loading and AmiGO have been installed and are fully functioning on three Linux machines. A load balancer appliance is used to direct the incoming AmiGO traffic between the two production nodes. Each production node provides AmiGO and GOst services. The development node also provides a testing environment. This year we plan to add at least one more production node.

Ontology Development

Metrics

GO term statistics

October 1, 2007

	Current	Defined	Obsolete	Total
Function	7879	7492	556	8435
Process	13916	13757	458	14374
Component	2019	2019	114	2133
All	23814	24396	1128	24942

April 16, 2008

	Current	Defined	Obsolete	Total
Function	8262	7909	566	8828
Process	14702	14564	470	15172
Component	2077	2077	117	2194
All	25041	25703	1153	26194

SourceForge statistics (Oct. 1 - April 17)

items opened: 500
items closed: 476

SourceForge reports (on SF site)

Completed work

Regulates relationship

Our most notable accomplishment since the Princeton meeting in September is that the regulates relationships have gone live. Chris, David and Tanya did an enormous amount of work, which is documented in the regulation section of the wiki. A brief summary of metrics is also available.

Other completed work

The revamp of Sensu terms is now complete. We described our approach of renaming terms and, where necessary, improving definitions or merging terms, at the September meeting.
We reported on the Cardiovascular physiology/development and Muscle Biology content meetings at the September meeting. Changes stemming from those meetings have gone live.
Smaller-scale efforts include:
- A number of disjointness violations have been corrected.
- Electron transport terms have been reorganized.
- New enzyme-activity function terms and (many!) synonyms added, improving consistency with EC.
- Process and component terms for plasma lipoprotein particles added.
- Sporulation terms have been reorganized, and new terms added (connected with 'sensu' work).
- More new terms have been added for PAMGO.
- PIR GO slim added.

Work in progress

Two pilot projects to add links between the function and process ontologies are going on. Progress and future directions will be discussed during the meeting.
A content meeting on lung development was held December 5-6. Progress will be briefly noted during the meeting.
Jen has started gathering information and identifying experts to work on an overhaul of signal transduction process terms.

Collaboration with IMG

Jane is working with Iain Anderson from IMG. The first set of IMG terms (about 1800) from April 2007 have been mapped and sent back, but since then another approx 1500 terms have been added to IMG and I am currently mapping these. The IMG pathways and parts also require mapping.

I am collaborating with Antonio Jimeno from Dietrich Rebholz-Schuhmann's group (EBI) to create automatic mappings for these terms, which I then manually verify and return the data to him to improve the algorithms. We hope to eventually use this work to create a generic vocabulary mapping tool.

Reference Genome Project

Target genes

- There are currently 394 genes in the Target gene list. - Selection of genes: Since Nov 2007, we rotate the group selecting target genes. - Curation priorities: Since Nov 2007, targets are not only disease genes anymore. We select 20 genes, 5 in each of 4 categories: (1) disease genes, (2) 'hot genes', (3) metabolic pathways, (4) uncharacterized.

Annotation Progress

Organism	# genes looked at	# genes with orthologs	# genes curated
Arabidopsis	372	131 (35%)	129 (98%)
Caenorhabditis	412	271 (66%)	198 (73%)
Gallus	99	82 (82%)	none marked completed
Homo	394	393 (99%)	231 (59%)
Mus	394	382 (97%)	338 (88%)
Saccharomyces	394	164 (41%)	159 (95%)
Drosophila	376	171 (45%)	70 (41%)
Rattus	394	347 (88%)	250 (72%)
Danio	374	283 (75%)	264 (93%)
Dictyostelium	351	164 (46%)	53 (32%)
Schizosaccharomyces	334	124 (37%)	105 (85%)
Escherichia	375	51 (13%)	none marked completed

Annotation Quality Control

We are trying to address the issue of quality control of the annotations. Some of the concerns are:

Omission of annotations
Errors in annotations
Absence of 'with' for ISS annotations or 'with' object not experimentally characterized
Overannotation with ISS to process terms
Problems in the ontology that can become evident when comparing annotations from different species

Methods to address this:

There are some queries that can be done: for example, genes for which an ortholog has GO annotations and that is either lacking annotations or annotated to ND
(Val Wood): Looking for co-occurences of annotations as a high-level way to check for errors
Manual verification of ortho sets (Source forge tracker: http://sourceforge.net/tracker/?group_id=36855&atid=1040173

Software development

Currently the targets genes and annotation status are captured using Google spreadsheets (Target genes and links to every group's annotation status page can be found at http://spreadsheets.google.com/ccc?key=pwOksMOra5uq4vIYjPgefPw

Ortho set curation status: Siddhartha Basu, Chris Mungall, Seth Carbon and Mary Dolan are working on a database and a tool where target genes (ortho sets) and their curation status will be maintained.
Graphical displays (Mary Dolan): several improvements
Integration of ref genomes genes into AmiGO

Generating Ortholog sets

P-POD: Kara Dolinski (Princeton) : Procedure:

Obtain FASTA files from each group from gp2protein files
all vs all BLAST
Ortho MCL
ClustalW
PHYLIP

Notung
Output will be rooted trees reconciled with species tress, graphic image of tree
We are on the Notung step right now and are adding data as they are generated. Currently, OrthoMCL families can be queried by gene name, though you just get back a list of members right now. Data are being made available as soon as we have them:

Web interface: http://ppod.princeton.edu/cgi-bin/ppod.cgi FTP site: ftp://gen-ftp.princeton.edu/ppod/go_ref_genome/

Communication

The reference genome group holds a monthly phone conference. Minutes can be found at Conference_Calls

Software and Utilities

OBO-Edit

Major code refactoring, split into 3 parts: general utils (org.bbop), object model and API (org.obo) and GUI (org.oboedit)
converted map2slim to OE framework
SWUG:OBO-Edit Report 2008-04
OBOMerge GUI in progress (Jen)

AmiGO development

See also WPWG report
AmiGO_1_5 summary of features
AmiGO_1_6 plans

Annotation

Analysed current patterns for Annotation_of_Alternate_Spliceforms and came up with proposal for GAF
Proposal for Annotation_Cross_Products column (col16) in GAF

Database

support for species taxonomy trees added (e.g. we can now do searches by kingdom, phylum etc), in production
support for homolsets added to schema (for refG), in production
added support for GO.xrf_abbs (not yet in production)
added multiple views to facilitate reporting: see http://www.berkeleybop.org/goose/reports

Reference Genomes

See Reference Genomes section
RG:_Software

General

Instigated SWUG:Quality_Control software QC principles
Collating all Derived_files_in_CVS
Managed support for regulates relation across GO software

Personnel

John left for Google. Nomi has stepped in to steer the OEWG. We have hired a developer to work on OE full time, but this person is stuck outside the country due to new US entry regulations. Sohel left, Siddhartha joined dictyBase and is now working on the refG tracker. Jen has been dabbling in OE development.

User Advocacy

GO helpdesk

Continues to be run efficiently on a rota system. The email system was recently moved to Mailman.

Number of GO helpdesk queries

Sept	27
Oct	47
Nov	34
Dec	21
Jan	39
Feb	45
March	42
April	29

GO newsletter

Two editions since the last meeting. We have applied for an ISSN for the newsletter.

Web-presence Working Group (formerly AmiGO WG)

AmiGO 1.5 was released earlier this month with many new features including a GO slimmer tool, a term enrichment tool and SQL search interface. We are now beginning to set priorities for the next release.
The advocacy group has not been involved with AmiGO development recently, but in the future we have decided that the advocacy group will be involved in setting priorities, from a biologist's perspective, at the beginning of a release and working with the software group to come up with a release plan. The software group will develop the release independently, with advocacy only getting involved again when testing in the run-up to the release is required.

Outreach

Outreach group activity reduced to supporting groups who approach GO directly. We are not currently actively seeking out new annotation groups.

Major Developments:

Group at CRIBI (Italy) committed to carrying out grape annotation.
Plant Physiology journal have agreed to accept annotations from submitting authors. (TAIR collaboration)

[online submission tool]

TAIR outreach at PAG 2008. Discussion of community annotation with TAIR, SGN (SOL Genomics Network) and WormBase.
Sol Genomics Network database annotation file has been submitted.
Reactome have created annotation files according to the plans laid down in Princeton, and are ready to commit when they have cvs access. (Emily Dimmer and Esther Schmidt)
ISAFG Conference - Fiona McCarthy reports continuing interest in GO.
Muscle Annotation wiki (Erika Feltrin and Alex Diehl)

Meeting Progress Report April 2008