Full Text Indexing

From GO Wiki
Revision as of 18:41, 21 September 2010 by Sjcarbon (talk | contribs)
Jump to navigation Jump to search
=Overview=
=Rational=
=Platforms=

Purpose

Existing database text searches on GO are slow and plagued by false positives and false negatives. There are many cross-cutting areas where fast ontology or annotation searching and/or autocomplete would be beneficial.

Also see Full Text Indexing Progress.

Groups

Deliverables

Evaluate Solr

We compared Solr vs direct Lucene connectivity.

  • Perl/CLucene lags far behind the official Lucene API and we cannot get the functionality we require. For example, anecdotally, people have gotten better search results using Solr over the old system.
  • Solr has a RESTful API and is language neutral for web consumers.
  • Solr is fast (see source for some documentation)

Doing a little light benchmarking, Solr by itself seems to run somewhere between 2 and 10 times as fast as my perl FCGI script over the C++ Lucene bindings, and seemed to hold even under very heavy load. Putting behind an Apache http proxy seems to more or less double the time, but makes some issues easier to deal with.

Given this information, and our experiments with other systems (see Full Text Indexing Progress), we are moving ahead with a Solr-based solution.

Install Solr on Labs

We have an installation on labs at

Update from 2010-08-06

As a note for interested parties, I've got a test installation of Solr
for GO up on accordion (http://accordion.lbl.gov:8080/solr, but not
terribly useful unless you're sending it the right commands). I probably
won't play with it for a while, at least until this release of AmiGO is
out and I try and switch the search backend over for 1.9, but I wanted
to get down what I've seen so far.

Since Solr can connect directly to a mysql database, it is much quicker
than my hand-rolled method--under a minute for terms and under twenty
minutes for a combined term/gp index. It also allows for just updating
new data as well as having full access while all of these things are
happening. This would make it much more reasonable to try some of the
more complicated document schemas that we've talked about.

The general search results are better now that we have direct control
over the boosting of different fields. There also seems to be
improvements in the text analyzers that come with it. The next version
of Solr should have the term completion we want built in, but right now,
with a little munging on the client, we can get better results than what
we are getting out of the current setup. There is also the option to dig
into the Java code and get exactly what we want, but that would take a
bit more effort (maybe later on).

GO Solr Indexing Engine

Current implementation uses solr xml to query GO database to build indexes. This approach is quite brittle.

In future we may do this via a java parser of GAFs + OBO Files

Solr Indexing Engine Optimizations

Integrate

Queries

Autocomplete

Advanced

Caveats

Scaling

While there are impressive speed improvements with FTI, given the uses that we have in mind (e.g. text completion as a ubiquitous public and private service), there is the worry that we could be overwhelmed with requests or that our hardware would be insufficient.

Solr and Lucene should be designed with much heavier tasks in mind on larger datasets. We (currently) believe that there is no problem here, and our benchmarks seem to imply the same thing; but we likely won't know until we really have to deal with it.

The worst case scenario is that we restrict it to internal use and external applications that we've okayed.

Security

Solr doesn't supply it's own security, which has to then be done either through the hosting server (Jetty in this case) or the proxy server (which causes a speed hit) if firewalled. I can't say I'm thrilled with the options, but I got something like I wanted through some rewrite rules on Jetty.