Full Text Indexing

From GO Wiki
Revision as of 20:50, 9 August 2010 by Cjm (talk | contribs)
Jump to navigation Jump to search

Project lead: Seth Carbon

Deliverables

Evaluate Solr

We compared Solr vs direct Lucene connectivity.

  • perl/lucene lags behind official release, cannot get the functionality we require
  • Solr has a RESTful API and is language neutral
  • Solr is fast

We are moving ahead with an Apache/Solr based solution

Install Solr on Labs

We have an installation on labs at

Update from 2010-08-06

As a note for interested parties, I've got a test installation of Solr
for GO up on accordion (http://accordion.lbl.gov:8080/solr, but not
terribly useful unless you're sending it the right commands). I probably
won't play with it for a while, at least until this release of AmiGO is
out and I try and switch the search backend over for 1.9, but I wanted
to get down what I've seen so far.

Since Solr can connect directly to a mysql database, it is much quicker
than my hand-rolled method--under a minute for terms and under twenty
minutes for a combined term/gp index. It also allows for just updating
new data as well as having full access while all of these things are
happening. This would make it much more reasonable to try some of the
more complicated document schemas that we've talked about.

The general search results are better now that we have direct control
over the boosting of different fields. There also seems to be
improvements in the text analyzers that come with it. The next version
of Solr should have the term completion we want built in, but right now,
with a little munging on the client, we can get better results than what
we are getting out of the current setup. There is also the option to dig
into the Java code and get exactly what we want, but that would take a
bit more effort (maybe later on).

Doing a little light benchmarking, Solr by itself seems to run somewhere
between 2 and 10 times as fast as my perl FCGI script over the C++
Lucene bindings, and seemed to hold even under very heavy load. Putting
behind an Apache http proxy seems to more or less double the time, but
makes some issues easier to deal with.

Solr doesn't supply it's own security, which has to then be done either
through the hosting server (Jetty in this case) or the proxy server
(which causes a speed hit) if firewalled. I can't say I'm thrilled with
the options, but I got something like I wanted through some rewrite
rules on Jetty.

On the whole, it seems to be superior to the way we have Lucene now in
almost every way. Except for the endless XML config file fiddling. And
there is a lot of that.

GO Solr Indexing Engine

Current implementation uses solr xml to query GO database to build indexes. This approach is quite brittle.

In future we may do this via a java parser of GAFs + OBO Files