Full Text Indexing: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(25 intermediate revisions by one other user not shown)
Line 1: Line 1:
Project lead: Seth Carbon
=DEPRECATED=


== Purpose ==
Please see [[GOlr]] instead.


Existing database text searches on GO are slow and plagued by false positives and false negatives. There are many cross-cutting areas where fast ontology or annotation searching and/or autocomplete would be beneficial.
=Overview=


== Groups ==
This document looks at the reasons for using FTI, as well as some of
the high-level software choices that have been made.


* [[Ontology Development]] - for ontology searches
To see the current status, please try [[Full Text Indexing Progress]].
* [[:Category:AmiGO_Hub]] - for gene searches


== Deliverables ==
=Rational=


=== Evaluate Solr ===
* Existing database text searches on GO are slow and plagued by false positives and false negatives.


We compared Solr vs direct Lucene connectivity.
* Make previously impossible services possible. Things like autocomplete and certain types of search were not practical in our infrastructure (e.g. gene products annotated to multiple terms).


* perl/lucene lags behind official release, cannot get the functionality we require
* Make user interfaces as fast and responsive as possible.
* Solr has a RESTful API and is language neutral
* Solr is fast


We are moving ahead with an Apache/Solr based solution
* Remove needless work and save resources for cases where a relational database makes sense.


=== Install Solr on Labs ===
==Other Uses==


We have an installation on labs at
Given an in-memory ontology to work with, a cleverly designed schema
could cover a lot of cases where SQL is currently used, but at
potential much higher speed. See [[Full Text Indexing Progress]].


* http://accordion.lbl.gov:8080/solr
A Solr-based system could also be used in a more general information
caching sense; for example, it may have uses in page display and quick
retrieval of commonly requested data.


Update from 2010-08-06
=Evaluation of Solr=
<pre>
As a note for interested parties, I've got a test installation of Solr
for GO up on accordion (http://accordion.lbl.gov:8080/solr, but not
terribly useful unless you're sending it the right commands). I probably
won't play with it for a while, at least until this release of AmiGO is
out and I try and switch the search backend over for 1.9, but I wanted
to get down what I've seen so far.


Since Solr can connect directly to a mysql database, it is much quicker
Given the below information, the fact that Lucene has dominated the
than my hand-rolled method--under a minute for terms and under twenty
FTI landscape, and our experiments with other systems (see
minutes for a combined term/gp index. It also allows for just updating
[[Full Text Indexing Progress]]), we are moving ahead with a
new data as well as having full access while all of these things are
Solr-based solution.
happening. This would make it much more reasonable to try some of the
more complicated document schemas that we've talked about.


The general search results are better now that we have direct control
==Useful API==
over the boosting of different fields. There also seems to be
 
improvements in the text analyzers that come with it. The next version
Perl/CLucene (the system our previous attempt was based on) lags far
of Solr should have the term completion we want built in, but right now,
behind the official Lucene API and we cannot get the functionality we
with a little munging on the client, we can get better results than what
require. For example, solutions to the leading '*' problem and support
we are getting out of the current setup. There is also the option to dig
for DisMax did not appear in the CLucene bindings.
into the Java code and get exactly what we want, but that would take a
 
bit more effort (maybe later on).
Solr also provides different facets to the same underlying data,
giving the ability to have an API specialized for different things
(e.g. search, spellcheck, autocomplete).
 
==Platform Agnosticism==
 
While Lucene is written in Java, Solr has a native RESTful API which
makes it language neutral for web consumers. There was some potential
for the old CLucene library to be platform agnostic, but with age
issues (see above) and poor support, these did not pan out.
 
==Fast Search==


Doing a little light benchmarking, Solr by itself seems to run somewhere
Doing some simple benchmarking, Solr by itself seems to run somewhere
between 2 and 10 times as fast as my perl FCGI script over the C++
between 2 and 10 times as fast the fastest iteration of the old
Lucene bindings, and seemed to hold even under very heavy load. Putting
system, and seemed to hold even under very heavy load. Putting it
behind an Apache http proxy seems to more or less double the time, but
behind an Apache http proxy seems to more or less double the time, but
makes some issues easier to deal with.
makes some issues (security, load balancing) easier to deal with.
 
Also see go-dev source for some documentation.
 
==Fast Indexing==


Solr doesn't supply it's own security, which has to then be done either
Since Solr can connect directly to a relational database, it is much
through the hosting server (Jetty in this case) or the proxy server
quicker to update and less error-prone than previous hand-rolled
(which causes a speed hit) if firewalled. I can't say I'm thrilled with
methods. In the last check, it took under a minute for a terms index
the options, but I got something like I wanted through some rewrite
and under twenty minutes for a combined term/gp index. Solr also
rules on Jetty.
allows for just updating new data as well as having full access while
all of these things are happening.


On the whole, it seems to be superior to the way we have Lucene now in
==Better Results==
almost every way. Except for the endless XML config file fiddling. And
there is a lot of that.


</pre>
The general search results are better now that we have direct control
over the boosting of different fields. There also seems to be
improvements in the text analyzers that come with it. The next version
of Solr should have the term completion we want built in, but right
now, with a little munging on the client, we can get better results
than what we are getting out of the current setup. There is also the
option to dig into the Java code and get exactly what we want, but
that would take a bit more effort (maybe later on).


=== GO Solr Indexing Engine ===
=Caveats=


Current implementation uses solr xml to query GO database to build indexes. This approach is quite brittle.
==Scaling==


In future we may do this via a java parser of GAFs + OBO Files
While there are impressive speed improvements with FTI, given the uses
that we have in mind (e.g. text completion as a ubiquitous public and
private service), there is the worry that we could be overwhelmed with
requests or that our hardware would be insufficient.


=== Solr Indexing Engine Optimizations ===
Solr and Lucene should be designed with much heavier tasks in mind on
larger datasets. We (currently) believe that there is no problem here,
and our benchmarks seem to imply the same thing; but we likely won't
know until we really have to deal with it.


=== Integrate ===
The worst case scenario is that we restrict it to internal use and
external applications that we've okayed.


==== Queries ====
==Security==


==== Autocomplete ====
Solr doesn't supply it's own security, which has to then be done either
through the hosting server (Jetty in this case) or the proxy server
(which causes a speed hit) if firewalled. I can't say I'm thrilled with
the options, but I got something like I wanted through some rewrite
rules on Jetty.


==== Advanced ====


[[Category:SWUG Projects]]
[[Category:SWUG Projects]]
[[Category:AmiGO]]
[[Category:Software]]

Latest revision as of 18:40, 7 April 2014

DEPRECATED

Please see GOlr instead.

Overview

This document looks at the reasons for using FTI, as well as some of the high-level software choices that have been made.

To see the current status, please try Full Text Indexing Progress.

Rational

  • Existing database text searches on GO are slow and plagued by false positives and false negatives.
  • Make previously impossible services possible. Things like autocomplete and certain types of search were not practical in our infrastructure (e.g. gene products annotated to multiple terms).
  • Make user interfaces as fast and responsive as possible.
  • Remove needless work and save resources for cases where a relational database makes sense.

Other Uses

Given an in-memory ontology to work with, a cleverly designed schema could cover a lot of cases where SQL is currently used, but at potential much higher speed. See Full Text Indexing Progress.

A Solr-based system could also be used in a more general information caching sense; for example, it may have uses in page display and quick retrieval of commonly requested data.

Evaluation of Solr

Given the below information, the fact that Lucene has dominated the FTI landscape, and our experiments with other systems (see Full Text Indexing Progress), we are moving ahead with a Solr-based solution.

Useful API

Perl/CLucene (the system our previous attempt was based on) lags far behind the official Lucene API and we cannot get the functionality we require. For example, solutions to the leading '*' problem and support for DisMax did not appear in the CLucene bindings.

Solr also provides different facets to the same underlying data, giving the ability to have an API specialized for different things (e.g. search, spellcheck, autocomplete).

Platform Agnosticism

While Lucene is written in Java, Solr has a native RESTful API which makes it language neutral for web consumers. There was some potential for the old CLucene library to be platform agnostic, but with age issues (see above) and poor support, these did not pan out.

Fast Search

Doing some simple benchmarking, Solr by itself seems to run somewhere between 2 and 10 times as fast the fastest iteration of the old system, and seemed to hold even under very heavy load. Putting it behind an Apache http proxy seems to more or less double the time, but makes some issues (security, load balancing) easier to deal with.

Also see go-dev source for some documentation.

Fast Indexing

Since Solr can connect directly to a relational database, it is much quicker to update and less error-prone than previous hand-rolled methods. In the last check, it took under a minute for a terms index and under twenty minutes for a combined term/gp index. Solr also allows for just updating new data as well as having full access while all of these things are happening.

Better Results

The general search results are better now that we have direct control over the boosting of different fields. There also seems to be improvements in the text analyzers that come with it. The next version of Solr should have the term completion we want built in, but right now, with a little munging on the client, we can get better results than what we are getting out of the current setup. There is also the option to dig into the Java code and get exactly what we want, but that would take a bit more effort (maybe later on).

Caveats

Scaling

While there are impressive speed improvements with FTI, given the uses that we have in mind (e.g. text completion as a ubiquitous public and private service), there is the worry that we could be overwhelmed with requests or that our hardware would be insufficient.

Solr and Lucene should be designed with much heavier tasks in mind on larger datasets. We (currently) believe that there is no problem here, and our benchmarks seem to imply the same thing; but we likely won't know until we really have to deal with it.

The worst case scenario is that we restrict it to internal use and external applications that we've okayed.

Security

Solr doesn't supply it's own security, which has to then be done either through the hosting server (Jetty in this case) or the proxy server (which causes a speed hit) if firewalled. I can't say I'm thrilled with the options, but I got something like I wanted through some rewrite rules on Jetty.