Full Text Indexing: Difference between revisions
Jump to navigation
Jump to search
(→Solr) |
No edit summary |
||
Line 1: | Line 1: | ||
Project lead: Seth Carbon | Project lead: Seth Carbon | ||
== | == Deliverables == | ||
http://accordion.lbl.gov:8080/solr | === Evaluate Solr === | ||
We compared Solr vs direct Lucene connectivity. | |||
* perl/lucene lags behind official release, cannot get the functionality we require | |||
* Solr has a RESTful API and is language neutral | |||
* Solr is fast | |||
We are moving ahead with an Apache/Solr based solution | |||
=== Install Solr on Labs === | |||
We have an installation on labs at | |||
* http://accordion.lbl.gov:8080/solr | |||
Update from 2010-08-06 | Update from 2010-08-06 | ||
Line 48: | Line 62: | ||
</pre> | </pre> | ||
=== GO Solr Indexing Engine === | |||
Current implementation uses solr xml to query GO database to build indexes. This approach is quite brittle. | |||
In future we may do this via a java parser of GAFs + OBO Files | |||
[[Category:SWUG Projects]] | [[Category:SWUG Projects]] |
Revision as of 20:50, 9 August 2010
Project lead: Seth Carbon
Deliverables
Evaluate Solr
We compared Solr vs direct Lucene connectivity.
- perl/lucene lags behind official release, cannot get the functionality we require
- Solr has a RESTful API and is language neutral
- Solr is fast
We are moving ahead with an Apache/Solr based solution
Install Solr on Labs
We have an installation on labs at
Update from 2010-08-06
As a note for interested parties, I've got a test installation of Solr for GO up on accordion (http://accordion.lbl.gov:8080/solr, but not terribly useful unless you're sending it the right commands). I probably won't play with it for a while, at least until this release of AmiGO is out and I try and switch the search backend over for 1.9, but I wanted to get down what I've seen so far. Since Solr can connect directly to a mysql database, it is much quicker than my hand-rolled method--under a minute for terms and under twenty minutes for a combined term/gp index. It also allows for just updating new data as well as having full access while all of these things are happening. This would make it much more reasonable to try some of the more complicated document schemas that we've talked about. The general search results are better now that we have direct control over the boosting of different fields. There also seems to be improvements in the text analyzers that come with it. The next version of Solr should have the term completion we want built in, but right now, with a little munging on the client, we can get better results than what we are getting out of the current setup. There is also the option to dig into the Java code and get exactly what we want, but that would take a bit more effort (maybe later on). Doing a little light benchmarking, Solr by itself seems to run somewhere between 2 and 10 times as fast as my perl FCGI script over the C++ Lucene bindings, and seemed to hold even under very heavy load. Putting behind an Apache http proxy seems to more or less double the time, but makes some issues easier to deal with. Solr doesn't supply it's own security, which has to then be done either through the hosting server (Jetty in this case) or the proxy server (which causes a speed hit) if firewalled. I can't say I'm thrilled with the options, but I got something like I wanted through some rewrite rules on Jetty. On the whole, it seems to be superior to the way we have Lucene now in almost every way. Except for the endless XML config file fiddling. And there is a lot of that.
GO Solr Indexing Engine
Current implementation uses solr xml to query GO database to build indexes. This approach is quite brittle.
In future we may do this via a java parser of GAFs + OBO Files