Full Text Indexing Progress: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 24: Line 24:
** Likely replace current autocomplete with Solr proxy calls.
** Likely replace current autocomplete with Solr proxy calls.
** Replace current AmiGO LiveSearch system using AmiGO::External calls.
** Replace current AmiGO LiveSearch system using AmiGO::External calls.
** AmiGO AIP for canonicalizer.
* Move to new/public hardware.
* Move to new/public hardware.
* Create public interface.
* Create public interface.

Revision as of 16:17, 23 September 2010

Overview

There are two separate fronts of progress for FTI. The first is in the indexing system itself ("system"); this would include things like software used (Solr, Jetty, etc.), schema, deployment, hardware, and other low-level issues that are probably not going to be hugely important to end-of-the-line users and programmers. The second is the consumption and use of FTI ("software"). This would include the integration into various pieces of software, services built up around FTI, and (possibly) abstraction APIs.

While there are some blurry points in this distinction (e.g. what about a JSON service built directly into the engine), hopefully it will provide a logical way to divide most of the problems that will be faced.

Goals

A changeable list of goals as we progress:

  • Produce a basic stand alone FTI based on Solr.
  • Make sure it's better than the previous attempts (benchmark).
  • Convert services currently consuming old FTI to Solr.
    • Likely replace current autocomplete with Solr proxy calls.
    • Replace current AmiGO LiveSearch system using AmiGO::External calls.
    • AmiGO AIP for canonicalizer.
  • Move to new/public hardware.
  • Create public interface.
  • Produce version with "complicated" schema.
    • Terms in GPs.
    • GPs in terms.
    • Try associations and evidence.
    • Implement smart and robust canonicalizer for GPs and terms.
  • Test for practical speed.
  • "Big join" test.
  • See if scaling is practical.
    • Try other proxies/balancers (Nginx, Cherokee, etc.).
    • Functional as virtualized service (see Virtualization).
  • Create rich searching interfaces using new engine.
    • Final would need to be combined with "ontology engine".

Some of these are out of order or depend on something elsewhere in the list.

System Progress

Installation

Solr on Jetty, with an Apache proxy, is currently installed on a BBOP development workstation:

Currently, it is not terribly useful unless you're sending it the right commands. It probably won't be played with again at least until AmiGO 1.8 is out and we try and switch the search backend and autocomplete over for 1.9.

The current setup is defined by files in the GO SVN repo on SF.net. [1]

We'll move to something more robust and public as soon as possible.

Schema

The production schema is essentially the SQL commands used to generate the data for Lucene, in XML format. [2]

The Lucene schema is how the GO data (taken by the production schema) is interpreted for use in Lucene. [3]

Right now, the setup uses very flat and basic schema, with small lists for things like synonyms. In the future, we'll want to have a richer schema that attempts to store valuable commonly used information. The main example being association information directly stored in gps and terms. This, coupled with an association index (for example, all "GO:9988776xUniProt", with all direct listings stored) and ontology engine may cover much of the ground of relational searches, but much faster.

Software Progress

The first steps will be converting the autocomplete to use Solr and have AmiGO use it to run LiveSearch.

Not much yet. More to come.

Past Experiments

Past experiments for FTI have included various combinations of:

  • Perl/CLucene
  • Xapian
  • Apache mod_perl
  • FCGI
  • Ruby/Ferret