Full Text Indexing Progress: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
No edit summary
mNo edit summary
 
(37 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=DEPRECATED=
Please see [[GOlr]] instead.
=Overview=
=Overview=


Line 15: Line 19:
faced.
faced.


=System Progress=
=Goals=


==Goals==
A changeable list of goals as we progress:


We would like to have a
* <strike>Produce a basic stand alone FTI based on Solr.</strike>
* <strike>Make sure it's better than the previous attempts (benchmark).</strike>
* Convert services currently consuming old FTI to Solr.
** <strike>Replace current autocomplete with Solr calls.</strike>
** <strike>AmiGO LiveSearch using Solr.</strike>
** AmiGO API for canonicalizer.
* Move to new/public hardware.
* Create public interface.
* Produce version with "complicated" schema.
** <strike>Terms</strike>
** <strike>Try associations and evidence.</strike>
** <strike>Annotations</strike>
** GPs (necessary)
* Fitness for purpose tests.
** "Big join" test.
** See if scaling works as desired.
** Try other proxies/balancers (Nginx, Cherokee, etc.).
** Functional as virtualized service (see [[Virtualization]]).
* Implement smart and robust canonicalizer for GPs and terms.
* Create rich searching interfaces using new engine. (Working)
** Final would need to be combined with "ontology engine".


==Current==
Some of these are out of order or depend on something elsewhere in the
list.


===Schema===
=System Progress=


Production schema: [http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/solr/solr/go-data-config.xml?revision=2881&view=markup]
==Installation==


Lucene schema: [http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/solr/solr/schema.xml?revision=2934&view=markup]
Solr on Jetty <strike>with an Apache proxy,</strike> is currently installed on a BBOP
development workstation for experimentation:


===Hardware===
* http://accordion.lbl.gov:8080/solr


Solr on Jetty is currently installed on a BBOP development
This backend will replace the majority of lifting done in AmiGO. The current setup is defined by files in the GO SVN repo on SF.net. [http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/gold/solr/]
workstation. While not really available for public use, it is being
used to test ways of integrating core software to use FTI (see below).


==Past Experiments==
We'll move to something more robust and less experimental as soon as possible.


==Schema==


=Software Progress=
The production schema is essentially the SQL commands used to generate
the data for the Lucene index, in XML format. Note that it is feeding off of GOLD.
[http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/gold/solr/conf/gold-pg-config.xml]
 
The Lucene schema is how the GO data (taken by the production schema)
is interpreted for use in Lucene.
[http://geneontology.svn.sourceforge.net/viewvc/geneontology/java/gold/solr/conf/schema.xml]


=Design Progress=
It uses a very flat and basic schema, with small lists for things like
==One==
synonyms. To make it generally usable (i.e. have an index for all aspects that can be searched generically), certain items are overloaded into the same field. For example, label is used in multiple ways.


Experimental
=Software Progress=


There is now a [http://amigo.berkeleybop.org/cgi-bin/amigo/amigo?mode=live_search_gold live search] and a term completion component that feed off of the Solr index.


==Two==
These services largely consume the direct JSON service from the Solr server. This will have to change in the future due to security and integrity issues.
==???==
==Target==
=Software=


==Current==
=Past Experiments=


*
Past experiments for FTI have included various combinations of:


==Past==
* Perl/CLucene
* Xapian
* Apache mod_perl
* FCGI
* Ruby/Ferret


* ...
* ...
* ...


[[Category:SWUG Projects]]
[[Category:SWUG Projects]]
[[Category:AmiGO]]
[[Category:AmiGO]]
[[Category:Software]]
[[Category:Software]]
[[Category:Software Progress]]

Latest revision as of 10:18, 25 October 2017

DEPRECATED

Please see GOlr instead.

Overview

There are two separate fronts of progress for FTI. The first is in the indexing system itself ("system"); this would include things like software used (Solr, Jetty, etc.), schema, deployment, hardware, and other low-level issues that are probably not going to be hugely important to end-of-the-line users and programmers. The second is the consumption and use of FTI ("software"). This would include the integration into various pieces of software, services built up around FTI, and (possibly) abstraction APIs.

While there are some blurry points in this distinction (e.g. what about a JSON service built directly into the engine), hopefully it will provide a logical way to divide most of the problems that will be faced.

Goals

A changeable list of goals as we progress:

  • Produce a basic stand alone FTI based on Solr.
  • Make sure it's better than the previous attempts (benchmark).
  • Convert services currently consuming old FTI to Solr.
    • Replace current autocomplete with Solr calls.
    • AmiGO LiveSearch using Solr.
    • AmiGO API for canonicalizer.
  • Move to new/public hardware.
  • Create public interface.
  • Produce version with "complicated" schema.
    • Terms
    • Try associations and evidence.
    • Annotations
    • GPs (necessary)
  • Fitness for purpose tests.
    • "Big join" test.
    • See if scaling works as desired.
    • Try other proxies/balancers (Nginx, Cherokee, etc.).
    • Functional as virtualized service (see Virtualization).
  • Implement smart and robust canonicalizer for GPs and terms.
  • Create rich searching interfaces using new engine. (Working)
    • Final would need to be combined with "ontology engine".

Some of these are out of order or depend on something elsewhere in the list.

System Progress

Installation

Solr on Jetty with an Apache proxy, is currently installed on a BBOP development workstation for experimentation:

This backend will replace the majority of lifting done in AmiGO. The current setup is defined by files in the GO SVN repo on SF.net. [1]

We'll move to something more robust and less experimental as soon as possible.

Schema

The production schema is essentially the SQL commands used to generate the data for the Lucene index, in XML format. Note that it is feeding off of GOLD. [2]

The Lucene schema is how the GO data (taken by the production schema) is interpreted for use in Lucene. [3]

It uses a very flat and basic schema, with small lists for things like synonyms. To make it generally usable (i.e. have an index for all aspects that can be searched generically), certain items are overloaded into the same field. For example, label is used in multiple ways.

Software Progress

There is now a live search and a term completion component that feed off of the Solr index.

These services largely consume the direct JSON service from the Solr server. This will have to change in the future due to security and integrity issues.

Past Experiments

Past experiments for FTI have included various combinations of:

  • Perl/CLucene
  • Xapian
  • Apache mod_perl
  • FCGI
  • Ruby/Ferret