20110811

Sphinx Search - why and how to use sphinx search delta indexes

Problem:
Anybody who used sphinx search knows that reindexing big indexes takes a long time. The main problem here is that the whole index is recreated every time you execute a reindexing.

Solution:
The way of handling this is by using delta index updates with index merging. The idea is to have 2 indexes:
  • the "main" and big index for the old (unchanged) data, and 
  • a small "delta" for the new (recently changed) data.  So, instead of reindexing the "main" index, you should only reindex the "delta" one every few minutes.
After a while (once a day) you should merge the "delta" and the "main" index (depending of the size of the delta).
This is called "main+delta" scheme.

How-to:
Having a table "documents" with fields: "id, title, body",  we create the sphinx search index as follows:


As you can see the main sql_query give us all the documents from the documents table. The idea of the "main+delta" scheme is that every time you reindex main, you will have to store the last id processed somewhere, so delta can start from there and process a little amount of records.

So first,  create this table in mysql to store that id:

Then in sphinx.conf:

How does this work? As you can see in main specification a pre-fetch query called sql_query_pre appears. That query is executed before the main query (you can have a bunch of those).
REPLACE INTO sph_counter SELECT 1, MAX(id) FROM documents
Our pre-fetch query, updates a record in the sph_counter table to be used later by delta.  This record will store the max(id) from our documents table in the moment of indexing, so the main query will get documents with id less than or equal to that maximum and delta query with ids bigger than that max.

In brief, you update main once in a while, and delta every so often. main will get ALL the documents till the time you update it, and delta will get all the new ones.

In your code you will have to search in both indexes:
$sphinxClient->Query(“this is my search query”, “main delta”);
Indexing main will take a long time too, but you will be sort of "live" because delta will update very quickly.
Instead of reindexing main you can merge both indexes, and only update delta. I will write about merging in a future post.

Links:

1 comentario: