Indexing with Sphinx Search
May 19th, 2008+1 to Andrew Aksyonoff for giving us Sphinx search engine. I used Apache Lucene for several projects during the last four years and with this experience, I started integrating Lucene with our Loud Feed application using ferret and act_as_ferret. All these years I had two issues bugging me with Lucene, although I had worked around to fix them; first, the time it takes to build the index and the second, corrupted lucene index due to concurrent access issues. So after getting act_as_ferret integrated, I started indexing the records. While I was indexing 65000 records on my dev environment, I ran out of patience having to wait more than an hour to get the search index created. I started looking around for better solutions. I stumbled upon Sphinx. The Sphinx search engine does not support real time updates out of the box like Lucene. The Sphinx documentation has a solution for the real time updates. For the Loud Feed application, we decided to rotate the search index every two hours which is more acceptable than having a real time Lucene index that would sometimes require rebuilding due to corrupted data.
Ok enough talk; I will show you how we integrated the Sphinx to the Loud Feed.
1. Get the Sphinx source
curl http://www.sphinxsearch.com/downloads/sphinx-0.9.8-rc2.tar.gz -o sphinx-0.9.8-rc2.tar.gz
OR
wget http://www.sphinxsearch.com/downloads/sphinx-0.9.8-rc2.tar.gz
2. Configure and install
./configure
sudo make
sudo make install
3. Check and see whether you have installed it correctly by typing.
$indexer
You should get the usage notes. If not, refer to Sphinx documentation.
4. Integrate Sphinx with Rails application
I was looking for a ruby/rails plugin rather than trying to reinvent the wheel. I found two plugins; acts_as_sphinx and Sphinx Client API for Ruby (SCAFR). They are both very similar except acts_as_sphinx has act methods to load models from the database using returned ids from the search index. But acts_as_sphinx hides all the advanced features of the Sphinx engine like for example using different Match Modes and search filtering. So I chose SCAFR, because It has the flexibility to configure the Sphinx query to meet Loud Feed requirements.
5. Integrating Sphinx Client API for Ruby (Sphinx Client API 0.4.0-r1112)
curl http://kpumuk.info/files/rails/plugins/sphinx-0.4.0-r1112.zip -o sphinx-0.4.0-r1112.zip
wget http://kpumuk.info/files/rails/plugins/sphinx-0.4.0-r1112.zip
Unzip the file and move the folder to your rails plugins.
5.1 Refactoring client.rb - to return ids in an array for easy loading.
While reviewing the code, I found there is no straight forward way to get the matching doc ids. So I added it to the client.rb (between line 689 and 734) as follows.
### start ###
id64 = response[p, 4].unpack(’N*’).first; p += 4
#add ids into an array
ids = []
# read matches
result['matches'] = []
while count > 0 and p < max
count -= 1
if id64 != 0
dochi, doclo, weight = response[p, 12].unpack(’N*N*N*’); p += 12
doc = dochi << 32 + doclo
else
doc, weight = response[p, 8].unpack(’N*N*’); p += 8
end
r = {} # This is a single result put in the result['matches'] array
r['id'] = doc
ids << doc
r['weight'] = weight
attrs_names_in_order.each do |attr|
r['attrs'] ||= {}
# handle floats
if attrs[attr] == SPH_ATTR_FLOAT
uval = response[p, 4].unpack(’N*’).first; p += 4
fval = ([uval].pack(’L')).unpack.first
r['attrs'][attr] = fval
else
# handle everything else as unsigned ints
val = response[p, 4].unpack(’N*’).first; p += 4
if (attrs[attr] & SPH_ATTR_MULTI) != 0
r['attrs'][attr] = []
nvalues = val
while nvalues > 0 and p < max
nvalues -= 1
val = response[p, 4].unpack(’N*’).first; p += 4
r['attrs'][attr] << val
end
else
r['attrs'][attr] = val
end
end
end
result['matches'] << r
end
result['ids'] = ids
### end ###
Now I can get the list of the ids straight from the result['ids'] without iterating through the result['matches'] hash.
5.2 Configuring Sphinx
Sphinx uses a configuration file to get data source information and index information. The Sphinx Client API for Ruby plugin has a sphinx.yml with config_file, root_dir and indexes properties. None of these didn’t make sense to me because I have index information in the sphinx.conf and Sphinx is installed in a system-wide location. Therefore I re-factored the Sphinx Client API for Ruby plugin’s sphinx.rake tasks.
### start ###
namespace :sphinx do
desc ‘Run indexer for configured indexes’
task :index do
config = load_config
system “indexer –config \”#{config[:config_file]}\” –all”
end
desc ‘Rotate configured indexes and restart searchd server’
task :rotate do
config = load_config
system “indexer –config \”#{config[:config_file]}\” –rotate –all”
end
desc ‘Start searchd server’
task :start do
config = load_config
if File.exists?(config[:pid_file])
puts ‘Sphinx searchd server is already started.’
else
system “searchd –config \”#{config[:config_file]}\”"
puts ‘Sphinx searchd server started.’
end
end
desc ‘Stop searchd server’
task :stop do
config = load_config
unless File.exists?(config[:pid_file])
puts ‘Sphinx searchd server is not running.’
else
pid = File.read(config[:pid_file]).chomp
kill ‘SIGHUP’, pid
puts ‘Sphinx searchd server stopped.’
end
end
desc ‘Restart searchd server’
task :restart => [:stop, :start]
def load_config
return @sphinx_config if @sphinx_config
config_url = “#{RAILS_ROOT}/config/#{RAILS_ENV}_sphinx.conf”
sphinx_config = File.read(config_url) rescue nil
@sphinx_config = {
:config_file => config_url || ‘/etc/sphinx.conf’
}
sphinx_config =~ /searchd\s*{.*pid_file\s*=\s*(.*?)\n.*}/m
@sphinx_config[:pid_file] = $1 || ‘/var/run/searchd.pid’
return @sphinx_config
end
end
### end ###
While refactoring the sphinx.rake tasks, I added the ability to load environment specific sphnix.conf files to it. This is needed because we have several environments in Loud Feed, dev, testing (Cruise Control), staging and production. Now we can have the config files in our source code.
5.3 Sphinx.conf
If you look at the examples of sphinx.conf files in Sphinx, act_as_sphinx and Sphinx Client API for Ruby, they are very simple, but in Loud Feed we are indexing albums, songs, artists and media and we have to apply filters to this data to implement advanced search options.
A real world sphinx.conf file
### start ###
source albums
{
type = mysql
sql_host = localhost
sql_user = johndoe
sql_pass = johndoe
sql_db = some_db
sql_query = \
SELECT a.id, a.name AS album_name, a.account_id, art.name AS artist_name , a.original_release_year, s.name AS song_name, mf.format_id \
FROM albums AS a LEFT JOIN artist_roles AS ar ON ar.album_id = a.id AND ar.position = ‘1′ \
LEFT JOIN media_formats AS mf ON (mf.owner_id = a.id AND mf.owner_type = ‘ALBUM’) \
LEFT JOIN artists AS art ON art.id = ar.artist_id LEFT JOIN songs AS s ON s.album_id = a.id WHERE a.status = ‘live’
sql_attr_uint = account_id
sql_attr_uint = format_id
sql_attr_timestamp = original_release_year
}
index albums
{
source = albums
path = index/albums
# morphology
morphology = stem_en
}
indexer
{
# memory limit
mem_limit = 32M
}
searchd
{
address = 127.0.0.1
port = 3312
log = log/searchd.log
query_log = log/searchd_query.log
read_timeout = 5
max_children = 30
pid_file = log/searchd.pid
max_matches = 1000
}
### end ###
5.4 Indexing and Starting the Sphinx server.
We can use the following rake tasks to index, start, stop and rotate the Sphinx indexes.
rake sphinx:index
rake sphinx:start
rake sphinx:stop
rake sphinx:rotate
Run rake sphinx:index create the index and start the search server by running rake sphinx:start.
5.5 Searching Index for matching records.
Ok now we have everything configured, you can write a functional spec to test index and its return results.
The following code snippet shows how to call the search index and load the models from the database.
def self.search_live(account_id,search_form,limit,offset)
@sphinx = Sphinx::Client.new
@sphinx.SetMatchMode(Sphinx::Client::SPH_MATCH_EXTENDED)
@sphinx.SetSortMode(Sphinx::Client::SPH_SORT_RELEVANCE)
@sphinx.SetLimits(offset,limit,1000)
@sphinx.SetFilter(”account_id”, [account_id.to_i])
#set filters
#set format filter
unless search_form.format.blank?
@sphinx.SetFilter(”format_id”, [search_form.format.to_i])
end
#set decade range filter
unless search_form.decade.blank?
@sphinx.SetFilterRange(”original_release_year”, search_form.start_year, search_form.end_year)
end
result = []
result=@sphinx.Query(build_sphinx_query(search_form), ‘albums’)
records = find( result['ids'])
class << records; self end.send(:define_method, ‘total’) {result['total']}
return records
end
def self.build_sphinx_query(search_form)
string = CGI::unescape(search_form.string)
query = “”
if …
#sry had to take this piece of code out
else
….
#sry had to take this piece of code out
else
unless search_form.calbum.blank?
query = ” @album_name #{string}”
end
unless search_form.csong.blank?
query += query.blank? ? ” @song_name #{string}” : ” | @song_name #{string}”
end
unless search_form.cartist.blank?
query += query.blank? ? ” @artist_name #{string}” : ” | @artist_name #{string}”
end
end
end
return query
end
5.4 Updating The Search Index
To update the search index we created a cron job to run every two hours and execute the rake index:rotate task.
6 See Sphinx in action
Visit myreggae, the reggae store powered by Loud Feed to see Sphinx in action, at http://www.myreggae.com
Overall we are very happy with Sphinx performance so far ![]()
