Indexing with Sphinx Search

May 19th, 2008

+1 to Andrew Aksyonoff for giving us Sphinx search engine. I used Apache Lucene for several projects during the last four years and with this experience, I started integrating Lucene with our Loud Feed application using ferret and act_as_ferret. All these years I had two issues bugging me with Lucene, although I had worked around to fix them; first, the time it takes to build the index and the second, corrupted lucene index due to concurrent access issues. So after getting act_as_ferret integrated, I started indexing the records. While I was indexing 65000 records on my dev environment, I ran out of patience having to wait more than an hour to get the search index created. I started looking around for better solutions. I stumbled upon Sphinx. The Sphinx search engine does not support real time updates out of the box like Lucene. The Sphinx documentation has a solution for the real time updates. For the Loud Feed application, we decided to rotate the search index every two hours which is more acceptable than having a real time Lucene index that would sometimes require rebuilding due to corrupted data.

Ok enough talk; I will show you how we integrated the Sphinx to the Loud Feed.

1. Get the Sphinx source

curl http://www.sphinxsearch.com/downloads/sphinx-0.9.8-rc2.tar.gz -o sphinx-0.9.8-rc2.tar.gz
OR
wget http://www.sphinxsearch.com/downloads/sphinx-0.9.8-rc2.tar.gz

2. Configure and install

./configure
sudo make
sudo make install

3. Check and see whether you have installed it correctly by typing.

$indexer

You should get the usage notes. If not, refer to Sphinx documentation.

4. Integrate Sphinx with Rails application

I was looking for a ruby/rails plugin rather than trying to reinvent the wheel. I found two plugins; acts_as_sphinx and Sphinx Client API for Ruby (SCAFR). They are both very similar except acts_as_sphinx has act methods to load models from the database using returned ids from the search index. But acts_as_sphinx hides all the advanced features of the Sphinx engine like for example using different Match Modes and search filtering. So I chose SCAFR, because It has the flexibility to configure the Sphinx query to meet Loud Feed requirements.

5. Integrating Sphinx Client API for Ruby (Sphinx Client API 0.4.0-r1112)

curl http://kpumuk.info/files/rails/plugins/sphinx-0.4.0-r1112.zip -o sphinx-0.4.0-r1112.zip
wget http://kpumuk.info/files/rails/plugins/sphinx-0.4.0-r1112.zip

Unzip the file and move the folder to your rails plugins.

5.1 Refactoring client.rb - to return ids in an array for easy loading.

While reviewing the code, I found there is no straight forward way to get the matching doc ids. So I added it to the client.rb (between line 689 and 734) as follows.

### start ###

id64 = response[p, 4].unpack(’N*’).first; p += 4
#add ids into an array
ids = []
# read matches
result['matches'] = []
while count > 0 and p < max
count -= 1
if id64 != 0
dochi, doclo, weight = response[p, 12].unpack(’N*N*N*’); p += 12
doc = dochi << 32 + doclo
else
doc, weight = response[p, 8].unpack(’N*N*’); p += 8
end

r = {} # This is a single result put in the result['matches'] array
r['id'] = doc
ids << doc
r['weight'] = weight
attrs_names_in_order.each do |attr|
r['attrs'] ||= {}

# handle floats

if attrs[attr] == SPH_ATTR_FLOAT

uval = response[p, 4].unpack(’N*’).first; p += 4

fval = ([uval].pack(’L')).unpack.first

r['attrs'][attr] = fval

else

# handle everything else as unsigned ints

val = response[p, 4].unpack(’N*’).first; p += 4

if (attrs[attr] & SPH_ATTR_MULTI) != 0

r['attrs'][attr] = []

nvalues = val

while nvalues > 0 and p < max

nvalues -= 1

val = response[p, 4].unpack(’N*’).first; p += 4

r['attrs'][attr] << val

end

else

r['attrs'][attr] = val

end

end

end

result['matches'] << r

end

result['ids'] = ids

### end ###
Now I can get the list of the ids straight from the result['ids'] without iterating through the result['matches'] hash.

5.2 Configuring Sphinx

Sphinx uses a configuration file to get data source information and index information. The Sphinx Client API for Ruby plugin has a sphinx.yml with config_file, root_dir and indexes properties. None of these didn’t make sense to me because I have index information in the sphinx.conf and Sphinx is installed in a system-wide location. Therefore I re-factored the Sphinx Client API for Ruby plugin’s sphinx.rake tasks.
### start ###
namespace :sphinx do
desc ‘Run indexer for configured indexes’
task :index do
config = load_config

system “indexer –config \”#{config[:config_file]}\” –all”
end

desc ‘Rotate configured indexes and restart searchd server’
task :rotate do
config = load_config
system “indexer –config \”#{config[:config_file]}\” –rotate –all”
end

desc ‘Start searchd server’
task :start do
config = load_config
if File.exists?(config[:pid_file])
puts ‘Sphinx searchd server is already started.’
else
system “searchd –config \”#{config[:config_file]}\”"
puts ‘Sphinx searchd server started.’
end
end

desc ‘Stop searchd server’
task :stop do
config = load_config
unless File.exists?(config[:pid_file])
puts ‘Sphinx searchd server is not running.’
else
pid = File.read(config[:pid_file]).chomp
kill ‘SIGHUP’, pid
puts ‘Sphinx searchd server stopped.’
end
end

desc ‘Restart searchd server’
task :restart => [:stop, :start]

def load_config
return @sphinx_config if @sphinx_config
config_url = “#{RAILS_ROOT}/config/#{RAILS_ENV}_sphinx.conf”
sphinx_config = File.read(config_url) rescue nil
@sphinx_config = {
:config_file => config_url || ‘/etc/sphinx.conf’
}
sphinx_config =~ /searchd\s*{.*pid_file\s*=\s*(.*?)\n.*}/m
@sphinx_config[:pid_file] = $1 || ‘/var/run/searchd.pid’
return @sphinx_config
end
end

### end ###

While refactoring the sphinx.rake tasks, I added the ability to load environment specific sphnix.conf files to it. This is needed because we have several environments in Loud Feed, dev, testing (Cruise Control), staging and production. Now we can have the config files in our source code.

5.3 Sphinx.conf

If you look at the examples of sphinx.conf files in Sphinx, act_as_sphinx and Sphinx Client API for Ruby, they are very simple, but in Loud Feed we are indexing albums, songs, artists and media and we have to apply filters to this data to implement advanced search options.

A real world sphinx.conf file

### start ###
source albums
{
type = mysql
sql_host = localhost
sql_user = johndoe
sql_pass = johndoe
sql_db = some_db

sql_query = \
SELECT a.id, a.name AS album_name, a.account_id, art.name AS artist_name , a.original_release_year, s.name AS song_name, mf.format_id \
FROM albums AS a LEFT JOIN artist_roles AS ar ON ar.album_id = a.id AND ar.position = ‘1′ \
LEFT JOIN media_formats AS mf ON (mf.owner_id = a.id AND mf.owner_type = ‘ALBUM’) \
LEFT JOIN artists AS art ON art.id = ar.artist_id LEFT JOIN songs AS s ON s.album_id = a.id WHERE a.status = ‘live’
sql_attr_uint = account_id
sql_attr_uint = format_id
sql_attr_timestamp = original_release_year
}

index albums
{
source = albums
path = index/albums
# morphology
morphology = stem_en
}

indexer
{
# memory limit
mem_limit = 32M
}

searchd
{
address = 127.0.0.1
port = 3312
log = log/searchd.log
query_log = log/searchd_query.log
read_timeout = 5
max_children = 30
pid_file = log/searchd.pid
max_matches = 1000
}

### end ###

5.4 Indexing and Starting the Sphinx server.

We can use the following rake tasks to index, start, stop and rotate the Sphinx indexes.

rake sphinx:index
rake sphinx:start
rake sphinx:stop
rake sphinx:rotate

Run rake sphinx:index create the index and start the search server by running rake sphinx:start.

5.5 Searching Index for matching records.

Ok now we have everything configured, you can write a functional spec to test index and its return results.

The following code snippet shows how to call the search index and load the models from the database.

def self.search_live(account_id,search_form,limit,offset)
@sphinx = Sphinx::Client.new
@sphinx.SetMatchMode(Sphinx::Client::SPH_MATCH_EXTENDED)
@sphinx.SetSortMode(Sphinx::Client::SPH_SORT_RELEVANCE)
@sphinx.SetLimits(offset,limit,1000)
@sphinx.SetFilter(”account_id”, [account_id.to_i])
#set filters
#set format filter
unless search_form.format.blank?
@sphinx.SetFilter(”format_id”, [search_form.format.to_i])
end
#set decade range filter
unless search_form.decade.blank?
@sphinx.SetFilterRange(”original_release_year”, search_form.start_year, search_form.end_year)
end
result = []
result=@sphinx.Query(build_sphinx_query(search_form), ‘albums’)
records = find( result['ids'])

class << records; self end.send(:define_method, ‘total’) {result['total']}
return records
end

def self.build_sphinx_query(search_form)
string = CGI::unescape(search_form.string)
query = “”
if …
#sry had to take this piece of code out
else
….
#sry had to take this piece of code out
else
unless search_form.calbum.blank?
query = ” @album_name #{string}”
end
unless search_form.csong.blank?
query += query.blank? ? ” @song_name #{string}” : ” | @song_name #{string}”

end

unless search_form.cartist.blank?

query += query.blank? ? ” @artist_name #{string}” : ” | @artist_name #{string}”

end

end

end

return query

end
5.4 Updating The Search Index
To update the search index we created a cron job to run every two hours and execute the rake index:rotate task.

6 See Sphinx in action

Visit myreggae, the reggae store powered by Loud Feed to see Sphinx in action, at http://www.myreggae.com

Overall we are very happy with Sphinx performance so far :)

i18nrb

February 12th, 2008

i18nrb - project home
http://i18nrb.rubyforge.org/

This is a simple i18n plugin for the rails framework. I have put this together based on my experience with i18n implemention in J2EE applications.

The plugin uses a yaml property file with the naming format “application_resources.yml” for each language. For example

Spanish application resources file => application_resources_es.yml German application resources file => application_resources_de.yml The default application resources file is the English file => application_resources.yml

The messages are grouped based on the application component and each group contains a key: value pair for each message that we expect our application to present, in the locale appropriate for the requesting browser.

More

Spring Framework

November 10th, 2007

The Spring framework is the most complete J2EE open source framework to date. It makes the Business Tier implementation less hassle and eliminates the need to use EJBs for your business objects. It comes with a built in MVC web framework and allows you to integrate with your choice of ORM solution. You can also use the Spring framework only in the middle tier with your choice of web MVC framework like Struts or WebWork. I recommend using built in Spring MVC web tier components because, it is easy to use and very flexible in terms of configuration. For the ORM; I recommend Hibernate because Spring addresses many Hibernate integrating issues.

I started using Spring in February 2005. Since then Spring has moved from 1.1.4 to 2.5 version. Before Spring I used a similar home made framework based on the Front Controller and the Application Service patterns in couple of my big projects. But Spring gave me a complete framework and helped me to concentrate more on the business aspects of the applications I have to develop.

I have used Spring in three major applications we did for a “Big Three’ Auto company. We integrated Acegi, CAS (Single Sign On), Apache Lucene and EhCache frameworks seamlessly with the Spring framework.

To find out more about the Spring framework check out this article by the founder of Spring, Rod Johnson.

More Information:
Spring - http://www.springframework.org/
Hibernate - http://www.hibernate.org/
Lucene - http://lucene.apache.org/java/docs/index.html
EhCache - http://ehcache.sourceforge.net/
CAS - http://www.ja-sig.org/products/cas/
Acegi - http://www.acegisecurity.org/

JSTOR OAI-PMH Project

April 8th, 2004

Krot, Michael and Yakimischak, David from JSTOR presented the OAI-PMH application at the CERN Workshop on Innovations in Scholarly Communication : Implementing the benefits of OAI (OAI3), CERN (Geneva, Switzerland) organized by the E-LIS.

Object Insight Inc. was consulted to guide the development team on development process, development of architecture blue prints and implementation of the application. We enjoyed working with Michael and the JSTOR team.

JSTOR OAI-PMH Presentation