<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Prabode Weebadde &#187; Sphinx</title>
	<atom:link href="http://www.weebadde.com/tag/sphinx/feed" rel="self" type="application/rss+xml" />
	<link>http://www.weebadde.com</link>
	<description>&#34;Oneself is the refuge for one&#34; - Buddha</description>
	<lastBuildDate>Fri, 09 Jul 2010 15:49:09 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Indexing with Sphinx Search</title>
		<link>http://www.weebadde.com/indexing-with-sphinx-search.html</link>
		<comments>http://www.weebadde.com/indexing-with-sphinx-search.html#comments</comments>
		<pubDate>Mon, 19 May 2008 19:17:16 +0000</pubDate>
		<dc:creator>prabode</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Rails]]></category>
		<category><![CDATA[Ruby]]></category>
		<category><![CDATA[Sphinx]]></category>

		<guid isPermaLink="false">http://shreeni.weebadde.com/?p=15</guid>
		<description><![CDATA[+1 to Andrew Aksyonoff for giving us Sphinx search engine. I used Apache Lucene for several projects during the last four years and with this experience, I started integrating Lucene with our Loud Feed application using ferret and act_as_ferret. All these years I had two issues bugging me with Lucene, although I had worked around [...]]]></description>
			<content:encoded><![CDATA[<p>+1 to Andrew Aksyonoff for giving us <a title="Sphinx" href="http://www.sphinxsearch.com/">Sphinx search engine</a>. I used Apache Lucene for several projects during the last four years and with this experience, I started integrating Lucene with our Loud Feed application using ferret and act_as_ferret. All these years I had two issues bugging me with Lucene, although I had worked around to fix them; first, the time it takes to build the index and the second, corrupted lucene index due to concurrent access issues. So after getting act_as_ferret integrated, I started indexing the records. While I was indexing 65000 records on my dev environment, I ran out of patience having to wait more than an hour to get the search index created. I started looking around for better solutions. I stumbled upon Sphinx. The Sphinx search engine does not support real time updates out of the box like Lucene. The Sphinx documentation has a solution for the real time updates. For the Loud Feed application, we decided to rotate the search index every two hours which is more acceptable than having a real time Lucene index that would sometimes require rebuilding due to corrupted data.</p>
<p>Ok enough talk; I will show you how we integrated the Sphinx to the Loud Feed.</p>
<p>1. Get the Sphinx source</p>
<p>curl http://www.sphinxsearch.com/downloads/sphinx-0.9.8-rc2.tar.gz -o sphinx-0.9.8-rc2.tar.gz<br />
OR<br />
wget http://www.sphinxsearch.com/downloads/sphinx-0.9.8-rc2.tar.gz</p>
<p>2. Configure and install</p>
<p>./configure<br />
sudo make<br />
sudo make install</p>
<p>3. Check and see whether you have installed it correctly by typing.</p>
<p>$indexer</p>
<p>You should get the usage notes. If not, refer to Sphinx documentation.</p>
<p>4. Integrate Sphinx with Rails application</p>
<p>I was looking for a ruby/rails plugin rather than trying to reinvent the wheel. I found two plugins; <a title="acts_as_sphinx-plugin" href="http://www.datanoise.com/articles/2007/3/23/acts_as_sphinx-plugin">acts_as_sphinx</a> and <a title="SCAR" href="http://kpumuk.info/projects/ror-plugins/sphinx/#documentation">Sphinx Client API for Ruby</a> (SCAFR).  They are both very similar except acts_as_sphinx has act methods to load models from the database using returned ids from the search index. But acts_as_sphinx hides all the advanced features of the Sphinx engine like for example using different Match Modes and search filtering. So I chose SCAFR, because It has the flexibility to configure the Sphinx query to meet Loud Feed requirements.</p>
<p>5. Integrating Sphinx Client API for Ruby (Sphinx Client API 0.4.0-r1112)</p>
<p>curl http://kpumuk.info/files/rails/plugins/sphinx-0.4.0-r1112.zip -o sphinx-0.4.0-r1112.zip<br />
wget http://kpumuk.info/files/rails/plugins/sphinx-0.4.0-r1112.zip</p>
<p>Unzip the file and move the folder to your rails plugins.</p>
<p>5.1 Refactoring client.rb &#8211; to return ids in an array for easy loading.</p>
<p>While reviewing the code, I found there is no straight forward way to get the matching doc ids. So I added it to the client.rb (between line 689 and 734) as follows<span style="color: #993300;">.</span></p>
<p>### start ###</p>
<p><span style="color: #993300;">id64 = response[p, 4].unpack(&#8216;N*&#8217;).first; p += 4<br />
#add ids into an array<br />
ids = []<br />
# read matches<br />
result['matches'] = []<br />
while count &gt; 0 and p &lt; max<br />
count -= 1<br />
if id64 != 0<br />
dochi, doclo, weight = response[p, 12].unpack(&#8216;N*N*N*&#8217;); p += 12<br />
doc = dochi &lt;&lt; 32 + doclo<br />
else<br />
doc, weight = response[p, 8].unpack(&#8216;N*N*&#8217;); p += 8<br />
end</span></p>
<p><span style="color: #993300;">r = {} # This is a single result put in the result['matches'] array<br />
r['id'] = doc<br />
ids &lt;&lt; doc<br />
r['weight'] = weight<br />
attrs_names_in_order.each do |attr|<br />
r['attrs'] ||= {}</span></p>
<p><span style="color: #993300;"># handle floats</span></p>
<p><span style="color: #993300;"> if attrs[attr] == SPH_ATTR_FLOAT</span></p>
<p><span style="color: #993300;"> uval = response[p, 4].unpack(&#8216;N*&#8217;).first; p += 4</span></p>
<p><span style="color: #993300;"> fval = ([uval].pack(&#8216;L&#8217;)).unpack.first</span></p>
<p><span style="color: #993300;"> r['attrs'][attr] = fval</span></p>
<p><span style="color: #993300;"> else</span></p>
<p><span style="color: #993300;"> # handle everything else as unsigned ints</span></p>
<p><span style="color: #993300;"> val = response[p, 4].unpack(&#8216;N*&#8217;).first; p += 4</span></p>
<p><span style="color: #993300;"> if (attrs[attr] &amp; SPH_ATTR_MULTI) != 0</span></p>
<p><span style="color: #993300;"> r['attrs'][attr] = []</span></p>
<p><span style="color: #993300;"> nvalues = val</span></p>
<p><span style="color: #993300;"> while nvalues &gt; 0 and p &lt; max</span></p>
<p><span style="color: #993300;"> nvalues -= 1</span></p>
<p><span style="color: #993300;"> val = response[p, 4].unpack(&#8216;N*&#8217;).first; p += 4</span></p>
<p><span style="color: #993300;"> r['attrs'][attr] &lt;&lt; val</span></p>
<p><span style="color: #993300;"> end</span></p>
<p><span style="color: #993300;"> else</span></p>
<p><span style="color: #993300;"> r['attrs'][attr] = val</span></p>
<p><span style="color: #993300;"> end</span></p>
<p><span style="color: #993300;"> end</span></p>
<p><span style="color: #993300;"> end</span></p>
<p><span style="color: #993300;"> result['matches'] &lt;&lt; r</span></p>
<p><span style="color: #993300;"> end</span></p>
<p><span style="color: #993300;"> result['ids'] = ids</span></p>
<p>### end ###<br />
Now I can get the list of the ids straight from the result['ids'] without iterating through the result['matches'] hash.</p>
<p>5.2 Configuring Sphinx</p>
<p>Sphinx uses a configuration file to get data source information and index information. The Sphinx Client API for Ruby plugin has a sphinx.yml with config_file, root_dir and indexes properties. None of these didn&#8217;t make sense to me because I have index information in the sphinx.conf and Sphinx is installed in a system-wide location.  Therefore I re-factored the Sphinx Client API for Ruby plugin&#8217;s sphinx.rake tasks.<br />
### start ###<br />
<span style="color: #993300;"> namespace :sphinx do<br />
desc &#8216;Run indexer for configured indexes&#8217;<br />
task :index do<br />
config = load_config</span></p>
<p><span style="color: #993300;">system &#8220;indexer &#8211;config \&#8221;#{config[:config_file]}\&#8221; &#8211;all&#8221;<br />
end</span></p>
<p><span style="color: #993300;">desc &#8216;Rotate configured indexes and restart searchd server&#8217;<br />
task :rotate do<br />
config = load_config<br />
system &#8220;indexer &#8211;config \&#8221;#{config[:config_file]}\&#8221; &#8211;rotate &#8211;all&#8221;<br />
end</span></p>
<p><span style="color: #993300;">desc &#8216;Start searchd server&#8217;<br />
task :start do<br />
config = load_config<br />
if File.exists?(config[:pid_file])<br />
puts &#8216;Sphinx searchd server is already started.&#8217;<br />
else<br />
system &#8220;searchd &#8211;config \&#8221;#{config[:config_file]}\&#8221;"<br />
puts &#8216;Sphinx searchd server started.&#8217;<br />
end<br />
end</span></p>
<p><span style="color: #993300;">desc &#8216;Stop searchd server&#8217;<br />
task :stop do<br />
config = load_config<br />
unless File.exists?(config[:pid_file])<br />
puts &#8216;Sphinx searchd server is not running.&#8217;<br />
else<br />
pid = File.read(config[:pid_file]).chomp<br />
kill &#8216;SIGHUP&#8217;, pid<br />
puts &#8216;Sphinx searchd server stopped.&#8217;<br />
end<br />
end</span></p>
<p><span style="color: #993300;">desc &#8216;Restart searchd server&#8217;<br />
task :restart =&gt; [:stop, :start]</span></p>
<p><span style="color: #993300;">def load_config<br />
return @sphinx_config if @sphinx_config<br />
config_url = &#8220;#{RAILS_ROOT}/config/#{RAILS_ENV}_sphinx.conf&#8221;<br />
sphinx_config = File.read(config_url) rescue nil<br />
@sphinx_config = {<br />
:config_file =&gt; config_url || &#8216;/etc/sphinx.conf&#8217;<br />
}<br />
sphinx_config =~ /searchd\s*{.*pid_file\s*=\s*(.*?)\n.*}/m<br />
@sphinx_config[:pid_file] = $1 || &#8216;/var/run/searchd.pid&#8217;<br />
return @sphinx_config<br />
end<br />
end</span></p>
<p>### end ###</p>
<p>While refactoring the sphinx.rake tasks, I added the ability to load environment specific sphnix.conf files to it. This is needed because we have several environments in Loud Feed, dev, testing (Cruise Control), staging and production. Now we can have the config files in our source code.</p>
<p>5.3 Sphinx.conf</p>
<p>If you look at the examples of sphinx.conf files in Sphinx, act_as_sphinx and Sphinx Client API for Ruby, they are very simple, but in Loud Feed we are indexing albums, songs, artists and media and we have to apply filters to this data to implement advanced search options.</p>
<p>A real world sphinx.conf file</p>
<p>### start ###<br />
<span style="color: #993300;"> source albums<br />
{<br />
type                =  mysql<br />
sql_host            = localhost<br />
sql_user            = johndoe<br />
sql_pass            = johndoe<br />
sql_db              = some_db</span></p>
<p><span style="color: #993300;">sql_query           = \<br />
SELECT a.id, a.name AS album_name, a.account_id, art.name AS artist_name , a.original_release_year, s.name AS song_name, mf.format_id \<br />
FROM albums AS a LEFT JOIN artist_roles AS ar ON ar.album_id = a.id AND ar.position = &#8217;1&#8242; \<br />
LEFT JOIN media_formats AS mf ON (mf.owner_id = a.id AND mf.owner_type = &#8216;ALBUM&#8217;) \<br />
LEFT JOIN artists AS art ON art.id  = ar.artist_id LEFT JOIN songs AS s ON s.album_id = a.id WHERE a.status = &#8216;live&#8217;<br />
sql_attr_uint = account_id<br />
sql_attr_uint = format_id<br />
sql_attr_timestamp  = original_release_year<br />
}</span></p>
<p><span style="color: #993300;">index albums<br />
{<br />
source          = albums<br />
path            = index/albums<br />
# morphology<br />
morphology          = stem_en<br />
}</span></p>
<p><span style="color: #993300;">indexer<br />
{<br />
# memory limit<br />
mem_limit           = 32M<br />
}</span></p>
<p><span style="color: #993300;">searchd<br />
{<br />
address             = 127.0.0.1<br />
port                = 3312<br />
log                 = log/searchd.log<br />
query_log           = log/searchd_query.log<br />
read_timeout        = 5<br />
max_children        = 30<br />
pid_file            = log/searchd.pid<br />
max_matches         = 1000<br />
}<br />
</span><br />
### end ###</p>
<p>5.4 Indexing and Starting the Sphinx server.</p>
<p>We can use the following rake tasks to index, start, stop and rotate the Sphinx indexes.</p>
<p>rake sphinx:index<br />
rake sphinx:start<br />
rake sphinx:stop<br />
rake sphinx:rotate</p>
<p>Run rake sphinx:index create the index and start the search server by running rake sphinx:start.</p>
<p>5.5 Searching Index for matching records.</p>
<p>Ok now we have everything configured, you can write a functional spec to test index and its return results.</p>
<p>The following code snippet shows how to call the search index and load the models from the database.</p>
<p><span style="color: #993300;"> def self.search_live(account_id,search_form,limit,offset)<br />
@sphinx = Sphinx::Client.new<br />
@sphinx.SetMatchMode(Sphinx::Client::SPH_MATCH_EXTENDED)<br />
@sphinx.SetSortMode(Sphinx::Client::SPH_SORT_RELEVANCE)<br />
@sphinx.SetLimits(offset,limit,1000)<br />
@sphinx.SetFilter(&#8220;account_id&#8221;, [account_id.to_i])<br />
#set filters<br />
#set format filter<br />
unless search_form.format.blank?<br />
@sphinx.SetFilter(&#8220;format_id&#8221;, [search_form.format.to_i])<br />
end<br />
#set decade range filter<br />
unless search_form.decade.blank?<br />
@sphinx.SetFilterRange(&#8220;original_release_year&#8221;, search_form.start_year, search_form.end_year)<br />
end<br />
result = []<br />
result=@sphinx.Query(build_sphinx_query(search_form), &#8216;albums&#8217;)<br />
records =  find( result['ids'])</span></p>
<p><span style="color: #993300;">class &lt;&lt; records; self end.send(:define_method, &#8216;total&#8217;) {result['total']}<br />
return records<br />
end</span></p>
<p><span style="color: #993300;">def self.build_sphinx_query(search_form)<br />
string = CGI::unescape(search_form.string)<br />
query = &#8220;&#8221;<br />
if &#8230;<br />
#sry had to take this piece of code out<br />
else<br />
&#8230;.<br />
#sry had to take this piece of code out<br />
else<br />
unless search_form.calbum.blank?<br />
query = &#8221; @album_name #{string}&#8221;<br />
end<br />
unless search_form.csong.blank?<br />
query += query.blank? ? &#8221; @song_name #{string}&#8221; : &#8221; | @song_name #{string}&#8221;</span></p>
<p><span style="color: #993300;">end</span></p>
<p><span style="color: #993300;"> unless search_form.cartist.blank?</span></p>
<p><span style="color: #993300;"> query += query.blank? ? &#8221; @artist_name #{string}&#8221; : &#8221; | @artist_name #{string}&#8221;</span></p>
<p><span style="color: #993300;"> end</span></p>
<p><span style="color: #993300;"> end</span></p>
<p><span style="color: #993300;"> end</span></p>
<p><span style="color: #993300;"> return query</span></p>
<p><span style="color: #993300;"> end</span><br />
5.4 Updating The Search Index<br />
To update the search index we created a cron job to run every two hours and execute the rake index:rotate task.</p>
<p>6 See Sphinx in action</p>
<p>Visit myreggae, the reggae store powered by Loud Feed to see Sphinx in action, at <a title="Powered by Loud Feed" href="http://www.myreggae.com">http://www.myreggae.com</a></p>
<p>Overall we are very happy with Sphinx performance so far <img src='http://www.weebadde.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.weebadde.com/indexing-with-sphinx-search.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
