Sphinx for search

In November or December, I started looking into other full text search options that would be suitable for searching forums (1.1 million posts) and projects (430,000 and lots of important associations). At the time, I was using Ferret (a port of Apache Lucene) and acts_as_ferret for all of our search needs and I wasn’t happy with how things were going with large search indexes. Indexing took a lot of time and a lot of power, unnoticed gaps would occur in the index (due to slow reindexing while live), search speed for large frequently updated indexes was relatively slow, sorting and filtering inside the search engine was not as simple as I had hoped….

I gave Sphinx a try and I was quickly hooked. Search is a very important part of our site and I think that Sphinx is going to be a great help.

the good

  • Really fast indexing.
  • Simple and sensible configuration – just hook it up to your database and write a SQL query that gathers all of the data that you want (for each document). Just write a single SQL statement that joins in all of the data that you are interested in searching on – one for each document type. Once that is done, you only have to define the columns that you want to use for filtering/sorting/grouping…
  • Fast filtering and sorting by user-defined attributes
  • Grouping – this is really handy. You can have your search results clustered by some common attribute and then sort/filter on top of that. Saves the usual waste of bringing back tons of results and then making the database do the work.
  • Pagination that works with both of the above.
  • Support for boolean, phrase, field search, proximity searches
  • Handy text related features – case folding, encoding/int’l character issues, HTML filtering, stemming
  • Runs as a standalone service which makes it easier to work with all around (easier deployment, scaling, monitoring, use from various languages)
  • APIs for PHP, Python, Java, Ruby, Perl, C++

The bad

  • No fuzzy or wildcard searches. We rely on autocomplete boxes and fuzzy suggestions a lot because we need to allow users to easily link to yarns and patterns that they’ve used in projects without forcing them to do normal search. However, Sphinx does offer prefix indexing and soundex matching, which may allow me to approximate some of what I want. I haven’t tried those options out yet…
  • No real-time updating. I wish I could tell Sphinx “document #19392 in the ‘forum’ index has been updated”, but I can’t. At the moment, I am getting around this missing feature by using two indexes for every search. I have a single “main” index that is very large and updated daily and a small “delta” index that is updated every 3 minutes and contains records that have changed since the last nightly run. This is certainly good enough for forum posts and the other things that I am searching with Sphinx.

The developer is working on both of these areas, which is great because I’d love to use Sphinx for everything.

Snippet

Just a little example of a real indexing and search run. The database query itself is probably a bottleneck here:

./indexer --rotate forum_posts

Sphinx 0.9.8-dev (r909)
Copyright (c) 2001-2007, Andrew Aksyonoff

using config file '/opt/sphinx/etc/sphinx.conf'...
indexing index 'forum_posts'...
collected 1126764 docs, 332.8 MB
sorted 63.6 Mhits, 100.0% done
total 1126764 docs, 332788731 bytes
total 218.109 sec, 1525791.18 bytes/sec, 5166.06 docs/sec
rotating indices: succesfully sent SIGHUP to searchd (pid=18897).

And a command line search…

./search --index forum_posts --phrase "i love cake" -l 1

Sphinx 0.9.8-dev (r909)
Copyright (c) 2001-2007, Andrew Aksyonoff

using config file '/opt/sphinx/etc/sphinx.conf'...
index 'forum_posts': query 'i love cake ': returned 23 matches of 23 total in 0.043 sec

displaying matches:
(snipped)

words:
1. 'i': 820193 documents, 2963422 hits
2. 'love': 109871 documents, 131778 hits
3. 'cake': 7191 documents, 9629 hits

I guess that’s it! If you’ve got search needs, you may want to give Sphinx a try …and don’t forget to take a look at the API for your favorite language.

Comment (1)

  1. I also looked at ferret (and we use Lucene) and have to say that it seems to be about as good, out of the box, as Alta Vista was in 1996, which ain’t saying much. It seems like a great deal of messy configuration wrapped around a neat but overly abstracted interface.

    I’ll definitely check out Sphinx.

    Wednesday, January 9, 2008 at 8:08 pm #