I recently got this email from a reader on my site and since I haven't posted for a while, I thought it might be a good discussion:
I've been reading your site and had a question that you might have good insight for.
I'm working on a high-traffic website that includes forums. All the code is custom PHP, including the forums, simply because the PHP BB software out there doesn't scale as large and as well as I wanted. I've implemented heavy memcached usage, distributed databases, etc. to handle any level of growth we may hit.
Everything is working great, and I'm to the point where we are going to implement a search engine for forum posts. I'm afraid that applying full-text searching to the forum posts is going to cause MySQL to get very... upset... if/when we have 12 million or more posts in the forums (very possible).
In your opinion, would a properly constructed natural language search system against the MyISAM engine produce better results than trying to write our own keywords search tables on the backend? I've googled to no avail - I find TONS of tutorials on using LIKE and % for "search engines" but nothing that really discusses high volume/high traffic like I'm hoping to deal with.
This question comes up some at Yahoo. Personally, I don't spend a lot of time thinking about search engine-like indexing of our content, because we've got a lot of internally available technology that solves this problem for us in ways that doesn't use MySQL. There are still, however, people trying to solve this problem with MySQL, though it's usually to a smaller degree.
My answer to your question, then, tends to gravitate towards that architecture: Traditional SQL based databases tend not to make good search engines, so use something else to index your site for relevance based searching.
Certainly the full text indexing for the MyISAM engine is a pretty good stab at implementing this. I remember using this technology a good 7-8 years ago and I don't have the impression that it's changed much in that time. I could absolutely mistaken about this, and my apologies to any hard-working developers that I might be discrediting.
The downsides to full text indexing in MyISAM are hopefully obvious:
- It locks you into MyISAM
- Overhead of updating the index for writes
- You can really only query what's in your db. So this doesn't index like a traditional search engine: gathering the entire content of a page and indexing the whole thing (static content in addition to whatever the db spits out onto the page).
I've seen some open source projects implement search on top of MySQL, but without using full-text indexing: drupal is one of them. This allows you to use your database as your index storage, but doesn't tie you down to MyISAM and full-text indexes (or to MySQL itself, for that matter). The disadvantage of this would be more load on your databases, obviously.
If I were trying to solve your problem directly, however, I'd probably really start looking at what's out there in the open source world that can do this job for me. 12 years ago (wow, was it that long ago?) on my first programming job in college, I used an open source search engine that I think was called glimpse. A quick search reveals webglimpse, so I think that's right.
I have to believe that in the last 12 years there are even more products in this space that handle this requirement set very nicely. While adopting yet another system to your architecture has its own costs (more hardware, more learning curves, etc.), that may be offset by using a tool more directly designed to solve your problem.
Is there some storage engine out there that does this for you? Probably, but I'm not one to necessarily jump all over that unless I feel it really meets the requirements for the product and doesn't compromise my engineering intuition. Beware of golden hammers. :)
Trackback URL for this post: