Blog Posts

Why a project switched from Google Search Appliance to Zend_Lucene

Google technology does a good job when searching the wild and treacherous realms of the public internet. However, the commercial Google Search Appliance (GSA) sold for searching intranet websites did not convince me at all. For a client, we first had to integrate the GSA, later we reimplemented search with Zend_Lucene. Some thoughts comparing the two search solutions.

This post became rather lengthy. If you just want the summary of my pro and con for GSA versus Lucene, scroll right to the end :-)

In a project we got to take over, the customer had already bought a GSA (the "cheap" one - only about $20'000). There was a list of wishes from the client how to optimally integrate the appliance into his web sites:

  • Limit access to authorized users
  • Index all content in all languages
  • Filter content by target group (information present as META in the HTML headers)
  • Show a box with results from their employee directory

GSA Software

The GSA made problems with most of those requests.

When you activate access protection, the GSA makes a HEAD request on the first 20 or so search results for each single search request, to check if that user has the right to see that document. As on our site, there are no individual visibility requirements, we did not need that. But there is no way to deactivate this check, resulting in unnecessary load on the web server. We ended up catching the GSA HEAD request quite early and just send a Not Modified response without further looking into the request.

The GSA completely ignores the language declaration (whether in META or in the attribute or inside the HTML head) and uses it's own heuristics. This might be fine for public Internet, when you can assume many sites declaring their content to be in the server installation language even if it is not - but in a controlled environment we can make sure those headers are correct. We talked to google support about this, but they could only confirm that its not possible. This was annoying, as the heuristics was wrong, for example when some part of a page content was in another language.

The spider component messed up with some bugs from the web site we needed to index. We found that the same parameter got repeated over and over on an URL. Those cycles led to having the same page indexed many times and the limit of 500'000 indexed pages being filled up. This is of course a bug in the web server, but we found no way to help the GSA not to stumble over it.

Filtering by meta information would work. But we have binary documents like PDF, Word and so on. There was no way to set the meta information for those documents. requiredfields=gsahintview:group1|-gsahintview should trigger a filter to say either we have the meta information with a specific value, or no meta at all. However, Google confirmed that, this combination of filter expressions is not possible. They updated their documentation to at least explain the restrictions.

The only thing that really worked without hassle was the search box. You can configure the GSA to request data from the web server and return an XML fragment that is integrated into the search result page.

Support by Google was a very positive aspect. They answered fast and without fuss, and have been motivated to help. They seemed competent - so I guess when they did not propose alternatives but simply said there is no such feature, there really was no alternative for our feature requests.

GSA Hardware

The google hardware however was a real nuisance. You get the appliance as a standard sized server to put into the rack. Have the hardware locally makes sense. It won't use external bandwith for indexing and you can be more secure about your confidential data. But during the 2 years we used the GSA, there were 3 hardware failures. As part of the setup test, our hoster checks if the system work properly by unplugging the whole system. While this is not good for data of course, the hardware should survive that. The GSA did not and had to be sent for repair. There were two more hardware issues - one was simply a RAM module signaling an error. But as the hoster is not allowed to open the box, even such simple repair took quite a while. Our client did not want to buy more than one Appliance for his system, as they are rather expensive. So you usually do not have a replacement ready. With any other server, the hoster can fix the system rather fast or in the worst case just re-install the system from backups. With the GSA there is no such redundancy.

The GSA is not only closed in on hardware level. You also do not have shell access to the system, so all configuration has to be done in the web interface. Versioning of that information can only be done by exporting and potentially re-importing the complete configuration. I like to have all relevant stuff in version control for better tracking.

Zend Lucene

The GSA license is for two years. After that period, another amount of 20 something thousand dollars has to be payed if you want to keep support. At that point, we discussed the state with our client and decided to invest a bit more than the license and go to an environment where we have more control and redundancy. The new search uses the Zend_Lucene component to build indexes. As everything is PHP here, the indexer uses the website framework itselves to render the pages and build the indexes.

  • We run separate instances of the process for each web site and each language, each building one index. In the beginning we had one script to build all indexes, but a PHP script running for over 24 hours was not very reliable - and we wanted to use the power of the multicore machine, as each PHP instance is single threaded. Lucene is rather CPU intensive to analyze text.
  • We did not want to touch existing code that changes content. We did not want to risk breaking normal operations in case something is wrong with Lucene. Every hour, a cronjob looks for new or changed documents to update the index. Every weekend, all indexes are rebuilt and - after a sanity check - replace the old indexes. Deletion of content neither triggers lucene. Until the index is rebuilt, the result page generation will just ignore results items that no longer exist in the database.
  • For documents, we use linux programs to convert the file into plain text that is analyzed by lucene (see code below). Except for docx and friends (the new XML formats of Microsoft Office 2007) which are natively supported
    • .msg, .txt, .html: cat
    • .doc, .dot: antiword (worked better than catdoc)
    • .rtf: catdoc
    • .ppt: catppt (part of catdoc package)
    • .pdf: pdftotext (part of xpdf)
    • We ignore zip files, although PHP would allow to open them.
  • All kind of meta information can be specified during indexing. This solves the language specification issue. As the database knows about the document languages, even binary documents are indexed in the correct language
  • The indexes are copied to each server (opening them over the shared nfs file server is not possible as Zend_Lucene wants to lock the files and nfs does not support that). This provides redundancy in case a server crashes. And the integration test server can run its own copy and index the test database.
  • We where able to fine-tune ranking relevance based on type and age of content.
  • To improve finding similar words, we used the stemming filters. We choose php-stemmer and are quite happy with it.
  • If we run into performance problems, we could switch to the Java Lucene for handling search requests, as the binary index format is compatible between Zend_Lucene and Java Lucene.

Indexing about 50'000 documents takes about a full day, running parallel scripts and having CPU cores pretty busy. But our webservers are bored over the weekend anyways. If this would be an issue, we could buy a separate server for searching, as you have in the case of the GSA. The hardware of that server would probably be more reliable and could be fixed by our hoster.

The resulting indexes are only a couple of megabyte. So even though Zend_Lucene has to load the index file for each search request, it is quite fast. Loading the index takes about 50ms of the request. I assume the file system cache keeps the file in memory

Zend_Lucene worked out quite well for us, although today, I would probably use Apache Solr to save some work, especially reading documents and for stemming.

Code fragment for reading binary files as plain text:

$map = array('ppt' => 'catppt %filename% 2>/dev/null',
             'pdf' => 'pdftotext -enc UTF-8 %filename% - 2>/dev/null', //the "-" tells to output to stdout
             'txt' => 'cat %filename% 2>/dev/null'
             ...);

if (! file_exists($filename))
    throw new Exception("File does not exist: '$filename'");

$type = pathinfo($filename, PATHINFO_EXTENSION);
if (! isset($map[$type]))
    throw new Exception("Unsupported document type: '$type'");

$filename = escapeshellarg($filename);
$cmd = str_replace('%filename%', $filename, $cmd[$type]);
$output = array(); $status = 0;
exec ($cmd, $output, $status);
if ($status != 0)
    throw new Exception("Converting $filename: exit status $status");

return implode($output, "\n");

Conclusions

Google Search Appliance

Pro:
+ Reputation with client and acceptance by users as it's a known brand
+ Good ranking algorithms for text and handle stemming
+ Responsive and helpful support

Con:
- Closed "black box" system
- You are not allowed to fix the hardware yourself
- No redundancy unless you buy several boxes
- Missing options to tailor to our use case (use HTML language information, request pages, filter flexibility)
- Significant price tag for the license, plus still quite some work to customize the GSA and adapt your own systems

Zend_Lucene

Pro:
+ Very flexible to build exactly what we needed
+ The problematic framework made less problems, we can iterate over content lists instead of parsing URLs to spider the site
+ Well documented and there is a community (not much experience as we did not have questions)
+ No arbitrary limitations on number of pages in an index.
+ Proved reliable for two years now
+ If performance ever becomes an issue, we can switch to Java Lucene while keeping the php indexer

Con:
- In-depth programming needed
- Thus more risk for bugs
- More to build by hand than with the GSA - but for us not as much as license costs plus customization of the web system to play well with GSA.

Related Entries:
- Ecostar Elastica/FOQElasticaBundle
- When Xamarin meets Lucene…

About the author


Find more about him on Twitter, Google+ and his personal site.

Comments [23]

Jonathan Nieto, 13.01.2011 16:42 CET

Hey David,

It was a long but indeed pleasure reading.
In the last years Open Source is becoming a true alternative for Close Source in the Enterprise fields. This post is another prove for that.

Thanks for sharing!

Ollietb, 13.01.2011 18:46 CET

Very interesting article. I made a symfony plugin which uses Zend Lucene to index web pages and pdfs, but I only touched on some of the functionality you describe here.
Here's my plugin http://www.symfony-project.org/plugins/epifonyCrawlerPlugin
Thanks for sharing your experience.

gggeek, 13.01.2011 19:32 CET

Indeed what works very well for the wide internet might not be the best solution for a smaller intranet. This is something that most people have to experiment to believe.

Slightly offtopic: I have been very frustrated with the google apps interface, that insists so much on doing away with the nested folders of documents in favor of only the 'search' box. But when the documents you have range in the thousands, and can be easily classified because of their nature, the standard 'browsing through folders' interface is good enough.

Searching for a given technical keyword will eg. return results from both project tech specs, audit results and client proposals, ie. lot of useless stuff. But I generally know what kind of document I am searching for, or can remember more or less the date when a given document was created/modified, and will be quite effective at drill down.

Not being able to permanently save a preferred sorting method for a given folder, abysmal support for sharing with groups of people, inability to have a complete list of all new shared documents available made the whole experience worse than the age-old shared-webdav-folder method.

David Fishman, 13.01.2011 20:50 CET

In the end, if you're building search, it's an app, and there's a limit to what kind of app you can build with closed source sheathed in tin. But I'd be interested in others who have made the tradeoff between Zend_Lucene and Solr. Naturally, I'd favor the latter, but I must say that I do often hear "if I'd had to do it again, I would have started with Solr.

tss, 14.01.2011 02:47 CET

Here is another comparison between Lucene based SearchBlox and Google Mini - http://www.searchblox.com/comparison-of-searchblox-vs-google-mini

EllisGL, 14.01.2011 07:06 CET

Did you ever look at Sphinx Search?

david, 14.01.2011 11:29 CET

thanks for the feedback!

@Ollietb: if you want, i could share some of the code if you want to improve your symfony plugin. its not hyper clean and not symfony unfortunately, so quite some work would be required to make use of it.

other liipers have used solr and liked it, but i unfortunately have no real experience with it, neither with sphinx.

the searchblox page seems a bit biased to me. we also had a google mini, the others are way more expensive. i am sure a google fan could find features present in the google search appliance that are not present in searchblox. but its a good overview of pricing and restrictions in the google appliance. i did not find anything outside the searchblox site comparing their product with solr...

Nicolas BUI, 14.01.2011 16:09 CET

If you need more performance, then I would suggest Apache Solr .. as it's Lucene little brother with better performance and more features.

Luca, 15.01.2011 11:35 CET

Very interesting article. My initial experience with Zend_Lucene was really positive, but after indexing a large amount of documents we had some errors due the number of files open (may be was a problem of my Linux box, the index was close to 500 Mb). After that error, we stitched to SOLR with a JSON interface to the search GUI (the front end is still Zend Framework). Did you have the same problem with a huge number of documents ? Was Zend_Lucene able to manage that size of index ? How is it the response time (with SOLR we passed to few seconds to few milliseconds) ?

david, 21.01.2011 16:55 CET

@luca: sorry for taking so long to reply. we have indexed hundreds of files and many thousand database records. the largest index file is 63M. but we do not store all content in the index. for preview in the result, we talk to the db/file system.
we never encountered the open documents issue. that sounds like you might not have closed file handles during indexing - or really a server overload issue not related to lucene.
the response time for instantiating and querying our index is about 50-100ms which is not blasting fast, but in our case good enough.

anyways, i think solr is probably the easiest way to go, at least if you can have apache on your server.

Luca, 21.01.2011 17:56 CET

SOLR does not neet Apache but Tomcat or similar application server.

david, 21.01.2011 18:49 CET

argh, friday evening. what i wanted to say was "if you can have JAVA on your server". sorry.

Lukas Kahwe Smith, 22.01.2011 00:23 CET

We are using Solr extensively at Liip already, but I guess the team on this project didn't want to introduce a dependency on Java.

However I think for people just starting out with Lucene based indexing ElasticSearch seems like the better starting point, since its much easier to setup. That being said in terms of features it has pull past Solr pretty quickly too, so I think its only a matter of time until we start using ElasticSearch and maybe even drop Solr entirely. They seem to have serious issues maintaining their original speed after merging your subversion repo with Lucene.

Luca, 22.01.2011 12:59 CET

I will have a look at ElasticSearch. May be i will give a try soon due that we need to use multi-language index. Ho do you manage the UTF8 in PHP5.3 ? I have several problems to approach these crawled data. E.g. the str_replace are not UTF8 safe. strlen the same. And so on.

Lukas Kahwe Smith, 22.01.2011 13:05 CET

Well for now ext/mbstring is the best approach for this. However its a bit tedious to manage code that is supposed to work with and without. I generally recommend for that case to just write the code requiring mbstring and just add a compat script that adds the functions in case the extension isnt installed.

Adell Mirganer, 31.01.2011 21:07 CET

Interesting reading. We also bought the GSA product and unlike your experience we have been amazed. But I think it is because we have 2.5 Millions document and 6 langages including hebrew and russian to index and nothing else on the market was able to really scale. You had only 50 000 document, so maybe for this case GSA is not the best option. But I think that as soon as things become serious, you will not find anything as simple as GSA.

Also: We involved a company to do all the integration. This is a complex part. You know how to do it or your do not know.

Anyway. take care and thanks for sharing

Lukas Kahwe Smith, 31.01.2011 21:14 CET

Well for http://infocube.ch/de we had about that amount of data (actually there are lots more documents that are indexed but not exposed yet afaik). We used Solr and I do not remember a single day tweaking performance. Though we did spend some time getting the sorting and phrase searches just right.

MaxxCAT Enterprise Search, 20.07.2011 20:14 CET

If you decide to give hardware another chance, you may want to check out one of our search appliances...we at MaxxCAT offer dedicated hardware that outperforms Google and costs thousands less. You can get more info at http://www.maxxcat.com/search-appliance-specs.html

Santiago de la Cruz de los Santos, 13.08.2011 15:33 CET

Hello friend, your article is interesting.

I work with the GSA and I have no problems, well documented and I have gone ahead. There was no need to call the provider. Feed use, onebox, filter by meta tags, internationalization, expansion of search, KEYMATCH, etc. We have obtained excellent results.

The GSA:
- It's a very interesting tool.
- Adaptable.
- Very powerful.
- Something we have to take account of the GSA is not a database. It is a machine that makes things different.
- Something that I can not lie, is a black box that is unlimited. Above those limits have to work.

Santiago de la Cruz de los Santos.
ujat55@yahoo.com.mx
DF, Mexico.
My comment, Greetings ....

David Ortega, 03.08.2013 13:29 CET

I came across your article while looking for more information about data management for my company. A lot info great info on here, I went to check out Google mini and MaxxCAT options...still debating.

Michal Kotnowski, 05.12.2013 14:39 CET

For everyone reading this (as it became quite popular among search results), mind you, the article is more than 2 years old now, refers to version 5.0 of Google Search Appliance, whereas version 7.0 is now available for GSA and most of the issues reported is no longer valid, esp. the hardware one (now it is Dell server, not custom design). Of course Solr is also being developed further, but for anyone looking for a piece of mind and real Search as a Service, GSA is a match.

david, 18.12.2013 09:31 CET

hi michal, yes, you are right with this remark.

Matt Hall, 08.05.2014 11:21 CET

Does anyone have any information like this but for version 7+?

Add a comment

Your email adress will never be published. Comment spam will be deleted!