December 14, 2009

An Awesome Drupal Search Experience With ApacheSolr


We recently implemented a complex search solution for a client that wanted content from all of their various web properties (built on a multitude of different technologies) to show up in the search results on their main Drupal site. We chose to use Apache Solr.

Why Apache Solr?

Mainly for the search functionality for reasons relating to the Solr search engine technology itself, such as its speed, faceting and spelling suggestion capabilities, but also because of the excellent integration with Drupal, in the form of the ApacheSolr Integration module.

Many of the external sites whose content needed to be pulled in were built entirely in Flash, others were built on the ASP.NET platform, others again used proprietary or custom CMS technology. The content also comprised many PDF documents, which would need parsing and indexing. While it is possible to push content to the same Solr index from different sources, with content as disparate as a Flash game, a Drupal node, a PDF, and a product on an eCommerce site, there's no common basis on which to index it all in such a way that things like faceting will work across all of it. The ApacheSolr Drupal module is very node-centric: facets have to do with things like node type, taxonomy terms and other Drupal concepts that are meaningless when it comes to content from other sources. It was clear that in order to be able to leverage the tools that Solr had to offer, we would need every page to be at least minimally represented by a Drupal node in the system.

Challenges and Problems

One of the challenges we faced was the question of how to parse and index content that wasn't even on a CMS of any kind. So, the two basic problems were:

  1. How to crawl and parse all of the client's web pages.

  2. How to import the pertinent details of this content into Drupal as nodes.

Crawling the html sites was achieved using a tool called Nutch. Nutch, which is built on top of the Lucene search engine technology, is actually a crawler, parser, indexer and search engine rolled into one. It even has a plugin for pushing content to a Solr index. However, the detailed parsing we needed to do on the content required something more powerful so in the end we used Nutch only as a crawler to follow all the links on the html-based sites and create a list of URLs. A custom Drupal module then had the task of parsing all of the content at these URLs (with the exception of the PDFs, which we left for Nutch to do) using a tool called QueryPath, for which there is a Drupal module (because the developer who built QueryPath is the awesome Drupal developer Matt Butcher) and importing it all as lightweight Drupal nodes, categorised according to information the parser had gleaned from it. The Flash sites could not be crawled, as there were no html links for Nutch to follow. So the lists of URLs to parse had to come from webtrends data, and the only parsable information was in fact the JavaScript tracker code, which told what type of content was on the page and, if it was e.g. a game, the name of the game. Again, QueryPath was a great help in parsing this information.

Once all the content was in Drupal, it could be indexed using the Drupal Solr schema, which is what maps the Drupal-specific information in a node to fields in the Solr index, to be used for faceting etc. We did implement some of the ApacheSolr module's hooks in our own custom module to make some slight changes to what got indexed. Normally, the titles of search results link to their node views. However on this client's site, those light-weight nodes we had imported, whose sole purpose was to show up in search results, needed to link to external urls, i.e. the urls of the pages they were parsed from. So we needed to add an external_url cck field, which would be indexed and then made available in the search result tpl file. Similarly, in order to display a thumbnail for each result, a thumbnail cck imagefield needed to be added during indexing.

Hooks

The following are the hooks that must be implemented to have custom cck fields available in search results:

/** * Implementation of hook_apachesolr_update_index(). */ function custom_search_results_apachesolr_update_index(&$document, $node) { $document->ss_cck_field_image = $node->field_image[0]['filepath']; $document->ss_cck_field_page_url = $node->field_content_url[0]['display_url']; } /** * Implementation of hook_apachesolr_modify_query(). */ function custom_search_results_apachesolr_modify_query(&$query, &$params) { $params['fl'] .= ',ss_cck_field_page_url,ss_cck_field_image'; } /** * Implementation of hook_apachesolr_process_results(). */ function custom_search_results_apachesolr_process_results(&$results) { foreach($results as &$result) { $result['external_url'] = $result['node']->ss_cck_field_page_url; $result['imagefield'] = $result['node']->ss_cck_field_image; } }

The first hook tells Solr how to index our two new fields. Although the fields are not actually defined in the schema.xml file, there is a dynamic field definition in there that knows how to deal with any field name beginning with "ss_". The second hook ensures that when a query is sent to Solr, it will send back these field values in the search results; and the third hook makes sure we can easily access these values in our search_result.tpl.php file.

The End Result

All of the above make for quite a powerful search experience for the user on our client's corporate site. On entering a search term, the user is delivered very fast, relevant results (or spelling suggestions as the case may be) by Solr. The results, representing content from all over the web (and indeed taking them all over the web, should they click on external results) are displayed with thumbnails where available and can be refined based on content type and several vocabularies. There is also a Surprise Me! button (like Google's I'm Feeling Lucky) and blocks of 'My Recent Searches', 'Related Searches' and 'Most Popular Searches', which were all pretty easy to implement on the back of the awesome ApacheSolr module.