Configuring Solr to provide search suggestions

I needed to provide search term suggestions based on characters that the user has typed into the search box. Doing this is pretty easy with Solr, an open source enterprise search platform, powered by Java, Apache and Lucene.

If you're using a version prior to 4.8, this can be accomplished using the SpellCheckComponent. See this document for details.

As of 4.8 a new component is available, the solr.SuggestComponent. This post will go through the steps to configure an index to provide search suggestions using this component. In my case I created a separate index to handle this, it could be combined into an existing index such as sitecore_web_index (or any other custom indexes you may be using), depending on what your needs are.

Define the schema for the index:

In order to create smaller documents I trimmed the fields down to the bare minimums. This is done in schema.xml.

[code language="xml"]<fields> <field name="_content" type="text_general" indexed="true" stored="false" /> <field name="_database" type="string" indexed="true" stored="true" /> <field name="_uniqueid" type="string" indexed="true" stored="true" required="true" /> <field name="_name" type="text_general" indexed="true" stored="true" /> <field name="_indexname" type="string" indexed="true" stored="true" /> <field name="_version" type="string" indexed="true" stored="true" /> <field name="_version_" type="long" indexed="true" stored="true" /></fields>[/code]

Then I added two fields that will be used by the suggester. One to store the suggestion text and another to store the weight of that suggestion. The suggestion field should be a text type and the weight field should be a float type. Both need to be stored in the index. In this case these fields get their values form corresponding fields in our sitecore instance. These fields can be added to documents based on your specific indexing strategy.

[code language="xml"]<field name="term" type="text_general" indexed="true" stored="true" /><field name="weight" type="float" indexed="true" stored="true" />[/code]

Define a custom field type for the suggest component:

Next we need to add a new type that the suggester will use to analyze and build the suggestion fields. This particular type will remove all non alphanumeric characters and be case-insensitive as well as tokenizing the contents of the field. This is not strictly necessary, existing types may be used. Again, this is done in schema.xml.

[code language="xml"]<types>...<fieldType name="suggestType" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^a-zA-Z0-9]" replacement=" " /> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer></fieldType>...</types>[/code]

Define the suggest component for the index:

Now that we have the schema set up, we need to define a searchComponent that will do the suggesting. This is done in solrconfig.xml.

Add the following to the <config> node:

[code language="xml"]<searchComponent name="suggest" class="solr.SuggestComponent"> <lst name="suggester"> <str name="name">fuzzySuggester</str> <str name="lookupImpl">FuzzyLookupFactory</str> <str name="storeDir">fuzzy_suggestions</str> <str name="dictionaryImpl">DocumentDictionaryFactory</str> <str name="field">term</str> <str name="weightField">weight</str> <str name="suggestAnalyzerFieldType">suggestType</str> <str name="buildOnStartup">false</str> <str name="buildOnCommit">false</str> </lst> <lst name="suggester"> <str name="name">infixSuggester</str> <str name="lookupImpl">AnalyzingInfixLookupFactory</str> <str name="indexPath">infix_suggestions</str> <str name="dictionaryImpl">DocumentDictionaryFactory</str> <str name="field">term</str> <str name="weightField">weight</str> <str name="suggestAnalyzerFieldType">suggestType</str> <str name="buildOnStartup">false</str> <str name="buildOnCommit">false</str> </lst></searchComponent>[/code]

lookupImpl

In this case we're setting up a suggest component that has two suggester data sources available to it.

  • The first uses the FuzzyLookupFactory: a FST-based sugester (Finite State Transducer) which will match terms starting with the provided characters while accounting for potential misspellings. This lookup implementation will not find terms where the provided characters are in the middle.
  • The second uses the AnalyzingInfixLookupFactory: which will look inside the terms for matches. Also the results will have <b> highlights around the provided terms inside the suggestions.

Using a combination of methods, we can get more complete results. Additional suggester implementations are available:

  • WFSTLookup: offers more fine-grained control over results ranking than FST
  • TSTLookup: “a simple, compact trie-based lookup”. Whatever that means.
  • JaspellLookup: see the Jaspell source.

See the Suggester Documentation for more details on the different types of Lookup Implementations. They each have properties unique to their implementation.

storeDir and indexPath

These parameters define the directory where the suggester structure will be stored after it's built. This parameter should be set so the data is available on disc without rebuilding.

field

The field to get the suggestions from. This could be a computed or a copy field.

weightField

As of Solr 5.1 this field is optional. In previous versions this field is required. If no proper weight value is available, a workaround is to define a float field in your schema and use that. Even if this field is never added to a document the code will compensate.

threshold (not used in this example)

A percentage of the documents a term must appear in. This can be useful for reducing the number of garbage returns due to misspellings if you haven’t scrubbed the input.

suggestAnalyzerFieldType

This parameter is set to the fieldType that will process the information in the defined 'field'. I suggest starting simple and adding complexity as the need arises.

  • This fieldType is completely independent from the analysis chain applied to the field you specify for your suggester. It’s perfectly reasonable to have the two fieldTypes be much different.
  • The "string" fieldType should probably not be used. If a "string" type is appropriate for the use case, the TermsComponent will probably serve as well and it is much simpler.

buildOnStartup and buildOnCommit

Building the suggester data involves re-reading, decompressing and and adding the field from every document to the suggester. These two settings should both generally be set to "false". On Startup happens every time Solr is started. On Commit happens every time a document is committed. In the case of a smaller list of potential suggestions, the latter is acceptable.

Define a requestHandler for the Suggest Component

[code language="xml"]<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" > <lst name="defaults"> <str name="suggest">true</str> <str name="suggest.dictionary">infixSuggester</str> <str name="suggest.dictionary">fuzzySuggester</str> <str name="suggest.onlyMorePopular">true</str> <str name="suggest.count">10</str> <str name="suggest.collate">true</str> </lst> <arr name="components"> <str>suggest</str> </arr></requestHandler>[/code]

The "name" of the requestHandler defines the url that will be used to request suggestions. In this case it will be http://"localhost":8983/solr/index_name/suggest. Your port number may be different.

The requestHandler definition contains two parts:

defaults

These are settings that you would like to apply to each request. They may be provided in the querystring if different values are necessary.

Multiple "suggest.dictionary" values may be used. Each one will have it's own section of results. The values are the names of the suggesters that were defined in the Suggest Component.

components

The name of the Suggest Component is set here. This connects the handler to the component.

See the documentation for more details on configuring search components and request handlers.

Actually getting suggestions

Once all of this is set up, using it is very simple. Assuming a solr index url like this:http://localhost:8983/solr/index_name

  • Build the suggester:
    Issue http://localhost:8983/solr/index_name/suggest?suggest.build=true.
    • Until you do this step, no suggestions are returned.
    • The two build settings (buildOnStartup and buildOnCommit) can be used to avoid this, but consider the size of your index and the time and cpu that will be required to build the suggest index automatically.
  • Ask for suggestions:
    Issue http://localhost:8983/solr/index_name/suggest?suggest.q=whatever
    • Additional parameters can be included, such as the count, the desired format (json or xml) or a specific suggest.dictionary.
    • Use "wt" and "indent" parameters to format your results into json or xml and apply indenting. e.g.: &wt=json&indent=true
    • The response will contain a "suggest" field. This field will contain fields for each of the suggest.dictionaries that was used. Each of these dictionary fields will have a "numFound" field as well as a "suggestions" field containing an array of the found suggestions and their weights.

Response Format:

[code language="js"]{ suggest: { suggester_name: { suggest_query: { numFound: .., suggestions: [ {term: .., weight: .., payload: ..}, .. ]} }}[/code]

I hope you find this information useful. See the Suggester documentation for more details.

Thanks for reading!