Sitecore Standard Analyzer : Managing your own stop words filter

Sitecore uses the Standard Analyzer as its Default Analyzer for most of its internal search operations (for Searches inside Content Editor).The standard analyzer uses the StopFilter (Removing stop words from a token stream.) and hence you will encounter an scenario where if you search for terms which contain common keywords like a, an, the, and. All those searches will fail because Lucene's StopFilter will remove the stop words.

For example if you are searching for an item called "Seller and Buyer" , The standard Analyzer will process that as "Seller Buyer" , The stopwords are removed from the phrase and since there is no field with such value in the index, search returns 0 results.

Out of the many ways to solve this issue, I will show you an way where you can manage you own stop words list which means you can provide an list of stop words to lucene.

Currently, the following stopwords are declared in Sitecore:

"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"Solution to Set up your own stopwords filter

  • Download the stopwords file
  • Edit contents of the above file to suit your needs to an file with .txt extension (This is due to my inability to attach .txt files in wordpress :))
  • Place the text file in your Data/Indexes Folder.
  • Make the below config changes in the Sitecore.ContentSearch.Lucene.DefaultIndexConfiguration.config file
  • Rebuild your sitecore indexes

[sourcecode language="xml"]<param desc="defaultAnalyzer" type="Sitecore.ContentSearch.LuceneProvider.Analyzers.DefaultPerFieldAnalyzer, Sitecore.ContentSearch.LuceneProvider"> <param desc="defaultAnalyzer" type="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net"> <param hint="version">Lucene_30</param> <param desc="stopWords" type="System.IO.FileInfo, mscorlib"> <param hint="fileName">[FULL_PATH_TO_SITECORE_ROOT_FOLDER]\Data\indexes\stopwords.txt</param> </param> </param></param>

[/sourcecode]

Please note, all changes that are made in the stopwords.txt file will be applied only after changing value of the config file or application pool restart.  In case you do not want any of the stop words you could provide and empty  txt file too . 

My Colleague Brent Svac also blogged  an other way to solve the stop words issue.

Adam Conn also has an excellent post which explains the Sitecore 7 Analyzers in detail

Should you have any questions , Please do not hesitate to comment or contact me via sjain@horizontalintegration.com