Vision

Working with noise words

August 27, 2007

There’s an update available for this article at: http://www.lcbridge.nl/vision/2008/noisewords_upd.htm.

Noise words are words that do not improve the relevancy of a matching document, and, therefore, are not considered useful in a search query. Words such as "and", "the", or the company name are very common and will be found in almost any document and are therefore useless in a search query.

If you want to see which words are considered noise words, or if you want to add your own noise words to the list, you can find the default set of noise word lists at the following location: [drive letter]:\Program Files\Microsoft Office Servers\12.0\Data\Config. This folder contains noise word lists for a wide range of languages, such as English (noiseeng.txt) and Dutch (noisenld.txt). If there is not a noise word list that supports your language, you can add noise words to the neutral noise word file (noiseneu.txt).

Every Shared Service Provider (SSP) uses its own set of noise word files, a copy taken from the default set of noise word lists ([drive letter]:\Program Files\Microsoft Office Servers\12.0\Data\Config). The set of noise word files related to an SSP can be found at the following location: [drive letter]:\Program Files\Microsoft Office Servers\12.0\Data\Applications\[application GUID]\Config. If you only have one SSP, this location is easy to find. If you have multiple SSPs, it might be a little tricky to find the correct location. Other than trial and error, we have not found a way to link the application GUID (corresponding to the name of the gatherer application) to an SSP. The following VBScript code fragment can be used to determine the correct application GUID for an SSP, although it does not give you a handle that can be linked back to the SSP:

Set objGatherAdmin = WScript.CreateObject("oSearch.GatherMgr.1")

For Each objApplication in objGatherAdmin.GatherApplications
 WScript.echo " "
 WScript.echo "application name: " & objApplication.Name
Next

If you want to try out this code, you need to save it in a file called GetApplicationGUID.vbs. Open a command prompt and type cscript GetApplicationGUID.vbs. This displays the name of all available application names.

If you want to change the noise word list, you need to take a couple of steps. The next procedure explains how to add the word test to the English noise word list:

  1. Determine which folder holds the noise word list of your choice. The list is located in [drive letter]:\Program Files\Microsoft Office Servers\12.0\Data\Applications\[application GUID] \Config.
  2. Open a noise word list file in NotePad. In this example, we will open the noise word list noiseeng.txt and add the word test to it.
  3. Save the noise word list and close NotePad. Make sure to save the file using the ANSI encoding type. Note: Saving noise word lists using the Unicode encoding type has been known to cause incorrect functioning of the noise word functionality. You should also ensure that you add a carriage return after the noise word. Finally, if adding a noise word does not seem to work, you should be aware of the fact that the word breaker which is a part of the Search architecture might break up a single noise word into separate words.
  4. Open a command prompt and type services.msc. This opens the Services window.
  5. Locate the search service for Microsoft Office SharePoint Server 2007 and restart it. This service is called Office SharePoint Server Search. Right-click its name and choose Restart.
  6. Open SharePoint Central Administration and perform a full update of the content index.

Note: You will also find an entry for the Windows SharePoint Services 3.0 search service, called Windows SharePoint Services Search. If you have selected to install SQL Server including the optional Full Text search component, you will also find an entry for that search service, called Microsoft Search.

It is only after a full update of the content index that any changes you have made to the noise word lists take effect. Noise words are filtered twice, the first time during the indexing phase. During indexing, any noise words found in indexed document will not be included in the content index. The IFilter responsible for processing the document reports the document language to the protocol handler that applies the appropriate noise word list.

Note: If you want to find more information about the status of the indexing process, you can use Search Settings in SharePoint Central Administration to take a look at the available crawl logs. Alternatively, you can look in the folder [drive letter]:\Program Files\Microsoft Office Servers\12.0\Data\Applications\[application GUID]\Projects\Search\Indexer\CiFiles to see if the content index files have been updated.

Noise words are filtered for the second time during query time. You can programmatically alter the LCID that is used within the search query, otherwise the locale of the search server is used.

If you want to check that adjusting the noise word list has had the desired effect, you can take a shortcut that does not require you to perform a full update of the content index. The next procedure explains how to test if the noise word list has had effect without performing a full update of the content index:

  1. Start the Microsoft Office SharePoint Server 2007 query tool (a free community tool that can be downloaded from the following location: http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=89b3cda7-aad9-4919-8faf-34ef9b28c57b).
  2. In the Server Url text box, enter the URL of a SharePoint site collection.
  3. In the query editor text box, enter the following query (you should adjust the keyword and LCID according to your needs):
select rank, author, path , title, "DAV:getlastmodified"
from scope()
where freetext(*,'the',1033)
order by rank desc

If you have successfully added the noise word to the noise word list, the response will indicate the following: "Your query included only common words and / or characters, which were removed. No results are available. Try to add query terms."

As discussed previously in this blog post, the set of noise word files related to an SSP can be found at the following location: [drive letter]:\Program Files\Microsoft Office Servers\12.0\Data\Applications\[application GUID]\Config. If you want, you can define a central location for all noise words (and thesaurus files, for that matter). This means that you can share one set of noise words (and thesaurus files) for all SSPs you might have. You can do this by locating the following key in the registry: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\ContextIndexCommon LanguageResources\Override\[Application name]\[Language]. The NoiseFile string value contains the path of the noise word file for the given language and SSP, the TsaurusFile string value contains the location of the thesaurus file.


« back to overview page