Review Foundry Review Engine User Manual

SEARCH ENGINE

Adjust Text:  a a a a
« Table of Contents   |   Obtain Review Foundry »


SEARCH ENGINE

It is a common misconception that if a search engine is available on a web site it is only working properly if it finds everything that could possibly be found on the web site. Well, that would be like expecting you to remember every word of every conversation you had ever participated in. You don't do it because not everything you hear is worthy of recollection, and your brain isn't big enough anyway. Like you brain, search engines only store information deemed relevant. This short section of the manual aims to teach you how to get the Review Foundry keyword search engine to work for you. It won't explain in any detail how it does the job.

Keyphrase Searches

When you type the phrase "cheap tennis shoes" into a search engine, and no other information, the SE has to decide what it is that you are after. Are you looking for documents in which all of those three words occur, or are you looking for documents in which the exact phrase "cheap tennis shoes" appears at least once? The second case is usually assumed, and if no documents can be found that contain the phrase, then the next best thing is considered to be documents in which all three words appears. Failing that, any two words, and then any single word. This is why search engines returns pages and pages of what appear to be basically useless results.

With a little thought it should be easy enough to see that in order to find all documents in which a given phrase occurs, one must parse all documents and record the position of every word in them. This is basically what the Review Foundry search indexer does (the code that creates the search index on which the search engine operates). But recording every word of every document would produce HUGE indices that far surpass in size the original documents themselves. Therefore some concessions are made when a table if records is indexed:


(1) only columns with a 'Search Weight' are indexed. these are TEXT
    columns that contain plain ASCII text. the 'Search Weight' is an
    assigned INTEGER that ascribes a level of importance to each word
    found in the column. if one column has a 'Search Weight' of 3 and
    another has a search weight of 1, the document will treated as
    though it is 3 times more relevant when ranking results if the
    keyword is found in that column than if it was found in the column
    with a 'Search Weight' of 1. so not every word in a table is
    indexed, nor any of the non textual information.

(2) words are stemmed to reduce the number of variations that each word
    might contribute to an index. the stem for the words 'abbreviate',
    'abbreviated', and 'abbreviation', might be 'abbreviat'. in this
    case one word can appear in the search index rather than three,
    saving on space. of course, a search for "best abbreviation" might
    then catch documents which contain the phrase "best abbreviated"
    but that is the consequence of stemming. we trade space for accuracy,
    but stemming can be switched off, if desired.

Improving Search Results By Indexing More Columns

When a table is indexed, and only are few tables managed by Review Foundry are indexed, the words found in the records making up the table (at least for the columns that carry a Search Weight) and their positions are recorded and stored away. For example, when the Item table is indexed, the indexable words end up in the table named Item_word_list. This table simply ascribes an integer to each unique word (or the stemmed version of it). Another table, named Item_word_index, records the position of each word in a given document (table record). In addition to the position of the word, the column in which the word was found is also recorded. This allows a search weight to be attached to the word when a search is performed.

But the column name is recorded as an ENUM field, so if a searchable column is added or subtracted from the Item table, the ENUM column in the index becomes inaccurate. In this case the Item_word_index table should be deleted and the Item reindexed from scratch. You can perform reindexing from the Index control panel, and the result of doing this is that the Item_word_list and Item_word_index tables are immediately deleted and then recreated and populated. This means there is no need to delete these tables manually before performing the indexing.

Improving Search Results By Disabling Stemming

When you think your data is sufficiently irregular that the removal of stemming in the indexing process might significantly improve the behavior of your search engine, see the Configure > Search page which will allow you to disable stemming.

Beware, when you change the default stemming setup. Review Foundry will check whether you have the stemming classes mentioned in the default setup for different languages. It will complain if it does not find these language classes when it saves the new configuration. But this is nothing to worry about. It won't happen the next time you save the page.

Note that the loss of stemming will increase the size of your index tables. How much depends on the nature of your indexed data.

Once you have toggled the stemmer from enabled to disabled, you need to reindex the Item table from scratch (since the core words will change).

Note that if you are dealing with records in the Supplier table, the index tables would be named Supplier_word_index and Supplier_word_list. Likewise for Member records the two index tables are Member_word_index and Member_word_list.

Supplier Searches

In addition to the keyword based search engine, the Yellowpage branch allows for a search of suppliers based on the supplier name (based on looking at the value for the Supplier.known_as column) and the "location" associated with the supplier. The location search is performed by comparing the string typed into the location input box with several location related Supplier columns, such as the Supplier.state_id (expanded to state name), the Supplier.country_id (expanded to country name), the Supplier.addr_city, the Supplier.addr_region, and the postcode or zipcode columns defined for the Supplier table.

On occasion it might be preferrable to search not just the Supplier.known_as column when performing the search on the string typed into the Supplier name input box, but to compare the string with some other Supplier columns too. For example, if your suppliers are restaurants and you have a Supplier.restaurant_type column which takes one of several values, it might be useful to offer your visitors the ability to search not just for a particular restaurant in their city, but to look for all "pub" type restaurants. They can then simply type "pub" for the restaurant name, and then add their city name to the location input. All pubs that offer a menu in their city should then turn up in the search.

To add extra columns for the supplier name search, other than the Supplier.known_as column, visit the Configure > Search page.


« Table of Contents   |   Obtain Review Foundry »


Copyright © 2004 Random Mouse Software. All Rights Reserved.