Review Foundry Review Engine User Manual

CONFIGURATION -- Index

Adjust Text:  a a a a
« Table of Contents   |   Obtain Review Foundry »


CONFIGURATION

Index

For your Review Foundry Search Engine to function an index needs to be built for each searchable table. Tables which are in priniciple searchable are those that have at least one string type column with an assigned search weight. Tables for which this is true, and for which there is a searchable interface built in, consist of the following 3 pairs: Category and Item, Team and Member, Yellowpage and Supplier. These 3 combinations can separately be searched by using the search textbox found on any of the 3 browsable sets of Containers.

The index for each searchable table--a set of 3 related tables for each such searchable table--is automatically updated each time a record is added, modified, or deleted. So most of the time you need not worry about dealing with the search index tables. However, there may be times when you need to rebuild these from scratch. This is done by visiting the Index control panel where you will see options to allow the re-indexing of each searchable table. This is executed as an NPH process to allow the result of the indexing to be output to the browser incrementally, so it operates in a manner similar to the Build control panel. These build-via-the-browser pages are processed in blocks, or Staggered. Alternately, the indexing can be initiated as a command-line process:

to build the index for a table

	nph-admin.cgi --do=Index[index_type]

to contract the index for a table after first building it

	nph-admin.cgi --do=ReIndex[index_type]

to delete the index

	nph-admin.cgi --do=DeleteIndex[index_type]

where [index_type] = Item | Member | Supplier | Category | Team | Yellowpage

If you want to Index, or ReIndex, all your tables in one go:

nph-admin.cgi --do=IndexAll
nph-admin.cgi --do=ReIndexAll

OR, if you can run cron jobs which are not subject to timeouts (you may need to be on a dedicated server for that to be the case), you might put something like the following into your crontab file:

08 2 15 * * /path/to/nph-admin.cgi --do=IndexItem        --cron=1
38 2 15 * * /path/to/nph-admin.cgi --do=IndexMember      --cron=1
08 3 15 * * /path/to/nph-admin.cgi --do=IndexSupplier    --cron=1
38 3 15 * * /path/to/nph-admin.cgi --do=IndexCategory    --cron=1
08 4 15 * * /path/to/nph-admin.cgi --do=IndexTeam        --cron=1
38 4 15 * * /path/to/nph-admin.cgi --do=IndexYellowpage  --cron=1

This set of crontab instructions specifies that, beginning at 2.08 A.M. every 15th day of the month, one table is indexed every half hour until all 6 Thing and Container tables have been indexed. Or you can elect to build the lot, one after the other, like this:

38 2 15 * * /path/to/nph-admin.cgi --do=IndexAll  --cron=1

The extra --cron=1 argument ensures that logging to the screen is switched off unless an error message needs to be output. This ensures that any email message sent to you after your cron jobs are completed remains of manageable size. You can also set a cron job to ReIndex tables after they have been indexed (perhaps on the next day). However, re-indexing may not be worth the effort--see the discussion on Index Contraction. In this case you would simply change 'IndexAll' to 'ReIndexAll' when setting up your cron file.

There are only 2 general parameters associated with indexing (the rest relate to stemming which we will cover next):

  • index_verbosity_high
    Example: Yes
    Toggle to "No" for the least amount of commentary while indexing.
  • index_staggered_block_size
    Example: 250
    When indexing via the browser, this parameter represents the maximum number of records in a table which should be indexed before reinitiating the browser process for the next set of records. If set too high you could exceed the web server timeout duration, after which time your indexing process will be prematurely terminated.

Stemming / Stoplist

Index tables can get quite large--in terms of the number of records they contain--as each word in an indexed document generally requires its own separate record in the index table. So to keep things manageable in terms of required disk space and search times, 2 methods are routinely used to reduce the size of your index tables. One is the Custom Stoplist method: if any words which appear in our custom stoplist also appear in a record to be indexed, those words are removed from a copy of the record before the indexing takes place and the copy is indexed instead. The second method is called Stemming, and it allows us to replace lexical variations of a word with a single representative. Thus, for the words absolute, absolution, and absolutely, stemming allows us to replace those words with the stem absolut. This means that phrase matching becomes less exact when we attempt to retrieve records that contain the keyphrase we are seeking. However, the number of indexing records for documents that contain words with the absolute stem could be reduced by as much as a factor of 3. Stemming can be performed for any of the languages supported by the Perl Lingua::Stem module.

In order to perform stoplist parsing, a list of stopwords must be defined before the indexing takes place. Also, we may wish to remove not only custom stopwords (words that we expect to appear in records and which do not carry any significant meaning in the contect of the average search), but also words that appear in a large fraction of the searchable records. To add these words to the stoplist requires that the document collection first be parsed in its entirety to find the word frequencies. So to reduce the size of your index you MAY need to perform the indexing twice on the entire record set. Clearly you cannot do this each time a new record is added to a table, so indexing everything from scratch should be a weekly or twice-monthly process that you can set as a cron job. The process of re-indexing to exclude the common stopwords is termed contraction here and may or may not prove to be useful, depending on the variability of the words that compose your record set.

Parameters relating to the process of stemming and stoplist configuration are:

  • db_index_stemming
    Example: Yes
    Unless you have a relatively small recordset and can afford the extra memory required to index your tables without resorting to stemming, set this to "Yes".
  • db_index_auto_stoplist
    Example: Yes
    Set to "Yes" to activate an auto-generated stoplist. By re-indexing once the stoplist has been created, you may be able to contract the size of your word index tables (to some degree) by ignoring all words which are contained in K percent of all indexed records (where K is set high, e.g. to 80 or 85). If you find that the generated word frequency table (computed each time a table is indexed) shows that you would be completely WASTING your time with contractions, DISABLE this variable so that neither the word frequency table is generated, nor the auto-generated stoplist (these both take time, especially with larger tables).
  • db_index_auto_stoplist_percentage
    Example: 80
    This is the cutoff percentage used in the auto-generated stoplist discussed above.
  • db_index_custom_stoplist
    Example: Yes
    Toggle to "Yes" to activate a custom stoplist when indexing tables. This reduces the size of your word index tables by indexing ALL BUT the words contained in the custom stoplist. The default custom stoplists represent common words (in the relevant language) which do not convey any significant contextual meaning, and can thus be dispensed with. You can add words to the stoplist if you wish. The custom stoplist will be merged with the auto-generated stoplist before indexing takes place.
  • db_locale
    Example: EN
    Both the initial custom stoplist and the Lingua::Stem word-stemming module used by the indexer depend on the language assumed to be present in the text records which they act upon. This parameter specifies the relevant language. Note: if a Stemmer for the chosen language is unavaliable it will NOT be activated.
  • English Custom Stop Words
    Example: a about above across after...
    This is your (language-specific) list of common stopwords. Words may be added to this list (anywhere as the content will be reordered when saved). Words should be whitespace separated.

    Also, if you wish to make abbreviated names, like "A B Caruthers" turn up in a search you will want to remove all the instances of a single alphabetical character from the list of common stopwords. Obviously this will increase the size of your index, and you might not like the result (slower search engine response), but if your database is not larger and does not contain lots of indexable text, this might work out fine, and make the search results more precise.
  • English Custom Bad Words
    Example: damn crap...
    This is your (language-specific) list of bad words which should not appear in any submitted records. Words may be added to this list (anywhere as the content will be reordered when saved). Words should be whitespace separated. This list is NOT presently used by the indexer.

Index Contraction

If you have enabled the autocreation of a stoplist of common words--that is, words common to K percent of the indexed records--then you will be presented with a table of word frequencies once the entire table has been indexed. The word frequency table might look something like this:

W R
100 0
75 10
63 20
55 30
48 40
32 50
12 60
8 70
0 80
0 90
0 100

Here W is the percentage of WORDS that appear in more than R percent of the RECORDS. Thus, in this example, 32 percent of all words appear in more than 50 percent of the records. So we might wish to cull those words from our index, reducing it in size by about a third. Depending on the nature of your records, a number like 32 percent might never be seen and virtually all words are shared by no more than a few documents. In that case, any attempt to trim the index--contract it--would be a waste of time.

If you do find that a significant fraction of your index can be trimmed, you should do so (a link will be presented in the case of browser-generated NPH pages to allow you to contract the index). This will free up disk space, and make your search queries run faster (if a little less accurately). If you see that you will gain only a few percent by contracting your index, then do not bother with it.

As a reminder, DISABLE the auto-generated stoplist if you find contractions to be of no use to you. Otherwise indexing from scratch will take much longer than need be the case (computing a word frequency table, for example, is computationally intensive).

« Table of Contents   |   Obtain Review Foundry »


Copyright © 2004 Random Mouse Software. All Rights Reserved.