|
CONFIGURATIONIndexFor your Review Foundry Search Engine to function an index needs to be built for each searchable table. Tables which are in priniciple searchable are those that have at least one string type column with an assigned search weight. Tables for which this is true, and for which there is a searchable interface built in, consist of the following 3 pairs: Category and Item, Team and Member, Yellowpage and Supplier. These 3 combinations can separately be searched by using the search textbox found on any of the 3 browsable sets of Containers. The index for each searchable table--a set of 3 related tables for each such searchable table--is automatically updated each time a record is added, modified, or deleted. So most of the time you need not worry about dealing with the search index tables. However, there may be times when you need to rebuild these from scratch. This is done by visiting the Index control panel where you will see options to allow the re-indexing of each searchable table. This is executed as an NPH process to allow the result of the indexing to be output to the browser incrementally, so it operates in a manner similar to the Build control panel. These build-via-the-browser pages are processed in blocks, or Staggered. Alternately, the indexing can be initiated as a command-line process: to build the index for a table nph-admin.cgi --do=Index[index_type] to contract the index for a table after first building it nph-admin.cgi --do=ReIndex[index_type] to delete the index nph-admin.cgi --do=DeleteIndex[index_type] where [index_type] = Item | Member | Supplier | Category | Team | Yellowpage If you want to Index, or ReIndex, all your tables in one go: nph-admin.cgi --do=IndexAll nph-admin.cgi --do=ReIndexAll OR, if you can run cron jobs which are not subject to timeouts (you may need to be on a dedicated server for that to be the case), you might put something like the following into your crontab file: 08 2 15 * * /path/to/nph-admin.cgi --do=IndexItem --cron=1 38 2 15 * * /path/to/nph-admin.cgi --do=IndexMember --cron=1 08 3 15 * * /path/to/nph-admin.cgi --do=IndexSupplier --cron=1 38 3 15 * * /path/to/nph-admin.cgi --do=IndexCategory --cron=1 08 4 15 * * /path/to/nph-admin.cgi --do=IndexTeam --cron=1 38 4 15 * * /path/to/nph-admin.cgi --do=IndexYellowpage --cron=1 This set of crontab instructions specifies that, beginning at 2.08 A.M. every 15th day of the month, one table is indexed every half hour until all 6 Thing and Container tables have been indexed. Or you can elect to build the lot, one after the other, like this: 38 2 15 * * /path/to/nph-admin.cgi --do=IndexAll --cron=1 The extra --cron=1 argument ensures that logging to the screen is switched off unless an error message needs to be output. This ensures that any email message sent to you after your cron jobs are completed remains of manageable size. You can also set a cron job to ReIndex tables after they have been indexed (perhaps on the next day). However, re-indexing may not be worth the effort--see the discussion on Index Contraction. In this case you would simply change 'IndexAll' to 'ReIndexAll' when setting up your cron file. There are only 2 general parameters associated with indexing (the rest relate to stemming which we will cover next):
Stemming / StoplistIndex tables can get quite large--in terms of the number of records they contain--as each word in an indexed document generally requires its own separate record in the index table. So to keep things manageable in terms of required disk space and search times, 2 methods are routinely used to reduce the size of your index tables. One is the Custom Stoplist method: if any words which appear in our custom stoplist also appear in a record to be indexed, those words are removed from a copy of the record before the indexing takes place and the copy is indexed instead. The second method is called Stemming, and it allows us to replace lexical variations of a word with a single representative. Thus, for the words absolute, absolution, and absolutely, stemming allows us to replace those words with the stem absolut. This means that phrase matching becomes less exact when we attempt to retrieve records that contain the keyphrase we are seeking. However, the number of indexing records for documents that contain words with the absolute stem could be reduced by as much as a factor of 3. Stemming can be performed for any of the languages supported by the Perl Lingua::Stem module. In order to perform stoplist parsing, a list of stopwords must be defined before the indexing takes place. Also, we may wish to remove not only custom stopwords (words that we expect to appear in records and which do not carry any significant meaning in the contect of the average search), but also words that appear in a large fraction of the searchable records. To add these words to the stoplist requires that the document collection first be parsed in its entirety to find the word frequencies. So to reduce the size of your index you MAY need to perform the indexing twice on the entire record set. Clearly you cannot do this each time a new record is added to a table, so indexing everything from scratch should be a weekly or twice-monthly process that you can set as a cron job. The process of re-indexing to exclude the common stopwords is termed contraction here and may or may not prove to be useful, depending on the variability of the words that compose your record set. Parameters relating to the process of stemming and stoplist configuration are:
Index ContractionIf you have enabled the autocreation of a stoplist of common words--that is, words common to K percent of the indexed records--then you will be presented with a table of word frequencies once the entire table has been indexed. The word frequency table might look something like this:
Here W is the percentage of WORDS that appear in more than R percent of the RECORDS. Thus, in this example, 32 percent of all words appear in more than 50 percent of the records. So we might wish to cull those words from our index, reducing it in size by about a third. Depending on the nature of your records, a number like 32 percent might never be seen and virtually all words are shared by no more than a few documents. In that case, any attempt to trim the index--contract it--would be a waste of time. If you do find that a significant fraction of your index can be trimmed, you should do so (a link will be presented in the case of browser-generated NPH pages to allow you to contract the index). This will free up disk space, and make your search queries run faster (if a little less accurately). If you see that you will gain only a few percent by contracting your index, then do not bother with it. As a reminder, DISABLE the auto-generated stoplist if you find contractions to be of no use to you. Otherwise indexing from scratch will take much longer than need be the case (computing a word frequency table, for example, is computationally intensive). « Table of Contents | Obtain Review Foundry » Copyright © 2004 Random Mouse Software. All Rights Reserved. |