conf/textIndexer.conf

<?xml version="1.0" encoding="utf-8"?>

<!-- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -->
<!-- Configuration file for the XTF text indexing tool                      -->
<!-- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -->

<textIndexer-config>

    <!-- See end of file for a description of the available options. -->
  
    <index name="default">
        <src path="./data" scan="all"/>
        <db path="./index"/>
        <!-- Expert version: -->
        	<!-- <src path="./data" scan="all" clone="yes"/> -->
        	<!-- <db path="./index" rotate="yes"/> -->
	        <!-- <validation path="./conf/indexValidation.xml"/> -->
        <!-- End of expert version -->
        <chunk size="200" overlap="20"/>
        <docselector path="./style/textIndexer/docSelector.xsl"/>
        <stopwords list="a an and are as at be but by for if in into is it no not of on or s such t that the their then there these they this to was will with"/>
        <pluralmap path="./conf/pluralFolding/pluralMap.txt.gz"/>
        <accentmap path="./conf/accentFolding/accentMap.txt"/>
        <spellcheck createDict="yes"/>
    </index>

    <!-- =====================================================================
    Tag: <index name="nnn"> ... </index>

         The 'name' attribute specifies a name for an index definition block. 
         The name may be any combination of digits and/or letters and the 
         underscore (_) character. Punctuation and other symbols are not 
         permitted, and neither is the use of the space character. Also, the 
         index name may only be used for one index block in any given 
         configuration (if it appears more than once, the first occurence is 
         used, and the remaining ones are ignored.) This index name is the name 
         passed on the command line to the textIndexer to identify which 
         indices need to be processed. 

         The sub-tags allowed for <index> are as follows:

         <src path="ppp" scan="all|pruned"/>

             The 'path' attribute specifies the file-system path where the 
             documents to be indexed are located. The path specified for an 
             index must be a valid path for the operating system on which the 
             tool is being run (e.g., Windows, Mac, Linux, etc.) If a relative 
             path is used, it is considered to be relative to the XTF_HOME
             environment variable.

             The optional scan attribute defaults (for backward compatibility)
             to "pruned" in order to prevent recursing into directories that 
             have indexed data. The distribution copy of the configuration file 
             has the value set to "all" which will recurse into all directories;
             typically most people expect this behavior.

        <db path="ppp"/>

             The 'path' attribute specifies the file-system path where the 
             database for the named index should be located. If the path does 
             not exist or there are no databases files located there, the 
             textIndexer will automatically create the necessary directories 
             and database files. As with the source path, the database path 
             specified for an index must be a valid path for the operating 
             system on which the tool is being run.) If a relative path is 
             used, it is considered to be relative to the XTF_HOME
             environment variable.

        <chunk size="S" overlap="O"/>

            Attribute: size

            This textIndexer tool splits source documents into smaller chunks 
            of text when adding the document to its index. Doing so makes 
            proximity searches and the display of their resulting summary 
            "blurbs" faster by limiting how much of the source document must 
            be read into memory. 

            The 'size' attribute defines (as a number of words) how large the 
            chunk size should be. Note: The chunk size should be equal 
            to or more than two words. If it is not, the textIndexer will 
            force it to be two.

            The 'overlap' attribute defines (as a number of words) how large 
            the chunk overlap should be. Note: The chunk overlap should be
            equal to or less than half the chunk size. If it is not, the 
            textIndexer will force it to be half.) 

            It should be mentioned that the selected chunk overlap effectively
            defines the maximum distance (in words) that can exist between two 
            words in a document and still produce a search match. Consequently 
            if you have a chunk overlap of five words, the maximum distance 
            between two words that will result in a proximity match is five 
            words. As a guideline, a chunk overlap of about 20 words for a 
            chunk size of 200 words gives fairly good results.

        <docselector path="ppp"/>

            The textIndexer provides great flexibility in deciding which files
            in a source directory should be indexed and how. It does this
            by passing the files in each directory to an XSLT stylesheet, the
            "document selector" or (docselector for short). That stylesheet
            in turn decides which files to index and specifies various 
            parameters for each one. See the documentation within the file
            "docSelector.xsl" for detailed information.

            The path and name of the stylesheet specified by this attribute 
            must valid for the operating system on which the tool is being 
            run (e.g., Windows, Mac, Linux, etc.) If a relative path is used, 
            it is considered to be relative to the XTF_HOME environment 
            variable.

        <stopwords path="ppp"/>

            This attribute specifies a list of words that the textIndexer 
            should not add to the index. Eliminating stop-words from an index 
            improves search speed for an index. This is because the search 
            doesn't need to sift through all the occurences of the stop-words 
            in the document library. Consequently, adding words like a, an, 
            the, and, etc. to the stop-word list, which occur frequently in 
            documents but are relatively uninteresting to search for, can 
            speed up the search for more interesting words enormously. The 
            one caveat is that searches for any single stop-word by itself 
            will yield no matches, so it is important to pick stop-words that 
            people aren't usually interested in finding. Note however that 
            due to an internal process called n-gramming, stop words will 
            still be found as part of larger phrases, like of in Man of War, 
            or the in The Terminator. 

            The stop-word file should be a plain text file consisting of a
            list of stop words separated by spaces and/or commas. The path 
            specified must be valid for the operating system on which 
            the tool is being run (e.g., Windows, Mac, Linux, etc.) If a 
            relative path is used, it is considered to be relative to the 
            XTF_HOME environment variable.

        <pluralmap path="ppp"/>

            This attribute specifies a list of plural words and their
            corresponding singular forms that the textIndexer should fold
            together. This can yield better search results. For instance, if a 
            user searches for "cat" they probably also would like results for 
            "cats."

            The file should be a plain text file, with one word pair per line. 
            First is the plural form of a word, followed by a "|" character, 
            followed by the singular form. All should be lowercase, even in the 
            case of acronyms.

            Optionally, the file may be compressed in GZIP format, in which case 
            it must end in the extension ".gz".

            Non-ASCII characters should be encoded in UTF-8 format.

        <accentmap path="ppp"/>

            This attribute specifies a accented characters and their 
            corresponding forms with the diacritical marks removed. This can 
            yield better search results. For instance, if a user is looking for 
            the German word "Hüt" but can't type it because they're on an 
            American keyboard, they can type in "hut" and still get a match on 
            "Hüt".

            The file should be a plain text file, with one word pair per line. 
            First is the 4-digit hex code for the accented Unicode character, 
            followed by a "|" character, followed by the hex code for the same 
            character with diacritics removed.

        <spellcheck createDict="yes|no"/>

            This attribute specifies whether a spellcheck dictionary will be
            created for this index. This can significantly increase indexing
            time. If a dictionary is created, crossQuery can use it to make
            automated spelling suggestions for queries which are likely to be
            misspelled.

    ======================================================================== -->

</textIndexer-config>