-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathtextIndexer.conf
executable file
·179 lines (143 loc) · 9.18 KB
/
textIndexer.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
<?xml version="1.0" encoding="utf-8"?>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -->
<!-- Configuration file for the XTF text indexing tool -->
<!-- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -->
<textIndexer-config>
<!-- See end of file for a description of the available options. -->
<index name="default">
<src path="./data" scan="all"/>
<db path="./index"/>
<!-- Expert version: -->
<!-- <src path="./data" scan="all" clone="yes"/> -->
<!-- <db path="./index" rotate="yes"/> -->
<!-- <validation path="./conf/indexValidation.xml"/> -->
<!-- End of expert version -->
<chunk size="200" overlap="20"/>
<docselector path="./style/textIndexer/docSelector.xsl"/>
<stopwords list="a an and are as at be but by for if in into is it no not of on or s such t that the their then there these they this to was will with"/>
<pluralmap path="./conf/pluralFolding/pluralMap.txt.gz"/>
<accentmap path="./conf/accentFolding/accentMap.txt"/>
<spellcheck createDict="yes"/>
</index>
<!-- =====================================================================
Tag: <index name="nnn"> ... </index>
The 'name' attribute specifies a name for an index definition block.
The name may be any combination of digits and/or letters and the
underscore (_) character. Punctuation and other symbols are not
permitted, and neither is the use of the space character. Also, the
index name may only be used for one index block in any given
configuration (if it appears more than once, the first occurence is
used, and the remaining ones are ignored.) This index name is the name
passed on the command line to the textIndexer to identify which
indices need to be processed.
The sub-tags allowed for <index> are as follows:
<src path="ppp" scan="all|pruned"/>
The 'path' attribute specifies the file-system path where the
documents to be indexed are located. The path specified for an
index must be a valid path for the operating system on which the
tool is being run (e.g., Windows, Mac, Linux, etc.) If a relative
path is used, it is considered to be relative to the XTF_HOME
environment variable.
The optional scan attribute defaults (for backward compatibility)
to "pruned" in order to prevent recursing into directories that
have indexed data. The distribution copy of the configuration file
has the value set to "all" which will recurse into all directories;
typically most people expect this behavior.
<db path="ppp"/>
The 'path' attribute specifies the file-system path where the
database for the named index should be located. If the path does
not exist or there are no databases files located there, the
textIndexer will automatically create the necessary directories
and database files. As with the source path, the database path
specified for an index must be a valid path for the operating
system on which the tool is being run.) If a relative path is
used, it is considered to be relative to the XTF_HOME
environment variable.
<chunk size="S" overlap="O"/>
Attribute: size
This textIndexer tool splits source documents into smaller chunks
of text when adding the document to its index. Doing so makes
proximity searches and the display of their resulting summary
"blurbs" faster by limiting how much of the source document must
be read into memory.
The 'size' attribute defines (as a number of words) how large the
chunk size should be. Note: The chunk size should be equal
to or more than two words. If it is not, the textIndexer will
force it to be two.
The 'overlap' attribute defines (as a number of words) how large
the chunk overlap should be. Note: The chunk overlap should be
equal to or less than half the chunk size. If it is not, the
textIndexer will force it to be half.)
It should be mentioned that the selected chunk overlap effectively
defines the maximum distance (in words) that can exist between two
words in a document and still produce a search match. Consequently
if you have a chunk overlap of five words, the maximum distance
between two words that will result in a proximity match is five
words. As a guideline, a chunk overlap of about 20 words for a
chunk size of 200 words gives fairly good results.
<docselector path="ppp"/>
The textIndexer provides great flexibility in deciding which files
in a source directory should be indexed and how. It does this
by passing the files in each directory to an XSLT stylesheet, the
"document selector" or (docselector for short). That stylesheet
in turn decides which files to index and specifies various
parameters for each one. See the documentation within the file
"docSelector.xsl" for detailed information.
The path and name of the stylesheet specified by this attribute
must valid for the operating system on which the tool is being
run (e.g., Windows, Mac, Linux, etc.) If a relative path is used,
it is considered to be relative to the XTF_HOME environment
variable.
<stopwords path="ppp"/>
This attribute specifies a list of words that the textIndexer
should not add to the index. Eliminating stop-words from an index
improves search speed for an index. This is because the search
doesn't need to sift through all the occurences of the stop-words
in the document library. Consequently, adding words like a, an,
the, and, etc. to the stop-word list, which occur frequently in
documents but are relatively uninteresting to search for, can
speed up the search for more interesting words enormously. The
one caveat is that searches for any single stop-word by itself
will yield no matches, so it is important to pick stop-words that
people aren't usually interested in finding. Note however that
due to an internal process called n-gramming, stop words will
still be found as part of larger phrases, like of in Man of War,
or the in The Terminator.
The stop-word file should be a plain text file consisting of a
list of stop words separated by spaces and/or commas. The path
specified must be valid for the operating system on which
the tool is being run (e.g., Windows, Mac, Linux, etc.) If a
relative path is used, it is considered to be relative to the
XTF_HOME environment variable.
<pluralmap path="ppp"/>
This attribute specifies a list of plural words and their
corresponding singular forms that the textIndexer should fold
together. This can yield better search results. For instance, if a
user searches for "cat" they probably also would like results for
"cats."
The file should be a plain text file, with one word pair per line.
First is the plural form of a word, followed by a "|" character,
followed by the singular form. All should be lowercase, even in the
case of acronyms.
Optionally, the file may be compressed in GZIP format, in which case
it must end in the extension ".gz".
Non-ASCII characters should be encoded in UTF-8 format.
<accentmap path="ppp"/>
This attribute specifies a accented characters and their
corresponding forms with the diacritical marks removed. This can
yield better search results. For instance, if a user is looking for
the German word "Hüt" but can't type it because they're on an
American keyboard, they can type in "hut" and still get a match on
"Hüt".
The file should be a plain text file, with one word pair per line.
First is the 4-digit hex code for the accented Unicode character,
followed by a "|" character, followed by the hex code for the same
character with diacritics removed.
<spellcheck createDict="yes|no"/>
This attribute specifies whether a spellcheck dictionary will be
created for this index. This can significantly increase indexing
time. If a dictionary is created, crossQuery can use it to make
automated spelling suggestions for queries which are likely to be
misspelled.
======================================================================== -->
</textIndexer-config>