-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Rename latin tokenizer to "ascii_stemmer"
- Loading branch information
Showing
28 changed files
with
98 additions
and
71 deletions.
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,13 +2,19 @@ | |
|
||
There are 3 language modules available. To configure these, you will need to serve the appropriate [language bundle](./getting_started.md#hosting-the-files) in your HTML (or edit the CDN link accordingly), and edit the indexer configuration file. | ||
|
||
## Ascii Tokenizer | ||
```json | ||
{ | ||
"lang_config": { | ||
// ... options go here ... | ||
} | ||
} | ||
``` | ||
|
||
#### CDN link | ||
## Ascii Tokenizer | ||
|
||
The default tokenizer splits on sentences, then whitespaces to obtain tokens. | ||
The default tokenizer should work for any language that relies on ASCII characters, or their inflections (e.g. "á"). | ||
|
||
An [asciiFoldingFilter](https://github.com/tantivy-search/tantivy/blob/main/src/tokenizer/ascii_folding_filter.rs) is then applied to these tokens, followed by punctuation and non-word-character boundary removal. | ||
The text is first split into on sentences, then whitespaces to obtain tokens. An [asciiFoldingFilter](https://github.com/tantivy-search/tantivy/blob/main/src/tokenizer/ascii_folding_filter.rs) is then applied to normalize diacritics, followed by punctuation and non-word-character boundary removal. | ||
|
||
```json | ||
{ | ||
|
@@ -34,13 +40,13 @@ An [asciiFoldingFilter](https://github.com/tantivy-search/tantivy/blob/main/src/ | |
<script src="https://cdn.jsdelivr.net/gh/ang-zeyu/[email protected]/packages/search-ui/dist/search-ui.ascii.bundle.js"></script> | ||
``` | ||
|
||
## Latin Tokenizer | ||
## Ascii Tokenizer with Stemmer | ||
|
||
This is essentially the same as the ascii tokenizer, but adds a `stemmer` option. | ||
|
||
```json | ||
{ | ||
"lang": "latin", | ||
"lang": "ascii_stemmer", | ||
"options": { | ||
// ---------------------------------- | ||
// Ascii Tokenizer options also apply | ||
|
@@ -60,7 +66,7 @@ If you do not need stemming, use the `ascii` tokenizer, which has a smaller wasm | |
**CDN Link** | ||
|
||
```html | ||
<script src="https://cdn.jsdelivr.net/gh/ang-zeyu/[email protected]/packages/search-ui/dist/search-ui.latin.bundle.js"></script> | ||
<script src="https://cdn.jsdelivr.net/gh/ang-zeyu/[email protected]/packages/search-ui/dist/search-ui.ascii-stemmer.bundle.js"></script> | ||
``` | ||
|
||
## Chinese Tokenizer | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
...anguages/infisearch_lang_latin/Cargo.toml → .../infisearch_lang_ascii_stemmer/Cargo.toml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
[package] | ||
name = "infisearch_lang_latin" | ||
name = "infisearch_lang_ascii_stemmer" | ||
version = "0.9.1" | ||
authors = ["Ze Yu <[email protected]>"] | ||
edition = "2018" | ||
|
File renamed without changes.
1 change: 1 addition & 0 deletions
1
packages/infisearch_languages/infisearch_lang_ascii_stemmer/src/lib.rs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
pub mod ascii_stemmer; |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
2 changes: 1 addition & 1 deletion
2
...search_search/pkg/lang_latin/package.json → ...earch/pkg/lang_ascii_stemmer/package.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
{ | ||
"name": "@infisearch/lang-latin", | ||
"name": "@infisearch/lang-ascii-stemmer", | ||
"collaborators": [ | ||
"Ze Yu <[email protected]>" | ||
], | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.