Skip to content
This repository has been archived by the owner on Mar 9, 2023. It is now read-only.

Commit

Permalink
Merge pull request #152 from WorksApplications/feature_changing_dicti…
Browse files Browse the repository at this point in the history
…onary_linking_mechanism

Feature changing dictionary linking mechanism
  • Loading branch information
kazuma-t authored Mar 26, 2021
2 parents 1c407ca + 6369574 commit 6437d51
Show file tree
Hide file tree
Showing 7 changed files with 286 additions and 116 deletions.
66 changes: 53 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,8 @@ EOS

```bash
$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d] [-v]
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string]
[-a] [-d] [-v]
[file [file ...]]

Tokenize Text
Expand All @@ -83,6 +84,7 @@ optional arguments:
-r file the setting file in JSON format
-m {A,B,C} the mode of splitting
-o file the output file
-s string sudachidict type
-a print all of the fields
-d print the debug information
-v, --version print sudachipy version
Expand Down Expand Up @@ -175,33 +177,71 @@ tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
## Dictionary Edition
**WARNING: `sudachipy link` is no longer available in SudachiPy v0.5.2 and later. **
There are three editions of Sudachi Dictionary, namely, `small`, `core`, and `full`. See [WorksApplications/SudachiDict](https://github.com/WorksApplications/SudachiDict) for the detail.
SudachiPy uses `sudachidict_core` by default. You can specify the dictionary with the `link -t` command.
SudachiPy uses `sudachidict_core` by default.
Dictionaries are installed as Python packages `sudachidict_small`, `sudachidict_core`, and `sudachidict_full`.
* [SudachiDict-small · PyPI](https://pypi.org/project/SudachiDict-small/)
* [SudachiDict-core · PyPI](https://pypi.org/project/SudachiDict-core/)
* [SudachiDict-full · PyPI](https://pypi.org/project/SudachiDict-full/)
The dictionary files are not in the package itself, but it is downloaded upon installation.
### Dictionary option: command line
You can specify the dictionary with the tokenize option `-s`.
```bash
$ pip install sudachidict_small
$ sudachipy link -t small
$ echo "外国人参政権" | sudachipy -s small
```
```bash
$ pip install sudachidict_full
$ sudachipy link -t full
$ echo "外国人参政権" | sudachipy -s full
```
You can remove the dictionary link with the `link -u` commnad.
### Dictionary option: Python package
```bash
$ sudachipy link -u
You can specify the dictionary with the `Dicionary()` argument; `config_path` or `dict_type`.
```python
class Dictionary(config_path=None, resource_dir=None, dict_type=None)
```
Dictionaries are installed as Python packages `sudachidict_small`, `sudachidict_core`, and `sudachidict_full`. SudachiPy tries to refer `sudachidict` package to use a dictionary. The `link` subcommand creates *a symbolic link* of `sudachidict_*` as `sudachidict`, to switch the packages.
1. `config_path`
* You can specify the file path to the setting file with `config_path` (See [Dictionary in The Setting File](#Dictionary in The Setting File) for the detail).
* If the dictionary file is specified in the setting file as `systemDict`, SudachiPy will use the dictionary.
2. `dict_type`
* You can also specify the dictionary type with `dict_type`.
* The available arguments are `small`, `core`, or `full`.
* If different dictionaries are specified with `config_path` and `dict_type`, **a dictionary defined `dict_type` overrides** those defined in the config path.
* [SudachiDict-small · PyPI](https://pypi.org/project/SudachiDict-small/)
* [SudachiDict-core · PyPI](https://pypi.org/project/SudachiDict-core/)
* [SudachiDict-full · PyPI](https://pypi.org/project/SudachiDict-full/)
```python
from sudachipy import tokenizer
from sudachipy import dictionary
# default: sudachidict_core
tokenizer_obj = dictionary.Dictionary().create()
# The dictionary given by the `systemDict` key in the config file (/path/to/sudachi.json) will be used
tokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json").create()
# The dictionary specified by `dict_type` will be set.
tokenizer_obj = dictionary.Dictionary(dict_type="core").create() # sudachidict_core (same as default)
tokenizer_obj = dictionary.Dictionary(dict_type="small").create() # sudachidict_small
tokenizer_obj = dictionary.Dictionary(dict_type="full").create() # sudachidict_full
# The dictionary specified by `dict_type` overrides those defined in the config path.
# In the following code, `sudachidict_full` will be used regardless of a dictionary defined in the config file.
tokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json", dict_type="full").create()
```
The dictionary files are not in the package itself, but it is downloaded upon installation.
### Dictionary in The Setting File
Expand Down Expand Up @@ -256,7 +296,7 @@ optional arguments:
-h, --help show this help message and exit
-d string description comment to be embedded on dictionary
-o file output file (default: user.dic)
-s file system dictionary (default: linked system_dic, see link -h)
-s file system dictionary path (default: system core dictionary path)
```
About the dictionary file format, please refer to [this document](https://github.com/WorksApplications/Sudachi/blob/develop/docs/user_dict.md) (written in Japanese, English version is not available yet).
Expand Down
69 changes: 56 additions & 13 deletions docs/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,8 @@ EOS

```bash
$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d] [-v]
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string]
[-a] [-d] [-v]
[file [file ...]]

Tokenize Text
Expand All @@ -83,6 +84,7 @@ optional arguments:
-r file the setting file in JSON format
-m {A,B,C} the mode of splitting
-o file the output file
-s string sudachidict type
-a print all of the fields
-d print the debug information
-v, --version print sudachipy version
Expand Down Expand Up @@ -170,39 +172,80 @@ tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
(これは `20200330` `core` 辞書による出力例です。 辞書のバージョンによって変わる可能性があります。)
## 辞書の種類
**WARNING: `sudachipy link` コマンドは SudachiPy v0.5.2 以降から利用できなくなりました. **
Sudachi辞書は`small``core``full`の3種類があります。 詳細は[WorksApplications/SudachiDict](https://github.com/WorksApplications/SudachiDict)を参照してください。
SudachiPyはデフォルトでは`sudachidict_core`に設定されています。辞書設定の変更は`link -t`コマンドによって行えます。
SudachiPyはデフォルトでは`sudachidict_core`に設定されています。
`sudachidict_small`, `sudachidict_core`, `sudachidict_full`はPythonのパッケージとしてインストールされます。
* [SudachiDict-small · PyPI](https://pypi.org/project/SudachiDict-small/)
* [SudachiDict-core · PyPI](https://pypi.org/project/SudachiDict-core/)
* [SudachiDict-full · PyPI](https://pypi.org/project/SudachiDict-full/)
辞書ファイルはパッケージ自体には含まれていませんが、上記のインストール時にダウンロードする処理が埋め込まれています。
### 辞書オプション: コマンドライン
辞書設定の変更は`-s`オプションで指定することができます。
```bash
$ pip install sudachidict_small
$ sudachipy link -t small
$ echo "外国人参政権" | sudachipy -s small
```
```bash
$ pip install sudachidict_full
$ sudachipy link -t full
$ echo "外国人参政権" | sudachipy -s full
```
`link -u`によってリンクを削除するとデフォルトの`sudachidict_core`を使用します。
### 辞書オプション: Python パッケージ
```bash
$ sudachipy link -u
Dictionary の引数 `config_path` または `dict_type` から利用する辞書を指定することができます。
```python
class Dictionary(config_path=None, resource_dir=None, dict_type=None)
```
`sudachidict_small`, `sudachidict_core`, `sudachidict_full`はPythonのパッケージとしてインストールされます。 SudachiPyは辞書を使用するとき`sudachidict` パッケージを参照します。 `link` によって`sudachidict_*``sudachidict`として参照するための *symbolic link* が作られます。
1. `config_path`
* `config_path` で辞書の設定ファイルのパスを指定することができます([辞書の設定ファイル](#辞書の設定ファイル) 参照)。
* 指定した辞書の設定ファイルに、辞書のファイルパス `systemDict` が記述されていれば、その辞書を優先して利用します.
2. `dict_type`
* `dict_type` オプションで辞書の種類を直接指定することもできます。
* `small`, `core`, `full` の3種類が指定可能です。
* `config_path``dict_type` で異なる辞書が指定されている場合、**`dict_type` が優先**されます。
* [SudachiDict-small · PyPI](https://pypi.org/project/SudachiDict-small/)
* [SudachiDict-core · PyPI](https://pypi.org/project/SudachiDict-core/)
* [SudachiDict-full · PyPI](https://pypi.org/project/SudachiDict-full/)
```python
from sudachipy import tokenizer
from sudachipy import dictionary
# デフォルトは sudachidict_core が設定されている
tokenizer_obj = dictionary.Dictionary().create()
# /path/to/sudachi.json の systemDict で指定されている辞書が設定される
tokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json").create()
# dict_type で指定された辞書が設定される
tokenizer_obj = dictionary.Dictionary(dict_type="core").create() # sudachidict_core (デフォルトと同じ)
tokenizer_obj = dictionary.Dictionary(dict_type="small").create() # sudachidict_small
tokenizer_obj = dictionary.Dictionary(dict_type="full").create() # sudachidict_full
# dict_type (sudachidict_full) が優先される
tokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json", dict_type="full").create()
```
辞書ファイルはパッケージ自体には含まれていませんが、上記のインストール時にダウンロードする処理が埋め込まれています。
### 辞書の設定ファイル
また、`sudachi.json`で辞書ファイルを切り替えることができます。
辞書のファイルパス `systemDict` は、絶対パスと相対パスのどちらでも指定可能です。
相対パスは、辞書の設定ファイルからの相対パスです。
```
{
Expand Down Expand Up @@ -250,7 +293,7 @@ optional arguments:
-h, --help show this help message and exit
-d string description comment to be embedded on dictionary
-o file output file (default: user.dic)
-s file system dictionary (default: linked system_dic, see link -h)
-s file system dictionary path (default: system core dictionary path)
```
辞書ファイル形式については[user_dict.md](https://github.com/WorksApplications/Sudachi/blob/develop/docs/user_dict.md)を参照してください。
Expand Down
33 changes: 8 additions & 25 deletions sudachipy/command_line.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
from . import __version__
from . import dictionary
from . import tokenizer
from .config import set_default_dict_package, settings, unlink_default_dict_package
from .config import settings
from .dictionarylib import BinaryDictionary
from .dictionarylib import SYSTEM_DICT_VERSION_2, USER_DICT_VERSION_3
from .dictionarylib.dictionarybuilder import DictionaryBuilder
Expand Down Expand Up @@ -125,22 +125,6 @@ def _command_build(args, print_usage):
builder.build(args.in_files, rf, wf)


def _command_link(args, print_usage):
output = sys.stdout
if args.unlink:
unlink_default_dict_package(output=output)
return

dict_package = 'sudachidict_' + args.dict_type
try:
return set_default_dict_package(dict_package, output=output)
except ImportError:
print('Package `{0}` does not exist.\n'
'You may install it with a command `$ pip install {0}`'
.format(dict_package), file=sys.stderr)
exit(1)


def _command_tokenize(args, print_usage):
if args.version:
print_version()
Expand Down Expand Up @@ -169,7 +153,10 @@ def _command_tokenize(args, print_usage):
enable_dump = args.d

try:
dict_ = dictionary.Dictionary(config_path=args.fpath_setting)
if args.system_dict_type is not None:
dict_ = dictionary.Dictionary(config_path=args.fpath_setting, dict_type=args.system_dict_type)
else:
dict_ = dictionary.Dictionary(config_path=args.fpath_setting)
tokenizer_obj = dict_.create()
input_ = fileinput.input(args.in_files, openhook=fileinput.hook_encoded("utf-8"))
run(tokenizer_obj, mode, input_, print_all, stdout_logger, enable_dump)
Expand All @@ -192,18 +179,14 @@ def main():
parser_tk.add_argument("-r", dest="fpath_setting", metavar="file", help="the setting file in JSON format")
parser_tk.add_argument("-m", dest="mode", choices=["A", "B", "C"], default="C", help="the mode of splitting")
parser_tk.add_argument("-o", dest="fpath_out", metavar="file", help="the output file")
parser_tk.add_argument("-s", dest="system_dict_type", metavar='string', choices=["small", "core", "full"],
help="sudachidict type")
parser_tk.add_argument("-a", action="store_true", help="print all of the fields")
parser_tk.add_argument("-d", action="store_true", help="print the debug information")
parser_tk.add_argument("-v", "--version", action="store_true", dest="version", help="print sudachipy version")
parser_tk.add_argument("in_files", metavar="file", nargs=argparse.ZERO_OR_MORE, help='text written in utf-8')
parser_tk.set_defaults(handler=_command_tokenize, print_usage=parser_tk.print_usage)

# link default dict package
parser_ln = subparsers.add_parser('link', help='see `link -h`', description='Link Default Dict Package')
parser_ln.add_argument("-t", dest="dict_type", choices=["small", "core", "full"], default="core", help="dict dict")
parser_ln.add_argument("-u", dest="unlink", action="store_true", help="unlink sudachidict")
parser_ln.set_defaults(handler=_command_link, print_usage=parser_ln.print_usage)

# build dictionary parser
parser_bd = subparsers.add_parser('build', help='see `build -h`', description='Build Sudachi Dictionary')
parser_bd.add_argument('-o', dest='out_file', metavar='file', default='system.dic',
Expand All @@ -224,7 +207,7 @@ def main():
parser_ubd.add_argument('-o', dest='out_file', metavar='file', default='user.dic',
help='output file (default: user.dic)')
parser_ubd.add_argument('-s', dest='system_dic', metavar='file', required=False,
help='system dictionary (default: linked system_dic, see link -h)')
help='system dictionary path (default: system core dictionary path)')
parser_ubd.add_argument("in_files", metavar="file", nargs=argparse.ONE_OR_MORE,
help='source files with CSV format (one or more)')
parser_ubd.set_defaults(handler=_command_user_build, print_usage=parser_ubd.print_usage)
Expand Down
Loading

0 comments on commit 6437d51

Please sign in to comment.