中文分词器集合

一些中文分词器的简单封装和集合

Free software: MIT license
Documentation: https://tokenziers-collection.readthedocs.io.

Features

TODO

使用

from tokenizers_collection.config import tokenizer_registry
for name, tokenizer in tokenizer_registry:
    print("Tokenizer: {}".format(name))
    tokenizer('input_file.txt', 'output_file.txt')

安装

pip install tokenizers_collection

更新许可文件与下载模型

因为其中有些模型需要更新许可文件（比如：pynlpir）或者需要下载模型文件（比如：pyltp），因此安装后需要执行特定的命令完成操作，这里已经将所有的操作封装成了一个函数，只需要执行类似如下的指令即可

python -m tokenizers_collection.helper

注意：

如果遇到 Error: unable to fetch newest license. 那么可能是 Python 3 的 SSL 的问题，参考 pynlpir update error 或者 How to make Python use CA certificates from Mac OS TrustStore? 进行解决。
由于需要下载的模型文件较大（600+ M），所以下载时间较长，具体情况根据当时网络情况而定，如果遇到错误，尝试重新运行命令。

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.rst

README.rst

中文分词器集合

Features

使用

安装

更新许可文件与下载模型

Credits

Files

README.rst

Latest commit

History

README.rst

File metadata and controls

中文分词器集合

Features

使用

安装

更新许可文件与下载模型

Credits