-
-
Notifications
You must be signed in to change notification settings - Fork 42
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'develop' into release/6.8
- Loading branch information
Showing
22 changed files
with
1,073 additions
and
644 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,197 @@ | ||
# Elasticsearch用Sudachiプラグイン チュートリアル | ||
|
||
Elasticsearch プラグインは 5.6, 6.8 の最新バージョンと7系の各マイナーバージョンをサポートしています。 | ||
|
||
以下では Elasticsearch 7.5.0 で Sudachi をつかう手順をしめします。 | ||
|
||
まずプラグインをインストールします。 | ||
|
||
``` | ||
$ sudo elasticsearch-plugin install https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v7.5.0-1.3.2/analysis-sudachi-elasticsearch7.5-1.3.2.zip | ||
``` | ||
|
||
パッケージには辞書が含まれていません。https://github.com/WorksApplications/SudachiDict から最新の辞書を取得し、 `$ES_HOME/sudachi` の下に置きます。 3つの辞書のうち以下では core 辞書を利用します。 | ||
|
||
``` | ||
$ wget https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/sudachi-dictionary-20191030-core.zip | ||
$ unzip sudachi-dictionary-20191030-core.zip | ||
$ sudo mkdir /etc/elasticsearch/sudachi | ||
$ sudo cp sudachi-dictionary-20191030/system_core.dic /etc/elasticsearch/sudachi | ||
``` | ||
|
||
配置後、Elasticsearch を再起動します。 | ||
|
||
設定ファイルを作成します。 | ||
|
||
```json:analysis_sudachi.json | ||
{ | ||
"settings" : { | ||
"analysis" : { | ||
"filter" : { | ||
"romaji_readingform" : { | ||
"type" : "sudachi_readingform", | ||
"use_romaji" : true | ||
}, | ||
"katakana_readingform" : { | ||
"type" : "sudachi_readingform", | ||
"use_romaji" : false | ||
} | ||
}, | ||
"analyzer" : { | ||
"sudachi_baseform_analyzer" : { | ||
"filter" : [ "sudachi_baseform" ], | ||
"type" : "custom", | ||
"tokenizer" : "sudachi_tokenizer" | ||
}, | ||
"sudachi_normalizedform_analyzer" : { | ||
"filter" : [ "sudachi_normalizedform" ], | ||
"type" : "custom", | ||
"tokenizer" : "sudachi_tokenizer" | ||
}, | ||
"sudachi_readingform_analyzer" : { | ||
"filter" : [ "katakana_readingform" ], | ||
"type" : "custom", | ||
"tokenizer" : "sudachi_tokenizer" | ||
}, | ||
"sudachi_romaji_analyzer" : { | ||
"filter" : [ "romaji_readingform" ], | ||
"type" : "custom", | ||
"tokenizer" : "sudachi_tokenizer" | ||
}, | ||
"sudachi_analyzer": { | ||
"filter": [], | ||
"tokenizer": "sudachi_tokenizer", | ||
"type": "custom" | ||
} | ||
}, | ||
"tokenizer" : { | ||
"sudachi_tokenizer": { | ||
"type": "sudachi_tokenizer", | ||
"mode": "search", | ||
"resources_path": "/etc/elasticsearch/config/sudachi" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
|
||
インデックスを作成します。 | ||
|
||
``` | ||
$ curl -X PUT 'localhost:9200/test_sudachi' -H 'Content-Type: application/json' -d @analysis_sudachi.json | ||
{"acknowledged":true,"shards_acknowledged":true,"index":"test_sudachi"} | ||
``` | ||
|
||
解析してみます。 | ||
|
||
``` | ||
$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_analyzer", "text" : "関西国際空港"}' | ||
{ | ||
"tokens" : [ | ||
{ | ||
"token" : "関西国際空港", | ||
"start_offset" : 0, | ||
"end_offset" : 6, | ||
"type" : "word", | ||
"position" : 0, | ||
"positionLength" : 3 | ||
}, | ||
{ | ||
"token" : "関西", | ||
"start_offset" : 0, | ||
"end_offset" : 2, | ||
"type" : "word", | ||
"position" : 0 | ||
}, | ||
{ | ||
"token" : "国際", | ||
"start_offset" : 2, | ||
"end_offset" : 4, | ||
"type" : "word", | ||
"position" : 1 | ||
}, | ||
{ | ||
"token" : "空港", | ||
"start_offset" : 4, | ||
"end_offset" : 6, | ||
"type" : "word", | ||
"position" : 2 | ||
} | ||
] | ||
} | ||
``` | ||
|
||
`search mode` が指定されているでA単位とC単位の両方が出力されます。 | ||
|
||
動詞、形容詞を終止形で出力してみます。 | ||
|
||
``` | ||
$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_baseform_analyzer", "text" : "おおきく"}' | ||
{ | ||
"tokens" : [ | ||
{ | ||
"token" : "おおきい", | ||
"start_offset" : 0, | ||
"end_offset" : 4, | ||
"type" : "word", | ||
"position" : 0 | ||
} | ||
] | ||
} | ||
``` | ||
|
||
表記を正規化して出力してみます。 | ||
|
||
``` | ||
$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_normalizedform_analyzer", "text" : "おおきく"}' | ||
{ | ||
"tokens" : [ | ||
{ | ||
"token" : "大きい", | ||
"start_offset" : 0, | ||
"end_offset" : 4, | ||
"type" : "word", | ||
"position" : 0 | ||
} | ||
] | ||
} | ||
``` | ||
|
||
読みを出力してみます。 | ||
|
||
``` | ||
$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_readingform_analyzer", "text" : "おおきく"}' | ||
{ | ||
"tokens" : [ | ||
{ | ||
"token" : "オオキク", | ||
"start_offset" : 0, | ||
"end_offset" : 4, | ||
"type" : "word", | ||
"position" : 0 | ||
} | ||
] | ||
} | ||
``` | ||
|
||
読みをローマ字 (Microsoft IME 風) で出力してみます。 | ||
|
||
``` | ||
$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_romaji_analyzer", "text" : "おおきく"}' | ||
{ | ||
"tokens" : [ | ||
{ | ||
"token" : "ookiku", | ||
"start_offset" : 0, | ||
"end_offset" : 4, | ||
"type" : "word", | ||
"position" : 0 | ||
} | ||
] | ||
} | ||
``` | ||
|
||
そのほか、品詞によるトークンの除外、ストップワードなどが利用できます。 | ||
|
||
こちらもご参照ください: [Elasticsearchのための新しい形態素解析器 「Sudachi」 - Qiita](https://qiita.com/sorami/items/99604ef105f13d2d472b) (Elastic stack Advent Calendar 2017) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,17 +3,17 @@ | |
|
||
<groupId>com.worksap.nlp</groupId> | ||
<artifactId>analysis-sudachi-elasticsearch6.8</artifactId> | ||
<version>1.3.2</version> | ||
<version>2.0.0</version> | ||
<packaging>jar</packaging> | ||
|
||
<name>analysis-sudachi</name> | ||
|
||
<properties> | ||
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> | ||
<java.version>1.8</java.version> | ||
<elasticsearch.version>6.8.5</elasticsearch.version> | ||
<lucene.version>7.7.2</lucene.version> | ||
<sudachi.version>0.3.1</sudachi.version> | ||
<elasticsearch.version>6.8.9</elasticsearch.version> | ||
<lucene.version>7.7.3</lucene.version> | ||
<sudachi.version>0.4.1</sudachi.version> | ||
<sonar.host.url>https://sonarcloud.io</sonar.host.url> | ||
<sonar.language>java</sonar.language> | ||
<sonar.organization>worksapplications</sonar.organization> | ||
|
@@ -153,4 +153,4 @@ | |
<developerConnection>scm:git:[email protected]:WorksApplications/elasticsearch-sudachi.git</developerConnection> | ||
<url>https://github.com/WorksApplications/elasticsearch-sudachi</url> | ||
</scm> | ||
</project> | ||
</project> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
45 changes: 45 additions & 0 deletions
45
src/main/java/com/worksap/nlp/elasticsearch/sudachi/index/SudachiSplitFilterFactory.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
/* | ||
* Copyright (c) 2020 Works Applications Co., Ltd. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package com.worksap.nlp.elasticsearch.sudachi.index; | ||
|
||
import java.util.Locale; | ||
|
||
import com.worksap.nlp.lucene.sudachi.ja.SudachiSplitFilter; | ||
import com.worksap.nlp.lucene.sudachi.ja.SudachiSplitFilter.Mode; | ||
|
||
import org.apache.lucene.analysis.TokenStream; | ||
import org.elasticsearch.common.settings.Settings; | ||
import org.elasticsearch.env.Environment; | ||
import org.elasticsearch.index.IndexSettings; | ||
import org.elasticsearch.index.analysis.AbstractTokenFilterFactory; | ||
|
||
public class SudachiSplitFilterFactory extends AbstractTokenFilterFactory { | ||
|
||
private static final String MODE_PARAM = "mode"; | ||
|
||
private final Mode mode; | ||
|
||
public SudachiSplitFilterFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) { | ||
super(indexSettings, name, settings); | ||
mode = Mode.valueOf(settings.get(MODE_PARAM, SudachiSplitFilter.DEFAULT_MODE.toString()).toUpperCase(Locale.ROOT)); | ||
} | ||
|
||
@Override | ||
public TokenStream create(TokenStream tokenStream) { | ||
return new SudachiSplitFilter(tokenStream, mode); | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.