Skip to content

Commit

Permalink
Merge branch 'develop' into release/6.8
Browse files Browse the repository at this point in the history
  • Loading branch information
kazuma-t committed May 27, 2020
2 parents bf5a9ce + 2b1157d commit 1a07a3a
Show file tree
Hide file tree
Showing 22 changed files with 1,073 additions and 644 deletions.
376 changes: 225 additions & 151 deletions README.md

Large diffs are not rendered by default.

197 changes: 197 additions & 0 deletions docs/tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
# Elasticsearch用Sudachiプラグイン チュートリアル

Elasticsearch プラグインは 5.6, 6.8 の最新バージョンと7系の各マイナーバージョンをサポートしています。

以下では Elasticsearch 7.5.0 で Sudachi をつかう手順をしめします。

まずプラグインをインストールします。

```
$ sudo elasticsearch-plugin install https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v7.5.0-1.3.2/analysis-sudachi-elasticsearch7.5-1.3.2.zip
```

パッケージには辞書が含まれていません。https://github.com/WorksApplications/SudachiDict から最新の辞書を取得し、 `$ES_HOME/sudachi` の下に置きます。 3つの辞書のうち以下では core 辞書を利用します。

```
$ wget https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/sudachi-dictionary-20191030-core.zip
$ unzip sudachi-dictionary-20191030-core.zip
$ sudo mkdir /etc/elasticsearch/sudachi
$ sudo cp sudachi-dictionary-20191030/system_core.dic /etc/elasticsearch/sudachi
```

配置後、Elasticsearch を再起動します。

設定ファイルを作成します。

```json:analysis_sudachi.json
{
"settings" : {
"analysis" : {
"filter" : {
"romaji_readingform" : {
"type" : "sudachi_readingform",
"use_romaji" : true
},
"katakana_readingform" : {
"type" : "sudachi_readingform",
"use_romaji" : false
}
},
"analyzer" : {
"sudachi_baseform_analyzer" : {
"filter" : [ "sudachi_baseform" ],
"type" : "custom",
"tokenizer" : "sudachi_tokenizer"
},
"sudachi_normalizedform_analyzer" : {
"filter" : [ "sudachi_normalizedform" ],
"type" : "custom",
"tokenizer" : "sudachi_tokenizer"
},
"sudachi_readingform_analyzer" : {
"filter" : [ "katakana_readingform" ],
"type" : "custom",
"tokenizer" : "sudachi_tokenizer"
},
"sudachi_romaji_analyzer" : {
"filter" : [ "romaji_readingform" ],
"type" : "custom",
"tokenizer" : "sudachi_tokenizer"
},
"sudachi_analyzer": {
"filter": [],
"tokenizer": "sudachi_tokenizer",
"type": "custom"
}
},
"tokenizer" : {
"sudachi_tokenizer": {
"type": "sudachi_tokenizer",
"mode": "search",
"resources_path": "/etc/elasticsearch/config/sudachi"
}
}
}
}
}
```

インデックスを作成します。

```
$ curl -X PUT 'localhost:9200/test_sudachi' -H 'Content-Type: application/json' -d @analysis_sudachi.json
{"acknowledged":true,"shards_acknowledged":true,"index":"test_sudachi"}
```

解析してみます。

```
$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_analyzer", "text" : "関西国際空港"}'
{
"tokens" : [
{
"token" : "関西国際空港",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0,
"positionLength" : 3
},
{
"token" : "関西",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "国際",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "空港",
"start_offset" : 4,
"end_offset" : 6,
"type" : "word",
"position" : 2
}
]
}
```

`search mode` が指定されているでA単位とC単位の両方が出力されます。

動詞、形容詞を終止形で出力してみます。

```
$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_baseform_analyzer", "text" : "おおきく"}'
{
"tokens" : [
{
"token" : "おおきい",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
}
]
}
```

表記を正規化して出力してみます。

```
$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_normalizedform_analyzer", "text" : "おおきく"}'
{
"tokens" : [
{
"token" : "大きい",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
}
]
}
```

読みを出力してみます。

```
$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_readingform_analyzer", "text" : "おおきく"}'
{
"tokens" : [
{
"token" : "オオキク",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
}
]
}
```

読みをローマ字 (Microsoft IME 風) で出力してみます。

```
$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_romaji_analyzer", "text" : "おおきく"}'
{
"tokens" : [
{
"token" : "ookiku",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
}
]
}
```

そのほか、品詞によるトークンの除外、ストップワードなどが利用できます。

こちらもご参照ください: [Elasticsearchのための新しい形態素解析器 「Sudachi」 - Qiita](https://qiita.com/sorami/items/99604ef105f13d2d472b) (Elastic stack Advent Calendar 2017)
10 changes: 5 additions & 5 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,17 @@

<groupId>com.worksap.nlp</groupId>
<artifactId>analysis-sudachi-elasticsearch6.8</artifactId>
<version>1.3.2</version>
<version>2.0.0</version>
<packaging>jar</packaging>

<name>analysis-sudachi</name>

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<java.version>1.8</java.version>
<elasticsearch.version>6.8.5</elasticsearch.version>
<lucene.version>7.7.2</lucene.version>
<sudachi.version>0.3.1</sudachi.version>
<elasticsearch.version>6.8.9</elasticsearch.version>
<lucene.version>7.7.3</lucene.version>
<sudachi.version>0.4.1</sudachi.version>
<sonar.host.url>https://sonarcloud.io</sonar.host.url>
<sonar.language>java</sonar.language>
<sonar.organization>worksapplications</sonar.organization>
Expand Down Expand Up @@ -153,4 +153,4 @@
<developerConnection>scm:git:[email protected]:WorksApplications/elasticsearch-sudachi.git</developerConnection>
<url>https://github.com/WorksApplications/elasticsearch-sudachi</url>
</scm>
</project>
</project>
4 changes: 4 additions & 0 deletions src/main/assemblies/plugin.xml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@
<source>src/main/extras/plugin-descriptor.properties</source>
<filtered>true</filtered>
</file>
<file>
<source>LICENSE</source>
<filtered>false</filtered>
</file>
</files>
<dependencySets>
<dependencySet>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
import org.elasticsearch.index.analysis.Analysis;

import com.worksap.nlp.lucene.sudachi.ja.SudachiAnalyzer;
import com.worksap.nlp.lucene.sudachi.ja.SudachiTokenizer;
import com.worksap.nlp.sudachi.Tokenizer.SplitMode;

public class SudachiAnalyzerProvider extends
AbstractIndexAnalyzerProvider<SudachiAnalyzer> {
Expand All @@ -39,10 +39,10 @@ public SudachiAnalyzerProvider(IndexSettings indexSettings,
super(indexSettings, name, settings);
final Set<?> stopWords = Analysis.parseStopWords(env, settings,
SudachiAnalyzer.getDefaultStopSet(), false);
final SudachiTokenizer.Mode mode = SudachiTokenizerFactory
final SplitMode mode = SudachiTokenizerFactory
.getMode(settings);
final String resourcesPath = new SudachiPathResolver(env.configFile()
.toString(), settings.get("resources_path", name))
.toString(), settings.get("resources_path", "sudachi"))
.resolvePathForDirectory();
final String settingsPath = new SudachiSettingsReader(env.configFile()
.toString(), settings.get("settings_path")).read();
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
/*
* Copyright (c) 2020 Works Applications Co., Ltd.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package com.worksap.nlp.elasticsearch.sudachi.index;

import java.util.Locale;

import com.worksap.nlp.lucene.sudachi.ja.SudachiSplitFilter;
import com.worksap.nlp.lucene.sudachi.ja.SudachiSplitFilter.Mode;

import org.apache.lucene.analysis.TokenStream;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;

public class SudachiSplitFilterFactory extends AbstractTokenFilterFactory {

private static final String MODE_PARAM = "mode";

private final Mode mode;

public SudachiSplitFilterFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) {
super(indexSettings, name, settings);
mode = Mode.valueOf(settings.get(MODE_PARAM, SudachiSplitFilter.DEFAULT_MODE.toString()).toUpperCase(Locale.ROOT));
}

@Override
public TokenStream create(TokenStream tokenStream) {
return new SudachiSplitFilter(tokenStream, mode);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,13 @@
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;

import com.worksap.nlp.lucene.sudachi.ja.SudachiTokenizer;
import com.worksap.nlp.lucene.sudachi.ja.SudachiTokenizer.Mode;
import com.worksap.nlp.sudachi.Tokenizer.SplitMode;

public class SudachiTokenizerFactory extends AbstractTokenizerFactory {
private final Mode mode;
private static final String SPLIT_MODE_PARAM = "split_mode";
private static final String MODE_PARAM = "mode";

private final SplitMode mode;
private final boolean discardPunctuation;
private final String resourcesPath;
private final String settingsPath;
Expand All @@ -41,23 +44,28 @@ public SudachiTokenizerFactory(IndexSettings indexSettings,
mode = getMode(settings);
discardPunctuation = settings.getAsBoolean("discard_punctuation", true);
resourcesPath = new SudachiPathResolver(env.configFile().toString(),
settings.get("resources_path", name)).resolvePathForDirectory();
settings.get("resources_path", "sudachi")).resolvePathForDirectory();
settingsPath = new SudachiSettingsReader(env.configFile().toString(),
settings.get("settings_path")).read();
}

public static SudachiTokenizer.Mode getMode(Settings settings) {
SudachiTokenizer.Mode mode = SudachiTokenizer.DEFAULT_MODE;
String modeSetting = settings.get("mode", null);
public static SplitMode getMode(Settings settings) {
SplitMode mode = SudachiTokenizer.DEFAULT_MODE;
String modeSetting = settings.get(SPLIT_MODE_PARAM, null);
if (modeSetting != null) {
if ("search".equalsIgnoreCase(modeSetting)) {
mode = SudachiTokenizer.Mode.SEARCH;
} else if ("normal".equalsIgnoreCase(modeSetting)) {
mode = SudachiTokenizer.Mode.NORMAL;
} else if ("extended".equalsIgnoreCase(modeSetting)) {
mode = SudachiTokenizer.Mode.EXTENDED;
if ("a".equalsIgnoreCase(modeSetting)) {
mode = SplitMode.A;
} else if ("b".equalsIgnoreCase(modeSetting)) {
mode = SplitMode.B;
} else if ("c".equalsIgnoreCase(modeSetting)) {
mode = SplitMode.C;
}
}

if (settings.hasValue(MODE_PARAM)) {
throw new IllegalArgumentException(MODE_PARAM + " is duprecated, use SudachiSplitFilter");
}

return mode;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
import com.worksap.nlp.elasticsearch.sudachi.index.SudachiPartOfSpeechFilterFactory;
import com.worksap.nlp.elasticsearch.sudachi.index.SudachiReadingFormFilterFactory;
import com.worksap.nlp.elasticsearch.sudachi.index.SudachiStopTokenFilterFactory;
import com.worksap.nlp.elasticsearch.sudachi.index.SudachiSplitFilterFactory;
import com.worksap.nlp.elasticsearch.sudachi.index.SudachiTokenizerFactory;

public class AnalysisSudachiPlugin extends Plugin implements AnalysisPlugin {
Expand All @@ -46,6 +47,7 @@ public Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() {
extra.put("sudachi_part_of_speech",
SudachiPartOfSpeechFilterFactory::new);
extra.put("sudachi_readingform", SudachiReadingFormFilterFactory::new);
extra.put("sudachi_split", SudachiSplitFilterFactory::new);
extra.put("sudachi_ja_stop", SudachiStopTokenFilterFactory::new);
return extra;
}
Expand Down
Loading

0 comments on commit 1a07a3a

Please sign in to comment.