elasticsearch n-gram

WonYong-Jang · Jan 5, 2025 · 488bdc9 · 488bdc9
1 parent 5174ecb
commit 488bdc9
Showing 1 changed file with 174 additions and 7 deletions.
diff --git a/_posts/elk/2024-12-29-ELK-Elastic-Search-Wildcard-N-Gram.md b/_posts/elk/2024-12-29-ELK-Elastic-Search-Wildcard-N-Gram.md
@@ -1,7 +1,7 @@
 ---
 layout: post
 title: "[ELK] ElasticSarch에서 wildcard 쿼리 대신 n-gram으로 성능 개선하기"
-subtitle: "term level query 방식인 wildcard 검색에서의 문제점"    
+subtitle: "term level query 방식인 wildcard 검색에서의 문제점 / n-gram 적용 및 search analyzer"    
 comments: true
 categories : ELK
 date: 2024-12-29
@@ -49,8 +49,8 @@ GET index/_search
 ```
 
 결과를 확인해보면 정상적으로 검색 하는 것을 확인할 수 있다.   
-그럼 이번에는 키워드를 바꿔서 나 보기가 역겨워라는 문장에서 `기가` 라는 
-단어를 기준으로 검색해보자.   
+그럼 이번에는 키워드를 바꿔서 '나 보기가 역겨워' 라는 문장에서 `'기가'` 라는 
+단어를 기준으로 검색해보자.     
 
 RDMBS는 정상적으로 검색하지만, Elasticsearch는 검색하지 못할 수 있다.   
 
@@ -60,12 +60,12 @@ RDMBS는 정상적으로 검색하지만, Elasticsearch는 검색하지 못할
 
 보통 텍스트 분석을 할 때 부사와 형용사를 제외한 명사, 동명사 정도만 
 색인을 하기 때문에 위와 같이 `보기가` 라는 문장은 `보기` 로 색인된다.   
-따라서 우리가 찾는 '기가' 와 매칭되는 term(token)은 검색할수 없게 된다.   
+따라서 우리가 찾는 '기가' 와 매칭되는 term(token)은 검색할 수 없게 된다.   
 
 참고로 한글 텍스트 분석의 경우 [nori](http://localhost:4000/elk/2021/06/18/ELK-Elastic-Search-analyze-korean.html)를 
 주로 사용한다.   
 
-> 물론 analyzer를 공백 단위로만 분리하여 색인한 경우는 검색이 된다.  
+> 물론 공백 단위로만 분리하여 색인하는 analyzer를 사용하는 경우는 검색이 된다.  
 
 ```
 GET index/_analyze
@@ -93,6 +93,11 @@ Output
 }
 ```   
 
+또한, 위와 같이 text 타입을 사용하지 않고 keyword 타입을 사용할 경우 
+analyzer를 사용하지 않기 때문에 term(token) 단위로 분리 하지 않고 색인을 한다.   
+따라서 wildcard를 이용하여 정상적으로 검색을 할 수 있지만 full scan을 해야 
+하기 때문에 성능 문제가 발생할 수 있다.   
+
 - - - 
 
 ## 2. wildcard, regexp 성능이슈   
@@ -111,6 +116,11 @@ Output
 
 ## 3. N-gram 을 이용하여 검색하기  
 
+위에서의 성능 이슈를 해결하기 위해 [n-gram](https://esbook.kimjmin.net/06-text-analysis/6.6-token-filter/6.6.4-ngram-edge-ngram-shingle) 방식을 
+이용하여 검색을 할 수 있다.     
+
+> ngram 외에도 edge ngram, shingle 이 존재하기 때문에 비지니스 요구사항에 따라 적합한 방식을 사용하면 된다.   
+
 N-gram 방식으로 나눠서 term(token) 단위로 색인을 하는 것이며, 엘라스틱서치 를 
 예로 들면 아래와 같다.   
 
@@ -129,7 +139,7 @@ PUT /test-index
         "my_ngram_tokenizer": {
           "type": "ngram",
           "min_gram": 1,
-          "max_gram": 9
+          "max_gram": 10
         }
       },
       "analyzer": {
@@ -140,7 +150,7 @@ PUT /test-index
         }
       }
     },
-    "index.max_ngram_diff": 10
+    "index.max_ngram_diff": 9
   }
 }
 ```
@@ -176,10 +186,167 @@ GET test-index/_analyze
 }
 ```
 
+Output
+
+```
+{
+  "tokens" : [
+    {
+      "token" : "진",
+      "start_offset" : 0,
+      "end_offset" : 1,
+      "type" : "word",
+      "position" : 0
+    },
+    {
+      "token" : "진달",
+      "start_offset" : 0,
+      "end_offset" : 2,
+      "type" : "word",
+      "position" : 1
+    },
+    {
+      "token" : "진달래",
+      "start_offset" : 0,
+      "end_offset" : 3,
+      "type" : "word",
+      "position" : 2
+    },
+    {
+      "token" : "진달래꽃",
+      "start_offset" : 0,
+      "end_offset" : 4,
+      "type" : "word",
+      "position" : 3
+    },
+...
+```
+
+위와 같이 적용하고 실제 검색을 하게 되었을 때, 관련 없는 document도 
+같이 검색이 될 수 있다.   
+
+아래와 같이 '진달래꽃' 이 포함된 document를 검색하는 것을 의도 했지만, 
+'진' 이 포함된 document도 같이 검색이 되었다.   
+
+```
+GET test-index/_search
+{
+  "query": {
+    "match": {
+      "contents": "진달래꽃"
+    }
+  }
+}
+```
+
+Output
+
+```
+"hits" : [
+      {
+        "_index" : "test-index",
+        "_type" : "_doc",
+        "_id" : "3",
+        "_score" : 5.6357455,
+        "_source" : {
+          "contents" : "진달래꽃 나 보기가 역겨워 가실 때에는 말없이 고이 보내 드리우리다 "
+        }
+      },
+      {
+        "_index" : "test-index",
+        "_type" : "_doc",
+        "_id" : "4",
+        "_score" : 1.1321255,
+        "_source" : {
+          "contents" : "진짜"
+        }
+      },
+      {
+        "_index" : "test-index",
+        "_type" : "_doc",
+        "_id" : "5",
+        "_score" : 1.1321255,
+        "_source" : {
+          "contents" : "진실"
+        }
+```  
+
+`위와 같이 검색되는 이유는 search analyzer를 따로 지정하지 않게되면 
+기본적으로 등록된 analyzer 기반으로 검색이 된다.`      
+`즉 우리가 등록했던 n-gram 기반 analyzer 를 이용하여 검색한다.`   
+
+> 검색어를 token 단위로 나누어 매칭되는 document들을 찾는다.   
+
+`여기서는 '진달래꽃' 이 포함된 document를 찾고 싶기 때문에 search analyzer를 직접 
+구성해야 한다.`   
+
+아래와 같이 기존에 작성했던 인덱스 설정에 추가하고 mapping 정보에도 추가한다.  
+
+```
+PUT /test-index
+{
+  "settings": {
+    "analysis": {
+      "tokenizer": {
+        "my_ngram_tokenizer": {
+          "type": "ngram",
+          "min_gram": 1,
+          "max_gram": 10
+        }
+      },
+      "analyzer": {
+        "my_ngram_analyzer": {
+          "type": "custom",
+          "tokenizer": "my_ngram_tokenizer",
+          "filter": ["lowercase"]
+        },
+        "my_search_analyzer":{
+          "type":"custom",
+          "tokenizer":"standard",
+          "filter": [
+            "lowercase"
+          ]
+        }
+      }
+    },
+    "index.max_ngram_diff": 9
+  }
+}
+```
+
+```
+PUT /test-index/_mapping
+{
+  "properties": {
+    "contents": {
+      "type": "text",
+      "analyzer": "my_ngram_analyzer",
+      "search_analyzer": "my_search_analyzer",
+      "fields": {
+        "keyword": {
+          "type": "keyword"
+        }
+      }
+    }
+  }
+}
+```
+
+위와 같이 mapping 정보까지 추가한 후 검색하면 의도한 것처럼 검색어가 포함된 document를 검색한다.   
+
+`ngram 토큰필터를 사용하면 저장되는 term의 갯수도 기하급수적으로 늘어나고 검색어를 'ho'를 
+검색했을 때 house, shoes 처럼 검색 결과를 예상하기 어렵기 때문에 일반적인 텍스트 검색에는 
+사용하지 않는 것이 좋다.`   
+
+`ngram을 사용하기 적합한 사례는 카테고리 목록이나 태그 목록과 같이 전체 개수가 많지 않은 데이터 집단에 
+자동완성 같은 기능을 구현하는데 적합하다.`   
+
+
 - - - 
 
 **Reference**   
 
+<https://opster.com/guides/elasticsearch/search-apis/elasticsearch-wildcard-queries/>   
 <https://findstar.pe.kr/2018/07/14/elasticsearch-wildcard-to-ngram/>   
 <https://dgahn.tistory.com/44>