Skip to content

Commit

Permalink
wip: more migration to opensearch
Browse files Browse the repository at this point in the history
  • Loading branch information
nboyse committed Feb 7, 2025
1 parent c44e222 commit d0b4554
Show file tree
Hide file tree
Showing 21 changed files with 68 additions and 127 deletions.
2 changes: 0 additions & 2 deletions django_app/.vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
"MINIO_HOST": "localhost",
"POSTGRES_HOST": "localhost",
"UNSTRUCTURED_HOST": "localhost",
"ELASTIC__HOST": "localhost"
}
},

Expand All @@ -33,7 +32,6 @@
"MINIO_HOST": "localhost",
"POSTGRES_HOST": "localhost",
"UNSTRUCTURED_HOST": "localhost",
"ELASTIC__HOST": "localhost"
}
},
]
Expand Down
6 changes: 0 additions & 6 deletions django_app/.vscode/tasks.json
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@
"MINIO_HOST": "localhost",
"POSTGRES_HOST": "localhost",
"UNSTRUCTURED_HOST": "localhost",
"ELASTIC__HOST": "localhost",
}
},
"presentation": {
Expand All @@ -34,7 +33,6 @@
"MINIO_HOST": "localhost",
"POSTGRES_HOST": "localhost",
"UNSTRUCTURED_HOST": "localhost",
"ELASTIC__HOST": "localhost",
}
},
"presentation": {
Expand All @@ -52,7 +50,6 @@
"MINIO_HOST": "localhost",
"POSTGRES_HOST": "localhost",
"UNSTRUCTURED_HOST": "localhost",
"ELASTIC__HOST": "localhost",
}
},
"presentation": {
Expand All @@ -70,7 +67,6 @@
"MINIO_HOST": "localhost",
"POSTGRES_HOST": "localhost",
"UNSTRUCTURED_HOST": "localhost",
"ELASTIC__HOST": "localhost",
}
},
"presentation": {
Expand All @@ -88,7 +84,6 @@
"MINIO_HOST": "localhost",
"POSTGRES_HOST": "localhost",
"UNSTRUCTURED_HOST": "localhost",
"ELASTIC__HOST": "localhost",
}
},
"presentation": {
Expand All @@ -106,7 +101,6 @@
"MINIO_HOST": "localhost",
"POSTGRES_HOST": "localhost",
"UNSTRUCTURED_HOST": "localhost",
"ELASTIC__HOST": "localhost",
}
},
"presentation": {
Expand Down
2 changes: 1 addition & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ services:
start_period: 30s

opensearch:
image: opensearchproject/opensearch:2.17.0
image: opensearchproject/opensearch:2.18.0
environment:
- discovery.type=single-node
- OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m
Expand Down
6 changes: 3 additions & 3 deletions docs/DEVELOPER_SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,16 +198,16 @@ elasticdump \
--type=data
```

### Loading data to Elasticsearch
### Loading data to Opensearch

If you've been provided with a dump from the vector store, add it to [data/elastic-dumps/](../data/elastic-dumps/). The below assumes the existance of `redbox-data-chunk.json` in that directory.

Consider dumping your existing indices if you don't want to have to reembed data you're working on.

Start the Elasticsearch service.
Start the Opensearch service.

```console
docker compose up -d elasticsearch
docker compose up -d opensearch
```

Load data from your JSONs, or your own file.
Expand Down
2 changes: 1 addition & 1 deletion docs/architecture/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ The Retrieval Augmented Generation (RAG) architecture grounds our Large Language
| Core API | ECS | App Service | Docker | FastAPI AI Interaction and DB Intermediary |
| Worker | ECS | App Service | Docker | Queue fed file ingester and embedder |
| Database | RDS/Postgres | Postgres | Postgres | Chat history & user data |
| Vector Database | ElasticCloud | ElasticCloud | Elasticsearch | RAG Database |
| Vector Database | ElasticCloud | ElasticCloud | Opensearch | RAG Database |
| Container Registry | ECR | ACR | Harbor | Storage for app containers |
| Embedding API | Azure OpenAI Service | Azure OpenAI Service | Huggingface Containers | Embedding for docs into VectorDB |
| LLM API | Azure OpenAI Service | Azure OpenAI Service | Huggingface Containers | Chat model |
10 changes: 5 additions & 5 deletions docs/architecture/transactions_and_schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ sequenceDiagram
Django->>S3: file key, content
Django->>Core: file key
Core->>Workers: file key
Core->>Elastic: file key
Core->>Opensearch: file key
S3->>Workers: file content
Workers->>Elastic: chunk key, content
Workers->>Opensearch: chunk key, content
```

### Chat APIs
Expand All @@ -44,7 +44,7 @@ title: Transaction sequence - POST /chat/rag
sequenceDiagram
Django->> Core: ChatHistory.messages[], File[].uuid
Elastic->>Core: File[].Chunk[].embeddings
Opensearch->>Core: File[].Chunk[].embeddings
Core->>LLM API: ChatHistory.messages[].embeddings, File[].Chunk[].embeddings
```
Expand Down Expand Up @@ -101,13 +101,13 @@ erDiagram
ChatHistory }|--o{ FileRecord: "ChatHistory.files_retrieved"
```

### Elastic Schema
### Opensearch Schema

Keeping things simple is the primary ethos here. We are storing the UUID of the parent file in the chunk. This allows us to easily query for all chunks of a file. We are also storing the text of the chunk, the metadata of the chunk, and the embedding of the chunk. The embedding is a float array that is generated by the embedding API.

```mermaid
---
title: Elastic schema
title: Opensearch schema
---
erDiagram
Expand Down
12 changes: 2 additions & 10 deletions docs/code_reference/models/settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,6 @@ Redbox used the `pydantic_settings` library to manage settings. This library all

::: redbox.models.settings.Settings

# Elasticsearch Settings
# OpenSearch Settings

Depending on the deployment scenarios we have two different ways to configure Elasticsearch: `ElasticLocalSettings` and `ElasticCloudSettings`.

## `ElasticLocalSettings`

::: redbox.models.settings.ElasticLocalSettings

## `ElasticCloudSettings`

::: redbox.models.settings.ElasticCloudSettings
We configure Opensearch via `OpenSearchSettings` in redbox-core/redbox/models/settings.py
8 changes: 4 additions & 4 deletions docs/installation/local.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ As the project deploys, you should eventually see the following message:
```
[+] Running 8/8
✔ Network redbox_redbox-app-network Created 0.0s
✔ Container redbox-elasticsearch-1 Healthy 22.7s
✔ Container redbox-opensearch-1 Healthy 22.7s
✔ Container redbox-redis-1 Healthy 22.7s
✔ Container redbox-minio-1 Healthy 22.7s
✔ Container redbox-db-1 Healthy 22.7s
Expand All @@ -35,11 +35,11 @@ As the project deploys, you should eventually see the following message:

Redbox utilises health checks to ensure that the services are running correctly.

!!! info "Elastic and Minio failure"
If you see that the Elasticsearch or MinIO containers are unhealthy, this may be due to a permission issue with the directory they're mounted to. You can fix this by running the following command:
!!! info "Opensearch and Minio failure"
If you see that the Opensearch or MinIO containers are unhealthy, this may be due to a permission issue with the directory they're mounted to. You can fix this by running the following command:

```bash
chmod -R 777 ./data/elastic/
chmod -R 777 ./data/opensearch/
chmod -R 777 ./data/objectstore/
```

Expand Down
10 changes: 5 additions & 5 deletions redbox-core/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion redbox-core/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ readme = "../README.md"
[tool.poetry.dependencies]
python = ">=3.12,<3.13"
pydantic = "^2.7.1"
elasticsearch = "^8.15.0"
langchain-community = ">0.2.12"
langchain = "^0.3.4"
langchain_openai = ">0.1.21"
Expand Down
3 changes: 1 addition & 2 deletions redbox-core/redbox/chains/components.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
from langchain_core.runnables import Runnable
from langchain_core.utils import convert_to_secret_str

# from langchain_elasticsearch import ElasticsearchRetriever
from langchain_openai.embeddings import AzureOpenAIEmbeddings, OpenAIEmbeddings


Expand Down Expand Up @@ -98,7 +97,7 @@ def get_all_chunks_retriever(env: Settings) -> OpenSearchRetriever:


def get_parameterised_retriever(env: Settings, embeddings: Embeddings | None = None):
"""Creates an Elasticsearch retriever runnable.
"""Creates an Opensearch retriever runnable.
Runnable takes input of a dict keyed to question, file_uuids and user_uuid.
Expand Down
3 changes: 1 addition & 2 deletions redbox-core/redbox/graph/nodes/tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
import numpy as np
import requests
import tiktoken
from elasticsearch import Elasticsearch
from opensearchpy import OpenSearch
from langchain_community.utilities import WikipediaAPIWrapper
from langchain_core.documents import Document
Expand All @@ -30,7 +29,7 @@


def build_search_documents_tool(
es_client: Union[Elasticsearch, OpenSearch],
es_client: OpenSearch,
index_name: str,
embedding_model: Embeddings,
embedding_field_name: str,
Expand Down
17 changes: 2 additions & 15 deletions redbox-core/redbox/loader/ingester.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,33 +25,20 @@


def get_elasticsearch_store(es, es_index_name: str):
# return ElasticsearchStore(
# index_name=es_index_name,
# embedding=get_embeddings(env),
# es_connection=es,
# query_field="text",
# vector_query_field=env.embedding_document_field_name,
# )
return OpenSearchVectorSearch(
index_name=es_index_name,
opensearch_url=env.elastic.collection_endpoint,
opensearch_url=env.opensearch.collection_endpoint,
embedding_function=get_embeddings(env),
query_field="text",
vector_query_field=env.embedding_document_field_name,
)


def get_elasticsearch_store_without_embeddings(es, es_index_name: str):
# return ElasticsearchStore(
# index_name=es_index_name,
# es_connection=es,
# query_field="text",
# strategy=BM25Strategy(),
# )

return OpenSearchVectorSearch(
index_name=es_index_name,
opensearch_url=env.elastic.collection_endpoint,
opensearch_url=env.opensearch.collection_endpoint,
embedding_function=FakeEmbeddings(size=env.embedding_backend_vector_size),
)

Expand Down
2 changes: 1 addition & 1 deletion redbox-core/redbox/models/chain.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ class AISettings(BaseModel):
chat_map_question_prompt: str = prompts.CHAT_MAP_QUESTION_PROMPT
reduce_system_prompt: str = prompts.REDUCE_SYSTEM_PROMPT

# Elasticsearch RAG and boost values
# Opensearch RAG and boost values
rag_k: int = 30
rag_num_candidates: int = 10
rag_gauss_scale_size: int = 3
Expand Down
Loading

0 comments on commit d0b4554

Please sign in to comment.