Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PosgreSQL hybrid search #958

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

SignalRT
Copy link

This PR Includes:

  • Text search index
  • Hybrid Search configuration
  • Vector or hybrid search

Motivation and Context (Why the change? What's the scenario?)

Vector search do not provide the best results in an important number of scenarios. Hybrid search provides better results.

High level description (Approach, Design)

This change includes a new parameter for activate hybrid search on PostgreSQL extension. This parameters defaults to the previous implementation (vector search).
The search will use vector search or hybrid search depending on the parameter.

Include:
- Text search index
- Hybrid Search configuration
- Vector or hybrid search
@SignalRT
Copy link
Author

@microsoft-github-policy-service agree

@SignalRT SignalRT marked this pull request as ready for review December 29, 2024 12:24
@SignalRT SignalRT requested a review from dluc as a code owner December 29, 2024 12:24
@@ -175,7 +183,8 @@ public async Task CreateTableAsync(
{this._colContent} TEXT DEFAULT '' NOT NULL,
{this._colPayload} JSONB DEFAULT '{{}}'::JSONB NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_tags ON {tableName} USING GIN({this._colTags});
CREATE INDEX IF NOT EXISTS {tableName}_idx_tags ON {tableName} USING GIN({this._colTags});
CREATE INDEX IF NOT EXISTS {tableName}_idx_content ON {tableName} USING GIN(to_tsvector('english',{this._colContent}));
Copy link
Collaborator

@dluc dluc Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For users already using KM + PG, their tables will lack this index. I'm assuming that if the index is missing, it will only affect performance, making hybrid search very slow, without causing any error.

@@ -175,7 +183,8 @@ public async Task CreateTableAsync(
{this._colContent} TEXT DEFAULT '' NOT NULL,
{this._colPayload} JSONB DEFAULT '{{}}'::JSONB NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_tags ON {tableName} USING GIN({this._colTags});
CREATE INDEX IF NOT EXISTS {tableName}_idx_tags ON {tableName} USING GIN({this._colTags});
CREATE INDEX IF NOT EXISTS {tableName}_idx_content ON {tableName} USING GIN(to_tsvector('english',{this._colContent}));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the language be configurable?

@@ -175,7 +183,8 @@ public async Task CreateTableAsync(
{this._colContent} TEXT DEFAULT '' NOT NULL,
{this._colPayload} JSONB DEFAULT '{{}}'::JSONB NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_tags ON {tableName} USING GIN({this._colTags});
CREATE INDEX IF NOT EXISTS {tableName}_idx_tags ON {tableName} USING GIN({this._colTags});
CREATE INDEX IF NOT EXISTS {tableName}_idx_content ON {tableName} USING GIN(to_tsvector('english',{this._colContent}));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -413,6 +426,8 @@ DO UPDATE SET

// Column names
string columns = withEmbeddings ? this._columnsListWithEmbeddings : this._columnsListNoEmbeddings;
string columnsHibrid = this._columnsListHybrid;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: var name typo columnsHibrid => columnsHybrid

Comment on lines +429 to +430
string columnsHibrid = this._columnsListHybrid;
string columnsListHybridCoalesce = this._columnsListHybridCoalesce;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 2 vars could be removed, is there a reason to copy the values?

Comment on lines +473 to +475
FROM {tableName}, plainto_tsquery('english', @query) query
WHERE {filterSqlHybridText} AND to_tsvector('english', {this._colContent}) @@ query
ORDER BY ts_rank_cd(to_tsvector('english', {this._colContent}), query) DESC
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: language should be configurable

Comment on lines +480 to +481
COALESCE(1.0 / (60 + semantic_search.rank), 0.0) +
COALESCE(1.0 / (60 + keyword_search.rank), 0.0) AS {colDistance}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you document or point to documentation explaining these formulas? e.g. what is 60?

COALESCE(semantic_search.{this._colTags}, keyword_search.{this._colTags}) AS {this._colTags},
COALESCE(semantic_search.{this._colContent}, keyword_search.{this._colContent}) AS {this._colContent},
COALESCE(semantic_search.{this._colPayload}, keyword_search.{this._colPayload}) AS {this._colPayload},
COALESCE(semantic_search.{this._colEmbedding}, keyword_search.{this._colEmbedding}) AS {this._colEmbedding},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the trailing comma (consistency with the way these values are used)

Copy link
Collaborator

@dluc dluc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks almost ready to merge, only a few minor tweaks, please see the comments inline. Two most important:

  • configurable language
  • documenting the new SQL and the hard coded calculations

@dluc dluc added the waiting for author Waiting for author to reply or address comments label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for author Waiting for author to reply or address comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants