-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
text_similarity_reranker fails with NPE when zero results are returned from a shard #119617
Comments
Pinging @elastic/ml-core (Team:ML) |
Pinging @elastic/search-eng (Team:SearchOrg) |
Pinging @elastic/search-relevance (Team:Search - Relevance) |
@kderusso I tried replicating in a yaml test and I could not. I wonder whats different?
|
@benwtrent Your test doesn't replicate the issue because the fetch search phase, where the NPE is thrown, is skipped when the index is empty (or when there are otherwise no documents to fetch). The issue with our replication attempts to date is that we rerank all the docs fetched by the standard retriever. Thus, if the reranker's doc set is empty, so is the standard retriever's, and the fetch search phase is skipped. The bug occurs when the reranker's doc set is empty, but the upstream retriever's doc set is not. This creates a scenario where the fetch search phase is executed on (what should be) an empty result set. I was able to reproduce the bug with this scenario: PUT _inference/rerank/ms-marco-minilm-l-6-v2
{
"service": "elasticsearch",
"task_type": "rerank",
"service_settings": {
"num_allocations": 1,
"num_threads": 1,
"model_id": "cross-encoder__ms-marco-minilm-l-6-v2"
},
"task_settings": {
"return_documents": true
}
}
PUT /my-index
{
"mappings": {
"properties": {
"text": {
"type": "text"
},
"rerank_text": {
"type": "text"
}
}
}
}
PUT /my-index2
{
"mappings": {
"properties": {
"text": {
"type": "text"
},
"rerank_text": {
"type": "text"
}
}
}
}
PUT /my-index3
{
"mappings": {
"properties": {
"text": {
"type": "text"
},
"rerank_text": {
"type": "text"
}
}
}
}
POST _aliases
{
"actions": [
{
"add": {
"indices": [ "my-index", "my-index2", "my-index3" ],
"alias": "my-alias"
}
}
]
}
POST /my-index2/_doc/
{
"text": "The moon occasionally covers the sun during an eclipse.",
"rerank_text": "The moon occasionally covers the sun during an eclipse."
}
POST /my-index2/_doc/
{
"text": "Solar eclipses are a type of eclipse that occurs when the Moon passes between the Earth and the Sun.",
"rerank_text": "Solar eclipses are a type of eclipse that occurs when the Moon passes between the Earth and the Sun."
}
POST /my-index/_doc/
{
"text": "How often does the moon hide the sun? This happens during a solar eclipse."
}
POST /my-index3/_doc/
{
"text": "The sun is larger than the moon but appears the same size in the sky because it is much further away.",
"rerank_text": "The sun is larger than the moon but appears the same size in the sky because it is much further away."
}
POST my-alias/_search
{
"retriever": {
"text_similarity_reranker": {
"retriever": {
"standard": {
"query": {
"match": {
"text": "moon"
}
}
}
},
"field": "rerank_text",
"inference_id": "ms-marco-minilm-l-6-v2",
"inference_text": "How often does the moon hide the sun?",
"rank_window_size": 100
}
}
} |
I tried add the following to Does it need to be more than one node? I am trying to see about a test that is repeatable in our CI for whomever needs to fix this.
|
@benwtrent I'm having trouble getting this to reliably fail as well. I'm confident the issue is triggered when there are no docs to rerank, but the fetch search phase is run. The complication is that there's logic in the fetch search phase to short-circuit when there are no docs to fetch (here and here). However, the existence of the bug tells us that this short-circuit logic isn't always enough. I will continue investigating to try to figure out the combination of factors that reproduces the issue. |
Ok, I think I figured it out for real this time. And if so, the bug is worse than we originally thought. The NPE is a symptom of a deeper issue. What I think is happening is that we are using inconsistent shard indices when querying via alias. In the first part of the rank feature phase, we consider all shards queried as a single group, regardless of the actual backing index. So for example, if we were querying two indices, 0: However, in the second part of the rank feature phase, we break down shards by their backing indices. We still use results from the first part though, resulting in potential shard index mis-matches. Continuing the example above, if the doc we actually want to rerank is in But the problem goes deeper. Due to the shard index mis-match, if there are results for the shard we are mistakenly addressing, we are reranking the wrong document. This is bad. I've been able to recreate the issue with this setup. Let me know if it doesn't work for you. It depends on PUT _inference/rerank/ms-marco-minilm-l-6-v2
{
"service": "elasticsearch",
"task_type": "rerank",
"service_settings": {
"num_allocations": 1,
"num_threads": 1,
"model_id": "cross-encoder__ms-marco-minilm-l-6-v2"
},
"task_settings": {
"return_documents": true
}
}
PUT /first-index
{
"mappings": {
"properties": {
"text": {
"type": "text"
}
}
}
}
PUT /second-index
{
"settings": {
"number_of_shards": 2,
"number_of_replicas": 0
},
"mappings": {
"properties": {
"text": {
"type": "text"
}
}
}
}
POST /second-index/_doc/
{
"text": "How often does the moon hide the sun? This happens during a solar eclipse."
}
POST _aliases
{
"actions": [
{
"add": {
"indices": [ "first-index", "second-index" ],
"alias": "my-alias"
}
}
]
}
// Triggers bug
POST my-alias/_search
{
"retriever": {
"text_similarity_reranker": {
"retriever": {
"standard": {
"query": {
"match": {
"text": "moon"
}
}
}
},
"field": "text",
"inference_id": "ms-marco-minilm-l-6-v2",
"inference_text": "How often does the moon hide the sun?",
"rank_window_size": 100
}
}
}
// Does not trigger bug
POST second-index/_search
{
"retriever": {
"text_similarity_reranker": {
"retriever": {
"standard": {
"query": {
"match": {
"text": "moon"
}
}
}
},
"field": "text",
"inference_id": "ms-marco-minilm-l-6-v2",
"inference_text": "How often does the moon hide the sun?",
"rank_window_size": 100
}
}
} |
Thanks @Mikep86 for the thorough investigation; must have been quite a pain to track down and reproduce :/ I was able to reproduce it as well; I think that the main issue is on the +++ b/server/src/main/java/org/elasticsearch/search/rank/feature/RankFeatureShardPhase.java
// FetchSearchResult#shardResult()
SearchHits hits = fetchSearchResult.hits();
RankFeatureShardResult featureRankShardResult = (RankFeatureShardResult) rankFeaturePhaseRankShardContext
- .buildRankFeatureShardResult(hits, searchContext.shardTarget().getShardId().id());
+ .buildRankFeatureShardResult(hits, searchContext.request().shardRequestIndex());
// save the result in the search context
// need to add profiling info as well available from fetch
if (featureRankShardResult != null) { |
Elasticsearch Version
8.16.1 to current
Installed Plugins
No response
Java Version
bundled
OS Version
Reproduceable in cloud
Problem Description
When performing a
text_similarity_reranker
search on an multiple indices, if one or more shards returns an empty result (no documents to rerank), the request fails with the following null pointer exception:Potentially this is happening in the RankFeaturePhase on null or empty entry sets.
Proposed fix from @Mikep86, but should probably refactored to happen on the coordinator to be more performant :
Steps to Reproduce
This can be reproduced by running MSMarco minilm on an alias pointing to empty indices:
Logs (if relevant)
No response
The text was updated successfully, but these errors were encountered: