Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Elastic Inference Service] Add ElasticInferenceService Unified ChatCompletions Integration #118871

Merged
merged 67 commits into from
Jan 8, 2025

Conversation

jaybcee
Copy link
Member

@jaybcee jaybcee commented Dec 17, 2024

Parent PR: #118301

We need to call EIS via Elasticsearch. This PR implements the functionality.

Testing

Run via

1. `./gradlew localDistro`
2. `cd build/distribution/local/elasticsearch-9.0.0-SNAPSHOT`
3. `./bin/elasticsearch -E xpack.inference.elastic.url=https://localhost:8443 -E xpack.inference.elastic.http.ssl.verification_mode=none -E xpack.security.enabled=false -E xpack.security.enrollment.enabled=false`
  1. Create endpoint via
curl --location --request PUT 'http://localhost:9200/_inference/completion/test' \
--header 'Content-Type: application/json' \
--data '{
    "service": "elastic",
    "service_settings": {
        "model_id": "elastic-model"
    }
}' -k
  1. We eventually expect to have a default endpoint.
  2. The model name is a bit of a placeholder for now its unclear to me what we expose. In any case its trivial. We have an external to internal mapping.

It returns

{
    "inference_id": "test",
    "task_type": "completion",
    "service": "elastic",
    "service_settings": {
        "model_id": "elastic-model",
        "rate_limit": {
            "requests_per_minute": 1000
        }
    }
}

Then we perform inference via

curl --location 'http://localhost:9200/_inference/completion/test/_unified' \
--header 'Content-Type: application/json' \
--data '{
    "messages": [
        {
            "role": "user",
            "content": "In only two digits and nothing else, what is the meaning of life?"
        }
    ],
    "model" : "elastic-model",
    "temperature": 0.7,
    "max_completion_tokens": 300
}' -k 

Returns

curl --location 'http://localhost:9200/_inference/completion/test/_unified' \
--header 'Content-Type: application/json' \
--data '{
    "messages": [
        {
            "role": "user",
            "content": "In only two digits and nothing else, what is the meaning of life?"
        }
    ],
    "model" : "elastic-model",
    "temperature": 0.7,
    "max_completion_tokens": 300
}' -k
event: message
data: {"id":"unified-a52c5569-6fca-48dd-9a03-cf6b2d999995","choices":[{"delta":{"role":"assistant"},"index":0}],"model":"elastic-model","object":"chat.completion.chunk"}

event: message
data: {"id":"unified-a52c5569-6fca-48dd-9a03-cf6b2d999995","choices":[{"delta":{"content":"42"},"index":0}],"model":"elastic-model","object":"chat.completion.chunk"}

event: message
data: {"id":"unified-a52c5569-6fca-48dd-9a03-cf6b2d999995","choices":[{"delta":{},"index":0}],"model":"elastic-model","object":"chat.completion.chunk"}

event: message
data: {"id":"unified-a52c5569-6fca-48dd-9a03-cf6b2d999995","choices":[{"delta":{},"finish_reason":"stop","index":0}],"model":"elastic-model","object":"chat.completion.chunk"}

event: message
data: {"id":"unified-a52c5569-6fca-48dd-9a03-cf6b2d999995","choices":[{"delta":{},"index":0}],"model":"elastic-model","object":"chat.completion.chunk","usage":{"completion_tokens":4,"prompt_tokens":22,"total_tokens":26}}

event: message
data: [DONE]

Copy link
Member Author

@jaybcee jaybcee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately this worked mostly out of the box. I had so change EIS a bit to reflect the SSE.

https://github.com/elastic/eis-gateway/pull/207

It sends the response with a data prefix.

Did we want to implement more tests?

return new URI(elasticInferenceServiceComponents().elasticInferenceServiceUrl() + "/api/v1/chat/completions");
}

// TODO create the Configuration class?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonathan-buttner

Can you explain why you had this TODO? I'm not sure what it brings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a follow up, I think we can address this after we merge this PR. Maybe create an issue so we don't forget it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


public static final String NAME = "elastic_inference_service_completion_service_settings";

// TODO what value do we put here?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timgrein , do you have any suggestion? I'm not up to speed on the state of rate limiting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I guess we could use the default from bedrock for now?

Copy link
Member Author

@jaybcee jaybcee Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the environment and quota set... We should leave it as is for now unless any objection. Is it ok to leave the TODO? I'll drop a note in the ES integration issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it to 240 for now, but, a customers quota and our shared quota can be different. In any case rate limiting is mildly opaque to me. This is a good enough number for now.

public static ElasticInferenceServiceCompletionServiceSettings fromMap(Map<String, Object> map, ConfigurationParseContext context) {
ValidationException validationException = new ValidationException();

// TODO does EIS have this?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timgrein, same thing, do we want limit per model at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean rate limit grouping per model? Not yet, I think we'll group on project ids first. When ELSER is available on EIS we can additionally group by model.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not clear. I meant in the context of ES. Or did you mean we should rate limit on project id within ES?

private static final String ROLE = "user";
private static final String USER = "a_user";

// TODO remove if EIS doesn't use the model and user fields
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxhniebergall, we need the model. The user field is a bit ambiguous. Do we set it and ignore it or should we stop sending it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss at the inference sync tomorrow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we'll get rid of it for now. It's available for some Bedrock models but it has to passed in an odd way. I'll remove the references to it in the code as well.

As for its usage, I don't think we use it in a meaningful way. My brief Googling shows that its useful for the provider to identify one of your users who is "jailbreaking" the LLM should you get suspended.

@jaybcee jaybcee marked this pull request as ready for review December 19, 2024 02:26
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Dec 19, 2024
@jaybcee jaybcee added the :SearchOrg/Inference Label for the Search Inference team label Dec 19, 2024
@elasticsearchmachine elasticsearchmachine added Team:SearchOrg Meta label for the Search Org (Enterprise Search) Team:Search - Inference labels Dec 19, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-inference-team (Team:Search - Inference)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Dec 19, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-eng (Team:SearchOrg)

@jaybcee jaybcee requested a review from maxhniebergall January 6, 2025 19:40
Copy link
Contributor

@timgrein timgrein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some small comments, happy to give it another pass afterwards

);
} catch (URISyntaxException e) {
throw new ElasticsearchStatusException(
"Failed to create URI for sparse embeddings service: " + e.getMessage(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also be a bit more specific here and use the service name/task type constants?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will put the model and task type. Not sure what you mean by service name? Lmk if this is ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, with "service name" I mean the name of the service provider, in this case elastic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Tim is referring to the service name here: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/elastic/ElasticInferenceService.java#L65

So saying something like:

Strings.format(Failed to create URI for spare embeddings for service %s: %s, NAME, e.getMessage())

Comment on lines 37 to 38
// 1. Basic Serialization
// Test with minimal required fields to ensure basic serialization works.
Copy link
Contributor

@timgrein timgrein Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually don't add numbered comments above test methods, I think we should remove these comments for consistency reasons. Same holds true for all other tests in this class (I won't mark them explicitly :) ). Just out of curiosity: Was this test re-written by some LLM? 😄 They tend to add these explicit "steps". If yes, pretty impressive how good they adapt even to a custom testing framework (but ES is probably anyway part of the training data).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jonathan-buttner wrote these, so I can't comment on the LLM nature. Generally I think given enough context the LLM can do something simple like this (and maybe one needs to clean it up). I will remove the comments 😄.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote these tests. Yes, they were written by LLM. I first asked it to create the list of tests that would be required, which is where these comments came from, and then I asked it to write the tests for each comment.

jaybcee and others added 9 commits January 7, 2025 09:30
…inference/external/http/sender/ElasticInferenceServiceUnifiedCompletionRequestManager.java

Co-authored-by: Tim Grein <[email protected]>
…inference/services/elastic/ElasticInferenceServiceSettings.java

Co-authored-by: Tim Grein <[email protected]>
…inference/services/elastic/ElasticInferenceServiceSparseEmbeddingsModel.java

Co-authored-by: Tim Grein <[email protected]>
…inference/services/elastic/completion/EISCompletionServiceSettingsTests.java

Co-authored-by: Tim Grein <[email protected]>
…inference/services/elastic/completion/EISCompletionModelTests.java

Co-authored-by: Tim Grein <[email protected]>
Copy link
Contributor

@timgrein timgrein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we start to use a common prefix for PRs for EIS to make it easier to grep through PRs/commits? Something like [EIS] or [Elastic Inference Service]? We could also simply use [Inference API] - I think that's how we do it for the other integrations..as elastic is just another provider it could make sense to stick to one common prefix.

@@ -0,0 +1,5 @@
pr: 118871
summary: "Add EIS Unified `ChatCompletions` Integration"
Copy link
Contributor

@timgrein timgrein Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also use Elastic Inference Service here? We could also use Elastic Inference Service (EIS). I think this will land in the changelog, which is often read by customers AFAIK, so probably better to be a bit more explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vote for [Elastic Inference Service] (maybe a tag is better long term?)

@jaybcee jaybcee changed the title EIS Unified ChatCompletions Integration [Elastic Inference Service] Add ElasticInferenceService Unified ChatCompletions Integration Jan 7, 2025
@jaybcee jaybcee requested a review from timgrein January 7, 2025 16:33
Copy link
Contributor

@jonathan-buttner jonathan-buttner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! 🚢

);
} catch (URISyntaxException e) {
throw new ElasticsearchStatusException(
"Failed to create URI for sparse embeddings service: " + e.getMessage(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Tim is referring to the service name here: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/elastic/ElasticInferenceService.java#L65

So saying something like:

Strings.format(Failed to create URI for spare embeddings for service %s: %s, NAME, e.getMessage())

Copy link
Member

@maxhniebergall maxhniebergall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Jason!

@@ -74,7 +74,8 @@ public void execute(
private static ResponseHandler createCompletionHandler() {
return new ElasticInferenceServiceUnifiedChatCompletionResponseHandler(
"elastic inference service completion",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this string with spaces in it correct? Seems like a bit of a weird value. Normally I think our non-error-message strings uses underscores instead of spaces. Definitely just a nit though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenAI one does the same thing. Not sure whats best, but I think they should be consistent. I'll keep this for now.

Copy link
Contributor

@timgrein timgrein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the changes 🚢

Comment on lines -117 to -132
public void testParseRequestConfig_ThrowsUnsupportedModelType() throws IOException {
try (var service = createServiceWithMockSender()) {
var failureListener = getModelListenerForException(
ElasticsearchStatusException.class,
"The [elastic] service does not support task type [completion]"
);

service.parseRequestConfig(
"id",
TaskType.COMPLETION,
getRequestConfigMap(Map.of(ServiceFields.MODEL_ID, ElserModels.ELSER_V2_MODEL), Map.of(), Map.of()),
failureListener
);
}
}

Copy link
Member Author

@jaybcee jaybcee Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this for now @jonathan-buttner, feels like we should move the configs where they belong. Too tightly coupled (and somewhat incorrect). Lmk if thats ok. I'll merge otherwise, the merges from main are catching up to me haha.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this looks good since we support the completion task type now 👍

@jaybcee jaybcee enabled auto-merge (squash) January 8, 2025 19:33
@jaybcee jaybcee merged commit 18345c4 into main Jan 8, 2025
17 checks passed
@jaybcee jaybcee deleted the ml-eis-integration-jbc branch January 8, 2025 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Feature:GenAI Features around GenAI :SearchOrg/Inference Label for the Search Inference team Team:Search - Inference Team:SearchOrg Meta label for the Search Org (Enterprise Search) v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants