Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: triton generate support #675

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from
Draft

Conversation

nnshah1
Copy link
Contributor

@nnshah1 nnshah1 commented May 24, 2024

This PR does two main things:

  1. Add support for triton's generate endpoint. This reuses the PA implementation for the OpenAI HTTP client - as it supports text in / text out and streaming. The format of the input message is similar to completions, but uses "text_input" and "text_output" instead of "prompt".

  2. Remove "service-kind" parameter from top level cli. Service kind can be inferred from endpoint-type and endpoint-type is more clear. endpoint-type is tied to the API and not the implementation. service kind "openAI' vs "triton" also was not parallel as "openAI" is an API and "triton" is a server. As the PA implementation is tied to service-kind - this change is only at the genai-perf level, and internally service-kind is still present it is just set based on endpoint-type. To facillitate a new endpoint-type of kserve was added.

Existing Tests have been updated.

No new tests added - could be done - or done as separate PR.

Note: most changes in genai-perf - but a small change added to PA - to allow for using the end of request as a completion event even for streaming cases. Since generate doesn't include an explicit done message - we use the end of the request as indication of done.

@nnshah1 nnshah1 marked this pull request as draft May 24, 2024 17:29
@nnshah1 nnshah1 changed the base branch from main to r24.05 May 24, 2024 17:30
Copy link
Contributor

@debermudez debermudez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems surprisingly lightweight and straightforward.

I think the only piece really missing is an update to the unit tests.
I would also like to set this up to test it e2e too.

@nnshah1 nnshah1 marked this pull request as ready for review June 4, 2024 23:15
@nnshah1 nnshah1 marked this pull request as draft June 4, 2024 23:25
@nnshah1 nnshah1 marked this pull request as ready for review June 5, 2024 03:59
@nnshah1 nnshah1 changed the title triton generate support feature: triton generate support Jun 5, 2024
@nnshah1 nnshah1 changed the base branch from r24.05 to main June 5, 2024 04:00
@nnshah1 nnshah1 force-pushed the nnshah1-triton-generate branch from af8c734 to 1f50b6e Compare June 5, 2024 04:10
@nnshah1 nnshah1 requested review from debermudez and tgerdesnv June 5, 2024 04:49
Copy link
Collaborator

@tgerdesnv tgerdesnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to see an E2E test protecting this. Not sure if the CI is ready for that yet, but a high priority ticket should be created at a minimum.

@@ -137,7 +142,7 @@ def _check_conditional_args(
if args.service_kind != "triton":
if args.output_tokens_mean_deterministic:
parser.error(
"The --output-tokens-mean-deterministic option is only supported with the Triton service-kind."
"The --output-tokens-mean-deterministic option is only supported with the kserve endpoint type."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the code be changed to check endpoint_type != kserve? I know that with the current code it is the same result, but it introduces an assumption (endpoint kserve -> service_kind triton) that could trip up a future developer.

default="tensorrtllm",
required=False,
help=f'When using the "triton" service-kind, '
help=f'When using the "kserve" endpoint type, '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can generate endpoint not use trtllm vs vllm?

Copy link
Contributor Author

@nnshah1 nnshah1 Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can - I haven't added any different behavior for the different backends. Actually - it has only been tested against vllm at the moment. So this is fair point ...

Let me move this back to draft - plan to test trt-llm in the next week or so

@@ -63,6 +63,14 @@ namespace openai {
void
ChatCompletionRequest::SendResponse(bool is_final, bool is_null)
{
// if final response has already been sent
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The classes in this file should be renamed since they aren't specific to Chat Completions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "HTTP with SSE Support" is in the end what it is .... not sure the best name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to see the classes refactored. We shouldn't need two independent full http clients. Either one goes away, or we get a base class and then some really thin implementation classes on top. We already have stories for this (TMA-1644), so no big deal if this is ignored in this PR.

@nnshah1 nnshah1 marked this pull request as draft June 5, 2024 14:11
@@ -731,7 +748,10 @@ def _extract_openai_text_output(self, response: str) -> str:

def _is_openai_empty_response(self, response: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change the name of the function since it's no longer just openai

)

endpoint_group.add_argument(
"--service-kind",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally one less CLI option 😄 Can we also update the README to reflect the changes in CLI options?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nnshah1 what's blocking this PR as being marked ready for review?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its out of date from some of our other work now and needs to be ported to the new repository at the bare minimum. I really like the additions here so I would like to see them integrated soon as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

7 participants