Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vllm-backend exclude_input_in_output flag whether prepend output with prompt #6866

Closed
mkhludnev opened this issue Feb 6, 2024 · 4 comments
Closed
Assignees

Comments

@mkhludnev
Copy link

Description
vllm-backend concatenates the prompt and the output before responding.

  1. it just seems verbose and redundant since client already have the prompt which was submitted. It especially odd with RAG-approach where prompts are so long.
  2. along side with gRPC response is cropped, where REST /generate fully sent #6864 it makes gRPC interface useless.

Triton Information
Are you using the Triton container or did you build it yourself?

nvcr.io/nvidia/tritonserver:23.11-vllm-python-py3

To Reproduce
Steps to reproduce the behavior.

  1. I run Notus7b under triton+vLLM in docker
  2. curl POST :8000/v2/models/vllm_model/generate
  3. Then run a sample langchain app which uses gRPC :8001 and watch verbose logs
  4. Actual: both responses have prompt prepended, gRPC response in verbose log is cropped gRPC response is cropped, where REST /generate fully sent #6864
backend: "vllm"

# Disabling batching in Triton, let vLLM handle the batching on its own.
max_batch_size: 0


# We need to use decoupled transaction policy for saturating
# vLLM engine for max throughtput.
model_transaction_policy {
  decoupled: True
}
# Note: The vLLM backend uses the following input and output names.
# Any change here needs to also be made in model.py
input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "stream"
    data_type: TYPE_BOOL
    dims: [ 1 ]
  },
  {
    name: "sampling_parameters"
    data_type: TYPE_STRING
    dims: [ 1 ]
    optional: true
  }
]

output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]

# The usage of device is deferred to the vLLM engine
instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]

model.json
{
    "model":"TheBloke/notus-7B-v1-AWQ",
    "disable_log_requests": "true",
    "gpu_memory_utilization": 0.7,
    "max_model_len": 4096,
    "quantization": "awq"
}

Expected behavior
There are no prompt in vLLM response that makes #6864 less painful.

@mkhludnev
Copy link
Author

Considering #6867 it's worth to implement echo=False option for vlm and trt backends.

@mkhludnev
Copy link
Author

up #6867 (reply in thread)

trt-llm backend has exclude_input_in_output flag, that controls prompt concatenation. We will add same flag to vLLM soon

@mkhludnev mkhludnev reopened this Feb 14, 2024
@mkhludnev mkhludnev changed the title vllm-backend prepends output with prompt vllm-backend exclude_input_in_output flag whether prepend output with prompt Feb 14, 2024
@dyastremsky
Copy link
Contributor

Thank you for this enhancement request! We have a ticket tracking this request.

Ref: 6137

@dyastremsky
Copy link
Contributor

This has been implemented in triton-inference-server/vllm_backend#35. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants