vllm-backend exclude_input_in_output flag whether prepend output with prompt #6866

mkhludnev · 2024-02-06T18:19:13Z

Description
vllm-backend concatenates the prompt and the output before responding.

it just seems verbose and redundant since client already have the prompt which was submitted. It especially odd with RAG-approach where prompts are so long.
along side with gRPC response is cropped, where REST /generate fully sent #6864 it makes gRPC interface useless.

Triton Information
Are you using the Triton container or did you build it yourself?

nvcr.io/nvidia/tritonserver:23.11-vllm-python-py3

To Reproduce
Steps to reproduce the behavior.

I run Notus7b under triton+vLLM in docker
curl POST :8000/v2/models/vllm_model/generate
Then run a sample langchain app which uses gRPC :8001 and watch verbose logs
Actual: both responses have prompt prepended, gRPC response in verbose log is cropped gRPC response is cropped, where REST /generate fully sent #6864

backend: "vllm"

# Disabling batching in Triton, let vLLM handle the batching on its own.
max_batch_size: 0


# We need to use decoupled transaction policy for saturating
# vLLM engine for max throughtput.
model_transaction_policy {
  decoupled: True
}
# Note: The vLLM backend uses the following input and output names.
# Any change here needs to also be made in model.py
input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "stream"
    data_type: TYPE_BOOL
    dims: [ 1 ]
  },
  {
    name: "sampling_parameters"
    data_type: TYPE_STRING
    dims: [ 1 ]
    optional: true
  }
]

output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]

# The usage of device is deferred to the vLLM engine
instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]

model.json
{
    "model":"TheBloke/notus-7B-v1-AWQ",
    "disable_log_requests": "true",
    "gpu_memory_utilization": 0.7,
    "max_model_len": 4096,
    "quantization": "awq"
}

Expected behavior
There are no prompt in vLLM response that makes #6864 less painful.

The text was updated successfully, but these errors were encountered:

mkhludnev · 2024-02-07T09:21:08Z

Considering #6867 it's worth to implement echo=False option for vlm and trt backends.

mkhludnev · 2024-02-14T12:22:14Z

up #6867 (reply in thread)

trt-llm backend has exclude_input_in_output flag, that controls prompt concatenation. We will add same flag to vLLM soon

dyastremsky · 2024-02-20T23:16:12Z

Thank you for this enhancement request! We have a ticket tracking this request.

Ref: 6137

dyastremsky · 2024-04-03T01:40:17Z

This has been implemented in triton-inference-server/vllm_backend#35. Closing this issue.

mkhludnev mentioned this issue Feb 6, 2024

Why prepending output with prompt? triton-inference-server/vllm_backend#31

Closed

mkhludnev closed this as completed Feb 7, 2024

mkhludnev reopened this Feb 14, 2024

mkhludnev changed the title ~~vllm-backend prepends output with prompt~~ vllm-backend exclude_input_in_output flag whether prepend output with prompt Feb 14, 2024

dyastremsky assigned oandreeva-nv Feb 20, 2024

dyastremsky closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm-backend exclude_input_in_output flag whether prepend output with prompt #6866

vllm-backend exclude_input_in_output flag whether prepend output with prompt #6866

mkhludnev commented Feb 6, 2024

mkhludnev commented Feb 7, 2024

mkhludnev commented Feb 14, 2024

dyastremsky commented Feb 20, 2024

dyastremsky commented Apr 3, 2024

vllm-backend exclude_input_in_output flag whether prepend output with prompt #6866

vllm-backend exclude_input_in_output flag whether prepend output with prompt #6866

Comments

mkhludnev commented Feb 6, 2024

mkhludnev commented Feb 7, 2024

mkhludnev commented Feb 14, 2024

dyastremsky commented Feb 20, 2024

dyastremsky commented Apr 3, 2024