From a17203fb9eebd919483a9a10651e3c1ff86d08ab Mon Sep 17 00:00:00 2001 From: Neelesh Gokhale Date: Wed, 12 Feb 2025 08:48:23 +0000 Subject: [PATCH] Update documentation to reflect current bucket defaults --- README_GAUDI.md | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/README_GAUDI.md b/README_GAUDI.md index b98a067c03ef1..1ada7e7d917ec 100644 --- a/README_GAUDI.md +++ b/README_GAUDI.md @@ -351,7 +351,7 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi - batch size max (`VLLM_PROMPT_BS_BUCKET_MAX`): `min(max_num_seqs, 64)` - sequence length min (`VLLM_PROMPT_SEQ_BUCKET_MIN`): `block_size` - sequence length step (`VLLM_PROMPT_SEQ_BUCKET_STEP`): `block_size` - - sequence length max (`VLLM_PROMPT_SEQ_BUCKET_MAX`): `max_model_len` + - sequence length max (`VLLM_PROMPT_SEQ_BUCKET_MAX`): `1024` - Decode: @@ -360,7 +360,18 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi - batch size max (`VLLM_DECODE_BS_BUCKET_MAX`): `max_num_seqs` - block size min (`VLLM_DECODE_BLOCK_BUCKET_MIN`): `block_size` - block size step (`VLLM_DECODE_BLOCK_BUCKET_STEP`): `block_size` - - block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max(128, (max_num_seqs*max_model_len)/block_size)` + - block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max(128, (max_num_seqs*2048)/block_size)` + - Recommended Values: + - Prompt: + + - sequence length max (`VLLM_PROMPT_SEQ_BUCKET_MAX`): `max_model_len` + - Decode: + + - block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max(128, (max_num_seqs*max_model_len/block_size)` + +> [!NOTE] +> Model config may report very high max_model_len, please set it to max input_tokens+output_tokens rounded up to multiple of block_size as per actual requirements. + - `VLLM_HANDLE_TOPK_DUPLICATES`, if ``true`` - handles duplicates that are outside of top-k. `false` by default. - `VLLM_CONFIG_HIDDEN_LAYERS` - configures how many hidden layers to run in a HPUGraph for model splitting among hidden layers when TP is 1. The default is 1. It helps improve throughput by reducing inter-token latency limitations in some models.