Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setup.sh script multiple model support using HF download #61

Closed
wants to merge 76 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
7d68656
add print_prompts cli arg
tstescoTT Dec 4, 2024
8d78d64
remove redundant stop token from vLLM example api calls
tstescoTT Dec 4, 2024
3108bc0
add capture_trace.py util to pre-prompt vllm server to capture all tr…
tstescoTT Dec 4, 2024
ea3d75d
adding utils/startup_utils.py to refine handling of startup in automa…
tstescoTT Dec 4, 2024
cc1d17a
adding force_max_tokens as option to call_inference_api(), add input_…
tstescoTT Dec 4, 2024
059d513
faster mock model prefill
tstescoTT Dec 4, 2024
48d17de
make it not send stop tokens by default and speed up mock model decod…
tstescoTT Dec 5, 2024
fead1aa
adding token count verification for vllm open ai api server to prompt…
tstescoTT Dec 5, 2024
5a80551
add max-log-len to limit logging of prompts to avoid clutter in logs
tstescoTT Dec 5, 2024
d845f08
add InferenceServerContext to startup_utils.py, improve wait_for_healthy
tstescoTT Dec 5, 2024
632ac83
add all_responses to utils/prompt_client_cli.py not using globals
tstescoTT Dec 5, 2024
f563e32
adding new utils/prompt_client_cli.py using utils/prompt_client.py an…
tstescoTT Dec 5, 2024
2467c74
fix health endpoint
tstescoTT Dec 5, 2024
af5e8dc
add vllm_model to EnvironmentConfig instead of BatchConfig
tstescoTT Dec 5, 2024
60c7ab2
refactor utils/capture_traces.py with new prompt_client
tstescoTT Dec 5, 2024
10993a2
fix utils imports
tstescoTT Dec 5, 2024
20ccdf4
fix BatchConfig usage
tstescoTT Dec 6, 2024
eab7e76
add benchmarking/online_benchmark_prompt_client.py using prompt_clien…
tstescoTT Dec 6, 2024
90acdf6
add benchmarking/online_benchmark_prompt_client.py using prompt_clien…
tstescoTT Dec 6, 2024
ec486ad
add benchmarking, evals, and tests dirs to Dockerfile
tstescoTT Dec 6, 2024
c58d7b3
update patchfile and benchmarking README.md with commands
tstescoTT Dec 6, 2024
fe4f96d
update Docker IMAGE_VERSION to v0.0.3
tstescoTT Dec 6, 2024
f3d815a
improve doc
tstescoTT Dec 6, 2024
8246a72
update benchmark_serving.patch
tstescoTT Dec 6, 2024
765c4be
add tt_model_runner.py patch for best_of
tstescoTT Dec 6, 2024
b93370d
update benchmarking/benchmark_serving.patch
tstescoTT Dec 6, 2024
5e07baa
use CACHE_ROOT for vllm_online_benchmark_results dir
tstescoTT Dec 6, 2024
d0e0b0f
adding timestamped online benchmark run result directory, rps=1 for v…
tstescoTT Dec 9, 2024
5db2523
update benchmark output file naming convention
tstescoTT Dec 9, 2024
5ab742c
rename benchmarking/online_benchmark_prompt_client.py to benchmarking…
tstescoTT Dec 9, 2024
06420bd
increase num_prompts default, default to 128/128 online test
tstescoTT Dec 9, 2024
b7e4cfc
use min_tokens and ignore_eos=True to force output seq len
tstescoTT Dec 9, 2024
dda29a9
adding min_tokens to locust requests
tstescoTT Dec 9, 2024
f8b3033
add --ignore-eos to vllm_online_benchmark.py to force the output seq …
tstescoTT Dec 10, 2024
12c38fc
add context_lens (isl, osl) pairs to capture_traces() to capture corr…
tstescoTT Dec 10, 2024
1cabdc9
add trace pre-capture to prompt_client_cli.py with option to disable
tstescoTT Dec 10, 2024
68f08d0
better comment and logs for trace capture
tstescoTT Dec 10, 2024
962c507
use TPOT and TPS in benchmarking/prompt_client_online_benchmark.py, a…
tstescoTT Dec 12, 2024
62bf427
update utils/prompt_client_cli.py and docs
tstescoTT Dec 12, 2024
d9e163c
remove WIP utils/startup_utils.py from this branch
tstescoTT Dec 12, 2024
cd29085
adding doc string to BatchProcessor
tstescoTT Dec 31, 2024
376403d
add output_path arg to batch_processor.py::BatchProcessor to optional…
tstescoTT Dec 31, 2024
daf0625
adding tests/test_vllm_seq_lens.py to test vllm sequence lengths and …
tstescoTT Dec 12, 2024
f3e34d1
fix TEST_PARAMS
tstescoTT Dec 13, 2024
4d360eb
adding fixed_batch_size to prompt_client_online_benchmark.py for bett…
tstescoTT Dec 17, 2024
41dcc22
use standard output values in ms
tstescoTT Dec 17, 2024
308eeaf
fix output filepath for prompt_client_online_benchmark.py, remove get…
tstescoTT Dec 17, 2024
e6fc8c4
add benchmark output file reader script
tstescoTT Dec 17, 2024
6295693
ruff formatting, rename benchmarking/benchmark_output_processor.py ->…
tstescoTT Dec 17, 2024
8963a12
add percentile-metrics to add e2els stats
tstescoTT Dec 18, 2024
fc8eb06
add latency to benchmarking/prompt_client_online_benchmark.py and sum…
tstescoTT Dec 18, 2024
6c4d092
support latency measurement with mean_e2el_ms
tstescoTT Dec 18, 2024
d8ec682
update benchmark sweeps
tstescoTT Dec 18, 2024
ffaabd6
update sweeps context lengths
tstescoTT Dec 18, 2024
4602ff3
model id as header not in table
tstescoTT Dec 18, 2024
594b9a1
add better formatting in benchmark_summary.py, update iso/osl sweeps
tstescoTT Dec 18, 2024
2ce6fe7
add better markdown formatting, add saving display .csv
tstescoTT Dec 18, 2024
6be324f
update sweep isl/osl
tstescoTT Dec 18, 2024
f558876
update sweep isl/osl
tstescoTT Dec 18, 2024
b4260d3
add metadata to markdown summary
tstescoTT Dec 18, 2024
89958d9
add ignore_eos=True to locust requests to use min/max tokens, increas…
tstescoTT Dec 18, 2024
126c588
update for llama 3.1 70B v0 testing
tstescoTT Dec 19, 2024
aef6a94
adding evals changes from tstesco/llama-evals
tstescoTT Dec 19, 2024
5ab1816
adding TT_METAL_COMMIT_SHA_OR_TAG=v0.54.0-rc2
tstescoTT Dec 19, 2024
0e5b67a
update README commit tags
tstescoTT Dec 19, 2024
0c48a9f
adding vllm benchmarking patch to stop sending unsupported params bes…
tstescoTT Dec 20, 2024
471c90b
move vllm-tt-metal-llama3-70b/setup.sh -> setup.sh, add support for H…
tstescoTT Dec 18, 2024
ec43450
add llama 3.2 refs
tstescoTT Dec 18, 2024
d17e46e
WIP make setup.sh run from repo root, add fixed model impl dir, env f…
tstescoTT Dec 18, 2024
895ff8d
adding setup.sh support for multiple models, adding support for llama…
tstescoTT Dec 19, 2024
49ee14f
update .env file location in documentation
tstescoTT Dec 19, 2024
1675f54
remove MODEL_IMPL_ROOT_DIR and add note about MODEL_NAME
tstescoTT Dec 19, 2024
d1bffe0
move setup_tt_metal_cache into setup_weights to use load_env scope
tstescoTT Dec 20, 2024
97c90cf
better logging and handling of {PERSISTENT_VOLUME}/model_weights dir …
tstescoTT Dec 20, 2024
6df6c7c
adding error message when huggingface-cli download fails with common …
tstescoTT Dec 20, 2024
6beaa66
update README for llama 3.1 70B v0 drop commits
tstescoTT Dec 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ __pycache__
env
.testvenv
python_env
.venv
.venv*

# persistent storage volume
persistent_volume
Expand Down
75 changes: 75 additions & 0 deletions benchmarking/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,78 @@ python examples/offline_inference_tt.py --measure_perf --max_seqs_in_batch 32 --
- `--max_seqs_in_batch` (default: `32`):
- **Maximum batch size** for inference, determining the number of prompts processed in parallel.

### Online Benchmarking

#### single user

```bash
python utils/prompt_client_cli.py \
--num_prompts 32 \
--batch_size 1 \
--tokenizer_model meta-llama/Llama-3.1-70B-Instruct \
--max_prompt_length 128 \
--input_seq_len 128 \
--output_seq_len 128 \
--template chat_template \
--dataset random
```

#### using vllm/benchmarking/benchmark_serving.py
Within the Docker container, use the benchmark_serving.patch file:
```
cd ~/app/src
python run_vllm_api_server.py
```
This simply stops the benchmarking script from sending the `best_of` arg which is not supported and causes issues.

To run the benchmarks, in another shell into the Docker container:
```
cd ~/vllm
git apply ~/app/benchmarking/benchmark_serving.patch
cd ~/app
export PYTHONPATH=$PYTHONPATH:$PWD
python benchmarking/vllm_online_benchmark.py
```

The output will be available for each input/output sequence length defined and time stamped.

Results are also printed to stdout, for example with mock data results:
```
==================================================
Benchmark Result
==================================================
Successful requests: 32
Benchmark duration (s): 0.39
Total input tokens: 4096
Total generated tokens: 64
Request throughput (req/s): 83.04
Output token throughput (tok/s): 166.07
Total Token throughput (tok/s): 10794.77
--------------------------------------------------
Time to First Token
--------------------------------------------------
Mean TTFT (ms): 358.26
Median TTFT (ms): 358.45
P99 TTFT (ms): 361.67
--------------------------------------------------
Time per Output Token (excl. 1st token)
--------------------------------------------------
Mean TPOT (ms): 14.03
Median TPOT (ms): 14.13
P99 TPOT (ms): 14.30
--------------------------------------------------
Inter-token Latency
--------------------------------------------------
Mean ITL (ms): 7.86
Median ITL (ms): 7.83
P99 ITL (ms): 8.05
==================================================
```

#### using tt-inference-server/benchmarking/prompt_client_online_benchmark.py

```bash
export PYTHONPATH=$PYTHONPATH:$PWD
python benchmarking/prompt_client_online_benchmark.py
```

40 changes: 40 additions & 0 deletions benchmarking/benchmark_serving.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
diff --git a/benchmarks/backend_request_func.py b/benchmarks/backend_request_func.py
index 4813fde2..0cb3e72e 100644
--- a/benchmarks/backend_request_func.py
+++ b/benchmarks/backend_request_func.py
@@ -235,9 +235,7 @@ async def async_request_openai_completions(
"model": request_func_input.model,
"prompt": request_func_input.prompt,
"temperature": 0.0,
- "best_of": request_func_input.best_of,
"max_tokens": request_func_input.output_len,
- "logprobs": request_func_input.logprobs,
"stream": True,
"ignore_eos": request_func_input.ignore_eos,
}
diff --git a/benchmarks/benchmark_serving.py b/benchmarks/benchmark_serving.py
index c1a396c8..74f75a15 100644
--- a/benchmarks/benchmark_serving.py
+++ b/benchmarks/benchmark_serving.py
@@ -22,6 +22,12 @@ On the client side, run:
--endpoint /generate_stream
to the end of the command above.
"""
+import sys
+from unittest.mock import MagicMock
+# mock out ttnn fully so we can import ttnn without using it
+sys.modules["ttnn"] = MagicMock()
+sys.modules["ttnn.device"] = MagicMock()
+
import argparse
import asyncio
import base64
@@ -417,7 +423,7 @@ async def benchmark(
prompt_len=test_prompt_len,
output_len=test_output_len,
logprobs=logprobs,
- best_of=best_of,
+ best_of=None,
multi_modal_content=test_mm_content,
ignore_eos=ignore_eos,
)
Loading