-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run end_to_end_test_llama.py error #134
base: main
Are you sure you want to change the base?
Conversation
] | ||
|
||
try: | ||
result = client.infer(model_name, inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the llama model is decoupled , so shouldn't the call be async_stream_infer instead of infer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I correct it to start_stream(call_back()) and async_stream_infer() and use old input(HttpInferInput), but got an error "TypeError: Not a cmessage" from tritonclient/grpc/_utils.py
Add llama triton guide
fix the int8_mode and decoupled mode backend support
when i follow the llame_guide.md to build this lib ,this error occur ```bash /workspace/build/fastertransformer_backend/src/libfastertransformer.cc: In member function 'std::shared_ptr<AbstractTransformerModel> triton::backend::fastertransformer_backend::ModelState::ModelFactory(triton::common::TritonJson::Value&, const string&)': /workspace/build/fastertransformer_backend/src/libfastertransformer.cc:340:98: error: 'int8_mode' was not declared in this scope 340 | ft_model = std::make_shared<LlamaTritonModel<__nv_bfloat16>>(tp, pp, custom_ar, model_dir, int8_mode); | ^~~~~~~~~ [100%] Linking CXX executable ../../../../../bin/multi_gpu_gpt_interactive_example [100%] Built target gptneox_example [100%] Built target multi_gpu_gpt_triton_example [100%] Built target llama_example /workspace/build/fastertransformer_backend/src/libfastertransformer.cc:343:90: error: 'int8_mode' was not declared in this scope 343 | ft_model = std::make_shared<LlamaTritonModel<float>>(tp, pp, custom_ar, model_dir, int8_mode); ``` i think the variable should fix. after i move it , the build success
Update libfastertransformer.cc
Running python3 tools/end_to_end_test_llama.py, an error was prompted, [400] HTTP end point doesn't support models with decoupled transaction policy