-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch errors and improve logging in Profiler #23
Conversation
src/triton_cli/profiler.py
Outdated
if args.ignore_eos: | ||
requests = export_data["experiments"][0]["requests"] | ||
for request in requests: | ||
if len(request["response_timestamps"]) != args.max_tokens: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must merge after triton-inference-server/vllm_backend#28 in order for this line to be valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we receiving back any output data for the responses from PA? If so, we could check if the bonus response is empty for validation.
Otherwise, to smooth the transition, can we do something like this for now?
if len(request["response_timestamps"]) == args.max_tokens:
# Assume FINAL flag is returned with final token response
pass
elif len(request["response_timestamps"]) == args.max_tokens + 1:
# Assume FINAL flag was returned with an empty response after the final token
logger.warning('Received an extra response from the backend. This may be due to the backend sending an "empty final response".')
else:
raise ValueError(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, the current PA output only has timestamps of requests responses. Updated to add additional check/warning for empty final response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great changes! Only feedback is on the vllm / final response comment.
if len(request["response_timestamps"]) == args.max_tokens: | ||
pass | ||
elif len(request["response_timestamps"]) == args.max_tokens + 1: | ||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include the comments I had in the other comment in-line here? The # Assume ...
ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will help me remember the reasoning for the check/split later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments
The change includes:
The sanity check prevents potential miscalculation that can arise due to user requesting the model to generate a sequence that's longer than the model's context length.
Non-verbose (no error):
Non-verbose (with PA error):
Non-verbose (with sanity check error):
Verbose (no error):