fix: Fixing CI for TRTLLM HLAPI #94

KrishnanPrash · 2024-12-05T17:41:05Z

This PR fixes the CI failures that have been observed when using the TRTLLM HL API. The fixes include:

Mainly, for the ScopedTritonServer, it relies on using subprocess and it's relevant functions instead of using psutils and having to manually keep track of the process id.
Additionally, resolving an issue that was observed when stopping triton instance in CI. Here is the error messaged being displayed:

Received signal=15, frame=<frame at 0x79d701106f80, file '/mnt/triton_cli/src/triton_cli/server/server_local.py', line 148, code logs>. Shutting down...
reentrant call inside <_io.BufferedReader name=3>
triton - ERROR - Failed to start server, unknown error.

The root cause for this was after process.terminate() was called for the underlying tritonserver process, it is followed by a process.communicate(timeout=...) which triggers a file I/O operation. Hence, to resolve this we replace process.communicate() with process.wait() to give the process time to clean up with making an I/O call.

tests/utils.py

rmccorm4

LGTM if pipeline passes. Please follow-up on the other PR from my branch into main that this is merging into afterwards

krishung5

LGTM. qq - does this also fix the GPU OOM issue or is it a separate problem?

rmccorm4 · 2024-12-07T00:29:46Z

LGTM. qq - does this also fix the GPU OOM issue or is it a separate problem?

My PR which launches the HL API conversion as a subprocess seems to definitively fix the OOM issue and guarantee memory cleanup. Results were muddied for a bit because it turned out the gitlab mirror got out of sync and was running old code each time.

KrishnanPrash · 2024-12-07T00:31:08Z

@krishung5 That fix was made as a part of Ryan's branch [Link]. IIUC, the solution was building the trtllm engine as a separate child process. Here is the relevant code snippet:

# Run TRT-LLM build in a separate process to make sure it definitely
# cleans up any GPU memory used when done.
p = multiprocessing.Process(
         target=self.__build_trtllm_engine, args=(huggingface_id, engines_path)
)
p.start()
p.join()

KrishnanPrash added 2 commits December 3, 2024 14:07

Fixing process clean up and polling

7552b78

Checking http and grpc status

679ee2e

KrishnanPrash temporarily deployed to GITLAB December 5, 2024 17:41 — with GitHub Actions Inactive

Cleaning up comments and removing old code

c08399c

KrishnanPrash temporarily deployed to GITLAB December 5, 2024 17:43 — with GitHub Actions Inactive

KrishnanPrash requested a review from rmccorm4 December 5, 2024 17:44

KrishnanPrash self-assigned this Dec 5, 2024

KrishnanPrash added the bug Something isn't working label Dec 5, 2024

Fixing re-entrant error message

a53dda9

KrishnanPrash temporarily deployed to GITLAB December 5, 2024 20:47 — with GitHub Actions Inactive

rmccorm4 requested a review from krishung5 December 7, 2024 00:08

rmccorm4 reviewed Dec 7, 2024

View reviewed changes

tests/utils.py Outdated Show resolved Hide resolved

rmccorm4 approved these changes Dec 7, 2024

View reviewed changes

krishung5 approved these changes Dec 7, 2024

View reviewed changes

removing comment

38a179d

KrishnanPrash temporarily deployed to GITLAB December 7, 2024 00:34 — with GitHub Actions Inactive

KrishnanPrash merged commit 06f46d6 into rmccormick-trtllm-hlapi Dec 7, 2024
4 checks passed

KrishnanPrash deleted the kprashanth-trtllm-hlapi-fix branch December 7, 2024 01:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fixing CI for TRTLLM HLAPI #94

fix: Fixing CI for TRTLLM HLAPI #94

KrishnanPrash commented Dec 5, 2024 •

edited

Loading

rmccorm4 left a comment •

edited

Loading

krishung5 left a comment

rmccorm4 commented Dec 7, 2024 •

edited

Loading

KrishnanPrash commented Dec 7, 2024

fix: Fixing CI for TRTLLM HLAPI #94

fix: Fixing CI for TRTLLM HLAPI #94

Conversation

KrishnanPrash commented Dec 5, 2024 • edited Loading

rmccorm4 left a comment • edited Loading

Choose a reason for hiding this comment

krishung5 left a comment

Choose a reason for hiding this comment

rmccorm4 commented Dec 7, 2024 • edited Loading

KrishnanPrash commented Dec 7, 2024

KrishnanPrash commented Dec 5, 2024 •

edited

Loading

rmccorm4 left a comment •

edited

Loading

rmccorm4 commented Dec 7, 2024 •

edited

Loading