Custom Repository Agent never receiving TRITONREPOAGENT_ModelAction of type TRITONREPOAGENT_ACTION_LOAD_COMPLETE #6359

nathanjacobiOXOS · 2023-09-27T19:03:50Z

Description
I have a custom repository agent called LoadCheckAgent.cpp. It is properly exported to a .so library and added to the config.pbtxt of models I am using. When a load request is sent to triton for a model using TRITONSERVER_ServerLoadModel(server_, name); , the repository agent TRITONREPOAGENT_ModelAction function is properly called, I have debug output within the agent outputting "AGENT CHECK" on entry to the function. If the action type is TRITONREPOAGENT_ACTION_LOAD, the repo agent is asked to output "MODEL LOAD - REPO", which is seen happening during runtime.

When the model is finished loading triton will output a success to terminal:

I0927 18:27:31.167871 51235 model_lifecycle.cc:815] successfully loaded 'model_name'

However the repository agent TRITONREPOAGENT_ModelAction is not called again, and no TRITONREPOAGENT_ACTION_LOAD_COMPLETE ever is received.

Additionally, if a unload request is then sent using

TRITONSERVER_ServerUnloadModelAndDependents(server_, name);

the behavior begins to display further issues. The following message is outputted by triton after requesting to unload:

E0927 18:27:37.661673 51235 model_lifecycle.cc:409] Agent model returns error on TRITONREPOAGENT_ACTION_UNLOAD: Internal: Unexpected lifecycle state transition from TRITONREPOAGENT_ACTION_LOAD to TRITONREPOAGENT_ACTION_UNLOAD
I0927 18:27:37.662367 51235 onnxruntime.cc:2754] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0927 18:27:37.673127 51235 onnxruntime.cc:2682] TRITONBACKEND_ModelFinalize: delete model state
I0927 18:27:37.673711 51235 model_lifecycle.cc:608] successfully unloaded 'model_name' version 1

This is immediately followed by the agent repo TRITONREPOAGENT_ModelAction being called, and it outputs it's debugging messages

AGENT CHECK
MODEL LOAD FAILED - REPO

The second of which only being outputted if the TRITONREPOAGENT_ActionType received is of TRITONREPOAGENT_ACTION_LOAD_FAIL

There is a debug message in place if the TRITONREPOAGENT_ActionType received is of TRITONREPOAGENT_ACTION_UNLOAD but, this message is never outputted meaning the repo agent never receives the unload request.

Triton Information
Triton version 2.35.0

Custom build, using an OS image that uses JetPack 5.1.1-b56 as the base, with some other changes. CUDA 11.4 is still in use. Backends pulled from tritonserver2.35.0-jetpack5.1.2.tgz directly

To Reproduce

Create a custom repository agent that outputs the type of TRITONREPOAGENT_ModelAction received. Create the .so as described in the steps here,

Place it in agents/checkload/libtritonrepoagent_checkload.so
Use TRITONSERVER_ServerOptionsSetRepoAgentDirectory(serverOptions, pathToAgents);
Include this in a config of an onnxruntime_onnx or tensorflow_savedmodel model

model_repository_agents
{
  agents [
    {
      name: "checkload",
      parameters {}
    }
  ]
}

Start server and request to load.

Expected behavior

The behavior described above contradicts the expected behavior outline by server/docs/docs/customization_guide/repository_agents.md
Here are those steps, with the contradicting behavior in boldface.

Load the model's configuration file (config.pbtxt) and extract the ModelRepositoryAgents settings. Even if a repository agent modifies the config.pbtxt file, the repository agent settings from the initial config.pbtxt file are used for the entire loading process.
For each repository agent specified:

Initialize the corresponding repository agent, loading the shared library if necessary. Model loading fails if the shared library is not available or if initialization fails.
Invoke the repository agent's TRITONREPOAGENT_ModelAction function with action TRITONREPOAGENT_ACTION_LOAD. As input the agent can access the model's repository as either a cloud storage location or a local filesystem location.
The repository agent can return success to indicate that no changes where made to the repository, can return failure to indicate that the model load should fail, or can create a new repository for the model (for example, by decrypting the input repository) and return success to indicate that the new repository should be used.
If the agent returns success Triton continues to the next agent. If the agent returns failure, Triton skips invocation of any additional agents.
If all agents returned success, Triton attempts to load the model using the final model repository.
For each repository agent that was invoked with TRITONREPOAGENT_ACTION_LOAD, in reverse order:
- Triton invokes the repository agent's TRITONREPOAGENT_ModelAction function with action TRITONREPOAGENT_ACTION_LOAD_COMPLETE if the model loaded successfully or TRITONREPOAGENT_ACTION_LOAD_FAIL if the model failed to load.

The text was updated successfully, but these errors were encountered:

nathanjacobiOXOS · 2023-09-27T19:33:18Z

Additional info to eliminate versioning issues on the model side, models being loaded were created in an environment using tensorflow 2.12.0 and onnxruntime 1.15.0

nathanjacobiOXOS · 2023-09-27T21:06:48Z

I just tested loading the same models with the same .so agent repo file in a custom built triton 2.19 and jetpack 4 enviroment. Functions as expected with no issues in the older version, but not in 2.35

nathanjacobiOXOS · 2023-09-28T16:31:58Z

I've done tests using the same OS and device, the issue persists in v.2.27.0, v2.30.0, v2.32.0, however the agent repo runs with proper behavior in v2.20.0 (JP 5.0) and in v2.24.0 (JP (5.0.2)

nathanjacobiOXOS · 2023-09-28T19:59:41Z

Using v2.20.0 and v2.24.0 has issues with other functions that previously functioned when on JP4. I believe there are some number of versioning issues in the cuda and nvidia related libraries installed on the JP51.1-b56 OS I am using, but I a cannot figure this out for sure without more information on what compatible versions for triton are. On this release page, the docker image for windows contains cuda 11.5, while the supported JP5.0 release is base 11.4. I just want to confirm that 11.4 will succeed in running this relase?

nathanjacobiOXOS · 2023-10-03T19:08:46Z

I've stumbled upon another bug involving custom repository agents, but it is not present in the new releases. Just the issues above are present in the new releases. In V2.20.0, V2.21.0, V2.24.0, any model loaded with a custom repository agent will cause the TRITONSERVER_InferenceRequestNew to hang indefinitely when trying to perform inference. If a custom repository agent is not used, then it does not hang. I'm not going to open a new issue due to the age of this bug but just thought you might like to be aware of it @nnshah1

nathanjacobiOXOS · 2023-10-12T21:03:11Z

Checking in @nnshah1, any updates to the investigation of this issue?

nathanjacobiOXOS · 2023-12-05T21:26:41Z

Checking in @nnshah1 again! Please let me know what you have found out :)

nnshah1 · 2023-12-05T22:53:33Z

apologies - let me take a look this week and provide an update -

cao-nv · 2023-12-14T03:59:18Z

I'm facing the same issue.
After the action TRITONREPOAGENT_ACTION_LOAD was invoked and my triton server was running normally, no other action was sent to agent.
After I interrupted server, signal TRITONREPOAGENT_ACTION_LOAD_FAIL was invoked even when model was successfully unloaded.

nnshah1 · 2023-12-14T07:06:09Z

I have been able to reproduce (I believe) - will continue debugging.

cao-nv · 2023-12-14T07:24:09Z

I have been able to reproduce (I believe) - will continue debugging.

Thank you,
Hope you will fix the bug soon.

iyLester · 2024-01-05T04:07:23Z

I found first_unload in model_lifecycle.h:InvokeAgentModels(), which is always false, resulting in an early return.
I modified first_unload with the following code to get the expected result.
Before
const bool first_unload = (action_type == TRITONREPOAGENT_ACTION_UNLOAD) && (last_action_type_ != TRITONREPOAGENT_ACTION_UNLOAD);

After
const bool first_unload = (action_type != TRITONREPOAGENT_ACTION_UNLOAD) && (last_action_type_ != TRITONREPOAGENT_ACTION_UNLOAD);

nnshah1 · 2024-01-05T09:37:23Z

Thanks for the debug and insight! I took a quick look at the comment and variable there and I think you are correct. I've created a small change to the logic there to better match the comment. Can you test on your side as well?

triton-inference-server/core#309

iyLester · 2024-01-08T07:43:11Z

After change, it's working fine.

nathanjacobiOXOS · 2024-01-08T15:38:23Z

Thanks all for finding and fixing!

nathanjacobiOXOS mentioned this issue Sep 27, 2023

Triton 2.19 C++ API not properly stopping server instance #6349

Closed

nathanjacobiOXOS changed the title ~~Repository never receiving TRITONREPOAGENT_ModelAction of type TRITONREPOAGENT_ACTION_LOAD_COMPLETE~~ Custom Repository Agent never receiving TRITONREPOAGENT_ModelAction of type TRITONREPOAGENT_ACTION_LOAD_COMPLETE Sep 27, 2023

nnshah1 added the investigating The developement team is investigating this issue label Sep 28, 2023

nnshah1 mentioned this issue Jan 5, 2024

update logic for ignoring duplicate unload requests triton-inference-server/core#309

Merged

nathanjacobiOXOS closed this as completed Jan 8, 2024

nnshah1 self-assigned this Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom Repository Agent never receiving TRITONREPOAGENT_ModelAction of type TRITONREPOAGENT_ACTION_LOAD_COMPLETE #6359

Custom Repository Agent never receiving TRITONREPOAGENT_ModelAction of type TRITONREPOAGENT_ACTION_LOAD_COMPLETE #6359

nathanjacobiOXOS commented Sep 27, 2023

nathanjacobiOXOS commented Sep 27, 2023

nathanjacobiOXOS commented Sep 27, 2023 •

edited

Loading

nathanjacobiOXOS commented Sep 28, 2023 •

edited

Loading

nathanjacobiOXOS commented Sep 28, 2023

nathanjacobiOXOS commented Oct 3, 2023 •

edited

Loading

nathanjacobiOXOS commented Oct 12, 2023

nathanjacobiOXOS commented Dec 5, 2023

nnshah1 commented Dec 5, 2023

cao-nv commented Dec 14, 2023

nnshah1 commented Dec 14, 2023 •

edited

Loading

cao-nv commented Dec 14, 2023

iyLester commented Jan 5, 2024 •

edited

Loading

nnshah1 commented Jan 5, 2024

iyLester commented Jan 8, 2024

nathanjacobiOXOS commented Jan 8, 2024

Custom Repository Agent never receiving TRITONREPOAGENT_ModelAction of type TRITONREPOAGENT_ACTION_LOAD_COMPLETE #6359

Custom Repository Agent never receiving TRITONREPOAGENT_ModelAction of type TRITONREPOAGENT_ACTION_LOAD_COMPLETE #6359

Comments

nathanjacobiOXOS commented Sep 27, 2023

nathanjacobiOXOS commented Sep 27, 2023

nathanjacobiOXOS commented Sep 27, 2023 • edited Loading

nathanjacobiOXOS commented Sep 28, 2023 • edited Loading

nathanjacobiOXOS commented Sep 28, 2023

nathanjacobiOXOS commented Oct 3, 2023 • edited Loading

nathanjacobiOXOS commented Oct 12, 2023

nathanjacobiOXOS commented Dec 5, 2023

nnshah1 commented Dec 5, 2023

cao-nv commented Dec 14, 2023

nnshah1 commented Dec 14, 2023 • edited Loading

cao-nv commented Dec 14, 2023

iyLester commented Jan 5, 2024 • edited Loading

nnshah1 commented Jan 5, 2024

iyLester commented Jan 8, 2024

nathanjacobiOXOS commented Jan 8, 2024

nathanjacobiOXOS commented Sep 27, 2023 •

edited

Loading

nathanjacobiOXOS commented Sep 28, 2023 •

edited

Loading

nathanjacobiOXOS commented Oct 3, 2023 •

edited

Loading

nnshah1 commented Dec 14, 2023 •

edited

Loading

iyLester commented Jan 5, 2024 •

edited

Loading