Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Repository Agent never receiving TRITONREPOAGENT_ModelAction of type TRITONREPOAGENT_ACTION_LOAD_COMPLETE #6359

Closed
nathanjacobiOXOS opened this issue Sep 27, 2023 · 15 comments
Assignees
Labels
investigating The developement team is investigating this issue

Comments

@nathanjacobiOXOS
Copy link

Description
I have a custom repository agent called LoadCheckAgent.cpp. It is properly exported to a .so library and added to the config.pbtxt of models I am using. When a load request is sent to triton for a model using TRITONSERVER_ServerLoadModel(server_, name); , the repository agent TRITONREPOAGENT_ModelAction function is properly called, I have debug output within the agent outputting "AGENT CHECK" on entry to the function. If the action type is TRITONREPOAGENT_ACTION_LOAD, the repo agent is asked to output "MODEL LOAD - REPO", which is seen happening during runtime.

When the model is finished loading triton will output a success to terminal:

I0927 18:27:31.167871 51235 model_lifecycle.cc:815] successfully loaded 'model_name'

However the repository agent TRITONREPOAGENT_ModelAction is not called again, and no TRITONREPOAGENT_ACTION_LOAD_COMPLETE ever is received.

Additionally, if a unload request is then sent using

TRITONSERVER_ServerUnloadModelAndDependents(server_, name);

the behavior begins to display further issues. The following message is outputted by triton after requesting to unload:

E0927 18:27:37.661673 51235 model_lifecycle.cc:409] Agent model returns error on TRITONREPOAGENT_ACTION_UNLOAD: Internal: Unexpected lifecycle state transition from TRITONREPOAGENT_ACTION_LOAD to TRITONREPOAGENT_ACTION_UNLOAD
I0927 18:27:37.662367 51235 onnxruntime.cc:2754] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0927 18:27:37.673127 51235 onnxruntime.cc:2682] TRITONBACKEND_ModelFinalize: delete model state
I0927 18:27:37.673711 51235 model_lifecycle.cc:608] successfully unloaded 'model_name' version 1

This is immediately followed by the agent repo TRITONREPOAGENT_ModelAction being called, and it outputs it's debugging messages

AGENT CHECK
MODEL LOAD FAILED - REPO

The second of which only being outputted if the TRITONREPOAGENT_ActionType received is of TRITONREPOAGENT_ACTION_LOAD_FAIL

There is a debug message in place if the TRITONREPOAGENT_ActionType received is of TRITONREPOAGENT_ACTION_UNLOAD but, this message is never outputted meaning the repo agent never receives the unload request.

Triton Information
Triton version 2.35.0

Custom build, using an OS image that uses JetPack 5.1.1-b56 as the base, with some other changes. CUDA 11.4 is still in use. Backends pulled from tritonserver2.35.0-jetpack5.1.2.tgz directly

To Reproduce

Create a custom repository agent that outputs the type of TRITONREPOAGENT_ModelAction received. Create the .so as described in the steps here,

Place it in agents/checkload/libtritonrepoagent_checkload.so
Use TRITONSERVER_ServerOptionsSetRepoAgentDirectory(serverOptions, pathToAgents);
Include this in a config of an onnxruntime_onnx or tensorflow_savedmodel model

model_repository_agents
{
  agents [
    {
      name: "checkload",
      parameters {}
    }
  ]
}

Start server and request to load.

Expected behavior

The behavior described above contradicts the expected behavior outline by server/docs/docs/customization_guide/repository_agents.md
Here are those steps, with the contradicting behavior in boldface.

Load the model's configuration file (config.pbtxt) and extract the ModelRepositoryAgents settings. Even if a repository agent modifies the config.pbtxt file, the repository agent settings from the initial config.pbtxt file are used for the entire loading process.
For each repository agent specified:

  • Initialize the corresponding repository agent, loading the shared library if necessary. Model loading fails if the shared library is not available or if initialization fails.

  • Invoke the repository agent's TRITONREPOAGENT_ModelAction function with action TRITONREPOAGENT_ACTION_LOAD. As input the agent can access the model's repository as either a cloud storage location or a local filesystem location.

  • The repository agent can return success to indicate that no changes where made to the repository, can return failure to indicate that the model load should fail, or can create a new repository for the model (for example, by decrypting the input repository) and return success to indicate that the new repository should be used.

  • If the agent returns success Triton continues to the next agent. If the agent returns failure, Triton skips invocation of any additional agents.

  • If all agents returned success, Triton attempts to load the model using the final model repository.

  • For each repository agent that was invoked with TRITONREPOAGENT_ACTION_LOAD, in reverse order:

    • Triton invokes the repository agent's TRITONREPOAGENT_ModelAction function with action TRITONREPOAGENT_ACTION_LOAD_COMPLETE if the model loaded successfully or TRITONREPOAGENT_ACTION_LOAD_FAIL if the model failed to load.
@nathanjacobiOXOS
Copy link
Author

Additional info to eliminate versioning issues on the model side, models being loaded were created in an environment using tensorflow 2.12.0 and onnxruntime 1.15.0

@nathanjacobiOXOS nathanjacobiOXOS changed the title Repository never receiving TRITONREPOAGENT_ModelAction of type TRITONREPOAGENT_ACTION_LOAD_COMPLETE Custom Repository Agent never receiving TRITONREPOAGENT_ModelAction of type TRITONREPOAGENT_ACTION_LOAD_COMPLETE Sep 27, 2023
@nathanjacobiOXOS
Copy link
Author

nathanjacobiOXOS commented Sep 27, 2023

I just tested loading the same models with the same .so agent repo file in a custom built triton 2.19 and jetpack 4 enviroment. Functions as expected with no issues in the older version, but not in 2.35

@nnshah1 nnshah1 added the investigating The developement team is investigating this issue label Sep 28, 2023
@nathanjacobiOXOS
Copy link
Author

nathanjacobiOXOS commented Sep 28, 2023

I've done tests using the same OS and device, the issue persists in v.2.27.0, v2.30.0, v2.32.0, however the agent repo runs with proper behavior in v2.20.0 (JP 5.0) and in v2.24.0 (JP (5.0.2)

@nathanjacobiOXOS
Copy link
Author

Using v2.20.0 and v2.24.0 has issues with other functions that previously functioned when on JP4. I believe there are some number of versioning issues in the cuda and nvidia related libraries installed on the JP51.1-b56 OS I am using, but I a cannot figure this out for sure without more information on what compatible versions for triton are. On this release page, the docker image for windows contains cuda 11.5, while the supported JP5.0 release is base 11.4. I just want to confirm that 11.4 will succeed in running this relase?

@nathanjacobiOXOS
Copy link
Author

nathanjacobiOXOS commented Oct 3, 2023

I've stumbled upon another bug involving custom repository agents, but it is not present in the new releases. Just the issues above are present in the new releases. In V2.20.0, V2.21.0, V2.24.0, any model loaded with a custom repository agent will cause the TRITONSERVER_InferenceRequestNew to hang indefinitely when trying to perform inference. If a custom repository agent is not used, then it does not hang. I'm not going to open a new issue due to the age of this bug but just thought you might like to be aware of it @nnshah1

@nathanjacobiOXOS
Copy link
Author

Checking in @nnshah1, any updates to the investigation of this issue?

@nathanjacobiOXOS
Copy link
Author

Checking in @nnshah1 again! Please let me know what you have found out :)

@nnshah1
Copy link
Contributor

nnshah1 commented Dec 5, 2023

apologies - let me take a look this week and provide an update -

@cao-nv
Copy link

cao-nv commented Dec 14, 2023

I'm facing the same issue.
After the action TRITONREPOAGENT_ACTION_LOAD was invoked and my triton server was running normally, no other action was sent to agent.
After I interrupted server, signal TRITONREPOAGENT_ACTION_LOAD_FAIL was invoked even when model was successfully unloaded.

@nnshah1
Copy link
Contributor

nnshah1 commented Dec 14, 2023

I have been able to reproduce (I believe) - will continue debugging.

@cao-nv
Copy link

cao-nv commented Dec 14, 2023

I have been able to reproduce (I believe) - will continue debugging.

Thank you,
Hope you will fix the bug soon.

@iyLester
Copy link

iyLester commented Jan 5, 2024

I found first_unload in model_lifecycle.h:InvokeAgentModels(), which is always false, resulting in an early return.
I modified first_unload with the following code to get the expected result.
Before
const bool first_unload = (action_type == TRITONREPOAGENT_ACTION_UNLOAD) && (last_action_type_ != TRITONREPOAGENT_ACTION_UNLOAD);

After
const bool first_unload = (action_type != TRITONREPOAGENT_ACTION_UNLOAD) && (last_action_type_ != TRITONREPOAGENT_ACTION_UNLOAD);

@nnshah1
Copy link
Contributor

nnshah1 commented Jan 5, 2024

Thanks for the debug and insight! I took a quick look at the comment and variable there and I think you are correct. I've created a small change to the logic there to better match the comment. Can you test on your side as well?

triton-inference-server/core#309

@iyLester
Copy link

iyLester commented Jan 8, 2024

After change, it's working fine.

@nathanjacobiOXOS
Copy link
Author

Thanks all for finding and fixing!

@nnshah1 nnshah1 self-assigned this Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigating The developement team is investigating this issue
Development

No branches or pull requests

4 participants