Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[main] Undeploy code change: add edge case for models that are marked as not found in cache #3520 #3523

Merged
merged 1 commit into from
Feb 14, 2025

Conversation

brianf-aws
Copy link
Contributor

There is a code change that requires to check the response of the model undeploy response object to check that the model has been marked as not found on all nodes.

Previously when we received a {}response it was assumed that no nodes are servicing the model but it turns out there was a code change that didn't update the model index but reported back that it couldn't find the model

## On UNDEPLOY when the cluster doesn't recognize the model (i.e. when I did a cluster restart with the model on undeployed)
{
  "ZSDYur-iTNezJKo8KibrkA": {
    "stats": {
      "pt2r8ZQBFvLEBI9XgEUD": "not_found"
    }
  },
  "6sw6aS_BTXiX2U-Xww_lNw": {
    "stats": {
      "pt2r8ZQBFvLEBI9XgEUD": "not_found"
    }
  },
  "_I3xR4dLQSuKhbO06lpoGQ": {
    "stats": {
      "pt2r8ZQBFvLEBI9XgEUD": "not_found"
    }
  },
  "Jd1kEmnoTWyeJkalOiAakg": {
    "stats": {
      "pt2r8ZQBFvLEBI9XgEUD": "not_found"
    }
  }
}

Reproduction

A. Normal case

  1. deploy model
  2. undeploy model (you get shape like above but the status is UNDEPLOYED)
  3. undeploy model (You get the same shape previously but the statis is NOT_FOUND) This is new behavior previously we only got {}

B. Edge case (Turn off sync up job)

  1. Start docker cluster
  2. deploy model
  3. bring down cluster and revive it
  4. check that model is still in deployed and profile returns {}
  5. undeploy you get the above data
  6. check the model and observe its stuck in deployed.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@brianf-aws
Copy link
Contributor Author

Can we add 2.19 backport?

@@ -198,7 +200,19 @@ private void undeployModels(
* Having this change enables a check that this edge case occurs along with having access to the model id
* allowing us to update the stale model index correctly to `UNDEPLOYED` since no nodes service the model.
*/
if (response.getNodes().isEmpty()) {
boolean modelNotFoundInNodesCache = response.getNodes().stream().allMatch(nodeResponse -> {
Map<String, String> status = nodeResponse.getModelUndeployStatus();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can nodeResponse be null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experience no, if the response is empty is a empty list but not null

@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval February 11, 2025 03:47 — with GitHub Actions Failure
There is a code change that requires to check the response of the model undeploy response object to check that the model has been marked as not found on all nodes.

Signed-off-by: Brian Flores <[email protected]>
@brianf-aws
Copy link
Contributor Author

Rebased with main branch

@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval February 14, 2025 01:35 — with GitHub Actions Error
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval February 14, 2025 01:35 — with GitHub Actions Inactive
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval February 14, 2025 01:35 — with GitHub Actions Failure
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval February 14, 2025 01:35 — with GitHub Actions Inactive
@dhrubo-os dhrubo-os merged commit c5ceb48 into opensearch-project:main Feb 14, 2025
7 of 9 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Feb 14, 2025
There is a code change that requires to check the response of the model undeploy response object to check that the model has been marked as not found on all nodes.

Signed-off-by: Brian Flores <[email protected]>
(cherry picked from commit c5ceb48)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Feb 14, 2025
There is a code change that requires to check the response of the model undeploy response object to check that the model has been marked as not found on all nodes.

Signed-off-by: Brian Flores <[email protected]>
(cherry picked from commit c5ceb48)
dhrubo-os pushed a commit that referenced this pull request Feb 18, 2025
… (#3552)

There is a code change that requires to check the response of the model undeploy response object to check that the model has been marked as not found on all nodes.

Signed-off-by: Brian Flores <[email protected]>
(cherry picked from commit c5ceb48)

Co-authored-by: Brian Flores <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants