Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Index failed during rolling upgrade #38741

Open
1 task done
zhuwenxing opened this issue Dec 25, 2024 · 13 comments
Open
1 task done

[Bug]: Index failed during rolling upgrade #38741

zhuwenxing opened this issue Dec 25, 2024 · 13 comments
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:v2.4.17 --> master-20241223-f499ca47-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2024-12-23T17:42:56.949Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[9-10] - pytest_assume.plugin.FailedAssumption: 

[2024-12-23T17:42:56.949Z] 2 Failed Assumptions:

[2024-12-23T17:42:56.949Z] 

[2024-12-23T17:42:56.949Z] chaos_commons.py:124: AssumptionFailure

[2024-12-23T17:42:56.949Z] >>	pytest.assume(

[2024-12-23T17:42:56.949Z] AssertionError: Expect Succ: Op.index succ rate 0.9594594594594594, total: 74, average time: 19.8179

[2024-12-23T17:42:56.949Z] assert False

[2024-12-23T17:42:56.949Z] 

[2024-12-23T17:42:56.949Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-12-23T17:42:56.949Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-12-23T17:42:56.949Z] AssertionError: IndexChecker__d1b5pX2j Op.index rto expect 10s but get 366.90567898750305s, [('failure', 36, '2024-12-23 17:16:50.265773', 1734974210.2658098), ('failure', 37, '2024-12-23 17:18:51.530762', 1734974331.5308142), ('failure', 38, '2024-12-23 17:20:53.856480', 1734974453.856523), ('success', 39, '2024-12-23 17:22:57.171459', 1734974577.1714888)]

[2024-12-23T17:42:56.949Z] assert False

[2024-12-23T17:42:56.949Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[3-10] - pytest_assume.plugin.FailedAssumption: 

[2024-12-23T17:42:56.949Z] 2 Failed Assumptions:

[2024-12-23T17:42:56.949Z] 

[2024-12-23T17:42:56.949Z] chaos_commons.py:124: AssumptionFailure

[2024-12-23T17:42:56.949Z] >>	pytest.assume(

[2024-12-23T17:42:56.949Z] AssertionError: Expect Succ: Op.index succ rate 0.958904109589041, total: 73, average time: 19.9359

[2024-12-23T17:42:56.949Z] assert False

[2024-12-23T17:42:56.949Z] 

[2024-12-23T17:42:56.949Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-12-23T17:42:56.949Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-12-23T17:42:56.949Z] AssertionError: IndexChecker__xQe95wlv Op.index rto expect 10s but get 366.17002153396606s, [('failure', 36, '2024-12-23 17:17:02.068743', 1734974222.0687613), ('failure', 37, '2024-12-23 17:19:05.445734', 1734974345.4457502), ('failure', 38, '2024-12-23 17:21:06.761849', 1734974466.7618792), ('success', 39, '2024-12-23 17:23:08.238754', 1734974588.238783)]

[2024-12-23T17:42:56.949Z] assert False

[2024-12-23T17:42:56.949Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[8-10] - pytest_assume.plugin.FailedAssumption: 

[2024-12-23T17:42:56.949Z] 2 Failed Assumptions:

[2024-12-23T17:42:56.949Z] 

[2024-12-23T17:42:56.949Z] chaos_commons.py:124: AssumptionFailure

[2024-12-23T17:42:56.949Z] >>	pytest.assume(

[2024-12-23T17:42:56.949Z] AssertionError: Expect Succ: Op.index succ rate 0.958904109589041, total: 73, average time: 19.9970

[2024-12-23T17:42:56.949Z] assert False

[2024-12-23T17:42:56.949Z] 

[2024-12-23T17:42:56.949Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-12-23T17:42:56.949Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-12-23T17:42:56.949Z] AssertionError: IndexChecker__edQskzPB Op.index rto expect 10s but get 364.3369879722595s, [('failure', 36, '2024-12-23 17:16:48.731750', 1734974208.7317827), ('failure', 37, '2024-12-23 17:18:50.074741', 1734974330.0747588), ('failure', 38, '2024-12-23 17:20:51.417225', 1734974451.4172559), ('success', 39, '2024-12-23 17:22:53.068747', 1734974573.0687706)]

[2024-12-23T17:42:56.949Z] assert False

[2024-12-23T17:42:56.949Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[7-10] - pytest_assume.plugin.FailedAssumption: 

[2024-12-23T17:42:56.949Z] 2 Failed Assumptions:

[2024-12-23T17:42:56.949Z] 

[2024-12-23T17:42:56.949Z] chaos_commons.py:124: AssumptionFailure

[2024-12-23T17:42:56.949Z] >>	pytest.assume(

[2024-12-23T17:42:56.949Z] AssertionError: Expect Succ: Op.index succ rate 0.9594594594594594, total: 74, average time: 19.8185

[2024-12-23T17:42:56.949Z] assert False

[2024-12-23T17:42:56.949Z] 

[2024-12-23T17:42:56.949Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-12-23T17:42:56.949Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-12-23T17:42:56.949Z] AssertionError: IndexChecker__VFBcWJw4 Op.index rto expect 10s but get 364.1275360584259s, [('failure', 36, '2024-12-23 17:16:58.804896', 1734974218.804926), ('failure', 37, '2024-12-23 17:19:00.022746', 1734974340.0227847), ('failure', 38, '2024-12-23 17:21:01.207465', 1734974461.2075007), ('success', 39, '2024-12-23 17:23:02.932438', 1734974582.932462)]

[2024-12-23T17:42:56.949Z] assert False

[2024-12-23T17:42:56.949Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[2-10] - pytest_assume.plugin.FailedAssumption: 

[2024-12-23T17:42:56.949Z] 2 Failed Assumptions:

[2024-12-23T17:42:56.949Z] 

[2024-12-23T17:42:56.949Z] chaos_commons.py:124: AssumptionFailure

[2024-12-23T17:42:56.949Z] >>	pytest.assume(

[2024-12-23T17:42:56.949Z] AssertionError: Expect Succ: Op.index succ rate 0.9571428571428572, total: 70, average time: 21.1751

[2024-12-23T17:42:56.949Z] assert False

[2024-12-23T17:42:56.949Z] 

[2024-12-23T17:42:56.949Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-12-23T17:42:56.949Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-12-23T17:42:56.949Z] AssertionError: IndexChecker__p1AQFq8h Op.index rto expect 10s but get 364.7519919872284s, [('failure', 32, '2024-12-23 17:16:54.095749', 1734974214.0957804), ('failure', 33, '2024-12-23 17:18:55.443125', 1734974335.4431539), ('failure', 34, '2024-12-23 17:20:57.151761', 1734974457.151787), ('success', 35, '2024-12-23 17:22:58.847746', 1734974578.8477724)]

[2024-12-23T17:42:56.949Z] assert False

[2024-12-23T17:42:56.949Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[4-10] - pytest_assume.plugin.FailedAssumption: 

[2024-12-23T17:42:56.949Z] 2 Failed Assumptions:

[2024-12-23T17:42:56.949Z] 

[2024-12-23T17:42:56.949Z] chaos_commons.py:124: AssumptionFailure

[2024-12-23T17:42:56.949Z] >>	pytest.assume(

[2024-12-23T17:42:56.949Z] AssertionError: Expect Succ: Op.index succ rate 0.9583333333333334, total: 72, average time: 20.5062

[2024-12-23T17:42:56.949Z] assert False

[2024-12-23T17:42:56.949Z] 

[2024-12-23T17:42:56.949Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-12-23T17:42:56.949Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-12-23T17:42:56.949Z] AssertionError: IndexChecker__iMxl1dLG Op.index rto expect 10s but get 366.6630003452301s, [('failure', 34, '2024-12-23 17:17:00.529751', 1734974220.5297801), ('failure', 35, '2024-12-23 17:19:02.822762', 1734974342.8227925), ('failure', 36, '2024-12-23 17:21:05.626751', 1734974465.6267788), ('success', 37, '2024-12-23 17:23:07.192754', 1734974587.1927805)]

[2024-12-23T17:42:56.949Z] assert False

[2024-12-23T17:42:56.949Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[10-10] - pytest_assume.plugin.FailedAssumption: 

[2024-12-23T17:42:56.949Z] 2 Failed Assumptions:

[2024-12-23T17:42:56.949Z] 

[2024-12-23T17:42:56.949Z] chaos_commons.py:124: AssumptionFailure

[2024-12-23T17:42:56.949Z] >>	pytest.assume(

[2024-12-23T17:42:56.949Z] AssertionError: Expect Succ: Op.index succ rate 0.9571428571428572, total: 70, average time: 20.9236

[2024-12-23T17:42:56.949Z] assert False

[2024-12-23T17:42:56.949Z] 

[2024-12-23T17:42:56.949Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-12-23T17:42:56.950Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-12-23T17:42:56.950Z] AssertionError: IndexChecker__z3rR2ud9 Op.index rto expect 10s but get 364.2149658203125s, [('failure', 32, '2024-12-23 17:16:55.224799', 1734974215.2248394), ('failure', 33, '2024-12-23 17:18:56.478588', 1734974336.4786143), ('failure', 34, '2024-12-23 17:20:57.964738', 1734974457.9647596), ('success', 35, '2024-12-23 17:22:59.439753', 1734974579.4398053)]

[2024-12-23T17:42:56.950Z] assert False

[2024-12-23T17:42:56.950Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[5-10] - pytest_assume.plugin.FailedAssumption: 

[2024-12-23T17:42:56.950Z] 2 Failed Assumptions:

[2024-12-23T17:42:56.950Z] 

[2024-12-23T17:42:56.950Z] chaos_commons.py:124: AssumptionFailure

[2024-12-23T17:42:56.950Z] >>	pytest.assume(

[2024-12-23T17:42:56.950Z] AssertionError: Expect Succ: Op.index succ rate 0.9545454545454546, total: 66, average time: 22.5251

[2024-12-23T17:42:56.950Z] assert False

[2024-12-23T17:42:56.950Z] 

[2024-12-23T17:42:56.950Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-12-23T17:42:56.950Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-12-23T17:42:56.950Z] AssertionError: IndexChecker__WkfTHRMw Op.index rto expect 10s but get 363.8909728527069s, [('failure', 28, '2024-12-23 17:16:48.740758', 1734974208.7407942), ('failure', 29, '2024-12-23 17:18:49.941748', 1734974329.9417803), ('failure', 30, '2024-12-23 17:20:51.202859', 1734974451.2028792), ('success', 31, '2024-12-23 17:22:52.631742', 1734974572.631767)]

[2024-12-23T17:42:56.950Z] assert False

[2024-12-23T17:42:56.950Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[6-10] - pytest_assume.plugin.FailedAssumption: 

[2024-12-23T17:42:56.950Z] 2 Failed Assumptions:

[2024-12-23T17:42:56.950Z] 

[2024-12-23T17:42:56.950Z] chaos_commons.py:124: AssumptionFailure

[2024-12-23T17:42:56.950Z] >>	pytest.assume(

[2024-12-23T17:42:56.950Z] AssertionError: Expect Succ: Op.index succ rate 0.9565217391304348, total: 69, average time: 21.4263

[2024-12-23T17:42:56.950Z] assert False

[2024-12-23T17:42:56.950Z] 

[2024-12-23T17:42:56.950Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-12-23T17:42:56.950Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-12-23T17:42:56.950Z] AssertionError: IndexChecker__1EWEhoRH Op.index rto expect 10s but get 366.40900230407715s, [('failure', 32, '2024-12-23 17:16:51.780754', 1734974211.7807827), ('failure', 33, '2024-12-23 17:18:53.049746', 1734974333.0497706), ('failure', 34, '2024-12-23 17:20:56.371745', 1734974456.3717666), ('success', 35, '2024-12-23 17:22:58.189757', 1734974578.189785)]

[2024-12-23T17:42:56.950Z] assert False

[2024-12-23T17:42:56.950Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[1-10] - pytest_assume.plugin.FailedAssumption: 

[2024-12-23T17:42:56.950Z] 2 Failed Assumptions:

[2024-12-23T17:42:56.950Z] 

[2024-12-23T17:42:56.950Z] chaos_commons.py:124: AssumptionFailure

[2024-12-23T17:42:56.950Z] >>	pytest.assume(

[2024-12-23T17:42:56.950Z] AssertionError: Expect Succ: Op.index succ rate 0.958904109589041, total: 73, average time: 19.9821

[2024-12-23T17:42:56.950Z] assert False

[2024-12-23T17:42:56.950Z] 

[2024-12-23T17:42:56.950Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2024-12-23T17:42:56.950Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2024-12-23T17:42:56.950Z] AssertionError: IndexChecker__RxZ7Qcgf Op.index rto expect 10s but get 364.4219825267792s, [('failure', 32, '2024-12-23 17:16:56.784752', 1734974216.784777), ('failure', 33, '2024-12-23 17:18:58.246904', 1734974338.246924), ('failure', 34, '2024-12-23 17:20:59.669795', 1734974459.669835), ('success', 35, '2024-12-23 17:23:01.206738', 1734974581.2067595)]

[2024-12-23T17:42:56.950Z] assert False

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5688/pipeline

log:

artifacts-kafka-mixcoord-5688-server-logs.tar.gz

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 25, 2024
@yanliang567
Copy link
Contributor

/assign @weiliu1031
/unassign

@yanliang567 yanliang567 added this to the 2.5.1 milestone Dec 25, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 25, 2024
@xiaofan-luan
Copy link
Collaborator

@zhuwenxing
did the index succeed finally?
usually we don't evaluate index recover speed. If the indexnode crash sometimes we need tot wait for reschedule

@weiliu1031
Copy link
Contributor

weiliu1031 commented Dec 27, 2024

image

  1. 2024/12/23 17:18:51 - The index_service received a request to create an index.
  2. 2024/12/23 17:20:03 - The index_service triggered the first segment's index task.
  3. 2024/12/23 17:20:49 - The index_service triggered the third segment's index task.
  4. 2024/12/23 17:20:52 - The SDK waited for the index creation to complete but timed out after 2 minutes.
  5. 2024/12/23 17:23 - Both index tasks were completed.

The index was eventually created successfully, but two issues were identified:

  1. The index_service took too long—approximately 70 seconds—from receiving the create index request to generating the index tasks. which is blocked by stat task.
  2. The index tasks also took nearly 2 minutes to complete.

@yanliang567 yanliang567 modified the milestones: 2.5.1, 2.5.2 Dec 30, 2024
@yanliang567 yanliang567 modified the milestones: 2.5.2, 2.5.3 Jan 6, 2025
@yanliang567 yanliang567 added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jan 9, 2025
@weiliu1031
Copy link
Contributor

build index becomes slower, which is a know impact by stat task. won't fix

@weiliu1031
Copy link
Contributor

please refine the timeout for building index if necessary

@weiliu1031
Copy link
Contributor

/assign @zhuwenxing

@yanliang567
Copy link
Contributor

/assign @xiaocai2333

@xiaocai2333
Copy link
Contributor

During an upgrade, existing segments will trigger stats tasks after the upgrade. This might lead to longer indexing times for newly triggered tasks, as the stats tasks generated by the the existing segments need to be processed first.
So, we should either appropriately increase the timeout duration or wait for the stats tasks of the existing segments to complete before triggering indexing.

@zhuwenxing
Copy link
Contributor Author

will increase index timeout to 360s to for verifing

@zhuwenxing
Copy link
Contributor Author

still failed with 360s timeout for index

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5808/pipeline
log:

artifacts-kafka-mixcoord-5808-server-logs.tar.gz

cluster: 4am
ns: chaos-testing
pod info

[2025-01-15T17:44:44.075Z] + kubectl get pods -o wide

[2025-01-15T17:44:44.077Z] + grep kafka-mixcoord-5808

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-etcd-0                                       1/1     Running                  0                41m     10.104.23.178   4am-node27   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-etcd-1                                       1/1     Running                  0                41m     10.104.27.143   4am-node31   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-etcd-2                                       1/1     Running                  0                41m     10.104.15.59    4am-node20   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-kafka-0                                      2/2     Running                  1 (40m ago)      41m     10.104.27.144   4am-node31   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-kafka-1                                      2/2     Running                  1 (40m ago)      41m     10.104.30.247   4am-node38   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-kafka-2                                      2/2     Running                  0                41m     10.104.23.185   4am-node27   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-kafka-exporter-54bf59fbb6-hwmw4              1/1     Running                  4 (40m ago)      41m     10.104.14.5     4am-node18   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-kafka-zookeeper-0                            1/1     Running                  0                41m     10.104.32.8     4am-node39   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-kafka-zookeeper-1                            1/1     Running                  0                41m     10.104.23.183   4am-node27   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-kafka-zookeeper-2                            1/1     Running                  0                41m     10.104.17.63    4am-node23   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-milvus-datanode-69cdc5c768-9mw7t             1/1     Running                  0                20m     10.104.17.65    4am-node23   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-milvus-indexnode-87f885b7d-bxn8v             1/1     Running                  0                31m     10.104.19.237   4am-node28   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-milvus-indexnode-87f885b7d-prb5n             1/1     Running                  0                30m     10.104.32.22    4am-node39   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-milvus-indexnode-87f885b7d-xz98v             1/1     Running                  0                32m     10.104.21.7     4am-node24   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-milvus-mixcoord-77ff88885d-qfhlg             1/1     Running                  0                27m     10.104.19.242   4am-node28   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-milvus-proxy-64db849594-pk8w9                1/1     Running                  0                17m     10.104.19.11    4am-node28   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-milvus-querynode-1-6dd6bbcffb-67tzh          1/1     Running                  0                23m     10.104.21.20    4am-node24   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-milvus-querynode-1-6dd6bbcffb-85kh6          1/1     Running                  0                22m     10.104.32.26    4am-node39   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-milvus-querynode-1-6dd6bbcffb-9pl5s          1/1     Running                  0                24m     10.104.19.250   4am-node28   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-minio-0                                      1/1     Running                  0                41m     10.104.23.177   4am-node27   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-minio-1                                      1/1     Running                  0                41m     10.104.27.142   4am-node31   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-minio-2                                      1/1     Running                  0                41m     10.104.33.83    4am-node36   <none>           <none>

[2025-01-15T17:44:44.333Z] kafka-mixcoord-5808-minio-3                                      1/1     Running                  0                41m     10.104.16.213   4am-node21   <none>           <none>
[2025-01-15T17:44:33.976Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[3-10] - pytest_assume.plugin.FailedAssumption: 

[2025-01-15T17:44:33.976Z] 1 Failed Assumptions:

[2025-01-15T17:44:33.976Z] 

[2025-01-15T17:44:33.976Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2025-01-15T17:44:33.976Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2025-01-15T17:44:33.976Z] AssertionError: IndexChecker__VE9KjjEi Op.index rto expect 10s but get 361.3266863822937s, [('failure', 36, '2025-01-15 17:18:01.159003', 1736961481.1590242), ('success', 37, '2025-01-15 17:24:02.485696', 1736961842.4857106)]

[2025-01-15T17:44:33.976Z] assert False

[2025-01-15T17:44:33.976Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[5-10] - pytest_assume.plugin.FailedAssumption: 

[2025-01-15T17:44:33.976Z] 1 Failed Assumptions:

[2025-01-15T17:44:33.976Z] 

[2025-01-15T17:44:33.976Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2025-01-15T17:44:33.976Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2025-01-15T17:44:33.976Z] AssertionError: IndexChecker__aRzujAAO Op.index rto expect 10s but get 361.3993544578552s, [('failure', 36, '2025-01-15 17:17:51.069165', 1736961471.0692165), ('success', 37, '2025-01-15 17:23:52.468545', 1736961832.468571)]

[2025-01-15T17:44:33.976Z] assert False

[2025-01-15T17:44:33.976Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[4-10] - pytest_assume.plugin.FailedAssumption: 

[2025-01-15T17:44:33.976Z] 1 Failed Assumptions:

[2025-01-15T17:44:33.976Z] 

[2025-01-15T17:44:33.976Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2025-01-15T17:44:33.976Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2025-01-15T17:44:33.976Z] AssertionError: IndexChecker__UqU99Lyv Op.index rto expect 10s but get 361.0691797733307s, [('failure', 36, '2025-01-15 17:18:03.477078', 1736961483.4771066), ('success', 37, '2025-01-15 17:24:04.546261', 1736961844.5462863)]

[2025-01-15T17:44:33.976Z] assert False

[2025-01-15T17:44:33.976Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[10-10] - pytest_assume.plugin.FailedAssumption: 

[2025-01-15T17:44:33.976Z] 1 Failed Assumptions:

[2025-01-15T17:44:33.976Z] 

[2025-01-15T17:44:33.976Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2025-01-15T17:44:33.976Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2025-01-15T17:44:33.976Z] AssertionError: IndexChecker__zSMBok1x Op.index rto expect 10s but get 361.4542455673218s, [('failure', 33, '2025-01-15 17:17:51.012903', 1736961471.0129178), ('success', 34, '2025-01-15 17:23:52.467108', 1736961832.4671633)]

[2025-01-15T17:44:33.976Z] assert False

[2025-01-15T17:44:33.976Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[8-10] - pytest_assume.plugin.FailedAssumption: 

[2025-01-15T17:44:33.976Z] 1 Failed Assumptions:

[2025-01-15T17:44:33.976Z] 

[2025-01-15T17:44:33.976Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2025-01-15T17:44:33.976Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2025-01-15T17:44:33.976Z] AssertionError: IndexChecker__Kyt97uif Op.index rto expect 10s but get 361.27700877189636s, [('failure', 35, '2025-01-15 17:17:53.264076', 1736961473.2641053), ('success', 36, '2025-01-15 17:23:54.541080', 1736961834.541114)]

[2025-01-15T17:44:33.976Z] assert False

[2025-01-15T17:44:33.976Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[9-10] - pytest_assume.plugin.FailedAssumption: 

[2025-01-15T17:44:33.976Z] 1 Failed Assumptions:

[2025-01-15T17:44:33.976Z] 

[2025-01-15T17:44:33.976Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2025-01-15T17:44:33.976Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2025-01-15T17:44:33.976Z] AssertionError: IndexChecker__Dd5YijSI Op.index rto expect 10s but get 362.45383620262146s, [('failure', 35, '2025-01-15 17:17:47.246103', 1736961467.2461584), ('success', 36, '2025-01-15 17:23:49.699974', 1736961829.6999946)]

[2025-01-15T17:44:33.976Z] assert False

[2025-01-15T17:44:33.976Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[1-10] - pytest_assume.plugin.FailedAssumption: 

[2025-01-15T17:44:33.976Z] 1 Failed Assumptions:

[2025-01-15T17:44:33.976Z] 

[2025-01-15T17:44:33.976Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2025-01-15T17:44:33.976Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2025-01-15T17:44:33.976Z] AssertionError: IndexChecker__93ndaiay Op.index rto expect 10s but get 361.2844207286835s, [('failure', 34, '2025-01-15 17:17:57.916114', 1736961477.9161696), ('success', 35, '2025-01-15 17:23:59.200531', 1736961839.2005904)]

[2025-01-15T17:44:33.976Z] assert False

[2025-01-15T17:44:33.976Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[7-10] - pytest_assume.plugin.FailedAssumption: 

[2025-01-15T17:44:33.976Z] 1 Failed Assumptions:

[2025-01-15T17:44:33.976Z] 

[2025-01-15T17:44:33.976Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2025-01-15T17:44:33.976Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2025-01-15T17:44:33.976Z] AssertionError: IndexChecker__tktCH7SP Op.index rto expect 10s but get 361.5369760990143s, [('failure', 34, '2025-01-15 17:17:59.508880', 1736961479.508928), ('success', 35, '2025-01-15 17:24:01.045886', 1736961841.0459042)]

[2025-01-15T17:44:33.976Z] assert False

[2025-01-15T17:44:33.976Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[6-10] - pytest_assume.plugin.FailedAssumption: 

[2025-01-15T17:44:33.976Z] 1 Failed Assumptions:

[2025-01-15T17:44:33.976Z] 

[2025-01-15T17:44:33.976Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2025-01-15T17:44:33.976Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2025-01-15T17:44:33.976Z] AssertionError: IndexChecker__41dVAGEH Op.index rto expect 10s but get 361.2454352378845s, [('failure', 31, '2025-01-15 17:17:56.449794', 1736961476.4498317), ('success', 32, '2025-01-15 17:23:57.695217', 1736961837.695267)]

[2025-01-15T17:44:33.976Z] assert False

[2025-01-15T17:44:33.976Z] FAILED testcases/test_single_request_operation_for_rolling_update.py::TestOperations::test_operations[2-10] - pytest_assume.plugin.FailedAssumption: 

[2025-01-15T17:44:33.976Z] 1 Failed Assumptions:

[2025-01-15T17:44:33.976Z] 

[2025-01-15T17:44:33.976Z] testcases/test_single_request_operation_for_rolling_update.py:153: AssumptionFailure

[2025-01-15T17:44:33.976Z] >>	pytest.assume(rto <= rto_threshold,  f"{self.health_checkers[k].c_name} {k} rto expect {rto_threshold}s but get {rto}s, {self.health_checkers[k].fail_records}")

[2025-01-15T17:44:33.976Z] AssertionError: IndexChecker__BG1aeRFJ Op.index rto expect 10s but get 361.23342394828796s, [('failure', 30, '2025-01-15 17:17:51.234539', 1736961471.2345972), ('success', 31, '2025-01-15 17:23:52.467972', 1736961832.4680212)]

[2025-01-15T17:44:33.976Z] assert False

[2025-01-15T17:44:33.976Z] ================= 10 failed, 10 warnings in 2145.86s (0:35:45) =================

@zhuwenxing
Copy link
Contributor Author

According to the logs, when the index was previously set to 120s, there were 3 failures. Now with it set to 360s, there is one failure. This suggests that during the rolling upgrade, there appears to be a time window of approximately 360 seconds during which new indexes cannot be created.

@yanliang567 yanliang567 modified the milestones: 2.5.3, 2.5.4 Jan 16, 2025
@xiaocai2333
Copy link
Contributor

After upgrading Milvus from version 2.4 to 2.5, the existing segments will undergo the following changes:
Image
This means that each segment will generate one stats task and one index task.

For the timeout case:

2025-01-16 01:17:28.030	
[2025/01/15 17:17:28.030 +00:00] [INFO] [proxy/impl.go:2081] ["CreateIndex received"] [traceID=de309f963ba55557ef11e0bb4ca982bb] [role=proxy] [db=default] [collection=IndexChecker__VE9KjjEi] [field=float_vector] [extra_params="[{\"key\":\"index_type\",\"value\":\"HNSW\"},{\"key\":\"metric_type\",\"value\":\"L2\"},{\"key\":\"params\",\"value\":\"{\\\"M\\\":48,\\\"efConstruction\\\":500}\"}]"]
2025-01-16 01:17:28.030	
[2025/01/15 17:17:28.030 +00:00] [INFO] [proxy/impl.go:2094] ["CreateIndex enqueued"] [traceID=de309f963ba55557ef11e0bb4ca982bb] [role=proxy] [db=default] [collection=IndexChecker__VE9KjjEi] [field=float_vector] [extra_params="[{\"key\":\"index_type\",\"value\":\"HNSW\"},{\"key\":\"metric_type\",\"value\":\"L2\"},{\"key\":\"params\",\"value\":\"{\\\"M\\\":48,\\\"efConstruction\\\":500}\"}]"] [BeginTs=455334021822677005] [EndTs=455334021822677005]
2025-01-16 01:17:28.042	
[2025/01/15 17:17:28.042 +00:00] [INFO] [proxy/impl.go:2112] ["CreateIndex done"] [traceID=de309f963ba55557ef11e0bb4ca982bb] [role=proxy] [db=default] [collection=IndexChecker__VE9KjjEi] [field=float_vector] [extra_params="[{\"key\":\"index_type\",\"value\":\"HNSW\"},{\"key\":\"metric_type\",\"value\":\"L2\"},{\"key\":\"params\",\"value\":\"{\\\"M\\\":48,\\\"efConstruction\\\":500}\"}]"] [BeginTs=455334021822677005] [EndTs=455334021822677005]
2025-01-16 01:24:48.450	
[2025/01/15 17:24:48.450 +00:00] [INFO] [proxy/impl.go:2371] ["DropIndex received"] [traceID=d8e476d5a324bdddccb35d72ff038282] [role=proxy] [db=default] [collection=IndexChecker__VE9KjjEi] [field=float_vector] ["index name"=]
2025-01-16 01:24:48.450	
[2025/01/15 17:24:48.450 +00:00] [DEBUG] [proxy/impl.go:2383] ["DropIndex enqueued"] [traceID=d8e476d5a324bdddccb35d72ff038282] [role=proxy] [db=default] [collection=IndexChecker__VE9KjjEi] [field=float_vector] ["index name"=] [BeginTs=455334137276137486] [EndTs=455334137276137486]
2025-01-16 01:24:48.460	
[2025/01/15 17:24:48.460 +00:00] [INFO] [proxy/impl.go:2401] ["DropIndex done"] [traceID=d8e476d5a324bdddccb35d72ff038282] [role=proxy] [db=default] [collection=IndexChecker__VE9KjjEi] [field=float_vector] ["index name"=] [BeginTs=455334137276137486] [EndTs=455334137276137486]

the collection has 4 segments, requiring 4 stats tasks to be submitted. The last stats task was submitted at 17:18:50, by which time there were already 900 tasks in the queue. The stats task was executed at 17:22:19 and completed quickly, immediately generating an index task. When the index task was enqueued, there were already 626 tasks in the queue. And the index task was executed at 17:24:38.

2025-01-16 01:18:50.403	
[2025/01/15 17:18:50.403 +00:00] [INFO] [datacoord/task_scheduler.go:248] ["task scheduler"] ["task num"=900]
2025-01-16 01:18:50.404	
[2025/01/15 17:18:50.404 +00:00] [INFO] [datacoord/stats_task_meta.go:111] ["add stats task"] [taskID=455334026730919971] [originSegmentID=455333832272257619] [targetSegmentID=455334026730919970] [subJobType=Sort]
2025-01-16 01:18:50.405	
[2025/01/15 17:18:50.405 +00:00] [INFO] [datacoord/stats_task_meta.go:127] ["add stats task success"] [taskID=455334026730919971] [originSegmentID=455333832272257619] [targetSegmentID=455334026730919970] [subJobType=Sort]
2025-01-16 01:22:19.163	
[2025/01/15 17:22:19.163 +00:00] [INFO] [indexnode/indexnode_service.go:382] ["receive stats job"] [traceID=cadd3ce212fa6e01712e295340806c37] [clusterID=kafka-mixcoord-5808] [TaskID=455334026730919971] [jobType=JobTypeStatsJob] [collectionID=455333832272257555] [partitionID=455333832272257556] [segmentID=455333832272257619] [targetSegmentID=455334026730919970] [subJobType=Sort] [startLogID=455334026732605873] [endLogID=455334026732606574]
2025-01-16 01:22:20.128	
[2025/01/15 17:22:20.128 +00:00] [INFO] [datacoord/task_scheduler.go:248] ["task scheduler"] ["task num"=626]
2025-01-16 01:22:20.139	
[2025/01/15 17:22:20.139 +00:00] [INFO] [datacoord/index_service.go:61] ["create index for segment"] [segmentID=455334026730919970] [indexID=455334026730280883]
2025-01-16 01:24:38.020	
[2025/01/15 17:24:38.020 +00:00] [INFO] [datacoord/task_index.go:262] ["index task pre check successfully"] [taskID=455334026732608075] [segID=455334026730919970]	

@xiaocai2333
Copy link
Contributor

The CPU utilization of indexnode is relative low. After the PR #39084 is merged, indexnode can concurrently execute small index tasks and stats tasks, improving resource utilization.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants