timeout - 1500 seconds - expired while waiting the decommission process is finished #9857

timtimb0t · 2025-01-20T11:52:33Z

Packages

Scylla version: 2025.1.0~dev-20250117.1ef2d9d07692 with build-id 4f745f675a915d8e8f5658b3cd49497f5ee65c50

Kernel Version: 6.8.0-1021-aws

Issue description

Such an error occurred during disrupt_decommission_streaming_err nemesis. Need to increase timeouts in FailedDecommissionOperationMonitoring class

2025-01-19 00:28:25.850: (DisruptionEvent Severity.ERROR) period_type=end event_id=6def3916-9dd4-4e37-8431-e4a0cf85c2bc duration=51m18s: nemesis_name=DecommissionStreamingErr target_node=Node longevity-twcs-48h-master-db-node-e42ded24-4 [3.254.53.253 | 10.4.8.101] errors=Wait for: Waiting decommission is finished for longevity-twcs-48h-master-db-node-e42ded24-4...: timeout - 1500 seconds - expired
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4096, in start_and_interrupt_decommission_streaming
ParallelObject(objects=[trigger, watcher], timeout=full_operations_timeout).call_objects()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 524, in call_objects
return self.run(lambda x: x(), ignore_exceptions=ignore_exceptions)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 503, in run
raise ParallelObjectException(results=results)
sdcm.utils.common.ParallelObjectException: functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_aws.AWSNode object at 0x710f00114250>>, sub_cmd='decommission', timeout=900, warning_event_on_exception=(<class 'Exception'>,), long_running=True, retry=0):
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 487, in run
result = future.result(time_out)
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 460, in result
raise TimeoutError()
concurrent.futures._base.TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 70, in wait_for
res = retry(func, **kwargs)
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 404, in __call__
do = self.iter(retry_state=retry_state)
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 360, in iter
raise retry_exc.reraise()
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 194, in reraise
raise self
tenacity.RetryError: RetryError[<Future at 0x710eecd6b370 state=finished returned bool>]
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5497, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4175, in disrupt_decommission_streaming_err
self.start_and_interrupt_decommission_streaming()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4090, in start_and_interrupt_decommission_streaming
with ignore_stream_mutation_fragments_errors(), ignore_raft_topology_cmd_failing(), \
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/topology_ops.py", line 58, in __exit__
wait_for(func=lambda: not self.is_node_decommissioning(), step=15,
File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 86, in wait_for
raise raising_exc from ex
sdcm.exceptions.WaitForTimeoutError: Wait for: Waiting decommission is finished for longevity-twcs-48h-master-db-node-e42ded24-4...: timeout - 1500 seconds - expired

Impact

No direct impact, SCT issue

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

longevity-twcs-48h-master-db-node-e42ded24-9 (54.78.169.48 | 10.4.11.51) (shards: 7)
longevity-twcs-48h-master-db-node-e42ded24-8 (54.195.77.27 | 10.4.11.98) (shards: 7)
longevity-twcs-48h-master-db-node-e42ded24-7 (54.194.148.239 | 10.4.9.12) (shards: 7)
longevity-twcs-48h-master-db-node-e42ded24-6 (3.255.124.105 | 10.4.9.202) (shards: -1)
longevity-twcs-48h-master-db-node-e42ded24-5 (34.249.120.246 | 10.4.8.122) (shards: 7)
longevity-twcs-48h-master-db-node-e42ded24-4 (3.254.53.253 | 10.4.8.101) (shards: 7)
longevity-twcs-48h-master-db-node-e42ded24-3 (54.78.145.109 | 10.4.9.197) (shards: 7)
longevity-twcs-48h-master-db-node-e42ded24-2 (63.32.106.79 | 10.4.11.154) (shards: 7)
longevity-twcs-48h-master-db-node-e42ded24-10 (3.248.189.241 | 10.4.10.127) (shards: 7)
longevity-twcs-48h-master-db-node-e42ded24-1 (54.76.251.166 | 10.4.9.115) (shards: 7)

OS / Image: ami-0d3649fe0e81d5c8d (NO RUNNER: NO RUNNER)

Test: longevity-twcs-48h-test
Test id: e42ded24-b9d6-4a5b-a6af-eb412c12ce5c
Test name: scylla-master/tier1/longevity-twcs-48h-test
Test method: longevity_twcs_test.TWCSLongevityTest.test_custom_time
Test config file(s):

longevity-twcs-48h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor e42ded24-b9d6-4a5b-a6af-eb412c12ce5c
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs e42ded24-b9d6-4a5b-a6af-eb412c12ce5c

Logs:

longevity-twcs-48h-master-db-node-e42ded24-1 - https://cloudius-jenkins-test.s3.amazonaws.com/e42ded24-b9d6-4a5b-a6af-eb412c12ce5c/20250118_040845/longevity-twcs-48h-master-db-node-e42ded24-1-e42ded24.tar.gz
longevity-twcs-48h-master-db-node-e42ded24-6 - https://cloudius-jenkins-test.s3.amazonaws.com/e42ded24-b9d6-4a5b-a6af-eb412c12ce5c/20250118_040845/longevity-twcs-48h-master-db-node-e42ded24-6-e42ded24.tar.gz
longevity-twcs-48h-master-db-node-e42ded24-5 - https://cloudius-jenkins-test.s3.amazonaws.com/e42ded24-b9d6-4a5b-a6af-eb412c12ce5c/20250118_040845/longevity-twcs-48h-master-db-node-e42ded24-5-e42ded24.tar.gz
longevity-twcs-48h-master-db-node-e42ded24-2 - https://cloudius-jenkins-test.s3.amazonaws.com/e42ded24-b9d6-4a5b-a6af-eb412c12ce5c/20250118_040845/longevity-twcs-48h-master-db-node-e42ded24-2-e42ded24.tar.gz
longevity-twcs-48h-master-db-node-e42ded24-7 - https://cloudius-jenkins-test.s3.amazonaws.com/e42ded24-b9d6-4a5b-a6af-eb412c12ce5c/20250118_040845/longevity-twcs-48h-master-db-node-e42ded24-7-e42ded24.tar.gz
longevity-twcs-48h-master-db-node-e42ded24-9 - https://cloudius-jenkins-test.s3.amazonaws.com/e42ded24-b9d6-4a5b-a6af-eb412c12ce5c/20250118_040845/longevity-twcs-48h-master-db-node-e42ded24-9-e42ded24.tar.gz
db-cluster-e42ded24.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e42ded24-b9d6-4a5b-a6af-eb412c12ce5c/20250119_003338/db-cluster-e42ded24.tar.gz
sct-runner-events-e42ded24.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e42ded24-b9d6-4a5b-a6af-eb412c12ce5c/20250119_003338/sct-runner-events-e42ded24.tar.gz
sct-e42ded24.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e42ded24-b9d6-4a5b-a6af-eb412c12ce5c/20250119_003338/sct-e42ded24.log.tar.gz
loader-set-e42ded24.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e42ded24-b9d6-4a5b-a6af-eb412c12ce5c/20250119_003338/loader-set-e42ded24.tar.gz
monitor-set-e42ded24.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e42ded24-b9d6-4a5b-a6af-eb412c12ce5c/20250119_003338/monitor-set-e42ded24.tar.gz
builder-e42ded24.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e42ded24-b9d6-4a5b-a6af-eb412c12ce5c/upload_20250119_003603/builder-e42ded24.log.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

timtimb0t · 2025-01-20T11:53:21Z

@aleksbykov , could you please take a look at this issue?

github-actions bot assigned timtimb0t Jan 20, 2025

timtimb0t added tests/longevity-tier1 on_core_qa tasks that should be solved by Core QA team labels Jan 20, 2025

timtimb0t assigned aleksbykov and unassigned timtimb0t Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timeout - 1500 seconds - expired while waiting the decommission process is finished #9857

timeout - 1500 seconds - expired while waiting the decommission process is finished #9857

timtimb0t commented Jan 20, 2025

Logs:

timtimb0t commented Jan 20, 2025

timeout - 1500 seconds - expired while waiting the decommission process is finished #9857

timeout - 1500 seconds - expired while waiting the decommission process is finished #9857

Comments

timtimb0t commented Jan 20, 2025

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

timtimb0t commented Jan 20, 2025