Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timeout - 1500 seconds - expired while waiting the decommission process is finished #9857

Open
timtimb0t opened this issue Jan 20, 2025 · 1 comment
Assignees
Labels
on_core_qa tasks that should be solved by Core QA team tests/longevity-tier1

Comments

@timtimb0t
Copy link
Contributor

Packages

Scylla version: 2025.1.0~dev-20250117.1ef2d9d07692 with build-id 4f745f675a915d8e8f5658b3cd49497f5ee65c50

Kernel Version: 6.8.0-1021-aws

Issue description

Such an error occurred during disrupt_decommission_streaming_err nemesis. Need to increase timeouts in FailedDecommissionOperationMonitoring class

2025-01-19 00:28:25.850: (DisruptionEvent Severity.ERROR) period_type=end event_id=6def3916-9dd4-4e37-8431-e4a0cf85c2bc duration=51m18s: nemesis_name=DecommissionStreamingErr target_node=Node longevity-twcs-48h-master-db-node-e42ded24-4 [3.254.53.253 | 10.4.8.101] errors=Wait for: Waiting decommission is finished for longevity-twcs-48h-master-db-node-e42ded24-4...: timeout - 1500 seconds - expired
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4096, in start_and_interrupt_decommission_streaming
ParallelObject(objects=[trigger, watcher], timeout=full_operations_timeout).call_objects()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 524, in call_objects
return self.run(lambda x: x(), ignore_exceptions=ignore_exceptions)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 503, in run
raise ParallelObjectException(results=results)
sdcm.utils.common.ParallelObjectException: functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_aws.AWSNode object at 0x710f00114250>>, sub_cmd='decommission', timeout=900, warning_event_on_exception=(<class 'Exception'>,), long_running=True, retry=0):
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 487, in run
result = future.result(time_out)
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 460, in result
raise TimeoutError()
concurrent.futures._base.TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 70, in wait_for
res = retry(func, **kwargs)
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 404, in __call__
do = self.iter(retry_state=retry_state)
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 360, in iter
raise retry_exc.reraise()
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 194, in reraise
raise self
tenacity.RetryError: RetryError[<Future at 0x710eecd6b370 state=finished returned bool>]
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5497, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4175, in disrupt_decommission_streaming_err
self.start_and_interrupt_decommission_streaming()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4090, in start_and_interrupt_decommission_streaming
with ignore_stream_mutation_fragments_errors(), ignore_raft_topology_cmd_failing(), \
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/topology_ops.py", line 58, in __exit__
wait_for(func=lambda: not self.is_node_decommissioning(), step=15,
File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 86, in wait_for
raise raising_exc from ex
sdcm.exceptions.WaitForTimeoutError: Wait for: Waiting decommission is finished for longevity-twcs-48h-master-db-node-e42ded24-4...: timeout - 1500 seconds - expired

Impact

No direct impact, SCT issue

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

  • longevity-twcs-48h-master-db-node-e42ded24-9 (54.78.169.48 | 10.4.11.51) (shards: 7)
  • longevity-twcs-48h-master-db-node-e42ded24-8 (54.195.77.27 | 10.4.11.98) (shards: 7)
  • longevity-twcs-48h-master-db-node-e42ded24-7 (54.194.148.239 | 10.4.9.12) (shards: 7)
  • longevity-twcs-48h-master-db-node-e42ded24-6 (3.255.124.105 | 10.4.9.202) (shards: -1)
  • longevity-twcs-48h-master-db-node-e42ded24-5 (34.249.120.246 | 10.4.8.122) (shards: 7)
  • longevity-twcs-48h-master-db-node-e42ded24-4 (3.254.53.253 | 10.4.8.101) (shards: 7)
  • longevity-twcs-48h-master-db-node-e42ded24-3 (54.78.145.109 | 10.4.9.197) (shards: 7)
  • longevity-twcs-48h-master-db-node-e42ded24-2 (63.32.106.79 | 10.4.11.154) (shards: 7)
  • longevity-twcs-48h-master-db-node-e42ded24-10 (3.248.189.241 | 10.4.10.127) (shards: 7)
  • longevity-twcs-48h-master-db-node-e42ded24-1 (54.76.251.166 | 10.4.9.115) (shards: 7)

OS / Image: ami-0d3649fe0e81d5c8d (NO RUNNER: NO RUNNER)

Test: longevity-twcs-48h-test
Test id: e42ded24-b9d6-4a5b-a6af-eb412c12ce5c
Test name: scylla-master/tier1/longevity-twcs-48h-test
Test method: longevity_twcs_test.TWCSLongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor e42ded24-b9d6-4a5b-a6af-eb412c12ce5c
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs e42ded24-b9d6-4a5b-a6af-eb412c12ce5c

Logs:

Jenkins job URL
Argus

@timtimb0t timtimb0t added tests/longevity-tier1 on_core_qa tasks that should be solved by Core QA team labels Jan 20, 2025
@timtimb0t timtimb0t assigned aleksbykov and unassigned timtimb0t Jan 20, 2025
@timtimb0t
Copy link
Contributor Author

@aleksbykov , could you please take a look at this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on_core_qa tasks that should be solved by Core QA team tests/longevity-tier1
Projects
None yet
Development

No branches or pull requests

2 participants