-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve upgrade mechanisms to keep service as healthy as possible #8
Comments
Something based around this should help: #!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright 2023 NeuroForge GmbH & Co. KG <https://neuroforge.de>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass
from datetime import datetime
import docker
from typing import List
from docker.models.services import Service
def print_timed(msg):
to_print = '{} [{}]: {}'.format(
datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
'docker_events',
msg)
print(to_print)
@dataclass
class StateInfo:
service: Service
target_replicas: int
actual_replicas: int
def has_long_restart_policy(service: Service):
"""
detects services with a long restart policy such as
cron style services with a restart condition
"""
try:
restart_policy = service.attrs["Spec"]["TaskTemplate"]["RestartPolicy"]
delay_ns = restart_policy["Delay"]
# 10 minutes in nanoseconds
cutoff_ns = 10 * 60 * 1e9
return delay_ns > cutoff_ns
except:
return False
def is_oneshot(service: Service):
"""
detects services that are intended as one shot
"""
try:
restart_policy = service.attrs["Spec"]["TaskTemplate"]["RestartPolicy"]
return restart_policy["Condition"] == "none"
except:
return False
def get_state_infos(client: docker.DockerClient) -> List[StateInfo]:
state_info: List[StateInfo] = []
services = client.services.list()
service: Service
for service in services:
mode = service.attrs["Spec"]["Mode"]
if is_oneshot(service):
# TODO: if its a one shot, check if the task is still
# running
continue
if has_long_restart_policy(service):
continue
if "Replicated" in mode:
target_replicas = mode["Replicated"]["Replicas"]
elif "Global" in mode:
target_replicas = len(client.nodes.list())
else:
continue
desired_running_tasks = service.tasks(filters={"desired-state": "running"})
actually_running_tasks = [elem for elem in desired_running_tasks
if elem["Status"]["State"] == "running"]
actually_running_tasks_count = len(actually_running_tasks)
state_info.append(StateInfo(
service=service,
target_replicas=target_replicas,
actual_replicas=actually_running_tasks_count
))
return state_info
def is_settled() -> bool:
client = docker.DockerClient()
state_info = get_state_infos(client)
settled_services = [elem for elem in state_info
if elem.actual_replicas == elem.target_replicas]
unsettled_services = [elem for elem in state_info
if elem.actual_replicas != elem.target_replicas]
unsettled_count = len(unsettled_services)
for elem in settled_services:
print_timed(f"OK: service {elem.service.name} ({elem.service.id}) has settled")
for elem in unsettled_services:
print_timed(f"NOK: service {elem.service.name} ({elem.service.id}) has not settled yet")
return unsettled_count == 0
if __name__ == '__main__':
if is_settled():
print_timed("swarm has settled")
exit(0)
else:
print_timed("swarm has not settled yet")
exit(1) |
or moreover moby/moby#34139 (comment) |
leaving this here as well As an alternative approach to move off services of nodes that are about to be drained it would be worth trying out to update services with "--constraint-add 'node.hostname!=$(hostname)'" or any other constraint on a per need basis instead of deploying them with the constraint from the get go. I haven't tried this on a multinode swarm yet, but trying it on a local "1 node swarm" suggests it to be worth exploring more This could work:
To not force the tasks off the nodes immediately, do this for every node in this order |
Currently we only wait until the node is drained. We should investigate whether it is feasible to wait for all stacks to finish being moved over. Wait for all services to stop scheduling new things during cluster upgrade?
Maybe we need to take a snapshot of all services and the replica counts before the upgrade and we then wait until the same replica counts are back?
The text was updated successfully, but these errors were encountered: