-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
controller: improve the mechanism of disk removing #102
controller: improve the mechanism of disk removing #102
Conversation
4acacd5
to
f561a2c
Compare
f561a2c
to
59ce989
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand this :-) and it looks reasonable. I just have a couple of questions:
- It looks like if the call to
utils.ForceUmountWithTimeout()
fails due to timeout, then the disk won't be removed from the node. Given that we do want to remove inactive or corrupted disks from the node, would it make sense to ignore timeout failures and remove the disk anyway? - I gave this a rough test by attaching a disk to harvester running in a VM, then I let it be provisioned, detached the disk, rebooted that node, waited for it to come up showing a failed disk, and then finally I updated node-disk-manager to a build I did of this PR. The failed/missing disk was automatically removed from the node (when viewed via the Harvester UI), and from Longhorn. So that all works fine. But, it's still present as a blockdevice CR. Is that correct? Or should the CR be deleted as well?
# kubectl get bds -A -o yaml
apiVersion: v1
items:
- apiVersion: harvesterhci.io/v1beta1
kind: BlockDevice
metadata:
creationTimestamp: "2024-05-16T04:30:39Z"
finalizers:
- wrangler.cattle.io/harvester-block-device-handler
generation: 560
labels:
kubernetes.io/hostname: harvester-node-0
ndm.harvesterhci.io/device-type: disk
name: b3270eab097d4515bac2d595665ac618
namespace: longhorn-system
resourceVersion: "118217891"
uid: 25443c2a-985d-4077-a02b-cc0ad96eb72a
spec:
devPath: /dev/sda
fileSystem:
forceFormatted: true
mountPoint: ""
nodeName: harvester-node-0
status:
conditions:
- lastUpdateTime: "2024-05-16T04:54:14Z"
message: Done device ext4 filesystem formatting
status: "False"
type: Formatting
- lastUpdateTime: "2024-05-16T04:54:14Z"
status: "True"
type: Mounted
- lastUpdateTime: "2024-05-16T07:33:05Z"
message: Disk not in longhorn node `harvester-node-0`
status: "False"
type: AddedToNode
deviceStatus:
capacity:
physicalBlockSizeBytes: 512
sizeBytes: 10737418240
details:
busPath: pci-0000:00:07.0-scsi-0:0:0:0
deviceType: disk
driveType: HDD
model: QEMU_HARDDISK
serialNumber: "1234567890123456"
storageController: SCSI
uuid: eb7b5736-12be-4181-addd-a73bd6fa9479
vendor: QEMU
wwn: "0x1234567890123456"
devPath: /dev/sda
fileSystem:
LastFormattedAt: "2024-05-16T04:54:14Z"
mountPoint: /var/lib/harvester/extra-disks/b3270eab097d4515bac2d595665ac618
type: ext4
partitioned: false
provisionPhase: Unprovisioned
state: Inactive
tags:
- harvester-ndm-disk-remove
kind: List
metadata:
resourceVersion: ""
Ah... good catch, I should not return. Maybe just log and continue the remove op with longhorn node.
Yeah, currently, the blockdevice CR needs to be deleted manually. There are some reasons I did not delete blockdevice CRD with any delete operations.
But I also wonder we need to refine the whole flow. I will create a HEP to redefine the NDM on v1.4.0 to decompose the component like the provisioned. I thought we could discuss more at that time. |
59ce989
to
8d8c848
Compare
- We should not skip the inactive/corrupted device by removing op - Improve the order of onChange. Now, we will handle the removal first and the addition/update later. That helps the logic be clarified. Signed-off-by: Vicente Cheng <[email protected]>
Signed-off-by: Vicente Cheng <[email protected]>
- we need to remove the duplicated disk after we tested Signed-off-by: Vicente Cheng <[email protected]>
Signed-off-by: Vicente Cheng <[email protected]>
8d8c848
to
7eea331
Compare
@Mergifyio backport v0.6.x |
✅ Backports have been created
|
Problem:
When the disk is gone, we cannot remove it on Harvester UI.
That's a regression after we introduce the inactive/corrupted concept.
Solution:
Improve the removing disk mechanism.
Related Issue:
harvester/harvester#5726
Test plan:
Nice to Have: