controller: improve the mechanism of disk removing #102

Vicente-Cheng · 2024-05-10T17:23:37Z

Problem:
When the disk is gone, we cannot remove it on Harvester UI.
That's a regression after we introduce the inactive/corrupted concept.

Solution:
Improve the removing disk mechanism.

Related Issue:
harvester/harvester#5726

Test plan:

Create 3 node Harvester cluster
add an extra disk and provision it as Longhorn disk
remove the above disk (physically). (If you use kvm, try virtctl command with attach-device/detach-device)
Remove this provision disk on the Harvester UI
make sure both Harvester/Longhorn UI removed the above disk

Nice to Have:

Add new integration test for the above case

tserong

I think I understand this :-) and it looks reasonable. I just have a couple of questions:

It looks like if the call to utils.ForceUmountWithTimeout() fails due to timeout, then the disk won't be removed from the node. Given that we do want to remove inactive or corrupted disks from the node, would it make sense to ignore timeout failures and remove the disk anyway?
I gave this a rough test by attaching a disk to harvester running in a VM, then I let it be provisioned, detached the disk, rebooted that node, waited for it to come up showing a failed disk, and then finally I updated node-disk-manager to a build I did of this PR. The failed/missing disk was automatically removed from the node (when viewed via the Harvester UI), and from Longhorn. So that all works fine. But, it's still present as a blockdevice CR. Is that correct? Or should the CR be deleted as well?

# kubectl get bds -A -o yaml
apiVersion: v1
items:
- apiVersion: harvesterhci.io/v1beta1
  kind: BlockDevice
  metadata:
    creationTimestamp: "2024-05-16T04:30:39Z"
    finalizers:
    - wrangler.cattle.io/harvester-block-device-handler
    generation: 560
    labels:
      kubernetes.io/hostname: harvester-node-0
      ndm.harvesterhci.io/device-type: disk
    name: b3270eab097d4515bac2d595665ac618
    namespace: longhorn-system
    resourceVersion: "118217891"
    uid: 25443c2a-985d-4077-a02b-cc0ad96eb72a
  spec:
    devPath: /dev/sda
    fileSystem:
      forceFormatted: true
      mountPoint: ""
    nodeName: harvester-node-0
  status:
    conditions:
    - lastUpdateTime: "2024-05-16T04:54:14Z"
      message: Done device ext4 filesystem formatting
      status: "False"
      type: Formatting
    - lastUpdateTime: "2024-05-16T04:54:14Z"
      status: "True"
      type: Mounted
    - lastUpdateTime: "2024-05-16T07:33:05Z"
      message: Disk not in longhorn node `harvester-node-0`
      status: "False"
      type: AddedToNode
    deviceStatus:
      capacity:
        physicalBlockSizeBytes: 512
        sizeBytes: 10737418240
      details:
        busPath: pci-0000:00:07.0-scsi-0:0:0:0
        deviceType: disk
        driveType: HDD
        model: QEMU_HARDDISK
        serialNumber: "1234567890123456"
        storageController: SCSI
        uuid: eb7b5736-12be-4181-addd-a73bd6fa9479
        vendor: QEMU
        wwn: "0x1234567890123456"
      devPath: /dev/sda
      fileSystem:
        LastFormattedAt: "2024-05-16T04:54:14Z"
        mountPoint: /var/lib/harvester/extra-disks/b3270eab097d4515bac2d595665ac618
        type: ext4
      partitioned: false
    provisionPhase: Unprovisioned
    state: Inactive
    tags:
    - harvester-ndm-disk-remove
kind: List
metadata:
  resourceVersion: ""

Vicente-Cheng · 2024-05-16T08:18:48Z

I think I understand this :-) and it looks reasonable. I just have a couple of questions:

It looks like if the call to utils.ForceUmountWithTimeout() fails due to timeout, then the disk won't be removed from the node. Given that we do want to remove inactive or corrupted disks from the node, would it make sense to ignore timeout failures and remove the disk anyway?

Ah... good catch, I should not return. Maybe just log and continue the remove op with longhorn node.
Let me update again...

I gave this a rough test by attaching a disk to harvester running in a VM, then I let it be provisioned, detached the disk, rebooted that node, waited for it to come up showing a failed disk, and then finally I updated node-disk-manager to a build I did of this PR. The failed/missing disk was automatically removed from the node (when viewed via the Harvester UI), and from Longhorn. So that all works fine. But, it's still present as a blockdevice CR. Is that correct? Or should the CR be deleted as well?

Yeah, currently, the blockdevice CR needs to be deleted manually. There are some reasons I did not delete blockdevice CRD with any delete operations.

for hotplug case, the blockdevice will change the state from inactive -> active.
for corruption cases, the block device needs to keep recording the corruption flag. Then, we could try to repair and reuse manually.

But I also wonder we need to refine the whole flow. I will create a HEP to redefine the NDM on v1.4.0 to decompose the component like the provisioned. I thought we could discuss more at that time.

pkg/controller/blockdevice/controller.go

pkg/utils/utils.go

- We should not skip the inactive/corrupted device by removing op - Improve the order of onChange. Now, we will handle the removal first and the addition/update later. That helps the logic be clarified. Signed-off-by: Vicente Cheng <[email protected]>

Signed-off-by: Vicente Cheng <[email protected]>

- we need to remove the duplicated disk after we tested Signed-off-by: Vicente Cheng <[email protected]>

Signed-off-by: Vicente Cheng <[email protected]>

Vicente-Cheng · 2024-05-21T09:23:29Z

@Mergifyio backport v0.6.x

mergify · 2024-05-21T09:23:38Z

backport v0.6.x

✅ Backports have been created

#106 controller: improve the mechanism of disk removing (backport #102) has been created for branch v0.6.x but encountered conflicts

Vicente-Cheng requested review from tserong and bk201 May 10, 2024 17:23

Vicente-Cheng mentioned this pull request May 10, 2024

[BUG] Can't delete Storage disk harvester/harvester#5726

Closed

Vicente-Cheng force-pushed the improve-the-remove-mechanism branch 4 times, most recently from 4acacd5 to f561a2c Compare May 13, 2024 05:26

Vicente-Cheng marked this pull request as draft May 13, 2024 06:44

Vicente-Cheng force-pushed the improve-the-remove-mechanism branch from f561a2c to 59ce989 Compare May 14, 2024 07:33

Vicente-Cheng marked this pull request as ready for review May 14, 2024 08:42

Vicente-Cheng requested a review from votdev May 16, 2024 03:26

tserong reviewed May 16, 2024

View reviewed changes

Vicente-Cheng force-pushed the improve-the-remove-mechanism branch from 59ce989 to 8d8c848 Compare May 16, 2024 08:28

Vicente-Cheng requested a review from tserong May 16, 2024 08:29

tserong approved these changes May 17, 2024

View reviewed changes

votdev requested changes May 17, 2024

View reviewed changes

pkg/controller/blockdevice/controller.go Outdated Show resolved Hide resolved

pkg/controller/blockdevice/controller.go Outdated Show resolved Hide resolved

pkg/utils/utils.go Outdated Show resolved Hide resolved

pkg/utils/utils.go Show resolved Hide resolved

Vicente-Cheng added 4 commits May 20, 2024 14:30

controller: add mountpoint handler when device inactive/corrupted

c2987fa

Signed-off-by: Vicente Cheng <[email protected]>

integration: add the test for inactive device removing

905eed4

- we need to remove the duplicated disk after we tested Signed-off-by: Vicente Cheng <[email protected]>

utils: add function document

7eea331

Signed-off-by: Vicente Cheng <[email protected]>

Vicente-Cheng force-pushed the improve-the-remove-mechanism branch from 8d8c848 to 7eea331 Compare May 20, 2024 06:30

Vicente-Cheng requested a review from votdev May 20, 2024 06:32

votdev approved these changes May 21, 2024

View reviewed changes

Vicente-Cheng merged commit 969ccdc into harvester:master May 21, 2024
8 checks passed

mergify bot mentioned this pull request May 21, 2024

controller: improve the mechanism of disk removing (backport #102) #106

Merged

1 task

Vicente-Cheng mentioned this pull request May 28, 2024

[backport v1.3] [BUG] Can't delete Storage disk harvester/harvester#5778

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

controller: improve the mechanism of disk removing #102

controller: improve the mechanism of disk removing #102

Vicente-Cheng commented May 10, 2024 •

edited

Loading

tserong left a comment

Vicente-Cheng commented May 16, 2024

Vicente-Cheng commented May 21, 2024

mergify bot commented May 21, 2024 •

edited

Loading

controller: improve the mechanism of disk removing #102

controller: improve the mechanism of disk removing #102

Conversation

Vicente-Cheng commented May 10, 2024 • edited Loading

tserong left a comment

Choose a reason for hiding this comment

Vicente-Cheng commented May 16, 2024

Vicente-Cheng commented May 21, 2024

mergify bot commented May 21, 2024 • edited Loading

✅ Backports have been created

Vicente-Cheng commented May 10, 2024 •

edited

Loading

mergify bot commented May 21, 2024 •

edited

Loading