Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller: improve the mechanism of disk removing #102

Merged

Conversation

Vicente-Cheng
Copy link
Collaborator

@Vicente-Cheng Vicente-Cheng commented May 10, 2024

Problem:
When the disk is gone, we cannot remove it on Harvester UI.
That's a regression after we introduce the inactive/corrupted concept.

Solution:
Improve the removing disk mechanism.

Related Issue:
harvester/harvester#5726

Test plan:

  1. Create 3 node Harvester cluster
  2. add an extra disk and provision it as Longhorn disk
  3. remove the above disk (physically). (If you use kvm, try virtctl command with attach-device/detach-device)
  4. Remove this provision disk on the Harvester UI
  5. make sure both Harvester/Longhorn UI removed the above disk

Nice to Have:

  • Add new integration test for the above case

@Vicente-Cheng Vicente-Cheng requested review from tserong and bk201 May 10, 2024 17:23
@Vicente-Cheng Vicente-Cheng force-pushed the improve-the-remove-mechanism branch 4 times, most recently from 4acacd5 to f561a2c Compare May 13, 2024 05:26
@Vicente-Cheng Vicente-Cheng marked this pull request as draft May 13, 2024 06:44
@Vicente-Cheng Vicente-Cheng force-pushed the improve-the-remove-mechanism branch from f561a2c to 59ce989 Compare May 14, 2024 07:33
@Vicente-Cheng Vicente-Cheng marked this pull request as ready for review May 14, 2024 08:42
@Vicente-Cheng Vicente-Cheng requested a review from votdev May 16, 2024 03:26
Copy link
Contributor

@tserong tserong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand this :-) and it looks reasonable. I just have a couple of questions:

  1. It looks like if the call to utils.ForceUmountWithTimeout() fails due to timeout, then the disk won't be removed from the node. Given that we do want to remove inactive or corrupted disks from the node, would it make sense to ignore timeout failures and remove the disk anyway?
  2. I gave this a rough test by attaching a disk to harvester running in a VM, then I let it be provisioned, detached the disk, rebooted that node, waited for it to come up showing a failed disk, and then finally I updated node-disk-manager to a build I did of this PR. The failed/missing disk was automatically removed from the node (when viewed via the Harvester UI), and from Longhorn. So that all works fine. But, it's still present as a blockdevice CR. Is that correct? Or should the CR be deleted as well?
# kubectl get bds -A -o yaml
apiVersion: v1
items:
- apiVersion: harvesterhci.io/v1beta1
  kind: BlockDevice
  metadata:
    creationTimestamp: "2024-05-16T04:30:39Z"
    finalizers:
    - wrangler.cattle.io/harvester-block-device-handler
    generation: 560
    labels:
      kubernetes.io/hostname: harvester-node-0
      ndm.harvesterhci.io/device-type: disk
    name: b3270eab097d4515bac2d595665ac618
    namespace: longhorn-system
    resourceVersion: "118217891"
    uid: 25443c2a-985d-4077-a02b-cc0ad96eb72a
  spec:
    devPath: /dev/sda
    fileSystem:
      forceFormatted: true
      mountPoint: ""
    nodeName: harvester-node-0
  status:
    conditions:
    - lastUpdateTime: "2024-05-16T04:54:14Z"
      message: Done device ext4 filesystem formatting
      status: "False"
      type: Formatting
    - lastUpdateTime: "2024-05-16T04:54:14Z"
      status: "True"
      type: Mounted
    - lastUpdateTime: "2024-05-16T07:33:05Z"
      message: Disk not in longhorn node `harvester-node-0`
      status: "False"
      type: AddedToNode
    deviceStatus:
      capacity:
        physicalBlockSizeBytes: 512
        sizeBytes: 10737418240
      details:
        busPath: pci-0000:00:07.0-scsi-0:0:0:0
        deviceType: disk
        driveType: HDD
        model: QEMU_HARDDISK
        serialNumber: "1234567890123456"
        storageController: SCSI
        uuid: eb7b5736-12be-4181-addd-a73bd6fa9479
        vendor: QEMU
        wwn: "0x1234567890123456"
      devPath: /dev/sda
      fileSystem:
        LastFormattedAt: "2024-05-16T04:54:14Z"
        mountPoint: /var/lib/harvester/extra-disks/b3270eab097d4515bac2d595665ac618
        type: ext4
      partitioned: false
    provisionPhase: Unprovisioned
    state: Inactive
    tags:
    - harvester-ndm-disk-remove
kind: List
metadata:
  resourceVersion: ""

@Vicente-Cheng
Copy link
Collaborator Author

I think I understand this :-) and it looks reasonable. I just have a couple of questions:

  1. It looks like if the call to utils.ForceUmountWithTimeout() fails due to timeout, then the disk won't be removed from the node. Given that we do want to remove inactive or corrupted disks from the node, would it make sense to ignore timeout failures and remove the disk anyway?

Ah... good catch, I should not return. Maybe just log and continue the remove op with longhorn node.
Let me update again...

  1. I gave this a rough test by attaching a disk to harvester running in a VM, then I let it be provisioned, detached the disk, rebooted that node, waited for it to come up showing a failed disk, and then finally I updated node-disk-manager to a build I did of this PR. The failed/missing disk was automatically removed from the node (when viewed via the Harvester UI), and from Longhorn. So that all works fine. But, it's still present as a blockdevice CR. Is that correct? Or should the CR be deleted as well?

Yeah, currently, the blockdevice CR needs to be deleted manually. There are some reasons I did not delete blockdevice CRD with any delete operations.

  1. for hotplug case, the blockdevice will change the state from inactive -> active.
  2. for corruption cases, the block device needs to keep recording the corruption flag. Then, we could try to repair and reuse manually.

But I also wonder we need to refine the whole flow. I will create a HEP to redefine the NDM on v1.4.0 to decompose the component like the provisioned. I thought we could discuss more at that time.

@Vicente-Cheng Vicente-Cheng force-pushed the improve-the-remove-mechanism branch from 59ce989 to 8d8c848 Compare May 16, 2024 08:28
@Vicente-Cheng Vicente-Cheng requested a review from tserong May 16, 2024 08:29
pkg/controller/blockdevice/controller.go Outdated Show resolved Hide resolved
pkg/controller/blockdevice/controller.go Outdated Show resolved Hide resolved
pkg/utils/utils.go Outdated Show resolved Hide resolved
pkg/utils/utils.go Show resolved Hide resolved
    - We should not skip the inactive/corrupted device by removing op

    - Improve the order of onChange. Now, we will handle the removal
      first and the addition/update later. That helps the logic be clarified.

Signed-off-by: Vicente Cheng <[email protected]>
    - we need to remove the duplicated disk after we tested

Signed-off-by: Vicente Cheng <[email protected]>
Signed-off-by: Vicente Cheng <[email protected]>
@Vicente-Cheng Vicente-Cheng force-pushed the improve-the-remove-mechanism branch from 8d8c848 to 7eea331 Compare May 20, 2024 06:30
@Vicente-Cheng Vicente-Cheng requested a review from votdev May 20, 2024 06:32
@Vicente-Cheng Vicente-Cheng merged commit 969ccdc into harvester:master May 21, 2024
8 checks passed
@Vicente-Cheng
Copy link
Collaborator Author

@Mergifyio backport v0.6.x

Copy link

mergify bot commented May 21, 2024

backport v0.6.x

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants