Unable to disconnect from NVMeoF subsystem #2603

derekbit · 2024-12-05T16:36:53Z

If using an NVMe initiator to connect to an NVMe target via NVMe-TCP, and the NVMe target node goes offline, resulting in the Paths in the subsystem displayed by nvme list-subsys being empty, how can I delete this subsystem? Or, what parameters should be provided during the connection to handle this situation?

# nvme list-subsys -o json
[
  {
    "HostNQN":"nqn.2014-08.org.nvmexpress:uuid:a48f0944-3ab0-4e5b-8019-79192500a44e",
    "Subsystems":[
      {
        "Name":"nvme-subsys0",
        "NQN":"nqn.2023-01.io.longhorn.spdk:pvc-4aa94de4-2f75-4cbb-855d-3737a488be50-e-0",
        "IOPolicy":"numa",
        "Paths":[]
      }
    ]
  }
]

I've tried the nvme disconnect function, but it doesn't work.

# nvme disconnect -n nqn.2023-01.io.longhorn.spdk:pvc-4aa94de4-2f75-4cbb-855d-3737a488be50-e-0
NQN:nqn.2023-01.io.longhorn.spdk:pvc-4aa94de4-2f75-4cbb-855d-3737a488be50-e-0 disconnected 0 controller(s)

# nvme list-subsys -o json
[
  {
    "HostNQN":"nqn.2014-08.org.nvmexpress:uuid:a48f0944-3ab0-4e5b-8019-79192500a44e",
    "Subsystems":[
      {
        "Name":"nvme-subsys0",
        "NQN":"nqn.2023-01.io.longhorn.spdk:pvc-4aa94de4-2f75-4cbb-855d-3737a488be50-e-0",
        "IOPolicy":"numa",
        "Paths":[]
      }
    ]
  }
]

dmesg

[ 1046.878470] nvme nvme1: failed to connect socket: -110
[ 1046.878493] nvme nvme1: Failed reconnect attempt 10
[ 1046.878496] nvme nvme1: Reconnecting in 2 seconds...
[ 1051.998732] nvme nvme1: failed to connect socket: -110
[ 1051.998750] nvme nvme1: Failed reconnect attempt 11
[ 1051.998752] nvme nvme1: Reconnecting in 2 seconds...
[ 1057.118953] nvme nvme1: failed to connect socket: -110
[ 1057.118978] nvme nvme1: Failed reconnect attempt 12
[ 1057.118981] nvme nvme1: Reconnecting in 2 seconds...
[ 1062.239224] nvme nvme1: failed to connect socket: -110
[ 1062.239255] nvme nvme1: Failed reconnect attempt 13
[ 1062.239258] nvme nvme1: Reconnecting in 2 seconds...
[ 1067.359552] nvme nvme1: failed to connect socket: -110
[ 1067.359578] nvme nvme1: Failed reconnect attempt 14
[ 1067.359581] nvme nvme1: Reconnecting in 2 seconds...
[ 1072.479843] nvme nvme1: failed to connect socket: -110
[ 1072.479869] nvme nvme1: Failed reconnect attempt 15
[ 1072.479872] nvme nvme1: Removing controller...
[ 1072.479893] nvme nvme1: Removing ctrl: NQN "nqn.2023-01.io.longhorn.spdk:pvc-4aa94de4-2f75-4cbb-855d-3737a488be50-e-0"
[ 1072.494892] nvme nvme1: Property Set error: 880, offset 0x14

The text was updated successfully, but these errors were encountered:

igaw · 2024-12-06T09:54:51Z

[ 1072.479893] nvme nvme1: Removing ctrl

This messages says the nvme subsystem got informed from userspace to release all resources to the mentioned controller. I suspect that the transport driver is not performing the cleanup tasks. Is this with nvme-tcp?

derekbit · 2024-12-06T15:32:38Z

[ 1072.479893] nvme nvme1: Removing ctrl
This messages says the nvme subsystem got informed from userspace to release all resources to the mentioned controller. I suspect that the transport driver is not performing the cleanup tasks. Is this with nvme-tcp?

Yes, it is with nvme-tcp.

igaw · 2024-12-09T08:50:54Z

Ah sorry, stupid me, you already mentioned that it it's nvme-tcp. Anyway, I've tried to replicated this with the current Linux head (6.13-rc1) and,

# nvme connect -t tcp -a 192.168.154.145 -s 4420 -n nqn.io-1 --hostnqn  nqn.2014-08.org.nvmexpress:uuid:befdec4c-2234-11b2-a85c-ca77c773af3 
[block traffic between host and controller]
# nvme disconnect-all
# nvme list-subsys 
[no output]

and the kernel log doesn't show any thing suspecious.

[   65.062189] nvme nvme1: creating 8 I/O queues.
[   65.071495] nvme nvme1: mapped 8/0/0 default/read/poll queues.
[   65.082160] nvme nvme1: new ctrl: NQN "nqn.io-1", addr 192.168.154.145:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:befdec4c-2234-11b2-a85c-ca77c773af36
[   75.573965] nvme nvme1: Removing ctrl: NQN "nqn.io-1"
[   90.188630] nvme nvme1: creating 8 I/O queues.
[   90.197960] nvme nvme1: mapped 8/0/0 default/read/poll queues.
[   90.205534] nvme nvme1: new ctrl: NQN "nqn.io-1", addr 192.168.154.145:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:befdec4c-2234-11b2-a85c-ca77c773af36
[  148.957642] nvme nvme1: I/O tag 1 (7001) type 4 opcode 0x18 (Keep Alive) QID 0 timeout
[  148.958107] nvme nvme1: starting error recovery
[  148.959373] nvme nvme1: failed nvme_keep_alive_end_io error=10
[  148.968537] nvme nvme1: Reconnecting in 10 seconds...
[  155.864828] nvme nvme1: Removing ctrl: NQN "nqn.io-1"
[  156.035038] nvme nvme1: Property Set error: 880, offset 0x14

Is the above command sequence what you are doing? If not, please provide those. Which kernel are you using?

Omar007 · 2024-12-13T20:36:58Z

I'm fairly certain I've just seen this happen as well. Active nvmeof tcp connection disrupted by having the remote node disconnect, wait for the retries until controller removal, then be left with a subsystem entry in nvme list-subsys and no way to remove/disconnect it, nor refresh/reconnect it ~~(other than just doing a new nvme connect ... I mean)~~. The connecting client system is running kernel 6.12.4.

EDIT: Actually, scratch that connect comment. Even if you properly disconnect it after that manual connect, from that point onward, it does not seem to properly clear out at all anymore?

igaw · 2024-12-16T08:10:01Z

Okay, I haven't tested the path, when the retry counter hits the limit and it gets auto removed. Let's see...

igaw · 2024-12-18T09:17:21Z

I've tried to reproduce this with current HEAD and also with 6.12, but no luck. Also I can't see how the remove ctrl path could leak the subsystem reference (which seems to be the problem here). Anyway, it's a kernel issue and not really a nvme-cli bug. I suggest you report this to the nvme mailing list. I could also post the question on the mailing list but since I can't reproduce it, it's likely going no where if I do so. Sorry.

derekbit · 2025-01-03T06:17:43Z

@igaw
Thanks for the help.
We also encounter the issue. The device is unable to be disconnected. Do you have any thoughts?

instance-manager-76f5303efa69a5131572cedfe2bee640:/ # nvme list-subsys
nvme-subsys0 - NQN=nqn.2023-01.io.longhorn.spdk:e2e-test-volume-0-e-0
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:ec2cad69-0c04-a806-2dc6-585615cc07ff
               iopolicy=numa
\
 +- nvme0 tcp traddr=10.42.3.6,trsvcid=20108 deleting (no IO)

igaw · 2025-01-03T14:14:38Z

Is there a way to reproduce it?

Hmm, so the connection is in deleting (no IO) state. That might help to identify the problem. Maybe we are waiting on an request to complete but we never end it...

Omar007 · 2025-01-23T14:11:48Z

I wish I could provide more info but the 'have the remote become unavailable and wait past its retries' seems to be all it takes. Not aware of anything special at the moment. Currently on kernel 6.12.10 and it had happened again.

Looking at the /sys/class/nvme-subsystem tree, the whole node is still present when this happens. The only difference with an active subsystem is the absence of the controller symlink.

Could the /sys tree be used to force it to try again or reinitialize the connection and/or controller?

c3y1huang mentioned this issue Jan 2, 2025

[BUG] [v1.8.0-rc1] Uninstallation fail if having backing images, the instance-manager pod stuck at terminating longhorn/longhorn#10044

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to disconnect from NVMeoF subsystem #2603

Unable to disconnect from NVMeoF subsystem #2603

derekbit commented Dec 5, 2024 •

edited

Loading

igaw commented Dec 6, 2024

derekbit commented Dec 6, 2024

igaw commented Dec 9, 2024

Omar007 commented Dec 13, 2024 •

edited

Loading

igaw commented Dec 16, 2024

igaw commented Dec 18, 2024

derekbit commented Jan 3, 2025

igaw commented Jan 3, 2025

Omar007 commented Jan 23, 2025 •

edited

Loading

Unable to disconnect from NVMeoF subsystem #2603

Unable to disconnect from NVMeoF subsystem #2603

Comments

derekbit commented Dec 5, 2024 • edited Loading

igaw commented Dec 6, 2024

derekbit commented Dec 6, 2024

igaw commented Dec 9, 2024

Omar007 commented Dec 13, 2024 • edited Loading

igaw commented Dec 16, 2024

igaw commented Dec 18, 2024

derekbit commented Jan 3, 2025

igaw commented Jan 3, 2025

Omar007 commented Jan 23, 2025 • edited Loading

derekbit commented Dec 5, 2024 •

edited

Loading

Omar007 commented Dec 13, 2024 •

edited

Loading

Omar007 commented Jan 23, 2025 •

edited

Loading