Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to disconnect from NVMeoF subsystem #2603

Open
derekbit opened this issue Dec 5, 2024 · 9 comments
Open

Unable to disconnect from NVMeoF subsystem #2603

derekbit opened this issue Dec 5, 2024 · 9 comments

Comments

@derekbit
Copy link

derekbit commented Dec 5, 2024

If using an NVMe initiator to connect to an NVMe target via NVMe-TCP, and the NVMe target node goes offline, resulting in the Paths in the subsystem displayed by nvme list-subsys being empty, how can I delete this subsystem? Or, what parameters should be provided during the connection to handle this situation?

# nvme list-subsys -o json
[
  {
    "HostNQN":"nqn.2014-08.org.nvmexpress:uuid:a48f0944-3ab0-4e5b-8019-79192500a44e",
    "Subsystems":[
      {
        "Name":"nvme-subsys0",
        "NQN":"nqn.2023-01.io.longhorn.spdk:pvc-4aa94de4-2f75-4cbb-855d-3737a488be50-e-0",
        "IOPolicy":"numa",
        "Paths":[]
      }
    ]
  }
]

I've tried the nvme disconnect function, but it doesn't work.

# nvme disconnect -n nqn.2023-01.io.longhorn.spdk:pvc-4aa94de4-2f75-4cbb-855d-3737a488be50-e-0
NQN:nqn.2023-01.io.longhorn.spdk:pvc-4aa94de4-2f75-4cbb-855d-3737a488be50-e-0 disconnected 0 controller(s)

# nvme list-subsys -o json
[
  {
    "HostNQN":"nqn.2014-08.org.nvmexpress:uuid:a48f0944-3ab0-4e5b-8019-79192500a44e",
    "Subsystems":[
      {
        "Name":"nvme-subsys0",
        "NQN":"nqn.2023-01.io.longhorn.spdk:pvc-4aa94de4-2f75-4cbb-855d-3737a488be50-e-0",
        "IOPolicy":"numa",
        "Paths":[]
      }
    ]
  }
]

dmesg

[ 1046.878470] nvme nvme1: failed to connect socket: -110
[ 1046.878493] nvme nvme1: Failed reconnect attempt 10
[ 1046.878496] nvme nvme1: Reconnecting in 2 seconds...
[ 1051.998732] nvme nvme1: failed to connect socket: -110
[ 1051.998750] nvme nvme1: Failed reconnect attempt 11
[ 1051.998752] nvme nvme1: Reconnecting in 2 seconds...
[ 1057.118953] nvme nvme1: failed to connect socket: -110
[ 1057.118978] nvme nvme1: Failed reconnect attempt 12
[ 1057.118981] nvme nvme1: Reconnecting in 2 seconds...
[ 1062.239224] nvme nvme1: failed to connect socket: -110
[ 1062.239255] nvme nvme1: Failed reconnect attempt 13
[ 1062.239258] nvme nvme1: Reconnecting in 2 seconds...
[ 1067.359552] nvme nvme1: failed to connect socket: -110
[ 1067.359578] nvme nvme1: Failed reconnect attempt 14
[ 1067.359581] nvme nvme1: Reconnecting in 2 seconds...
[ 1072.479843] nvme nvme1: failed to connect socket: -110
[ 1072.479869] nvme nvme1: Failed reconnect attempt 15
[ 1072.479872] nvme nvme1: Removing controller...
[ 1072.479893] nvme nvme1: Removing ctrl: NQN "nqn.2023-01.io.longhorn.spdk:pvc-4aa94de4-2f75-4cbb-855d-3737a488be50-e-0"
[ 1072.494892] nvme nvme1: Property Set error: 880, offset 0x14
@igaw
Copy link
Collaborator

igaw commented Dec 6, 2024

[ 1072.479893] nvme nvme1: Removing ctrl

This messages says the nvme subsystem got informed from userspace to release all resources to the mentioned controller. I suspect that the transport driver is not performing the cleanup tasks. Is this with nvme-tcp?

@derekbit
Copy link
Author

derekbit commented Dec 6, 2024

[ 1072.479893] nvme nvme1: Removing ctrl

This messages says the nvme subsystem got informed from userspace to release all resources to the mentioned controller. I suspect that the transport driver is not performing the cleanup tasks. Is this with nvme-tcp?

Yes, it is with nvme-tcp.

@igaw
Copy link
Collaborator

igaw commented Dec 9, 2024

Ah sorry, stupid me, you already mentioned that it it's nvme-tcp. Anyway, I've tried to replicated this with the current Linux head (6.13-rc1) and,

# nvme connect -t tcp -a 192.168.154.145 -s 4420 -n nqn.io-1 --hostnqn  nqn.2014-08.org.nvmexpress:uuid:befdec4c-2234-11b2-a85c-ca77c773af3 
[block traffic between host and controller]
# nvme disconnect-all
# nvme list-subsys 
[no output]

and the kernel log doesn't show any thing suspecious.

[   65.062189] nvme nvme1: creating 8 I/O queues.
[   65.071495] nvme nvme1: mapped 8/0/0 default/read/poll queues.
[   65.082160] nvme nvme1: new ctrl: NQN "nqn.io-1", addr 192.168.154.145:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:befdec4c-2234-11b2-a85c-ca77c773af36
[   75.573965] nvme nvme1: Removing ctrl: NQN "nqn.io-1"
[   90.188630] nvme nvme1: creating 8 I/O queues.
[   90.197960] nvme nvme1: mapped 8/0/0 default/read/poll queues.
[   90.205534] nvme nvme1: new ctrl: NQN "nqn.io-1", addr 192.168.154.145:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:befdec4c-2234-11b2-a85c-ca77c773af36
[  148.957642] nvme nvme1: I/O tag 1 (7001) type 4 opcode 0x18 (Keep Alive) QID 0 timeout
[  148.958107] nvme nvme1: starting error recovery
[  148.959373] nvme nvme1: failed nvme_keep_alive_end_io error=10
[  148.968537] nvme nvme1: Reconnecting in 10 seconds...
[  155.864828] nvme nvme1: Removing ctrl: NQN "nqn.io-1"
[  156.035038] nvme nvme1: Property Set error: 880, offset 0x14

Is the above command sequence what you are doing? If not, please provide those. Which kernel are you using?

@Omar007
Copy link

Omar007 commented Dec 13, 2024

I'm fairly certain I've just seen this happen as well. Active nvmeof tcp connection disrupted by having the remote node disconnect, wait for the retries until controller removal, then be left with a subsystem entry in nvme list-subsys and no way to remove/disconnect it, nor refresh/reconnect it (other than just doing a new nvme connect ... I mean). The connecting client system is running kernel 6.12.4.

EDIT: Actually, scratch that connect comment. Even if you properly disconnect it after that manual connect, from that point onward, it does not seem to properly clear out at all anymore?

@igaw
Copy link
Collaborator

igaw commented Dec 16, 2024

Okay, I haven't tested the path, when the retry counter hits the limit and it gets auto removed. Let's see...

@igaw
Copy link
Collaborator

igaw commented Dec 18, 2024

I've tried to reproduce this with current HEAD and also with 6.12, but no luck. Also I can't see how the remove ctrl path could leak the subsystem reference (which seems to be the problem here). Anyway, it's a kernel issue and not really a nvme-cli bug. I suggest you report this to the nvme mailing list. I could also post the question on the mailing list but since I can't reproduce it, it's likely going no where if I do so. Sorry.

@derekbit
Copy link
Author

derekbit commented Jan 3, 2025

@igaw
Thanks for the help.
We also encounter the issue. The device is unable to be disconnected. Do you have any thoughts?

instance-manager-76f5303efa69a5131572cedfe2bee640:/ # nvme list-subsys
nvme-subsys0 - NQN=nqn.2023-01.io.longhorn.spdk:e2e-test-volume-0-e-0
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:ec2cad69-0c04-a806-2dc6-585615cc07ff
               iopolicy=numa
\
 +- nvme0 tcp traddr=10.42.3.6,trsvcid=20108 deleting (no IO)

@igaw
Copy link
Collaborator

igaw commented Jan 3, 2025

Is there a way to reproduce it?

Hmm, so the connection is in deleting (no IO) state. That might help to identify the problem. Maybe we are waiting on an request to complete but we never end it...

@Omar007
Copy link

Omar007 commented Jan 23, 2025

I wish I could provide more info but the 'have the remote become unavailable and wait past its retries' seems to be all it takes. Not aware of anything special at the moment. Currently on kernel 6.12.10 and it had happened again.

Looking at the /sys/class/nvme-subsystem tree, the whole node is still present when this happens. The only difference with an active subsystem is the absence of the controller symlink.

Could the /sys tree be used to force it to try again or reinitialize the connection and/or controller?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants