Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISCSI Session Healing can make bad situations worse #961

Open
speedyguy17 opened this issue Jan 10, 2025 · 0 comments
Open

ISCSI Session Healing can make bad situations worse #961

speedyguy17 opened this issue Jan 10, 2025 · 0 comments

Comments

@speedyguy17
Copy link

ISCSI Session healing will take the following:

  • detect all ISCSI sessions that are not in "logged in" state
  • wait for a timeout
  • log them out and back in

This has the impact of causing any ext4 filesystems mounted on top of devices owned by that session go read-only, leading to any pods consuming those PVs to become irrecoverable.

Consider an (unfortunately) extended network outage:

  • all iscsi sessions states becomes "FREE"
  • Trident will detect this sessions as stale (not LOGGED IN)
  • after the session recovery timeout, trident will set the action for the sessions to LogoutLoginRescan
  • Trident issues iscsiadm -m ..... -u on the sessions
  • Upon logout of the sessions, Linux tears down each of the /dev/sdXX block devices
  • Upon teardown of the last sdXX backing a given volume, multipath returns EIO to any outstanding IO on the /dev/dm-XX device
  • When ext4 receives EIO for a jbd2 IO, it intentionally and irrecoverably marks the filesystem as read only

At this point, the Pod sees an RO PV that cannot be recovered without a remount of the file system as RW, and a restart of the pod. The session healing has turned a recoverable network outage into an irrecoverable degradation of the file system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants