refactor: Background reconnect tasks replace blocking recovery #2910

GilboaAWS · 2025-01-01T13:46:38Z

Description

Motivation:

The current implementation blocks the main processing flow during connection refresh operations. When a reconnect is needed, poll_recover blocks until the refresh operation completes, preventing the processing of any requests during this time. This blocking behavior:

Impacts system responsiveness
Inefficiently handles temporary network issues
Forces all requests to wait, even those targeting different addresses
Creates unnecessary latency for requests that could be routed to available nodes

By moving reconnect operations to background tasks, implementing exponential backoff, and enhancing request routing, we can:

Improve system throughput by eliminating blocking operations
Handle network issues more gracefully
Allow requests to wait appropriately for refreshing connections
Maintain existing ordering guarantees and consistency
Better utilize available connections during refresh operations

Current impl.:

Core Structure:

Implements Sink trait with methods: start_send, poll_ready, and poll_flush
Connected to a socket output stream via a forward implementation.

poll_ready:
Always returns Poll::Ready

forward Implementation:
Calls start_send to send requests
Calls poll_flush to process requests

poll_flush Behavior:

Returns Poll::Ready when:

No requests in pending_requests queue
No requests in inflight_requests queue

Returns Poll::Pending when:

poll_complete returns Pending

Request Processing Flow:

poll_flush is called
poll_recover is called first:
- Blocks on the recovery future if one exists
- Ensures all recovery operations complete before proceeding
After recovery completes, poll_complete is called:
- Moves all pending_requests to inflight_requests queue
- Calls poll_next on inflight_requests to process them
- Returns Pending if all current inflight requests are pending
- Returns Ready when all requests are processed

Reordering Prevention:

All requests in inflight_requests see the same connection_map
Ensures consistent behavior for batches of requests

Key Points:

The system uses a two-stage queue (pending_requests and inflight_requests)
poll_recover acts as a synchronization point, ensuring all requests see a consistent state
Batches of requests are processed together, maintaining order and consistency
poll_next is called within poll_complete to process inflight requests

This structure ensures that:

Requests are processed in batches
Each batch sees a consistent view of the connection state
Recovery operations complete before new requests are processed
The processing of inflight requests is handled within the poll_complete method

PR 1 - Move connection refresh to the background

Core Changes:

Move refresh_connection to run as a tokio task in the background, eliminating blocking in poll_recover.

New State Management:

Introduce two new sets in ConnectionContainer:

'pending_refresh': Tracks addresses that need to be refreshed
'completed_refresh': Tracks addresses that have completed refresh

Refresh Task Behavior:

When refresh is needed:
- Add address to 'pending_refresh' set
- Start refresh logic in background tokio task
When refresh completes:
- Add address to 'completed_refresh' set

Modified poll_flush Behavior:

Before calling poll_complete:
- Remove connections for all addresses in 'pending_refresh' set
- Add connections for all addresses in 'completed_refresh' set
- Clear both sets after processing
Continue with normal poll_complete flow

Consistency Maintenance:

All connection_map modifications happen at poll_flush start
Each batch of requests sees consistent connection state
No blocking on refresh operations

Key Benefits:

Unblocks poll_recover, improving system responsiveness
Maintains consistency for request batches
Enables concurrent refresh operations
Preserves existing batch processing model
Maintains request ordering guarantees

PR 2 - Enhanced get_connection with State-Based Notifier:

Core Changes:

Modify get_connection to intelligently wait on refresh operations while maintaining order guarantees.

New State Management:

Add refresh state tracking per address
Introduce notifier mechanism that's triggered by poll_flush
New state enum for refresh operations:
- Refreshing
- RefreshTimeout (when refresh operation exceeds time threshold)

Modified poll_flush Behavior:

At the beginning, before connection map updates:
- Check duration of ongoing refresh operations
- For operations running too long:
  - Update state to RefreshTimeout
  - Trigger associated notifier
  - Allow awaiting requests to proceed

New Connection Retrieval Flow:

When connection not found for a route:
- Check if address has active refresh operation
If refresh is active:
- Await on address's notifier
- When notified:
  - If connection exists in map - use it
  - If RefreshTimeout state - proceed to random node
If no refresh active:
- Proceed with current random node logic

Key Benefits:

Maintains current behavior of falling back to random node
Replaces blocking recovery with controlled waiting period
Prevents request reordering by:
- Using centralized state management in poll_flush
- Ensuring consistent state views for request batches
Provides natural transition from current blocking behavior to more efficient approach

Comparison to Current Behavior:

Current: Blocks entirely during recover task
Proposed: Waits briefly for refresh completion before random routing
Both ensure consistent handling of request batches
Both eventually route to random node when connection unavailable

PR 3 - Exponential Backoff for Reconnection: #473

Core Changes:

Enhance refresh task with exponential backoff retry mechanism for connection attempts.

Refresh Task Behavior:

Initial connection attempt
On failure:
- Calculate next retry interval using exponential backoff
- Wait for calculated duration
- Retry connection
Continue until:
- Connection succeeds
- Maximum retries reached
- Maximum total time exceeded

Key Benefits:

More robust handling of temporary network issues
Prevents overwhelming network with immediate retries
Better resource utilization during connection problems
Improved system stability during network instability

Checklist

PR 1
PR 2
PR 3

GilboaAWS added the 1_3_candidate label Jan 1, 2025

GilboaAWS added this to the 1.3 milestone Jan 1, 2025

GilboaAWS self-assigned this Jan 1, 2025

ikolomi added this to Valkey-GLIDE - internal Jan 1, 2025

ikolomi moved this to Backlog in Valkey-GLIDE - internal Jan 1, 2025

ikolomi moved this from Backlog to In Progress in Valkey-GLIDE - internal Jan 1, 2025

GilboaAWS mentioned this issue Jan 5, 2025

[redis-rs][core] Move connection refresh to the background #2915

Draft

6 tasks

asafpamzn removed the 1_3_candidate label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Background reconnect tasks replace blocking recovery #2910

refactor: Background reconnect tasks replace blocking recovery #2910

GilboaAWS commented Jan 1, 2025 •

edited

Loading

refactor: Background reconnect tasks replace blocking recovery #2910

refactor: Background reconnect tasks replace blocking recovery #2910

Comments

GilboaAWS commented Jan 1, 2025 • edited Loading

Description

Motivation:

Current impl.:

PR 1 - Move connection refresh to the background

PR 2 - Enhanced get_connection with State-Based Notifier:

PR 3 - Exponential Backoff for Reconnection: #473

Checklist

GilboaAWS commented Jan 1, 2025 •

edited

Loading