Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache routes received from next_hop #499

Merged
merged 15 commits into from
May 22, 2024
Merged

Cache routes received from next_hop #499

merged 15 commits into from
May 22, 2024

Conversation

FelixMcFelix
Copy link
Collaborator

@FelixMcFelix FelixMcFelix commented Apr 16, 2024

This PR inserts a caching layer in front of next_hop to store routes for a given (IpAddr6, L4Hash) for around 100ms, and reorganises routing-related code into xde::route. This is a necessary step as today we spend around 21% of xde_mc_tx in route lookup on a per-outbound-packet basis. As a consequence, we need install a new Periodic to flush expired entries every 1s, as otherwise we can have a large amountof detritus pileup from different L4Hash values.

Currently this is implemented as a per-port cache rather than shared across the driver -- I've done this because it will give us better sharding of concurrent readers/writers at higher port/guest counts. The PR as-is pulls us up from ~1.74Gbps to 2Gbps on a single link between sn9/14.

I'm seeing similar gain on the all-perf-in-flight branch on in-review illumos bits (which includes mac flows + packet chains), where this change brings us from 2.25 Gbps to 2.50 Gbps on this setup.

This is not good code. It also appears to give us an extra 200Mbps.
@FelixMcFelix FelixMcFelix marked this pull request as ready for review April 19, 2024 10:44
I think this is necessary in principle because we're including hash
information in the routing key, and I'd rather not have this state
balloon out of control.
@FelixMcFelix FelixMcFelix self-assigned this Apr 22, 2024
}
}

// The following are wrappers for reference drop functions used in XDE.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It hasn't changed in this PR, but these functions seem iffy to me, since they make it trivial to trigger UB from safe Rust:

    let i = 123;
    let ptr = i as *const ip::ire_t;
    ire_refrele(ire); // probably bad!

Copy link
Collaborator Author

@FelixMcFelix FelixMcFelix Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't have time to squeeze this in to this PR, but I've opened #515 to track it separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was done to facilitate implementing DropRef. If there's a better/safer way to do this, all for it.

xde/src/route.rs Outdated Show resolved Hide resolved

/// A simple caching layer over `next_hop`.
#[derive(Clone)]
pub struct RouteCache(Arc<KRwLock<BTreeMap<RouteKey, CachedRoute>>>);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, did you test a HashMap vs BTreeMap?

Copy link
Collaborator Author

@FelixMcFelix FelixMcFelix Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit that I hadn't -- I was following the OPTE preference for BTreeMaps. But I threw something together quickly there in criterion with RouteKey and CachedRoute to insert/lookup one row into an arbitrary sized table:

   Compiling opte-bench v0.1.0 (/Users/kyle/gits/opte/bench)
    Finished bench [optimized + debuginfo] target(s) in 2.92s
     Running benches/userland.rs (target/release/deps/userland-57e28d11e88fdd6e)

mapquest/btreemap/get/0 time:   [287.02 ps 287.61 ps 288.30 ps]
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe
mapquest/btreemap/get/2 time:   [3.8893 ns 3.9047 ns 3.9191 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
mapquest/btreemap/get/8 time:   [8.9178 ns 8.9306 ns 8.9465 ns]
Found 19 outliers among 100 measurements (19.00%)
  5 (5.00%) high mild
  14 (14.00%) high severe
mapquest/btreemap/get/64
                        time:   [8.1967 ns 8.2355 ns 8.2792 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
mapquest/btreemap/get/512
                        time:   [7.2225 ns 7.2445 ns 7.2702 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe
mapquest/btreemap/get/8192
                        time:   [2.8060 ns 2.8106 ns 2.8155 ns]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild

mapquest/btreemap/insert/0
                        time:   [53.993 ns 54.513 ns 55.530 ns]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe
mapquest/btreemap/insert/2
                        time:   [47.196 ns 47.271 ns 47.339 ns]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
mapquest/btreemap/insert/8
                        time:   [57.601 ns 57.679 ns 57.760 ns]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
mapquest/btreemap/insert/64
                        time:   [290.32 ns 290.57 ns 290.83 ns]
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
mapquest/btreemap/insert/512
                        time:   [2.4069 µs 2.4108 µs 2.4145 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
mapquest/btreemap/insert/8192
                        time:   [50.270 µs 50.590 µs 50.858 µs]




mapquest/hashmap/get/0  time:   [287.28 ps 288.20 ps 289.29 ps]
Found 17 outliers among 100 measurements (17.00%)
  5 (5.00%) high mild
  12 (12.00%) high severe
mapquest/hashmap/get/2  time:   [22.259 ns 23.395 ns 24.472 ns]
mapquest/hashmap/get/8  time:   [21.733 ns 22.932 ns 24.172 ns]
mapquest/hashmap/get/64 time:   [22.045 ns 23.297 ns 24.493 ns]
mapquest/hashmap/get/512
                        time:   [19.729 ns 20.846 ns 22.009 ns]
mapquest/hashmap/get/8192
                        time:   [19.731 ns 20.825 ns 21.964 ns]

mapquest/hashmap/insert/0
                        time:   [46.374 ns 46.581 ns 46.796 ns]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
mapquest/hashmap/insert/2
                        time:   [43.214 ns 43.935 ns 44.742 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
mapquest/hashmap/insert/8
                        time:   [66.543 ns 66.619 ns 66.698 ns]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
mapquest/hashmap/insert/64
                        time:   [59.119 ns 60.854 ns 62.838 ns]
mapquest/hashmap/insert/512
                        time:   [295.78 ns 301.83 ns 308.87 ns]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
mapquest/hashmap/insert/8192
                        time:   [231.58 ns 232.38 ns 233.19 ns]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low severe
  1 (1.00%) high mild
  1 (1.00%) high severe

This sort of tracks with my understanding of why we're using BTreeMaps in general throughout OPTE: faster lookup at the cost of more expensive insert. Perhaps a HashMap might work better here: the insert cost scaling up into microseconds around 256--512 entries is not pretty.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a more obvious reason why we use BTree{Map,Set} that totally slipped my mind: there is no HashMap in alloc/no_std. So we would need to find and measure a suitable replacement if we wanted to go down that road. In the meantime I think the best call might be to limit the cache size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't have a ton of context here, but do we expect the nexthop cache to have a lot of regular churn?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be in practice -- the above is very much worst-case, and I'm thinking about the case where a guest might have many active flows that are actively refreshing their routes and then add one (/n) more.

In the general case of a steady number of flows:

  • An extant flow will call next_hop every 100ms in case preferences or reachability change. If there's an existing entry (even if expired), we should only be paying for lookup rather than a full insert on the BTreeMap.
  • Expired routes are cleared out from the map every 1s.

The last part might add to the risk of churn; I'll change it up to expire flows which have expired by longer than the cleanup period to make refresh more consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition here is that we want to optimize for queries rather than inserts. And if that intuition is not correct I really want to understand why. If this can be modeled as an LPM query, then using poptrie might be a win.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm in agreement on that front. Summarising the other comment I made, we can rework this as an LPM query on underlay destination alone (basically removing insert cost), but we'd need to redo how we walk and inspect IREs for a destination.

This also includes more details on the relative costs we're trying to
offset, and the futur ease for impling/finding a suitable hashmap.
This places 'requesting a new IRE from illumos' and 'removing the
allocated slot in the RouteCache` onto two different timers. This should
allow a flow to keep reusing its existing slot without needing to
redo an expensive insert incurred after a removal.
@FelixMcFelix FelixMcFelix added this to the 9 milestone May 10, 2024
@FelixMcFelix FelixMcFelix requested a review from rcgoodfellow May 20, 2024 20:48
Copy link
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kyle. Mostly high-level overall direction comments from my end. Code looks good to me, but let's look into the expiry rollover safety that I noted in the comments below.

// Adjacent to xde is the native IPv6 stack along with its routing
// table. This table is routinely updated to indicate the best path to
// any given IPv6 destination that may be specified in the outer IP
// header. As xde is not utilizing the native IPv6 stack to send out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should revisit our tx-path entry point. I'm not sure if there is an intrinsic reason we need to tx directly to mac, and illumos has it's own caching mechanisms like destination cache entries (DCE) that are likely a lot quicker than full route lookups.

That's not to say we should drop this work in favor of that direction, as we're getting a clear win here, but something to think about for sure. There is also an argument to be made in the other direction that OPTE should be as self-sufficient as possible, so lots of tradeoffs to consider.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of what #504 is getting at; we do already need to revisit the tx path as part of getting the NICs back into hardware classification. Although, the way I've been going about that so far is by getting another client handle at the DLS layer, so it's not fundamentally different from mac_tx.

Coming at it from the IP level is probably safer still. It looks to me like the easiest way of doing this from an NCE/IRE (using public APIs) is ip_xmit, but this works on a packet at a time. There could be something worth copying in how it uses the ILL to perform the actual send?

xde/src/route.rs Outdated Show resolved Hide resolved
///
/// Naturally, bringing this down to O(ns) is desirable. Usually, illumos
/// holds `ire_t`s per `conn_t`, but we're aiming to be more fine-grained
/// with DDM -- so we need a tradeoff between 'asking about the best route
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not index too heavily on what will happen once the DDM data plane lands. There are some questions as to where this is going to fit in to the overall data plane - and one possibility is that it goes all the way down to the NIC as an offload.


/// A simple caching layer over `next_hop`.
#[derive(Clone)]
pub struct RouteCache(Arc<KRwLock<BTreeMap<RouteKey, CachedRoute>>>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition here is that we want to optimize for queries rather than inserts. And if that intuition is not correct I really want to understand why. If this can be modeled as an LPM query, then using poptrie might be a win.

// counts.
// If full and we have no old entry to update, drop the lock and do
// not insert.
// XXX: Want to profile in future to see if LRU expiry is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is worth an issue and measuring cache performance there is 1) a pool of destination route keys larger than the cap that is b) churning within the expiry time. One could easily imagine this scenarios arising for a busy web server with lots of clients. I think the reason we have the L4 hash in the key is to pin flows to paths, so expressing this as an LPM to reduce table size would probably be tricky?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some extra probes around cache events so we can observe cases like this.

The L4 Hash in the key captures behaviour we have today, which is that OPTE uses multiple sled-to-sled paths depending on flow hash. We could recast that behaviour into an LPM of O(sleds) entries: if we can query and cache all reachability and preference information for each underlay destination (e.g., over both NIC links), then we can handle distribution over paths ourselves. We could possibly walk ire->ire_bucket in the same vein as ire_round_robin to fill that out?

I think that would be the right call; it'd be followup work that would likely replace large parts of this, so it's up to you on whether we use what we have for now and keep an LPM in mind as the next step (until we can offload, that is 😄).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take the win in front of us and continue to consider recasting things to fit in an LPM lookup later on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again, #539 opened.

xde/src/route.rs Outdated Show resolved Hide resolved
///
/// Expired routes will not be removed from the cache, and will leave
/// an entry to enable a quick in-place refresh in the `BTreeMap`.
const EXPIRE_ROUTE_LIFETIME: Duration = Duration::from_millis(100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the moment, this is well within the DDM control plane neighbor expiry lifetime. We'll revisit when the DDM data plane lands.

@FelixMcFelix FelixMcFelix merged commit d6177ca into master May 22, 2024
10 checks passed
@FelixMcFelix FelixMcFelix deleted the route-cache branch May 22, 2024 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants