Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move underlay NICs back into H/W Classification #504

Merged
merged 33 commits into from
Aug 20, 2024

Conversation

FelixMcFelix
Copy link
Collaborator

@FelixMcFelix FelixMcFelix commented Apr 24, 2024

Today, we get our TX and RX pathways on underlay devices for XDE by creating a secondary MAC client on each device. As part of this process we must attach a unicast MAC address (or specify MAC_OPEN_FLAGS_NO_UNICAST_ADDR) during creation to spin up a valid datapath, otherwise we can receive packets on our promiscuous mode handler but any sent packets are immediately dropped by MAC. However, datapath setup then fails to supply a dedicated ring/group for the new client, and the device is reduced to pure software classification. This hard-disables any ring polling threads, and so all packet processing occurs in the interrupt context. This limits throughput and increases OPTE's blast radius on control plane/crucible traffic between sleds.

This PR places a hold onto the underlay NICs via dls, and makes use of dls_open/dls_close to acquire a valid transmit pathway onto the original (primary) MAC client, to which we can also attach a promiscuous callback. As desired, we are back in hardware classification:

EVT22200009 # mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic apix cpc crypto mm random smbios zfs sata ip hook neti sockfs lofs vmm scsi_vhci arp ufs ipcc logindmux nsmb ptm ]
> ::mac_ring
            ADDR TYPE STATE FLAG            GROUP CLASS              MIP              SRS FLOW NAME
fffffcfa18602008 RX   inuse 0000 fffffcfa19b08dc0 hw    fffffcfa3b776028 fffffcfa19dc1dc0 cxgbe1
fffffcfa186020e0 RX   inuse 0000 fffffcfa19b08dc0 hw    fffffcfa3b776028 fffffcfa19dc78c0 cxgbe1
fffffcfa186021b8 RX   inuse 0000 fffffcfa19b08dc0 hw    fffffcfa3b776028 fffffcfa4cc69b40 cxgbe1
fffffcfa18602290 TX   inuse 0000 fffffcfa3b97d4a0 no    fffffcfa3b77e020                0
fffffcfa18602368 TX   inuse 0000 fffffcfa3b97d4a0 no    fffffcfa3b77e020                0
fffffcfa18602440 TX   inuse 0000 fffffcfa3b97d4a0 no    fffffcfa3b77e020                0
fffffcfa18602518 TX   inuse 0000 fffffcfa3b97d4a0 no    fffffcfa3b77e020                0
fffffcfa186025f0 TX   inuse 0000 fffffcfa3b97d4a0 no    fffffcfa3b77e020                0
fffffcfa186026c8 TX   inuse 0000 fffffcfa3b97d4a0 no    fffffcfa3b77e020                0
fffffcfa186027a0 TX   inuse 0000 fffffcfa3b97d4a0 no    fffffcfa3b77e020                0
fffffcfa18602878 TX   inuse 0000 fffffcfa3b97d4a0 no    fffffcfa3b77e020                0
fffffcfa18602950 RX   inuse 0000 fffffcfa1968f780 hw    fffffcfa3b77e020 fffffcfa4cc68480 cxgbe0
fffffcfa18602a28 RX   inuse 0000 fffffcfa1968f780 hw    fffffcfa3b77e020 fffffcfa445d0b00 cxgbe0
fffffcfa18602b00 RX   inuse 0000 fffffcfa1968f780 hw    fffffcfa3b77e020 fffffcfa19dbf040 cxgbe0
fffffcfa18602bd8 TX   inuse 0000 fffffcfa2fe62798 no    fffffcfa3b786018                0
fffffcfa18602cb0 RX   inuse 0000 fffffcf95805cbc0 hw    fffffcfa3b786018 fffffcfa43645240 igb0
fffffcfa3d7a0880 TX   inuse 0000 fffffcfa3b97d9e0 no    fffffcfa3b776028                0
fffffcfa3d7a0958 TX   inuse 0000 fffffcfa3b97d9e0 no    fffffcfa3b776028                0
fffffcfa3d7a0a30 TX   inuse 0000 fffffcfa3b97d9e0 no    fffffcfa3b776028                0
fffffcfa3d7a0b08 TX   inuse 0000 fffffcfa3b97d9e0 no    fffffcfa3b776028                0
fffffcfa3d7a0be0 TX   inuse 0000 fffffcfa3b97d9e0 no    fffffcfa3b776028                0
fffffcfa3d7a0cb8 TX   inuse 0000 fffffcfa3b97d9e0 no    fffffcfa3b776028                0
fffffcfa3d7a0d90 TX   inuse 0000 fffffcfa3b97d9e0 no    fffffcfa3b776028                0
fffffcfa3d7a0e68 TX   inuse 0000 fffffcfa3b97d9e0 no    fffffcfa3b776028                0

And if we zoom into the flamegraph, we now receive packets via mac_rx_srs_poll_ring (!!!).

rx

Performance numbers are included below -- on its own, a 7% increase in guest throughput and a 74% increase in underlay max throughput.

This work is orthogonal to #62 (and related efforts) which will get us out of promiscuous mode -- both are necessary parts of making optimal use of the illumos networking stack.

Closes #489 .

@FelixMcFelix FelixMcFelix self-assigned this Apr 24, 2024
@FelixMcFelix
Copy link
Collaborator Author

FelixMcFelix commented Apr 25, 2024

Extra setup details on sn9/14 for TCP tunable adjustments (so we don't saturate at ~20Gbps), and better presentation of all known speeds:

# TCP tunables.
ipadm set-prop -p max_buf=8388608 tcp
ipadm set-prop -p send_buf=8388608 tcp
ipadm set-prop -p recv_buf=8388608 tcp

Global zone iperf numbers and status after a set-xde-underlay on both ends, average of 3 runs omitting slow-start:

Setup Underlay Speed (Gbps), iperf -c ... -P 1 -O 2 OPTE Speed (Gbps)
master 17.1 2.19
+= #504 (+ hw_class, + poll thread) 29.1 2.36
+= #62 (+ no promisc, - no poll thread, - subflow check cost) flows-and-hw branch 29.8 2.55
No XDE 30.9 N/A

EDIT: Updated 2024-06-20, stlouis-0-ge5ec805ada.

@FelixMcFelix FelixMcFelix added this to the 9 milestone May 23, 2024
@FelixMcFelix
Copy link
Collaborator Author

FelixMcFelix commented Jun 14, 2024

I spent some time revisiting and fixing this up, and trying to understand the incompatibility with the flows work -- hacky as it is, I wouldn't have expected breakage.

There is a fairly consistent way to trigger a kernel panic, roughly in order of:

  • Add a user-flow to an underlay link.
  • Set the underlay.
  • Pass traffic between ports across the underlay.
  • Remove all ports and clear the underlay

I have some dumps to dig into on Monday, remarkably only the test harness is causing issues when we put the two features together (i.e., only once traffic is sent over do we get a panic).

EDIT: One dump puts us at:

panic[cpu10]/thread=fffffe6ab8385840:
BAD TRAP: type=e (#pf Page fault) rp=fffffe00968b0190 addr=fffffe6aa97398a0


mdnsd:
#pf Page fault
Bad kernel fault at addr=0xfffffe6aa97398a0
pid=545, pc=0xfffffffffbbd5388, sp=0xfffffe00968b0280, eflags=0x10282
cr0: 80050033<pg,wp,ne,et,mp,pe>  cr4: 3406f8<smap,smep,osxsav,xmme,fxsr,pge,mce,pae,pse,de>
cr2: fffffe6aa97398a0
cr3: 1304f90000
cr8: 0

        rdi: fffffe6aa9739790 rsi: fffffe6b2b09d640 rdx:                0
        rcx:                0  r8:                0  r9:                0
        rax: fffffe6aa9462088 rbx: fffffe6aa9739790 rbp: fffffe00968b0330
        r10: 36a01600005e0001 r11: fffffe6a9fcb2770 r12:                0
        r13: fffffe6b2b09d640 r14: fffffe6b2b09d640 r15:               28
        fsb:                0 gsb: fffffe6a9d0ec000  ds:               4b
         es:               4b  fs:                0  gs:              1c3
        trp:                e err:                0 rip: fffffffffbbd5388
         cs:               30 rfl:            10282 rsp: fffffe00968b0280
         ss:               38

fffffe00968b00a0 unix:die+d3 ()
fffffe00968b0180 unix:trap+999 ()
fffffe00968b0190 unix:cmntrap+e9 ()
fffffe00968b0330 mac:mac_tx+28 ()
fffffe00968b0380 dld:str_mdata_fastpath_put+8e ()
fffffe00968b0490 ip:ip_xmit+9ce ()
fffffe00968b0530 ip:ip_postfrag_loopcheck+99 ()
fffffe00968b0780 ip:ire_send_wire_v4+3c2 ()
fffffe00968b0810 ip:ire_send_multicast_v4+b0 ()
fffffe00968b08d0 ip:ip_output_simple_v4+467 ()
fffffe00968b0940 ip:ip_output_simple+dd ()
fffffe00968b0b30 ip:ill_mcast_send_queued+14b ()
fffffe00968b0ba0 ip:ilg_delete_all+1f4 ()
fffffe00968b0c00 ip:ip_quiesce_conn+168 ()
fffffe00968b0c30 ip:udp_do_close+39 ()
fffffe00968b0c60 ip:udp_close+19 ()
fffffe00968b0cc0 sockfs:so_close+80 ()
fffffe00968b0cf0 sockfs:socket_close_internal+21 ()
fffffe00968b0d70 sockfs:socket_vop_close+e2 ()
fffffe00968b0df0 genunix:fop_close+66 ()
fffffe00968b0e30 genunix:closef+63 ()
fffffe00968b0ea0 genunix:closeandsetf+1cd ()
fffffe00968b0ec0 genunix:close+13 ()
fffffe00968b0f10 unix:brand_sys_syscall32+186 ()

EDIT2: Found it -- dls_open takes ownership of the dls_dl_handle_t and dls_link_t it is given, so we do not rele that. I found this out in a fairly long-winded way when it turned out that the MAC impl had 0 associated clients and a dangling mi_single_active_client. I think we can clean/fix this up now and hopefully get it in soon.

@FelixMcFelix FelixMcFelix force-pushed the goodbye-all-of-classification branch from f18abb1 to 1da884e Compare June 17, 2024 16:48
Comment on lines -1112 to -1111
// Set up a unicast callback. The MAC address here is a sentinel value with
// nothing real behind it. This is why we picked the zero value in the Oxide
// OUI space for virtual MACs. The reason this is being done is that illumos
// requires that if there is a single mac client on a link, that client must
// have an L2 address. This was not caught until recently, because this is
// only enforced as a debug assert in the kernel.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never be the case now: being visible via (and getting a stream from) DLS implies the existence of a client on the underlying MAC. This is also the root-cause of landing in software classification, as we end up with two clients.

@FelixMcFelix FelixMcFelix changed the title WIP: Move back into H/W Classification Move underlay NICs back into H/W Classification Jun 19, 2024
@FelixMcFelix FelixMcFelix marked this pull request as ready for review June 19, 2024 18:09
Good thing we don't perform any hairpin responses on inbound traffic,
eh?
@FelixMcFelix FelixMcFelix modified the milestones: 9, 10 Jun 24, 2024
@FelixMcFelix FelixMcFelix requested a review from luqmana June 24, 2024 16:18
@FelixMcFelix FelixMcFelix marked this pull request as ready for review August 8, 2024 16:14
@FelixMcFelix
Copy link
Collaborator Author

FelixMcFelix commented Aug 8, 2024

I think this is good to go now, a quick recap on what's changed:

  • We now use the functions added as part of oxidecomputer/illumos-gate@0187c42 to construct and destroy a valid dld_str_t with the least effort. Any cleanup which we were replicating from dls/dld after closing the stream has been moved into dld_str_destroy_detached.
    • This means we only access one field within dld_str_t -- the mac client handle of the underlay device in question.
  • opteadm and the ioctl interface used by omicron now use ctf-bindgen to inspect /system/object/dld/object for a complete list of field offsets in dld_str_t. These are computed and sent when setting the underlay (i.e., this is a prerequisite for all other operations), wherein XDE will fail to load if the offset and type of ds_mch have changed.

CI is going to fail on test/bench until we get a new Helios image, but I can confirm that xde-tests and kbench behave correctly after a pfexec pkg update on my local box (as well as sn9/14).

@@ -396,7 +396,7 @@ where
}

pub fn fetch_fragile_types() -> Result<FragileInternals, Error> {
let dld_ctf = Ctf::from_file("/kernel/drv/amd64/dld")
let dld_ctf = Ctf::from_file("/system/object/dld/object")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮 TIL objfs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned that trick from Robert! I think it'll be short-lived if we use your below suggestion. I'll preserve a copy of this branch since we could end up needing these tricks again (although I hope we do not).

xde/src/dls/sys.rs Outdated Show resolved Hide resolved
Sad to see the CTF check machinery go, but it's for the best after
all...
@askfongjojo askfongjojo modified the milestones: 10, 11 Aug 15, 2024
@FelixMcFelix FelixMcFelix modified the milestones: 11, 10 Aug 15, 2024
@askfongjojo askfongjojo modified the milestones: 10, 11 Aug 15, 2024
@FelixMcFelix FelixMcFelix merged commit dcdd0b1 into master Aug 20, 2024
10 checks passed
@FelixMcFelix FelixMcFelix deleted the goodbye-all-of-classification branch August 20, 2024 20:38
FelixMcFelix added a commit to oxidecomputer/omicron that referenced this pull request Aug 20, 2024
* Move underlay NICs back into H/W Classification (oxidecomputer/opte#504)

My disposition is to wait til R11 before we merge this -- I've done
lengthy testing on `glasgow`, but I would like plenty of soak time on
dogfood before this sees a release.
FelixMcFelix added a commit to oxidecomputer/omicron that referenced this pull request Aug 23, 2024
* Move underlay NICs back into H/W Classification
(oxidecomputer/opte#504)

My disposition is to wait til R11 before we merge this -- I've done
lengthy testing on `glasgow`, but I would like plenty of soak time on
dogfood before this sees a release.
@twinfees twinfees added the customer For any bug reports or feature requests tied to customer requests label Aug 27, 2024
FelixMcFelix added a commit that referenced this pull request Nov 19, 2024
This PR rewrites the core of OPTE's packet model to use zero-copy packet
parsing/modification via the
[`ingot`](oxidecomputer/ingot#1) library. This
enables a few changes which get us just shy of the 3Gbps mark.

* **[2.36 -> 2.7]** The use of ingot for modifying packets in both the
slowpath (UFT miss) and existing fastpath (UFT hit).
* Parsing is faster -- we no longer copy out all packet header bytes
onto the stack, and we do not allocate a vector to decompose an `mblk_t`
into individual links.
* Packet field modifications are applied directly to the `mblk_t` as
they happen, and field reads are made from the same source.
  * Non-encap layers are not copied back out.
* **[2.7 -> 2.75]** Packet L4 hashes are cached as part of the UFT,
speeding up multipath route selection over the underlay.
* **[2.75 -> 2.8]** Incremental Internet checksum recalculation is only
performed when applicable fields change on inner flow headers (e.g.,
NAT'd packets).
  * VM-to-VM / intra-VPC traffic is the main use case here.
* **[2.8 -> 3.05]** `NetworkParser`s now have the concept of inbound &
outbound `LightweightMeta` formats. These support the key operations
needed to execute all our UFT flows today (`FlowId` lookup, inner
headers modification, encap push/pop, cksum update).
* This also allows us to pre-serialize any bytes to be pushed in front
of a packet, speeding up `EmitSpec`.
* This is crucial for outbound traffic in particular, which has far
smaller (in `struct`-size) metadata.
* UFT misses or incompatible flows fallback to using the full metadata.
* **[3.05 -> 2.95]** TCP state tracking uses a separate per-flow lock
and does not require any lookup from a UFT.
* I do not have numbers on how large the performance loss would be if we
held the `Port` lock for the whole time.
* (Not measured) Packet/UFT L4 Hashes are used as the Geneve source
port, spreading inbound packets over NIC Rx queues based on the inner
flow.
* This is now possible because of #504 -- software classification would
have limited us to the default inbound queue/group.
* I feel bad for removing one more FF7 reference, but that is the way of
these things. RIP port `7777`.
* Previously, Rx queue affinity was derived solely from `(Src Sled, Dst
Sled)`.

There are several other changes here made to how OPTE functions which
are needed to support the zero-copy model.

* Each collected set of header transforms are `Arc<>`'d, such that we
can apply them outside of the `Port` lock.
* `FlowTable<S>`s now store `Arc<FlowEntry<S>>`, rather than
`FlowEntry<S>`.
* This enables the UFT entry for any flow to store its matched TCP flow,
update its hit count and timestamp, and then update the TCP state
without reacquring the `Port` lock.
* This also drastically simplifies TCP state handling in fast path cases
to not rely on post-transformation packets for lookup.
* `Opte::process` returns an `EmitSpec` which is needed to finalise a
packet before it can be used.
* I'm not too happy about the ergonomics, but we have this problem
because otherwise we'd need `Packet` to have some self-referential
fields when supporting other key parts of XDE (e.g., parse -> use fields
-> select port -> process).

Closes #571, closes #481, closes #460.

Slightly alleviates #435.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer For any bug reports or feature requests tied to customer requests perf
Projects
None yet
Development

Successfully merging this pull request may close these issues.

set-xde-underlay forces underlay devices into software classification
4 participants