Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VPP-526] VPP Crash in host_user_if_input #1875

Closed
vvalderrv opened this issue Jan 31, 2025 · 0 comments
Closed

[VPP-526] VPP Crash in host_user_if_input #1875

vvalderrv opened this issue Jan 31, 2025 · 0 comments

Comments

@vvalderrv
Copy link
Contributor

Description

Summary:

VPP crashes in vhost_user_if_input. This happens when using multiple threads and queues, and rebooting the system while send traffic.

The traceback is as follows:

(gdb) bt

#0 0x00007ffff70671d0 in vhost_user_if_input (vm=0x7fffb6d64648, vum=0x7ffff747ee20 <vhost_user_main>, vui=0x7fffb6ec75cc, node=0x7fffb6e06cb8) at /scratch/myciscoatt/src/vpp/build-data/../vnet/vnet/devices/virtio/vhost-user.c:1150

#1 0x00007ffff7067df5 in vhost_user_input (vm=0x7fffb6d64648, node=0x7fffb6e06cb8, f=0x0) at /scratch/myciscoatt/src/vpp/build-data/../vnet/vnet/devices/virtio/vhost-user.c:1361

#2 0x00007ffff74d4fff in dispatch_node (vm=0x7fffb6d64648, node=0x7fffb6e06cb8, type=VLIB_NODE_TYPE_INPUT, dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0, last_time_stamp=116904157643429) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/main.c:996

#3 0x00007ffff751a2e1 in vlib_worker_thread_internal (vm=0x7fffb6d64648) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/threads.c:1389

#4 0x00007ffff751a5c3 in vlib_worker_thread_fn (arg=0x7fffb5427c70) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/threads.c:1455

#5 0x00007ffff62b0314 in clib_calljmp () at /scratch/myciscoatt/src/vpp/build-data/../vppinfra/vppinfra/longjmp.S:110

#6 0x00007fff768bcc00 in ?? ()

#7 0x00007ffff7515a44 in vlib_worker_thread_bootstrap_fn (arg=0x7fffb5427c70) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/threads.c:516

Backtrace stopped: previous frame inner to this frame (corrupt stack?)

(gdb) p/x txvq

$10 = 0x7fffb6ec7954

(gdb) p/x *txvq

$11 =


{qsz = 0x100, last_avail_idx = 0x0, last_used_idx = 0x0, desc = 0x7fd8a4630000, avail = 0x7fd8a4631000, used = 0x7fd8a4632000, log_guest_addr = 0x424632000, callfd = 0x56, kickfd = 0x5e, errfd = 0x0, enabled = 0x1, log_used = 0x0, callfd_idx = 0x11, n_since_last_int = 0x0, int_deadline = 0x52d6}

The pointers look good when examined in gdb so this points to a race condition. The race condition might be that the memory is not available when system is rebooted. VPP

Dave Barach looked at this and here is his summary:

Guys,

John D. asked me to take a look at a multiple-worker, multiple-queue vhost_user crash scenario. After some fiddling, I found a scenario that’s 100% reproducible. With vpp provisioned by the ML2 plugin [or whatever calls itself “test_papi”], ssh into the compute vm and type “sudo /sbin/reboot”.

This scenario causes a mild vhost_user shared-memory earthquake with traffic flowing.

One of the worker threads will receive SIGSEGV, right here:

 /* vhost_user_if_input, at or near line 1142 */

u32 next_desc =

       txvq->avail->ring<span class="error">[(txvq->last_avail_idx + 1) & qsz_mask]</span>;</p>

By the time one can look at the memory reference in gdb, the memory is accessible. My guess: qemu briefly changes protections on the vhost_user shared-memory segment, yadda yadda yadda.

This scenario never causes an issue when running single-queue, single-core.

An API trace - see below - indicates that vpp receives no notification of any kind. There isn’t a hell of lot that the vhost_user driver can do to protect itself.

Time for someone to stare at the quemu code, I guess...

HTH… Dave

  1. api trace custom-dump /tmp/twoboot

    SCRIPT: memclnt_create name test_papi

    SCRIPT: sw_interface_dump all

    SCRIPT: control_ping

    SCRIPT: sw_interface_dump all

    SCRIPT: control_ping

    SCRIPT: sw_interface_set_flags sw_if_index 1 admin-up link-up

    SCRIPT: bridge_domain_add_del bd_id 5678 flood 1 uu-flood 1 forward 1 learn 1 arp-term 0

    SCRIPT: sw_interface_set_l2_bridge sw_if_index 1 bd_id 5678 shg 0 enable

    SCRIPT: tap_connect tapname vppef940067-0b mac fa:16:3e:6e:22:41

    SCRIPT: sw_interface_set_flags sw_if_index 4 admin-up link-up

    SCRIPT: sw_interface_set_l2_bridge sw_if_index 4 bd_id 5678 shg 0 enable

    SCRIPT: sw_interface_dump all

    SCRIPT: control_ping

    SCRIPT: sw_interface_dump all

    SCRIPT: control_ping

    SCRIPT: sw_interface_set_flags sw_if_index 3 admin-up link-up

    SCRIPT: bridge_domain_add_del bd_id 5679 flood 1 uu-flood 1 forward 1 learn 1 arp-term 0

    SCRIPT: sw_interface_set_l2_bridge sw_if_index 3 bd_id 5679 shg 0 enable

    SCRIPT: create_vhost_user_if socket /tmp/52970d78-dad3-4887-b4bf-df90d3e13602

    SCRIPT: sw_interface_set_flags sw_if_index 5 admin-up link-up

    SCRIPT: sw_interface_set_l2_bridge sw_if_index 5 bd_id 5679 shg 0 enable

    SCRIPT: create_vhost_user_if socket /tmp/92473e06-ea98-4b4f-80df-c9bb702c3885

    SCRIPT: sw_interface_set_flags sw_if_index 6 admin-up link-up

    SCRIPT: sw_interface_set_l2_bridge sw_if_index 6 bd_id 5678 shg 0 enable

    SCRIPT: sw_interface_dump all

    SCRIPT: control_ping

    SCRIPT: sw_interface_dump all

    SCRIPT: control_ping

    SCRIPT: sw_interface_set_flags sw_if_index 2 admin-up link-up

    SCRIPT: bridge_domain_add_del bd_id 5680 flood 1 uu-flood 1 forward 1 learn 1 arp-term 0

    SCRIPT: sw_interface_set_l2_bridge sw_if_index 2 bd_id 5680 shg 0 enable

    SCRIPT: create_vhost_user_if socket /tmp/e2261ff9-4953-4368-a8c9-8005ccf0e896

    SCRIPT: sw_interface_set_flags sw_if_index 7 admin-up link-up

    SCRIPT: sw_interface_set_l2_bridge sw_if_index 7 bd_id 5680 shg 0 enable

    SCRIPT: create_vhost_user_if socket /tmp/b5d9c5f0-0494-4bd0-bb28-437f5261fad5

    SCRIPT: sw_interface_set_flags sw_if_index 8 admin-up link-up

    SCRIPT: sw_interface_set_l2_bridge sw_if_index 8 bd_id 5679 shg 0 enable

    SCRIPT: tap_connect tapname vppb7464b44-11 mac fa:16:3e:66:31:79

    SCRIPT: sw_interface_set_flags sw_if_index 9 admin-up link-up

    SCRIPT: sw_interface_set_l2_bridge sw_if_index 9 bd_id 5680 shg 0 enable

    SCRIPT: tap_connect tapname vppab16509a-c5 mac fa:16:3e:c2:9f:ac

    SCRIPT: sw_interface_set_flags sw_if_index 10 admin-up link-up

    SCRIPT: sw_interface_set_l2_bridge sw_if_index 10 bd_id 5679 shg 0 enable

    SCRIPT: create_vhost_user_if socket /tmp/783d34a8-3e72-4434-97cf-80c7e199e66c

    SCRIPT: sw_interface_set_flags sw_if_index 11 admin-up link-up

    SCRIPT: sw_interface_set_l2_bridge sw_if_index 11 bd_id 5678 shg 0 enable

    SCRIPT: create_vhost_user_if socket /tmp/67a02881-e241-4ae4-abb4-dfa03e951772

    SCRIPT: sw_interface_set_flags sw_if_index 12 admin-up link-up

    SCRIPT: sw_interface_set_l2_bridge sw_if_index 12 bd_id 5680 shg 0 enable

    SCRIPT: memclnt_create name vpp_api_test # connect vpp_api_test prior to rebooting vm, as described

    SCRIPT: sw_interface_dump name_filter Ether

    SCRIPT: sw_interface_dump name_filter lo

    SCRIPT: sw_interface_dump name_filter pg

    SCRIPT: sw_interface_dump name_filter vxlan_gpe

    SCRIPT: sw_interface_dump name_filter vxlan

    SCRIPT: sw_interface_dump name_filter host

    SCRIPT: sw_interface_dump name_filter l2tpv3_tunnel

    SCRIPT: sw_interface_dump name_filter gre

    SCRIPT: sw_interface_dump name_filter lisp_gpe

    SCRIPT: sw_interface_dump name_filter ipsec

    SCRIPT: control_ping

    SCRIPT: get_first_msg_id lb_16c904aa

    SCRIPT: get_first_msg_id snat_aa4c5cd5

    SCRIPT: get_first_msg_id pot_e4aba035

    SCRIPT: get_first_msg_id ioam_trace_a2e66598

    SCRIPT: get_first_msg_id ioam_export_eb694f98

    SCRIPT: get_first_msg_id flowperpkt_789ffa7b

    SCRIPT: cli_request

    vl_api_memclnt_delete_t:

    index: 269

    handle: 0x305e16c0

    REBOOT THE VM RIGHT HERE

    Absolutely nothing to indicate that anything happened

    SCRIPT: memclnt_create name vpp_api_test # connect vpp_api_test again

    SCRIPT: sw_interface_dump name_filter Ether

    SCRIPT: sw_interface_dump name_filter lo

    SCRIPT: sw_interface_dump name_filter pg

    SCRIPT: sw_interface_dump name_filter vxlan_gpe

    SCRIPT: sw_interface_dump name_filter vxlan

    SCRIPT: sw_interface_dump name_filter host

    SCRIPT: sw_interface_dump name_filter l2tpv3_tunnel

    SCRIPT: sw_interface_dump name_filter gre

    SCRIPT: sw_interface_dump name_filter lisp_gpe

    SCRIPT: sw_interface_dump name_filter ipsec

    SCRIPT: control_ping

    SCRIPT: get_first_msg_id lb_16c904aa

    SCRIPT: get_first_msg_id snat_aa4c5cd5

    SCRIPT: get_first_msg_id pot_e4aba035

    SCRIPT: get_first_msg_id ioam_trace_a2e66598

    SCRIPT: get_first_msg_id ioam_export_eb694f98

    SCRIPT: get_first_msg_id flowperpkt_789ffa7b

    SCRIPT: cli_request

    vl_api_memclnt_delete_t:

    index: 269

    handle: 0x305e16c0

    DBGvpp#

Assignee

Unassigned

Reporter

John DeNisco

Comments

No comments.

Original issue: https://jira.fd.io/browse/VPP-526

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant