You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
VPP crashes in vhost_user_if_input. This happens when using multiple threads and queues, and rebooting the system while send traffic.
The traceback is as follows:
(gdb) bt
#0 0x00007ffff70671d0 in vhost_user_if_input (vm=0x7fffb6d64648, vum=0x7ffff747ee20 <vhost_user_main>, vui=0x7fffb6ec75cc, node=0x7fffb6e06cb8) at /scratch/myciscoatt/src/vpp/build-data/../vnet/vnet/devices/virtio/vhost-user.c:1150
#1 0x00007ffff7067df5 in vhost_user_input (vm=0x7fffb6d64648, node=0x7fffb6e06cb8, f=0x0) at /scratch/myciscoatt/src/vpp/build-data/../vnet/vnet/devices/virtio/vhost-user.c:1361
#2 0x00007ffff74d4fff in dispatch_node (vm=0x7fffb6d64648, node=0x7fffb6e06cb8, type=VLIB_NODE_TYPE_INPUT, dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0, last_time_stamp=116904157643429) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/main.c:996
#3 0x00007ffff751a2e1 in vlib_worker_thread_internal (vm=0x7fffb6d64648) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/threads.c:1389
#4 0x00007ffff751a5c3 in vlib_worker_thread_fn (arg=0x7fffb5427c70) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/threads.c:1455
#5 0x00007ffff62b0314 in clib_calljmp () at /scratch/myciscoatt/src/vpp/build-data/../vppinfra/vppinfra/longjmp.S:110
The pointers look good when examined in gdb so this points to a race condition. The race condition might be that the memory is not available when system is rebooted. VPP
Dave Barach looked at this and here is his summary:
Guys,
John D. asked me to take a look at a multiple-worker, multiple-queue vhost_user crash scenario. After some fiddling, I found a scenario that’s 100% reproducible. With vpp provisioned by the ML2 plugin [or whatever calls itself “test_papi”], ssh into the compute vm and type “sudo /sbin/reboot”.
This scenario causes a mild vhost_user shared-memory earthquake with traffic flowing.
One of the worker threads will receive SIGSEGV, right here:
By the time one can look at the memory reference in gdb, the memory is accessible. My guess: qemu briefly changes protections on the vhost_user shared-memory segment, yadda yadda yadda.
This scenario never causes an issue when running single-queue, single-core.
An API trace - see below - indicates that vpp receives no notification of any kind. There isn’t a hell of lot that the vhost_user driver can do to protect itself.
Time for someone to stare at the quemu code, I guess...
Description
Summary:
VPP crashes in vhost_user_if_input. This happens when using multiple threads and queues, and rebooting the system while send traffic.
The traceback is as follows:
(gdb) bt
#0 0x00007ffff70671d0 in vhost_user_if_input (vm=0x7fffb6d64648, vum=0x7ffff747ee20 <vhost_user_main>, vui=0x7fffb6ec75cc, node=0x7fffb6e06cb8) at /scratch/myciscoatt/src/vpp/build-data/../vnet/vnet/devices/virtio/vhost-user.c:1150
#1 0x00007ffff7067df5 in vhost_user_input (vm=0x7fffb6d64648, node=0x7fffb6e06cb8, f=0x0) at /scratch/myciscoatt/src/vpp/build-data/../vnet/vnet/devices/virtio/vhost-user.c:1361
#2 0x00007ffff74d4fff in dispatch_node (vm=0x7fffb6d64648, node=0x7fffb6e06cb8, type=VLIB_NODE_TYPE_INPUT, dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0, last_time_stamp=116904157643429) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/main.c:996
#3 0x00007ffff751a2e1 in vlib_worker_thread_internal (vm=0x7fffb6d64648) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/threads.c:1389
#4 0x00007ffff751a5c3 in vlib_worker_thread_fn (arg=0x7fffb5427c70) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/threads.c:1455
#5 0x00007ffff62b0314 in clib_calljmp () at /scratch/myciscoatt/src/vpp/build-data/../vppinfra/vppinfra/longjmp.S:110
#6 0x00007fff768bcc00 in ?? ()
#7 0x00007ffff7515a44 in vlib_worker_thread_bootstrap_fn (arg=0x7fffb5427c70) at /scratch/myciscoatt/src/vpp/build-data/../vlib/vlib/threads.c:516
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) p/x txvq
$10 = 0x7fffb6ec7954
(gdb) p/x *txvq
$11 =
{qsz = 0x100, last_avail_idx = 0x0, last_used_idx = 0x0, desc = 0x7fd8a4630000, avail = 0x7fd8a4631000, used = 0x7fd8a4632000, log_guest_addr = 0x424632000, callfd = 0x56, kickfd = 0x5e, errfd = 0x0, enabled = 0x1, log_used = 0x0, callfd_idx = 0x11, n_since_last_int = 0x0, int_deadline = 0x52d6}
The pointers look good when examined in gdb so this points to a race condition. The race condition might be that the memory is not available when system is rebooted. VPP
Dave Barach looked at this and here is his summary:
Guys,
John D. asked me to take a look at a multiple-worker, multiple-queue vhost_user crash scenario. After some fiddling, I found a scenario that’s 100% reproducible. With vpp provisioned by the ML2 plugin [or whatever calls itself “test_papi”], ssh into the compute vm and type “sudo /sbin/reboot”.
This scenario causes a mild vhost_user shared-memory earthquake with traffic flowing.
One of the worker threads will receive SIGSEGV, right here:
u32 next_desc =
By the time one can look at the memory reference in gdb, the memory is accessible. My guess: qemu briefly changes protections on the vhost_user shared-memory segment, yadda yadda yadda.
This scenario never causes an issue when running single-queue, single-core.
An API trace - see below - indicates that vpp receives no notification of any kind. There isn’t a hell of lot that the vhost_user driver can do to protect itself.
Time for someone to stare at the quemu code, I guess...
HTH… Dave
SCRIPT: memclnt_create name test_papi
SCRIPT: sw_interface_dump all
SCRIPT: control_ping
SCRIPT: sw_interface_dump all
SCRIPT: control_ping
SCRIPT: sw_interface_set_flags sw_if_index 1 admin-up link-up
SCRIPT: bridge_domain_add_del bd_id 5678 flood 1 uu-flood 1 forward 1 learn 1 arp-term 0
SCRIPT: sw_interface_set_l2_bridge sw_if_index 1 bd_id 5678 shg 0 enable
SCRIPT: tap_connect tapname vppef940067-0b mac fa:16:3e:6e:22:41
SCRIPT: sw_interface_set_flags sw_if_index 4 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 4 bd_id 5678 shg 0 enable
SCRIPT: sw_interface_dump all
SCRIPT: control_ping
SCRIPT: sw_interface_dump all
SCRIPT: control_ping
SCRIPT: sw_interface_set_flags sw_if_index 3 admin-up link-up
SCRIPT: bridge_domain_add_del bd_id 5679 flood 1 uu-flood 1 forward 1 learn 1 arp-term 0
SCRIPT: sw_interface_set_l2_bridge sw_if_index 3 bd_id 5679 shg 0 enable
SCRIPT: create_vhost_user_if socket /tmp/52970d78-dad3-4887-b4bf-df90d3e13602
SCRIPT: sw_interface_set_flags sw_if_index 5 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 5 bd_id 5679 shg 0 enable
SCRIPT: create_vhost_user_if socket /tmp/92473e06-ea98-4b4f-80df-c9bb702c3885
SCRIPT: sw_interface_set_flags sw_if_index 6 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 6 bd_id 5678 shg 0 enable
SCRIPT: sw_interface_dump all
SCRIPT: control_ping
SCRIPT: sw_interface_dump all
SCRIPT: control_ping
SCRIPT: sw_interface_set_flags sw_if_index 2 admin-up link-up
SCRIPT: bridge_domain_add_del bd_id 5680 flood 1 uu-flood 1 forward 1 learn 1 arp-term 0
SCRIPT: sw_interface_set_l2_bridge sw_if_index 2 bd_id 5680 shg 0 enable
SCRIPT: create_vhost_user_if socket /tmp/e2261ff9-4953-4368-a8c9-8005ccf0e896
SCRIPT: sw_interface_set_flags sw_if_index 7 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 7 bd_id 5680 shg 0 enable
SCRIPT: create_vhost_user_if socket /tmp/b5d9c5f0-0494-4bd0-bb28-437f5261fad5
SCRIPT: sw_interface_set_flags sw_if_index 8 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 8 bd_id 5679 shg 0 enable
SCRIPT: tap_connect tapname vppb7464b44-11 mac fa:16:3e:66:31:79
SCRIPT: sw_interface_set_flags sw_if_index 9 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 9 bd_id 5680 shg 0 enable
SCRIPT: tap_connect tapname vppab16509a-c5 mac fa:16:3e:c2:9f:ac
SCRIPT: sw_interface_set_flags sw_if_index 10 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 10 bd_id 5679 shg 0 enable
SCRIPT: create_vhost_user_if socket /tmp/783d34a8-3e72-4434-97cf-80c7e199e66c
SCRIPT: sw_interface_set_flags sw_if_index 11 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 11 bd_id 5678 shg 0 enable
SCRIPT: create_vhost_user_if socket /tmp/67a02881-e241-4ae4-abb4-dfa03e951772
SCRIPT: sw_interface_set_flags sw_if_index 12 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 12 bd_id 5680 shg 0 enable
SCRIPT: memclnt_create name vpp_api_test # connect vpp_api_test prior to rebooting vm, as described
SCRIPT: sw_interface_dump name_filter Ether
SCRIPT: sw_interface_dump name_filter lo
SCRIPT: sw_interface_dump name_filter pg
SCRIPT: sw_interface_dump name_filter vxlan_gpe
SCRIPT: sw_interface_dump name_filter vxlan
SCRIPT: sw_interface_dump name_filter host
SCRIPT: sw_interface_dump name_filter l2tpv3_tunnel
SCRIPT: sw_interface_dump name_filter gre
SCRIPT: sw_interface_dump name_filter lisp_gpe
SCRIPT: sw_interface_dump name_filter ipsec
SCRIPT: control_ping
SCRIPT: get_first_msg_id lb_16c904aa
SCRIPT: get_first_msg_id snat_aa4c5cd5
SCRIPT: get_first_msg_id pot_e4aba035
SCRIPT: get_first_msg_id ioam_trace_a2e66598
SCRIPT: get_first_msg_id ioam_export_eb694f98
SCRIPT: get_first_msg_id flowperpkt_789ffa7b
SCRIPT: cli_request
vl_api_memclnt_delete_t:
index: 269
handle: 0x305e16c0
REBOOT THE VM RIGHT HERE
Absolutely nothing to indicate that anything happened
SCRIPT: memclnt_create name vpp_api_test # connect vpp_api_test again
SCRIPT: sw_interface_dump name_filter Ether
SCRIPT: sw_interface_dump name_filter lo
SCRIPT: sw_interface_dump name_filter pg
SCRIPT: sw_interface_dump name_filter vxlan_gpe
SCRIPT: sw_interface_dump name_filter vxlan
SCRIPT: sw_interface_dump name_filter host
SCRIPT: sw_interface_dump name_filter l2tpv3_tunnel
SCRIPT: sw_interface_dump name_filter gre
SCRIPT: sw_interface_dump name_filter lisp_gpe
SCRIPT: sw_interface_dump name_filter ipsec
SCRIPT: control_ping
SCRIPT: get_first_msg_id lb_16c904aa
SCRIPT: get_first_msg_id snat_aa4c5cd5
SCRIPT: get_first_msg_id pot_e4aba035
SCRIPT: get_first_msg_id ioam_trace_a2e66598
SCRIPT: get_first_msg_id ioam_export_eb694f98
SCRIPT: get_first_msg_id flowperpkt_789ffa7b
SCRIPT: cli_request
vl_api_memclnt_delete_t:
index: 269
handle: 0x305e16c0
DBGvpp#
Assignee
Unassigned
Reporter
John DeNisco
Comments
No comments.
Original issue: https://jira.fd.io/browse/VPP-526
The text was updated successfully, but these errors were encountered: