-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VPP-171] [vhost-user] vhost traffic flowing to a wrong tx virtQueue #1427
Comments
Thanks Dave. Calling vlib_worker_thread_node_runtime_update(); was the missing piece. It works now. |
I suspect we may be back on this topic, but I've cleaned up several issues. |
Steve can you check with the code that is at least as recent as below: DBGvpp# show version verbose It looks like some sort of corruption is happening in vec_foreach (f, feature_vector) loop in find_config_with_features(). The reason I say that is, the vlib_node_runtime_t in vnet_interface_output_node_no_flatten_inline looks spooky after the execution of the above mentioned loop. ========= $204 = { function = 0x7ffff6d33346 <vnet_interface_output_node_no_flatten>, errors = 0x7fffc4df8dec, clocks_since_last_overflow = 315980, max_clock = 128513, max_clock_n = 1, calls_since_last_overflow = 8, vectors_since_last_overflow = 8, next_frame_index = 677, node_index = 187, input_main_loops_per_call = 0, main_loop_count_last_dispatch = 162883242, main_loop_vector_stats = {0, 1}, flags = 0, state = 0, n_next_nodes = 3, cached_next_index = 2, cpu_index = 0, runtime_data = {25769803782, 1, 0, 0, 0, 0, 0} ======= AFTER ======= (gdb) p {vlib_node_runtime_t} 0x7fffc4d3d4e4 , Just for kicks, to see how the system would behave if this corruption had not happened, I add the following: — a/vnet/vnet/interface_output.c from = vlib_frame_args (frame); + if (rt->is_deleted) { + printf("RESET DELETE FLAG\n"); + rt->is_deleted = 0; + }Now, the VM were pingable after vhost interface ADD-DEL-ADD sequence. |
On my setup, i ran some add/delete vhost interface along with 2 VM ping. But i didn't see any packets drop. cpu { Your issue seems to be related with your test environment. |
Steve Shin and I actively debugged this issue and found that the following fixes the issue when VPP is running in single threaded mode. However, the problem persists in multi-threaded mode. Below, I have provided some information that I know as of now. ================================== _vec_len (im->deleted_hw_interface_nodes) -= 1; ================ =============== Boot VMs set interface state VirtualEthernet0/0/0 down Problem #0 find_config_with_features (vm=0xc68a80 <vlib_global_main>, cm=0x7ffff74a93f8 <ip6_main+312>, feature_vector=0x7fffc4d6d244) After the execution of vec_foreach (f, feature_vector) loop in find_config_with_features(), the variable rt gets updated. (gdb) b vnet/vnet/config.c:116 $171 = { hw_if_index = 6, sw_if_index = 6, dev_instance = 1, is_deleted = 0 } (gdb) c Continuing. Breakpoint 26, find_config_with_features (vm=0xc68a80 <vlib_global_main>, cm=0x7ffff74a93f8 <ip6_main+312>, feature_vector=0x7fffc4d6d244) at /scratch/localadmin/openvpp/vpp.new/build-data/../vnet/vnet/config.c:118 118 if (last_node_index == ~0 || last_node_index != cm->end_node_index) (gdb) p {vnet_interface_output_runtime_t} 0x7fffc4d3f5ec is_deleted = 1 At this time, I have run out of ideas and will be helpful if some one with more knowledge of VPP chips in. |
If you look at the following packet trace capture, on VirtualEthernet0/0/0-tx node, it is trying to send to the wrong virtual QUEUE - VirtualEthernet0/0/1 tx queue 0. The is where the problem happens: |
Description
There’s case where vhost traffic is flowing to a wrong tx virtQueue.
This issue happens while reusing inactive vhost-user interfaces.
Here’s a scenario:
- create vhost virtual interfaces between 2 VM: ping works. - delete all vhost virtual interfaces - create loopback interface (This will introduce a new interface and normal scenario to create a different subnet.) - create vhost virtual interfaces between 2 VM: ping fails.- shesha (Fri, 1 Jul 2016 18:03:28 +0000): Thanks Dave. Calling vlib_worker_thread_node_runtime_update(); was the missing piece. It works now.
- dbarach (Fri, 1 Jul 2016 16:37:11 +0000): I suspect we may be back on this topic, but I've cleaned up several issues.
- shesha (Fri, 1 Jul 2016 00:10:33 +0000): Steve can you check with the code that is at least as recent as below:
- if (rt->is_deleted)
- printf("RESET DELETE FLAG\n");
- rt->is_deleted = 0;
- }
- jonshin (Thu, 30 Jun 2016 22:01:31 +0000):
- shesha (Thu, 30 Jun 2016 01:03:02 +0000): Steve Shin and I actively debugged this issue and found that the following fixes the issue when VPP is running in single threaded mode. However, the problem persists in multi-threaded mode. Below, I have provided some information that I know as of now.
-
-
-
-
-
-
-
-
- jonshin (Wed, 29 Jun 2016 19:56:49 +0000):
Assignee
Dave Barach
Reporter
Steve Shin
Comments
DBGvpp# show version verbose
Version: v16.09-rc0~161-gea3e1fc
Compiled by: localadmin
Compile host: ubuntu
Compile date: Thu Jun 30 17:03:05 PDT 2016
Compile location: /scratch/localadmin/openvpp/vpp.new
Compiler: GCC 4.8.4
CPU model name: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
CPU microarchitecture: Haswell (Haswell-E)
CPU flags: sse3 ssse3 sse41 sse42 avx avx2 aes
Current PID: 6472
DPDK Version: DPDK 16.04.0
DPDK EAL init args: -c 3 -n 4 --huge-dir /run/vpp/hugepages --file-prefix vpp -w 0000:0a:00.0 --master-lcore 0 --socket-mem 512,512
------------------
It looks like some sort of corruption is happening in vec_foreach (f, feature_vector) loop in find_config_with_features(). The reason I say that is, the vlib_node_runtime_t in vnet_interface_output_node_no_flatten_inline looks spooky after the execution of the above mentioned loop.
=========
BEFORE
=========
(gdb) p
{vlib_node_runtime_t} 0x7fffc4d3d4e4
$204 = {
function = 0x7ffff6d33346 <vnet_interface_output_node_no_flatten>,
errors = 0x7fffc4df8dec,
clocks_since_last_overflow = 315980,
max_clock = 128513,
max_clock_n = 1,
calls_since_last_overflow = 8,
vectors_since_last_overflow = 8,
next_frame_index = 677,
node_index = 187,
input_main_loops_per_call = 0,
main_loop_count_last_dispatch = 162883242,
main_loop_vector_stats = {0, 1},
flags = 0,
state = 0,
n_next_nodes = 3,
cached_next_index = 2,
cpu_index = 0,
runtime_data = {25769803782, 1, 0, 0, 0, 0, 0}
=======
AFTER
=======
(gdb) p {vlib_node_runtime_t}
0x7fffc4d3d4e4
$209 = {
function = 0x7ffff6d33346 <vnet_interface_output_node_no_flatten>,
errors = 0x7fffc4df8dec,
clocks_since_last_overflow = 0,
max_clock = 0,
max_clock_n = 0,
calls_since_last_overflow = 0,
vectors_since_last_overflow = 0,
next_frame_index = 690,
node_index = 187,
input_main_loops_per_call = 0,
main_loop_count_last_dispatch = 0,
main_loop_vector_stats =
{0, 0}
,
flags = 0,
state = 0,
n_next_nodes = 3,
cached_next_index = 0,
cpu_index = 0,
* runtime_data = {25769803782, 4294967297, 0, 0, 0, 0, 0}*
Just for kicks, to see how the system would behave if this corruption had not happened, I add the following:
— a/vnet/vnet/interface_output.c
+++ b/vnet/vnet/interface_output.c
@@ -420,6 +420,11 @@ vnet_interface_output_node_no_flatten_inline (vlib_main_t * vm,
from = vlib_frame_args (frame);
{
Now, the VM were pingable after vhost interface ADD-DEL-ADD sequence.
On my setup, i ran some add/delete vhost interface along with 2 VM ping. But i didn't see any packets drop.
Here's my configuration:
Thread 0 vpp_main (lcore 0)
Thread 1 vpp_wk_0 (lcore 1)
cpu {
main-core 0
corelist-workers 1
}
Your issue seems to be related with your test environment.
==================================
Following fix works for single thread mode
==================================
— a/vnet/vnet/interface.c
+++ b/vnet/vnet/interface.c
@@ -656,6 +656,15 @@ vnet_register_interface (vnet_main_t * vnm,
_vec_len (im->deleted_hw_interface_nodes) -= 1;
================
Multi threaded case
================
Setup is very simple. VPP has two vhost interfaces with a VM attached to each one. Boot those VMs, and delete the interfaces and add them back. Reboot VMs to reconnect vhost file descriptors. (Steve has a patch in QEMU that does not require this reboot). After reboot, ping fails because one of the interfaces is showing as still deleted in app (show err).
===============
Steps to reproduce
===============
create vhost-user socket /tmp/sock0
create vhost-user socket /tmp/sock1
set interface state VirtualEthernet0/0/0 up
set interface state VirtualEthernet0/0/1 up
set interface l2 bridge VirtualEthernet0/0/0 23
set interface l2 bridge VirtualEthernet0/0/1 23
Boot VMs
set interface state VirtualEthernet0/0/0 down
set interface state VirtualEthernet0/0/1 down
delete vhost-user sw_if_index
delete vhost-user sw_if_index
create vhost-user socket /tmp/sock0
create vhost-user socket /tmp/sock1
set interface state VirtualEthernet0/0/0 up
set interface state VirtualEthernet0/0/1 up
set interface l2 bridge VirtualEthernet0/0/0 23
set interface l2 bridge VirtualEthernet0/0/1 23
Problem
========
The problem can be noticed when the first vhost interface is created after deletion. Packets are dropped because vnet_interface_output_node_no_flatten_inline() sees rt->is_deleted as 1. This is strange as that variable is always zero and is never changed "explicitly". (rt variables modified in vnet_register_interface () is different than one accessed in vnet_interface_output_node_no_flatten_inline. Their addresses are different.). This I have verified 100s of times in GDB during my debug. What I noticed is, 'rt' variable that is accessed in vnet_interface_output_node_no_flatten_inline() is getting updated as a side-effect in the following function chain:
#0 find_config_with_features (vm=0xc68a80 , cm=0x7ffff74a93f8 , feature_vector=0x7fffc4d6d244)
#1 0x00007ffff6d0f515 in vnet_config_add_feature (vm=0xc68a80 <vlib_global_main>, cm=0x7ffff74a93f8 <ip6_main+312>,
#2 0x00007ffff6eda86b in ip6_sw_interface_add_del (vnm=0xc691c0 <vnet_main>, sw_if_index=7, is_add=1)
#3 0x00007ffff6d1dc75 in call_elf_section_interface_callbacks (vnm=0xc691c0 <vnet_main>, if_index=7, flags=1,
#4 0x00007ffff6d1de11 in call_sw_interface_add_del_callbacks (vnm=0xc691c0 <vnet_main>, sw_if_index=7, is_create=1)
#5 0x00007ffff6d1e0d0 in vnet_sw_interface_set_flags_helper (vnm=0xc691c0 <vnet_main>, sw_if_index=7, flags=0, helper_flags=1)
#6 0x00007ffff6d1fc15 in vnet_register_interface (vnm=0xc691c0 <vnet_main>, dev_class_index=6, dev_instance=2, hw_class_index=16,
#7 0x00007ffff6d80ff5 in ethernet_register_interface (vnm=0xc691c0 <vnet_main>, dev_class_index=6, dev_instance=2,
After the execution of vec_foreach (f, feature_vector) loop in find_config_with_features(), the variable rt gets updated.
(gdb) b vnet/vnet/config.c:116
Breakpoint 26 at 0x7ffff6d0e310: file /scratch/localadmin/openvpp/vpp.new/build-data/../vnet/vnet/config.c, line 116.
(gdb) p
{vnet_interface_output_runtime_t} 0x7fffc4d3f5ec
$171 = {
hw_if_index = 6,
sw_if_index = 6,
dev_instance = 1,
is_deleted = 0
}
(gdb) c
Continuing.
Breakpoint 26, find_config_with_features (vm=0xc68a80 <vlib_global_main>, cm=0x7ffff74a93f8 <ip6_main+312>,
118 if (last_node_index == ~0 || last_node_index != cm->end_node_index)
(gdb) p {vnet_interface_output_runtime_t}
0x7fffc4d3f5ec
$172 =
{
hw_if_index = 6,
sw_if_index = 6,
dev_instance = 1,
{color:red}
is_deleted = 1
}
At this time, I have run out of ideas and will be helpful if some one with more knowledge of VPP chips in.
If you look at the following packet trace capture, on VirtualEthernet0/0/0-tx node, it is trying to send to the wrong virtual QUEUE - VirtualEthernet0/0/1 tx queue 0.
00:14:02:645617: dpdk-input
VirtualEthernet0/0/1 rx queue 0
buffer 0xfbd49da: current data 0, length 42, free-list 0, totlen-nifb 0, trace 0x2
PKT MBUF: port 255, nb_segs 1, pkt_len 42
ARP: fa:16:3e:ad:f2:67 -> ff:ff:ff:ff:ff:ff
request, type ethernet/IP4, address size 6/4
fa:16:3e:ad:f2:67/51.51.51.142 -> 00:00:00:00:00:00/51.51.51.141
00:14:02:645627: ethernet-input
ARP: fa:16:3e:ad:f2:67 -> ff:ff:ff:ff:ff:ff
00:14:02:645633: l2-input
l2-input: sw_if_index 10 dst ff:ff:ff:ff:ff:ff src fa:16:3e:ad:f2:67
00:14:02:645634: arp-term-l2bd
request, type ethernet/IP4, address size 6/4
fa:16:3e:ad:f2:67/51.51.51.142 -> 00:00:00:00:00:00/51.51.51.141
00:14:02:645637: l2-flood
l2-flood: sw_if_index 10 dst ff:ff:ff:ff:ff:ff src fa:16:3e:ad:f2:67 bd_index 1
00:14:02:645639: l2-output
l2-output: sw_if_index 9 dst ff:ff:ff:ff:ff:ff src fa:16:3e:ad:f2:67
00:14:02:645643: VirtualEthernet0/0/0-output
VirtualEthernet0/0/0
ARP: fa:16:3e:ad:f2:67 -> ff:ff:ff:ff:ff:ff
request, type ethernet/IP4, address size 6/4
fa:16:3e:ad:f2:67/51.51.51.142 -> 00:00:00:00:00:00/51.51.51.141
00:14:02:645645: VirtualEthernet0/0/0-tx
VirtualEthernet0/0/1 tx queue 0 —————> This should be VirtualEthernet0/0/0.
buffer 0xfbd49da: current data 0, length 42, free-list 1, totlen-nifb 0, trace 0x2
ARP: fa:16:3e:ad:f2:67 -> ff:ff:ff:ff:ff:ff
request, type ethernet/IP4, address size 6/4
fa:16:3e:ad:f2:67/51.51.51.142 -> 00:00:00:00:00:00/51.51.51.141
00:14:02:645661: l2-flood
l2-flood: sw_if_index 10 dst 00:01:08:00:06:04 src 00:01:fa:16:3e:ad bd_index 1
00:14:02:645664: arp-input
request, type ethernet/IP4, address size 6/4
fa:16:3e:ad:f2:67/51.51.51.142 -> 00:00:00:00:00:00/51.51.51.141
00:14:02:645668: error-drop
arp-input: IP4 destination address not local to subnet
-------
The is where the problem happens:
open-vpp/vnet/vnet/devices/dpdk/device.c
dpdk_interface_tx (vlib_main_t * vm,
{
dpdk_main_t * dm = &dpdk_main;
vnet_interface_output_runtime_t * rd = (void *) node->runtime_data;
dpdk_device_t * xd = vec_elt_at_index (dm->devices, rd->dev_instance); -> xd is extracted using dev_instance which comes from node’s runtime_data. This data is determined when vnet_register_interface().
u32 n_packets = f->n_vectors;
Original issue: https://jira.fd.io/browse/VPP-171
The text was updated successfully, but these errors were encountered: