-
Notifications
You must be signed in to change notification settings - Fork 629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VPP-1943] sw_interface_dump sometimes doesn't return info from all interfaces #3405
Comments
Let me correct this, the issue happens in VPP Device tests (as the link shows), not performance tests. |
To be explicit, the symptom shows [1] in CSIT performance tests as: TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType' |
> The issue happens sporadically, much more frequently on aach64 than x86_64. Here [0] is an example for x86_64. [0] https://logs.fd.io/production/vex-yul-rot-jenkins-1/vpp-csit-verify-device-master-1n-skx/10496/ |
We did enable all of the debug messages by using the debugfs:
root@s56-t37-sut1:/sys/kernel/debug/i40e/0000:05:00.1# echo 'msg_enable -1' >command
This gave a bit more of the context: [1483319.788799] i40e 05:00.1 AQTX: desc and buffer writeback:
12 is this error message:
Searching for the I40E_AQ_RC_EBUSY in the linux source code shows only it being checked, but not being assigned, so I conclude it is a contention in the lower level firmware code. So, there are two questions/open items to look at:
1) is this an expected condition to happen ? what is the reason for this contention and is it normal to have it, and what is the expected correct behavior of the calling code should be 2) if "yes" to the previous question - then, since the caller in this case initialization code of DPDK - we will need this addressed it there.
|
Additional info: These errors correspond with the VFs that are improperly initialized in VPP. |
An observation: The failures seem to be absent when only one job is running, suggesting a race condition when two sources are trying to get PF version or other PF data. |
Description
When running multiple VPPs in parallel, we're observing an issue with VFs not coming up with dpdk plugin:
With config
dev 0000:05:10.0
dev 0000:05:10.1
Only one interface is up though:
Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count
VirtualFunctionEthernet5/10/0 1 down 9000/0/0/0
local0 0 down 0/0/0/0
VPP logs shows that an error in dpdk plugin is associated with this failure state:
2020/12/02 10:44:04:317 notice dpdk i40evf_check_api_version(): PF/VF API version mismatch:(0.0)-(1.1)
2020/12/02 10:44:04:317 notice dpdk i40evf_init_vf(): check_api version failed
2020/12/02 10:44:04:317 notice dpdk i40evf_dev_init(): Init vf failed
2020/12/02 10:44:04:317 notice dpdk EAL: Releasing pci mapped resource for 0000:05:10.1
2020/12/02 10:44:04:317 notice dpdk EAL: Calling pci_unmap_resource for 0000:05:10.1 at 0x2101014000
2020/12/02 10:44:04:317 notice dpdk EAL: Calling pci_unmap_resource for 0000:05:10.1 at 0x2101024000
2020/12/02 10:44:04:317 notice dpdk EAL: Requested device 0000:05:10.1 cannot be used
There are also associated error logs in dmesg:
[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 49 aq_err 12
[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 52 aq_err 12
[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 54 aq_err 12
0000:05:00.1 is the PF associated with 0000:05:10.1, VF 48 is 0000:05:10.0, VF 49 0000:05:10.1 and so on.
This suggests and issue with either NIC firmware or the i40e driver. More info in comment from Andrew below.
---------------------------------------------
OLD DESCRIPTION:
PAPI that was executed:
sw_interface_dump(name_filter_valid=False,name_filter='')
Show PCI right after VPP starts shows at 10:16:57.004:
show pci
Address Sock VID:PID Link Speed Driver Product Name Vital Product Data
0000:91:02.4 1 8086:154c unknown vfio-pci
0000:91:02.5 1 8086:154c unknown vfio-pci
vpp#
But then sw_interface_dump returns only partial interface information - 0000:91:02.4 is missing at 10:17:00.177 (3 seconds later):
[OrderedDict([('_0', 84), ('context', 2377), ('sw_if_index', 0), ('sup_sw_if_index', 0), ('l2_address', MACAddress(00:00:00:00:00:00)), ('flags', <vl_api_if_status_flags_t.0: 0>), ('type', <vl_api_if_type_t.IF_API_TYPE_HARDWARE: 0>), ('link_duplex', <vl_api_link_duplex_t.LINK_DUPLEX_API_UNKNOWN: 0>), ('link_speed', 0), ('link_mtu', 0), ('mtu', [0, 0, 0, 0]), ('sub_id', 0), ('sub_number_of_tags', 0), ('sub_outer_vlan_id', 0), ('sub_inner_vlan_id', 0), ('sub_if_flags', <vl_api_sub_if_flags_t.0: 0>), ('vtr_op', 0), ('vtr_push_dot1q', 0), ('vtr_tag1', 0), ('vtr_tag2', 0), ('outer_tag', 0), ('b_dmac', MACAddress(00:00:00:00:00:00)), ('b_smac', MACAddress(00:00:00:00:00:00)), ('b_vlanid', 0), ('i_sid', 0), ('interface_name', 'local0'), ('interface_dev_type', 'local'), ('tag', '')]),
OrderedDict([('_0', 84), ('context', 2377), ('sw_if_index', 1), ('sup_sw_if_index', 1), ('l2_address', MACAddress(ba:dc:0f:fe:02:05)), ('flags', <vl_api_if_status_flags_t.IF_STATUS_API_FLAG_LINK_UP|IF_STATUS_API_FLAG_ADMIN_UP: 3>), ('type', <vl_api_if_type_t.IF_API_TYPE_HARDWARE: 0>), ('link_duplex', <vl_api_link_duplex_t.LINK_DUPLEX_API_UNKNOWN: 0>), ('link_speed', 40000000), ('link_mtu', 9206), ('mtu', [9000, 0, 0, 0]), ('sub_id', 0), ('sub_number_of_tags', 0), ('sub_outer_vlan_id', 0), ('sub_inner_vlan_id', 0), ('sub_if_flags', <vl_api_sub_if_flags_t.0: 0>), ('vtr_op', 0), ('vtr_push_dot1q', 0), ('vtr_tag1', 0), ('vtr_tag2', 0), ('outer_tag', 0), ('b_dmac', MACAddress(00:00:00:00:00:00)), ('b_smac', MACAddress(00:00:00:00:00:00)), ('b_vlanid', 0), ('i_sid', 0), ('interface_name', 'VirtualFunctionEthernet91/2/5'), ('interface_dev_type', 'dpdk'), ('tag', '')])]
When VPP starts, the dpdk plugin complains that /dev/vfio/142 doesn't exist:
2020/11/10 12:36:03:864 notice dpdk EAL: Detected 64 lcore(s)
2020/11/10 12:36:03:864 notice dpdk EAL: Detected 2 NUMA nodes
2020/11/10 12:36:03:864 notice dpdk EAL: Selected IOVA mode 'VA'
2020/11/10 12:36:03:864 notice dpdk EAL: Probing VFIO support...
2020/11/10 12:36:03:864 notice dpdk EAL: VFIO support initialized
2020/11/10 12:36:03:864 notice dpdk EAL: Cannot open /dev/vfio/142: No such device
2020/11/10 12:36:03:864 notice dpdk EAL: Failed to open group 142
2020/11/10 12:36:03:864 notice dpdk EAL: Requested device 0000:91:02.0 cannot be used
2020/11/10 12:36:03:864 notice dpdk EAL: using IOMMU type 1 (Type 1)
2020/11/10 12:36:03:864 notice dpdk EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:91:02.1 (socket 1)
2020/11/10 12:36:03:864 notice dpdk EAL: No legacy callbacks, legacy socket not created
When this error happens, all subsequent tests fail the same way.
Another way this can fail:
2020/11/09 16:05:39:251 notice dpdk EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:91:02.5 (socket 1)
2020/11/09 16:05:39:251 notice dpdk i40evf_check_api_version(): PF/VF API version mismatch:(0.0)-(1.1)
2020/11/09 16:05:39:251 notice dpdk i40evf_init_vf(): check_api version failed
2020/11/09 16:05:39:251 notice dpdk i40evf_dev_init(): Init vf failed
2020/11/09 16:05:39:251 notice dpdk EAL: Releasing pci mapped resource for 0000:91:02.5
2020/11/09 16:05:39:251 notice dpdk EAL: Calling pci_unmap_resource for 0000:91:02.5 at 0x2101014000
2020/11/09 16:05:39:251 notice dpdk EAL: Calling pci_unmap_resource for 0000:91:02.5 at 0x2101024000
2020/11/09 16:05:39:251 notice dpdk EAL: Requested device 0000:91:02.5 cannot be used
This failure only results in one failed test, suggesting some bug in DPDK.
Yet another way:
2020/11/25 13:59:21:696 notice dpdk EAL: Detected 64 lcore(s)
2020/11/25 13:59:21:696 notice dpdk EAL: Detected 2 NUMA nodes
2020/11/25 13:59:21:696 notice dpdk EAL: Selected IOVA mode 'VA'
2020/11/25 13:59:21:696 notice dpdk EAL: Probing VFIO support...
2020/11/25 13:59:21:696 notice dpdk EAL: VFIO support initialized
2020/11/25 13:59:21:696 notice dpdk EAL: using IOMMU type 1 (Type 1)
2020/11/25 13:59:21:696 notice dpdk EAL: Getting a vfio_dev_fd for 0000:91:02.0 failed
2020/11/25 13:59:21:696 notice dpdk EAL: Requested device 0000:91:02.0 cannot be used
2020/11/25 13:59:21:696 notice dpdk EAL: Cannot open /dev/vfio/143: Device or resource busy
2020/11/25 13:59:21:696 notice dpdk EAL: Failed to open group 143
2020/11/25 13:59:21:696 notice dpdk EAL: Requested device 0000:91:02.1 cannot be used
2020/11/25 13:59:21:696 notice dpdk EAL: Bus (pci) probe failed.
There is something wrong with /dev/vfio/.
The issue happens sporadically, much more frequently on aach64 than x86_64. It only happens when there are multiple jobs running, suggesting this happens when multiple VPP instances simultaneously use the same PF (associated with the different VFs the different VPP instances are using). This would be a race condition and it makes sense that it happens more frequently on arm, since those servers are slower, providing a bigger window for overlap where the race condition could happen.
ADDITIONAL INFORMATION:
NIC:
Intel Ethernet Converged Network Adapter XL710-Q2
cat /sys/class/net/enp5s0f0/device/device
0x1583
cat /sys/class/net/enp5s0f0/device/vendor
0x8086
cat /sys/class/net/enp5s0f0/device/subsystem_device
0x0001
cat /sys/class/net/enp5s0f0/device/subsystem_vendor
0x8086
DRIVER/FIRMWARE VERSION:
driver: i40e
version: 2.1.14-k
firmware-version: 6.01 0x800035da 1.1747.0
expansion-rom-version:
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
The PF is using the i40e driver above.
VFs are using vfio-pci.
The environment where I've gathered logs from is Ubuntu18.04 with this kernel:
Linux s55-t36-sut1 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:23:38 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux
This also happens on x86 with this kernel (the environment is otherwise the same):
Linux s1-t11-sut1 4.15.0-23-generic #25-Ubuntu SMP Wed May 23 18:02:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Assignee
Juraj Linkeš
Reporter
Juraj Linkeš
Comments
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
[1] https://logs.fd.io/production/vex-yul-rot-jenkins-1/vpp-csit-verify-device-master-1n-skx/10496/archives/log.html.gz#s1-s1-s1-s2-s6-t1-k2-k8-k3-k1-k1-k1-k1-k1-k1-k3-k1-k1
Here [0] is an example for x86_64.
[0] https://logs.fd.io/production/vex-yul-rot-jenkins-1/vpp-csit-verify-device-master-1n-skx/10496/
root@s56-t37-sut1:/sys/kernel/debug/i40e/0000:05:00.1# echo 'msg_enable -1' >command
This gave a bit more of the context:
[1483319.788799] i40e 05:00.1 AQTX: desc and buffer writeback:
[1483319.788800] i40e 05:00.1 AQ CMD: opcode 0x0207, flags 0x2003, datalen 0x0000, retval 0x0000
[1483319.788801] i40e 05:00.1 cookie (h,l) 0x00000000 0x00000000
[1483319.788803] i40e 05:00.1 param (0,1) 0x00000000 0x002029E8
[1483319.788804] i40e 05:00.1 addr (h,l) 0x00000000 0x07FF07FF
[1483319.788807] i40e 05:00.1 ntc 234 head 235.
[1483319.788809] i40e 05:00.1 AQTX: desc and buffer:
[1483319.788810] i40e 05:00.1 AQ CMD: opcode 0x0207, flags 0x2000, datalen 0x0000, retval 0x0000
[1483319.788811] i40e 05:00.1 cookie (h,l) 0x00000000 0x00000000
[1483319.788812] i40e 05:00.1 param (0,1) 0x00000000 0x002031E8
[1483319.788813] i40e 05:00.1 addr (h,l) 0x00000000 0x07FF07FF
[1483319.788867] i40e 05:00.1 AQTX: desc and buffer writeback:
[1483319.788868] i40e 05:00.1 AQ CMD: opcode 0x0207, flags 0x2003, datalen 0x0000, retval 0x0000
[1483319.788869] i40e 05:00.1 cookie (h,l) 0x00000000 0x00000000
[1483319.788871] i40e 05:00.1 param (0,1) 0x00000000 0x002031E8
[1483319.788872] i40e 05:00.1 addr (h,l) 0x00000000 0x07FF07FF
[1483319.802767] vfio-pci 0000:05:10.1: enabling device (0000 -> 0002)
[1483319.826114] i40e 05:00.1 AQRX: desc and buffer:
[1483319.826115] i40e 05:00.1 AQ CMD: opcode 0x0801, flags 0x1009, datalen 0x0008, retval 0x0070
[1483319.826116] i40e 05:00.1 cookie (h,l) 0x00000001 0x00000000
[1483319.826117] i40e 05:00.1 param (0,1) 0x00000000 0x00000000
[1483319.826118] i40e 05:00.1 addr (h,l) 0x00000000 0xFFBBB000
[1483319.826119] i40e 05:00.1 AQ CMD Buffer:
[1483319.826121] i40e 05:00.1: 0x00000000: 01 00 00 00 01 00 00 00
[1483319.826126] i40e 05:00.1 ntc 235 head 236.
[1483319.826127] i40e 05:00.1 AQTX: desc and buffer:
[1483319.826129] i40e 05:00.1 AQ CMD: opcode 0x0802, flags 0x3400, datalen 0x0008, retval 0x0000
[1483319.826130] i40e 05:00.1 cookie (h,l) 0x00000001 0x00000000
[1483319.826131] i40e 05:00.1 param (0,1) 0x00000070 0x00000000
[1483319.826132] i40e 05:00.1 addr (h,l) 0x00000000 0xFFCFA000
[1483319.826132] i40e 05:00.1 AQ CMD Buffer:
[1483319.826134] i40e 05:00.1: 0x00000000: 01 00 00 00 01 00 00 00
[1483319.826187] i40e 05:00.1 AQTX: Command completed with error 0xC.
[1483319.826188] i40e 05:00.1 AQTX: desc and buffer writeback:
[1483319.826189] i40e 05:00.1 AQ CMD: opcode 0x0802, flags 0x3407, datalen 0x0008, retval 0x000C
[1483319.826190] i40e 05:00.1 cookie (h,l) 0x00000001 0x00000000
[1483319.826191] i40e 05:00.1 param (0,1) 0x00000070 0x00000000
[1483319.826192] i40e 05:00.1 addr (h,l) 0x00000000 0xFFCFA000
[1483319.826193] i40e 05:00.1 AQ CMD Buffer:
[1483319.826194] i40e 05:00.1: 0x00000000: 01 00 00 00 01 00 00 00
[1483319.826198] i40e 0000:05:00.1: Unable to send the message to VF 48 aq_err 12
12 is this error message:
https://elixir.bootlin.com/linux/v4.15/source/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h#L115
Searching for the I40E_AQ_RC_EBUSY in the linux source code shows only it being checked, but not being assigned, so I conclude it is a contention in the lower level firmware code.
So, there are two questions/open items to look at:
is this an expected condition to happen ? what is the reason for this contention and is it normal to have it, and what is the expected correct behavior of the calling code should be
if "yes" to the previous question - then, since the caller in this case initialization code of DPDK - we will need this addressed it there.
Additional info:
DMESG reveals the following errors:
[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 49 aq_err 12
[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 52 aq_err 12
[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 54 aq_err 12
These errors correspond with the VFs that are improperly initialized in VPP.
The failures seem to be absent when only one job is running, suggesting a race condition when two sources are trying to get PF version or other PF data.
Original issue: https://jira.fd.io/browse/VPP-1943
The text was updated successfully, but these errors were encountered: