Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VPP-1943] sw_interface_dump sometimes doesn't return info from all interfaces #3405

Open
vvalderrv opened this issue Feb 2, 2025 · 6 comments

Comments

@vvalderrv
Copy link
Contributor

Description

When running multiple VPPs in parallel, we're observing an issue with VFs not coming up with dpdk plugin:

With config

dev 0000:05:10.0

dev 0000:05:10.1

Only one interface is up though:

Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count

VirtualFunctionEthernet5/10/0 1 down 9000/0/0/0

local0 0 down 0/0/0/0

VPP logs shows that an error in dpdk plugin is associated with this failure state:

2020/12/02 10:44:04:317 notice dpdk i40evf_check_api_version(): PF/VF API version mismatch:(0.0)-(1.1)

2020/12/02 10:44:04:317 notice dpdk i40evf_init_vf(): check_api version failed

2020/12/02 10:44:04:317 notice dpdk i40evf_dev_init(): Init vf failed

2020/12/02 10:44:04:317 notice dpdk EAL: Releasing pci mapped resource for 0000:05:10.1

2020/12/02 10:44:04:317 notice dpdk EAL: Calling pci_unmap_resource for 0000:05:10.1 at 0x2101014000

2020/12/02 10:44:04:317 notice dpdk EAL: Calling pci_unmap_resource for 0000:05:10.1 at 0x2101024000

2020/12/02 10:44:04:317 notice dpdk EAL: Requested device 0000:05:10.1 cannot be used

There are also associated error logs in dmesg:

[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 49 aq_err 12

[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 52 aq_err 12

[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 54 aq_err 12

0000:05:00.1 is the PF associated with 0000:05:10.1, VF 48 is 0000:05:10.0, VF 49 0000:05:10.1 and so on.

This suggests and issue with either NIC firmware or the i40e driver. More info in comment from Andrew below.

---------------------------------------------

OLD DESCRIPTION:

PAPI that was executed:

sw_interface_dump(name_filter_valid=False,name_filter='')

Show PCI right after VPP starts shows at 10:16:57.004:

show pci

Address Sock VID:PID Link Speed Driver Product Name Vital Product Data

0000:91:02.4 1 8086:154c unknown vfio-pci

0000:91:02.5 1 8086:154c unknown vfio-pci

vpp#

But then sw_interface_dump returns only partial interface information - 0000:91:02.4 is missing at 10:17:00.177 (3 seconds later):

[OrderedDict([('_0', 84), ('context', 2377), ('sw_if_index', 0), ('sup_sw_if_index', 0), ('l2_address', MACAddress(00:00:00:00:00:00)), ('flags', <vl_api_if_status_flags_t.0: 0>), ('type', <vl_api_if_type_t.IF_API_TYPE_HARDWARE: 0>), ('link_duplex', <vl_api_link_duplex_t.LINK_DUPLEX_API_UNKNOWN: 0>), ('link_speed', 0), ('link_mtu', 0), ('mtu', [0, 0, 0, 0]), ('sub_id', 0), ('sub_number_of_tags', 0), ('sub_outer_vlan_id', 0), ('sub_inner_vlan_id', 0), ('sub_if_flags', <vl_api_sub_if_flags_t.0: 0>), ('vtr_op', 0), ('vtr_push_dot1q', 0), ('vtr_tag1', 0), ('vtr_tag2', 0), ('outer_tag', 0), ('b_dmac', MACAddress(00:00:00:00:00:00)), ('b_smac', MACAddress(00:00:00:00:00:00)), ('b_vlanid', 0), ('i_sid', 0), ('interface_name', 'local0'), ('interface_dev_type', 'local'), ('tag', '')]),

OrderedDict([('_0', 84), ('context', 2377), ('sw_if_index', 1), ('sup_sw_if_index', 1), ('l2_address', MACAddress(ba:dc:0f:fe:02:05)), ('flags', <vl_api_if_status_flags_t.IF_STATUS_API_FLAG_LINK_UP|IF_STATUS_API_FLAG_ADMIN_UP: 3>), ('type', <vl_api_if_type_t.IF_API_TYPE_HARDWARE: 0>), ('link_duplex', <vl_api_link_duplex_t.LINK_DUPLEX_API_UNKNOWN: 0>), ('link_speed', 40000000), ('link_mtu', 9206), ('mtu', [9000, 0, 0, 0]), ('sub_id', 0), ('sub_number_of_tags', 0), ('sub_outer_vlan_id', 0), ('sub_inner_vlan_id', 0), ('sub_if_flags', <vl_api_sub_if_flags_t.0: 0>), ('vtr_op', 0), ('vtr_push_dot1q', 0), ('vtr_tag1', 0), ('vtr_tag2', 0), ('outer_tag', 0), ('b_dmac', MACAddress(00:00:00:00:00:00)), ('b_smac', MACAddress(00:00:00:00:00:00)), ('b_vlanid', 0), ('i_sid', 0), ('interface_name', 'VirtualFunctionEthernet91/2/5'), ('interface_dev_type', 'dpdk'), ('tag', '')])]

When VPP starts, the dpdk plugin complains that /dev/vfio/142 doesn't exist:

2020/11/10 12:36:03:864 notice dpdk EAL: Detected 64 lcore(s)

2020/11/10 12:36:03:864 notice dpdk EAL: Detected 2 NUMA nodes

2020/11/10 12:36:03:864 notice dpdk EAL: Selected IOVA mode 'VA'

2020/11/10 12:36:03:864 notice dpdk EAL: Probing VFIO support...

2020/11/10 12:36:03:864 notice dpdk EAL: VFIO support initialized

2020/11/10 12:36:03:864 notice dpdk EAL: Cannot open /dev/vfio/142: No such device

2020/11/10 12:36:03:864 notice dpdk EAL: Failed to open group 142

2020/11/10 12:36:03:864 notice dpdk EAL: Requested device 0000:91:02.0 cannot be used

2020/11/10 12:36:03:864 notice dpdk EAL: using IOMMU type 1 (Type 1)

2020/11/10 12:36:03:864 notice dpdk EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:91:02.1 (socket 1)

2020/11/10 12:36:03:864 notice dpdk EAL: No legacy callbacks, legacy socket not created

When this error happens, all subsequent tests fail the same way.

Another way this can fail:

2020/11/09 16:05:39:251 notice dpdk EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:91:02.5 (socket 1)

2020/11/09 16:05:39:251 notice dpdk i40evf_check_api_version(): PF/VF API version mismatch:(0.0)-(1.1)

2020/11/09 16:05:39:251 notice dpdk i40evf_init_vf(): check_api version failed

2020/11/09 16:05:39:251 notice dpdk i40evf_dev_init(): Init vf failed

2020/11/09 16:05:39:251 notice dpdk EAL: Releasing pci mapped resource for 0000:91:02.5

2020/11/09 16:05:39:251 notice dpdk EAL: Calling pci_unmap_resource for 0000:91:02.5 at 0x2101014000

2020/11/09 16:05:39:251 notice dpdk EAL: Calling pci_unmap_resource for 0000:91:02.5 at 0x2101024000

2020/11/09 16:05:39:251 notice dpdk EAL: Requested device 0000:91:02.5 cannot be used

This failure only results in one failed test, suggesting some bug in DPDK.

Yet another way:

2020/11/25 13:59:21:696 notice dpdk EAL: Detected 64 lcore(s)

2020/11/25 13:59:21:696 notice dpdk EAL: Detected 2 NUMA nodes

2020/11/25 13:59:21:696 notice dpdk EAL: Selected IOVA mode 'VA'

2020/11/25 13:59:21:696 notice dpdk EAL: Probing VFIO support...

2020/11/25 13:59:21:696 notice dpdk EAL: VFIO support initialized

2020/11/25 13:59:21:696 notice dpdk EAL: using IOMMU type 1 (Type 1)

2020/11/25 13:59:21:696 notice dpdk EAL: Getting a vfio_dev_fd for 0000:91:02.0 failed

2020/11/25 13:59:21:696 notice dpdk EAL: Requested device 0000:91:02.0 cannot be used

2020/11/25 13:59:21:696 notice dpdk EAL: Cannot open /dev/vfio/143: Device or resource busy

2020/11/25 13:59:21:696 notice dpdk EAL: Failed to open group 143

2020/11/25 13:59:21:696 notice dpdk EAL: Requested device 0000:91:02.1 cannot be used

2020/11/25 13:59:21:696 notice dpdk EAL: Bus (pci) probe failed.

There is something wrong with /dev/vfio/.

The issue happens sporadically, much more frequently on aach64 than x86_64. It only happens when there are multiple jobs running, suggesting this happens when multiple VPP instances simultaneously use the same PF (associated with the different VFs the different VPP instances are using). This would be a race condition and it makes sense that it happens more frequently on arm, since those servers are slower, providing a bigger window for overlap where the race condition could happen.

ADDITIONAL INFORMATION:

NIC:

Intel Ethernet Converged Network Adapter XL710-Q2

cat /sys/class/net/enp5s0f0/device/device

0x1583

cat /sys/class/net/enp5s0f0/device/vendor

0x8086

cat /sys/class/net/enp5s0f0/device/subsystem_device

0x0001

cat /sys/class/net/enp5s0f0/device/subsystem_vendor

0x8086

DRIVER/FIRMWARE VERSION:

driver: i40e

version: 2.1.14-k

firmware-version: 6.01 0x800035da 1.1747.0

expansion-rom-version:

bus-info: 0000:05:00.0

supports-statistics: yes

supports-test: yes

supports-eeprom-access: yes

supports-register-dump: yes

supports-priv-flags: yes

The PF is using the i40e driver above.

VFs are using vfio-pci.

The environment where I've gathered logs from is Ubuntu18.04 with this kernel:

Linux s55-t36-sut1 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:23:38 UTC 2019 aarch64 aarch64 aarch64 GNU/Linux

This also happens on x86 with this kernel (the environment is otherwise the same):

Linux s1-t11-sut1 4.15.0-23-generic #25-Ubuntu SMP Wed May 23 18:02:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Assignee

Juraj Linkeš

Reporter

Juraj Linkeš

Comments

  • juraj.linkes (Wed, 12 May 2021 08:40:10 +0000):
    CSIT performance tests
Let me correct this, the issue happens in VPP Device tests (as the link shows), not performance tests. - **vrpolak (Tue, 11 May 2021 15:40:37 +0000)**: To be explicit, the symptom shows [1] in CSIT performance tests as:

  TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

[1] https://logs.fd.io/production/vex-yul-rot-jenkins-1/vpp-csit-verify-device-master-1n-skx/10496/archives/log.html.gz#s1-s1-s1-s2-s6-t1-k2-k8-k3-k1-k1-k1-k1-k1-k1-k3-k1-k1

  • vrpolak (Fri, 18 Dec 2020 11:44:40 +0000): > The issue happens sporadically, much more frequently on aach64 than x86_64.

Here [0] is an example for x86_64.

[0] https://logs.fd.io/production/vex-yul-rot-jenkins-1/vpp-csit-verify-device-master-1n-skx/10496/

  • ayourtch (Thu, 3 Dec 2020 11:47:53 +0000): We did enable all of the debug messages by using the debugfs:

 

root@s56-t37-sut1:/sys/kernel/debug/i40e/0000:05:00.1# echo 'msg_enable -1' >command

 

This gave a bit more of the context:

[1483319.788799] i40e 05:00.1 AQTX: desc and buffer writeback:

[1483319.788800] i40e 05:00.1 AQ CMD: opcode 0x0207, flags 0x2003, datalen 0x0000, retval 0x0000

[1483319.788801] i40e 05:00.1 cookie (h,l) 0x00000000 0x00000000

[1483319.788803] i40e 05:00.1 param (0,1) 0x00000000 0x002029E8

[1483319.788804] i40e 05:00.1 addr (h,l) 0x00000000 0x07FF07FF

[1483319.788807] i40e 05:00.1 ntc 234 head 235.

[1483319.788809] i40e 05:00.1 AQTX: desc and buffer:

[1483319.788810] i40e 05:00.1 AQ CMD: opcode 0x0207, flags 0x2000, datalen 0x0000, retval 0x0000

[1483319.788811] i40e 05:00.1 cookie (h,l) 0x00000000 0x00000000

[1483319.788812] i40e 05:00.1 param (0,1) 0x00000000 0x002031E8

[1483319.788813] i40e 05:00.1 addr (h,l) 0x00000000 0x07FF07FF

[1483319.788867] i40e 05:00.1 AQTX: desc and buffer writeback:

[1483319.788868] i40e 05:00.1 AQ CMD: opcode 0x0207, flags 0x2003, datalen 0x0000, retval 0x0000

[1483319.788869] i40e 05:00.1 cookie (h,l) 0x00000000 0x00000000

[1483319.788871] i40e 05:00.1 param (0,1) 0x00000000 0x002031E8

[1483319.788872] i40e 05:00.1 addr (h,l) 0x00000000 0x07FF07FF

[1483319.802767] vfio-pci 0000:05:10.1: enabling device (0000 -> 0002)

[1483319.826114] i40e 05:00.1 AQRX: desc and buffer:

[1483319.826115] i40e 05:00.1 AQ CMD: opcode 0x0801, flags 0x1009, datalen 0x0008, retval 0x0070

[1483319.826116] i40e 05:00.1 cookie (h,l) 0x00000001 0x00000000

[1483319.826117] i40e 05:00.1 param (0,1) 0x00000000 0x00000000

[1483319.826118] i40e 05:00.1 addr (h,l) 0x00000000 0xFFBBB000

[1483319.826119] i40e 05:00.1 AQ CMD Buffer:

[1483319.826121] i40e 05:00.1: 0x00000000: 01 00 00 00 01 00 00 00

[1483319.826126] i40e 05:00.1 ntc 235 head 236.

[1483319.826127] i40e 05:00.1 AQTX: desc and buffer:

[1483319.826129] i40e 05:00.1 AQ CMD: opcode 0x0802, flags 0x3400, datalen 0x0008, retval 0x0000

[1483319.826130] i40e 05:00.1 cookie (h,l) 0x00000001 0x00000000

[1483319.826131] i40e 05:00.1 param (0,1) 0x00000070 0x00000000

[1483319.826132] i40e 05:00.1 addr (h,l) 0x00000000 0xFFCFA000

[1483319.826132] i40e 05:00.1 AQ CMD Buffer:

[1483319.826134] i40e 05:00.1: 0x00000000: 01 00 00 00 01 00 00 00

[1483319.826187] i40e 05:00.1 AQTX: Command completed with error 0xC.

[1483319.826188] i40e 05:00.1 AQTX: desc and buffer writeback:

[1483319.826189] i40e 05:00.1 AQ CMD: opcode 0x0802, flags 0x3407, datalen 0x0008, retval 0x000C

[1483319.826190] i40e 05:00.1 cookie (h,l) 0x00000001 0x00000000

[1483319.826191] i40e 05:00.1 param (0,1) 0x00000070 0x00000000

[1483319.826192] i40e 05:00.1 addr (h,l) 0x00000000 0xFFCFA000

[1483319.826193] i40e 05:00.1 AQ CMD Buffer:

[1483319.826194] i40e 05:00.1: 0x00000000: 01 00 00 00 01 00 00 00

[1483319.826198] i40e 0000:05:00.1: Unable to send the message to VF 48 aq_err 12

 

 12 is this error message:

 

https://elixir.bootlin.com/linux/v4.15/source/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h#L115

 

Searching for the I40E_AQ_RC_EBUSY in the linux source code shows only it being checked, but not being assigned, so I conclude it is a contention in the lower level firmware code.

So, there are two questions/open items to look at:

 

  1. is this an expected condition to happen ? what is the reason for this contention and is it normal to have it, and what is the expected correct behavior of the calling code should be

  2. if "yes" to the previous question - then, since the caller in this case initialization code of DPDK - we will need this addressed it there.

  • juraj.linkes (Thu, 3 Dec 2020 11:08:06 +0000):

    Additional info:

DMESG reveals the following errors:

[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 49 aq_err 12

[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 52 aq_err 12

[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 54 aq_err 12

These errors correspond with the VFs that are improperly initialized in VPP.

  • juraj.linkes (Fri, 13 Nov 2020 14:33:40 +0000): An observation:

The failures seem to be absent when only one job is running, suggesting a race condition when two sources are trying to get PF version or other PF data.

Original issue: https://jira.fd.io/browse/VPP-1943

@vvalderrv
Copy link
Contributor Author

CSIT performance tests

Let me correct this, the issue happens in VPP Device tests (as the link shows), not performance tests.

@vvalderrv
Copy link
Contributor Author

To be explicit, the symptom shows [1] in CSIT performance tests as:

  TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

[1] https://logs.fd.io/production/vex-yul-rot-jenkins-1/vpp-csit-verify-device-master-1n-skx/10496/archives/log.html.gz#s1-s1-s1-s2-s6-t1-k2-k8-k3-k1-k1-k1-k1-k1-k1-k3-k1-k1

@vvalderrv
Copy link
Contributor Author

> The issue happens sporadically, much more frequently on aach64 than x86_64.

Here [0] is an example for x86_64.

[0] https://logs.fd.io/production/vex-yul-rot-jenkins-1/vpp-csit-verify-device-master-1n-skx/10496/

@vvalderrv
Copy link
Contributor Author

We did enable all of the debug messages by using the debugfs:

 

root@s56-t37-sut1:/sys/kernel/debug/i40e/0000:05:00.1# echo 'msg_enable -1' >command

 

This gave a bit more of the context:

[1483319.788799] i40e 05:00.1 AQTX: desc and buffer writeback:
[1483319.788800] i40e 05:00.1 AQ CMD: opcode 0x0207, flags 0x2003, datalen 0x0000, retval 0x0000
[1483319.788801] i40e 05:00.1 cookie (h,l) 0x00000000 0x00000000
[1483319.788803] i40e 05:00.1 param (0,1) 0x00000000 0x002029E8
[1483319.788804] i40e 05:00.1 addr (h,l) 0x00000000 0x07FF07FF
[1483319.788807] i40e 05:00.1 ntc 234 head 235.
[1483319.788809] i40e 05:00.1 AQTX: desc and buffer:
[1483319.788810] i40e 05:00.1 AQ CMD: opcode 0x0207, flags 0x2000, datalen 0x0000, retval 0x0000
[1483319.788811] i40e 05:00.1 cookie (h,l) 0x00000000 0x00000000
[1483319.788812] i40e 05:00.1 param (0,1) 0x00000000 0x002031E8
[1483319.788813] i40e 05:00.1 addr (h,l) 0x00000000 0x07FF07FF
[1483319.788867] i40e 05:00.1 AQTX: desc and buffer writeback:
[1483319.788868] i40e 05:00.1 AQ CMD: opcode 0x0207, flags 0x2003, datalen 0x0000, retval 0x0000
[1483319.788869] i40e 05:00.1 cookie (h,l) 0x00000000 0x00000000
[1483319.788871] i40e 05:00.1 param (0,1) 0x00000000 0x002031E8
[1483319.788872] i40e 05:00.1 addr (h,l) 0x00000000 0x07FF07FF
[1483319.802767] vfio-pci 0000:05:10.1: enabling device (0000 -> 0002)
[1483319.826114] i40e 05:00.1 AQRX: desc and buffer:
[1483319.826115] i40e 05:00.1 AQ CMD: opcode 0x0801, flags 0x1009, datalen 0x0008, retval 0x0070
[1483319.826116] i40e 05:00.1 cookie (h,l) 0x00000001 0x00000000
[1483319.826117] i40e 05:00.1 param (0,1) 0x00000000 0x00000000
[1483319.826118] i40e 05:00.1 addr (h,l) 0x00000000 0xFFBBB000
[1483319.826119] i40e 05:00.1 AQ CMD Buffer:
[1483319.826121] i40e 05:00.1: 0x00000000: 01 00 00 00 01 00 00 00
[1483319.826126] i40e 05:00.1 ntc 235 head 236.
[1483319.826127] i40e 05:00.1 AQTX: desc and buffer:
[1483319.826129] i40e 05:00.1 AQ CMD: opcode 0x0802, flags 0x3400, datalen 0x0008, retval 0x0000
[1483319.826130] i40e 05:00.1 cookie (h,l) 0x00000001 0x00000000
[1483319.826131] i40e 05:00.1 param (0,1) 0x00000070 0x00000000
[1483319.826132] i40e 05:00.1 addr (h,l) 0x00000000 0xFFCFA000
[1483319.826132] i40e 05:00.1 AQ CMD Buffer:
[1483319.826134] i40e 05:00.1: 0x00000000: 01 00 00 00 01 00 00 00
[1483319.826187] i40e 05:00.1 AQTX: Command completed with error 0xC.
[1483319.826188] i40e 05:00.1 AQTX: desc and buffer writeback:
[1483319.826189] i40e 05:00.1 AQ CMD: opcode 0x0802, flags 0x3407, datalen 0x0008, retval 0x000C
[1483319.826190] i40e 05:00.1 cookie (h,l) 0x00000001 0x00000000
[1483319.826191] i40e 05:00.1 param (0,1) 0x00000070 0x00000000
[1483319.826192] i40e 05:00.1 addr (h,l) 0x00000000 0xFFCFA000
[1483319.826193] i40e 05:00.1 AQ CMD Buffer:
[1483319.826194] i40e 05:00.1: 0x00000000: 01 00 00 00 01 00 00 00
[1483319.826198] i40e 0000:05:00.1: Unable to send the message to VF 48 aq_err 12

 

 12 is this error message:

 

https://elixir.bootlin.com/linux/v4.15/source/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h#L115

 

Searching for the I40E_AQ_RC_EBUSY in the linux source code shows only it being checked, but not being assigned, so I conclude it is a contention in the lower level firmware code.

So, there are two questions/open items to look at:

 

1) is this an expected condition to happen ? what is the reason for this contention and is it normal to have it, and what is the expected correct behavior of the calling code should be

2) if "yes" to the previous question - then, since the caller in this case initialization code of DPDK - we will need this addressed it there.

 

@vvalderrv
Copy link
Contributor Author

Additional info:
DMESG reveals the following errors:
[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 49 aq_err 12
[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 52 aq_err 12
[Thu Dec 3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 54 aq_err 12

These errors correspond with the VFs that are improperly initialized in VPP.

@vvalderrv
Copy link
Contributor Author

An observation:

The failures seem to be absent when only one job is running, suggesting a race condition when two sources are trying to get PF version or other PF data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant