Skip to content

Commit

Permalink
Docs: add mtu size configuration for SpiderMultusConfig
Browse files Browse the repository at this point in the history
Signed-off-by: Cyclinder Kuo <[email protected]>
  • Loading branch information
cyclinder committed Feb 19, 2025
1 parent 285e341 commit cc8cdc5
Show file tree
Hide file tree
Showing 6 changed files with 958 additions and 50 deletions.
97 changes: 87 additions & 10 deletions docs/usage/install/ai/get-started-macvlan-zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,19 @@

- 在 Infiniband 的 IPOIB 网卡上不支持创建 Macvlan 接口,因此,本方案只能适用在 RoCE 网络场景下,不能使用在 infiniband 网络场景下。

## 对比 SR-IOV CNI 的 RDMA 方案

| 比较维度 | Macvlan 共享 RDMA 方案 | SR-IOV CNI 隔离 RDMA 方案 |
| ----------- | ------------------------------------- | --------------------------------- |
| 网络隔离 | 所有容器共享 RDMA 设备,隔离性较差 | 容器独享 RDMA 设备,隔离性较好 |
| 性能 | 性能较高 | 硬件直通,性能最优 |
| 资源利用率 | 资源利用率较高 | 较低,受硬件支持的 VFs 数量限制 |
| 配置复杂度 | 配置相对简单 | 配置较为复杂,需要硬件支持和配置 |
| 兼容性 | 兼容性较好,适用于大多数环境 | 依赖硬件支持,兼容性较差 |
| 适用场景 | 适用于大多数场景,包括裸金属,虚拟机等 | 只适用于裸金属,不适用于虚拟机场景 |
| 成本 | 成本较低,因为不需要额外的硬件支持 | 成本较高,需要支持 SR-IOV 的硬件设备 |
| 支持 RDMA 协议 | 支持 Roce 协议,不支持 Infiniband 协议 | 支持 Roce 和 Infiniband 协议 |

## 方案

本文将以如下典型的 AI 集群拓扑为例,介绍如何搭建 Spiderpool。
Expand Down Expand Up @@ -83,20 +96,13 @@
.......
```

确认网卡的工作模式,如下输出表示网卡工作在 Ethernet 模式下,可实现 RoCE 通信
确认网卡的工作模式为 Ethernet:

```shell
$ ibstat mlx5_0 | grep "Link layer"
Link layer: Ethernet
```

如下输出表示网卡工作在 Infiniband 模式下,可实现 Infiniband 通信

```shell
$ ibstat mlx5_0 | grep "Link layer"
Link layer: InfiniBand
```

如果网卡没有工作在预期的模式下,请输入如下命令,确认网卡支持配置 LINK_TYPE 参数,如果没有该参数,请更换支持的网卡型号

```shell
Expand All @@ -113,7 +119,53 @@
LINK_TYPE_P1 IB(1)
```

3. 开启 [GPUDirect RMDA](https://docs.nvidia.com/cuda/gpudirect-rdma/) 功能
3. (可选)更改主机网卡的 MTU 大小

在一些特殊的通信场景下,用户需要为主机网卡自定义 MTU 大小以满足不同数据报文通信需求。本文以 Ubuntu 系统为例,主机网卡的 MTU 默认值为 1500,您可以通过以下方式自定义配置主机网卡的 MTU 大小:

打开 `netplan` 配置文件,这些文件位于 /etc/netplan/ 目录下,文件名可能是 01-netcfg.yaml 或类似的名称。使用文本编辑器打开文件,例如:

```shell
vim /etc/netplan/01-netcfg.yaml
```

修改文件中的 `network:` 部分中关于 mtu 的配置,例如:

```yaml
network:
version: 2
ethernets:
enp11s0f0np0:
mtu: 8000
...
```

在这个例子中,我们将 `enp11s0f0np0` 的 mtu 设置为 8000,以满足通信需求。保存文件并退出,使用 `netplan apply`应用更改。

```
$ sudo netplan apply
```

执行更新后,请检查主机上的 `enp11s0f0np0` 网卡的 mtu 是否已经更新为 8000。

```shell
~# ip l show enp11s0f0np0
6: enp11s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8000 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether b8:3f:d2:9f:09:42 brd ff:ff:ff:ff:ff:ff
...
```

4. 配置主机 RDMA 无损网络

在高性能网络场景下,RDMA 网络对于丢包非常敏感,一旦发生丢包重传,性能会急剧下降。因此要使得 RDMA 网络性能不受影响,丢包率必须保证在 1e-05(十万分之一)以下,最好为零丢包。对于 Roce 网络,可通过 PFC + ECN 机制来保障网络传输过程不丢包。

可参考 [配置 RDMA 无损网络](../../roce-qos-zh_CN.md)

> 配置无损网络要求网卡必须工作在 RDMA Roce 网络环境下,不能是 Infiniband
>
> 配置无损网络必须要求交换机支持 PFC + ECN 机制,并且配置与主机侧对齐,否则不能工作

5. 开启 [GPUDirect RMDA](https://docs.nvidia.com/cuda/gpudirect-rdma/) 功能

在安装或使用 [gpu-operator](https://github.com/NVIDIA/gpu-operator) 过程中

Expand All @@ -131,7 +183,7 @@
gdrdrv 24576 0
```

4. 确认主机上的 RDMA 子系统为 shared 模式,这是 macvlan 场景下提供 RDMA 设备给容器的要求。
6. 确认主机上的 RDMA 子系统为 shared 模式,这是 macvlan 场景下提供 RDMA 设备给容器的要求。

```shell
# Check the current operating mode (the Linux RDMA subsystem operates in shared mode by default):
Expand Down Expand Up @@ -246,6 +298,31 @@
EOF
```
默认情况下,Pod 网卡的 MTU 与其 macvlan 网卡的 MTU 相同。在一些特殊通信场景下,用户需要为 Pod 自定义 MTU 大小以满足不同数据报文通信需求。您可以通过以下方式自定义配置 Pod 的 MTU 大小:
```yaml
apiVersion: spiderpool.spidernet.io/v2beta1
kind: SpiderMultusConfig
metadata:
name: gpu1-macvlan
namespace: spiderpool
spec:
cniType: macvlan
rdmaResourceName: spidernet.io/shared_cx5_gpu1
macvlan:
master: ["enp11s0f0np0"]
ippools:
ipv4: ["gpu1-net11"]
chainCNIJsonData:
- |
{
"type": "tuning",
"mtu": 1480
}
```
注意: MTU 的取值范围不应该大于 macvlan master 网卡的 MTU 值,否则无法创建 Pod。
## 创建测试应用
1. 在指定节点上创建一组 DaemonSet 应用
Expand Down
88 changes: 86 additions & 2 deletions docs/usage/install/ai/get-started-macvlan.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,19 @@ By using [RDMA shared device plugin](https://github.com/Mellanox/k8s-rdma-shared

- Macvlan interfaces cannot be created on an Infiniband IPOIB network card, so this solution is only applicable in RoCE network scenarios and cannot be used in Infiniband network scenarios.

## Comparison of SR-IOV CNI RDMA Solution

| Comparison Dimension | Macvlan Shared RDMA Solution | SR-IOV CNI Isolated RDMA Solution |
| -------------------- | ---------------------------------- | ---------------------------------- |
| Network Isolation | All containers share RDMA devices, poor isolation | Containers have dedicated RDMA devices, good isolation |
| Performance | High performance | Optimal performance with hardware passthrough |
| Resource Utilization | High resource utilization | Low, limited by the number of supported VFs |
| Configuration Complexity | Relatively simple configuration | More complex configuration, requires hardware support |
| Compatibility | Good compatibility, suitable for most environments | Depends on hardware support, less compatible |
| Applicable Scenarios | Suitable for most scenarios, including bare metal and VMs | Only suitable for bare metal, not for VM scenarios |
| Cost | Low cost, no additional hardware support needed | High cost, requires hardware supporting SR-IOV |
| Support RDMA Protocol | Support Roce protocol, not support Infiniband protocol | Support Roce and Infiniband protocol |

## Solution

This article will introduce how to set up Spiderpool using the following typical AI cluster topology as an example.
Expand Down Expand Up @@ -113,7 +126,53 @@ The network planning for the cluster is as follows:
LINK_TYPE_P1 IB(1)
```

3. Enable [GPUDirect RDMA](https://docs.nvidia.com/cuda/gpudirect-rdma/)
3. (Optional) Change the MTU size of the host network card

In some special communication scenarios, users may need to customize the MTU size of the host network card to meet the communication needs of different data packets.

This document uses the Ubuntu system as an example, where the default MTU value of the host network card is 1500. You can customize the MTU size of the host network card as follows:

Open the netplan configuration file, which is located in the /etc/netplan/ directory. The filename might be 01-netcfg.yaml or something similar. Use a text editor to open the file, for example:

```shell
vim /etc/netplan/01-netcfg.yaml
```

Modify the network: section of the file to configure the MTU, for example:

```shell
network:
version: 2
ethernets:
enp11s0f0np0:
mtu: 8000
...
```

In this example, we set the MTU of `enp11s0f0np0` to 8000 to meet communication needs. Save the file and exit, then apply the changes using `netplan apply`.

```shell
$ sudo netplan apply
```

After executing the update, check if the MTU of the `enp11s0f0np0` network card on the host has been updated to 8000.

```
~# ip l show enp11s0f0np0
6: enp11s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8000 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether b8:3f:d2:9f:09:42 brd ff:ff:ff:ff:ff:ff
...
```

4. Configure Host RDMA Lossless Network

In high-performance network scenarios, RDMA networks are very sensitive to packet loss. Once packet retransmission occurs, performance will drop sharply. Therefore, to ensure that RDMA network performance is not affected, the packet loss rate must be kept below 1e-05 (one in 100,000), ideally zero packet loss. For RoCE networks, the PFC + ECN mechanism can be used to ensure no packet loss during network transmission. Refer to [RoCE Lossless Network Configuration](../../roce-qos.md)

> Configuring a lossless network requires an RDMA RoCE network environment and cannot be Infiniband.
>
> Configuring a lossless network requires the switch to support the PFC + ECN mechanism and be aligned with the host side configuration; otherwise, it will not work.

5. Enable [GPUDirect RDMA](https://docs.nvidia.com/cuda/gpudirect-rdma/)

The installation of the [gpu-operator](https://github.com/NVIDIA/gpu-operator):

Expand All @@ -132,7 +191,7 @@ The network planning for the cluster is as follows:
gdrdrv 24576 0
```

4. Set the RDMA subsystem on the host to shared mode, allowing containers to independently use shared RDMA device.
6. Set the RDMA subsystem on the host to shared mode, allowing containers to independently use shared RDMA device.

```shell
# Check the current operating mode (the Linux RDMA subsystem operates in shared mode by default):
Expand Down Expand Up @@ -246,6 +305,31 @@ The network planning for the cluster is as follows:
EOF
```
In some special communication scenarios, users need to customize the MTU size for Pods to meet the communication needs of different data packets. You can customize the MTU size for Pods in the following way.
```yaml
apiVersion: spiderpool.spidernet.io/v2beta1
kind: SpiderMultusConfig
metadata:
name: gpu1-macvlan
namespace: spiderpool
spec:
cniType: macvlan
rdmaResourceName: spidernet.io/shared_cx5_gpu1
macvlan:
master: ["enp11s0f0np0"]
ippools:
ipv4: ["gpu1-net11"]
chainCNIJsonData:
- |
{
"type": "tuning",
"mtu": 1480
}
```
Note: The MTU value should not exceed the MTU value of the macvlan master network interface, otherwise the Pod cannot be created.
## Create a Test Application
1. Create a DaemonSet application on specified nodes.
Expand Down
Loading

0 comments on commit cc8cdc5

Please sign in to comment.