外观
Nvlink Rdma 多机多卡训练
多卡
当前丹摩 A800、A100、H20 机型配置了 Nvlink 互联模组。产品规格中含有 SMX 关键字都支持 Nvlink 互联。
⚠️ 通常低端卡型不支持 Nvlink,另外产品里有 PCI-E 关键字的也不支持 Nvlink。
创建一个四卡实例,登录实例执行 nvidia-smi topo -m 可以看到拓扑关系。

多机
大多数情况,内置 Nvlink 的机器都搭配了 4 张 ib/rocev2 网卡。我们可以创建多个实例进行多机训练。
📢 注意,丹摩调整了 IB/Roce HCA 设备编号。当前 HCA 编号如下:
- mlx5_10
- mlx5_11
- mlx5_12
- mlx5_13
执行 ibv_devinfo 可以看到 hca 的详情。
bash
root@d50f5e4p420c7396kbhg-oplkf:~/workspace# ibv_devinfo
hca_id: mlx5_13
transport: InfiniBand (0)
fw_ver: 20.39.1002
node_guid: 946d:ae03:008b:e8d4
sys_image_guid: 946d:ae03:008b:e8d4
vendor_id: 0x02c9
vendor_part_id: 4123
hw_ver: 0x0
board_id: MT_0000000223
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 37
port_lid: 49
port_lmc: 0x00
link_layer: InfiniBand
Failed to open device
hca_id: mlx5_11
transport: InfiniBand (0)
fw_ver: 20.39.1002
node_guid: 946d:ae03:008b:d2f0
sys_image_guid: 946d:ae03:008b:d2f0
vendor_id: 0x02c9
vendor_part_id: 4123
hw_ver: 0x0
board_id: MT_0000000223
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 37
port_lid: 50
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_12
transport: InfiniBand (0)
fw_ver: 20.39.1002
node_guid: 946d:ae03:008b:efc4
sys_image_guid: 946d:ae03:008b:efc4
vendor_id: 0x02c9
vendor_part_id: 4123
hw_ver: 0x0
board_id: MT_0000000223
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 37
port_lid: 48
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_10
transport: InfiniBand (0)
fw_ver: 20.39.1002
node_guid: b83f:d203:006e:682c
sys_image_guid: b83f:d203:006e:682c
vendor_id: 0x02c9
vendor_part_id: 4123
hw_ver: 0x0
board_id: MT_0000000223
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 37
port_lid: 55
port_lmc: 0x00
link_layer: InfiniBand多机多卡快速实践
创建实例
创建支持 Nvlink 及 RDMA 高速无损网络的规格实例。
丹摩 RDMA 规格有 IB 和 RoceV2 两种配置。

配置实例
登录实例
构建 rdma 最少需要两个实例。

安装 rdma 相关依赖
bash
rm -rf /etc/apt/sources.list.d/*cuda*
apt update
apt-get install -y infiniband-diags perftest iputils-ping iproute2
apt-get install -y librdmacm-dev libibverbs-dev ibverbs-utils ibutils rdmacm-utils
apt-get install -y rdma-core # raise error, sys xxx read only.rdma 通信
rdma 吞吐压测
rdma server
bash
root@d50f59cp420c7396kbh0-jxgqu:~/workspace# ib_write_bw -d mlx5_10 -i 1 10.18.145.165
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_10
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x37 QPN 0x0028 PSN 0x4d1086 RKey 0x181fdf VAddr 0x007f0520d03000
remote address: LID 0x37 QPN 0x0027 PSN 0xa75afa RKey 0x1826e6 VAddr 0x007f23b0909000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 3400.000000 != 3200.000000. CPU Frequency is not max.
65536 5000 22644.86 22486.08 0.359777
---------------------------------------------------------------------------------------rdma client
bash
root@d50f59cp420c7396kbh0-jxgqu:~/workspace# ib_write_bw -d mlx5_10 -i 1 10.18.145.165
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_10
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x37 QPN 0x0028 PSN 0x4d1086 RKey 0x181fdf VAddr 0x007f0520d03000
remote address: LID 0x37 QPN 0x0027 PSN 0xa75afa RKey 0x1826e6 VAddr 0x007f23b0909000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 3400.000000 != 3200.000000. CPU Frequency is not max.
65536 5000 22644.86 22486.08 0.359777
---------------------------------------------------------------------------------------延迟压测
rdma server
bash
root@d50f5e4p420c7396kbhg-oplkf:~/workspace# ib_write_lat -d mlx5_10 -i 1
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write Latency Test
Dual-port : OFF Device : mlx5_10
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x37 QPN 0x002d PSN 0x9b6969 RKey 0x1ffbbb VAddr 0x005597697fc000
remote address: LID 0x37 QPN 0x002c PSN 0x5f4b96 RKey 0x1ffcbc VAddr 0x00559fbbb93000
---------------------------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec]
Conflicting CPU frequency values detected: 3400.000000 != 3200.000000. CPU Frequency is not max.
Conflicting CPU frequency values detected: 3400.000000 != 3200.000000. CPU Frequency is not max.
2 1000 1.80 2.86 1.83 1.83 0.03 1.93 2.86
---------------------------------------------------------------------------------------rdma client
bash
root@d50f59cp420c7396kbh0-jxgqu:~/workspace# ib_write_lat -d mlx5_10 -i 1 10.18.145.165
---------------------------------------------------------------------------------------
RDMA_Write Latency Test
Dual-port : OFF Device : mlx5_10
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x37 QPN 0x002c PSN 0x5f4b96 RKey 0x1ffcbc VAddr 0x00559fbbb93000
remote address: LID 0x37 QPN 0x002d PSN 0x9b6969 RKey 0x1ffbbb VAddr 0x005597697fc000
---------------------------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec]
Conflicting CPU frequency values detected: 3400.000000 != 3200.000000. CPU Frequency is not max.
Conflicting CPU frequency values detected: 3400.000000 != 3200.000000. CPU Frequency is not max.
2 1000 1.77 2.82 1.83 1.83 0.03 1.88 2.82
---------------------------------------------------------------------------------------.
