mellanox网卡(ConnectX-7)开启SACK
lossy优化在寄存器ROCE_ACCL中
查看当前配置:
mlxreg -d mlx5_2 --reg_name ROCE_ACCL --getField Name | Data
============================================================
roce_adp_retrans_field_select | 0x00000001
roce_tx_window_field_select | 0x00000001
roce_slow_restart_field_select | 0x00000001
roce_slow_restart_idle_field_select | 0x00000001
min_ack_timeout_limit_disabled_field_select | 0x00000001
adaptive_routing_forced_en_field_select | 0x00000001
selective_repeat_forced_en_field_select | 0x00000001
dc_half_handshake_en_field_select | 0x00000000
ack_dscp_force_field_select | 0x00000000
roce_adp_retrans_en | 0x00000001
roce_tx_window_en | 0x00000000
roce_slow_restart_en | 0x00000001
roce_slow_restart_idle_en | 0x00000000
min_ack_timeout_limit_disabled | 0x00000000
adaptive_routing_forced_en | 0x00000000
selective_repeat_forced_en | 0x00000000
dc_half_handshake_en | 0x00000000
ack_dscp_force | 0x00000000
ack_dscp | 0x00000000
============================================================
设置启用sack:
mlxreg -d 1b:00.1 --reg_name ROCE_ACCL --set "roce_adp_retrans_en=0x1,roce_tx_window_en=0x1,roce_slow_restart_en=0x1,roce_slow_restart_idle_en=0x1"
mlxconfig -d 1b:00.1 set LOG_TX_PSN_WINDOW=<new value>
设置完使用ib_write_bw测试可以看到全部是write only报文,没有write middle了:
但是NCCL测试还是没有生效,需要修改NCCL变量NCCL_IB_ADAPTIVE_ROUTING=1