基于RoCE v1配置PFC

环境:

两台host(各配有一块双端口40Gbps ConnectX-3 网卡,驱动版本为4.1-1.0.2.0,OS为Ubuntu 16.04)

一台32端口Mellanox Spectrum交换机SN2700,onyx版本为3.6.8102.

PFC背景知识:

引用Juniper对PFC的介绍,"Priority-based flow control (PFC), IEEE standard 802.1Qbb, is a link-level flow control mechanism. The flow control mechanism is similar to that used by IEEE 802.3x Ethernet PAUSE, but it operates on individual priorities. Instead of pausing all traffic on a link, PFC allows you to selectively pause traffic according to its class."

可见,相比于IEEE 802.3x,PFC的粒度更小。因此配置的过程可以理解为将应用流量映射到某一个优先级的过程。根据对流量标记位置的不同,可以分为Trust L2和Trust L3。由于ConnectX-3仅支持RoCE v1,因此本文只关注Trust L2。

在端主机侧,映射关系为:ToS -> skb_priority -> Vlan-qos (也记为User Priority,即UP,其值为Vlan tag中PCP的值) -> tc。

在交换机侧,映射关系为:PCP + DEI -> switch-priority -> ingress Port Group (PG)。其中PG包含对PFC阈值的配置。

本文使用tc 4以及switch-priority 4为例。

配置过程:

首先配置交换机

0. 进入配置模式:

switch-6bd534 [standalone: master] > enable
switch-6bd534 [standalone: master] # configure terminal

1. 创建VLAN,并设置交换机端口为trunk模式:

switch-6bd534 [standalone: master] (config) # vlan 10
switch-6bd534 [standalone: master] (config vlan 10) # exit
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 switchport mode trunk

2. 关闭所有端口的flow control:

扫描二维码关注公众号,回复: 3064325 查看本文章
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 flowcontrol send off force
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 flowcontrol receive off force

3.使能priority 4,并在所有端口启用PFC:

switch-6bd534 [standalone: master] (config) # dcb priority-flow-control enable
This action might cause traffic loss while shutting down a port with priority-flow-control mode on
Type 'yes' to confirm enable pfc globally: yes
switch-6bd534 [standalone: master] (config) # dcb priority-flow-control priority 4 enable
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 dcb priority-flow-control mode on force

4. 修改端口的buffer配置,并做switch-priority和PG buffer之间的映射:

switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 ingress-buffer iPort.pg0 map pool iPool0 type lossy reserved 20K shared alpha 8
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 ingress-buffer iPort.pg4 map pool iPool0 type lossless reserved 70K xoff 17K xon 17K shared alpha 2
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 egress-buffer  ePort.tc4 map pool ePool0 reserved 1500 shared alpha inf
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 ingress-buffer iPort.pg4 bind switch-priority 4

5. 做PCP+DEI到switch-priority的映射:

switch-6bd534 [standalone: master] (config) # qos trust L2
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 qos map pcp 4 dei 0 to switch-priority 4

这样,交换机侧就配置好了。

接下来配置端主机

1. 设置pfctx和pfcrx 参数:

# vim /etc/modprobe.d/mlx4.conf

添加:

options mlx4_en pfctx=0x16 pfcrx=0x16

注意,pfctx和pfcrx均为8 bits的bitmap,使能priority 4即为0x16.

然后重启网卡:

# /etc/init.d/openibd restart

验证:

# RX=`cat /sys/module/mlx4_en/parameters/pfcrx`;printf "0x%x\n" $RX

输出结果为:0x16 即正确。

2. 创建VLAN,并设置IP。

# modprobe 8021q
# vconfig add eth2 10
Added VLAN with VID == 10 to IF -:eth2:-
# ifconfig eth2.10 10.10.10.5/24 up

3. 对TCP/IP流量做skb_priority到UP的映射,将所有skb_priority都映射到UP 4:

# for i in {0..7}; do vconfig set_egress_map eth2.10 $i 4 ; done
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10

4. 对不经过内核的流量,即RDMA流量,做skb_priority到UP的映射,将所有skb_priority都映射到UP 4:

# tc_wrap.py -i eth2 -u 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
skprio2up is available only for RoCE in kernels that don't support set_egress_map
Traffic classes are set to 8
UP  0
UP  1
UP  2
UP  3
UP  4
        skprio: 0
        skprio: 1
        skprio: 2 (tos: 8)
        skprio: 3
        skprio: 4 (tos: 24)
        skprio: 5
        skprio: 6 (tos: 16)
        skprio: 7
        skprio: 8
        skprio: 9
        skprio: 10
        skprio: 11
        skprio: 12
        skprio: 13
        skprio: 14
        skprio: 15
        skprio: 0 (vlan 10)
        skprio: 1 (vlan 10)
        skprio: 2 (vlan 10 tos: 8)
        skprio: 3 (vlan 10)
        skprio: 4 (vlan 10 tos: 24)
        skprio: 5 (vlan 10)
        skprio: 6 (vlan 10 tos: 16)
        skprio: 7 (vlan 10)
UP  5
UP  6
UP  7

5. 做UP到TC的映射,将UP 4映射到TC 4,其他UP各自映射到相应的TC:

# mlnx_qos -i eth2 -p 0,1,2,3,4,5,6,7
Priority trust mode is not supported on your system
Priority trust mode: none
PFC configuration:
        priority    0   1   2   3   4   5   6   7
        enabled     0   0   0   0   1   0   0   0

tc: 0 ratelimit: unlimited, tsa: vendor
         priority:  0
tc: 1 ratelimit: unlimited, tsa: vendor
         priority:  1
tc: 2 ratelimit: unlimited, tsa: vendor
         priority:  2
tc: 3 ratelimit: unlimited, tsa: vendor
         priority:  3
tc: 4 ratelimit: unlimited, tsa: vendor
         priority:  4
tc: 5 ratelimit: unlimited, tsa: vendor
         priority:  5
tc: 6 ratelimit: unlimited, tsa: vendor
         priority:  6
tc: 7 ratelimit: unlimited, tsa: vendor
         priority:  7

这样就都配置完成了。

最后,保存配置,防止重启失效:

switch-6bd534 [standalone: master] (config) # write memory

验证

用ib_write_bw测试,一台做sender,一台做receiver。

receiver:

$ ib_write_bw -R --report_gbits --port=12500 -D 10

sender:

$ ib_write_bw -R --report_gbits 10.10.10.6 --port=12500 -D 10

然后在交换机上查看PG4是否接收到了数据:

switch-6bd534 [standalone: master] (config) # show interfaces ethernet 1/5 counters pg 4

PG 4:
  44321827              packets
  48853700404           bytes
  0                     queue depth
  0                     no buffer discard
  0                     shared buffer discard

或者查看PFC (注意,并不一定会触发PFC)

switch-6bd534 [standalone: master] (config) # show interfaces ethernet 1/5 counters pfc prio 4

PFC 4:
  Rx:
    0                     pause packets
    0                     pause duration

  Tx:
    18                    pause packets
    4                     pause duration

在端主机侧查看priority 4的counter:

$ ethtool -S eth2 | grep prio_4
     rx_pause_prio_4: 88
     rx_pause_duration_prio_4: 0
     rx_pause_transition_prio_4: 0
     tx_pause_prio_4: 0
     tx_pause_duration_prio_4: 11
     tx_pause_transition_prio_4: 44
     rx_prio_4_packets: 9155756
     rx_prio_4_bytes: 752828084
     tx_prio_4_packets: 862787989
     tx_prio_4_bytes: 950840867498

 

参考:

1. HowTo Run RoCE over L2 Enabled with PFC

2. How to Enable PFC on Mellanox Switches (Spectrum)

3. Mellanox support

猜你喜欢

转载自blog.csdn.net/u013431916/article/details/82385641
v1