Using lxd to do vlan test (by quqi99)

作者:张华 发表于:2022-08-15
版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明

问题

客户说sriov虚机里收不着arp reply, 他们的sriov虚机里是两个sriov网卡做一个ptk0 (bond ?), 由active NIC(pkt0_p)与standby NIC(pkt0_s)组成.

<no-ip>/fa:16:3e:d8:3f:b9(pkt0)
<no-ip>/fa:16:3e:d8:3f:b9(pkt0_p)
<no-ip>/fa:16:3e:70:be:ba(pkt0_s)
151.2.143.1/151.2.143.2/fa:16:3e:d8:3f:b9(pkt0.610@pkt0)
10.139.99.1/10.139.99.2/fa:16:3e:d8:3f:b9(pkt0.510@pkt0)
10.139.160.10/10.139.160.11/10.139.160.12/fa:16:3e:d8:3f:b9(pkt0.700@pkt0)

他说在active NIC作ICMP的心跳检查没问题,但是在standby NIC上做ARP到GW的心跳检查收不着arp reply (但下列数据似乎收着啦?)

1, arp for active port(fa:16:3e:d8:3f:b9)

$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.src==fa:16:3e:d8:3f:b9 and arp |tail -n1
357602 8141.824956 fa:16:3e:d8:3f:b9 → IETF-VRRP-VRID_64 ARP 60 Who has 10.139.160.254? Tell 10.139.160.10
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.dst==fa:16:3e:d8:3f:b9 and arp |tail -n1
357603 8141.825416 IETF-VRRP-VRID_64 → fa:16:3e:d8:3f:b9 ARP 60 10.139.160.254 is at 00:00:5e:00:01:64

2, icmp for active port(fa:16:3e:d8:3f:b9)

$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.dst==fa:16:3e:d8:3f:b9 and icmp |tail -n1
358835 8169.867056 10.139.160.254 → 10.139.160.10 ICMP 102 Echo (ping) reply    id=0x000a, seq=15233/33083, ttl=64 (request in 358834)
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.src==fa:16:3e:d8:3f:b9 and icmp |tail -n1
358834 8169.863263 10.139.160.10 → 10.139.160.254 ICMP 102 Echo (ping) request  id=0x000a, seq=15233/33083, ttl=64

3, arp for standby port(fa:16:3e:70:be:ba)

$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.src==fa:16:3e:70:be:ba and arp |tail -n1
358848 8170.244743 fa:16:3e:70:be:ba → Broadcast    ARP 60 Who has 10.139.160.254? (ARP Probe)
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.dst==fa:16:3e:70:be:ba and arp |tail -n1
358849 8170.245117 IETF-VRRP-VRID_64 → fa:16:3e:70:be:ba ARP 60 10.139.160.254 is at 00:00:5e:00:01:64

4, icmp for standby port(fa:16:3e:70:be:ba)

$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.src==fa:16:3e:70:be:ba and icmp |tail -n1
<empty>
$ tshark -r ./EXT_TMP-700.pcap-1.act.pcap eth.dst==fa:16:3e:70:be:ba and icmp |tail -n1
<empty>

已经做过如下分析:

  • 确认下列的sriov ovn配置中用于external network的br-data里没有使用sriov NIC, 如果这里是sriov NIC,并且sriov NIC没有使用直通,而是使用mapvtap的话,可能存在发卡模式的问题,即一个host上的VM不能访问本chassis的网络,但可以访问其他chassis的网络.
juju config ovn-chassis-sriov-hugepages ovn-bridge-mappings
dcfabric:br-data sriovfabric1:br-data sriovfabric2:br-data
$ juju config ovn-chassis-sriov-hugepages bridge-interface-mappings
br-data:bond1
$ juju config ovn-chassis-sriov-hugepages sriov-device-mappings
sriovfabric1:ens3f0 sriovfabric1:ens6f0 sriovfabric2:ens3f1 sriovfabric2:ens6f1
$ juju config ovn-chassis-sriov-hugepages sriov-numvfs
ens3f0:32 ens3f1:32 ens6f0:32 ens6f1:32
  • 排除了lp bug 1875852, 客户没有使用vlan作为tenant network
  • 在PF上使用tcpdump只看到arp request是正常的.因为arp request是广播,那么在PF上能看到.但arp reply是单播,如果PF不是混杂模式(某些Intel sriov网卡有这个硬件bug不支持混杂模式)那么用PF上用tcpdump看不到arp reply是正常的.另外,在VF上是无法使用tcpdump的.
  • DHCP是禁用的.一般说来使用sr-iov ovn应该将sriov subnet打开dhcp. 但这里是禁用的,应该也没问题,因为客户会静态指定IP
  • 客户静态指定IP(由heat指定)与nova里分配的IP不一样,应该也不影响.因为sriov会bypass host,host上的SG不会影响它(主要是IP/MAC防欺骗的SW rule)
  • 实际IP与nova分配的IP不同,openstack应用层面的SG是不会影响到它,那sriov硬件层面的SG呢?确认spoof checking 也是off的.
i$ grep -E 'fa:16:3e:f8:42:fe|fa:16:3e:70:be:ba|fa:16:3e:8f:56:5a|fa:16:3e:d8:3f:b9' sos_commands/networking/ip_-s_-d_link
vf 30 MAC fa:16:3e:70:be:ba, spoof checking off, link-state auto, trust on
vf 31 MAC fa:16:3e:f8:42:fe, spoof checking off, link-state auto, trust on
vf 29 MAC fa:16:3e:8f:56:5a, spoof checking off, link-state auto, trust on
vf 30 MAC fa:16:3e:d8:3f:b9, spoof checking off, link-state auto, trust on
  • mac filting排除了(above spoof checking), 那vlan filting的问题呢?tcpdump数据显示客户似乎在虚机内部定义了一个vlan(pkt0.700@pkt0)

我们这篇文章的测试主要就是模拟这个vlan测试,当然这里不涉及sriov硬件.

vlan实验环境搭建

lxc remote add faster https://mirrors.tuna.tsinghua.edu.cn/lxc-images/ --protocol=simplestreams --public
lxc image list faster:
lxc remote list
#Failed creating instance record: Failed detecting root disk device: No root device could be found
#lxc profile device add default root disk path=/ pool=default
#lxc profile show default
#lxc launch ubuntu:focal master -p juju-default --config=user.network-config="$(cat network.yml)"
lxc launch faster:ubuntu/jammy test1
lxc launch faster:ubuntu/jammy test2

#add two NICs from NET1 for two containers
lxc network create NET1 ipv6.address=none ipv4.address=10.139.160.1/24
lxc network attach NET1 test1 eth1
lxc network attach NET1 test1 eth2
lxc network attach NET1 test2 eth1
lxc network attach NET1 test2 eth2

#https://developers.redhat.com/blog/2018/10/22/introduction-to-linux-interfaces-for-virtual-networking#vlan
#ip link add ptk0 type bond miimon 100 mode active-backup
#ip link set eth2 master ptk0
#ip link set eth1 master ptk0
lxc exec test1 -- /bin/bash
cat << EOF |tee /etc/netplan/11-test.yaml
network:
  version: 2
  renderer: networkd
  ethernets:
    eth1:
      addresses: []
      dhcp4: false
      dhcp6: false
      macaddress: 00:16:3e:15:bd:58
    eth2:
      addresses: []
      dhcp4: false
      dhcp6: false
      macaddress: 00:16:3e:68:72:0f
  bonds:
    ptk0:
      addresses: []
      dhcp4: false
      dhcp6: false
      interfaces:
        - eth1
        - eth2
      parameters:
        mode: active-backup
        primary: eth1
  vlans:
    ptk0.700:
      id: 700
      link: ptk0
      dhcp4: no
      addresses: [ 10.139.160.10/24 ]
      nameservers:
        search: [ domain.local ]
        addresses: [ 8.8.8.8 ]
EOF
netplan apply

lxc exec test2 -- /bin/bash
cat << EOF |tee /etc/netplan/11-test.yaml
network:
  version: 2
  renderer: networkd
  ethernets:
    eth1:
      addresses: []
      dhcp4: false
      dhcp6: false
      macaddress: 00:16:3e:1e:19:25
    eth2:
      addresses: []
      dhcp4: false
      dhcp6: false
      macaddress: 00:16:3e:f7:9e:22
  bonds:
    ptk0:
      addresses: []
      dhcp4: false
      dhcp6: false
      interfaces:
        - eth1
        - eth2
      parameters:
        mode: active-backup
        primary: eth1
  vlans:
    ptk0.700:
      id: 700
      link: ptk0
      dhcp4: no
      addresses: [ 10.139.160.11/24 ]
      nameservers:
        search: [ domain.local ]
        addresses: [ 8.8.8.8 ]
EOF
netplan apply

上面创建了两个lxd,并在两个lxd中创建了active/standby的bond (ptk0), 然后创建了一个vlan (ptk0.700), 要想上面的网络通,还得在host里设置trunk, 这样vlan网络就通了.
注意:上面需要使用macaddress为两个NIC来设置mac, 若不设置,在创建bond和vlan后会出现有所NIC的mac相同的情况.

$ sudo brctl show |grep NET1 -A3
NET1		8000.00163eeb79c4	no		veth2af34c1d
							veth3a5b458e
							veth82c292b2
							veth9b8e8cb6
#sudo bridge vlan add vid 2-4094 dev NET1 self
sudo bridge vlan add vid 700 dev NET1 self
sudo bridge vlan add vid 700 dev veth2af34c1d
sudo bridge vlan add vid 700 dev veth3a5b458e
sudo bridge vlan add vid 700 dev veth82c292b2
sudo bridge vlan add vid 700 dev veth9b8e8cb6
sudo bridge vlan show

此时,test1可以通过vlan700来ping test2

root@test1:~# ping 10.139.160.11 -c1
PING 10.139.160.11 (10.139.160.11) 56(84) bytes of data.
64 bytes from 10.139.160.11: icmp_seq=1 ttl=64 time=0.133 ms
root@test2:~# tcpdump -i eth1 -nn -e -l
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
05:54:36.128602 00:16:3e:15:bd:58 > 00:16:3e:1e:19:25, ethertype 802.1Q (0x8100), length 102: vlan 700, p 0, ethertype IPv4 (0x0800), 10.139.160.10 > 10.139.160.11: ICMP echo request, id 37135, seq 1, length 64
05:54:36.128643 00:16:3e:1e:19:25 > 00:16:3e:15:bd:58, ethertype 802.1Q (0x8100), length 102: vlan 700, p 0, ethertype IPv4 (0x0800), 10.139.160.11 > 10.139.160.10: ICMP echo reply, id 37135, seq 1, length 64

但是仍然无法ping GW的

root@test1:~# ping 10.139.160.1 -c1
PING 10.139.160.1 (10.139.160.1) 56(84) bytes of data.
From 10.139.160.10 icmp_seq=1 Destination Host Unreachable
$ sudo tcpdump -i NET1 -nn -e -l
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on NET1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:25:24.761131 00:16:3e:15:bd:58 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 700, p 0, ethertype ARP (0x0806), Request who-has 10.139.160.1 tell 10.139.160.10, length 28

无论是创建一个eth0.700, 还是创建一个vlan=700的tap0,均无法ping

#use eth0.700
sudo ip link add link eth0 name eth0.700 type vlan id 700
sudo brctl addif NET1 eth0.700
sudo ifconfig eth0.700 up
sudo ip addr add 10.139.160.254/24 dev eth0.700
sudo bridge vlan add vid 700 dev eth0.700

#use a tap
sudo ip tuntap add mode tap tap0
sudo ip link set tap0 master NET1
sudo bridge vlan add dev tap0 vid 700 pvid untagged master
sudo ip addr add 10.139.160.254/24 dev tap0
sudo bridge vlan show

测试1

那就将test2当成gw吧,然后我们从test1上ping它然后抓包
如果仅从active port使用icmp

root@test1:~# ping -I eth1 10.139.160.1 -c1
ping: Warning: source address might be selected on device other than: eth1
PING 10.139.160.1 (10.139.160.1) from 192.168.121.88 eth1: 56(84) bytes of data.
^C
--- 10.139.160.1 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

$ sudo tcpdump -i NET1 -nn -e -l
14:32:04.483156 00:16:3e:15:bd:58 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 10.139.160.1 tell 192.168.121.88, length 28
14:32:04.483185 00:16:3e:eb:79:c4 > 00:16:3e:15:bd:58, ethertype ARP (0x0806), length 42: Reply 10.139.160.1 is-at 00:16:3e:eb:79:c4, length 28

运行’ping -I eth1 10.139.160.11 -c1’与'ping -I eth2 10.139.160.11 -c1’均无输出

测试2

使用arping命令发送arp request时必须指定一个IP, 但standby port上又没有IP,所以通过’-S’指定了一个.

root@test1:~# arping -I ptk0.700 10.139.160.11 -S 10.139.160.2 -C1
ARPING 10.139.160.11
42 bytes from 00:16:3e:1e:19:25 (10.139.160.11): index=0 time=8.119 usec
root@test2:~# sudo tcpdump -i ptk0.700 -nn -e -l
09:08:16.814374 00:16:3e:15:bd:58 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 58: Request who-has 10.139.160.11 tell 10.139.160.2, length 44
09:08:16.814410 00:16:3e:1e:19:25 > 00:16:3e:15:bd:58, ethertype ARP (0x0806), length 42: Reply 10.139.160.11 is-at 00:16:3e:1e:19:25, length 28

运行’arping -I eth1 10.139.160.11 -S 10.139.160.2 -C1’与’arping -I eth2 10.139.160.11 -S 10.139.160.2 -C1’均无输出

root@test1:~# arping -I eth2 10.139.160.11 -S 10.139.160.2 -C1
ARPING 10.139.160.11
Timeout

那是因为eth1与eth2不是vlan=700?

Some Outputs

root@test1:~# cat /proc/net/bonding/ptk0 
Ethernet Channel Bonding Driver: v5.15.0-43-generic

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth1 (primary_reselect always)
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: eth1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:16:3e:15:bd:58
Slave queue ID: 0

Slave Interface: eth2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:16:3e:68:72:0f
Slave queue ID: 0

另一种纯CLI方法

上面的不使用netplan还设置网络,而是直接使用纯CLI命令来创建bond, 并且不采用vlan-filtering的方法- https://developers.redhat.com/blog/2017/09/14/vlan-filter-support-on-bridge#bridge_and_vlan

lxc launch faster:ubuntu/jammy test1
lxc launch faster:ubuntu/jammy test2
#add two NICs from NET1 for two containers
lxc network create NET1 ipv6.address=none ipv4.address=10.139.160.1/24
lxc network attach NET1 test1 eth1
lxc network attach NET1 test1 eth2
lxc network attach NET1 test2 eth1
lxc network attach NET1 test2 eth2

#inside test1
lxc exec test1 -- /bin/bash
sudo ip link add ptk0 type bond miimon 100 mode active-backup
sudo ip link set eth1 down
sudo ip link set eth1 master ptk0
sudo ip link set eth2 down
sudo ip link set eth2 master ptk0
sudo ip link set dev ptk0 address 00:16:3e:15:bd:58
sudo ip link set dev eth1 address 00:16:3e:15:bd:58
sudo ip link set dev eth2 address 00:16:3e:68:72:0f
sudo ip link set ptk0 up
sudo ip link add link ptk0 name ptk0.700 type vlan id 700
sudo ip addr add 10.139.160.10/24 dev ptk0.700

#inside test2
lxc exec test2 -- /bin/bash
sudo ip link add ptk0 type bond miimon 100 mode active-backup
sudo ip link set eth1 down
sudo ip link set eth1 master ptk0
sudo ip link set eth2 down
sudo ip link set eth2 master ptk0
sudo ip link set dev ptk0 address 00:16:3e:1e:19:25
sudo ip link set dev eth1 address 00:16:3e:1e:19:25
sudo ip link set dev eth2 address 00:16:3e:f7:9e:22
sudo ip link set ptk0 up
sudo ip link add link ptk0 name ptk0.700 type vlan id 700
sudo ip addr add 10.139.160.11/24 dev ptk0.700

#on host
sudo bridge vlan add vid 700 dev NET1 self
brctl show NET1 |grep veth |xargs -i sudo bridge vlan add vid 700 dev {}
sudo bridge vlan show

20220818 - sriov on lxd and kvm

试图想通过lxd来用两个sriov vf,但失败了,记录如下:

set up sriov env in my desktop - https://blog.csdn.net/quqi99/article/details/53488243
#Failed growing available VFs from 3 to 7 on device "enp6s0f0": write /sys/class/net/enp6s0f0/device/sriov_numvfs
sudo modprobe -r igb && sudo modprobe igb max_vfs=7
lspci -nn |grep 82576
ip link show enp6s0f0
$ cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-5.15.0-39-generic root=UUID=20355b12-b4b2-4a30-b9e1-59fafe2d7633 ro transparent_hugepage=never hugepagesz=2M hugepages=128 default_hugepagesz=2M intel_iommu=pt intel_iommu=on pci=assign-busses mitigations=off nohpet nokaslr crashkernel=512M-:192M

#lxc launch faster:ubuntu/jammy i1
#lxc init faster:ubuntu/jammy i1
#lxc config device add i1 eth0 nic nictype=sriov parent=enp6s0f0
#lxc config device add i1 eth1 nic nictype=sriov parent=enp6s0f0
#lxc config device add i1 eth0 nic network="sriov0" name=eth0 hwaddr="da:da:9d:42:e5:f0"
lxc remote add faster https://mirrors.tuna.tsinghua.edu.cn/lxc-images/ --protocol=simplestreams --public
lxc init faster:ubuntu/jammy test1

#refer https://blog.csdn.net/quqi99/article/details/125004749
#lxd supports sriov now - https://github.com/lxc/lxd/pull/7678
lxc network create sriov0 --type=sriov parent=enp6s0f0
lxc launch faster:ubuntu/jammy test1
#Failed growing available VFs from 3 to 7 on device "enp6s0f0": write /sys/class/net/enp6s0f0/device/sriov_numvfs
lxc network attach sriov0 test1 eth1
lxc network attach sriov0 test1 eth2
lxc config device override i1 eth1 network=sriov0
lxc config device override i1 eth2 network=sriov0

lxc init faster:ubuntu/jammy i1
lxc config device add i1 eth0 nic nictype=sriov parent=enp6s0f0
lxc start i1
$ lxc start i1
Error: Failed to start device "eth0": All virtual functions on parent device "enp6s0f0" are already in use
Try `lxc info --show-log i1` for more info

sudo snap set lxd daemon.debug=true; sudo systemctl reload snap.lxd.daemon
sudo tail -f /var/snap/lxd/common/lxd/logs/lxd.log

上面失败了,接着我们改使用kvm,创建两个kvm虚机每个用两个sriov vf

lspci | grep net
sudo virt-install --name=i1 --ram=4096 --vcpus=1 --hvm --virt-type=kvm \
    --connect=qemu:///system --os-variant=ubuntu20.04 --accelerate \
    --disk=/images/i1.qcow2,bus=virtio,format=qcow2,cache=none,sparse=true,size=8 \
    --network=bridge=virbr1,model=rtl8139 --nographics -v \
    --hostdev=07:10.0 --hostdev=07:10.1 \
    --location 'https://mirrors.cloud.tencent.com/ubuntu/dists/focal/main/installer-amd64/' --extra-args='console=ttyS0,115200n8 serial'
sudo virsh --connect qemu:///system console i1
arp -a && ssh [email protected] -v

sudo mkdir /mnt/rootfs && sudo chown -R $USER /mnt/rootfs/
sudo mount -o ro /images/iso/ubuntu-20.04-legacy-server-amd64.iso /mnt/rootfs
cd /mnt/rootfs
sudo virt-install --name=i2 --ram=4096 --vcpus=1 --hvm --virt-type=kvm \
    --connect=qemu:///system --os-variant=ubuntu20.04 --accelerate \
    --disk=/images/i2.qcow2,bus=virtio,format=qcow2,cache=none,sparse=true,size=8 \
    --network=bridge=virbr1,model=rtl8139 --nographics -v \
    --hostdev=07:10.2 --hostdev=07:10.3 \
    --cdrom /images/iso/ubuntu-20.04-legacy-server-amd64.iso \
    --boot kernel=./install/netboot/ubuntu-installer/amd64/linux,initrd=./install/netboot/ubuntu-installer/amd64/initrd.gz,kernel_args="console=ttyS0"

然后再接着使用上面的netplan做bond + vlan实验,但是即使使用macaddress设置了MAC之后mac也会变成全是一样的,如下:

3: enp6s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 9e:fb:03:a7:84:3e brd ff:ff:ff:ff:ff:ff
4: enp7s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 5e:55:d1:be:15:7f brd ff:ff:ff:ff:ff:ff

cat << EOF |tee /etc/netplan/11-test.yaml
network:
  version: 2
  renderer: networkd
  ethernets:
    enp6s0:
      addresses: []
      dhcp4: false
      dhcp6: false
      macaddress: 9e:fb:03:a7:84:3e
    enp7s0:
      addresses: []
      dhcp4: false
      dhcp6: false
      macaddress: 5e:55:d1:be:15:7f
  bonds:
    ptk0:
      addresses: []
      dhcp4: false
      dhcp6: false
      interfaces:
        - enp6s0
        - enp7s0
      parameters:
        mode: active-backup
        primary: enp6s0
  vlans:
    ptk0.700:
      id: 700
      link: ptk0
      dhcp4: no
      addresses: [ 10.139.160.10/24 ]
      nameservers:
        search: [ domain.local ]
        addresses: [ 8.8.8.8 ]
EOF
netplan apply


3: enp6s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ptk0 state UP group default qlen 1000
    link/ether 3a:a3:8b:e2:4b:89 brd ff:ff:ff:ff:ff:ff
4: enp7s0: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc fq_codel master ptk0 state DOWN group default qlen 1000
    link/ether 3a:a3:8b:e2:4b:89 brd ff:ff:ff:ff:ff:ff
5: ptk0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 3a:a3:8b:e2:4b:89 brd ff:ff:ff:ff:ff:ff
6: ptk0.700@ptk0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 3a:a3:8b:e2:4b:89 brd ff:ff:ff:ff:ff:ff
    inet 10.139.160.10/24 brd 10.139.160.255 scope global ptk0.700
       valid_lft forever preferred_lft forever

$ ip link show enp6s0f0
3: enp6s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 2c:53:4a:02:20:3c brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 3a:a3:8b:e2:4b:89 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
$ ip link show enp6s0f1
4: enp6s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether 2c:53:4a:02:20:3d brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 3a:a3:8b:e2:4b:89 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off

对于sriov,在host上并没有什么bridge和tap所以也不需要用’bridge vlan add vid 700 dev’做trunk, 所以在vlan=700上的i1(10.139.160.10)直接就能ping i2(10.139.160.11).

将host上的PF也做一个vlan=700的GW, 这样i1(10.139.160.10)就可以ping GW(10.139.160.254)了.

sudo ip link add link enp6s0f0 name enp6s0f0.700 type vlan id 700
sudo ip addr add 10.139.160.254/24 dev enp6s0f0.700
sudo ifconfig enp6s0f0.700 up

关于对指定vf添加信任(ip link set dev enp6s0f0 vf 0 trust on)信任后使用该vf的虚机里才能修改mac地址.做一个实验,将虚机里的ptk0.700的MAC由3a:a3:8b:e2:4b:89改成aa:bb:cc:dd:ee:00(ip link set dev ptk0.700 address aa:bb:cc:dd:ee:00), 此时host上的vf上的mac仍然是3a:a3:8b:e2:4b:89,按理说由于trust is off(sudo ip link set dev enp6s0f0 vf 0 trust off)所以应该ping不通了.但结果是仍然能ping通,即使将ptk0与enp6s0f0的mac也改了还是能ping通.
关于对指定vf设置spoof checking off(ip link set dev enp6s0f0 vf 0 spoof off)后会禁用SG的防欺骗检查.

按理说,如果虚机里有active/passive的bond应该设置trust on与spoof checking off(做了一个测试,这样设置之后,之前在虚机里将mac改为aa:bb:cc:dd:ee:00,然后发现host上的vf的mac也被自动改成aa:bb:cc:dd:ee:00了), 下面是原话:

(1). SR-IOV bonding configurations inside guests with VLANs on the interfaces. For example, when the bond shifts from active slave to standby slave, the bond interface carries the MAC of the original active. This MAC needs to be configured down on the VF else all tx packets will be dropped due MAC spoof checking. This can be also achieved if we set fail_over_mac as active which changes the bond MAC on port switchover. But with VLANs on top of bond there will be issues if the bond MAC changes as the MAC of the VLAN interfaces will still have the old MACs.
(2). Currently only a list of 30 multicast addresses can be supported per VF. This restricts the number of IPv6 IPs which can be used/interfaces, as for each IP there will be a different multicast MAC allocated by the kernel. This in turn also restricts the number of VLAN than can created while using IPv6.

vlan实验及相关的 tcpdump输出如下:

1, 由于在enp6s0f0上创建的enp6s0f0.700(相当于运行了ip link set enp6s0f0 vf 0 vlan 700 qos 2), 所以enp6s0f0也自动设置成了trunk,这时在i1里运行(ping -I ptk0.700 10.139.160.254 -c1)时,在enp6s0f0上的抓包会有vlan信息:
$ sudo tcpdump -i enp6s0f0 -nn -e -l "(arp or icmp)"
11:19:15.133438 aa:bb:cc:dd:ee:00 > 2c:53:4a:02:20:3c, ethertype 802.1Q (0x8100), length 102: vlan 700, p 0, ethertype IPv4 (0x0800), 10.139.160.10 > 10.139.160.254: ICMP echo request, id 15, seq 1, length 64
11:19:15.133518 2c:53:4a:02:20:3c > aa:bb:cc:dd:ee:00, ethertype 802.1Q (0x8100), length 102: vlan 700, p 0, ethertype IPv4 (0x0800), 10.139.160.254 > 10.139.160.10: ICMP echo reply, id 15, seq 1, length 64
11:19:20.319191 2c:53:4a:02:20:3c > aa:bb:cc:dd:ee:00, ethertype 802.1Q (0x8100), length 46: vlan 700, p 0, ethertype ARP (0x0806), Request who-has 10.139.160.10 tell 10.139.160.254, length 28
11:19:20.319312 aa:bb:cc:dd:ee:00 > 2c:53:4a:02:20:3c, ethertype 802.1Q (0x8100), length 64: vlan 700, p 0, ethertype ARP (0x0806), Reply 10.139.160.10 is-at aa:bb:cc:dd:ee:00, length 46
11:19:20.387683 aa:bb:cc:dd:ee:00 > 2c:53:4a:02:20:3c, ethertype 802.1Q (0x8100), length 64: vlan 700, p 0, ethertype ARP (0x0806), Request who-has 10.139.160.254 tell 10.139.160.10, length 46
11:19:20.387692 2c:53:4a:02:20:3c > aa:bb:cc:dd:ee:00, ethertype 802.1Q (0x8100), length 46: vlan 700, p 0, ethertype ARP (0x0806), Reply 10.139.160.254 is-at 2c:53:4a:02:20:3c, length 28

2, 但enp6s0f0.700上会看不到vlan信息(也是在i1里运行ping 10.139.160.254 -c1):
$ sudo tcpdump -i enp6s0f0.700 -nn -e -l "(arp or icmp)"
11:23:44.639393 aa:bb:cc:dd:ee:00 > 2c:53:4a:02:20:3c, ethertype IPv4 (0x0800), length 98: 10.139.160.10 > 10.139.160.254: ICMP echo request, id 16, seq 1, length 64
11:23:44.639448 2c:53:4a:02:20:3c > aa:bb:cc:dd:ee:00, ethertype IPv4 (0x0800), length 98: 10.139.160.254 > 10.139.160.10: ICMP echo reply, id 16, seq 1, length 64
11:23:49.701228 aa:bb:cc:dd:ee:00 > 2c:53:4a:02:20:3c, ethertype ARP (0x0806), length 60: Request who-has 10.139.160.254 tell 10.139.160.10, length 46
11:23:49.701245 2c:53:4a:02:20:3c > aa:bb:cc:dd:ee:00, ethertype ARP (0x0806), length 42: Reply 10.139.160.254 is-at 2c:53:4a:02:20:3c, length 28
11:23:49.887183 2c:53:4a:02:20:3c > aa:bb:cc:dd:ee:00, ethertype ARP (0x0806), length 42: Request who-has 10.139.160.10 tell 10.139.160.254, length 28
11:23:49.887297 aa:bb:cc:dd:ee:00 > 2c:53:4a:02:20:3c, ethertype ARP (0x0806), length 60: Reply 10.139.160.10 is-at aa:bb:cc:dd:ee:00, length 46

3, 在i1里的bond(ptk0)和active NIC(enp6s0)和standby NIC(enp7s0)均无法运行ping命令(eg: ping -I enp6s0 10.139.160.254 -c1)

4, 同理,在i1上的ptk0.700上能运行arping命令

root@i1:/home/hua# arping -I ptk0.700 10.139.160.254 -S 10.139.160.10 -C1
ARPING 10.139.160.254
42 bytes from 2c:53:4a:02:20:3c (10.139.160.254): index=0 time=129.264 usec

11:28:52.927114 aa:bb:cc:dd:ee:00 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.139.160.254 tell 10.139.160.10, length 46
11:28:52.927127 2c:53:4a:02:20:3c > aa:bb:cc:dd:ee:00, ethertype ARP (0x0806), length 42: Reply 10.139.160.254 is-at 2c:53:4a:02:20:3c, length 28

5, 但是在i1的里的bond(ptk0)和active NIC(enp6s0)可以运行arping命令(eg: arping -I enp6s0 10.139.160.254 -S 10.139.160.10 -C1)(但tcpdump里无输出),但是无法在standby NIC上运行它(arping -I enp7s0 10.139.160.254 -S 10.139.160.10 -C1)

#这种情况(tcpdump -i enp6s0f0.700 -nn -e -l "(arp or icmp)")无输出
root@i1:/home/hua# arping -I enp6s0 10.139.160.254 -S 10.139.160.10 -C1
ARPING 10.139.160.254
60 bytes from da:0d:a3:9d:5e:dd (10.139.160.254): index=0 time=266.707 usec
60 bytes from 28:d2:44:52:31:1d (10.139.160.254): index=1 time=286.628 usec
#但是tcpdump -i enp6s0f0 -nn -e -l "(arp or icmp)"会有如下输出(只有arp request, 没有arp reply)
13:50:10.436783 aa:bb:cc:dd:ee:00 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.139.160.254 tell 10.139.160.10, length 46

#这种情况arpping无法运行,且(tcpdump -i enp6s0f0.700 -nn -e -l "(arp or icmp)")也无输出
root@i1:/home/hua# arping -I enp7s0 10.139.160.254 -S 10.139.160.10 -C1
ARPING 10.139.160.254
Timeout
Timeout

reference

[1] LACP Bond配置 - https://blog.csdn.net/quqi99/article/details/51251210
[2] 三种方式使用vlan - https://blog.csdn.net/quqi99/article/details/51218884
[3] creating vlan over openstack - https://blog.csdn.net/quqi99/article/details/118341936
[4] VLAN filter support on bridge - https://developers.redhat.com/blog/2017/09/14/vlan-filter-support-on-bridge#

猜你喜欢

转载自blog.csdn.net/quqi99/article/details/126345528