WorkingTipsOnGPUPassthrough

Hardware Info

gpu info:

root@usbserver:~# lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GP106M [GeForce GTX 1060 Mobile] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
root@usbserver:~# uname -a
Linux usbserver 5.4.0-74-generic #83-Ubuntu SMP Sat May 8 02:35:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
root@usbserver:~# cat /etc/issue
Ubuntu 20.04.2 LTS \n \l

Newly installed system, install qemu/libvirtd/virt-manager via:

# apt-get update -y && apt-get install -y virt-manager

Configuration

Remove current NVIDIA or opensource Nouveau driver:

root@usbserver:~# apt-get remove --purge nvidia*

Add driver blacklist:

# vim /etc/modprobe.d/blacklist.conf
....
blacklist amd76x_edac #this might not be required for x86 32 bit users.
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

Disable modset and add intel_iommu for default grub setting and update grub:

# vim /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="maybe-ubiquity nouveau.modeset=0 intel_iommu=on"
# update-grub2

Disable nouveau kernel:

# echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
# update-initramfs -u

On the above output for lspci -nn | grep -i nvidia, record the GPU id 10de:1c20 and 10de:10f1, create /etc/modprobe.d/vfio.conf, add corresponding GPU PCI Ids:

# vim /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:10f1,10de:1c20

Edit /etc/default/grub for adding vfio-ids:

# vim /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="maybe-ubiquity nouveau.modeset=0 intel_iommu=on vfio-pci.ids=10de:1c20,10de:10f1"
# update-grub2

Add vfio modules into initrd:

# vim /etc/initramfs-tools/modules

vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
# update-initramfs -u

Now restart host machine, after boot check configurations for vfio:

# dmesg | grep -i iommu
# dmesg | grep -i dmar
# lspci -vv
See if the  drive in use is vfio-pci

VM

CrossRouter

In openwrt, configure following:

/images/2021_06_07_07_36_24_553x374.jpg

Then in nuc(192.168.1.222) add following items:

# sudo route add -net 192.168.0.17 netmask 255.255.255.255 gw 192.168.1.2 enp3s0
# sudo route add -net 192.168.0.0 netmask 255.255.255.0 gw 192.168.1.2 enp3s0

While 192.168.1.2 is the openWRT’s wan address, while the 192.168.0.0/24 is its lan address range.

From now we could directly get to the 192.168.0.1/24 range.

WorkingTipsOnGpuPassthrough

近来要调研一些虚拟化的东西,主要包括轻量级虚拟化、边缘池算力等,记下一些关键的点以便后续整理。

手头的两台GPU服务器,每台上面有7块GPU卡,需要将它们透传到虚拟机中以便调研相关系统的性能。

环境

环境信息列举如下:

CPU: model name      : Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
Memory: 
free -m
              total        used        free      shared  buff/cache   available
Mem:         385679       22356      284811        1042       78512      358158
Swap:             0           0           0
GPU: 
# lspci -nn | grep -i nvidia
3d:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:xxxx] (rev a1)
3e:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:xxxx] (rev a1)
40:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:xxxx] (rev a1)
41:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:xxxx] (rev a1)
b1:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:xxxx] (rev a1)
b2:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:xxxx] (rev a1)
b4:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:xxxx] (rev a1)
b5:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:xxxx] (rev a1)

操作系统:

# cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core) 
# /usr/libexec/qemu-kvm --version
QEMU emulator version 2.12.0 (qemu-kvm-ev-2.12.0-44.1.el7_8.1)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
# uname -r
4.19.12-1.el7.elrepo.x86_64

系统配置

激活IOMMU, 通过编辑/etc/default/grub中的GRUB_CMDLINE_LINUX行,增加intel_iommu=on后,重新生成grub引导文件:

# vim /etc/default/grub
GRUB_CMDLINE_LINUX="crashkernel=auto rhgb quiet intel_iommu=on rd.driver.pre=vfio-pci"
# grub-mkconfig -o /boot/grub2/grub.cfg
# reboot
重启后,检查:   
dmesg | grep -E "DMAR|IOMMU"

激活vfio-pci内核模块, 注意填入的数值是lspci取得的:

# options vfio-pci ids=10de:xxxx

激活vfio-pci的自动加载:

# echo 'vfio-pci' > /etc/modules-load.d/vfio-pci.conf
# reboot
# dmesg | grep -i vfio

qemu的更新步骤如下:

# /usr/libexec/qemu-kvm -version
QEMU emulator version 1.5.3 (qemu-kvm-1.5.3-141.el7_4.6), Copyright (c) 2003-2008 Fabrice Bellard
# yum -y install centos-release-qemu-ev
# sed -i -e "s/enabled=1/enabled=0/g" /etc/yum.repos.d/CentOS-QEMU-EV.repo
# yum --enablerepo=centos-qemu-ev -y install qemu-kvm-ev
# systemctl restart libvirtd
# /usr/libexec/qemu-kvm -version
QEMU emulator version 2.12.0 (qemu-kvm-ev-2.12.0-44.1.el7_8.1)

现在在virt-manager中是可以指定下放GPU的。

/images/2021_06_05_08_17_16_785x569.jpg

登录到虚拟机以后:

root@localhost:~# lspci -nn | grep -i nvidia
00:0a.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:xxxx] (rev a1)

接下来我会调研如何在lxd下及在k3s+kubevirt的场景下透传GPU.

EnableHostNetworkInLXD

lxc/lxd lacks of using host network like docker run -it --net host, following are the steps on how to enable this feature:

注:使用host网络模式会使得主机层面的网络增强在容器层面被绕过,等同于完全放开了网络安全策略,被攻击面增大。

建议谨慎使用。

lxd开启hostNetwork

将默认额lxd profile导入到文件中:

# lxc profile show default>host

编辑文件,去掉关于网卡的配置(前置-的行需要被删除):

# vim host
config:
  security.secureboot: "false"
description: Default LXD profile
devices:
-  eth0:
-    name: eth0
-    network: lxdbr0
-    type: nic
  root:
    path: /
    pool: default
    type: disk
name: default
used_by:
- /1.0/instances/centos

删除后保存文件,创建一个名称为hostnetworkprofile,用修改过的配置文件定义之:

lxc profile create hostnetwork
lxc profile edit hostnetwork<host

创建一个新实例:

# lxc init adcfe657303d ubuntu1

指定hostnetwork的profile为其启动所需的profile:

# lxc profile assign ubuntu1 hostnetwork

创建一个raw.lxc的外部配置文件:

# vim /root/lxc_host_network_config
lxc.net.0.type = none

配置该lxc实例的属性(raw.lxc及特权容器), 在某些系统上可能不需要特权容器, 设置完毕后启动:

lxc config set ubuntu1 raw.lxc="lxc.include=/root/lxc_host_network_config"
lxc config set ubuntu1 security.privileged=true
lxc start ubuntu1

检查lxc实例所拥有的网络地址空间:

# lxc ls
lxc ls
+---------+---------+---------------------------+--------------------------------------------------+-----------------+-----------+
|  NAME   |  STATE  |           IPV4            |                       IPV6                       |      TYPE       | SNAPSHOTS |
+---------+---------+---------------------------+--------------------------------------------------+-----------------+-----------+
| centos  | RUNNING | 10.225.0.168 (enp5s0)     | fd42:1cac:64d0:f018:547d:b31b:251b:381e (enp5s0) | VIRTUAL-MACHINE | 0         |
+---------+---------+---------------------------+--------------------------------------------------+-----------------+-----------+
| ubuntu1 | RUNNING | 192.192.189.1 (virbr1)    | fd42:1cac:64d0:f018::1 (lxdbr0)                  | CONTAINER       | 0         |
|         |         | 192.168.122.1 (virbr0)    | 240e:3b5:cb5:abf0:36d4:30d3:285b:862e (enp3s0)   |                 |           |
|         |         | 192.168.1.222 (enp3s0)    |                                                  |                 |           |
|         |         | 172.23.8.165 (ztwdjmv5j3) |                                                  |                 |           |
|         |         | 172.17.0.1 (docker0)      |                                                  |                 |           |
|         |         | 10.33.34.1 (virbr2)       |                                                  |                 |           |
|         |         | 10.225.0.1 (lxdbr0)       |                                                  |                 |           |
+---------+---------+---------------------------+--------------------------------------------------+-----------------+-----------+

进入到网络中检查所有网卡:

# lxc exec ubuntu1 bash
root@ubuntu1:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 74:d4:35:6a:84:19 brd ff:ff:ff:ff:ff:ff
3: ztwdjmv5j3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2800 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 0e:d0:a9:d8:58:2a brd ff:ff:ff:ff:ff:ff
4: wlp2s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DORMANT group default qlen 1000
    link/ether 40:e2:30:30:1e:ee brd ff:ff:ff:ff:ff:ff
5: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 52:54:00:ea:66:bb brd ff:ff:ff:ff:ff:ff
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default 
    link/ether 02:42:d0:f9:2f:53 brd ff:ff:ff:ff:ff:ff
7: lxdbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 00:16:3e:b7:04:a9 brd ff:ff:ff:ff:ff:ff
8: virbr1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 52:54:00:88:da:f3 brd ff:ff:ff:ff:ff:ff
9: virbr2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 52:54:00:7a:73:8b brd ff:ff:ff:ff:ff:ff
10: tapeac054b8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master lxdbr0 state UP mode DEFAULT group default qlen 1000
    link/ether fe:25:d8:30:90:f6 brd ff:ff:ff:ff:ff:ff
11: veth43eb373c@veth5f86c132: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:16:3e:cb:fe:fc brd ff:ff:ff:ff:ff:ff
12: veth5f86c132@veth43eb373c: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue master lxdbr0 state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
    link/ether 9e:b0:74:e7:fa:86 brd ff:ff:ff:ff:ff:ff
19: veth8d89fdf9@vethb3a2bfad: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:16:3e:b3:06:c7 brd ff:ff:ff:ff:ff:ff
20: vethb3a2bfad@veth8d89fdf9: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue master lxdbr0 state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
    link/ether da:c7:ab:6b:90:7f brd ff:ff:ff:ff:ff:ff

lxc 开启hostNetwork

使用lxc创建一个容器:

lxc-create -n test -t download -- --dist ubuntu --release focal --arch amd64

编辑 /var/lib/lxc/test/config. 删除所有关于network的行, 添加 “lxc.network.type = none”

启动容器:

lxc-start -n test

检查lxc网络设置:

lxc-ls -f


NAME STATE   AUTOSTART GROUPS IPV4                                                                                          IPV6                                                          UNPRIVILEGED 
test RUNNING 0         -      10.225.0.1, 10.33.34.1, 172.17.0.1, 172.23.8.165, 192.168.1.222, 192.168.122.1, 192.192.189.1 240e:3b5:cb5:abf0:36d4:30d3:285b:862e, fd42:1cac:64d0:f018::1 false   

进入到容器后检查网卡:


➜  ~ lxc-attach test  
# /bin/bash
root@test:~# ip addr
这里可以看到主机上所有的接口地址

WorkingTipsOnK3sKubevirt

1. 目的

k3s+kubevirt,运行虚拟机工作负载. 落地平台为AllInOne节点。主要针对边缘侧算力平台落地场景。

2. 环境

嵌套虚拟化环境用于承载k3s算力管控平台。运行操作系统为Ubuntu20.04, 40Core, 274G内存。
更新: 嵌套虚拟化会产生诸多问题,导致qemu无法启动,因而后面我采用在物理机上直接启动k3s的方式。

物理机环境:

Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
376G memory
2T nvme ssd
CentOS 7.6.1810
kernel: 4.19.12-1.el7.elrepo.x86_64

3. 步骤

注:以下为虚拟化环境,未安装成功(因后续有嵌套虚拟化导致qemu无法启动的问题)
安装步骤:

# apt-get update -y && apt-get upgrade -y
# export http_proxy export https_proxy
# curl -sfL https://get.k3s.io | sh -
# vim /etc/resolv.conf
nameserver 223.5.5.5
# systemctl stop systemd-resolved
# systemctl disable systemd-resolved

以下为正常过程:

安装k3s:

# yum update -y && yum install -y git
# curl -sfL https://get.k3s.io | sh -

安装KubeVirt:

# export VERSION=v0.41.0
# kubectl create -f https://github.com/kubevirt/kubevirt/releases/download/${VERSION}/kubevirt-operator.yaml
# kubectl create -f https://github.com/kubevirt/kubevirt/releases/download/${VERSION}/kubevirt-cr.yaml

安装virtctl用于控制KubeVirt虚拟机,这里使用Kubernetes的插件管理器Krew来安装virtctl:

# (   set -x; cd "$(mktemp -d)" &&   curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/krew.tar.gz" &&   tar zxvf krew.tar.gz &&   KREW=./krew-"$(uname | tr '[:upper:]' '[:lower:]')_$(uname -m | sed -e 's/x86_64/amd64/' -e 's/arm.*$/arm/')" &&   "$KREW" install krew; )

# export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"
# kubectl krew install virt

安装Containerized-Data-Importer(CDI)用于管理虚拟机的磁盘,安装步骤如下:

# export VERSION=$(curl -s https://github.com/kubevirt/containerized-data-importer/releases/latest | grep -o "v[0-9]\.[0-9]*\.[0-9]*")
# kubectl create -f https://github.com/kubevirt/containerized-data-importer/releases/download/$VERSION/cdi-operator.yaml
# kubectl create -f https://github.com/kubevirt/containerized-data-importer/releases/download/$VERSION/cdi-cr.yaml

https://www.microsoft.com/en-us/evalcenter/evaluate-windows-server-2019 下载Windows ISO 180天试用版, 而后上传该ISO:

# Get the CDI upload proxy service IP:
 kubectl get svc -n cdi
    
 # Upload 
 kubectl virt image-upload --image-path </path/to/iso> \
     --pvc-name iso-win2k19 --access-mode ReadWriteOnce \
     --pvc-size 10G --uploadproxy-url <upload-proxy service:443> \
     --insecure --wait-secs=240

上传前需要确保coredns正常运行,否则会出现上传不成功的情况。

PVC被设置成ReadWriteOnce, 因为默认的local-path storageclass 不支持更多的模式。因为我们只使用一个节点,所以这点没所谓,但是在大型的K3s集群里,需要注意PVC的属性配置。

virtio-container-disk 容器镜像需要被实现拉回,因为在安装的时候我们需要使用其中包含的驱动程序。

# crictl pull kubevirt/virtio-container-disk

现在我们创建一个yaml文件用于定义需要创建的KubeVirt虚拟机(win.yaml):

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: winhd
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 15Gi
  storageClassName: manual
---
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
  name: win2k19-iso
spec:
  running: false
  template:
    metadata:
      labels:
        kubevirt.io/domain: win2k19-iso
    spec:
      domain:
        cpu:
          cores: 4
        devices:
          disks:
          - bootOrder: 1
            cdrom:
              bus: sata
            name: cdromiso
          - disk:
              bus: sata
            name: harddrive
          - cdrom:
              bus: sata
            name: virtiocontainerdisk
        machine:
          type: q35
        resources:
          requests:
            memory: 8G
      volumes:
      - name: cdromiso
        persistentVolumeClaim:
          claimName: iso-win2k19
      - name: harddrive
        persistentVolumeClaim:
          claimName: winhd
      - containerDisk:
          image: kubevirt/virtio-container-disk
        name: virtiocontainerdisk

我们再创建一个PV用于承载该虚拟机的磁盘(pv.yaml):

apiVersion: v1
kind: PersistentVolume
metadata:
  name: task-pv-volume
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/media/sda/win"

创建虚拟机:

 kubectl apply -f pv.yaml
 kubectl apply -f win.yaml
 kubectl virt start win2k19-iso
 # If you're running this on a remote machine, use X-forwarding and
 # apt-get install virt-viewer
 kubectl virt vnc win2k19-iso

值得注意的是,win.yaml中我们只能选择sata作为主磁盘的格式,kubectl virt vnc在我的机器上无法使用,所以我用了kubectl virt --proxy-only=true win2k19-iso用于获取一个动态端口,而后vncviewer到该动态端口上去。

完成安装后,进入设备管理器安装完未安装好的驱动程序。

使用NodePort暴露安装好以后的虚拟机的RDP端口:

apiVersion: v1
kind: Service
metadata:
  name: windows-nodeport
spec:
  externalTrafficPolicy: Cluster
  ports:
  - name: nodeport
    nodePort: 30000
    port: 27017
    protocol: TCP
    targetPort: 3389
  selector:
    kubevirt.io/domain: win2k19-iso
  type: NodePort

创建完该服务后,则可通过xxx.xxx.xxx.xxx:30000用于访问该虚拟机的RDP远程桌面端口了。

创建完以后的虚拟机如图所示:

/images/2021_05_31_21_30_56_746x693.jpg

如果是virtctl,则直接访问的方式是通过proxy:

kubectl proxy --address=0.0.0.0 --accept-hosts='^*$' --port 8080

testvm:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: testvm
spec:
  running: false
  template:
    metadata:
      labels:
        kubevirt.io/size: small
        kubevirt.io/domain: testvm
    spec:
      domain:
        devices:
          disks:
            - name: containerdisk
              disk:
                bus: virtio
            - name: cloudinitdisk
              disk:
                bus: virtio
          interfaces:
          - name: default
            bridge: {}
        resources:
          requests:
            memory: 64M
      networks:
      - name: default
        pod: {}
      volumes:
        - name: containerdisk
          containerDisk:
            image: quay.io/kubevirt/cirros-container-disk-demo
        - name: cloudinitdisk
          cloudInitNoCloud:
            userDataBase64: SGkuXG4=