完全用RAM运行Ubuntu

1. 目的

将Ubuntu18.04.1操作系统(arm64)完全运行在内存中。

2. 准备材料

Ubuntu 18.04.1 arm64安装iso.
arm64服务器/libvirtd/virt-manager.(在没有实体服务器的情况下,可以用虚拟机来模拟测试).

3. 步骤

最小化安装Ubuntu 18.04.1 操作系统, 根分区最好包含所有分区(all in one)。
安装完毕操作系统后,定制自己需要的软件包及准备环境后,删除所有的临时文件,尽量瘦身系统。这是因为内存定制化后,所有的文件在启动时将被加载到内存!全新安装的ubuntu大约占据约1.5GB的磁盘空间。
以下为定制为RAM启动的流程:

步骤一:
更改/etc/fstab文件内容,首先备份该文件:

# cp /etc/fstab /etc/fstab.bak

编辑/etc/fstab文件内容,找到标识根分区(/)的行,更改为以下内容(下为示例):

#/dev/mapper/ubuntu--vg-root /               ext4    errors=remount-ro 0       1
none / tmpfs defaults 0 0

步骤二:
更改initramfs中的local脚本内容, initramfs 包含的工具和脚本,在正式的根文件系统的初始化脚本 init 启动之前,就被挂载并完成相应的初始化工作。我们需要提前将磁盘根分区中的内容拷贝入tmpfs中,以便在/etc/fstab开始执行的时候找寻到正确的分区.

首先备份/usr/share/initramfs-tools/scripts/local文件:

# cp /usr/share/initramfs-tools/scripts/local /usr/share/initramfs-tools/scripts/local.bak   

编辑local文件,更改其Mount root部分的处理逻辑(约204行左右内容):

	# FIXME This has no error checking
	# Mount root
	#mount ${roflag} ${FSTYPE:+-t ${FSTYPE} }${ROOTFLAGS} ${ROOT} ${rootmnt}
	# Start of ramboottmp
        mkdir /ramboottmp
        mount ${roflag} -t ${FSTYPE} ${ROOTFLAGS} ${ROOT} /ramboottmp
        mount -t tmpfs -o size=100% none ${rootmnt}
        cd ${rootmnt}
        cp -rfa /ramboottmp/* ${rootmnt}
        umount /ramboottmp
        ### End of ramboottmp

保存该文件后,重新编译initramfs:

# mkinitramfs -o /boot/initrd.img-ramboot

编译成功后,将local文件替换会原来的版本:

# cp -f /usr/share/initramfs-tools/scripts/local.bak /usr/share/initramfs-tools/scripts/local

步骤三:
更改grub,以使用刚才编译出的initrd.img-ramboot来启动操作系统:

更改第一启动项中的/initrd行,替换为:

# chmod +w /boot/grub/grub.cfg
# vim /boot/grub/grub.cfg
.....
.....
        linux	/boot/vmlinuz-4.15.0-29-generic root=/dev/mapper/ubuntu--vg-root ro  
	initrd	/boot/initrd.img-ramboot
......
......
# chmod -w /boot/grub/grub.cfg

步骤四:
重启,重启时选择第一启动项,此时根分区会整体被加载到tmpfs中。

4. 性能对比测试

测试环境定义:

  • aarch64 4核
  • 64 GB 内存
  • 100 GB 磁盘分区
  • Ubuntu 18.04.1 LTS
  • 内核版本: 4.15.0-29-generic
  • fio版本: fio-3.1

所有测试样例均在ramdisk主机及传统主机上运行并对比.

4.1 fio 4k随机读写

测试命令如下:

# fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=randrw --size=500m --io_size=10g --blocksize=4k --ioengine=libaio --fsync=1 --iodepth=1 --numjobs=1 --runtime=60 --group_reporting
指标内存型主机传统主机
READ bwbw=513MiB/s (538MB/s)bw=85.0KiB/s (87.0kB/s)
READ ioio=5133MiB (5382MB)io=5104KiB (5226kB)
READ iopsIOPS=131kIOPS=21
WRITE bwbw=510MiB/s (535MB/s)bw=88.1KiB/s (90.2kB/s)
WRITE ioio=5107MiB (5355MB)io=5288KiB (5415kB)
WRITE iopsIOPS=131kIOPS=22

测试显示:4K随机读写的带宽对比,内存型主机是传统主机的约6000倍,读IOPS/写IOPS,内存型主机是传统主机的约6000倍。

4.2 fio 4k顺序读写

测试命令如下:

# fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=rw --size=500m --io_size=10g --blocksize=4k --ioengine=libaio --fsync=1 --iodepth=1 --numjobs=1 --runtime=60 --group_reporting
指标内存型主机传统主机
READ bwbw=640MiB/s (671MB/s)bw=73.2KiB/s (75.0kB/s)
READ ioio=5133MiB (5382MB)io=4396KiB (4502kB)
READ iopsIOPS=164kIOPS=18
WRITE bwbw=637MiB/s (668MB/s)bw=76.8KiB/s (78.6kB/s)
WRITE ioio=5107MiB (5355MB)io=4608KiB (4719kB)
WRITE iopsIOPS=163kIOPS=19

测试显示:4K顺序读写的带宽对比,内存型主机是传统主机的约9000倍,读IOPS/写IOPS,内存型主机是传统主机的约9000倍。

WorkingTipsOnx11lxdewine

环境准备

via:

x11docker --desktop --home --pulseaudio x11docker/lxde-wine

环境截图

PlayOnLinux:

/images/2021_04_24_07_22_11_520x457.jpg

软件列表:

/images/2021_04_24_07_22_48_855x552.jpg

选择“微软绘图”:

/images/2021_04_24_07_23_49_852x545.jpg

安装界面:

/images/2021_04_24_07_24_19_519x400.jpg

/images/2021_04_24_07_24_51_519x403.jpg

/images/2021_04_24_07_25_12_497x377.jpg

/images/2021_04_24_07_26_09_517x414.jpg

Age Of Empires:

/images/2021_04_24_07_29_38_850x546.jpg

WorkingTipsOnWinInDocker

制作Windows镜像

CentOS7上以以下方式启动虚拟机:

/usr/libexec/qemu-kvm -enable-kvm \
        -machine q35 -smp sockets=1,cores=1,threads=2 -m 2048 \
        -usb -device usb-kbd -device usb-tablet -rtc base=localtime \
        -net nic,model=virtio -net user,hostfwd=tcp::4444-:4444 \
        -drive file=hdd.img,media=disk,if=virtio \
        -drive file=/home/docker/win/cn_windows_10_consumer_editions_version_2004_x64_dvd.iso,media=cdrom \
        -drive file=/home/docker/win/virtio-win-0.1.141.iso,media=cdrom

用qemu提示的vnc端口访问该运行中的实例:

/images/2021_04_20_10_53_00_636x464.jpg

选择自定义安装:

/images/2021_04_20_10_53_46_593x403.jpg

需加载驱动程序:

/images/2021_04_20_10_55_28_488x496.jpg

选择好后的驱动:

/images/2021_04_20_10_55_56_414x92.jpg

忽略警告,继续:

/images/2021_04_20_10_56_46_553x286.jpg

继续安装直到安装完毕。

/images/2021_04_20_14_23_46_568x429.jpg

密码:

/images/2021_04_20_14_25_14_533x423.jpg

更新驱动程序:

/images/2021_04_20_14_40_16_616x407.jpg

选中E:\后更新:

/images/2021_04_20_14_41_40_548x273.jpg

此时关闭vm, 并创建一个overlay的image并使用该image启动一次vm:

$ qemu-img create -b hdd.img -f qcow2 snapshot.img
$ /usr/libexec/qemu-kvm -enable-kvm \
        -machine q35 -smp sockets=1,cores=1,threads=2 -m 2048 \
        -usb -device usb-kbd -device usb-tablet -rtc base=localtime \
        -net nic,model=virtio -net user,hostfwd=tcp::4444-:4444 \
        -drive file=snapshot.img,media=disk,if=virtio \
        -monitor stdio

在qemu终端内, 保存当前的状态后关机:

(qemu) savevm windows
Then type quit to stop VM:

(qemu) quit

因为有save后的状态,因而如果我们能保证容器内的qemu与容器外的qemu是同一版本的话,则可以快速恢复。

编译容器镜像

$ mv hdd.img snapshot.img image
$ cd image
$ docker build -t windows/win10qemu:20210420 .

在Centos7系列的操作系统上,因为宿主机的qemu版本与容器中的qemu版本差异,导致无法启动,需做以下修改:

# vim entrypoint.sh
....

  qemu-system-x86_64 -enable-kvm \
    -machine q35 -smp sockets=1,cores=1,threads=2 -m 2048 \
    -usb -device usb-kbd -device usb-tablet -rtc base=localtime \
    -net nic,model=virtio -net user,hostfwd=tcp::4444-:4444 \
    -drive file=snapshot.img,media=disk,if=virtio &
...
# vim Dockerfile
FROM windows/win10qemu:20210420
COPY entrypoint.sh /
# docker build -t win/win10new:latest .

运行容器:

# docker run -it --rm --privileged -p 4444:4444 -p 5915:5900  win/win10new:latest

打开vnc软件开始访问5915端口可以看到Windows桌面:

/images/2021_04_20_16_56_36_779x615.jpg

K8s中运行

由容器镜像创建出pod负载,service暴露即可。

WorkingTipsOnGPUOnCentOS7

1. 先决条件

各工作节点上需要保证内核为指定版本,并安装对应的kernel-ml-devel/kernel-ml-headers/gcc依赖包.

# uname -a
Linux worker2 4.19.12-1.el7.elrepo.x86_64 #1 SMP Fri Dec 21 11:06:36 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
# rpm -e --nodeps kernel-headers
# yum install -y kernel-ml-devel kernel-ml-headers gcc

手动安装Nvidia驱动:

# ./NVIDIA-Linux-x86_64-460.32.03.run 
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.32.03...........
..........................................................
..........................................................

忽略该报错:

/images/2021_04_19_16_53_12_622x250.jpg

选择NO, 忽略安装32位兼容包:

/images/2021_04_19_16_53_55_639x172.jpg

OK结束安装:

/images/2021_04_19_16_54_23_623x183.jpg

检查驱动是否安装成功:

# nvidia-smi 
Mon Apr 19 04:55:13 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:08.0 Off |                  Off |
| N/A   31C    P0    36W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:00:0A.0 Off |                  Off |
| N/A   31C    P0    35W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2. ccse改动

2.1 创建新的离线库:

引入nvidia-docker2相关的离线包并更新k8s-offline-pkgs仓库:

# cd k8s-offline-pkgs/
# tar xzvf /root/nvidiadocker.tar.gz -C .
libnvidia-container-tools-1.3.3-1.x86_64.rpm
nvidia-docker2-2.5.0-1.noarch.rpm
libnvidia-container1-1.3.3-1.x86_64.rpm
nvidia-container-toolkit-1.4.2-2.x86_64.rpm
nvidia-container-runtime-3.4.2-1.x86_64.rpm
# createrepo .

Ccse console节点上替换离线包:

[root@first x86_64]# pwd
/dcos/app/console/backend/webapps/repo/x86_64
[root@first x86_64]# mv k8s-offline-pkgs/ k8s-offline-pkgs.back
[root@first x86_64]# scp -r docker@10.168.100.1:/home/docker/k8s-offline-pkgs .

Ccse代码改动, 仅添加nvidia-docker2的安装:

# vi /dcos/app/console/kubeadm-playbook/roles/util/docker/tasks/install.yml

  - name: <安装docker><install-docker> 安装 docker (ccse源)
    shell: yum install -y docker-ce nvidia-docker2 --disablerepo=\* --enablerepo=ccse-k8s,ccse-centos7-base
    when: "yum_repo == 'ccse'"
# vi /dcos/app/console/kubeadm-playbook/roles/util/docker/templates/daemon.json.j2
    { 
    {% if custom_image_repository != '' %}{{ docker_insecure_registry_mirrors | indent(2,true) }}{% endif %}
      "storage-driver": "{{ docker_storage_driver }}",
      "graph": "{{ hosts_datadir_map[inventory_hostname] }}/docker",
      "log-driver": "json-file",
      "log-opts": {
                "max-size": "1g"
            },
      "default-runtime": "nvidia",
      "runtimes": {
          "nvidia": {
              "path": "/usr/bin/nvidia-container-runtime",
              "runtimeArgs": []
          }
      }
    }

3. 验证

相关包位于10.50.208.145/home/docker目录下的nvidiadockerclassic.tar:

# ls /home/docker/nvidiadockerclassic.tar  -l -h
-rw-r--r-- 1 root root 187M Apr 19 17:49 /home/docker/nvidiadockerclassic.tar

3.1 镜像准备

部署完毕后, ccse console节点上上传准备镜像:

# tar xvf nvidiadockerclassic.tar 
nvidiadockerclassic/
nvidiadockerclassic/nvidia-device-plugin.yml
nvidiadockerclassic/k8sdeviceplugin.tar
# cd nvidiadockerclassic
# docker load<k8sdeviceplugin.tar
# docker tag nvcr.io/nvidia/k8s-device-plugin:v0.9.0 10.168.100.144:8021/nvcr.io/nvidia/k8s-device-plugin:v0.9.0
# docker push 10.168.100.144:8021/nvcr.io/nvidia/k8s-device-plugin:v0.9.0

3.2 插件安装及验证

master节点上create nvidia-device-plugin.yml文件:

# kubectl create -f nvidia-device-plugin.yml 

验证device-plugin安装成功:

# kubectl  get po -A | grep device
kube-system   nvidia-device-plugin-daemonset-9mhq7           1/1     Running   0          19s
kube-system   nvidia-device-plugin-daemonset-m7txq           1/1     Running   0          19s
# kubectl logs nvidia-device-plugin-daemonset-9mhq7 -n kube-system
2021/04/19 09:53:23 Loading NVML
2021/04/19 09:53:23 Starting FS watcher.
2021/04/19 09:53:23 Starting OS watcher.
2021/04/19 09:53:23 Retreiving plugins.
2021/04/19 09:53:23 Starting GRPC server for 'nvidia.com/gpu'
2021/04/19 09:53:23 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/04/19 09:53:23 Registered device plugin for 'nvidia.com/gpu' with Kubelet

测试:

# kubectl create -f test.yml
# kubectl  get po -o wide
NAME             READY   STATUS    RESTARTS   AGE   IP               NODE             NOMINATED NODE   READINESS GATES
dcgmproftester   1/1     Running   0          19s   172.26.189.204   10.168.100.184   <none>           <none>
# nvidia-smi 
Mon Apr 19 05:54:55 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:09.0 Off |                  Off |
| N/A   56C    P0   218W / 250W |    493MiB / 32510MiB |     88%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     14182      C   /usr/bin/dcgmproftester11         489MiB |
+-----------------------------------------------------------------------------+

WorkingTipsOnGpu

1. 环境配置信息

整个验证环境的配置信息如下:

gpumaster: 10.168.100.2	4核16G
gpunode1: 10.168.100.3	4核16G PCI直通B5:00 Tesla V100
gpunode2: 10.168.100.4	4核16G PCI直通B2:00 Tesla V100

节点的操作系统配置如下, CentOS 7.6最小化安装方式:

# uname -a
Linux gpumaster 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core)

其中master节点上外挂了一块500G 的数据盘,需要手动挂载至/dcos目录:

[root@gpumaster ~]# df -h | grep dcos
/dev/vdb1                493G   73M  467G   1% /dcos
[root@gpumaster ~]# cat /etc/fstab | grep dcos
/dev/vdb1        /dcos                       ext4       defaults        0 0

3个节点依次关闭selinux/firewalld:

# vi /etc/selinux/config
...
SELINUX=disabled
...
# systemctl disable firewalld
# reboot

2. 部署CCSE集群

依次添加节点:

/images/2021_04_19_09_01_57_825x247.jpg

新增一个名为gpucluster的集群:

/images/2021_04_19_09_06_40_828x248.jpg

集群创建完毕后,新增两个GPU节点:

/images/2021_04_19_09_16_24_1099x449.jpg

添加完成后,检查集群状态:

[root@gpumaster ~]# kubectl get node
NAME           STATUS   ROLES    AGE     VERSION
10.168.100.2   Ready    master   6m19s   v1.17.3
10.168.100.3   Ready    node     78s     v1.17.3
10.168.100.4   Ready    node     78s     v1.17.3

3. 升级内核

在三个节点上,依次执行以下操作以升级内核。

配置离线软件库:

# cd /etc/yum.repos.d
# mkdir back
# mv CentOS-* back
# vi nvidia.repo
[nvidia]
name=nvidia
baseurl=http://10.168.100.144:8200/repo/x86_64/nvidiarpms
gpgcheck=0
enabled=1
proxy=_none_
# yum install -y kernel-ml

配置grub启动:

# vi /etc/default/grub
...
GRUB_DEFAULT=0
...
GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet rd.driver.blacklist=nouveau nouveau.modeset=0"
...
# grub2-mkconfig -o /boot/grub2/grub.cfg

完全禁用系统自带的nouveau驱动:

# echo 'install nouveau /bin/false' >> /etc/modprobe.d/nouveau.conf

执行完上述操作后需重启机器并验证内核是否更改成功:

# uname -a
Linux gpunode2 4.19.12-1.el7.elrepo.x86_64 #1 SMP Fri Dec 21 11:06:36 EST 2018 x86_64 x86_64 x86_64 GNU/Linux

4. gpu-operator文件准备

Harbor中预上传的镜像文件列表如下(nvcr.io及nvidia):

/images/2021_04_19_11_22_13_829x264.jpg

10.168.100.1上scp以下目录到所有节点:

$ scp -r docker@10.168.100.1:/home/docker/nvidia_items .

预Load nfd镜像:

# docker load<quay.tar
...
Loaded image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0

5. 安装NVIDIA/gpu-operator

登录到gpumaster节点,从文件创建一个部署charts时需用到的configmap:

# cat ccse.repo
[ccse-k8s]
name=Centos local yum repo for k8s
baseurl=http://10.168.100.144:8200/repo/x86_64/k8s-offline-pkgs
gpgcheck=0
enabled=1
proxy=_none_

[ccse-centos7-base]
name=Centos local yum repo for k8s
baseurl=http://10.168.100.144:8200/repo/x86_64/centos7-base
gpgcheck=0
enabled=1
proxy=_none_

[fuck]
name=Centos local yum repo for k8s 111
baseurl=http://10.168.100.144:8200/repo/x86_64/nvidiarpms
gpgcheck=0
enabled=1
proxy=_none_
# kubectl create namespace gpu-operator-resources
namespace/gpu-operator-resources created
# kubectl create configmap repo-config -n gpu-operator-resources --from-file=ccse.repo
configmap/repo-config created

现在创建gpu-operator实例:

# cd gpu-operator/
#  helm install --generate-name . -f values.yaml

检查实例运行情况:

# kubectl get po
NAME                                                              READY   STATUS    RESTARTS   AGE
chart-1618803326-node-feature-discovery-master-655c6997cd-fp465   1/1     Running   0          65s
chart-1618803326-node-feature-discovery-worker-7flft              1/1     Running   0          65s
chart-1618803326-node-feature-discovery-worker-mkqm7              1/1     Running   0          65s
chart-1618803326-node-feature-discovery-worker-w2d44              1/1     Running   0          65s
gpu-operator-945878fff-l22vc                                      1/1     Running   0          65s

给GPU节点手动添加标签,gpu-operator-resources命名空间下的实例运行情况:

使能GPU驱动安装:

# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.driver=true       
node/10.168.100.3 labeled
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.driver=true       
node/10.168.100.4 labeled

检查GPU驱动编译情况:

# kubectl  get po -n gpu-operator-resources
NAME                            READY   STATUS    RESTARTS   AGE
nvidia-driver-daemonset-w6d2q   1/1     Running   0          86s
nvidia-driver-daemonset-zmf9l   1/1     Running   0          86s
# kubectl logs po nvidia-driver-daemonset-zmf9l -n gpu-operator-resources
Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 460.32.03) is now complete.

Loading IPMI kernel module...
Loading NVIDIA driver kernel modules...
Starting NVIDIA persistence daemon...
Mounting NVIDIA driver rootfs...
Done, now waiting for signal

使能device-plugin, dcgm-exporter等:

# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.container-toolkit=true
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.device-plugin=true
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.dcgm-exporter=true
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.gpu-feature-discovery=true

# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.container-toolkit=true
# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.device-plugin=true
# kubectl label nodes 10.168.100.3  nvidia.com/gpu.deploy.dcgm-exporter=true
# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.gpu-feature-discovery=true

检查toolkit-daemonset运行情况,会发现Init:ImagePullBackOff报错信息:

# kubectl get po -n gpu-operator-resources
NAME                                       READY   STATUS                  RESTARTS   AGE
nvidia-container-toolkit-daemonset-6kqq5   0/1     Init:ImagePullBackOff   0          2m16s
nvidia-container-toolkit-daemonset-cbww2   0/1     Init:ImagePullBackOff   0          4m1s
# kubectl logs nvidia-container-toolkit-daemonset-cbww2 -n gpu-operator-resources
  Normal   BackOff         3m31s (x7 over 4m46s)  kubelet, 10.168.100.4  Back-off pulling image "10.168.100.144:8021/nvcr.io/nvidia/k8s/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59"
  Warning  Failed          3m31s (x7 over 4m46s)  kubelet, 10.168.100.4  Error: ImagePullBackOff
  Normal   Pulling         3m20s (x4 over 4m48s)  kubelet, 10.168.100.4  Pulling image "10.168.100.144:8021/nvcr.io/nvidia/k8s/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59"

这是因为pod拉取的镜像tag不对所导致,需要手动修改image的tag:

# kubectl get ds -n gpu-operator-resources
NAME                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                  AGE
nvidia-container-toolkit-daemonset   2         2         0       2            0           nvidia.com/gpu.deploy.container-toolkit=true   133m
nvidia-driver-daemonset              2         2         2       2            2           nvidia.com/gpu.deploy.driver=true              135m
# kubectl edit ds nvidia-container-toolkit-daemonset -n gpu-operator-resources
        #image: 10.168.100.144:8021/nvcr.io/nvidia/k8s/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59
        image: 10.168.100.144:8021/nvcr.io/nvidia/cuda:11.2.1-base-ubi8

刷新pod运行情况,可以看到nvidia-container-toolkit-daemonsetnvidia-device-plugin-daemonset运行正常,而nvidia-device-plugin-validationInit:CreashLoopBackOff失败:

# kubectl get po -n gpu-operator-resources
NAME                                       READY   STATUS                  RESTARTS   AGE
nvidia-container-toolkit-daemonset-27qj8   1/1     Running                 0          52s
nvidia-container-toolkit-daemonset-g5ndb   1/1     Running                 0          51s
nvidia-device-plugin-daemonset-sqfdc       1/1     Running                 0          26s
nvidia-device-plugin-daemonset-wldkd       1/1     Running                 0          26s
nvidia-device-plugin-validation            0/1     Init:CrashLoopBackOff   1          9s
nvidia-driver-daemonset-m4xjv              1/1     Running                 0          137m
nvidia-driver-daemonset-vkrz5              1/1     Running                 5          137m

定位该validation所在的节点名(此例中为10.168.100.3):

# kubectl get po nvidia-device-plugin-validation -n  gpu-operator-resources -o wide
NAME                              READY   STATUS                  RESTARTS   AGE     IP              NODE           NOMINATED NODE   READINESS GATES
nvidia-device-plugin-validation   0/1     Init:CrashLoopBackOff   4          2m55s   172.26.222.10   10.168.100.3   <none>           <none>

获取启动失败原因:

# kubectl describe po nvidia-device-plugin-validation -n gpu-operator-resources
......
  Warning  Failed            56s (x5 over 2m21s)   kubelet, 10.168.100.3  Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidiactl": no such file or directory

登录10.168.100.3节点,获取/dev下驱动程序设备名:

# docker ps | grep nvidia-device-plugin-daemonset | grep -v pause
abbea480fdf2        10.168.100.144:8021/nvcr.io/nvidia/k8s-device-plugin       "nvidia-device-plugin"   6 minutes ago       Up 6 minutes                            k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-sqfdc_gpu-operator-resources_b9988b02-82a6-4637-a7f0-fdee5a448d60_0
# docker exec -it k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-sqfdc_gpu-operator-resources_b9988b02-82a6-4637-a7f0-fdee5a448d60_0 /bin/bash
[root@nvidia-device-plugin-daemonset-sqfdc /]# ls /dev/nvidia* -l -h
crw-rw-rw- 1 root root 195, 254 Apr 19 03:52 /dev/nvidia-modeset
crw-rw-rw- 1 root root 237,   0 Apr 19 06:08 /dev/nvidia-uvm
crw-rw-rw- 1 root root 237,   1 Apr 19 06:08 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Apr 19 03:52 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Apr 19 03:52 /dev/nvidiactl
[root@nvidia-device-plugin-daemonset-sqfdc /]# exit

在主机级别(10.168.100.3)上手动创建/dev/nvidiactl文件, 依据同样步骤在10.168.100.4上查找到相应的设备驱动号也添加/dev/nvidiactl文件:

[root@gpunode1 ~]# mknod -m 666 /dev/nvidiactl c 195 255
[root@gpunode1 ~]# ls /dev/nvidiactl -l
crw-rw-rw- 1 root root 195, 255 Apr 19 02:19 /dev/nvidiactl

delete掉nvidia-device-plugin-validation这个pod后,kubelet将重新拉起一个,此时报错信息有变化,提示缺少/dev/nvidia-uvm设备驱动文件:

  Warning  Failed     10s (x2 over 11s)  kubelet, 10.168.100.4  Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia-uvm": no such file or directory

按照上面创建/dev/nvidiactl的方法创建/dev/nvidia-uvm驱动文件,注意设备号与容器中保持一致:

# mknod -m 666 /dev/nvidia-uvm c 237 0

删除pod后重新拉起,报错信息为缺少/dev/nvidia-uvm-tools:

  Warning  Failed     9s (x2 over 10s)  kubelet, 10.168.100.4  Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia-uvm-tools": no such file or directory

手动创建nvidia-uvm-tools设备文件后删除pod等待kubelet重新拉起pod:

# mknod -m 666 /dev/nvidia-uvm-tools c 237 1
  Warning  Failed     12s (x2 over 12s)  kubelet, 10.168.100.3  Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia-modeset": no such file or directory

手动创建nvidia-modeset设备文件后删除pod等待kubelet重新拉起pod:

# mknod -m 666 /dev/nvidia-modeset c 195 254
  Warning  Failed     13s (x2 over 14s)  kubelet, 10.168.100.4  Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia0": no such file or directory

手动创建nvidia0设备文件后删除pod等待kubelet重新拉起pod:

# mknod -m 666 /dev/nvidia0 c 195 0
# kubectl  get po -A | grep device-plugin-validation
gpu-operator-resources   nvidia-device-plugin-validation                                   0/1     Completed   0          2m26s

此时kubelet将继续拉起剩余的nvidia资源,最终状态应该是:

# kubectl  get po -A
NAMESPACE                NAME                                                              READY   STATUS      RESTARTS   AGE
default                  chart-1618804240-node-feature-discovery-master-5f446799f4-sk7vg   1/1     Running     0          163m
default                  chart-1618804240-node-feature-discovery-worker-5sllh              1/1     Running     1          163m
default                  chart-1618804240-node-feature-discovery-worker-86w4w              1/1     Running     0          163m
default                  chart-1618804240-node-feature-discovery-worker-fl52v              1/1     Running     0          163m
default                  gpu-operator-945878fff-88thn                                      1/1     Running     0          163m
gpu-operator-resources   gpu-feature-discovery-p6zqs                                       1/1     Running     0          53s
gpu-operator-resources   gpu-feature-discovery-x88v4                                       1/1     Running     0          53s
gpu-operator-resources   nvidia-container-toolkit-daemonset-27qj8                          1/1     Running     0          26m
gpu-operator-resources   nvidia-container-toolkit-daemonset-g5ndb                          1/1     Running     0          26m
gpu-operator-resources   nvidia-dcgm-exporter-c9vht                                        1/1     Running     0          74s
gpu-operator-resources   nvidia-dcgm-exporter-mz7rh                                        1/1     Running     0          74s
gpu-operator-resources   nvidia-device-plugin-daemonset-sqfdc                              1/1     Running     0          25m
gpu-operator-resources   nvidia-device-plugin-daemonset-wldkd                              1/1     Running     0          25m
gpu-operator-resources   nvidia-device-plugin-validation                                   0/1     Completed   0          2m47s
gpu-operator-resources   nvidia-driver-daemonset-m4xjv                                     1/1     Running     0          163m
gpu-operator-resources   nvidia-driver-daemonset-vkrz5                                     1/1     Running     5          163m
....

6. 测试GPU

gpu-operator目录下预置了一个test.yaml文件,直接创建:

[root@gpumaster gpu-operator]# kubectl create -f test.yaml
pod/dcgmproftester created
[root@gpumaster gpu-operator]# kubectl  get po -o wide | grep dcgmproftester
dcgmproftester                                                    1/1     Running   0          103s   172.26.243.149   10.168.100.4   <none>           <none>

找寻到10.168.100.4上的nvidia-device-plugin-daemonset的pod, 观察该节点上gpu的功耗及显存占用情况,可以看到该工作负载确实使用了gpu中的运算单元:

# kubectl exec nvidia-device-plugin-daemonset-wldkd -n gpu-operator-resources nvidia-smi
nvidia 33988608 269 nvidia_modeset,nvidia_uvm, Live 0xffffffffa05dd000 (PO)
Mon Apr 19 06:39:26 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:00:08.0 Off |                  Off |
| N/A   61C    P0   208W / 250W |    493MiB / 32510MiB |     84%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

测试完毕后该pod将处于completed状态,观察其输出:

# kubectl  get po -o wide | grep dcgm
dcgmproftester                                                    0/1     Completed   0          4m12s   172.26.243.149   10.168.100.4   <none>           <none>
# kubectl  logs dcgmproftester
.....
TensorEngineActive: generated ???, dcgm 0.000 (74380.8 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (75398.9 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (75787.6 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (77173.9 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (75669.5 gflops)
Skipping UnwatchFields() since DCGM validation is disabled