Apr 27, 2021
Technology1. 目的
将Ubuntu18.04.1操作系统(arm64)完全运行在内存中。
2. 准备材料
Ubuntu 18.04.1 arm64安装iso.
arm64服务器/libvirtd/virt-manager.(在没有实体服务器的情况下,可以用虚拟机来模拟测试).
3. 步骤
最小化安装Ubuntu 18.04.1 操作系统, 根分区最好包含所有分区(all in one)。
安装完毕操作系统后,定制自己需要的软件包及准备环境后,删除所有的临时文件,尽量瘦身系统。这是因为内存定制化后,所有的文件在启动时将被加载到内存!全新安装的ubuntu大约占据约1.5GB的磁盘空间。
以下为定制为RAM启动的流程:
步骤一:
更改/etc/fstab
文件内容,首先备份该文件:
# cp /etc/fstab /etc/fstab.bak
编辑/etc/fstab
文件内容,找到标识根分区(/)的行,更改为以下内容(下为示例):
#/dev/mapper/ubuntu--vg-root / ext4 errors=remount-ro 0 1
none / tmpfs defaults 0 0
步骤二:
更改initramfs中的local
脚本内容, initramfs 包含的工具和脚本,在正式的根文件系统的初始化脚本 init 启动之前,就被挂载并完成相应的初始化工作。我们需要提前将磁盘根分区中的内容拷贝入tmpfs
中,以便在/etc/fstab
开始执行的时候找寻到正确的分区.
首先备份/usr/share/initramfs-tools/scripts/local
文件:
# cp /usr/share/initramfs-tools/scripts/local /usr/share/initramfs-tools/scripts/local.bak
编辑local
文件,更改其Mount root
部分的处理逻辑(约204行左右内容):
# FIXME This has no error checking
# Mount root
#mount ${roflag} ${FSTYPE:+-t ${FSTYPE} }${ROOTFLAGS} ${ROOT} ${rootmnt}
# Start of ramboottmp
mkdir /ramboottmp
mount ${roflag} -t ${FSTYPE} ${ROOTFLAGS} ${ROOT} /ramboottmp
mount -t tmpfs -o size=100% none ${rootmnt}
cd ${rootmnt}
cp -rfa /ramboottmp/* ${rootmnt}
umount /ramboottmp
### End of ramboottmp
保存该文件后,重新编译initramfs:
# mkinitramfs -o /boot/initrd.img-ramboot
编译成功后,将local
文件替换会原来的版本:
# cp -f /usr/share/initramfs-tools/scripts/local.bak /usr/share/initramfs-tools/scripts/local
步骤三:
更改grub,以使用刚才编译出的initrd.img-ramboot
来启动操作系统:
更改第一启动项中的/initrd
行,替换为:
# chmod +w /boot/grub/grub.cfg
# vim /boot/grub/grub.cfg
.....
.....
linux /boot/vmlinuz-4.15.0-29-generic root=/dev/mapper/ubuntu--vg-root ro
initrd /boot/initrd.img-ramboot
......
......
# chmod -w /boot/grub/grub.cfg
步骤四:
重启,重启时选择第一启动项,此时根分区会整体被加载到tmpfs
中。
4. 性能对比测试
测试环境定义:
- aarch64 4核
- 64 GB 内存
- 100 GB 磁盘分区
- Ubuntu 18.04.1 LTS
- 内核版本: 4.15.0-29-generic
- fio版本: fio-3.1
所有测试样例均在ramdisk
主机及传统主机上运行并对比.
4.1 fio 4k随机读写
测试命令如下:
# fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=randrw --size=500m --io_size=10g --blocksize=4k --ioengine=libaio --fsync=1 --iodepth=1 --numjobs=1 --runtime=60 --group_reporting
指标 | 内存型主机 | 传统主机 |
---|
READ bw | bw=513MiB/s (538MB/s) | bw=85.0KiB/s (87.0kB/s) |
READ io | io=5133MiB (5382MB) | io=5104KiB (5226kB) |
READ iops | IOPS=131k | IOPS=21 |
WRITE bw | bw=510MiB/s (535MB/s) | bw=88.1KiB/s (90.2kB/s) |
WRITE io | io=5107MiB (5355MB) | io=5288KiB (5415kB) |
WRITE iops | IOPS=131k | IOPS=22 |
测试显示:4K随机读写的带宽对比,内存型主机是传统主机的约6000倍,读IOPS/写IOPS,内存型主机是传统主机的约6000倍。
4.2 fio 4k顺序读写
测试命令如下:
# fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=rw --size=500m --io_size=10g --blocksize=4k --ioengine=libaio --fsync=1 --iodepth=1 --numjobs=1 --runtime=60 --group_reporting
指标 | 内存型主机 | 传统主机 |
---|
READ bw | bw=640MiB/s (671MB/s) | bw=73.2KiB/s (75.0kB/s) |
READ io | io=5133MiB (5382MB) | io=4396KiB (4502kB) |
READ iops | IOPS=164k | IOPS=18 |
WRITE bw | bw=637MiB/s (668MB/s) | bw=76.8KiB/s (78.6kB/s) |
WRITE io | io=5107MiB (5355MB) | io=4608KiB (4719kB) |
WRITE iops | IOPS=163k | IOPS=19 |
测试显示:4K顺序读写的带宽对比,内存型主机是传统主机的约9000倍,读IOPS/写IOPS,内存型主机是传统主机的约9000倍。
Apr 20, 2021
Technology制作Windows镜像
CentOS7上以以下方式启动虚拟机:
/usr/libexec/qemu-kvm -enable-kvm \
-machine q35 -smp sockets=1,cores=1,threads=2 -m 2048 \
-usb -device usb-kbd -device usb-tablet -rtc base=localtime \
-net nic,model=virtio -net user,hostfwd=tcp::4444-:4444 \
-drive file=hdd.img,media=disk,if=virtio \
-drive file=/home/docker/win/cn_windows_10_consumer_editions_version_2004_x64_dvd.iso,media=cdrom \
-drive file=/home/docker/win/virtio-win-0.1.141.iso,media=cdrom
用qemu提示的vnc端口访问该运行中的实例:
选择自定义安装
:
需加载驱动程序:
选择好后的驱动:
忽略警告,继续:
继续安装直到安装完毕。
密码:
更新驱动程序:
选中E:\后更新:
此时关闭vm, 并创建一个overlay的image并使用该image启动一次vm:
$ qemu-img create -b hdd.img -f qcow2 snapshot.img
$ /usr/libexec/qemu-kvm -enable-kvm \
-machine q35 -smp sockets=1,cores=1,threads=2 -m 2048 \
-usb -device usb-kbd -device usb-tablet -rtc base=localtime \
-net nic,model=virtio -net user,hostfwd=tcp::4444-:4444 \
-drive file=snapshot.img,media=disk,if=virtio \
-monitor stdio
在qemu终端内, 保存当前的状态后关机:
(qemu) savevm windows
Then type quit to stop VM:
(qemu) quit
因为有save后的状态,因而如果我们能保证容器内的qemu与容器外的qemu是同一版本的话,则可以快速恢复。
编译容器镜像
$ mv hdd.img snapshot.img image
$ cd image
$ docker build -t windows/win10qemu:20210420 .
在Centos7系列的操作系统上,因为宿主机的qemu版本与容器中的qemu版本差异,导致无法启动,需做以下修改:
# vim entrypoint.sh
....
qemu-system-x86_64 -enable-kvm \
-machine q35 -smp sockets=1,cores=1,threads=2 -m 2048 \
-usb -device usb-kbd -device usb-tablet -rtc base=localtime \
-net nic,model=virtio -net user,hostfwd=tcp::4444-:4444 \
-drive file=snapshot.img,media=disk,if=virtio &
...
# vim Dockerfile
FROM windows/win10qemu:20210420
COPY entrypoint.sh /
# docker build -t win/win10new:latest .
运行容器:
# docker run -it --rm --privileged -p 4444:4444 -p 5915:5900 win/win10new:latest
打开vnc软件开始访问5915端口可以看到Windows桌面:
K8s中运行
由容器镜像创建出pod负载,service暴露即可。
Apr 19, 2021
Technology1. 先决条件
各工作节点上需要保证内核为指定版本,并安装对应的kernel-ml-devel/kernel-ml-headers/gcc
依赖包.
# uname -a
Linux worker2 4.19.12-1.el7.elrepo.x86_64 #1 SMP Fri Dec 21 11:06:36 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
# rpm -e --nodeps kernel-headers
# yum install -y kernel-ml-devel kernel-ml-headers gcc
手动安装Nvidia驱动:
# ./NVIDIA-Linux-x86_64-460.32.03.run
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.32.03...........
..........................................................
..........................................................
忽略该报错:
选择NO
, 忽略安装32位兼容包:
按OK
结束安装:
检查驱动是否安装成功:
# nvidia-smi
Mon Apr 19 04:55:13 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:00:08.0 Off | Off |
| N/A 31C P0 36W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:00:0A.0 Off | Off |
| N/A 31C P0 35W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
2. ccse改动
2.1 创建新的离线库:
引入nvidia-docker2
相关的离线包并更新k8s-offline-pkgs
仓库:
# cd k8s-offline-pkgs/
# tar xzvf /root/nvidiadocker.tar.gz -C .
libnvidia-container-tools-1.3.3-1.x86_64.rpm
nvidia-docker2-2.5.0-1.noarch.rpm
libnvidia-container1-1.3.3-1.x86_64.rpm
nvidia-container-toolkit-1.4.2-2.x86_64.rpm
nvidia-container-runtime-3.4.2-1.x86_64.rpm
# createrepo .
Ccse console节点上替换离线包:
[root@first x86_64]# pwd
/dcos/app/console/backend/webapps/repo/x86_64
[root@first x86_64]# mv k8s-offline-pkgs/ k8s-offline-pkgs.back
[root@first x86_64]# scp -r docker@10.168.100.1:/home/docker/k8s-offline-pkgs .
Ccse代码改动, 仅添加nvidia-docker2
的安装:
# vi /dcos/app/console/kubeadm-playbook/roles/util/docker/tasks/install.yml
- name: <安装docker><install-docker> 安装 docker (ccse源)
shell: yum install -y docker-ce nvidia-docker2 --disablerepo=\* --enablerepo=ccse-k8s,ccse-centos7-base
when: "yum_repo == 'ccse'"
# vi /dcos/app/console/kubeadm-playbook/roles/util/docker/templates/daemon.json.j2
{
{% if custom_image_repository != '' %}{{ docker_insecure_registry_mirrors | indent(2,true) }}{% endif %}
"storage-driver": "{{ docker_storage_driver }}",
"graph": "{{ hosts_datadir_map[inventory_hostname] }}/docker",
"log-driver": "json-file",
"log-opts": {
"max-size": "1g"
},
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
3. 验证
相关包位于10.50.208.145
的/home/docker
目录下的nvidiadockerclassic.tar
:
# ls /home/docker/nvidiadockerclassic.tar -l -h
-rw-r--r-- 1 root root 187M Apr 19 17:49 /home/docker/nvidiadockerclassic.tar
3.1 镜像准备
部署完毕后, ccse console节点上上传准备镜像:
# tar xvf nvidiadockerclassic.tar
nvidiadockerclassic/
nvidiadockerclassic/nvidia-device-plugin.yml
nvidiadockerclassic/k8sdeviceplugin.tar
# cd nvidiadockerclassic
# docker load<k8sdeviceplugin.tar
# docker tag nvcr.io/nvidia/k8s-device-plugin:v0.9.0 10.168.100.144:8021/nvcr.io/nvidia/k8s-device-plugin:v0.9.0
# docker push 10.168.100.144:8021/nvcr.io/nvidia/k8s-device-plugin:v0.9.0
3.2 插件安装及验证
master节点上create nvidia-device-plugin.yml
文件:
# kubectl create -f nvidia-device-plugin.yml
验证device-plugin安装成功:
# kubectl get po -A | grep device
kube-system nvidia-device-plugin-daemonset-9mhq7 1/1 Running 0 19s
kube-system nvidia-device-plugin-daemonset-m7txq 1/1 Running 0 19s
# kubectl logs nvidia-device-plugin-daemonset-9mhq7 -n kube-system
2021/04/19 09:53:23 Loading NVML
2021/04/19 09:53:23 Starting FS watcher.
2021/04/19 09:53:23 Starting OS watcher.
2021/04/19 09:53:23 Retreiving plugins.
2021/04/19 09:53:23 Starting GRPC server for 'nvidia.com/gpu'
2021/04/19 09:53:23 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/04/19 09:53:23 Registered device plugin for 'nvidia.com/gpu' with Kubelet
测试:
# kubectl create -f test.yml
# kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
dcgmproftester 1/1 Running 0 19s 172.26.189.204 10.168.100.184 <none> <none>
# nvidia-smi
Mon Apr 19 05:54:55 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:00:09.0 Off | Off |
| N/A 56C P0 218W / 250W | 493MiB / 32510MiB | 88% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 14182 C /usr/bin/dcgmproftester11 489MiB |
+-----------------------------------------------------------------------------+
Apr 13, 2021
Technology1. 环境配置信息
整个验证环境的配置信息如下:
gpumaster: 10.168.100.2 4核16G
gpunode1: 10.168.100.3 4核16G PCI直通B5:00 Tesla V100
gpunode2: 10.168.100.4 4核16G PCI直通B2:00 Tesla V100
节点的操作系统配置如下, CentOS 7.6最小化安装方式:
# uname -a
Linux gpumaster 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)
其中master节点上外挂了一块500G 的数据盘,需要手动挂载至/dcos
目录:
[root@gpumaster ~]# df -h | grep dcos
/dev/vdb1 493G 73M 467G 1% /dcos
[root@gpumaster ~]# cat /etc/fstab | grep dcos
/dev/vdb1 /dcos ext4 defaults 0 0
3个节点依次关闭selinux/firewalld:
# vi /etc/selinux/config
...
SELINUX=disabled
...
# systemctl disable firewalld
# reboot
2. 部署CCSE集群
依次添加节点:
新增一个名为gpucluster
的集群:
集群创建完毕后,新增两个GPU节点:
添加完成后,检查集群状态:
[root@gpumaster ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
10.168.100.2 Ready master 6m19s v1.17.3
10.168.100.3 Ready node 78s v1.17.3
10.168.100.4 Ready node 78s v1.17.3
3. 升级内核
在三个节点上,依次执行以下操作以升级内核。
配置离线软件库:
# cd /etc/yum.repos.d
# mkdir back
# mv CentOS-* back
# vi nvidia.repo
[nvidia]
name=nvidia
baseurl=http://10.168.100.144:8200/repo/x86_64/nvidiarpms
gpgcheck=0
enabled=1
proxy=_none_
# yum install -y kernel-ml
配置grub启动:
# vi /etc/default/grub
...
GRUB_DEFAULT=0
...
GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet rd.driver.blacklist=nouveau nouveau.modeset=0"
...
# grub2-mkconfig -o /boot/grub2/grub.cfg
完全禁用系统自带的nouveau
驱动:
# echo 'install nouveau /bin/false' >> /etc/modprobe.d/nouveau.conf
执行完上述操作后需重启机器并验证内核是否更改成功:
# uname -a
Linux gpunode2 4.19.12-1.el7.elrepo.x86_64 #1 SMP Fri Dec 21 11:06:36 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
4. gpu-operator文件准备
Harbor中预上传的镜像文件列表如下(nvcr.io及nvidia):
从10.168.100.1
上scp以下目录到所有节点:
$ scp -r docker@10.168.100.1:/home/docker/nvidia_items .
预Load nfd镜像:
# docker load<quay.tar
...
Loaded image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
5. 安装NVIDIA/gpu-operator
登录到gpumaster
节点,从文件创建一个部署charts时需用到的configmap:
# cat ccse.repo
[ccse-k8s]
name=Centos local yum repo for k8s
baseurl=http://10.168.100.144:8200/repo/x86_64/k8s-offline-pkgs
gpgcheck=0
enabled=1
proxy=_none_
[ccse-centos7-base]
name=Centos local yum repo for k8s
baseurl=http://10.168.100.144:8200/repo/x86_64/centos7-base
gpgcheck=0
enabled=1
proxy=_none_
[fuck]
name=Centos local yum repo for k8s 111
baseurl=http://10.168.100.144:8200/repo/x86_64/nvidiarpms
gpgcheck=0
enabled=1
proxy=_none_
# kubectl create namespace gpu-operator-resources
namespace/gpu-operator-resources created
# kubectl create configmap repo-config -n gpu-operator-resources --from-file=ccse.repo
configmap/repo-config created
现在创建gpu-operator实例:
# cd gpu-operator/
# helm install --generate-name . -f values.yaml
检查实例运行情况:
# kubectl get po
NAME READY STATUS RESTARTS AGE
chart-1618803326-node-feature-discovery-master-655c6997cd-fp465 1/1 Running 0 65s
chart-1618803326-node-feature-discovery-worker-7flft 1/1 Running 0 65s
chart-1618803326-node-feature-discovery-worker-mkqm7 1/1 Running 0 65s
chart-1618803326-node-feature-discovery-worker-w2d44 1/1 Running 0 65s
gpu-operator-945878fff-l22vc 1/1 Running 0 65s
给GPU节点手动添加标签,gpu-operator-resources
命名空间下的实例运行情况:
使能GPU驱动安装:
# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.driver=true
node/10.168.100.3 labeled
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.driver=true
node/10.168.100.4 labeled
检查GPU驱动编译情况:
# kubectl get po -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
nvidia-driver-daemonset-w6d2q 1/1 Running 0 86s
nvidia-driver-daemonset-zmf9l 1/1 Running 0 86s
# kubectl logs po nvidia-driver-daemonset-zmf9l -n gpu-operator-resources
Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 460.32.03) is now complete.
Loading IPMI kernel module...
Loading NVIDIA driver kernel modules...
Starting NVIDIA persistence daemon...
Mounting NVIDIA driver rootfs...
Done, now waiting for signal
使能device-plugin
, dcgm-exporter
等:
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.container-toolkit=true
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.device-plugin=true
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.dcgm-exporter=true
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.gpu-feature-discovery=true
# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.container-toolkit=true
# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.device-plugin=true
# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.dcgm-exporter=true
# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.gpu-feature-discovery=true
检查toolkit-daemonset
运行情况,会发现Init:ImagePullBackOff
报错信息:
# kubectl get po -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
nvidia-container-toolkit-daemonset-6kqq5 0/1 Init:ImagePullBackOff 0 2m16s
nvidia-container-toolkit-daemonset-cbww2 0/1 Init:ImagePullBackOff 0 4m1s
# kubectl logs nvidia-container-toolkit-daemonset-cbww2 -n gpu-operator-resources
Normal BackOff 3m31s (x7 over 4m46s) kubelet, 10.168.100.4 Back-off pulling image "10.168.100.144:8021/nvcr.io/nvidia/k8s/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59"
Warning Failed 3m31s (x7 over 4m46s) kubelet, 10.168.100.4 Error: ImagePullBackOff
Normal Pulling 3m20s (x4 over 4m48s) kubelet, 10.168.100.4 Pulling image "10.168.100.144:8021/nvcr.io/nvidia/k8s/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59"
这是因为pod拉取的镜像tag不对所导致,需要手动修改image的tag:
# kubectl get ds -n gpu-operator-resources
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
nvidia-container-toolkit-daemonset 2 2 0 2 0 nvidia.com/gpu.deploy.container-toolkit=true 133m
nvidia-driver-daemonset 2 2 2 2 2 nvidia.com/gpu.deploy.driver=true 135m
# kubectl edit ds nvidia-container-toolkit-daemonset -n gpu-operator-resources
#image: 10.168.100.144:8021/nvcr.io/nvidia/k8s/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59
image: 10.168.100.144:8021/nvcr.io/nvidia/cuda:11.2.1-base-ubi8
刷新pod运行情况,可以看到nvidia-container-toolkit-daemonset
及nvidia-device-plugin-daemonset
运行正常,而nvidia-device-plugin-validation
则Init:CreashLoopBackOff
失败:
# kubectl get po -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
nvidia-container-toolkit-daemonset-27qj8 1/1 Running 0 52s
nvidia-container-toolkit-daemonset-g5ndb 1/1 Running 0 51s
nvidia-device-plugin-daemonset-sqfdc 1/1 Running 0 26s
nvidia-device-plugin-daemonset-wldkd 1/1 Running 0 26s
nvidia-device-plugin-validation 0/1 Init:CrashLoopBackOff 1 9s
nvidia-driver-daemonset-m4xjv 1/1 Running 0 137m
nvidia-driver-daemonset-vkrz5 1/1 Running 5 137m
定位该validation所在的节点名(此例中为10.168.100.3
):
# kubectl get po nvidia-device-plugin-validation -n gpu-operator-resources -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nvidia-device-plugin-validation 0/1 Init:CrashLoopBackOff 4 2m55s 172.26.222.10 10.168.100.3 <none> <none>
获取启动失败原因:
# kubectl describe po nvidia-device-plugin-validation -n gpu-operator-resources
......
Warning Failed 56s (x5 over 2m21s) kubelet, 10.168.100.3 Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidiactl": no such file or directory
登录10.168.100.3
节点,获取/dev
下驱动程序设备名:
# docker ps | grep nvidia-device-plugin-daemonset | grep -v pause
abbea480fdf2 10.168.100.144:8021/nvcr.io/nvidia/k8s-device-plugin "nvidia-device-plugin" 6 minutes ago Up 6 minutes k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-sqfdc_gpu-operator-resources_b9988b02-82a6-4637-a7f0-fdee5a448d60_0
# docker exec -it k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-sqfdc_gpu-operator-resources_b9988b02-82a6-4637-a7f0-fdee5a448d60_0 /bin/bash
[root@nvidia-device-plugin-daemonset-sqfdc /]# ls /dev/nvidia* -l -h
crw-rw-rw- 1 root root 195, 254 Apr 19 03:52 /dev/nvidia-modeset
crw-rw-rw- 1 root root 237, 0 Apr 19 06:08 /dev/nvidia-uvm
crw-rw-rw- 1 root root 237, 1 Apr 19 06:08 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195, 0 Apr 19 03:52 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Apr 19 03:52 /dev/nvidiactl
[root@nvidia-device-plugin-daemonset-sqfdc /]# exit
在主机级别(10.168.100.3
)上手动创建/dev/nvidiactl
文件, 依据同样步骤在10.168.100.4
上查找到相应的设备驱动号也添加/dev/nvidiactl
文件:
[root@gpunode1 ~]# mknod -m 666 /dev/nvidiactl c 195 255
[root@gpunode1 ~]# ls /dev/nvidiactl -l
crw-rw-rw- 1 root root 195, 255 Apr 19 02:19 /dev/nvidiactl
delete掉nvidia-device-plugin-validation
这个pod后,kubelet将重新拉起一个,此时报错信息有变化,提示缺少/dev/nvidia-uvm
设备驱动文件:
Warning Failed 10s (x2 over 11s) kubelet, 10.168.100.4 Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia-uvm": no such file or directory
按照上面创建/dev/nvidiactl
的方法创建/dev/nvidia-uvm
驱动文件,注意设备号与容器中保持一致:
# mknod -m 666 /dev/nvidia-uvm c 237 0
删除pod后重新拉起,报错信息为缺少/dev/nvidia-uvm-tools
:
Warning Failed 9s (x2 over 10s) kubelet, 10.168.100.4 Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia-uvm-tools": no such file or directory
手动创建nvidia-uvm-tools
设备文件后删除pod等待kubelet重新拉起pod:
# mknod -m 666 /dev/nvidia-uvm-tools c 237 1
Warning Failed 12s (x2 over 12s) kubelet, 10.168.100.3 Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia-modeset": no such file or directory
手动创建nvidia-modeset
设备文件后删除pod等待kubelet重新拉起pod:
# mknod -m 666 /dev/nvidia-modeset c 195 254
Warning Failed 13s (x2 over 14s) kubelet, 10.168.100.4 Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia0": no such file or directory
手动创建nvidia0
设备文件后删除pod等待kubelet重新拉起pod:
# mknod -m 666 /dev/nvidia0 c 195 0
# kubectl get po -A | grep device-plugin-validation
gpu-operator-resources nvidia-device-plugin-validation 0/1 Completed 0 2m26s
此时kubelet将继续拉起剩余的nvidia资源,最终状态应该是:
# kubectl get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default chart-1618804240-node-feature-discovery-master-5f446799f4-sk7vg 1/1 Running 0 163m
default chart-1618804240-node-feature-discovery-worker-5sllh 1/1 Running 1 163m
default chart-1618804240-node-feature-discovery-worker-86w4w 1/1 Running 0 163m
default chart-1618804240-node-feature-discovery-worker-fl52v 1/1 Running 0 163m
default gpu-operator-945878fff-88thn 1/1 Running 0 163m
gpu-operator-resources gpu-feature-discovery-p6zqs 1/1 Running 0 53s
gpu-operator-resources gpu-feature-discovery-x88v4 1/1 Running 0 53s
gpu-operator-resources nvidia-container-toolkit-daemonset-27qj8 1/1 Running 0 26m
gpu-operator-resources nvidia-container-toolkit-daemonset-g5ndb 1/1 Running 0 26m
gpu-operator-resources nvidia-dcgm-exporter-c9vht 1/1 Running 0 74s
gpu-operator-resources nvidia-dcgm-exporter-mz7rh 1/1 Running 0 74s
gpu-operator-resources nvidia-device-plugin-daemonset-sqfdc 1/1 Running 0 25m
gpu-operator-resources nvidia-device-plugin-daemonset-wldkd 1/1 Running 0 25m
gpu-operator-resources nvidia-device-plugin-validation 0/1 Completed 0 2m47s
gpu-operator-resources nvidia-driver-daemonset-m4xjv 1/1 Running 0 163m
gpu-operator-resources nvidia-driver-daemonset-vkrz5 1/1 Running 5 163m
....
6. 测试GPU
gpu-operator
目录下预置了一个test.yaml
文件,直接创建:
[root@gpumaster gpu-operator]# kubectl create -f test.yaml
pod/dcgmproftester created
[root@gpumaster gpu-operator]# kubectl get po -o wide | grep dcgmproftester
dcgmproftester 1/1 Running 0 103s 172.26.243.149 10.168.100.4 <none> <none>
找寻到10.168.100.4
上的nvidia-device-plugin-daemonset
的pod, 观察该节点上gpu的功耗及显存占用情况,可以看到该工作负载确实使用了gpu中的运算单元:
# kubectl exec nvidia-device-plugin-daemonset-wldkd -n gpu-operator-resources nvidia-smi
nvidia 33988608 269 nvidia_modeset,nvidia_uvm, Live 0xffffffffa05dd000 (PO)
Mon Apr 19 06:39:26 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:00:08.0 Off | Off |
| N/A 61C P0 208W / 250W | 493MiB / 32510MiB | 84% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
测试完毕后该pod将处于completed
状态,观察其输出:
# kubectl get po -o wide | grep dcgm
dcgmproftester 0/1 Completed 0 4m12s 172.26.243.149 10.168.100.4 <none> <none>
# kubectl logs dcgmproftester
.....
TensorEngineActive: generated ???, dcgm 0.000 (74380.8 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (75398.9 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (75787.6 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (77173.9 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (75669.5 gflops)
Skipping UnwatchFields() since DCGM validation is disabled