WorkingTipsOnx11lxdewine

环境准备

via:

x11docker --desktop --home --pulseaudio x11docker/lxde-wine

环境截图

PlayOnLinux:

/images/2021_04_24_07_22_11_520x457.jpg

软件列表:

/images/2021_04_24_07_22_48_855x552.jpg

选择“微软绘图”:

/images/2021_04_24_07_23_49_852x545.jpg

安装界面:

/images/2021_04_24_07_24_19_519x400.jpg

/images/2021_04_24_07_24_51_519x403.jpg

/images/2021_04_24_07_25_12_497x377.jpg

/images/2021_04_24_07_26_09_517x414.jpg

Age Of Empires:

/images/2021_04_24_07_29_38_850x546.jpg

WorkingTipsOnWinInDocker

制作Windows镜像

CentOS7上以以下方式启动虚拟机:

/usr/libexec/qemu-kvm -enable-kvm \
        -machine q35 -smp sockets=1,cores=1,threads=2 -m 2048 \
        -usb -device usb-kbd -device usb-tablet -rtc base=localtime \
        -net nic,model=virtio -net user,hostfwd=tcp::4444-:4444 \
        -drive file=hdd.img,media=disk,if=virtio \
        -drive file=/home/docker/win/cn_windows_10_consumer_editions_version_2004_x64_dvd.iso,media=cdrom \
        -drive file=/home/docker/win/virtio-win-0.1.141.iso,media=cdrom

用qemu提示的vnc端口访问该运行中的实例:

/images/2021_04_20_10_53_00_636x464.jpg

选择自定义安装:

/images/2021_04_20_10_53_46_593x403.jpg

需加载驱动程序:

/images/2021_04_20_10_55_28_488x496.jpg

选择好后的驱动:

/images/2021_04_20_10_55_56_414x92.jpg

忽略警告,继续:

/images/2021_04_20_10_56_46_553x286.jpg

继续安装直到安装完毕。

/images/2021_04_20_14_23_46_568x429.jpg

密码:

/images/2021_04_20_14_25_14_533x423.jpg

更新驱动程序:

/images/2021_04_20_14_40_16_616x407.jpg

选中E:\后更新:

/images/2021_04_20_14_41_40_548x273.jpg

此时关闭vm, 并创建一个overlay的image并使用该image启动一次vm:

$ qemu-img create -b hdd.img -f qcow2 snapshot.img
$ /usr/libexec/qemu-kvm -enable-kvm \
        -machine q35 -smp sockets=1,cores=1,threads=2 -m 2048 \
        -usb -device usb-kbd -device usb-tablet -rtc base=localtime \
        -net nic,model=virtio -net user,hostfwd=tcp::4444-:4444 \
        -drive file=snapshot.img,media=disk,if=virtio \
        -monitor stdio

在qemu终端内, 保存当前的状态后关机:

(qemu) savevm windows
Then type quit to stop VM:

(qemu) quit

因为有save后的状态,因而如果我们能保证容器内的qemu与容器外的qemu是同一版本的话,则可以快速恢复。

编译容器镜像

$ mv hdd.img snapshot.img image
$ cd image
$ docker build -t windows/win10qemu:20210420 .

在Centos7系列的操作系统上,因为宿主机的qemu版本与容器中的qemu版本差异,导致无法启动,需做以下修改:

# vim entrypoint.sh
....

  qemu-system-x86_64 -enable-kvm \
    -machine q35 -smp sockets=1,cores=1,threads=2 -m 2048 \
    -usb -device usb-kbd -device usb-tablet -rtc base=localtime \
    -net nic,model=virtio -net user,hostfwd=tcp::4444-:4444 \
    -drive file=snapshot.img,media=disk,if=virtio &
...
# vim Dockerfile
FROM windows/win10qemu:20210420
COPY entrypoint.sh /
# docker build -t win/win10new:latest .

运行容器:

# docker run -it --rm --privileged -p 4444:4444 -p 5915:5900  win/win10new:latest

打开vnc软件开始访问5915端口可以看到Windows桌面:

/images/2021_04_20_16_56_36_779x615.jpg

K8s中运行

由容器镜像创建出pod负载,service暴露即可。

WorkingTipsOnGPUOnCentOS7

1. 先决条件

各工作节点上需要保证内核为指定版本,并安装对应的kernel-ml-devel/kernel-ml-headers/gcc依赖包.

# uname -a
Linux worker2 4.19.12-1.el7.elrepo.x86_64 #1 SMP Fri Dec 21 11:06:36 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
# rpm -e --nodeps kernel-headers
# yum install -y kernel-ml-devel kernel-ml-headers gcc

手动安装Nvidia驱动:

# ./NVIDIA-Linux-x86_64-460.32.03.run 
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.32.03...........
..........................................................
..........................................................

忽略该报错:

/images/2021_04_19_16_53_12_622x250.jpg

选择NO, 忽略安装32位兼容包:

/images/2021_04_19_16_53_55_639x172.jpg

OK结束安装:

/images/2021_04_19_16_54_23_623x183.jpg

检查驱动是否安装成功:

# nvidia-smi 
Mon Apr 19 04:55:13 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:08.0 Off |                  Off |
| N/A   31C    P0    36W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:00:0A.0 Off |                  Off |
| N/A   31C    P0    35W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2. ccse改动

2.1 创建新的离线库:

引入nvidia-docker2相关的离线包并更新k8s-offline-pkgs仓库:

# cd k8s-offline-pkgs/
# tar xzvf /root/nvidiadocker.tar.gz -C .
libnvidia-container-tools-1.3.3-1.x86_64.rpm
nvidia-docker2-2.5.0-1.noarch.rpm
libnvidia-container1-1.3.3-1.x86_64.rpm
nvidia-container-toolkit-1.4.2-2.x86_64.rpm
nvidia-container-runtime-3.4.2-1.x86_64.rpm
# createrepo .

Ccse console节点上替换离线包:

[root@first x86_64]# pwd
/dcos/app/console/backend/webapps/repo/x86_64
[root@first x86_64]# mv k8s-offline-pkgs/ k8s-offline-pkgs.back
[root@first x86_64]# scp -r docker@10.168.100.1:/home/docker/k8s-offline-pkgs .

Ccse代码改动, 仅添加nvidia-docker2的安装:

# vi /dcos/app/console/kubeadm-playbook/roles/util/docker/tasks/install.yml

  - name: <安装docker><install-docker> 安装 docker (ccse源)
    shell: yum install -y docker-ce nvidia-docker2 --disablerepo=\* --enablerepo=ccse-k8s,ccse-centos7-base
    when: "yum_repo == 'ccse'"
# vi /dcos/app/console/kubeadm-playbook/roles/util/docker/templates/daemon.json.j2
    { 
    {% if custom_image_repository != '' %}{{ docker_insecure_registry_mirrors | indent(2,true) }}{% endif %}
      "storage-driver": "{{ docker_storage_driver }}",
      "graph": "{{ hosts_datadir_map[inventory_hostname] }}/docker",
      "log-driver": "json-file",
      "log-opts": {
                "max-size": "1g"
            },
      "default-runtime": "nvidia",
      "runtimes": {
          "nvidia": {
              "path": "/usr/bin/nvidia-container-runtime",
              "runtimeArgs": []
          }
      }
    }

3. 验证

相关包位于10.50.208.145/home/docker目录下的nvidiadockerclassic.tar:

# ls /home/docker/nvidiadockerclassic.tar  -l -h
-rw-r--r-- 1 root root 187M Apr 19 17:49 /home/docker/nvidiadockerclassic.tar

3.1 镜像准备

部署完毕后, ccse console节点上上传准备镜像:

# tar xvf nvidiadockerclassic.tar 
nvidiadockerclassic/
nvidiadockerclassic/nvidia-device-plugin.yml
nvidiadockerclassic/k8sdeviceplugin.tar
# cd nvidiadockerclassic
# docker load<k8sdeviceplugin.tar
# docker tag nvcr.io/nvidia/k8s-device-plugin:v0.9.0 10.168.100.144:8021/nvcr.io/nvidia/k8s-device-plugin:v0.9.0
# docker push 10.168.100.144:8021/nvcr.io/nvidia/k8s-device-plugin:v0.9.0

3.2 插件安装及验证

master节点上create nvidia-device-plugin.yml文件:

# kubectl create -f nvidia-device-plugin.yml 

验证device-plugin安装成功:

# kubectl  get po -A | grep device
kube-system   nvidia-device-plugin-daemonset-9mhq7           1/1     Running   0          19s
kube-system   nvidia-device-plugin-daemonset-m7txq           1/1     Running   0          19s
# kubectl logs nvidia-device-plugin-daemonset-9mhq7 -n kube-system
2021/04/19 09:53:23 Loading NVML
2021/04/19 09:53:23 Starting FS watcher.
2021/04/19 09:53:23 Starting OS watcher.
2021/04/19 09:53:23 Retreiving plugins.
2021/04/19 09:53:23 Starting GRPC server for 'nvidia.com/gpu'
2021/04/19 09:53:23 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/04/19 09:53:23 Registered device plugin for 'nvidia.com/gpu' with Kubelet

测试:

# kubectl create -f test.yml
# kubectl  get po -o wide
NAME             READY   STATUS    RESTARTS   AGE   IP               NODE             NOMINATED NODE   READINESS GATES
dcgmproftester   1/1     Running   0          19s   172.26.189.204   10.168.100.184   <none>           <none>
# nvidia-smi 
Mon Apr 19 05:54:55 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:09.0 Off |                  Off |
| N/A   56C    P0   218W / 250W |    493MiB / 32510MiB |     88%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     14182      C   /usr/bin/dcgmproftester11         489MiB |
+-----------------------------------------------------------------------------+

WorkingTipsOnGpu

1. 环境配置信息

整个验证环境的配置信息如下:

gpumaster: 10.168.100.2	4核16G
gpunode1: 10.168.100.3	4核16G PCI直通B5:00 Tesla V100
gpunode2: 10.168.100.4	4核16G PCI直通B2:00 Tesla V100

节点的操作系统配置如下, CentOS 7.6最小化安装方式:

# uname -a
Linux gpumaster 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core)

其中master节点上外挂了一块500G 的数据盘,需要手动挂载至/dcos目录:

[root@gpumaster ~]# df -h | grep dcos
/dev/vdb1                493G   73M  467G   1% /dcos
[root@gpumaster ~]# cat /etc/fstab | grep dcos
/dev/vdb1        /dcos                       ext4       defaults        0 0

3个节点依次关闭selinux/firewalld:

# vi /etc/selinux/config
...
SELINUX=disabled
...
# systemctl disable firewalld
# reboot

2. 部署CCSE集群

依次添加节点:

/images/2021_04_19_09_01_57_825x247.jpg

新增一个名为gpucluster的集群:

/images/2021_04_19_09_06_40_828x248.jpg

集群创建完毕后,新增两个GPU节点:

/images/2021_04_19_09_16_24_1099x449.jpg

添加完成后,检查集群状态:

[root@gpumaster ~]# kubectl get node
NAME           STATUS   ROLES    AGE     VERSION
10.168.100.2   Ready    master   6m19s   v1.17.3
10.168.100.3   Ready    node     78s     v1.17.3
10.168.100.4   Ready    node     78s     v1.17.3

3. 升级内核

在三个节点上,依次执行以下操作以升级内核。

配置离线软件库:

# cd /etc/yum.repos.d
# mkdir back
# mv CentOS-* back
# vi nvidia.repo
[nvidia]
name=nvidia
baseurl=http://10.168.100.144:8200/repo/x86_64/nvidiarpms
gpgcheck=0
enabled=1
proxy=_none_
# yum install -y kernel-ml

配置grub启动:

# vi /etc/default/grub
...
GRUB_DEFAULT=0
...
GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet rd.driver.blacklist=nouveau nouveau.modeset=0"
...
# grub2-mkconfig -o /boot/grub2/grub.cfg

完全禁用系统自带的nouveau驱动:

# echo 'install nouveau /bin/false' >> /etc/modprobe.d/nouveau.conf

执行完上述操作后需重启机器并验证内核是否更改成功:

# uname -a
Linux gpunode2 4.19.12-1.el7.elrepo.x86_64 #1 SMP Fri Dec 21 11:06:36 EST 2018 x86_64 x86_64 x86_64 GNU/Linux

4. gpu-operator文件准备

Harbor中预上传的镜像文件列表如下(nvcr.io及nvidia):

/images/2021_04_19_11_22_13_829x264.jpg

10.168.100.1上scp以下目录到所有节点:

$ scp -r docker@10.168.100.1:/home/docker/nvidia_items .

预Load nfd镜像:

# docker load<quay.tar
...
Loaded image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0

5. 安装NVIDIA/gpu-operator

登录到gpumaster节点,从文件创建一个部署charts时需用到的configmap:

# cat ccse.repo
[ccse-k8s]
name=Centos local yum repo for k8s
baseurl=http://10.168.100.144:8200/repo/x86_64/k8s-offline-pkgs
gpgcheck=0
enabled=1
proxy=_none_

[ccse-centos7-base]
name=Centos local yum repo for k8s
baseurl=http://10.168.100.144:8200/repo/x86_64/centos7-base
gpgcheck=0
enabled=1
proxy=_none_

[fuck]
name=Centos local yum repo for k8s 111
baseurl=http://10.168.100.144:8200/repo/x86_64/nvidiarpms
gpgcheck=0
enabled=1
proxy=_none_
# kubectl create namespace gpu-operator-resources
namespace/gpu-operator-resources created
# kubectl create configmap repo-config -n gpu-operator-resources --from-file=ccse.repo
configmap/repo-config created

现在创建gpu-operator实例:

# cd gpu-operator/
#  helm install --generate-name . -f values.yaml

检查实例运行情况:

# kubectl get po
NAME                                                              READY   STATUS    RESTARTS   AGE
chart-1618803326-node-feature-discovery-master-655c6997cd-fp465   1/1     Running   0          65s
chart-1618803326-node-feature-discovery-worker-7flft              1/1     Running   0          65s
chart-1618803326-node-feature-discovery-worker-mkqm7              1/1     Running   0          65s
chart-1618803326-node-feature-discovery-worker-w2d44              1/1     Running   0          65s
gpu-operator-945878fff-l22vc                                      1/1     Running   0          65s

给GPU节点手动添加标签,gpu-operator-resources命名空间下的实例运行情况:

使能GPU驱动安装:

# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.driver=true       
node/10.168.100.3 labeled
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.driver=true       
node/10.168.100.4 labeled

检查GPU驱动编译情况:

# kubectl  get po -n gpu-operator-resources
NAME                            READY   STATUS    RESTARTS   AGE
nvidia-driver-daemonset-w6d2q   1/1     Running   0          86s
nvidia-driver-daemonset-zmf9l   1/1     Running   0          86s
# kubectl logs po nvidia-driver-daemonset-zmf9l -n gpu-operator-resources
Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 460.32.03) is now complete.

Loading IPMI kernel module...
Loading NVIDIA driver kernel modules...
Starting NVIDIA persistence daemon...
Mounting NVIDIA driver rootfs...
Done, now waiting for signal

使能device-plugin, dcgm-exporter等:

# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.container-toolkit=true
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.device-plugin=true
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.dcgm-exporter=true
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.gpu-feature-discovery=true

# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.container-toolkit=true
# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.device-plugin=true
# kubectl label nodes 10.168.100.3  nvidia.com/gpu.deploy.dcgm-exporter=true
# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.gpu-feature-discovery=true

检查toolkit-daemonset运行情况,会发现Init:ImagePullBackOff报错信息:

# kubectl get po -n gpu-operator-resources
NAME                                       READY   STATUS                  RESTARTS   AGE
nvidia-container-toolkit-daemonset-6kqq5   0/1     Init:ImagePullBackOff   0          2m16s
nvidia-container-toolkit-daemonset-cbww2   0/1     Init:ImagePullBackOff   0          4m1s
# kubectl logs nvidia-container-toolkit-daemonset-cbww2 -n gpu-operator-resources
  Normal   BackOff         3m31s (x7 over 4m46s)  kubelet, 10.168.100.4  Back-off pulling image "10.168.100.144:8021/nvcr.io/nvidia/k8s/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59"
  Warning  Failed          3m31s (x7 over 4m46s)  kubelet, 10.168.100.4  Error: ImagePullBackOff
  Normal   Pulling         3m20s (x4 over 4m48s)  kubelet, 10.168.100.4  Pulling image "10.168.100.144:8021/nvcr.io/nvidia/k8s/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59"

这是因为pod拉取的镜像tag不对所导致,需要手动修改image的tag:

# kubectl get ds -n gpu-operator-resources
NAME                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                  AGE
nvidia-container-toolkit-daemonset   2         2         0       2            0           nvidia.com/gpu.deploy.container-toolkit=true   133m
nvidia-driver-daemonset              2         2         2       2            2           nvidia.com/gpu.deploy.driver=true              135m
# kubectl edit ds nvidia-container-toolkit-daemonset -n gpu-operator-resources
        #image: 10.168.100.144:8021/nvcr.io/nvidia/k8s/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59
        image: 10.168.100.144:8021/nvcr.io/nvidia/cuda:11.2.1-base-ubi8

刷新pod运行情况,可以看到nvidia-container-toolkit-daemonsetnvidia-device-plugin-daemonset运行正常,而nvidia-device-plugin-validationInit:CreashLoopBackOff失败:

# kubectl get po -n gpu-operator-resources
NAME                                       READY   STATUS                  RESTARTS   AGE
nvidia-container-toolkit-daemonset-27qj8   1/1     Running                 0          52s
nvidia-container-toolkit-daemonset-g5ndb   1/1     Running                 0          51s
nvidia-device-plugin-daemonset-sqfdc       1/1     Running                 0          26s
nvidia-device-plugin-daemonset-wldkd       1/1     Running                 0          26s
nvidia-device-plugin-validation            0/1     Init:CrashLoopBackOff   1          9s
nvidia-driver-daemonset-m4xjv              1/1     Running                 0          137m
nvidia-driver-daemonset-vkrz5              1/1     Running                 5          137m

定位该validation所在的节点名(此例中为10.168.100.3):

# kubectl get po nvidia-device-plugin-validation -n  gpu-operator-resources -o wide
NAME                              READY   STATUS                  RESTARTS   AGE     IP              NODE           NOMINATED NODE   READINESS GATES
nvidia-device-plugin-validation   0/1     Init:CrashLoopBackOff   4          2m55s   172.26.222.10   10.168.100.3   <none>           <none>

获取启动失败原因:

# kubectl describe po nvidia-device-plugin-validation -n gpu-operator-resources
......
  Warning  Failed            56s (x5 over 2m21s)   kubelet, 10.168.100.3  Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidiactl": no such file or directory

登录10.168.100.3节点,获取/dev下驱动程序设备名:

# docker ps | grep nvidia-device-plugin-daemonset | grep -v pause
abbea480fdf2        10.168.100.144:8021/nvcr.io/nvidia/k8s-device-plugin       "nvidia-device-plugin"   6 minutes ago       Up 6 minutes                            k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-sqfdc_gpu-operator-resources_b9988b02-82a6-4637-a7f0-fdee5a448d60_0
# docker exec -it k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-sqfdc_gpu-operator-resources_b9988b02-82a6-4637-a7f0-fdee5a448d60_0 /bin/bash
[root@nvidia-device-plugin-daemonset-sqfdc /]# ls /dev/nvidia* -l -h
crw-rw-rw- 1 root root 195, 254 Apr 19 03:52 /dev/nvidia-modeset
crw-rw-rw- 1 root root 237,   0 Apr 19 06:08 /dev/nvidia-uvm
crw-rw-rw- 1 root root 237,   1 Apr 19 06:08 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Apr 19 03:52 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Apr 19 03:52 /dev/nvidiactl
[root@nvidia-device-plugin-daemonset-sqfdc /]# exit

在主机级别(10.168.100.3)上手动创建/dev/nvidiactl文件, 依据同样步骤在10.168.100.4上查找到相应的设备驱动号也添加/dev/nvidiactl文件:

[root@gpunode1 ~]# mknod -m 666 /dev/nvidiactl c 195 255
[root@gpunode1 ~]# ls /dev/nvidiactl -l
crw-rw-rw- 1 root root 195, 255 Apr 19 02:19 /dev/nvidiactl

delete掉nvidia-device-plugin-validation这个pod后,kubelet将重新拉起一个,此时报错信息有变化,提示缺少/dev/nvidia-uvm设备驱动文件:

  Warning  Failed     10s (x2 over 11s)  kubelet, 10.168.100.4  Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia-uvm": no such file or directory

按照上面创建/dev/nvidiactl的方法创建/dev/nvidia-uvm驱动文件,注意设备号与容器中保持一致:

# mknod -m 666 /dev/nvidia-uvm c 237 0

删除pod后重新拉起,报错信息为缺少/dev/nvidia-uvm-tools:

  Warning  Failed     9s (x2 over 10s)  kubelet, 10.168.100.4  Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia-uvm-tools": no such file or directory

手动创建nvidia-uvm-tools设备文件后删除pod等待kubelet重新拉起pod:

# mknod -m 666 /dev/nvidia-uvm-tools c 237 1
  Warning  Failed     12s (x2 over 12s)  kubelet, 10.168.100.3  Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia-modeset": no such file or directory

手动创建nvidia-modeset设备文件后删除pod等待kubelet重新拉起pod:

# mknod -m 666 /dev/nvidia-modeset c 195 254
  Warning  Failed     13s (x2 over 14s)  kubelet, 10.168.100.4  Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia0": no such file or directory

手动创建nvidia0设备文件后删除pod等待kubelet重新拉起pod:

# mknod -m 666 /dev/nvidia0 c 195 0
# kubectl  get po -A | grep device-plugin-validation
gpu-operator-resources   nvidia-device-plugin-validation                                   0/1     Completed   0          2m26s

此时kubelet将继续拉起剩余的nvidia资源,最终状态应该是:

# kubectl  get po -A
NAMESPACE                NAME                                                              READY   STATUS      RESTARTS   AGE
default                  chart-1618804240-node-feature-discovery-master-5f446799f4-sk7vg   1/1     Running     0          163m
default                  chart-1618804240-node-feature-discovery-worker-5sllh              1/1     Running     1          163m
default                  chart-1618804240-node-feature-discovery-worker-86w4w              1/1     Running     0          163m
default                  chart-1618804240-node-feature-discovery-worker-fl52v              1/1     Running     0          163m
default                  gpu-operator-945878fff-88thn                                      1/1     Running     0          163m
gpu-operator-resources   gpu-feature-discovery-p6zqs                                       1/1     Running     0          53s
gpu-operator-resources   gpu-feature-discovery-x88v4                                       1/1     Running     0          53s
gpu-operator-resources   nvidia-container-toolkit-daemonset-27qj8                          1/1     Running     0          26m
gpu-operator-resources   nvidia-container-toolkit-daemonset-g5ndb                          1/1     Running     0          26m
gpu-operator-resources   nvidia-dcgm-exporter-c9vht                                        1/1     Running     0          74s
gpu-operator-resources   nvidia-dcgm-exporter-mz7rh                                        1/1     Running     0          74s
gpu-operator-resources   nvidia-device-plugin-daemonset-sqfdc                              1/1     Running     0          25m
gpu-operator-resources   nvidia-device-plugin-daemonset-wldkd                              1/1     Running     0          25m
gpu-operator-resources   nvidia-device-plugin-validation                                   0/1     Completed   0          2m47s
gpu-operator-resources   nvidia-driver-daemonset-m4xjv                                     1/1     Running     0          163m
gpu-operator-resources   nvidia-driver-daemonset-vkrz5                                     1/1     Running     5          163m
....

6. 测试GPU

gpu-operator目录下预置了一个test.yaml文件,直接创建:

[root@gpumaster gpu-operator]# kubectl create -f test.yaml
pod/dcgmproftester created
[root@gpumaster gpu-operator]# kubectl  get po -o wide | grep dcgmproftester
dcgmproftester                                                    1/1     Running   0          103s   172.26.243.149   10.168.100.4   <none>           <none>

找寻到10.168.100.4上的nvidia-device-plugin-daemonset的pod, 观察该节点上gpu的功耗及显存占用情况,可以看到该工作负载确实使用了gpu中的运算单元:

# kubectl exec nvidia-device-plugin-daemonset-wldkd -n gpu-operator-resources nvidia-smi
nvidia 33988608 269 nvidia_modeset,nvidia_uvm, Live 0xffffffffa05dd000 (PO)
Mon Apr 19 06:39:26 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:00:08.0 Off |                  Off |
| N/A   61C    P0   208W / 250W |    493MiB / 32510MiB |     84%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

测试完毕后该pod将处于completed状态,观察其输出:

# kubectl  get po -o wide | grep dcgm
dcgmproftester                                                    0/1     Completed   0          4m12s   172.26.243.149   10.168.100.4   <none>           <none>
# kubectl  logs dcgmproftester
.....
TensorEngineActive: generated ???, dcgm 0.000 (74380.8 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (75398.9 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (75787.6 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (77173.9 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (75669.5 gflops)
Skipping UnwatchFields() since DCGM validation is disabled

WorkingTipsOnShrinkingCCSE

Repository Shrinking

Create a new vm via following command:

# cd /var/lib/libvirt/qemu/save
### Following is for creating a new vm for saving rpms
# virsh dumpxml node1>example.xml
# vim example.xml 
# qemu-img create -f qcow2 -b ccsebaseimage.qcow2 saverpms.qcow2 
Formatting 'saverpms.qcow2', fmt=qcow2 size=536870912000 backing_file=ccsebaseimage.qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
# virsh define example.xml 
Domain nodetmp defined from example.xml
# virsh start nodetmp
Domain nodetmp started
# virsh net-dhcp-leases default
### Getting the ip address for nodetmp(10.17.18.199)
# scp ./ccse-offline-files.tar.gz root@10.17.18.199:/home/
# ssh root@10.17.18.199

Following is on 10.17.18.199:

[root@first ~]# cd /home/
[root@first home]# tar xzvf ccse-offline-files.tar.gz
# vi /etc/yum.conf
keepcache=1
### Add a new vm disk (vdb)
# fdisk /dev/vdb
# mkfs.ext4 /dev/vdb1
# mkdir /dcos
# mount /dev/vdb1 /dcos
# vi /etc/fstab 
/dev/vdb1 /dcos                   ext4     defaults        0 0
# mount -a
# exit

Bug-fix(lsof):

# scp ./Packages/lsof-4.87-6.el7.x86_64.rpm root@10.17.18.199:/root/

Backup the vm disks on host machine:

# virsh destroy nodetmp
# mv saverpms.qcow2 saverpms1.qcow2
# qemu-img create -f qcow2 -b saverpms1.qcow2 saverpms.qcow2
# virsh start nodetmp
# ssh root@10.17.18.199

Re-login, and run:

#  rpm -ivh /root/lsof-4.87-6.el7.x86_64.rpm 
# cd /home/ccse-xxxxxxxx
# vi config/config.yaml
common:
  # 控制台和/或Harbor所在的主机IP
  host: 10.17.18.199
# vim ./files/offline-repo/ccse-centos7-base.repo
	#[ccse-centos7-base]
	#name=ccse-offline-repo
	#baseurl=file://{centos7_base_repo_dir}
	#enabled=1
	#gpgcheck=0
	[ccse-centos7-base]
	name=Centos local yum repo for k8s
	baseurl=http://10.17.18.2:8200/repo/x86_64/centos7-base
	gpgcheck=0
	enabled=1
	proxy=_none_
# ./deploy.sh install all 2>&1 | sudo tee install-log_`date "+%Y%m%d%H%M"`

Notice, 10.17.18.2 is for existing ccse console.
After deployment, the cached rpms is listed as:

# find /var/cache | grep rpm$
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/audit-2.8.5-4.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/audit-libs-2.8.5-4.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/audit-libs-python-2.8.5-4.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/checkpolicy-2.5-8.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/libsemanage-python-2.5-14.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/policycoreutils-2.5-34.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/policycoreutils-python-2.5-34.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/python-IPy-0.75-6.el7.noarch.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/setools-libs-3.3.8-4.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/libcgroup-0.41-21.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/unzip-6.0-21.el7.x86_64.rpm

Now enable the visit for ccse console(web ui):

# systemctl stop firewalld
# systemctl disable firewalld
# setenforce 0
# vi /etc/selinux/config 
SELINUX=disabled

ccse webui:

/images/2021_04_12_10_37_55_515x298.jpg

Create a new vm and added it on ccse webui, in newly added vm do following command:

# vi /etc/yum.conf 
keepcached
# systemctl stop firewalld
# systemctl disable firewalld
# setenforce 0
# vi /etc/selinux/config 
SELINUX=disabled

Create a new cluster, and fetch the new vm’s rpm cache:

[root@first cache]# find . | grep rpm$
./yum/x86_64/7/ccse-centos7-base/packages/audit-2.8.5-4.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/audit-libs-2.8.5-4.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/checkpolicy-2.5-8.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/audit-libs-python-2.8.5-4.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/libsemanage-python-2.5-14.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/libcgroup-0.41-21.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/policycoreutils-2.5-34.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/python-IPy-0.75-6.el7.noarch.rpm
./yum/x86_64/7/ccse-centos7-base/packages/setools-libs-3.3.8-4.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/policycoreutils-python-2.5-34.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/conntrack-tools-1.4.4-7.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/libnetfilter_cttimeout-1.0.0-7.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/libnetfilter_queue-1.0.2-2.el7_2.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/socat-1.7.3.2-2.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/libnetfilter_cthelper-1.0.0-11.el7.x86_64.rpm
./yum/x86_64/7/ccse-k8s/packages/container-selinux-2.119.1-1.c57a6f9.el7.noarch.rpm
./yum/x86_64/7/ccse-k8s/packages/docker-ce-18.09.9-3.el7.x86_64.rpm
./yum/x86_64/7/ccse-k8s/packages/containerd.io-1.2.13-3.2.el7.x86_64.rpm
./yum/x86_64/7/ccse-k8s/packages/docker-ce-cli-18.09.9-3.el7.x86_64.rpm
./yum/x86_64/7/ccse-k8s/packages/3f1db71d0bb6d72bc956d788ffee737714e5717c629b26355a2dcf1dba4ad231-kubelet-1.17.3-0.x86_64.rpm
./yum/x86_64/7/ccse-k8s/packages/548a0dcd865c16a50980420ddfa5fbccb8b59621179798e6dc905c9bf8af3b34-kubernetes-cni-0.7.5-0.x86_64.rpm
./yum/x86_64/7/ccse-k8s/packages/35625b6ab1da6c58ce4946742181c0dcf9ac9b6c2b5bea2c13eed4876024c342-kubectl-1.17.3-0.x86_64.rpm

harbor shrinking

Save the harbor images:

[root@first ~]# docker save -o harbor.tar goharbor/chartmuseum-photon:v0.9.0-v1.8.6 goharbor/harbor-migrator:v1.8.6 goharbor/redis-photon:v1.8.6 goharbor/clair-photon:v2.1.0-v1.8.6 goharbor/notary-server-photon:v0.6.1-v1.8.6 goharbor/notary-signer-photon:v0.6.1-v1.8.6 goharbor/harbor-registryctl:v1.8.6 goharbor/registry-photon:v2.7.1-patch-2819-v1.8.6 goharbor/nginx-photon:v1.8.6 goharbor/harbor-log:v1.8.6 goharbor/harbor-jobservice:v1.8.6 goharbor/harbor-core:v1.8.6 goharbor/harbor-portal:v1.8.6 goharbor/harbor-db:v1.8.6 goharbor/prepare:v1.8.6
[root@first ~]# ls -l -h harbor.tar 
-rw-------. 1 root root 1.5G Apr 11 23:31 harbor.tar
[root@first ~]# cp harbor.tar harbor.tar.back
[root@first ~]# xz -T4 harbor.tar
[root@first ~]# ls -l -h harbor.tar.*
-rw-------. 1 root root 1.5G Apr 11 23:31 harbor.tar.back
-rw-------. 1 root root 428M Apr 11 23:31 harbor.tar.xz

rpm

rpms combine:

[root@first rpms]# ls -l -h | wc -l
12
##### After transferring from working node
#########################################
[root@first rpms]# cp /tmp/rpms/* .
cp: overwrite ‘./audit-2.8.5-4.el7.x86_64.rpm’? y
cp: overwrite ‘./audit-libs-2.8.5-4.el7.x86_64.rpm’? y
cp: overwrite ‘./audit-libs-python-2.8.5-4.el7.x86_64.rpm’? y
cp: overwrite ‘./checkpolicy-2.5-8.el7.x86_64.rpm’? y
cp: overwrite ‘./libcgroup-0.41-21.el7.x86_64.rpm’? y
cp: overwrite ‘./libsemanage-python-2.5-14.el7.x86_64.rpm’? y
cp: overwrite ‘./policycoreutils-2.5-34.el7.x86_64.rpm’? y
cp: overwrite ‘./policycoreutils-python-2.5-34.el7.x86_64.rpm’? y
cp: overwrite ‘./python-IPy-0.75-6.el7.noarch.rpm’? y
cp: overwrite ‘./setools-libs-3.3.8-4.el7.x86_64.rpm’? y
[root@first rpms]# ls -l -h | wc -l
17