WorkingTipsOnx11lxdewine
Apr 24, 2021Technology
环境准备
via:
x11docker --desktop --home --pulseaudio x11docker/lxde-wine
环境截图
PlayOnLinux:
软件列表:
选择“微软绘图”:
安装界面:
Age Of Empires:
via:
x11docker --desktop --home --pulseaudio x11docker/lxde-wine
PlayOnLinux:
软件列表:
选择“微软绘图”:
安装界面:
Age Of Empires:
CentOS7上以以下方式启动虚拟机:
/usr/libexec/qemu-kvm -enable-kvm \
-machine q35 -smp sockets=1,cores=1,threads=2 -m 2048 \
-usb -device usb-kbd -device usb-tablet -rtc base=localtime \
-net nic,model=virtio -net user,hostfwd=tcp::4444-:4444 \
-drive file=hdd.img,media=disk,if=virtio \
-drive file=/home/docker/win/cn_windows_10_consumer_editions_version_2004_x64_dvd.iso,media=cdrom \
-drive file=/home/docker/win/virtio-win-0.1.141.iso,media=cdrom
用qemu提示的vnc端口访问该运行中的实例:
选择自定义安装
:
需加载驱动程序:
选择好后的驱动:
忽略警告,继续:
继续安装直到安装完毕。
密码:
更新驱动程序:
选中E:\后更新:
此时关闭vm, 并创建一个overlay的image并使用该image启动一次vm:
$ qemu-img create -b hdd.img -f qcow2 snapshot.img
$ /usr/libexec/qemu-kvm -enable-kvm \
-machine q35 -smp sockets=1,cores=1,threads=2 -m 2048 \
-usb -device usb-kbd -device usb-tablet -rtc base=localtime \
-net nic,model=virtio -net user,hostfwd=tcp::4444-:4444 \
-drive file=snapshot.img,media=disk,if=virtio \
-monitor stdio
在qemu终端内, 保存当前的状态后关机:
(qemu) savevm windows
Then type quit to stop VM:
(qemu) quit
因为有save后的状态,因而如果我们能保证容器内的qemu与容器外的qemu是同一版本的话,则可以快速恢复。
$ mv hdd.img snapshot.img image
$ cd image
$ docker build -t windows/win10qemu:20210420 .
在Centos7系列的操作系统上,因为宿主机的qemu版本与容器中的qemu版本差异,导致无法启动,需做以下修改:
# vim entrypoint.sh
....
qemu-system-x86_64 -enable-kvm \
-machine q35 -smp sockets=1,cores=1,threads=2 -m 2048 \
-usb -device usb-kbd -device usb-tablet -rtc base=localtime \
-net nic,model=virtio -net user,hostfwd=tcp::4444-:4444 \
-drive file=snapshot.img,media=disk,if=virtio &
...
# vim Dockerfile
FROM windows/win10qemu:20210420
COPY entrypoint.sh /
# docker build -t win/win10new:latest .
运行容器:
# docker run -it --rm --privileged -p 4444:4444 -p 5915:5900 win/win10new:latest
打开vnc软件开始访问5915端口可以看到Windows桌面:
由容器镜像创建出pod负载,service暴露即可。
各工作节点上需要保证内核为指定版本,并安装对应的kernel-ml-devel/kernel-ml-headers/gcc
依赖包.
# uname -a
Linux worker2 4.19.12-1.el7.elrepo.x86_64 #1 SMP Fri Dec 21 11:06:36 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
# rpm -e --nodeps kernel-headers
# yum install -y kernel-ml-devel kernel-ml-headers gcc
手动安装Nvidia驱动:
# ./NVIDIA-Linux-x86_64-460.32.03.run
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.32.03...........
..........................................................
..........................................................
忽略该报错:
选择NO
, 忽略安装32位兼容包:
按OK
结束安装:
检查驱动是否安装成功:
# nvidia-smi
Mon Apr 19 04:55:13 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:00:08.0 Off | Off |
| N/A 31C P0 36W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:00:0A.0 Off | Off |
| N/A 31C P0 35W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
引入nvidia-docker2
相关的离线包并更新k8s-offline-pkgs
仓库:
# cd k8s-offline-pkgs/
# tar xzvf /root/nvidiadocker.tar.gz -C .
libnvidia-container-tools-1.3.3-1.x86_64.rpm
nvidia-docker2-2.5.0-1.noarch.rpm
libnvidia-container1-1.3.3-1.x86_64.rpm
nvidia-container-toolkit-1.4.2-2.x86_64.rpm
nvidia-container-runtime-3.4.2-1.x86_64.rpm
# createrepo .
Ccse console节点上替换离线包:
[root@first x86_64]# pwd
/dcos/app/console/backend/webapps/repo/x86_64
[root@first x86_64]# mv k8s-offline-pkgs/ k8s-offline-pkgs.back
[root@first x86_64]# scp -r docker@10.168.100.1:/home/docker/k8s-offline-pkgs .
Ccse代码改动, 仅添加nvidia-docker2
的安装:
# vi /dcos/app/console/kubeadm-playbook/roles/util/docker/tasks/install.yml
- name: <安装docker><install-docker> 安装 docker (ccse源)
shell: yum install -y docker-ce nvidia-docker2 --disablerepo=\* --enablerepo=ccse-k8s,ccse-centos7-base
when: "yum_repo == 'ccse'"
# vi /dcos/app/console/kubeadm-playbook/roles/util/docker/templates/daemon.json.j2
{
{% if custom_image_repository != '' %}{{ docker_insecure_registry_mirrors | indent(2,true) }}{% endif %}
"storage-driver": "{{ docker_storage_driver }}",
"graph": "{{ hosts_datadir_map[inventory_hostname] }}/docker",
"log-driver": "json-file",
"log-opts": {
"max-size": "1g"
},
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
相关包位于10.50.208.145
的/home/docker
目录下的nvidiadockerclassic.tar
:
# ls /home/docker/nvidiadockerclassic.tar -l -h
-rw-r--r-- 1 root root 187M Apr 19 17:49 /home/docker/nvidiadockerclassic.tar
部署完毕后, ccse console节点上上传准备镜像:
# tar xvf nvidiadockerclassic.tar
nvidiadockerclassic/
nvidiadockerclassic/nvidia-device-plugin.yml
nvidiadockerclassic/k8sdeviceplugin.tar
# cd nvidiadockerclassic
# docker load<k8sdeviceplugin.tar
# docker tag nvcr.io/nvidia/k8s-device-plugin:v0.9.0 10.168.100.144:8021/nvcr.io/nvidia/k8s-device-plugin:v0.9.0
# docker push 10.168.100.144:8021/nvcr.io/nvidia/k8s-device-plugin:v0.9.0
master节点上create nvidia-device-plugin.yml
文件:
# kubectl create -f nvidia-device-plugin.yml
验证device-plugin安装成功:
# kubectl get po -A | grep device
kube-system nvidia-device-plugin-daemonset-9mhq7 1/1 Running 0 19s
kube-system nvidia-device-plugin-daemonset-m7txq 1/1 Running 0 19s
# kubectl logs nvidia-device-plugin-daemonset-9mhq7 -n kube-system
2021/04/19 09:53:23 Loading NVML
2021/04/19 09:53:23 Starting FS watcher.
2021/04/19 09:53:23 Starting OS watcher.
2021/04/19 09:53:23 Retreiving plugins.
2021/04/19 09:53:23 Starting GRPC server for 'nvidia.com/gpu'
2021/04/19 09:53:23 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/04/19 09:53:23 Registered device plugin for 'nvidia.com/gpu' with Kubelet
测试:
# kubectl create -f test.yml
# kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
dcgmproftester 1/1 Running 0 19s 172.26.189.204 10.168.100.184 <none> <none>
# nvidia-smi
Mon Apr 19 05:54:55 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:00:09.0 Off | Off |
| N/A 56C P0 218W / 250W | 493MiB / 32510MiB | 88% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 14182 C /usr/bin/dcgmproftester11 489MiB |
+-----------------------------------------------------------------------------+
整个验证环境的配置信息如下:
gpumaster: 10.168.100.2 4核16G
gpunode1: 10.168.100.3 4核16G PCI直通B5:00 Tesla V100
gpunode2: 10.168.100.4 4核16G PCI直通B2:00 Tesla V100
节点的操作系统配置如下, CentOS 7.6最小化安装方式:
# uname -a
Linux gpumaster 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)
其中master节点上外挂了一块500G 的数据盘,需要手动挂载至/dcos
目录:
[root@gpumaster ~]# df -h | grep dcos
/dev/vdb1 493G 73M 467G 1% /dcos
[root@gpumaster ~]# cat /etc/fstab | grep dcos
/dev/vdb1 /dcos ext4 defaults 0 0
3个节点依次关闭selinux/firewalld:
# vi /etc/selinux/config
...
SELINUX=disabled
...
# systemctl disable firewalld
# reboot
依次添加节点:
新增一个名为gpucluster
的集群:
集群创建完毕后,新增两个GPU节点:
添加完成后,检查集群状态:
[root@gpumaster ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
10.168.100.2 Ready master 6m19s v1.17.3
10.168.100.3 Ready node 78s v1.17.3
10.168.100.4 Ready node 78s v1.17.3
在三个节点上,依次执行以下操作以升级内核。
配置离线软件库:
# cd /etc/yum.repos.d
# mkdir back
# mv CentOS-* back
# vi nvidia.repo
[nvidia]
name=nvidia
baseurl=http://10.168.100.144:8200/repo/x86_64/nvidiarpms
gpgcheck=0
enabled=1
proxy=_none_
# yum install -y kernel-ml
配置grub启动:
# vi /etc/default/grub
...
GRUB_DEFAULT=0
...
GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet rd.driver.blacklist=nouveau nouveau.modeset=0"
...
# grub2-mkconfig -o /boot/grub2/grub.cfg
完全禁用系统自带的nouveau
驱动:
# echo 'install nouveau /bin/false' >> /etc/modprobe.d/nouveau.conf
执行完上述操作后需重启机器并验证内核是否更改成功:
# uname -a
Linux gpunode2 4.19.12-1.el7.elrepo.x86_64 #1 SMP Fri Dec 21 11:06:36 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
Harbor中预上传的镜像文件列表如下(nvcr.io及nvidia):
从10.168.100.1
上scp以下目录到所有节点:
$ scp -r docker@10.168.100.1:/home/docker/nvidia_items .
预Load nfd镜像:
# docker load<quay.tar
...
Loaded image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
登录到gpumaster
节点,从文件创建一个部署charts时需用到的configmap:
# cat ccse.repo
[ccse-k8s]
name=Centos local yum repo for k8s
baseurl=http://10.168.100.144:8200/repo/x86_64/k8s-offline-pkgs
gpgcheck=0
enabled=1
proxy=_none_
[ccse-centos7-base]
name=Centos local yum repo for k8s
baseurl=http://10.168.100.144:8200/repo/x86_64/centos7-base
gpgcheck=0
enabled=1
proxy=_none_
[fuck]
name=Centos local yum repo for k8s 111
baseurl=http://10.168.100.144:8200/repo/x86_64/nvidiarpms
gpgcheck=0
enabled=1
proxy=_none_
# kubectl create namespace gpu-operator-resources
namespace/gpu-operator-resources created
# kubectl create configmap repo-config -n gpu-operator-resources --from-file=ccse.repo
configmap/repo-config created
现在创建gpu-operator实例:
# cd gpu-operator/
# helm install --generate-name . -f values.yaml
检查实例运行情况:
# kubectl get po
NAME READY STATUS RESTARTS AGE
chart-1618803326-node-feature-discovery-master-655c6997cd-fp465 1/1 Running 0 65s
chart-1618803326-node-feature-discovery-worker-7flft 1/1 Running 0 65s
chart-1618803326-node-feature-discovery-worker-mkqm7 1/1 Running 0 65s
chart-1618803326-node-feature-discovery-worker-w2d44 1/1 Running 0 65s
gpu-operator-945878fff-l22vc 1/1 Running 0 65s
给GPU节点手动添加标签,gpu-operator-resources
命名空间下的实例运行情况:
使能GPU驱动安装:
# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.driver=true
node/10.168.100.3 labeled
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.driver=true
node/10.168.100.4 labeled
检查GPU驱动编译情况:
# kubectl get po -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
nvidia-driver-daemonset-w6d2q 1/1 Running 0 86s
nvidia-driver-daemonset-zmf9l 1/1 Running 0 86s
# kubectl logs po nvidia-driver-daemonset-zmf9l -n gpu-operator-resources
Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 460.32.03) is now complete.
Loading IPMI kernel module...
Loading NVIDIA driver kernel modules...
Starting NVIDIA persistence daemon...
Mounting NVIDIA driver rootfs...
Done, now waiting for signal
使能device-plugin
, dcgm-exporter
等:
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.container-toolkit=true
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.device-plugin=true
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.dcgm-exporter=true
# kubectl label nodes 10.168.100.4 nvidia.com/gpu.deploy.gpu-feature-discovery=true
# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.container-toolkit=true
# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.device-plugin=true
# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.dcgm-exporter=true
# kubectl label nodes 10.168.100.3 nvidia.com/gpu.deploy.gpu-feature-discovery=true
检查toolkit-daemonset
运行情况,会发现Init:ImagePullBackOff
报错信息:
# kubectl get po -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
nvidia-container-toolkit-daemonset-6kqq5 0/1 Init:ImagePullBackOff 0 2m16s
nvidia-container-toolkit-daemonset-cbww2 0/1 Init:ImagePullBackOff 0 4m1s
# kubectl logs nvidia-container-toolkit-daemonset-cbww2 -n gpu-operator-resources
Normal BackOff 3m31s (x7 over 4m46s) kubelet, 10.168.100.4 Back-off pulling image "10.168.100.144:8021/nvcr.io/nvidia/k8s/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59"
Warning Failed 3m31s (x7 over 4m46s) kubelet, 10.168.100.4 Error: ImagePullBackOff
Normal Pulling 3m20s (x4 over 4m48s) kubelet, 10.168.100.4 Pulling image "10.168.100.144:8021/nvcr.io/nvidia/k8s/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59"
这是因为pod拉取的镜像tag不对所导致,需要手动修改image的tag:
# kubectl get ds -n gpu-operator-resources
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
nvidia-container-toolkit-daemonset 2 2 0 2 0 nvidia.com/gpu.deploy.container-toolkit=true 133m
nvidia-driver-daemonset 2 2 2 2 2 nvidia.com/gpu.deploy.driver=true 135m
# kubectl edit ds nvidia-container-toolkit-daemonset -n gpu-operator-resources
#image: 10.168.100.144:8021/nvcr.io/nvidia/k8s/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59
image: 10.168.100.144:8021/nvcr.io/nvidia/cuda:11.2.1-base-ubi8
刷新pod运行情况,可以看到nvidia-container-toolkit-daemonset
及nvidia-device-plugin-daemonset
运行正常,而nvidia-device-plugin-validation
则Init:CreashLoopBackOff
失败:
# kubectl get po -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
nvidia-container-toolkit-daemonset-27qj8 1/1 Running 0 52s
nvidia-container-toolkit-daemonset-g5ndb 1/1 Running 0 51s
nvidia-device-plugin-daemonset-sqfdc 1/1 Running 0 26s
nvidia-device-plugin-daemonset-wldkd 1/1 Running 0 26s
nvidia-device-plugin-validation 0/1 Init:CrashLoopBackOff 1 9s
nvidia-driver-daemonset-m4xjv 1/1 Running 0 137m
nvidia-driver-daemonset-vkrz5 1/1 Running 5 137m
定位该validation所在的节点名(此例中为10.168.100.3
):
# kubectl get po nvidia-device-plugin-validation -n gpu-operator-resources -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nvidia-device-plugin-validation 0/1 Init:CrashLoopBackOff 4 2m55s 172.26.222.10 10.168.100.3 <none> <none>
获取启动失败原因:
# kubectl describe po nvidia-device-plugin-validation -n gpu-operator-resources
......
Warning Failed 56s (x5 over 2m21s) kubelet, 10.168.100.3 Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidiactl": no such file or directory
登录10.168.100.3
节点,获取/dev
下驱动程序设备名:
# docker ps | grep nvidia-device-plugin-daemonset | grep -v pause
abbea480fdf2 10.168.100.144:8021/nvcr.io/nvidia/k8s-device-plugin "nvidia-device-plugin" 6 minutes ago Up 6 minutes k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-sqfdc_gpu-operator-resources_b9988b02-82a6-4637-a7f0-fdee5a448d60_0
# docker exec -it k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-sqfdc_gpu-operator-resources_b9988b02-82a6-4637-a7f0-fdee5a448d60_0 /bin/bash
[root@nvidia-device-plugin-daemonset-sqfdc /]# ls /dev/nvidia* -l -h
crw-rw-rw- 1 root root 195, 254 Apr 19 03:52 /dev/nvidia-modeset
crw-rw-rw- 1 root root 237, 0 Apr 19 06:08 /dev/nvidia-uvm
crw-rw-rw- 1 root root 237, 1 Apr 19 06:08 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195, 0 Apr 19 03:52 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Apr 19 03:52 /dev/nvidiactl
[root@nvidia-device-plugin-daemonset-sqfdc /]# exit
在主机级别(10.168.100.3
)上手动创建/dev/nvidiactl
文件, 依据同样步骤在10.168.100.4
上查找到相应的设备驱动号也添加/dev/nvidiactl
文件:
[root@gpunode1 ~]# mknod -m 666 /dev/nvidiactl c 195 255
[root@gpunode1 ~]# ls /dev/nvidiactl -l
crw-rw-rw- 1 root root 195, 255 Apr 19 02:19 /dev/nvidiactl
delete掉nvidia-device-plugin-validation
这个pod后,kubelet将重新拉起一个,此时报错信息有变化,提示缺少/dev/nvidia-uvm
设备驱动文件:
Warning Failed 10s (x2 over 11s) kubelet, 10.168.100.4 Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia-uvm": no such file or directory
按照上面创建/dev/nvidiactl
的方法创建/dev/nvidia-uvm
驱动文件,注意设备号与容器中保持一致:
# mknod -m 666 /dev/nvidia-uvm c 237 0
删除pod后重新拉起,报错信息为缺少/dev/nvidia-uvm-tools
:
Warning Failed 9s (x2 over 10s) kubelet, 10.168.100.4 Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia-uvm-tools": no such file or directory
手动创建nvidia-uvm-tools
设备文件后删除pod等待kubelet重新拉起pod:
# mknod -m 666 /dev/nvidia-uvm-tools c 237 1
Warning Failed 12s (x2 over 12s) kubelet, 10.168.100.3 Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia-modeset": no such file or directory
手动创建nvidia-modeset
设备文件后删除pod等待kubelet重新拉起pod:
# mknod -m 666 /dev/nvidia-modeset c 195 254
Warning Failed 13s (x2 over 14s) kubelet, 10.168.100.4 Error: failed to start container "device-plugin-validation-init": Error response from daemon: linux runtime spec devices: error gathering device information while adding custom device "/dev/nvidia0": no such file or directory
手动创建nvidia0
设备文件后删除pod等待kubelet重新拉起pod:
# mknod -m 666 /dev/nvidia0 c 195 0
# kubectl get po -A | grep device-plugin-validation
gpu-operator-resources nvidia-device-plugin-validation 0/1 Completed 0 2m26s
此时kubelet将继续拉起剩余的nvidia资源,最终状态应该是:
# kubectl get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default chart-1618804240-node-feature-discovery-master-5f446799f4-sk7vg 1/1 Running 0 163m
default chart-1618804240-node-feature-discovery-worker-5sllh 1/1 Running 1 163m
default chart-1618804240-node-feature-discovery-worker-86w4w 1/1 Running 0 163m
default chart-1618804240-node-feature-discovery-worker-fl52v 1/1 Running 0 163m
default gpu-operator-945878fff-88thn 1/1 Running 0 163m
gpu-operator-resources gpu-feature-discovery-p6zqs 1/1 Running 0 53s
gpu-operator-resources gpu-feature-discovery-x88v4 1/1 Running 0 53s
gpu-operator-resources nvidia-container-toolkit-daemonset-27qj8 1/1 Running 0 26m
gpu-operator-resources nvidia-container-toolkit-daemonset-g5ndb 1/1 Running 0 26m
gpu-operator-resources nvidia-dcgm-exporter-c9vht 1/1 Running 0 74s
gpu-operator-resources nvidia-dcgm-exporter-mz7rh 1/1 Running 0 74s
gpu-operator-resources nvidia-device-plugin-daemonset-sqfdc 1/1 Running 0 25m
gpu-operator-resources nvidia-device-plugin-daemonset-wldkd 1/1 Running 0 25m
gpu-operator-resources nvidia-device-plugin-validation 0/1 Completed 0 2m47s
gpu-operator-resources nvidia-driver-daemonset-m4xjv 1/1 Running 0 163m
gpu-operator-resources nvidia-driver-daemonset-vkrz5 1/1 Running 5 163m
....
gpu-operator
目录下预置了一个test.yaml
文件,直接创建:
[root@gpumaster gpu-operator]# kubectl create -f test.yaml
pod/dcgmproftester created
[root@gpumaster gpu-operator]# kubectl get po -o wide | grep dcgmproftester
dcgmproftester 1/1 Running 0 103s 172.26.243.149 10.168.100.4 <none> <none>
找寻到10.168.100.4
上的nvidia-device-plugin-daemonset
的pod, 观察该节点上gpu的功耗及显存占用情况,可以看到该工作负载确实使用了gpu中的运算单元:
# kubectl exec nvidia-device-plugin-daemonset-wldkd -n gpu-operator-resources nvidia-smi
nvidia 33988608 269 nvidia_modeset,nvidia_uvm, Live 0xffffffffa05dd000 (PO)
Mon Apr 19 06:39:26 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:00:08.0 Off | Off |
| N/A 61C P0 208W / 250W | 493MiB / 32510MiB | 84% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
测试完毕后该pod将处于completed
状态,观察其输出:
# kubectl get po -o wide | grep dcgm
dcgmproftester 0/1 Completed 0 4m12s 172.26.243.149 10.168.100.4 <none> <none>
# kubectl logs dcgmproftester
.....
TensorEngineActive: generated ???, dcgm 0.000 (74380.8 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (75398.9 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (75787.6 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (77173.9 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (75669.5 gflops)
Skipping UnwatchFields() since DCGM validation is disabled
Create a new vm via following command:
# cd /var/lib/libvirt/qemu/save
### Following is for creating a new vm for saving rpms
# virsh dumpxml node1>example.xml
# vim example.xml
# qemu-img create -f qcow2 -b ccsebaseimage.qcow2 saverpms.qcow2
Formatting 'saverpms.qcow2', fmt=qcow2 size=536870912000 backing_file=ccsebaseimage.qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
# virsh define example.xml
Domain nodetmp defined from example.xml
# virsh start nodetmp
Domain nodetmp started
# virsh net-dhcp-leases default
### Getting the ip address for nodetmp(10.17.18.199)
# scp ./ccse-offline-files.tar.gz root@10.17.18.199:/home/
# ssh root@10.17.18.199
Following is on 10.17.18.199
:
[root@first ~]# cd /home/
[root@first home]# tar xzvf ccse-offline-files.tar.gz
# vi /etc/yum.conf
keepcache=1
### Add a new vm disk (vdb)
# fdisk /dev/vdb
# mkfs.ext4 /dev/vdb1
# mkdir /dcos
# mount /dev/vdb1 /dcos
# vi /etc/fstab
/dev/vdb1 /dcos ext4 defaults 0 0
# mount -a
# exit
Bug-fix(lsof):
# scp ./Packages/lsof-4.87-6.el7.x86_64.rpm root@10.17.18.199:/root/
Backup the vm disks on host machine:
# virsh destroy nodetmp
# mv saverpms.qcow2 saverpms1.qcow2
# qemu-img create -f qcow2 -b saverpms1.qcow2 saverpms.qcow2
# virsh start nodetmp
# ssh root@10.17.18.199
Re-login, and run:
# rpm -ivh /root/lsof-4.87-6.el7.x86_64.rpm
# cd /home/ccse-xxxxxxxx
# vi config/config.yaml
common:
# 控制台和/或Harbor所在的主机IP
host: 10.17.18.199
# vim ./files/offline-repo/ccse-centos7-base.repo
#[ccse-centos7-base]
#name=ccse-offline-repo
#baseurl=file://{centos7_base_repo_dir}
#enabled=1
#gpgcheck=0
[ccse-centos7-base]
name=Centos local yum repo for k8s
baseurl=http://10.17.18.2:8200/repo/x86_64/centos7-base
gpgcheck=0
enabled=1
proxy=_none_
# ./deploy.sh install all 2>&1 | sudo tee install-log_`date "+%Y%m%d%H%M"`
Notice, 10.17.18.2
is for existing ccse console.
After deployment, the cached rpms is listed as:
# find /var/cache | grep rpm$
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/audit-2.8.5-4.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/audit-libs-2.8.5-4.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/audit-libs-python-2.8.5-4.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/checkpolicy-2.5-8.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/libsemanage-python-2.5-14.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/policycoreutils-2.5-34.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/policycoreutils-python-2.5-34.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/python-IPy-0.75-6.el7.noarch.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/setools-libs-3.3.8-4.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/libcgroup-0.41-21.el7.x86_64.rpm
/var/cache/yum/x86_64/7/ccse-centos7-base/packages/unzip-6.0-21.el7.x86_64.rpm
Now enable the visit for ccse console(web ui):
# systemctl stop firewalld
# systemctl disable firewalld
# setenforce 0
# vi /etc/selinux/config
SELINUX=disabled
ccse webui:
Create a new vm and added it on ccse webui, in newly added vm do following command:
# vi /etc/yum.conf
keepcached
# systemctl stop firewalld
# systemctl disable firewalld
# setenforce 0
# vi /etc/selinux/config
SELINUX=disabled
Create a new cluster, and fetch the new vm’s rpm cache:
[root@first cache]# find . | grep rpm$
./yum/x86_64/7/ccse-centos7-base/packages/audit-2.8.5-4.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/audit-libs-2.8.5-4.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/checkpolicy-2.5-8.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/audit-libs-python-2.8.5-4.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/libsemanage-python-2.5-14.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/libcgroup-0.41-21.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/policycoreutils-2.5-34.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/python-IPy-0.75-6.el7.noarch.rpm
./yum/x86_64/7/ccse-centos7-base/packages/setools-libs-3.3.8-4.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/policycoreutils-python-2.5-34.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/conntrack-tools-1.4.4-7.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/libnetfilter_cttimeout-1.0.0-7.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/libnetfilter_queue-1.0.2-2.el7_2.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/socat-1.7.3.2-2.el7.x86_64.rpm
./yum/x86_64/7/ccse-centos7-base/packages/libnetfilter_cthelper-1.0.0-11.el7.x86_64.rpm
./yum/x86_64/7/ccse-k8s/packages/container-selinux-2.119.1-1.c57a6f9.el7.noarch.rpm
./yum/x86_64/7/ccse-k8s/packages/docker-ce-18.09.9-3.el7.x86_64.rpm
./yum/x86_64/7/ccse-k8s/packages/containerd.io-1.2.13-3.2.el7.x86_64.rpm
./yum/x86_64/7/ccse-k8s/packages/docker-ce-cli-18.09.9-3.el7.x86_64.rpm
./yum/x86_64/7/ccse-k8s/packages/3f1db71d0bb6d72bc956d788ffee737714e5717c629b26355a2dcf1dba4ad231-kubelet-1.17.3-0.x86_64.rpm
./yum/x86_64/7/ccse-k8s/packages/548a0dcd865c16a50980420ddfa5fbccb8b59621179798e6dc905c9bf8af3b34-kubernetes-cni-0.7.5-0.x86_64.rpm
./yum/x86_64/7/ccse-k8s/packages/35625b6ab1da6c58ce4946742181c0dcf9ac9b6c2b5bea2c13eed4876024c342-kubectl-1.17.3-0.x86_64.rpm
Save the harbor images:
[root@first ~]# docker save -o harbor.tar goharbor/chartmuseum-photon:v0.9.0-v1.8.6 goharbor/harbor-migrator:v1.8.6 goharbor/redis-photon:v1.8.6 goharbor/clair-photon:v2.1.0-v1.8.6 goharbor/notary-server-photon:v0.6.1-v1.8.6 goharbor/notary-signer-photon:v0.6.1-v1.8.6 goharbor/harbor-registryctl:v1.8.6 goharbor/registry-photon:v2.7.1-patch-2819-v1.8.6 goharbor/nginx-photon:v1.8.6 goharbor/harbor-log:v1.8.6 goharbor/harbor-jobservice:v1.8.6 goharbor/harbor-core:v1.8.6 goharbor/harbor-portal:v1.8.6 goharbor/harbor-db:v1.8.6 goharbor/prepare:v1.8.6
[root@first ~]# ls -l -h harbor.tar
-rw-------. 1 root root 1.5G Apr 11 23:31 harbor.tar
[root@first ~]# cp harbor.tar harbor.tar.back
[root@first ~]# xz -T4 harbor.tar
[root@first ~]# ls -l -h harbor.tar.*
-rw-------. 1 root root 1.5G Apr 11 23:31 harbor.tar.back
-rw-------. 1 root root 428M Apr 11 23:31 harbor.tar.xz
rpms combine:
[root@first rpms]# ls -l -h | wc -l
12
##### After transferring from working node
#########################################
[root@first rpms]# cp /tmp/rpms/* .
cp: overwrite ‘./audit-2.8.5-4.el7.x86_64.rpm’? y
cp: overwrite ‘./audit-libs-2.8.5-4.el7.x86_64.rpm’? y
cp: overwrite ‘./audit-libs-python-2.8.5-4.el7.x86_64.rpm’? y
cp: overwrite ‘./checkpolicy-2.5-8.el7.x86_64.rpm’? y
cp: overwrite ‘./libcgroup-0.41-21.el7.x86_64.rpm’? y
cp: overwrite ‘./libsemanage-python-2.5-14.el7.x86_64.rpm’? y
cp: overwrite ‘./policycoreutils-2.5-34.el7.x86_64.rpm’? y
cp: overwrite ‘./policycoreutils-python-2.5-34.el7.x86_64.rpm’? y
cp: overwrite ‘./python-IPy-0.75-6.el7.noarch.rpm’? y
cp: overwrite ‘./setools-libs-3.3.8-4.el7.x86_64.rpm’? y
[root@first rpms]# ls -l -h | wc -l
17