基本环境
搭建k8s步骤请参考文章centos7 使用kubeadm部署k8s
以下操作均为gpu节点机上操作进行
系统
centos7
docker版本
docker 19.03+
k8s版本
kubelet-1.15.1
kubeadm-1.15.1
kubectl-1.15.1
安装nvidia工具包
安装nvidia-container-runtime && nvidia-container-runtime
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum install -y nvidia-container-toolkit nvidia-container-runtime
安装nvidia-docker2
# 添加nvidia-docker2源
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
# 安装nvidia-docker2,重载Docker daemon configuration
yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
安装 NVIDIA GPU 驱动
添加ELRepo源
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh https://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
安装显卡检查程序
yum install -y nvidia-detect
查找合适的驱动
nvidia-detect
例如查找出合适驱动为kmod-nvidia
,安装该驱动
yum -y install kmod-nvidia
应用配置
配置/etc/docker/daemon.json
# /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"insecure-registries": [],
"registry-mirrors": [
"http://hub-mirror.c.163.com",
"https://registry.docker-cn.com"
]
}
systemctl daemon-reload
systemctl restart docker
应用k8s-device-plugin
配置DaemonSet
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta6/nvidia-device-plugin.yml
若出现如下错误
The connection to the server raw.githubusercontent.com was refused - did you specify the right host or port?
则手动输入配置
kubectl create -f - <<EOF
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvidia/k8s-device-plugin:1.0.0-beta6
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
EOF
查看gpu支持是否配置成功
kubectl describe node
找到对应的gpu节点机,查看如下配置
Name: k8s-pro-node3
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=k8s-pro-node3
kubernetes.io/os=linux
Annotations: flannel.alpha.coreos.com/backend-data: {"VtepMAC":"5e:ff:1a:83:ea:74"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 192.168.0.72
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 18 Dec 2020 09:02:10 +0800
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Fri, 18 Dec 2020 09:47:05 +0800 Fri, 18 Dec 2020 09:02:57 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 18 Dec 2020 09:47:05 +0800 Fri, 18 Dec 2020 09:02:57 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 18 Dec 2020 09:47:05 +0800 Fri, 18 Dec 2020 09:02:57 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 18 Dec 2020 09:47:05 +0800 Fri, 18 Dec 2020 09:06:28 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.0.72
Hostname: k8s-pro-node3
Capacity:
cpu: 8
ephemeral-storage: 103079844Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32779236Ki
nvidia.com/gpu: 1
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 94998384074
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32676836Ki
nvidia.com/gpu: 1
pods: 110
若出现如下信息则代表配置成功
nvidia.com/gpu: 1
可以看到在如下地方有下列配置
Capacity:
cpu: 8
ephemeral-storage: 103079844Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32779236Ki
nvidia.com/gpu: 1
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 94998384074
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32676836Ki
nvidia.com/gpu: 1
pods: 110