【Kubernetes】Pod学习（九）Pod调度：Taints和Tolerations

此文为学习《Kubernetes权威指南》的相关笔记

学习笔记：

不论是nodeSelector调度方案还是亲和性调度方案，都是为Pod在创建时提供了选择Node的主动性，但是在很多场景中，Node同样需要具备选择自身可部署Pod属性的主动性，Taints和Tolerations（污点和容忍）机制使Node能够主动选择（设置Taints）具备某种属性（带有Tolerations）的Pod，也具备拒绝和驱逐Pod的主动性。

Node在设置Taints后，只有明确声明了Tolerations去容忍这些污点的Pod才有资格被调度在该Node上运行（NoSchedule效果），于此同时，Node可以设置效果是NoExecue的Taints污点，直接驱逐该Node上不具备对应Tolerations的Pod。

K8s调度器处理多个Taint和Toleration的逻辑顺序为，列出节点所有Taint，忽略Pod的Toleration能够匹配的部分，剩下的没有忽略的Taint就是对Pod的效果，几种特殊情况如下：

剩余效果中存在effet=NoSchedule：Pod一定不会被调度到该Node（硬限制）
剩余效果中不存在NoSchedule，但有PreferNoSchedule效果：尽量不把Pod调度到该Node（软限制）
剩余效果存在effect=NoExecute：未调度Pod一定不会调度（硬限制）+已调度Pod停止运行（驱逐）
Pod中定义了对于NoExecute效果的容忍，同时包含了tolerationSeconds的值：Pod运行到时间才被驱逐

这种机制的引入使K8s进一步提升了调度上的灵活性

在实例之前，尝试查看当前节点是否包含Taints:

使用kebectl describe node <node-name> 查看节点信息

# kubectl describe node miwifi-r4cm-srv
......
Taints: node-role.kubernetes.io/master:NoSchedule

......

可以看到，K8s集群的Master节点上原生地带有一个Taint，这个污点标志了这个节点是Master节点，污点效果是NoSchedule：不调度，所有不包含对应容忍项的Pod在创建后都不会被调度到这个节点上，K8s正是通过这种方式让Master节点默认不作为工作节点承担工作任务。

删除节点Taints的方式与删除Labels的方式相似：

# kubectl taint nodes <node-name> <taint-key>-

删除Master节点的这个污点之后，Master节点同样可以作为工作节点加入调度器的考虑范围

1、给Node添加Taints信息

命令格式： # kubectl taint nodes <node-name> <key>=<value>:<effect>

其中effect为Pod中不包含该污点对应容忍时产生的效果

为xu.node1添加3个Taint

可以看到，即使key和value值都相同，若想产生多个effect，就必须定义多个Taint

# kubectl taint node xu.node1 key1=value1:NoSchedule
node/xu.node1 tainted
# kubectl taint node xu.node1 key1=value1:NoExecute
node/xu.node1 tainted
# kubectl taint node xu.node1 key2=value2:NoSchedule
node/xu.node1 tainted

可以看到Node的属性定义中已经包含了刚才添加的三个脏点

# kubectl describe node xu.node1
......
Taints:             key1=value1:NoExecute
                    key1=value1:NoSchedule
                    key2=value2:NoSchedule

......

2、尝试创建不完全包含对应容忍的Pod，查看调度情况

新建Pod：taint-test1

在该Pod中定义两个容忍项，分别与xu.Node1的前两个污点相对应

可以看到，容忍若想和污点匹配，必须做到key\value\effect都是一一对应的

另外，有如下两个特例：

空的key配合Exists操作符能够匹配所有键和值
空的effect匹配所有的effect

apiVersion: v1
kind: Pod
metadata:
name: taint-test1
spec:
containers:
- name: taint-test1
   image: busybox
tolerations:
- key: "key1"                     # key1=value1:NoSchedule
   operator: "Equal"
   value: "value1"
   effect: "NoSchedule"
- key: "key1"                             #   key1=value1:NoExecute
   operator: "Equal"
   value: "value1"
   effect: "NoExecute"

创建完成后，使用kubectl describe命令查看Pod详情

可以看到该Pod的Tolerations包含了创建时定义的两个容忍项

.......

Tolerations:     key1=value1:NoSchedule
                 key1=value1:NoExecute
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s

......

因为没有满足唯一一个工作节点xu.node1的第三个Taint，这个Pod被调度失败

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.

3、创建新Pod，包含三个Taint所对应的容忍项

新节点taint-test2定义如下

.....
tolerations:
- key: "key1"            # key1=value1:NoSchedule
   operator: "Equal"
   value: "value1"
   effect: "NoSchedule"
- key: "key1"            #   key1=value1:NoExecute
   operator: "Equal"
   value: "value1"
   effect: "NoExecute"
- key: "key2"            # key2=value2:NoSchedule
   operator: "Equal"
   value: "value2"
   effect: "NoSchedule"

创建Pod，可以看到被成功调度到xu.node1上

Events:
Type     Reason                  Age               From               Message
----     ------                  ----              ----               -------
Normal   Scheduled               <unknown>         default-scheduler Successfully assigned default/taint-test2 to xu.node1

这里出现了问题：在Node上定义了effect为NoExecution的Taint后，在满足污点要求的节点被调度到Node上时，会出现如下错误，目前猜测是这个污点的定义让工作节点上的K8s系统容器停止运作，有待深究后更新。

Events:
Type     Reason                  Age                 From               Message
----     ------                  ----                ----               -------
Normal   Scheduled               <unknown>           default-scheduler Successfully assigned default/taint-test2 to xu.node1
Warning FailedCreatePodSandBox 15m                 kubelet, xu.node1 Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "2f7147cb2cd74b4d2219fe3c8484b4ded1a5a361939b518b255df51368899cd6" network for pod "taint-test2": networkPlugin cni failed to set up pod "taint-test2_default" network: unable to allocate IP address: Post http://127.0.0.1:6784/ip/2f7147cb2cd74b4d2219fe3c8484b4ded1a5a361939b518b255df51368899cd6: dial tcp 127.0.0.1:6784: connect: connection refused, failed to clean up sandbox container "2f7147cb2cd74b4d2219fe3c8484b4ded1a5a361939b518b255df51368899cd6" network for pod "taint-test2": networkPlugin cni failed to teardown pod "taint-test2_default" network: Delete http://127.0.0.1:6784/ip/2f7147cb2cd74b4d2219fe3c8484b4ded1a5a361939b518b255df51368899cd6: dial tcp 127.0.0.1:6784: connect: connection refused]
Normal   SandboxChanged          40s (x70 over 15m) kubelet, xu.node1 Pod sandbox changed, it will be killed and re-created.

4、污点和容忍的一些应用场景

①将部分节点留给一些特定的应用使用（某些关键Pod需要单独开辟一些专用节点）

②具有特殊硬件设备的节点，优先给真正需要这些硬件的Pod使用

③K8s在节点故障的情况下，自动以限速的模式逐步为Node设置Taints

对于任何一个Pod都可以看到存在原生的容忍项：

......

Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s

......

当系统自动为出现故障的节点添加Taints时，上述自动设置的容忍项将保证Pod在被驱逐前再运行300s

如果想自定义上述容忍项，格式如下:

tolerations:

- key: "node.alpha.kubernetes.io/unreachable"

operator: "Exists"

effect: "NoExecute"

tolerationSeconds: 6000

5、尝试部署一个包含容忍项的Deployment

作为学习过程的总结和回顾，尝试部署一个Deployment

定义的Pod中包含对于Master节点的原生污点的容忍项

配置文件如下

apiVersion: apps/v1
kind: Deployment
metadata:
name: taint-test
spec:
replicas: 2
selector:
matchLabels:
   app: nginx
template:
metadata:
   name: nginx
   labels:
    app: nginx
spec:
   containers:
   - name: nginx
     image: nginx
     imagePullPolicy: IfNotPresent
     ports:
     - containerPort: 80
     resources:
      requests:
       cpu: "300m"
       memory: "64Mi"
      limits:
       cpu: "1000m"
       memory: "128Mi"

   tolerations:
   - key: "node-role.kubernetes.io/master"
     operator: "Exists"
     effect: ""

部署该Deployment后，可以看到两个Pod副本被分别调度到主结点和唯一的工作节点上

# kubectl get pods -o wide
NAME                          READY   STATUS    RESTARTS   AGE   IP          NODE              NOMINATED NODE   READINESS GATES
taint-test-869c6dc8d5-qz5g9   1/1     Running   0          86s   10.32.0.4   miwifi-r4cm-srv   <none>           <none>
taint-test-869c6dc8d5-xbkcg   1/1     Running   0          86s   10.44.0.1   xu.node1          <none>           <none>

刺眼的宝石蓝

发布了27 篇原创文章 · 获赞 0 · 访问量 952

私信关注