Affinity and anti-affinity scheduling

In DaemonSet, we talked about using nodeSelector to select the nodes to be deployed by Pod. In fact, Kubernetes also supports a more refined and flexible scheduling mechanism, that is, affinity and anti-affinity scheduling.

Kubernetes supports two levels of affinity and anti-affinity of nodes and pods. By configuring affinity and anti-affinity rules, you can specify hard restrictions or preferences, such as deploying front-end Pod and back-end Pod together, deploying certain types of applications to certain specific nodes, deploying different applications to different nodes, etc. .

Node Affinity (node ​​affinity)

You must have guessed that the basis of the affinity rules must be labels. Let’s take a look at the labels on the nodes in the CCE cluster.

$ kubectl describe node 192.168.0.212
Name:               192.168.0.212
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/is-baremetal=false
                    failure-domain.beta.kubernetes.io/region=cn-east-3
                    failure-domain.beta.kubernetes.io/zone=cn-east-3a
                    kubernetes.io/arch=amd64
                    kubernetes.io/availablezone=cn-east-3a
                    kubernetes.io/eniquota=12
                    kubernetes.io/hostname=192.168.0.212
                    kubernetes.io/os=linux
                    node.kubernetes.io/subnetid=fd43acad-33e7-48b2-a85a-24833f362e0e
                    os.architecture=amd64
                    os.name=EulerOS_2.0_SP5
                    os.version=3.10.0-862.14.1.5.h328.eulerosv2r7.x86_64

These labels are automatically added by the CCE when a node is created. The following introduces a few labels that will be used more in scheduling.

  • failure-domain.beta.kubernetes.io/region: indicates the region where the node is located. If the label value of the above node is cn-east-3, it means that the node is in a region of Shanghai.
  • failure-domain.beta.kubernetes.io/zone: indicates the availability zone (availability zone) where the node is located.
  • kubernetes.io/hostname: The hostname of the node.
    In addition, in the Label: The Weapon of Organizing Pods, custom labels are also introduced. Generally, for a large Kubernetes cluster, many labels will definitely be defined according to business needs.

In DaemonSet, nodeSelector is introduced. Through nodeSelector, Pod can be deployed only on nodes with specific labels. As shown below, Pod will only be deployed on nodes with the label gpu=true.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  nodeSelector:                 # 节点选择,当节点拥有gpu=true时才在节点上创建Pod
    gpu: true
...

The same thing can be done through the node affinity rule configuration, as shown below.

apiVersion: apps/v1
kind: Deployment
metadata:
  name:  gpu
  labels:
    app:  gpu
spec:
  selector:
    matchLabels:
      app: gpu
  replicas: 3
  template:
    metadata:
      labels:
        app:  gpu
    spec:
      containers:
      - image:  nginx:alpine
        name:  gpu
        resources:
          requests:
            cpu: 100m
            memory: 200Mi
          limits:
            cpu: 100m
            memory: 200Mi
      imagePullSecrets:
      - name: default-secret
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: gpu
                operator: In
                values:
                - "true"

It seems that this is a lot more complicated, but this way can get stronger expressive ability, which will be further introduced later.

Here affinity means affinity, nodeAffinity means node affinity, requiredDuringSchedulingIgnoredDuringExecution is very long, but this can be divided into two paragraphs:

  • The first half of requiredDuringScheduling indicates that the rules defined below must be compulsorily satisfied (require).
  • The second half of IgnoredDuringExecution said that it will not affect Pods that are already running on the node. The current rules provided by Kubernetes all end with IgnoredDuringExecution, because the current node affinity rules will only affect the pods that are being scheduled. In the end, kubernetes will also support it. RequiredDuringExecution, that is, remove a certain label on the node, and those pods that need the node to contain the label will be removed.
  • In addition, the value of operator operator is In, indicating that the tag value needs to be in the list of values, and the values ​​of other operators are as follows.
  • NotIn: The value of the tag is not in a list
  • Exists: A label exists
  • DoesNotExist: a label does not exist
  • Gt: The value of the tag is greater than a certain value (string comparison)
  • Lt: The value of the label is less than a certain value (string comparison). It
    should be noted that there is no nodeAntiAffinity (node ​​anti-affinity), because NotIn and DoesNotExist can provide the same function.

Let's verify whether this rule takes effect. First, label the node 192.168.0.212 with gpu=true.

$ kubectl label node 192.168.0.212 gpu=true
node/192.168.0.212 labeled

$ kubectl get node -L gpu
NAME            STATUS   ROLES    AGE   VERSION                            GPU
192.168.0.212   Ready    <none>   13m   v1.15.6-r1-20.3.0.2.B001-15.30.2   true
192.168.0.94    Ready    <none>   13m   v1.15.6-r1-20.3.0.2.B001-15.30.2   
192.168.0.97    Ready    <none>   13m   v1.15.6-r1-20.3.0.2.B001-15.30.2   

Create this Deployment, you can find that all Pods are deployed on the 192.168.0.212 node.

$ kubectl create -f affinity.yaml 
deployment.apps/gpu created

$ kubectl get pod -owide
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE         
gpu-6df65c44cf-42xw4     1/1     Running   0          15s   172.16.0.37   192.168.0.212
gpu-6df65c44cf-jzjvs     1/1     Running   0          15s   172.16.0.36   192.168.0.212
gpu-6df65c44cf-zv5cl     1/1     Running   0          15s   172.16.0.38   192.168.0.212

Node priority selection rules

The requiredDuringSchedulingIgnoredDuringExecution mentioned above is a mandatory selection rule. There is also a priority selection rule for node affinity, that is, preferredDuringSchedulingIgnoredDuringExecution, which indicates which nodes will be preferentially selected according to the rules.

To demonstrate this effect, first add a node to the above cluster, and this node is not in the same availability zone as the other three nodes. After creation, query the availability zone label of the node, as shown below, the newly added node is in cn-east -3c this available zone.

$ kubectl get node -L failure-domain.beta.kubernetes.io/zone,gpu
NAME            STATUS   ROLES    AGE     VERSION                            ZONE         GPU
192.168.0.100   Ready    <none>   7h23m   v1.15.6-r1-20.3.0.2.B001-15.30.2   cn-east-3c   
192.168.0.212   Ready    <none>   8h      v1.15.6-r1-20.3.0.2.B001-15.30.2   cn-east-3a   true
192.168.0.94    Ready    <none>   8h      v1.15.6-r1-20.3.0.2.B001-15.30.2   cn-east-3a   
192.168.0.97    Ready    <none>   8h      v1.15.6-r1-20.3.0.2.B001-15.30.2   cn-east-3a  

The following defines a Deployment that requires Pod to be deployed on nodes in the Availability Zone cn-east-3a first. It can be defined as follows, using the preferredDuringSchedulingIgnoredDuringExecution rule, and set the weight of cn-east-3a to 80, and gpu=true weight It is 20, so Pod will be deployed on the cn-east-3a node first.

apiVersion: apps/v1
kind: Deployment
metadata:
  name:  gpu
  labels:
    app:  gpu
spec:
  selector:
    matchLabels:
      app: gpu
  replicas: 10
  template:
    metadata:
      labels:
        app:  gpu
    spec:
      containers:
      - image:  nginx:alpine
        name:  gpu
        resources:
          requests:
            cpu:  100m
            memory:  200Mi
          limits:
            cpu:  100m
            memory:  200Mi
      imagePullSecrets:
      - name: default-secret
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 80 
            preference: 
              matchExpressions: 
              - key: failure-domain.beta.kubernetes.io/zone
                operator: In 
                values: 
                - cn-east-3a
          - weight: 20 
            preference: 
              matchExpressions: 
              - key: gpu
                operator: In 
                values: 
                - "true"

Looking at the actual deployment situation, you can see that there are 5 Pods deployed on the node 192.168.0.212, and only 2 on 192.168.0.100.

$ kubectl create -f affinity2.yaml 
deployment.apps/gpu created

$ kubectl get po -o wide
NAME                   READY   STATUS    RESTARTS   AGE     IP            NODE         
gpu-585455d466-5bmcz   1/1     Running   0          2m29s   172.16.0.44   192.168.0.212
gpu-585455d466-cg2l6   1/1     Running   0          2m29s   172.16.0.63   192.168.0.97 
gpu-585455d466-f2bt2   1/1     Running   0          2m29s   172.16.0.79   192.168.0.100
gpu-585455d466-hdb5n   1/1     Running   0          2m29s   172.16.0.42   192.168.0.212
gpu-585455d466-hkgvz   1/1     Running   0          2m29s   172.16.0.43   192.168.0.212
gpu-585455d466-mngvn   1/1     Running   0          2m29s   172.16.0.48   192.168.0.97 
gpu-585455d466-s26qs   1/1     Running   0          2m29s   172.16.0.62   192.168.0.97 
gpu-585455d466-sxtzm   1/1     Running   0          2m29s   172.16.0.45   192.168.0.212
gpu-585455d466-t56cm   1/1     Running   0          2m29s   172.16.0.64   192.168.0.100
gpu-585455d466-t5w5x   1/1     Running   0          2m29s   172.16.0.41   192.168.0.212

In the above example, the priority of node sorting is as follows. A node with two labels is ranked highest. Only the node with the cn-east-3a label is ranked second (weight 80), and only the node with gpu=true is ranked first. Third, the nodes that do not have the lowest ranking.

Figure 1 Priority sort order

Affinity and anti-affinity scheduling

Here you see that the Pod is not scheduled to the 192.168.0.94 node. This is because there are many other Pods deployed on this node, which uses more resources, so it is not scheduled on this node. This also shows that preferredDuringSchedulingIgnoredDuringExecution is the priority rule. , Not mandatory rules.

Pod Affinity (Pod Affinity)

The rules of node affinity can only affect the affinity between Pod and node. Kubernetes also supports the affinity between Pod and Pod, such as deploying the front-end and back-end of the application together to reduce access latency. Pod affinity also has two rules: requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution.

Take a look at the following example, assuming that an application backend has been created and has the tag app=backend.

$ kubectl get po -o wide
NAME                       READY   STATUS    RESTARTS   AGE     IP            NODE         
backend-658f6cb858-dlrz8   1/1     Running   0          2m36s   172.16.0.67   192.168.0.100

When deploying the frontend pod together with the backend, you can do the following Pod affinity rule configuration.

apiVersion: apps/v1
kind: Deployment
metadata:
  name:   frontend
  labels:
    app:  frontend
spec:
  selector:
    matchLabels:
      app: frontend
  replicas: 3
  template:
    metadata:
      labels:
        app:  frontend
    spec:
      containers:
      - image:  nginx:alpine
        name:  frontend
        resources:
          requests:
            cpu:  100m
            memory:  200Mi
          limits:
            cpu:  100m
            memory:  200Mi
      imagePullSecrets:
      - name: default-secret
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname
            labelSelector:
              matchLabels:
                app: backend

Create a frontend and then view it, you can see that the frontend is created on the same node as the backend.

$ kubectl create -f affinity3.yaml 
deployment.apps/frontend created

$ kubectl get po -o wide
NAME                        READY   STATUS    RESTARTS   AGE     IP            NODE         
backend-658f6cb858-dlrz8    1/1     Running   0          5m38s   172.16.0.67   192.168.0.100
frontend-67ff9b7b97-dsqzn   1/1     Running   0          6s      172.16.0.70   192.168.0.100
frontend-67ff9b7b97-hxm5t   1/1     Running   0          6s      172.16.0.71   192.168.0.100
frontend-67ff9b7b97-z8pdb   1/1     Running   0          6s      172.16.0.72   192.168.0.100

There is a topologyKey field, which means to first delimit the range specified by the topologyKey, and then select the content defined by the following rules. There is kubernetes.io/hostname on each node here, so the role of topologyKey is not seen.

If the backend has two Pods, they are on different nodes.

$ kubectl get po -o wide
NAME                       READY   STATUS    RESTARTS   AGE     IP            NODE         
backend-658f6cb858-5bpd6   1/1     Running   0          23m     172.16.0.40   192.168.0.97
backend-658f6cb858-dlrz8   1/1     Running   0          2m36s   172.16.0.67   192.168.0.100

Tag 192.168.0.97 and 192.168.0.94 with perfer=true.

$ kubectl label node 192.168.0.97 perfer=true
node/192.168.0.97 labeled
$ kubectl label node 192.168.0.94 perfer=true
node/192.168.0.94 labeled

$ kubectl get node -L perfer
NAME            STATUS   ROLES    AGE   VERSION                            PERFER
192.168.0.100   Ready    <none>   44m   v1.15.6-r1-20.3.0.2.B001-15.30.2   
192.168.0.212   Ready    <none>   91m   v1.15.6-r1-20.3.0.2.B001-15.30.2   
192.168.0.94    Ready    <none>   91m   v1.15.6-r1-20.3.0.2.B001-15.30.2   true
192.168.0.97    Ready    <none>   91m   v1.15.6-r1-20.3.0.2.B001-15.30.2   true

Define the topologyKey of podAffinity as perfer.

        affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: perfer
            labelSelector:
              matchLabels:
                app: backend

When scheduling, first circle the nodes with the perfer label, which are 192.168.0.97 and 192.168.0.94 here, and then match the Pod with the app=backend label, so that all frontends will be deployed on 192.168.0.97.

$ kubectl create -f affinity3.yaml 
deployment.apps/frontend created

$ kubectl get po -o wide
NAME                        READY   STATUS    RESTARTS   AGE     IP            NODE         
backend-658f6cb858-5bpd6    1/1     Running   0          26m     172.16.0.40   192.168.0.97
backend-658f6cb858-dlrz8    1/1     Running   0          5m38s   172.16.0.67   192.168.0.100
frontend-67ff9b7b97-dsqzn   1/1     Running   0          6s      172.16.0.70   192.168.0.97
frontend-67ff9b7b97-hxm5t   1/1     Running   0          6s      172.16.0.71   192.168.0.97
frontend-67ff9b7b97-z8pdb   1/1     Running   0          6s      172.16.0.72   192.168.0.97

Pod AntiAffinity (Pod Anti-Affinity)

I talked about the affinity of Pods. Pods are deployed together through affinity. Sometimes the requirements are just the opposite. Pods need to be deployed separately. For example, deployment of Pods together will affect performance.

The following example defines an anti-affinity rule. This rule indicates that Pod cannot be scheduled to the node with app=frontend label Pod, that is, the frontend is scheduled to different nodes (each node has only one Pod).

apiVersion: apps/v1
kind: Deployment
metadata:
  name:   frontend
  labels:
    app:  frontend
spec:
  selector:
    matchLabels:
      app: frontend
  replicas: 5
  template:
    metadata:
      labels:
        app:  frontend
    spec:
      containers:
      - image:  nginx:alpine
        name:  frontend
        resources:
          requests:
            cpu:  100m
            memory:  200Mi
          limits:
            cpu:  100m
            memory:  200Mi
      imagePullSecrets:
      - name: default-secret
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname
            labelSelector:
              matchLabels:
                app: frontend

Create and view, you can see that there is only one frontend Pod on each node, and another is Pending, because when the fifth is deployed, the 4 nodes have app=frontend Pods, so the fifth is always Pending .

$ kubectl create -f affinity4.yaml 
deployment.apps/frontend created

$ kubectl get po -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP            NODE         
frontend-6f686d8d87-8dlsc   1/1     Running   0          18s   172.16.0.76   192.168.0.100
frontend-6f686d8d87-d6l8p   0/1     Pending   0          18s   <none>        <none>       
frontend-6f686d8d87-hgcq2   1/1     Running   0          18s   172.16.0.54   192.168.0.97 
frontend-6f686d8d87-q7cfq   1/1     Running   0          18s   172.16.0.47   192.168.0.212
frontend-6f686d8d87-xl8hx   1/1     Running   0          18s   172.16.0.23   192.168.0.94 

Guess you like

Origin blog.51cto.com/14051317/2553703