The open source China community team made its first live broadcast, telling the story behind the open source China community in the name of sharing."

This article is shared from the Huawei Cloud Community " Test on the Effect of Preempting Online Tasks, Suppressing CPU Resources for Offline Tasks, and Ensuring Service Quality for Online Tasks in CCE Cloud Native Mixed Deployment Scenario ". The author: You can make a friend.

background

Enterprise IT environments usually run two major types of processes, one is online services and the other is offline operations.

Online tasks : long running time, service traffic and resource utilization have tidal characteristics, are sensitive to delay, and have high service SLA requirements, such as e-commerce transaction services, etc.

Offline tasks : The running time is divided into intervals, the resource utilization rate is high during operation, the delay is not sensitive, the fault tolerance rate is high, and interruptions generally allow re-running, such as big data processing, etc.

The main form of co-location is to improve resource utilization by deploying online and offline tasks on the same node. For example, a node previously deployed 3 online tasks with high service SLA, and now it deploys 3 online tasks and 3 offline tasks in a mixed manner. , Offline services utilize the idle resources of online services in various periods without affecting the service quality of online services.

At the container co-location level, it mainly involves: 1) At the scheduling level, node scheduling resources are over-scheduled, and online and offline tasks are mixed and scheduled to the same node; 2) At the CPU level, online tasks are preempted and offline tasks are suppressed; 3) The memory level is not introduced in this article . Through the power of the CPU of the co-located technology, it can be realized that during the operation process, the system will automatically complete online "preemption" and "suppression" of offline task resources according to the usage of online and offline task resources to ensure the resource demands of online resources. Take a 4-core machine as an example:

When an online task requires 3-core CPU resources, the system needs to "suppress" the offline task to use up to 1-core CPU resource;
The online task was at the low peak of the business at that time and only used 1 core CPU resource. The offline task could use the remaining CPU resources in the short term; when the online task business increased, the system ensured that the online business could "preempt" the offline business CPU resources;

Environmental preparation

Environmental requirements

Cluster version :

v1.19 cluster: v1.19.16-r4 and above
v1.21 cluster: v1.21.7-r0 and above
v1.23 cluster: v1.23.5-r0 and above
v1.25 and above

Cluster type : CCE Standard cluster or CCE Turbo cluster.

Node OS : EulerOS 2.9 (kernel-4.18.0-147.5.1.6.h729.6.eulerosv2r9.x86_64) or Huawei Cloud EulerOS 2.0

Node type : elastic virtual machine.

Volcano plug-in version : 1.7.0 and above.

environmental information

CCE cluster deploys kube-prometheus-stack, grafana and volcano plug-ins

CPU suppression and preemption demonstration

Stress test baseline

Create the workload required for the demonstration, and ensure that the two workloads are scheduled to the same node (since the general expression in the dashboard is associated with the pod name, it is recommended not to use the workload name, otherwise it will affect the normal display of the dashboard)

kind: Deployment 
apiVersion: apps/v1 
metadata:
  name: redis        
spec: 
  replicas: 1 
  selector: 
    matchLabels: 
      app: redis 
  template: 
    metadata:
      creationTimestamp: null 
      labels: 
        app: redis 
      annotations: 
        prometheus.io/path: /metrics 
        prometheus.io/port: '9121' 
        prometheus.io/scrape: 'true' 
    spec: 
      containers: 
        - name: container-1 
          image: swr.cn-north-4.myhuaweicloud.com/testapp/redis:v6 
          resources: 
            limits: 
              cpu: '1' 
            requests: 
              cpu: 250m 
        - name: container-2 
          image: bitnami/redis-exporter:latest 
          resources:
            limits:
              cpu: 250m
              memory: 512Mi 
            requests:
              cpu: 250m
              memory: 512Mi
      imagePullSecrets: 
        - name: default-secret 
      schedulerName: volcano 
--- 
kind: Deployment 
apiVersion: apps/v1
metadata:
  name: stress 
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stress 
  template:
    metadata:
      labels:
        app: stress 
    spec:
      containers:
        - name: container-1
          image: swr.cn-north-4.myhuaweicloud.com/testapp/centos-stress:v1 
          command: 
            - /bin/bash 
          args: 
            - '-c' 
            - while true; do echo hello; sleep 10; done 
          resources:
            limits:
              cpu: '4' 
              memory: 4Gi 
            requests:
              cpu: 2500m 
              memory: 1Gi 
      imagePullSecrets:
        - name: default-secret
      schedulerName: volcano
      affinity: 
        podAffinity: 
          requiredDuringSchedulingIgnoredDuringExecution: 
            - labelSelector: 
                matchExpressions:
                  - key: app 
                    operator: In 
                    values: 
                      - redis 
              namespaces: 
                - default 
              topologyKey: kubernetes.io/hostname

Use the redis-benchmark command to stress test redis; 192.168.1.159for the Pod ip of redis

./redis-benchmark -h 192.168.1.159 -p 6379 -n 3000000 -c 100 –q -t SET,INCR,LPUSH,LPOP,RPOP,SADD,HSET,SPOP,ZADD,ZPOPMIN

Observe redis indicators and CPU usage on the grafana page, which can be used as baseline reference data without interference.

Non-mixed scene

Create a node pool for hybrid deployment and redeploy the above workloads to new nodes

Use the redis-benchmark command again to stress test redis; 192.168.1.172for the Pod ip of redis

./redis-benchmark -h 192.168.1.172 -p 6379 -n 3000000 -c 100 –q -t SET,INCR,LPUSH,LPOP,RPOP,SADD,HSET,SPOP,ZADD,ZPOPMIN

Enter the stress container. After the redis indicator reaches the baseline and becomes stable, execute the command to increase the CPU usage.

stress-of -c 4 -t 3600

Observe the redis indicators and CPU usage on the grafana page and find that the performance data of redis degraded rapidly during the stress container stress test.

Mixed scene

Update the node pool and configure the hybrid label for the node in the advanced configuration:volcano.sh/colocation="true"

Click the configuration management of the node pool-kubelet component configuration-enable the node co-location feature

Modify the eviction threshold of the node to 100 to avoid direct eviction when the CPU usage exceeds the threshold during stress testing.

kubectl annotate node 192.168.0.209 volcano.sh/evicting-cpu-high-watermark=100

Modify the annotation of stress workload and mark stress as offline business. The redis workload does not need to be modified.

kind: Deployment 
apiVersion: apps/v1 
metadata:
  name: stress 
spec: 
  replicas: 1 
  selector: 
    matchLabels: 
      app: stress 
  template: 
    metadata:
      labels: 
        app: stress 
      annotations:          
        volcano.sh/qos-level: "-1" # Offline job annotations
    spec: 
      containers: 
        - name: container-1 
          image: swr.cn-north-4.myhuaweicloud.com/testapp/centos-stress:v1 
          command: 
            - /bin/bash 
          args: 
            - '-c' 
            - while true; do echo hello; sleep 10; done 
          resources: 
            limits: 
              cpu: '4' 
              memory: 4Gi 
            requests: 
              cpu: 2500m 
              memory: 1Gi 
      imagePullSecrets: 
        - name: default-secret 
      schedulerName: volcano
      affinity: 
        podAffinity: 
          requiredDuringSchedulingIgnoredDuringExecution: 
            - labelSelector: 
                matchExpressions:
                  - key: app 
                    operator: In 
                    values: 
                      - redis 
              namespaces: 
                - default 
              topologyKey: kubernetes.io/hostname

Use the redis-benchmark command to stress test redis; 192.168.1.172 for the Pod ip of redis

./redis-benchmark -h 192.168.1.172 -p 6379 -n 3000000 -c 100 –q -t SET,INCR,LPUSH,LPOP,RPOP,SADD,HSET,SPOP,ZADD,ZPOPMIN

Enter the stress container. After the redis indicator reaches the baseline and becomes stable, execute the command to increase the CPU usage.

stress-of -c 4 -t 3600

Observe the redis indicators and CPU usage on the grafana page. In the mixed scenario, even if the offline tasks try to exhaust the node CPU, the operating system still maintains the CPU demands of the online tasks, ensuring the service quality of the online tasks.

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

Test cases in CCE cloud native co-location scenario