This article is shared from the Huawei Cloud Community " Test on the Effect of Preempting Online Tasks, Suppressing CPU Resources for Offline Tasks, and Ensuring Service Quality for Online Tasks in CCE Cloud Native Mixed Deployment Scenario ". The author: You can make a friend.
background
Enterprise IT environments usually run two major types of processes, one is online services and the other is offline operations.
Online tasks : long running time, service traffic and resource utilization have tidal characteristics, are sensitive to delay, and have high service SLA requirements, such as e-commerce transaction services, etc.
Offline tasks : The running time is divided into intervals, the resource utilization rate is high during operation, the delay is not sensitive, the fault tolerance rate is high, and interruptions generally allow re-running, such as big data processing, etc.
The main form of co-location is to improve resource utilization by deploying online and offline tasks on the same node. For example, a node previously deployed 3 online tasks with high service SLA, and now it deploys 3 online tasks and 3 offline tasks in a mixed manner. , Offline services utilize the idle resources of online services in various periods without affecting the service quality of online services.
At the container co-location level, it mainly involves: 1) At the scheduling level, node scheduling resources are over-scheduled, and online and offline tasks are mixed and scheduled to the same node; 2) At the CPU level, online tasks are preempted and offline tasks are suppressed; 3) The memory level is not introduced in this article . Through the power of the CPU of the co-located technology, it can be realized that during the operation process, the system will automatically complete online "preemption" and "suppression" of offline task resources according to the usage of online and offline task resources to ensure the resource demands of online resources. Take a 4-core machine as an example:
- When an online task requires 3-core CPU resources, the system needs to "suppress" the offline task to use up to 1-core CPU resource;
- The online task was at the low peak of the business at that time and only used 1 core CPU resource. The offline task could use the remaining CPU resources in the short term; when the online task business increased, the system ensured that the online business could "preempt" the offline business CPU resources;
Environmental preparation
Environmental requirements
Cluster version :
- v1.19 cluster: v1.19.16-r4 and above
- v1.21 cluster: v1.21.7-r0 and above
- v1.23 cluster: v1.23.5-r0 and above
- v1.25 and above
Cluster type : CCE Standard cluster or CCE Turbo cluster.
Node OS : EulerOS 2.9 (kernel-4.18.0-147.5.1.6.h729.6.eulerosv2r9.x86_64) or Huawei Cloud EulerOS 2.0
Node type : elastic virtual machine.
Volcano plug-in version : 1.7.0 and above.
environmental information
CCE cluster deploys kube-prometheus-stack, grafana and volcano plug-ins
CPU suppression and preemption demonstration
Stress test baseline
Create the workload required for the demonstration, and ensure that the two workloads are scheduled to the same node (since the general expression in the dashboard is associated with the pod name, it is recommended not to use the workload name, otherwise it will affect the normal display of the dashboard)
kind: Deployment apiVersion: apps/v1 metadata: name: redis spec: replicas: 1 selector: matchLabels: app: redis template: metadata: creationTimestamp: null labels: app: redis annotations: prometheus.io/path: /metrics prometheus.io/port: '9121' prometheus.io/scrape: 'true' spec: containers: - name: container-1 image: swr.cn-north-4.myhuaweicloud.com/testapp/redis:v6 resources: limits: cpu: '1' requests: cpu: 250m - name: container-2 image: bitnami/redis-exporter:latest resources: limits: cpu: 250m memory: 512Mi requests: cpu: 250m memory: 512Mi imagePullSecrets: - name: default-secret schedulerName: volcano --- kind: Deployment apiVersion: apps/v1 metadata: name: stress spec: replicas: 1 selector: matchLabels: app: stress template: metadata: labels: app: stress spec: containers: - name: container-1 image: swr.cn-north-4.myhuaweicloud.com/testapp/centos-stress:v1 command: - /bin/bash args: - '-c' - while true; do echo hello; sleep 10; done resources: limits: cpu: '4' memory: 4Gi requests: cpu: 2500m memory: 1Gi imagePullSecrets: - name: default-secret schedulerName: volcano affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - redis namespaces: - default topologyKey: kubernetes.io/hostname
Use the redis-benchmark command to stress test redis; 192.168.1.159
for the Pod ip of redis
./redis-benchmark -h 192.168.1.159 -p 6379 -n 3000000 -c 100 –q -t SET,INCR,LPUSH,LPOP,RPOP,SADD,HSET,SPOP,ZADD,ZPOPMIN
Observe redis indicators and CPU usage on the grafana page, which can be used as baseline reference data without interference.
Non-mixed scene
Create a node pool for hybrid deployment and redeploy the above workloads to new nodes
Use the redis-benchmark command again to stress test redis; 192.168.1.172
for the Pod ip of redis
./redis-benchmark -h 192.168.1.172 -p 6379 -n 3000000 -c 100 –q -t SET,INCR,LPUSH,LPOP,RPOP,SADD,HSET,SPOP,ZADD,ZPOPMIN
Enter the stress container. After the redis indicator reaches the baseline and becomes stable, execute the command to increase the CPU usage.
stress-of -c 4 -t 3600
Observe the redis indicators and CPU usage on the grafana page and find that the performance data of redis degraded rapidly during the stress container stress test.
Mixed scene
Update the node pool and configure the hybrid label for the node in the advanced configuration:volcano.sh/colocation="true"
Click the configuration management of the node pool-kubelet component configuration-enable the node co-location feature

Modify the eviction threshold of the node to 100 to avoid direct eviction when the CPU usage exceeds the threshold during stress testing.
kubectl annotate node 192.168.0.209 volcano.sh/evicting-cpu-high-watermark=100
Modify the annotation of stress workload and mark stress as offline business. The redis workload does not need to be modified.
kind: Deployment apiVersion: apps/v1 metadata: name: stress spec: replicas: 1 selector: matchLabels: app: stress template: metadata: labels: app: stress annotations: volcano.sh/qos-level: "-1" # Offline job annotations spec: containers: - name: container-1 image: swr.cn-north-4.myhuaweicloud.com/testapp/centos-stress:v1 command: - /bin/bash args: - '-c' - while true; do echo hello; sleep 10; done resources: limits: cpu: '4' memory: 4Gi requests: cpu: 2500m memory: 1Gi imagePullSecrets: - name: default-secret schedulerName: volcano affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - redis namespaces: - default topologyKey: kubernetes.io/hostnameUse the redis-benchmark command to stress test redis;
192.168.1.172
for the Pod ip of redis
./redis-benchmark -h 192.168.1.172 -p 6379 -n 3000000 -c 100 –q -t SET,INCR,LPUSH,LPOP,RPOP,SADD,HSET,SPOP,ZADD,ZPOPMINEnter the stress container. After the redis indicator reaches the baseline and becomes stable, execute the command to increase the CPU usage.
stress-of -c 4 -t 3600
Observe the redis indicators and CPU usage on the grafana page. In the mixed scenario, even if the offline tasks try to exhaust the node CPU, the operating system still maintains the CPU demands of the online tasks, ensuring the service quality of the online tasks.

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~
A programmer born in the 1990s developed a video porting software and made over 7 million in less than a year. The ending was very punishing! High school students create their own open source programming language as a coming-of-age ceremony - sharp comments from netizens: Relying on RustDesk due to rampant fraud, domestic service Taobao (taobao.com) suspended domestic services and restarted web version optimization work Java 17 is the most commonly used Java LTS version Windows 10 market share Reaching 70%, Windows 11 continues to decline Open Source Daily | Google supports Hongmeng to take over; open source Rabbit R1; Android phones supported by Docker; Microsoft's anxiety and ambition; Haier Electric shuts down the open platform Apple releases M4 chip Google deletes Android universal kernel (ACK ) Support for RISC-V architecture Yunfeng resigned from Alibaba and plans to produce independent games on the Windows platform in the future