Exploration of K8S three probes ReadinessProbe, LivenessProbe and StartupProbe

The author of this article is LEE, Lao Li, a technical veteran who has worked in the IT industry for 16 years.

event background

Because k8s uses a large number of asynchronous mechanisms and the decoupling of various object relationship designs, when the number of application instances is added/deleted, or the application version changes to trigger a rolling upgrade, the system cannot guarantee that the application-related services and ingress configurations will always be updated. It is timely to complete the refresh. In some cases, only the new Pod completes its own initialization, and the system has not yet completed the refresh of externally accessible access information such as Endpoint and load balancer, and the old Pod is immediately deleted, eventually causing the service to be temporarily unavailable. It is unacceptable for production, so k8s added some survivability probes: StartupProbe, LivenessProbe, ReadinessProbe.

technology exploration

pod status

Pod common states

  • Pending : Pending, when we request to create a pod, the conditions are not met, the scheduling is not completed, and no node can meet the scheduling conditions. A node that has been created but is not suitable for its operation is called suspended, which also includes the process of the cluster creating a network for the container, or downloading the image.
  • Running : All containers in the Pod have been created, and at least one container is running, starting, or restarting.
  • Succeeded : All containers in the Pod exit after successful execution, and there is no restarting container.
  • Failed : All containers in the Pod have exited, but at least one container exits in a failed state.
  • Unknown : Unknown status. What is the status of the so-called pod? The apiserver communicates with the kubelet running on the pod node to obtain status information. If the kubelet itself on the node fails, the apiserver cannot connect to the kubelet and cannot get information. will see Unknown

Pod status wheel

###Pod Restart Policy

  • Always : Restart the container whenever the container fails and exits.
  • OnFailure : When the container exits abnormally (abnormally), the container is automatically restarted.
  • Never : Never restart the container regardless of the container state.

###Pod common state transition scenarios

Pod state transition

Probe Introduction

K8S provides 3 probes:

  • ReadinessProbe
  • LivenessProbe
  • StartupProbe (added in this 1.16 version)

The purpose of the probe

In Kubernetes, Pod is the smallest computing unit, and a Pod is composed of multiple containers, which means that each container is an application. During the running of the application, the program may hang due to some unexpected circumstances.

So how to monitor the stability of these containers to ensure that there will be no problems during operation of the service, and the mechanism of restarting after a problem occurs has become a top priority. Considering this, kubernetes has launched a liveness probe mechanism.

With the survivability probe, it can ensure that the program can be automatically restarted if it hangs up during operation, but there is still a problem that is often encountered. For example, when starting a Pod in Kubernetes, it shows that the Pod has been successfully started and can access the inside. port, but returns an error message. In addition, when performing a rolling update, there will always be a period of time when the Pod provides network access to the outside world, but the access occurs 404. These two reasons are because the Pod has been successfully started, but the application in the Pod container is still In the process of starting, considering this, Kubernetes introduced a readiness probe mechanism.

  1. LivenessProbe : The liveness probe is used to judge whether the container is healthy. If the health condition is not met, Kubelet will judge whether the Pod needs to be restarted according to the restartPolicy (restart policy) set in the Pod. LivenessProbe detects according to the configuration (process, or port, or whether the command is executed successfully, etc.), to determine whether the container is normal. If it is not detected, it means that the container is unhealthy (how many consecutive failures can be configured to be recorded as unhealthy), then the kubelet will kill the container and deal with it according to the container's restart strategy. If no liveness probe is configured, the default container startup is in the status of Success. That is, the value returned by the probe is always Success. That is, the pod status is RUNING after Success
  2. ReadinessProbe : Readiness probe, used to determine whether the program in the container is alive (or healthy), only the program (service) is normal, and the container starts to provide network access to the outside world (startup is complete and ready). After the container is started, it detects according to the ReadinessProbe configuration. If there is no problem, the result is success, that is, the status is Success. The pod's READY status is true, changing from 0/1 to 1/1. If failure continues to be 0/1, status is false. If no readiness probe is configured, the default state is Success after the container starts. For this pod, the relationship between the Service resource and EndPoint associated with this pod will also be set based on the Pod's Ready status. If the Ready status of the Pod becomes false during the running process, the system will automatically remove this pod from the EndPoint list associated with the Service resource. At that time, after the service resource receives the GET request, kube-proxy will definitely not introduce traffic into this pod. This mechanism can prevent traffic from being forwarded to unavailable Pods. If the Pod returns to the Ready state. Will then be added back to the Endpoint list. kube-proxy will also have a probability to introduce traffic into this pod through the load mechanism.
  3. StartupProbe : The StartupProbe probe mainly solves the problem that ReadinessProbe and LivenessProbe probes cannot better judge whether the program is started and survived in complex programs. Then introduce the StartupProbe probe to serve the ReadinessProbe and LivenessProbe probes.

The difference between ReadinessProbe and LivenessProbe

  • ReadinessProbe When the detection fails, delete the Pod's IP:Port from the corresponding EndPoint list.
  • When LivenessProbe fails to detect, it will kill the container and decide to take corresponding measures according to the Pod restart strategy.

The difference between StartupProbe and ReadinessProbe, LivenessProbe

If the three probes exist at the same time, the StartupProbe probe will be executed first, and the other two probes will be temporarily disabled until the pod meets the conditions configured by the StartupProbe probe, and the other two probes will be started. If not, the container will be restarted according to the rules.

After the container is started, the other two probes will follow the configuration until the container dies, and the StartupProbe probe will not perform subsequent detections until the container is started according to the configuration.

Correct usage of ReadinessProbe and LivenessProbe

Both LivenessProbe and ReadinessProbe support the following three detection methods:

  • ExecAction : Execute the specified command in the container. If the execution is successful and the exit code is 0, the detection is successful.
  • HTTPGetAction : Call the HTTP Get method through the IP address, port number, and path of the container. If the status code of the response is greater than or equal to -200 and less than 400, the container is considered healthy.
  • TCPSocketAction : Perform a TCP check through the IP address and port number of the container. If a TCP connection can be established, the container is healthy.

Probe results have the following values:

  • Success: Indicates that the test is passed.
  • Failure: Indicates that the test failed.
  • Unknown: Indicates that the detection has not been performed normally.

Related properties of LivenessProbe and ReadinessProbe The probe (Probe) has many optional fields, which can be used to more precisely control the behavior (Probe) of the two probes of Liveness and Readiness:

  • initialDelaySeconds : How many seconds to wait for the probe to start working after the container starts, the unit is "seconds", the default is 0 seconds, the minimum value is 0
  • periodSeconds : The time interval for performing detection (in seconds), the default is 10s, the unit is "seconds", the minimum value is 1
  • timeoutSeconds : After the probe executes the detection request, it waits for the timeout time for the response, the default is 1s, the unit is "second", the minimum value is 1
  • successThreshold : The minimum number of successful connections that are considered successful after the probe fails to detect, the default is 1s, it must be 1s in the Liveness probe, and the minimum value is 1s.
  • failureThreshold : The number of retries for detection failures. After a certain number of retries, it will be considered a failure. In the readiness probe, the Pod will be marked as not ready, the default is 3s, and the minimum value is 1s

Tips : In fact, initialDelaySeconds can be configured without configuration in ReadinessProbe. Without configuration, the default pod just starts and starts ReadinessProbe detection, but what about that? Except for StartupProbe, ReadinessProbe and LivenessProbe run in the entire life cycle of the pod. ReadinessProbe detection failed when it was just started , but it shows that the READY status is always 0/1. The failure of ReadinessProbe will not cause the pod to be restarted. Only the failure of StartupProbe and LivenessProbe will restart the pod. And after how many s, after the real service is started, after the success check is successful, the READY state is naturally normal

The correct way to use StartupProbe

The StartupProbe probe supports the following three probing methods:

  • ExecAction : Execute the specified command in the container. If the execution is successful and the exit code is 0, the detection is successful.
  • HTTPGetAction : Call the HTTP Get method through the IP address, port number, and path of the container. If the status code of the response is greater than or equal to 200 and less than 400, the container is considered healthy.
  • TCPSocketAction : Perform a TCP check through the IP address and port number of the container. If a TCP connection can be established, the container is healthy.

Probe results have the following values:

  • Success: Indicates that the test is passed.
  • Failure: Indicates that the test failed.
  • Unknown: Indicates that the detection has not been performed normally.

StartupProbe Probe Properties

  • initialDelaySeconds : How many seconds to wait for the probe to start working after the container starts, the unit is "seconds", the default is 0 seconds, the minimum value is 0
  • periodSeconds : The time interval for performing detection (in seconds), the default is 10s, the unit is "seconds", the minimum value is 1
  • timeoutSeconds : After the probe executes the detection request, it waits for the timeout time for the response, the default is 1s, the unit is "second", the minimum value is 1
  • successThreshold : The minimum number of successful connections that are considered successful after the probe fails to detect, the default is 1s, it must be 1s in the Liveness probe, and the minimum value is 1s.
  • failureThreshold : The number of retries for detection failures. After a certain number of retries, it will be considered a failure. In the readiness probe, the Pod will be marked as not ready, the default is 3s, and the minimum value is 1s

Tips : After the StartupProbe is executed, all the configurations of the other two probes are started, which is equivalent to when the container is just started, so if the other two probes are configured with initialDelaySeconds, it is recommended not to give them too long.

Example of use

LivenessProbe probe usage example

1. Do health detection through exec

[root@localhost ~]# vim liveness-exec.yaml
复制代码
apiVersion: v1
kind: Pod
metadata:
    name: liveness-exec
    labels:
        app: liveness
spec:
    containers:
        - name: liveness
          image: busybox
          args: #创建测试探针探测的文件
              - /bin/sh
              - -c
              - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
          LivenessProbe:
              initialDelaySeconds: 10 #延迟检测时间
              periodSeconds: 5 #检测时间间隔
              exec: #使用命令检查
                  command: #指令,类似于运行命令sh
                      - cat #sh 后的第一个内容,直到需要输入空格,变成下一行
                      - /tmp/healthy #由于不能输入空格,需要另外声明,结果为sh cat"空格"/tmp/healthy
复制代码

Thinking arrangement:

After the container is initialized, execute (/bin/sh -c "touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600") to first create a /tmp/healthy file, and then execute the sleep command to sleep 30 seconds, execute the command to delete the /tmp/healthy file after the time is up.

The set survival probe detection method is to execute the shell command, and use the cat command to output the contents of the healthy file. If this command can be successfully executed once (default successThreshold: 1), the survival probe will consider the detection successful, because there is no configuration ( failureThreshold, timeoutSeconds), so execute (cat /tmp/healthy) and only wait for 1s, if it returns failure after execution within 1s, the detection fails.

In the first 30 seconds, because the file exists, the cat /tmp/healthy command is successfully executed when the survival probe detects. The healthy file is deleted after 30 seconds, so the execution of the command fails, and Kubernetes will judge whether to restart the Pod according to the restart policy set by the Pod.

2. Do health detection through HTTP

[root@localhost ~]# vi liveness-http.yaml
复制代码
apiVersion: v1
kind: Pod
metadata:
    name: liveness-http
    labels:
        test: liveness
spec:
    containers:
        - name: liveness
          image: test.com/test-http-prober:v0.0.1
          LivenessProbe:
              failureThreshold: 5 #检测失败5次表示未就绪
              initialDelaySeconds: 20 #延迟加载时间
              periodSeconds: 10 #重试时间间隔
              timeoutSeconds: 5 #超时时间设置
              successThreshold: 2 #检查成功为2次表示就绪
              httpGet:
                  scheme: HTTP
                  port: 8081
                  path: /ping
复制代码

Thinking arrangement:

After the pod starts, after initialization and waiting for 20s, LivenessProbe starts to work and requests the http://Pod_IP:8081/ping interface, similar to curl -I http://Pod_IP:8081/ping interface, considering that the request will be delayed ( After curl -I, there is always a state of suspended animation), so this request operation lasts for 5s. If the return value of the visit within 5s is >=200 and <=400, it means the first detection success. If it is other values, or after 5s It is still in a state of suspended animation, execute a similar (ctrl+c) interruption, and return failure failure.

After waiting for 10s, request the http://Pod_IP:8081/ping interface again. If there are 2 consecutive successes, it means there is no problem. If there are 5 consecutive failures during the period, it means that there is a problem. Restart the pod directly. This operation will accompany the entire life cycle of the pod.

Tips

The Http Get detection method has the following optional control fields:

  • scheme: The protocol used to connect to the host, the default is HTTP.
  • host: The name of the host to connect to. The default is the Pod IP. You can set the host header in the Http Request headers.
  • port: The port number or name to be accessed on the container.
  • path: The access URI on the http server.
  • httpHeaders: Custom HTTP request headers, HTTP allows repeated headers.

3. Do health detection through TCP

[root@localhost ~]# vi liveness-tcp.yaml
复制代码
apiVersion: v1
kind: Pod
metadata:
    name: liveness-tcp
    labels:
        app: liveness
spec:
    containers:
        - name: liveness
          image: nginx
          LivenessProbe:
              initialDelaySeconds: 15
              periodSeconds: 20
              tcpSocket:
                  port: 80
复制代码

Thinking arrangement:

The TCP inspection method is very similar to the HTTP inspection method. After the container starts the time set by the initialDelaySeconds parameter, the kubelet will send the first LivenessProbe probe, trying to connect to port 80 of the container, similar to telnet port 80. Detection is done every 20 seconds (periodSeconds). If the connection fails, the Pod will be killed and the container will be restarted.

ReadinessProbe Probe Usage Example

The ReadinessProbe probe is used in the same way as the LivenessProbe probe detection method, and it also supports three types, except that one is used to detect the survival of the application, and the other is the condition for judging whether to provide external traffic.

[root@localhost ~]# vim readiness-exec.yaml
复制代码
apiVersion: v1
kind: Pod
metadata:
    name: readiness-exec
    labels:
        app: readiness-exec
spec:
    containers:
        - name: readiness-exec
          image: busybox
          args: #创建测试探针探测的文件
              - /bin/sh
              - -c
              - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
          LivenessProbe:
              initialDelaySeconds: 10
              periodSeconds: 5
              exec:
                  command:
                      - cat
                      - /tmp/healthy
---
apiVersion: v1
kind: Pod
metadata:
    name: readiness-http
    labels:
        app: readiness-http
spec:
    containers:
        - name: readiness-http
          image: test.com/test-http-prober:v0.0.1
          ports:
              - name: server
                containerPort: 8080
              - name: management
                containerPort: 8081
          ReadinessProbe:
              initialDelaySeconds: 20
              periodSeconds: 5
              timeoutSeconds: 10
              httpGet:
                  scheme: HTTP
                  port: 8081
                  path: /ping
---
apiVersion: v1
kind: Pod
metadata:
    name: readiness-tcp
    labels:
        app: readiness-tcp
spec:
    containers:
        - name: readiness-tcp
          image: nginx
          LivenessProbe:
              initialDelaySeconds: 15
              periodSeconds: 20
              tcpSocket:
                  port: 80
复制代码

Here to talk about terminationGracePeriodSeconds

The parameter terminationGracePeriodSeconds is very important, please explain it in detail. Please refer to my other article "Detailed Interpretation of Pod Graceful Exit in Kubernetes to Help You Solve Big Problems", which has a detailed explanation, and I will talk about other content here.

Tips : terminationGracePeriodSeconds cannot be used for ReadinessProbe, if it is applied to ReadinessProbe it will be rejected by the apiserver interface

LivenessProbe:
    httpGet:
        path: /ping
        port: liveness-port
    failureThreshold: 1
    periodSeconds: 30
    terminationGracePeriodSeconds: 30 # 宽限时间30s
复制代码

Example of using the StartupProbe probe

[root@localhost ~]# vim startup.yaml
复制代码
apiVersion: v1
kind: Pod
metadata:
    name: startup
    labels:
        app: startup
spec:
    containers:
        - name: startup
          image: nginx
          StartupProbe:
              failureThreshold: 3 # 失败阈值,连续几次失败才算真失败
              initialDelaySeconds: 5 # 指定的这个秒以后才执行探测
              timeoutSeconds: 10 # 探测超时,到了超时时间探测还没返回结果说明失败
              periodSeconds: 5 # 每隔几秒来运行这个
              httpGet:
                  path: /test
                  prot: 80
复制代码

Thinking arrangement:

After the time set by the container startup initialDelaySeconds (5 seconds) parameter, the kubelet will send the first StartupProbe probe, trying to connect to port 80 of the container. If there are no more than 3 consecutive detection failures (failureThreshold), and the interval between each detection is 5 seconds (periodSeconds) and the detection execution time does not exceed the timeout time of 10 seconds/each time (timeoutSeconds), then the detection is considered successful, otherwise the detection fails, kubelet Kill the Pod directly.

Summarize

Through the exploration of the three probes, we can get a one-sentence summary: understanding the underlying structure can maximize the availability, security, and continuity of the Pod to achieve the best working state. There is no "silver bullet" for everything, especially for important businesses that require a solution for each case. I hope this analysis can provide you with a way to open the door of thinking.

Guess you like

Origin blog.csdn.net/weixin_39992480/article/details/128447377