Exploration of istio traffic management issues--[Connection pool (circuit breaker)] and [Outlier detection]

write in front

Due to business needs, we upgraded our technical architecture. After many comparisons, we finally chose the cutting-edge technology framework servicemesh... one of the representative ones is istio. After getting into the pit, I discovered that there is really not much information on this thing in China. Although various configurations are introduced in official documents, most of them are briefly mentioned in a few strokes. When it is actually used in production, , you need to constantly explore its implementation mechanism by yourself, otherwise you will not know why this parameter is set in this way, and whether this setting is appropriate for your business scenario.

Although istio is an open source framework, as an operation and maintenance role, I do not have a deep understanding of code. I cannot directly analyze the essence by studying the code like many great masters. Therefore, I can only find the "truth" through constant practice and experiments. , this article is to record the various questions I encountered in the process of using istio. Some have been actually verified, and some have no specific results and need to be continuously explored in subsequent work and study.

The article will be continuously updated to add new problems, solve old problems, or update old problems.

noun definition

service level

注意:以下设定仅是本文为方便描述自行定义。

  • First-level service: In this article, it refers to the service that is directly mapped by the ingressgateway port and can be directly called from outside the cluster.

  • Second-level services: services called by first-level services, third-level and fourth-level services, and so on.

Note: The service levels here are relative, not absolute. The purpose is to facilitate the subsequent explanation of the calling relationship between each service. For example, if there are three services A\B\C, A service maps ingressgateway, A calls B, and B calls C, then A is the first-level service, B is the second-level service, and C is the third-level service. If there is a D that also maps the ingressgateway, and D calls C, then C is the secondary service.


Traffic policy related

Question 1: Will first-level services be evicted? If not, what happens if the first-level service fails? Is there any way to fail fast?

Answer: (inferred answer)

以下仅为推论,我还没有实际证据证明,希望有同学了解的,给一个准确答案:

1. It has no downstream services (if it does not include ingressgateway), and there is no service to perform passive outlier detection on it. As long as it is not due to health check problems, the instance should not be evicted, but health check problems will still occur. will be expelled.

2. The first-level service is a service that is directly called outside the cluster, so the service (or client) that calls it externally should set corresponding policies, such as timeout, retry, etc. If it is a client, it should also set what happens if the request fails. Display issues, etc.


Question 2: Should connection pools be set up for secondary and post-secondary services? If a connection pool is set up, will its downstream service be evicted because the upstream service connection pool overflows if outlier detection is configured?

Answer: (please refer to question 8)


Question 3: If connection pools are not set up for secondary and subsequent services, how should these secondary services be protected from the impact of large traffic? Is it rate limiting through envoyfilter?

Answer: (please refer to question 8)

envoyfilter进行限速是可行的,但是在生产中如何应用,需要持续探索


Question 4: If it is HTTP/2, do the restrictions of connectionPool take effect? Why does HTTP/2 establish two connections per instance?

Insert image description here

answer:

If you use HTTP/2, then connectionPool should be meaningless. Normally, there is only one link between two instances of HTTP/2, so setting more or less will not affect HTTP/2. And the official documentation is relatively clear:
Insert image description here
But when using HTTP/2 we will find that it will create two links. I looked for information on the Internet:

The Istio proxy does establish two TCP connections when communicating over HTTP/2. This is because the Istio proxy needs to multiplex on two TCP connections. One of the TCP connections is used for control flow and the other TCP connection is used for data flow. Frames on the control stream are used to control the flow of messages, while frames on the data stream contain the actual request and response data.

Note: I am not very familiar with communication protocols. If there are any mistakes, please correct me.


Question 5: How is the circuit breaker of the upstream service implemented?

answer:

  • In the http/1.1 protocol, the upstream service configures the connectionPool in the destination rule and sets the maximum number of TCP connections from each downstream instance to the upstream cluster in tcp.maxConnections.

illustrate:

The maximum number of TCP connections between each instance of the downstream service and the cluster of the upstream service. That is, if the downstream service has three instances and the upstream service also has three instances, and the upstream service sets the maximum number of connections to 100, then instance 1 of the downstream service is the same as the upstream service. The maximum sum of the number of connections between two instances of the service is 100. The sum of the maximum number of connections between instance 2 of the downstream service and the upstream service is also 100. The same applies to instance 3.

Rather than saying that the maximum number of connections between the downstream cluster and the upstream cluster is 100. The value of this control can be queried in remaining_cx in 15000/stat of istio-proxy, as shown in the example below.
Insert image description here

verify:

httpbin sets the maximum number of connections to 10
Insert image description here

There are two fortio instances here:
注意看访问的映射端口是不同的,一个是15000,一个是25000

Insert image description here
Insert image description here
可以看到每个都是显示剩余10个连接。

Then look at httpbin, each connection from 15006 is 0 (in order to display the results more clearly, only fortio calls httpbin, so just filter the connections of 15006 directly)
Insert image description here
Insert image description here

  • The first step is to initiate a call from a fortio:
    Insert image description here
    10 concurrencies are started directly. In order to last longer, we send 2000000 requests. You can see that the remaining_cx value at this time has become 0
    Insert image description here
    but the remaining_cx of another pod has been changed. The value is still 10.
    Insert image description here
    Look at the number of connections in the three httpbins. The sum of the three pods is 10. At this time, the concurrency number is increased, and the sum of their sums is also 10.
    Insert image description here
  • Then we start another pod and the sum of request connections to httpbin
    Insert image description here
    is already 20.

Insert image description here

  • So will a single httpbin instance exceed the set 10 connections? Let's change fortio to 3 instances and httpbin to 2 instances to test it.
    Insert image description here
    Obviously, a single instance will exceed the set maximum number of connections. This also proves from the side that the maximum number of TCP connections is limited. is the number of connections between a single instance of the downstream service and the upstream service cluster.

  • In the http/1.1 protocol, regarding the setting of maxRequestsPerConnection, students who understand the http protocol should understand what is going on. Even in the long link state, http/1.1 can only send one request at a time, and will not initiate another request until it returns. , after reaching the maximum number of requests, the connection will be closed and a new connection will be established. In this way, everyone will understand the role of this parameter in istio configuration.
    Insert image description here

  • There is nothing much to say about http1MaxPendingRequests, it is just waiting for the queue, which is effective for both http/1.1 and http2.

    Based on this, we can simply deduce the relationship between the maximum number of connections and the number of instances of upstream and downstream services:

    I am only half-level in mathematics and my derivation is wrong. Please correct me.

B服务单个实例可承受最大链接数a,B服务设置最大连接数x,共有y个实例
A服务作为下游服务,共计b个实例,每个实例到B服务都可以发起x个链接,所以B服务共可以发起xb个链接
要使B服务不会过载,则应该满足:xb/y<=x<=a
当x=a时:
b/y<=1
b<=y
当x<a时:
b/y<a/x
由此可知:b<=ay/x

Insert image description here
This is just a simple derivation. The actual situation is far more complicated than this, and specific calculations need to be made based on the actual situation.


Question 6: Is the size of the connection pool set to limit the current flow?

answer:

There is a certain connection between the setting of the connection pool size and the current limit, but they are not the same.

The connection pool size is set to limit the number of connections established and maintained with other services, thereby avoiding connection overload or resource waste, while optimizing application performance and stability.

Current limiting limits the application's traffic and request load by limiting the number of requests that can be processed per second. In Istio, current limiting can be achieved through rule-based quota settings.

The difference between the two is that the connection pool size is set for the number of established and maintained connections, while the current limit is set for the number of requests that can be processed per second. Although the setting of the connection pool size can control the application traffic to a certain extent, it is not a direct implementation of current limiting.

To sum up, connection pool size setting and current limiting are both means of optimizing application performance and stability, but they are two different concepts and implementation methods.


Question 7: Within a grid pod, what is the relationship between the application's link to the sidecar and the sidecar's link to the upstream service?

answer:

  • If it's HTTP/2, there's nothing to say, it's all a single long link.
  • If the application uses HTTP/1.1, the sidecar will use HTTP/1.1 for the upstream by default unless otherwise configured. Although envoy supports the conversion between HTTP/1.1 and HTTP/2, istio believes that if you forcibly convert HTTP/1.1 and HTTP/2, it may affect stability, accuracy, etc., such as during the transmission process of HTTP/2 Once the packet is lost, problems will easily occur on the HTTP/1.1 side. Speaking of this, another feature of envoy can be introduced. For envoy itself, its upstream links and downstream links are decoupled, which is why it can convert HTTP/1.1 and HTTP/2 The reason; since envoy decouples the upstream and downstream links, there is no inevitable relationship between the upstream and downstream links. However, the indirect positive correlation still exists. After all, there are more downstream links and requests, and within the scope of the connection pool, the upstream links and requests will also increase.

Let’s use the example of bookinfo to explain:
Insert image description here
Insert image description here
upstream and downstream are relative. When we look at links from the sidecar perspective, for outbound traffic, its downstream is the application in the same pod, and the upstream is the target cluster. For Inbound traffic is just the opposite.

Regardless of whether it is outbound or inbound, sidecar is decoupled from its upstream and downstream links. In other words, the life cycle of its upstream and downstream links is inconsistent. For example, when using HTTP/1.1, the upstream may have 100 links and the downstream may have 50 links. For another example, for outbound, the downstream is HTTP/1.1. In the configuration h2UpgradePolicy: UPGRADE, then the upstream is HTTP/2.


Question 8: After the connection pool overflows and causes a circuit breaker, the downstream service will receive 503. Who returns this 503?

answer:

503 returned by the sidecar of the current service.

In istio, if the connection pool is exhausted, sending another request will cause an overflow circuit breaker. At this time, 503 is returned by Envoy . This is because Envoy caches connections to upstream services. If the connections to upstream services fail or are closed, Envoy will try to send requests through these connections, causing the connections to be reset. Envoy will then encapsulate this error into a 503 and return it to the downstream caller.

Analysis:
Insert image description here
First, we see that when a 503 error occurs, only the fortio sidecar of the request initiator has a large number of 503 errors, while there are only 200 in the sidecar of httpbin;

Let’s look at fortio’s return log:503 UO upstream_reset_before_response_started{overflow}

This error indicates that the upstream service's connection pool overflowed, causing the request to be reset. UO is Envoy's response flag, which means upstream overflow with circuit breaking , which means that the number of connections to the upstream service exceeds the fuse threshold. Overflow is Envoy's reset type, indicating that the connection pool is exhausted. As we analyzed above, the counter of the connection pool is in the sidecar of the downstream service. It is also indirectly stated here that this 503 should be the status code generated by fortio's sidecar itself.

This is also stated in envoy's documentation:

Insert image description here
Based on the above point of view, it is believed that the 503 status code is not returned by the upstream service, but the sidecar of the current service is automatically generated and returned to the downstream application after judging based on the connection pool overflow. In other words, this error is before load balancing. Envoy found that the number of connections had reached the circuit breaker threshold before selecting the upstream service instance, so it directly rejected the request without performing load balancing.

At the same time, during multiple tests, no outlier detection was found to be triggered, which also proves that the returned service should not be the upstream service. In turn, it seems to prove that connection pool overflow does not trigger outlier detection.

Insert image description here
Insert image description here

Guess you like

Origin blog.csdn.net/Mrheiiow/article/details/130411303