linux a TCP retransmit timeout - a disconnected data analysis

Recently discovered a problem on production, beginning, database applications connect properly, if there is no business for a long time is estimated more than half an hour, and then initiates a service,

Re-Rom database discovery application, has been hanging in there reconnection, if restarting the application can quickly connect to the database (database is Oracle). Later, after several

According to experts, the library looked after the students found that our production is the RAC, and the clients are configured for TAF, resulting in session when the switch occurs, it may be the original

The connection is not good to release, affecting the reconnection. The Oracle client TAF off, reconnect the issue is resolved. But there is a very strange

Phenomenon, is the focus of the issue say today, if not a long time when the business is still off, and execute SQL to about 15 minutes after application to cut off

Return, which would cause the application within 15 minutes unable to serve, the application returns the error ORA-03113: end-of-file on communication channel

From this wrong view, it should be returned to the Oracle client disconnection error, but why take 15 minutes before it returns this error?

Machine network conditions are as follows:

Host Application A ----> FW1 (Firewall 1) ----> FW2 (Firewall 2) ----> database host (OracleDB)

Later, after the students determine the network of experts, there may be a firewall set up a session timeout, if there is no data firewall on a long conversation will be deleted

Session, while the Internet was also encountered a similar situation:

YQE[G(IFO0]9)UY9VOZ38IH

We did a similar attempt, release time limits firewall, the problem did not occur again. But there are several questions not solved:

1. Why firewall delete the session, the host to wait 15 minutes?

2. Delete the firewall session, it will notify the host (the host to send RST)?

And colleagues discuss the morning, guess due to the firewall removed the sessions, but the host does not know, there are database operations when the Oracle client

Initiates a TCP request, but the firewall can not find the session, discarding those packets (currently not throw unclear), it led to the TCP kept timeout

Resend.

Check TCP / IP, Volume 21 Comments section 21.2, all timeout retransmission have such a description:

image

Mentioned here 9 minutes, but the book is well written earlier, speculation has linux is not the same, but not much different principle, google a bit,

Like finding a 15-minute statement, reference [1] mentioned in:

= TCP_RTO_MIN (HZ /. 5) = 0.2s
TCP_RTO_MAX = (120 * HZ) = 120s
linear_backoff_thresh = ilog2 (. 5 * 120) = ilog2 (0x258) = 9
timeout: linear_backoff_thresh = not exceed the portion 9 by 2-fold increase index TCP_RTO_MIN , more than linear growth portion by TCP_RTO_MAX
tcp_time_stamp: current clock time
, for example, the data transmission phase, sysctl_tcp_retries2 = 9, then the timeout 1023 * TCP_RTO_MIN = 204.6s =; when sysctl_tcp_retries2 = 11, timeout = 1023 * TCP_RTO_MIN + 2 * TCP_RTO_MAX = 448.6s
default sysctl_tcp_retries2 = 15, timeout = 1023 * TCP_RTO_MIN + 6 * TCP_RTO_MAX = 920.6s, about 15 minutes

The RTO is calculated and a certain algorithm (the specific algorithm, see reference [3] )

Simply put, if the system configuration is less than the number of retransmissions, then 9, it is the exponential growth in time, if greater than 9, then, is the maximum timeout.

The linux default is 15, so exactly 15 minutes to view the configuration of our hosts, identified as 15:

[steven@kfjk2 ~]$ cat /proc/sys/net/ipv4/tcp_retries2
15

Now there is a question not clear, the firewall is deleted after the session, it will inform the host? Now it should look not at least on the host is not received

Firewall RST, since the two firms both firewalls are not the same, there may be a eaten by another package instead. If you delete the conversation, in the original

The conversation has come up package, the session is to rebuild it? Or directly to the packet discard? Or send RST it? From the current master of the phenomenon, speculation is:

After the firewall delete the session, the host is not notified to the host will not send RST, when there is a new pack up, no connection is found, but not S when the package directly discarded,

After causing the host to run out of retransmissions reported to the application to disconnect after their own hair RST.

but. . . Above are based on the phenomenon of stuff to guess, the most effective way is to catch a tcpdump packet of view, it is a production dares Esen so be it!

Only this in mind, in order to avoid later stepped pit, while developers have to care about network deployment, at that time I did not consider the middle two firewalls.

[References]

  1. linux TCP retransmission timeout
  2. [Discussion] under linux tcp protocol stack timeout retransmission mechanism
  3. TCP / IP retransmission timeout --RTO

Reproduced in: https: //my.oschina.net/mawx/blog/318965

Guess you like

Origin blog.csdn.net/weixin_33850015/article/details/91608434