Tailscale's TLS certificate expired and the website was down for 90 minutes!

On March 7, the official website tailscale.com of Tailscale, a well-known WireGuard-based VPN manufacturer, was interrupted for about 90 minutes due to the expiration of its TLS certificate .

Although the impact was limited, the incident triggered heated discussions and reflections on forums such as Hacker News.

Netizens expressed their opinions one after another. smackeyacky lamented: " I said it once, and I will say it again now: certificate expiration has become the number one killer of service interruptions in the new era. "

Tailscale co-founder bradfitz also responded immediately on Hacker News, introducing the cause of the incident and the response process. It turned out that they carried out a large-scale website migration in December last year, which involved the reconstruction of the underlying architecture, domain name resolution and other aspects. In order to support IPv6, they also specially built additional proxy servers.

But what is unexpected is that this seemingly innocuous change actually planted the root of disaster. Because the proxy server terminated the TLS connection and the DNS configuration was negligent, the monitoring system failed to detect the warning that the certificate had expired in time. Bradfitz admitted that this incident exposed that the team still has a lot of room for improvement in change management and risk assessment.

Do you buy this wave of "justifications"? Let's see what netizens say.

j45 raised a question: If IPv6 is so important, why did you choose a supplier that didn't support it at all? bradfitz could only smile bitterly and said that there was no consensus within the company on this issue.

lmeyerov pointedly pointed out that key scripts and documents should not be placed on the marketing page, which is tantamount to "loss of reputation."

What’s even more interesting is that everyone has suggestions for Tailscale’s next steps. amluto suggested that they change to a TCP proxy so that they can make full use of Let's Encrypt's HTTP authentication method. agwa's idea is even bolder: **Would you like to try automatically renewing the certificate every day? **Although it is safe to extend the expiration time, rolling updates so frequently should not be difficult, right?

All talk and no practice, how does Tailscale plan to break the situation? Bradfitz said that in addition to the previously mentioned monitoring improvements, they also plan to further simplify the network topology and reduce reliance on special solutions. At the same time, in order to nip problems in the bud, monthly reminders should be set up like the "ancient people" and have dedicated personnel to keep an eye on them to avoid being "finished" when the certificate expires.

But bradfitz also confidently added: Tailscale’s design goal is to create a flexible mesh architecture. Even if the control plane is occasionally ventilated, the user's connection status will not be affected. This accident just confirmed their advantages.

Tailscale's response this time can be described as a "textbook" in the field of infrastructure. They did not shy away from trivial matters or blame others, but had the courage to take responsibility ; they did not rush to fix things and do things perfunctorily, but reflected deeply and touched the essence. This open, honest and accepting attitude is worth learning from every technical team.

Returning to this incident, the author believes that the problems exposed by Tailscale are by no means an isolated case. In today's era of rapid iteration, any platform will inevitably experience twists and turns of one kind or another. But the key is to always be vigilant, respectful of professionalism, and attentive to detail. One wrong thought may be the starting point for business interruption and reputation damage.

Particularly worthy of vigilance is the “disgraced” design. When a seemingly inconspicuous page or service becomes the "life and death book" of the entire system, we must pay special attention. Should it be properly decoupled? Does it require special optimization? Only by taking precautions can we reduce the impact of “black swan” events.

For startups, technology is important, but they must also pay attention to the overall situation. **What is the real need? What can be simplified? **Architects need to ask themselves questions like this all the time. Blindly following the so-called "best practices" and creating a bunch of "gold and jade but bad things inside" stuff is putting the cart before the horse.

All in all, Tailscale’s “certificate gate” has sounded the alarm for us: security and availability are the foundation of everything . Only through careful design and rigorous attitude can we gain the trust of users. I believe that Tailscale can learn from this incident, adopt a more mature and professional attitude, create a truly resilient service, and continue to prosper in the VPN field.

Although Tailscale suffered a 90-minute service outage due to an expired TLS certificate, this just highlighted one of its advantages - most users were barely affected. Tailscale's distributed architecture makes client connections independent of a central node being always online. This flexible design is what sets Tailscale apart from traditional VPNs. A brief central service outage does not negate the value of Tailscale, but highlights its good fault tolerance.

Linus took it upon himself to prevent kernel developers from replacing tabs with spaces. His father is one of the few leaders who can write code, his second son is the director of the open source technology department, and his youngest son is an open source core contributor. Robin Li: Natural language will become a new universal programming language. The open source model will fall further and further behind Huawei: It will take 1 year to fully migrate 5,000 commonly used mobile applications to Hongmeng. Java is the language most prone to third-party vulnerabilities. Rich text editor Quill 2.0 has been released with features, reliability and developers. The experience has been greatly improved. Ma Huateng and Zhou Hongyi shook hands to "eliminate grudges." Meta Llama 3 is officially released. Although the open source of Laoxiangji is not the code, the reasons behind it are very heart-warming. Google announced a large-scale restructuring
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4148359/blog/11051442