《设计数据密集型应用》——第一章原文+翻译（上）

Reliable,Scalable,and Maintainable Applications(可靠，可伸缩，可维护的应用)

Many applications today are data-intensive, as opposed to compute-intensive. Raw CPU power is rarely limiting factor for these application—bigger problems are usually the amount of data, the complexity of data, the speed at which it is changing.

与计算密集型应用相对的，当今的很多应用都是数据密集型应用。这些应用很少受限于原始的CPU计算能力，更多的受限于数据的体量、数据的复杂度以及数据改变的速度。

A data-intensive application is typically built from standard building blocks that provide commonly need functionality. For example, many application need to:

Store data so that they, or another application, can find it again later (databases)
Remember the result of an expensive operation, so speed up reads (caches)
Allow users to search data by keyword or filter it in various ways. (Search indexes)
Send a message to another process, to be handled asynchronously (stream process)
Periodically crunch a large amount of accumulated data (batch process)

数据密集型应用通常由一些能够提供常见功能的标准构建块构建而成。例如，很多应用都需要下面的功能：

存储数据方便日后供自己或者其他应用访问（数据库）。
记住一些开销大的操作，加快数据读取（缓存）。
允许用户根据关键字搜索，使用各种方式查询数据（搜索）
将消息发送至其他的进程，进行异步处理（流式处理）
间断性的处理累积的历史数据（批处理）

If that sounds painfully obvious, that’s just because these data systems are such a successful abstractions: we use them all the time without thinking too much. When building an application, most engineers wouldn’t dream of writing a new data storage engine from scratch, because databases are a good tool for the job.

如果上面的这些功能听起来极其的平淡无奇，只是因为这些数据系统做到了非常成功的抽象：我们不需要考虑太多问题就可以非常轻松的使用它们。在开发应用时，大部分的软件工程师不会幻想着开发一个新的数据存储引擎，因为有现成的数据库供我们直接使用。

But reality is not that simple. There are many database systems with different characteristics, because different applications have different requirements. There are various approaches to caching, several ways to search indexes, and so on. When building an application, we still need to figure out which tools and which approaches are the most appropriate for the job at hand. And it can be hard to combine tools when you need to do something that a single tool cannot do alone.

但是现实不是那么简单。因为不同的应用有不同的需求，导致出现了不同种类的数据库系统。例如实现缓存的方法有很多，实现搜索索引的方式也有一些，其他情况也类似。所以在构建应用时，我们必须明确哪些才是最适合系统的工具和方法。另外，当单一的工具不能满足需求时就需要组合多种工具，工具的组合使用也不是那么容易。

This book is a journey through both the principles and the practicalities of data systems, and how you can use them to build data-intensive applications. We will explore what different Tools have in common, what distinguish them, and how they achieve their characteristics.

这本书自始至终讲解的是数据系统的原理以及实践，告诉你怎么使用这些数据系统构建数据密集型应用。我们将会探讨不同的数据系统之间的相同点和不同点，以及它们的实现原理。

In this chapter, we will start by exploring the fundamentals of what we are trying to achieve: reliable,scalable,and maintainable Data systems. We’ll clarify what those things mean, outline some ways of thinking about them, and go over the basics we will need for later chapters.

在本章，我们首先探索一些的系统的基本原则：稳定性，伸缩性以及可维护性。我们会阐述它们具体的含义，概述一些思考这些问题的方式，以及准备一些后面将会用到的基础知识。

Thinking about data systems

We typically think of databases,queues,caches,etc. as being very different categories of tools. Although a database and a message queue have some superficial similarity—both store data for some time—they have different access patterns, which means different performance characteristics, and thus different implementations.

通常情况下，数据库、队列、缓存等工具被划为为不同种类的工具。尽管一个数据库和消息队列有着一些非常肤浅的相似点——都能存储数据——但是它们有着完全不同的数据访问模式，不同的性能追求，不同的实现。

So why should we lump them all together under an umbrella term like data systems?

那为什么我们会把这些截然不同的数据工具都放到数据系统这个术语下面？

Many new tools for data storage and processing have emerged in recent years. They are optimizied for a variety of different use cases, and they no longer neatly fit into traditional categories. For example, there are message queues with database-like durability guarantees(Apache Kafka).The boundaries between the categories are becoming blurred.

近些年出现了许多针对于数据存储和数据处理的新工具。它们针对于一系列不同的用例而设计，不再满足于传统的目录分类。举个例子，现在的消息队列能够提供和数据库类似的持久化保证（Apache Kafka）。不同目录的界限正在变得模糊。

Secondary,increasingly many applications now have such demanding or wide-ranging requirements that a single tool can no longer meet all of its data processing and storing need. Instead, the work is broken down into tasks that can be preformed efficiently by a single tool, and those different tools are stitched together using application code.

第二，越来越多的应用的需求不再是单一的工具能够解决的。代替的，应用被拆分成了多个部分，分别交给不同的工具进行处理，最后由应用层的代码将这些工具粘合起来。

For example, if you have an application-managed caching layer(using Memcached or similar), or a full-text search server(such as ELasticsearch or Solr) separate from your main database, it is normally the application code’s responsibility to keep those caches and indexes in sync with the main database.

举个例子，你的应用除了主数据库，可能还会有缓存服务（Memcached或者其他缓存），全文搜索服务（ElasticSearch或者Solr）。最终靠应用层的代码来保证缓存、全文索引和主数据库的数据同步。

When you combine several tools in order to provide a service, the service’s interface or application programming interface usually hides those implementations details from clients. Now you have essentially created a new, special-purpose data systems From smaller, general-purpose components. Your composite data system may provide certain guarantees:e.e., that the cache will be correctly invalidated or updated on writes so that outside clients can see consistent results. You are now not only an application developer, but also a data system designer.

当结合几种工具编写出一个服务时，服务的接口或者应用程序接口会掩盖掉具体的实现。到此，你基本上使用一些小的、通用的组件创造出了一个崭新的、专用的数据系统。这个组合出来的系统也可以提供某些保证：例如当数据更新时，缓存也应该及时更新，这样客户端才能查询到一致的结果。

If you are designing a data system or service, a lot of tricky questions arise. How do you ensure that data remains correct and complete even when things go wrong internally? How do you provide consistently good performance to clients, even when parts of your system are degraded? How do you scale to handle an increase in load? What does a good API for service look like?

如果你正在设计一个数据系统，必须要考虑一些棘手的问题。当系统内部出错时，怎么保证数据的正确和完整？当系统部分降级后，怎么确保性能不受影响？当负载增加后，怎么相应的扩展系统？怎么设计一个良好的服务API？

There are many factors that may influence the design of a data system, including the skills and experiences of people involved, legacy system dependencies, the time-scale for delivery, your organization’s tolerance of different kinds of risk, regulatory constraints, etc. Those factors depend very much on situations.

很多因素会影响到系统的设计，例如参与其中的人员的能力和工作经验，遗产系统的依赖，交付的时间限制，企业对风险的容忍力度，规章的限制等等。这些因素需要根据具体的情况进行讨论。

In this book, we focus on three concerns that are important in most software systems

本书主要关注三个软件设计中的重要原则：

Reliability 稳定性

The system should continue to work correctly even in the face of adversity(hardware or software faults, and even human error)

在发生了某些异常（硬件错误，软件错误，人为操作错误）时，系统仍能持续正常工作。

Scalability 伸缩性或扩展性

As the system grows(in data volume, traffic volumes, or complexity), there should be reasonable ways of dealing with that growth,

伴随着系统的增长，包括数据体量、流量、或者复杂性的增长，系统应该存在合理的方式应对这些增长。

Maintainability 可维护性

Over time, many different people will work on the system(engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.

随时间推移，会有不同职责的人在这个系统上工作，包括研发人员和运维人员，可能维护当前的系统功能也可能引入新的用例。他们应该能够高效的在这个系统上工作。

These words are often cast around without a clear understanding of what they mean.In the interest of thoughtful engineering, we will spend the rest of this chapter exploring ways of thinking about reliability, scalability, maintainability. Then, in the following chapter, we will look at various techniques, architectures, and algorithms that are used in order to achieve those goals.

我们经常能够搜到这些词语，但是没有关于这些词的清晰解释。考虑到要做一个有想法的软件工程师，我们会在本章剩余的部分探讨关于稳定性、伸缩性、可维护性的思考方式。在接下来的章节中，讨论一些实现这些目标的技术、架构和算法。

Reliability 稳定性

Everybody has an intuitive idea of what it means for something to be reliable or unreliable. For software, typical expectations include:

The application performs the function that the user expected.
It can tolerate the user making mistake or using the software in unexpected ways
Its performance is good enough for the required use case, under the expected load an data volume.
The system prevents unauthorized access and abuse.

每个人对于一件东西是否稳定都有直观的理解。对于软件来说，稳定的系统应该是：

系统能够按照用户的预期执行功能。
系统能够允许用户犯错、或者按照未预期的方式使用软件。
在预期到的负载情况下，系统能够提供良好的性能。
系统能够阻止未授权的访问和滥用

If all those things together mean “working correctly”, then we can understand reliability as meaning, roughly,”continuing to work correctly, even when things go wrong”

如果上面的这些要求结合在一起意味着一个系统能正常工作，那么我们可以得出一个粗略的稳定性定义“在系统发生某些错误的时候，也能持续性的工作”。

The things that can go wrong was called faults,and systems that anticipate faults and can cope with them called fault-tolerant or resilient. The former term is slightly misleading: it suggests that we could make a system tolerant of every possible kind of fault, which in reality is not feasible. If the entire planet Earth was swallowed by a black hole, tolerance of that fault would require web hosting in space. So it only makes sense to talk about tolerating contain types of faults.

系统运行中发生错误的东西我们成为fault, 能够预测到这些错误并且处理错误的系统我们称为能够“容忍错误”或者有”弹性”。容错这个词语可能会令人产生误解：字面上看，我们可以使得系统容忍各种可能的错误，但实际是不可能的。设想一下，如果整个地球被一个黑洞吞没，我们必须把服务架设在太空中才能容忍这种错误。因此在这里讨论容忍某些类型的错误才有意义。

Note that a fault is not the same as a failure. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a a whole stops providing the required service to the user. It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures. In this book we cover several techniques for building reliable systems from unreliable parts.

注意错误不等同于失败。错误被定义为系统的某个组件偏离了其正常的运行，但是失败代表着整个系统停止了服务。错误是不可避免的，所以通常情况下会设计容错机制来阻止错误造成更严重的失败。

Counterintuitively, in such fault-tolerant systems, it can make sense to increase the rate of faults by triaging them deliberately—for example, by randomly killing individual processes without warning. Many Many bugs are actually due to poor error handing;by deliberately inducing faults, you ensure that the fault-tolerate machinery is continually exercised and tested, which can increase you confidence that faults will be handled correctly when they occur naturally.

在容错系统中，故意触发错误（例如故意杀死某些进程）增加出错的概率很有必要。因为很多bug都是因为错误处理不当引起的，通过故意引入错误来考验我们系统的容错机制，这样当错误真正发生时才能正确处理错误。

Although we generally prefer to tolerating faults over preventing faults, there are cases where prevention is better than tolerance. This is the case with security matters, for example: if an attacker has compromised a system and gained access to sensitive data, that event cannot be undone. However, this book mostly deals with the kinds of faults that can be cured, as described in following sections.

尽管通常情况下容忍错误优于阻止错误，但是有一些情况阻止错误优于容忍错误。例如在安全领域，一个黑客攻破了一个系统获得了数据的操作权限，这种事情是无法容忍的。然而，本书主要关注于那些能够被解决的错误。

Hardware faults 硬件错误

When we think of causes of system failure, hardware faults quickly come to mind. Hard disks crash, RAM becomes faulty, the power grid has a blackout, someone unplugs the wrong network cable. Anyone who has worked with large data center can tell you that these things happen all the time when you have a lot of machines.

一提起系统失败的原因，映入脑海的首先是硬件错误。例如硬盘故障，内存故障，电源故障，人为拔下网线等。有过在大型数据中心工作经验的人会告诉你这些硬件故障一直在发生。

Hard disks are reported as having a mean time to failure(MTTF) of about 10 to 50 years. Thus, on a storage cluster with 10,000 Disks, we should expect on average one disk to die per day.

硬盘拥有着大约10~50年的平均使用寿命。在一个拥有着10000块硬盘的存储集群中，可以预测平均每天坏一块硬盘。

Our first response is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system. Disk may be set up in a RAID configuration, servers may have dual power and hot-swapped CPUs, and data enters may have diesel generators for backup power. When one component dies, redundant component can take its place while the broken component is replaced. This approach cannot completely prevent hardware problems from causing failures, but it is well understood and can often keep a machine running uninterrupted for years.

应对硬件故障的第一反应通常是为每个硬件组件增加冗余。例如，硬盘可以配置RAID，服务器可以配置双电源和热插拔的CPU，一些数据中心配有柴油发电机作为备份电源。当某个组件故障时，后备的组件接替故障的组件继续提供服务。这种增加冗余的技术无法完全阻止硬件故障造成系统失败，但是这种技术非常成熟，通常能够保证机器不间断的运行数年。

Until recently, redundancy of hardware components was sufficient for most applications, since it makes total failure of a single machine fairly rare. As long as you restore a backup onto a new machine fairly quickly, the downtime in case of failure is not catastrophic in most applications. Thus, multi-machine redundancy was only required by a small number of applications for which high availability is essential.

一直到现在，增加硬件冗余的技术仍然可以满足大多数的软件系统。而且对大多数软件系统来说，发生硬件故障以后，只要能够快速把系统在新的机器上恢复，短暂的宕机时间仍然可以接受。因此，只有少量的高可用系统才需要多机冗余。

However, as data volumes and applications’ computing demands have increased, more applications have begun using larger numbers of machines, which proportionally increase the rate of hardware failure. Moreover, in some clout platforms such as Amazon Web Services(AWS) it is fairly common for virtual machine to become unavailable without warning, as the platforms are designed to prioritize flexibility and elasticity over single-machine reliability.

然而，随着数据体量的增减和计算需要的增加，更多的应用开始使用多台机器提供服务，这也成比例的增加了硬件故障的可能性。而且一些类似于AWS的云平台，它们更加关注于提供伸缩性和灵活性而非单机的可靠性，所以在这些平台中虚拟机硬件故障是很常见的。

Hence there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy. Such systems also have operational advantages: a single-server system requires planned downtime if you need to reboot the machine(to apply operating system security patches , for example), whereas a system that can tolerate machine failure can be patched one node at a time, without downtime of the entire system (a rolling upgrade).

因此，除了刚才提到的硬件冗余之外，出现了一种趋势：通过优先使用软件容错技术来容忍整个集群故障。这样的系统还具有操作优势：如果需要重新启动机器（例如，操作系统修补补丁），则单机的系统必须停机一段时间，而能够容忍机器故障的系统可以一次修补一个节点，而不会造成整个系统的故障（即滚动升级）。

Software Errors 软件错误

We usually think of hardware failure as being random and independent from each other: one machine’s disk failing does not imply that another machine’s disk is going to failure.There may be weak correlation(for example due to a common cause, such as the temperature in the server rack), but otherwise it is unlikely that a larger number of hardware components will fail at the same time.

通常情况下，硬件故障是随机的，独立于其他的机器而发生：某台机器硬盘故障不能说明其他机器的硬盘也发生故障。尽管这其中有很弱的相关性，例如因为机架的温度造成了多台机器故障，但是一般情况下不可能大量的机器同时发生故障。

Another class of fault is a systematic error within the system. Such faults are hard to anticipate, and because they are correlated across nodes, they tend to cause many more system failures than hardware faults. Examples include:

另外一类错误是软件系统内部的错误。软件错误往往很难预测，而且因为节点之间的相关性，软件错误会比硬件错误造成更多的系统失败。软件错误如下：

A software bug that cause each instance of an system to crash when given a error input. For example, consider the leap second on June 30,2012, that caused many applications to hang simultaneously due to a bug in Linux kernel.
A runaway process that uses up some shared resources—CPU time, memory, disk space , or network bandwidth.
A service that a system depends on that slows down, becomes unresponsive,or starts returning corrupted response.
Cascading failures, where a small fault in one component triggers a fault in another component, which in turn triggers future faults.

The bugs that cause these kinds of software faults lie dormant for a long time until they are triggered by an unusual set of circumstances. In those circumstances, it is revealed that the software is making some kind of assumption about its environment—and while that assumption is usually true, it eventually stops bring true for some reason.

造成软件故障的bug被触发之前可能会潜伏很长时间。这也显示了软件系统会对它所允许的环境做出某些假设，这些假设通常是正确的，但是最终会因为各种原因变得不正确。

There is no quick solution to the problem of systematic failure of system. Lots of small things can help: carefully thinking about assumptions and interactions in the system;thorough testing;process isolation;allowing processes to crash and restart;measuring,monitoring,and analyzing system behavior in production.If a system is expected to provide some guarantee(for example,in a message queue, that the number of incoming messages equals the number of outgoing messages), it can constantly check itself while it is running and raise an alert if a discrepancy is found.

解决软件系统失败没有捷径。有着一些建议：仔细考虑系统中的假设和交互；细致的进行测试；程序隔离；程序崩溃后重启机制；测量、监控、分析系统的运行情况。如果系统还被要求提供某些保证，例如消息队列中的入队和出队消息数量必须一致，那么系统必须一直自我检查确保及时发现不一致的情况。

Human Errors 人为错误

Humans design and build software systems, and the operators who keep the systems running are also human.Even when they have the best intentions, humans are known to be unreliable. For example, one study of large internet services found that configuration errors by operators were the leading cause of outages, whereas hardware faults played a role in only 10-25% of outages.

是人类设计和够建了软件系统，也是人类操作员维持着系统的正常运行。即使人类的意图再好，也是不可靠的。例如，一项关于大型互联网服务的研究发现，运维人员的配置错误是导致系统故障的主要原因，而硬件问题只占据了其中的10~25%。

How do we make our system reliable,in spite of unreliable human?The best systems combine several approaches:

Design systems in a way that minimizes opportunities for error. For example, well-designed abstraction, APIs, and Amin interfaces make it easy to do “the right thing” and discourage “the wrong thing.” However, if the interfaces are too restrictive people will work around them, negating their benefits, so this is a tricky balance to get right.
Decouple the places where people make the most mistakes from the places where they can cause failures. In particular, provide full featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users.
Test thoroughy at all levels, from unit tests to whole system integration tests and manual tests. Automated testing is widely used, well understood, and especially valuable for covering corner cases that rarely arise in normal operations.
Allow quickly and easy recovery from human errors,to minimize the impact in the case of a failure. For example, make it fast to roll back configuration changes, roll out new code gradually, and provide tools to recompute data(in case it turns out that the old computation was incorrect)
Set up detailed and clear monitoring, such as performance metrics and error rates. In other engineering disciplines this is referred to as telemetry.(once a rocket has left the ground, telemetry is essential for tracking what is happening and for understanding failures) Monitoring can show us early warning signals and allow to us to check whether any assumption or constraints are being violated. When a problem occurs, metrics are invaluable in diagnosing the issue.
Implementing good management practice and training—a complex and important aspect, and beyond the scope of this book.

How important is reliability? 稳定性的重要性

Reliability is not just for nuclear power station and air traffic control software—more mundane application are also expected to work reliably. Bugs in business applications cause lost productivity(and legal risks if figures are reported incorrectly), and outages of e-commerce sites can have huge costs in terms of lost revenue and damage to reputation.

不是只有核电站或者航空工业才对系统稳定性有要求，民用的软件也需要稳定性的保证。业务系统的bug会造成生产损失（如果结果出错，可能承担法律责任），电商网站的故障会造成收入和声誉的损失。

Even in “noncritical” applications we have a responsibility to our users. Consider a parent who stores all their pictures and videos of their children in your photo application.How would they feel if that database was suddenly corrupted?Would they know how to restore it from a backup?

即使开发的是一些“不重要”的软件，我们也需要对我们的用户负责。设想一对父母把他们的家庭照片和视频存放在了你的网站挂上，如果网站数据库崩溃这对父母会有什么样的感受？他们知道怎么样从备份中恢复吗？

There are situations in which we may sacrifice reliability in order to reduce development cost or operational cost—but we should be very conscious of when we are cutting corner.

也存在一些为了减少开发成本而牺牲稳定性的情况——但是我们一定要考虑清楚什么情况下才能这么做。

《设计数据密集型应用》——第一章原文+翻译（上）

Reliable,Scalable,and Maintainable Applications(可靠，可伸缩，可维护的应用)

猜你喜欢