[Switch] 20 minutes to read data large distributed computing

This is a popular science nature of the article, I hope can live with an example of a user-friendly for non-computer professional backgrounds friend clear big data distributed computing technology. Although big data technology include other storage, calculation and analysis of a series of complex technology, but it has always been the core of distributed computing, large data you want to understand the technology, it may MapReduce distributed computing model from the beginning. The theoretical model is not a new concept, as early as 2004, Google was released, after ten years of development, seems to have become the cornerstone of the current large ecological data, the road can be described as big data technology, that MapReduce.

Traditional computing technology

Before entering into the concept of distributed computing technology, we must first look at traditional computing techniques, in order to make the concept of computer-related fields can vividly in simple terms, we want human computer analogy:

In this figure we have established a basic element of computer analogies, but not strict enough to illustrate the problem, if you are, there are some vague concepts such as memory, central processor (CPU), I believe after reading this chart can understand their effect. With this analogy relationship, we can put the problem in the computer field conversion issues for us are familiar with the human field, from now on, everyone, for example, you own is a computer, we codenamed " man-computer ", you have the basic computer components, God is a programmer can write a program - a series of good instruction set, so that you do some computing tasks.

Here we use a simple case, analysis of "human computer" is how traditional computing techniques to solve practical problems. Before you begin, to increase the number defined as memory normal computer is capped, our "human computer" is also an upper limit memory, where we assume that at most a "human computer" can simultaneously in the "Memory" in mind live four types of information, such as: the number of apples, pears and other four kinds of fruit:

It looks like this "man-computer" performance is relatively poor, but fortunately we need to deal with the problem of not complicated: There are dozens of Zhang and Wang does not include the king of cards, these cards are color and size are uncertain ( does not necessarily make up a deck of cards), how to give a "human computer" design a program to count the number of cards in each suit?

Your answer may blurt out: For the "human computer", remember that each number of colors directly in the brain, taking one by one count cards, all cards have been processed after the reported four suit The number on the line. The answer is completely correct, normal computer simplest computing model is like this, the memory records of the results, as the input device continues to read data, update statistics in memory, from the final output device to show results:

The next problem is the difficulty to upgrade, and count the number of each of these cards face cards in A ~ K total of 13 kinds of sign face. How do we "program" the upgrade?

We noticed, Solutions before if still in use, "human computer" and "Memory" was not enough, because of its storage limit for the four types of information to be stored A ~ K these 13 cards face information. Contact the real-life scenes, when we find ourselves unable to remember a lot of information, will assist with the original account memory, the computer is the same, not enough memory to use the disk to store information at this time, books can be analog in a store in the "disk" Excel document:

Then the statistical Cards Solutions to this problem there: Take a poker every update count the number of books in the appropriate card type, the number after the completion of all of the cards directly reported results:

Traditional computing model is such a single computer, can be summarized as a process of addition, subtraction and other mathematical operations performed on the input data according to a certain uniform rules, and then outputs the result, which generates the intermediate data stored in the memory or hard disk . In the above case, poker is a "human computer" and "input data", the equivalent of a binary computer world can be identified numbers and text. The number of poker statistics is the "output" is equivalent to the information you can see on the computer screen.
In fact, by virtue of the basic components of memory, hard disk and CPU, etc., a single computer (not just including personal computers, smart phones are also considered) can be completed in computing watch movies and other daily basic needs are involved, we listen to music online, as long as does not exceed the calculated limit the CPU (such as man-machine war like Go) is properly properly no problem, and we have to optimize memory, hard disk optimization and other means to enhance the computing power of a single computer to meet the people's growing material and the need for cultural life.

Okay, enough background, let's get to the point

Big Data Distributed Computing

首先，什么是分布式计算？简单点理解就是将大量的数据分割成多个小块，由多台计算机分工计算，然后将结果汇总。这些执行分布式计算的计算机叫做集群，我们仍然延续前文中人和计算机的类比，那么集群就是一个团队，单兵作战的时代已经过去，团队合作才是王道：

为什么需要分布式计算？因为“大数据”来了，单个计算机不够用了，即数据量远远超出单个计算机的处理能力范围：有时候是单位时间内的数据量大，比如在12306网上买票，每秒可能有数以万计的访问；也有可能是数据总量大，比如百度搜索引擎，要在服务器上检索数亿的中文网页信息。

实现分布式计算的方案有很多，在大数据技术出现之前就已经有科研人员在研究，但一直没有被广泛应用。直到2004年Google公布了MapReduce之后才大热了起来。大数据技术、分布式计算和MapReduce的关系可以用下图来描述，MapReduce是分布式计算在大数据领域的应用：

MapReduce模型是经过商业实践的成熟的分布式计算框架，与Google的分布式文件系统GFS、分布式数据存储系统BigTable一起，号称Google的大数据“三宝”，为大数据技术的发展提供了坚实的理论基础。但遗憾的是，谷歌并没有向外界公布自己的商业产品，而真正让大数据技术大踏步前进的是按照Google理论实现的开源免费产品Hadoop，目前已经形成了以Hadoop为核心的大数据技术生态圈。

让我们回到数扑克牌这个例子中，大数据时代的扑克牌问题是什么样子的？

输入数据的规模增加：扑克牌暴增到数万张
中间运算数据的规模增加：问题又升级了，我们需要统计52种牌型每种牌型出现的次数
处理时间有限制：我们希望能尽快得到统计结果

怎么样，有没有感觉到大数据扑面而来。要知道我们的“人型计算机”的“内存“和“硬盘”是有容量限制的，52种牌型的信息已经超出了单台计算机的处理能力。当然这里会有人提出质疑，认为扩充内存或者磁盘容量就可以解决这个问题，52种牌型完全不需要分布式计算。那大家考虑一下假如这堆牌中有几百种、甚至几千种牌型呢？

So 52 kinds of cards to meet the reality of the situation, so that we understand to have a single computer can not handle so much data the same time, we need to collaborate with more than one computer, it's time to release MapReduce this big move.

I personally looked up some information, did some practice later that MapReduce technology can simply use four words to sum up tactic: divide, change, wash, together, represent the "segmentation", "transformation", "shuffle "" merge "four steps:

Let's look at how to solve the big problem of data cards with four-character tactic.

The first step, segmentation: the input data into a plurality of parts by cutting

Since a single "computer people" can not be completely processed all of poker, then we put the cards randomly divided into multiple portions, each playing card by a "computer person" to handle the number does not exceed the upper limit of a single computer processing, and try to make each number of relatively average.

Here we talk about the issue of division of roles, multiple computers cooperation, certainly there must be division of roles, we put in charge of data segmentation "human computer" can be understood as a "commander", "commander" usually only one (in in practice there may be more), to co-ordinate scheduling like the property in his possession. Responsible for the implementation of specific operational tasks "human computer" is "calculated soldiers", "computing soldiers" in accordance with the tasks undertaken were divided into "change computing soldiers" and "co-calculation soldiers," the former responsible for the second step "conversion" The latter is responsible for the final step "merger."

The total number of "soldiers calculation" is of course the more the better, but the "change calculation soldiers" and "soldiers together computing" the ratio of the respective share is not fixed, and can be carried out according to how much the efficiency of operation of the data adjustment. When the troops is not enough time, a calculation of soldiers could have the two roles, "commander" but also may serve as "computing soldiers", because in reality there is more than one computer can process multiple tasks assume that in theory tell a computer can be played many roles.

"Commander" before splitting cards will be assigned good "Variable Calculation soldiers" and the number of "co calculated soldier", and the number of "change calculated soldier" to split the poker to the corresponding parts, each parts poker give a "change calculation soldier", then the next step.

The second step, conversion: to make each input data mapping transformation (ie MapReduce in Map)

每一个“变计算兵”都要对自己分得的每一张扑克牌按照相同的规则做变换，使得后续的步骤中可以对变换后的结果做处理。这种变换可以是加减乘除等数学运算，也可以是对输入数据的结构的转换。例如对于我们这个扑克牌问题来讲，目的是为了计数，所以可以将扑克牌转换为一种计算机更容易处理的数值结构：将每张扑克牌上贴一张小便签，这条小便签上写明了其个数为1。

我们把这种贴了标签的扑克牌叫做变种扑克牌。当在后续的步骤中统计牌型个数时，只需要把每个标签上的数字加起来就可以。有的朋友肯定会好奇为什么不让每个“计算兵”直接统计各自的所有牌型的扑克的个数，这是因为这种“映射变换”运算的本质在于将每张扑克牌都进行同一种相同规则的变换，统计个数的工作要留在最后一步完成。严格的流水化操作，会让整体的效率更高，而且变换的规则要根据具体问题来制定，更容易适配不同种类的计算。

第三步，洗牌：把变换后的数据按照一定规则分组

变换的运算完成之后，每个“变计算兵”要将各自的变种扑克牌按照牌型分成多个小份，每个小份要最终被一个指定的“合计算兵”进行结果合并统计，这个过程就是“洗牌”，是“变计算兵”将变换后的扑克牌按照规则分组并分配给指定的“合计算兵”的过程。

洗牌分两个阶段，第一阶段是每个“变计算兵”将变种扑克牌按照一定的规则分类，分类的规则取决于每个“合计算兵”的统计范围，分类的个数取决于“合计算兵”的个数。如上图所示，假设有3个“合计算兵”分别负责不同范围的牌型的统计，那么“变计算兵”需要根据每个“合计算兵”负责的牌型将自己的变种扑克牌分成3个小份，每份交给对应的“合计算兵”。洗牌的第二阶段，“合计算兵”在指挥官的指挥下，去各个“变计算兵”的手中获取属于他自己的那一份变种扑克牌，从而使得牌型相同的扑克牌只会在一个“合计算兵”的手上。洗牌的意义在于使相同牌型的变种扑克牌汇聚在了一起，以便于统计。

第四步，合并：将洗牌后的数据进行统计合并（也就是MapReduce中的Reduce）

“合计算兵”将手中的变种扑克牌按照相同的计算规则依次进行合并，计算规则也需要根据具体问题来制定，在这里是对扑克牌上标签的数值直接累加，统计出最终的结果。

然后所有的“合计算兵”把自己的计算结果上交给“指挥官”，“指挥官”汇总后公布最终统计的结果。

总结

ok，“分变洗合”四字诀介绍完毕，完整过程如下：

分布式处理技术在逻辑上并不复杂，但在具体的实现过程中会有很多复杂的过程，譬如“指挥官”如何协调调度所有的“运算兵”，“运算兵”之间如何通信等等，但对于使用MapReduce来完成计算任务的程序员来讲，这些复杂的过程是透明的，分布式计算框架会自己去处理这些问题，程序员只需要定义两种计算规则：第二步中变换的规则和第四步中合并的规则。

正所谓大道至简，万变不离其宗，理解了MapReduce就理解了大数据分布式处理技术，而理解大数据分布式处理技术，也就理解了大数据技术的核心。

作者：LeonLu
链接：https://www.jianshu.com/p/094c5aab1fdb
来源：简书
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。