日常前言

五月六月，又陷入反反复复的项目 Bug 中了。讲道理，分析日志是越来越熟练了，代码水平其实没有很大提高，毕竟改 Bug 嘛，大多只是在原有代码的基础上，添加或者修改一些业务逻辑。虽然改原生代码的时候能学到很多东西，但是那些部分很少出现问题，绝大部分还是我们自己人加入、修改的逻辑挖出来的坑。填坑的过程真是漫长又令人心烦。
时间有限，这次的翻译也只选了五个短篇，尽量提高内容质量同时也节省出一些业余时间学习一些其它知识。
不过这次翻译对我来说，收获颇丰。在翻译其中两篇文章的时候，我做了详细的笔记，并且在阅读过程中查阅了不少相关资料，学到了很多东西。
其中一篇是数据可视化的艺术，虽然只是以网页性能分析为例，对各种常用图表作了简单的适用场景的介绍，但是这正是我最近需要了解的内容 —— 由于业务原因，我需要经常接入第三方算法，并评测其性能。然而组内一直都是用打印 Log 的方式去分析性能，很不方便，而且经常会忽略掉一些异常变化。我正需要一些方法提高我们的性能分析效率，而这篇文章则给我指明了方向。
另一篇则是关于概率数据结构的介绍。选择翻译这篇文章是因为看到了 Bloom Filter，这让我想起了大学时给老师打工写爬虫的时光……这次顺势重温了 Bloom Filter，并了解了 HyperLogLog 与 Min-Count Sketch 这两个算法。我总觉得在不久的将来我就会用上它们。
这一期文章依旧采纳了四篇：
说到版权问题，我其实不太清楚我这样中英翻译的方式发 Blog 是不是侵了英文原版的版权了。但是如果不是中英翻译的话，发这些 Blog 就是多此一举了。如果侵权了的话，以后再删掉吧~

版权相关

翻译人：StoneDemo，该成员来自云+社区翻译社
原文链接：What Does Big Data Mean to You?
原文作者：MEGHRAJ SINGH BENIWAL

What Does Big Data Mean to You?

题目：（大数据对你来说意味着什么？）

This is the era of Big Data and these are undoubtedly revolutionary times. Massive amounts of data are being generated by the hour, from social media and from enterprises. It would be extremely foolish to waste this treasure trove by simply doing nothing about it. Enterprises have learnt to harvest Big Data to earn higher profits, offer better services and gain a deeper understanding of their target clientèle.

毋庸置疑，现如今是属于大数据（Big Data）的，革命性的时代。从社交媒体到企业，每时每刻都在产生大量的数据。无所作为，从而把这样的宝藏白白浪费掉是及其愚蠢的。企业已经学会了收集大数据以获取更高的利润，并提供更好的服务以及更深入地了解其目标客户。

Big Data basically refers to the huge amounts of data, both organised and unorganised, that enterprises generate on a day-to-day basis. In this context, the volume of data is not as relevant as what organisations do with the data. Analysis of Big Data can lead to insights that improve strategic business decision-making.

大数据主要是指企业中日常生成的，大量的有组织以及无组织的数据。在这种情况下，组织如何处理这些数据，与数据量是无关的。对大数据分析可以产生改善战略商务决策（Strategic business decision-making）的洞察力。

The importance of Big Data

（大数据的重要性）

As mentioned earlier, the value of Big Data does not depend on how much information you have, but on what you are going to do with it. You can harvest data from any point and examine it to find solutions that enable the following four things:

Price reductions

Time reductions

Fresh product development and modified offerings

Making smart judgements

如前所述，大数据的价值不在于您拥有多少信息，而在于您要如何利用它。您可以从任何一个点收集数据（并对其进行检查），以找到下面四种情况的解决方案：

物价回降（Price reductions）
时间缩减（Time reductions）
新产品开发，以及改进产品
做出明智的判断

When you pool Big Data with high-energy analytics, the following business-related tasks are possible:

Identifying reasons of failures, issues and flaws in real-time.

Generating vouchers at the point-of-sale based on the customer’s purchasing history.

Calculating the full risk of certain functions within minutes.

Detecting deceitful behaviour before it impacts your organisation.

当您耗费大量精力分析聚合大数据时，下面这些业务关联的任务就可能实现：

实时识别故障原因、问题以及缺陷。
根据客户的购买历史，在销售端（Point-of-sale）生成凭证（Voucher）。
在几分钟内计算出特定功能的全部风险。
在欺骗行为影响到您的组织之前，将其检测出来。

图1. 大数据基础结构

Examples of Big Data

（大数据实例）

The automotive industry: Ford’s modern-day hybrid Fusion model yields up to 25GB of data per hour. This data can be used to interpret driving habits and patterns in order to prevent accidents, deflect collisions, etc.

汽车行业：福特现代混合动力车型 Fusion，它每小时产生高达 25GB 的数据。这些数据可以用于解释驾驶习惯和驾驶模式，以预防意外事故，转向碰撞等情况。

Entertainment: The video game industry is using Big Data for examining over 500GB of organised data and 4TB of functional backlogs, each day.

娱乐：电子游戏行业每天都在使用大数据技术来检查超过 500GB 的有组织数据，以及 4TB 的功能性积压（Functional backlogs）。

The social media effect: About 500TB of fresh data gets added into the databases of social media site Facebook daily.

社交媒体效应：每天，社交媒体网站 Facebook 的数据库中都会增加大约 500TB 的新数据。

Types of Big Data

（大数据类型）

Big Data can be classified into the following three main categories.

大数据可以分为以下三大类。

1. Structured: Data that can be stocked, approached and refined in the form of a fixed data format is termed as structured data. With time, computer science has been able to develop methods for running with such data and also deriving value out of it. Nevertheless, these days, we are anticipating issues related to the sheer volume of such data, which is turning into zettabytes (1 billion terabytes equals 1 zettabyte).

1. 结构化：可以以固定数据格式存储、处理和改进的数据称为结构化数据。随着时间的推移，如今计算机科学已经能够开发使用这些数据的方法，并从中获得价值。不过近来我们正预测与庞大数量的这类数据相关的问题，这些数据量将成为 ZB（10 亿 TB 等于 1ZB）级别的。

2. Unstructured: Data in an unmapped form is known as unstructured data. Large volumes of unstructured data pose many challenges in terms of how to derive value out of it. For example, a heterogeneous data source, incorporating a collection of simple text files, pictures, audio as well as video recordings, will be difficult to analyse. These days, organisations have an abundance of data available to them, but unfortunately they don’t know how to extract value out of it since this data is in an unprocessed form.

2. 非结构化：非映射（Unmapped）形式的数据称为非结构化数据。如何从大量的非结构化数据中获取价值，这其中充满挑战。例如，包含了简单文本文件、图片、音频，以及视频录像之集合的异构数据源（Heterogeneous data source），这些数据将难以进行分析。当下，组织拥有大量可用的数据，但不幸的是，他们并何从下手以提取数据的价值，因为这些数据是未经处理的形式。

3. Semi-structured: This can comprise both forms of data. Also, we can consider semi-structured data as a structure in form, but in reality, the data itself is not defined, e.g., data depicted in an XML file.

3. 半结构化：这可以包含两种形式的数据。另外，我们可以将半结构化数据视为一种形式上的结构，但实际上数据本身并未定义。例如，XML 文件中所描述的数据。

The four Vs of Big Data

（大数据的四个 “V” 值）

Some of the common characteristics of Big Data are depicted in Figure 2.

一些共同特征如图 2 所示。

1. Volume: The volume of data is an important factor in deciding on its value. Hence, volume is one property that needs to be considered while handling Big Data.

1. 体积（Volume）：数据量是决定大数据价值的重要因素。因此，体积是处理大数据时需要考虑的一个属性。

2. Variety: This refers to assorted data sources and the nature of data, both structured and unstructured. Previously, spreadsheets and databases were the only origins of data considered in most of the practical applications. But these days, data in the form of e-mails, pictures, recordings, monitoring devices, etc, are also being considered in investigation applications.

2. 种类（Variety）：指的是各种数据源以及数据的性质，这其中既有结构的，也有非结构化的。曾经，电子表格和数据库是大多数实际应用中唯一考虑的数据来源。但现在，调查应用中还会考虑到电子邮件，图片，录音，以及监控设备等形式的数据。

3. Velocity: This term refers to how swiftly data is generated. How fast the data is created and refined to meet a particular need, determines its real potential. The velocity of Big Data is the rate at which data flows from sources like business procedures, application logs, websites, etc. The speed at which Big Data flows is very high and virtually non-stop.

3. 速率（Velocity）：该术语是指 “数据是如何迅速生成的”。数据创建和提炼的速率要有多快，才能满足特定需求，这决定了它的真正潜力。大数据的速率是数据从业务流程、应用程序日志、网站等来源流出的速度。大数据流动的速度非常高，几乎从不间断。

4. Veracity: This refers to the incompatibility between the various formats that the data is being generated in, thus constraining the process of mining or managing the data profitably.

4. 精确性（Veracity）：这是指所生成数据的各种格式之间的不兼容性，这限制了挖掘或管理数据的过程。

图2. 大数据的特征

Big Data architecture

（大数据架构）

Big Data architecture comprises consistent, scalable and completely computerised data pipelines. The skillset needed to build such infrastructure requires a deep knowledge of every layer in the heap, starting with a cluster design to setting up the top chain responsible for processing the data. Figure 3 shows the complexity of the stack, along with how data pipeline engineering touches every part of it.

大数据架构包含一致的、可扩展的，以及完全计算机化的数据管道（Data pipelines）。构建这种基础架构需要具有深入了解堆中的每一层的能力，即从集群设计（Cluster design）开始，直到设置负责处理数据的顶级链（Top chain）。图 3 展示了堆栈的复杂性以及数据管道工程如何触及其每个部分。

In this figure, the data pipelines collect raw data and transform it into something of value. Meanwhile, the Big Data engineer has to plan what happens to the data, the way it is stored in the cluster, how access is approved internally, what equipment to use for processing the data, and finally, the mode of providing access to the outside world. Those who design and implement this architecture are referred to as Big Data engineers.

在图 3 中，数据管道收集原始数据并将其转化为有价值的东西。同时，大数据工程师必须计划好数据会发生什么情况，数据存储在集群中的方式，内部许可的访问方式，用于处理数据的设备，以及提供给外界访问的模式。那些设计和实现这种架构的人被称为大数据工程师。

Big Data technologies

（大数据技术）

As we know, the subject of Big Data is very broad and permeates many new technology developments. Here is an overview of some of the technologies that help users monetise Big Data.

众所周知，大数据的主题非常广泛，并且渗透到了许多新技术的发展中。以下对一些技术的概述旨在帮助用户对大数据进行改造。

1.MapReduce: This allows job implementation, with scalability crossing thousands of servers.

Map: Input dataset transforms into a different set of values.

Reduce: Many outputs of the Map task are united to form a reduced set of values.

1. MapReduce（映射化简）：这使得任务的实现具有能够跨越数千台服务器的可扩展性。

Map：将输入数据集转换为一组不同的值。
Reduce： 将 Map 任务的输出联合起来，形成一组简化的值。

2. Hadoop: This is the most admired execution of MapReduce, being a completely open source platform for handling Big Data. Hadoop is flexible enough to be able to work with many data sources, like aggregating data in order to do large scale processing, reading data from a database, etc.

Hadoop：这是 MapReduce 最令人钦佩的执行方式，它是一个完全开源的处理大数据的平台。Hadoop 足够灵活，它能够处理多种数据源，例如聚合数据以进行大规模处理，从数据库读取数据等。

3. Hive: This is an SQL-like link that allows BI applications to run queries beside a Hadoop cluster. Having been developed by Facebook, it has been made open source for a little while and is a higher-level concept of the Hadoop framework. Also, it allows everyone to make queries against data stored in a Hadoop cluster and has improved on Hadoop’s functionality, making it ideal for BI users.

3. Hive：这是一个类似 SQL 的链接，允许 BI（商业智能）应用程序在 Hadoop 集群旁运行查询。这是由 Facebook 开发的，它已经被开源了一段时间，并且它还是 Hadoop 框架的更高层次的概念。此外，它允许每个人对存储在 Hadoop 集群中的数据进行查询，并改进了 Hadoop 的功能，使其成为了 BI 用户的理想选择。

图3. 大数据体系结构

Advantages of Big Data processing

（大数据处理的优势）

The capability of processing Big Data has various benefits.

处理大数据的能力具有多种益处。

1. Businesses can make use of outside brainpower while taking decisions: The right to use social data from search engines and websites like Facebook and Twitter is enabling enterprises to improve their business strategies.

1. 企业可以在进行决策时利用外脑（Outside brainpower）：使用来自搜索引擎以及 Facebook 和 Twitter 等网站的社交数据的权利，可以帮助企业改进商务战略。

2. Enhanced customer service: Customer response systems are getting replaced by new systems intended for Big Data technologies. Within these new systems, Big Data technologies are being utilised to read and assess consumer responses.

2. 增强客户服务：客户响应系统正在被使用了大数据技术的新系统所取代。在这些新系统中，大数据技术用于理解与评估消费者的反应。

3. Early recognition of risks for the services: Risk factors can be recognised beforehand to deliver the perfect data.

3. 在早期识别服务风险：可以事先识别风险因素，以提供完美的数据。

4. Improved operational competence: Big Data technologies can be utilised for building staging areas or landing zones for new data, prior to deciding what data should be moved to the data warehouse. Also, such incorporation of Big Data and data warehousing technologies helps businesses to bypass data that is not commonly accessed.

4. 提高操作能力：大数据技术可用于在决定将哪些数据移入数据仓库之前，为新数据构建暂存区（Staging areas）或着陆区（Landing zones）。此外，这种大数据和数据仓库技术的结合可帮助企业绕过不经常访问的数据。

The challenges

（挑战）

Though it is very easy to get trapped in all the hype around Big Data, one of the reasons it is so underutilised is that there are many challenges still to be resolved in the technologies used to harness it. Some of these are:

Companies face problems in identifying the correct data and examining how best to utilise it. Constructing data-related business cases frequently means forming opinions out-of-the-box and looking for income models that are extremely different from the traditional business model.

Companies are reluctant to choose the fine talent that is capable of both working with new technologies and examining the data to find significant business insights.

A bulk of data points have not been linked yet, and companies frequently do not have the correct platforms to combine and manage the data across the enterprise.

The technology in the data world is evolving very fast. Leveraging data means functioning with well-built, pioneering technology collaborators – companies that can help create the right IT design so as to adapt to changes in the landscape in a well-organised manner.

虽然很容易陷入各种关于大数据的炒作之中，但它未得到充分利用的原因之一就是，在使用到它的技术中仍有许多挑战需要解决。其中一些挑战如下：

公司面临着的问题是：识别正确的数据，以及审查如何最好地利用它们。构建与数据有关的商业案例，这往往意味着形成 “开箱即用（Out-of-the-box）” 的意见，以及寻找与传统商业模式截然不同的收入模式。
公司不情愿去挑选同时具有使用新技术和审查数据（以发掘重要的商业洞察）能力的优秀人才。
大量数据点还没有进行链接，公司通常没有合适的平台来整合和管理整个企业的数据。
数据世界的技术发展日新月异。借用数据之力，意味着能够与良好的、具有开拓性的伙伴一起运营 —— 这些公司可以帮助创建正确的 IT 设计，从而以良好的组织方式适应环境的变化。

The accessibility of Big Data, inexpensive product hardware, and new information managing and analytics software have come together to create a unique moment in the history of data analysis. We now have the capability that is necessary to examine these amazing data sets rapidly and cost-effectively, for the first time in history. This ability symbolises an authentic leap forward, and a chance to enjoy massive improvements in terms of work productivity, income and success.

大数据的可访问性（Accessibility），便宜的硬件产品，以及新的信息管理和分析软件聚合在一起，在数据分析的历史中创造了独特的时刻。我们现在有能力快速且经济高效地审查这些惊人的数据集，这是有史以来的第一次。这种能力象征着真正的飞跃，同时也象征着一个在工作效率、收入和成功方面大幅进步的机会。

[大数据文章之其一] 大数据对你来说意味着什么？