2-1课程目录

1、Spark及生态圈概述

Spark产生背景 Spark 概述及特点

Spark发展历史 Spark Survey

Spark对比Hadoop Spark和Hadoop的协作性

Spark开发语言 Spark运行模式

2-2 -Spark概述及特点

官网：https://spark.apache.org/

1、概述

Apache Spark™ is a unified analytics engine for large-scale data processing.

Apache Spad是大规模数据处理的统一分析引擎。

2、特点

1、Speed（快速）

Run workloads 100x faster.

Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.

2、Ease of Use（使用方便）

Write applications quickly in Java, Scala, Python, R, and SQL.

Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells.

3、Generality（通用）

Combine SQL, streaming, and complex analytics.

Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

4、Runs Everywhere（可以运行在任何处）

Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.

扫描二维码关注公众号，回复： 4544930 查看本文章

You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

2-3 -Spark产生背景

MapReduce的局限性

1）代码繁琐

2）只能支持map和reduce方法

3）执行效率低下

4）不适合迭代多次，交互式，流式处理

框架多元化

1）批处理（离线）：MapReduce、HIve、Pig

2）流式处理（实时）：Storm、JStorm

3）交互式计算：Impala

学习、运维成本无形中提高了很多

===》Spark

第2章 Spark及其生态圈概述