Spark reduce differences and relations streaming storm map

1.1 Basic Concepts

Storm is a flow computing framework, Storm written using Java and Clojure, the advantage of all-in-memory computing, so its position is distributed real-time computing.

Spark is a memory-based computing, open source cluster computing system aimed at faster data analysis. Spark similar general purpose parallel computing framework of Hadoop MapReduce, Spark Map Reduce Distributed computing algorithm, has Hadoop MapReduce has advantages, but is different from the MapReduce Job output and intermediate results can be stored in memory, so that no We need to read and write HDFS, and therefore better suited for Spark Map Reduce algorithm for data mining and machine learning needs iteration.

Hadoop implements MapReduce is the idea of ​​the data slice calculation to handle large amounts of data offline. Hadoop data processing must be already stored on HDFS HBase database or the like, so when Hadoop implementation efficiency is calculated up to the stored data by moving the machine.

1.2 application scenarios

1.2.1 Storm applicable scene:

1) the data stream processing
Storm steady stream flow can be used to process the incoming message, then processing the result to be written to a storage.
2) Distributed RPC. Since Storm processing components are distributed, and low processing delay, it can be distributed as a general framework using RPC.

1.2.2 Spark application scenarios:

1) multiple application operating Spark particular set of data is based on an iterative calculation frame memory for operating multiple applications need particular data set. The more often require repeated operations, the greater the amount of data need to be read, the greater the benefit, the amount of data is small but larger computationally intensive applications, relatively small benefit.

2) Application of coarse-grained status update due to the characteristics of RDD, the Spark application that is not fine-grained asynchronous status update applicable, for example, stored or is incremental Web crawlers and Web service index. For the application of the model is that incremental changes are not suitable. Spark of general applicability compare a wide range of more generic.

1.2.3 Hadoop application scenarios:

1 off-line analytical processing) Mass data

2) large-scale Web information search

3) parallel data-intensive computing

1.3 differences and relations

Relative Hadoop, Strom has the advantage of real-time data processing for large data because hadoop in the process of handling large data and high-latency real-time data features make it the face of the lack of adequate coping strategies, Strom has been widely present in applications such as real-time push system, early warning systems, and many other scenarios.

 

Storm and MapReduce difference: 

Storm: processes, threads running memory resident data is not flushed to disk, data transfer over the network.

MapReduce: off-line batch computing framework for the design of large data.

Storm and   Spark Streaming difference:  

Storm: Pure streaming, designed specifically for streaming data transfer mode is simpler, more efficient and a lot of places, not not do batch processing, it can also do micro-batch, to improve throughput

Spark Streaming: Micro batch, the RDD made very small batch to come close streamed

 

1.4 summary

Hadoop is suitable for off-line batch data processing, real-time requirements of the scene very low

Storm suitable for real-time streaming data processing, real-time done a good job;

Spark Hadoop interposed between the frame and a Map-Reduce batch stream of the process frame Storm, aspects better performance than the batch Map-Reduce, streaming weaker Storm 

Guess you like

Origin www.cnblogs.com/taoweizhong/p/11025731.html