CentOS 6.8 actual deployment of JStorm cluster

Alibaba JStorm is a powerful enterprise-level streaming computing engine, which is 4 times the performance of Apache Storm. It can freely switch between row mode and mini-batch mode. JStorm not only provides a streaming computing engine, but also provides a complete solution for real-time computing. More components are involved, such as jstorm-on-yarn, jstorm-on-docker, SQL Engine, Exactly-Once Framework and so on.

CentOS 6.8 Actual Deployment of JStorm Cluster CentOS 6.8 Actual Deployment of JStorm Cluster

JStorm is a distributed real-time computing engine

JStorm is a system similar to Hadoop MapReduce. The user implements a task according to the specified interface, and then submits the task to the JStorm system. JStorm will run the task and run it 7 * 24 hours. Once a Worker in the middle fails unexpectedly , the scheduler immediately assigns a new Worker to replace the failed Worker.

Therefore, from an application point of view, a JStorm application is a distributed application that complies with a certain programming specification. From a system perspective, JStorm is a scheduling system similar to MapReduce. From the perspective of data, JStorm is a message processing mechanism based on pipeline.

Real-time computing is now the most popular direction in the field of big data, because people's requirements for data are getting higher and higher, and real-time requirements are getting faster and faster. The traditional Hadoop MapReduce is gradually unable to meet the needs, so the demand in this field continues .

Comparison of Storm components and Hadoop components

JStorm Hadoop
Role Cloud JobTracker
Supervisor TaskTracker
Worker Child
Application Name Topology Job
programming interface Spout/Bolt Mapper/Reducer

advantage

Before the appearance of Storm and JStorm, there were many real-time computing engines on the market, but since the appearance of Storm and JStorm, it can basically be said to dominate the world: The advantages:

  • The development is very fast: the interface is simple and easy to use. As long as you follow the programming specifications of Topology, Spout and Bolt, you can develop an application with excellent scalability. There is no need to consider actions such as redundancy between the underlying RPC and Worker, and data shunting.
  • Excellent scalability: when the speed of the first-level processing unit is reached, the performance can be linearly expanded by directly configuring the number of concurrency
  • Robust: When a Worker fails or a machine fails, a new Worker is automatically assigned to replace the failed Worker
  • Data accuracy: Ack mechanism can be used to ensure that data is not lost. If there is a further requirement for accuracy, use a transaction mechanism to ensure data accuracy.
  • High real-time performance: The design of JStorm is biased towards single-line records, so the delay is lower than similar products

Application Scenario

The way JStorm processes data is based on message-based pipeline processing, so it is especially suitable for stateless computing , that is, all dependent data of computing units can be found in received messages, and it is best for one data stream not to depend on another data stream.

Therefore, it is often used for:

  • Log analysis, analyzing specific data from the log, and storing the analysis results in an external storage such as a database. Currently, the mainstream log analysis technology uses JStorm or Storm
  • Pipeline system, which transfers a data from one system to another system, such as synchronizing a database to Hadoop
  • The message converter converts the received message according to a certain format and stores it in another system such as message middleware
  • Statistical analyzer extracts a certain field from logs or messages, then performs count or sum calculations, and finally stores the statistical values ​​in external memory. Intermediate processing can be more complex.
  • Real-time recommendation system, run the recommendation algorithm in jstorm, to achieve second-level recommendation effect

basic concept

First of all, JStorm is somewhat similar to Hadoop's MR (Map-Reduce), but the difference is that hadoop's MR, submitted to hadoop's MR job, ends after execution, and the process exits, while a JStorm task (called in JStorm topology) is always running 7*24 hours unless the user actively kills it.

JStorm components

Next is a rough structure diagram of a more classic Storm (same as JStorm):

CentOS 6.8 Actual Deployment of JStorm Cluster CentOS 6.8 Actual Deployment of JStorm Cluster

The faucet in the picture (well, a bit vulgar) is called a spout, and the lightning is called a bolt.

In the topology of JStorm, there are two components: spout and bolt.

# spout

The spout represents the input data source. This data source can be arbitrary, such as kafaka, DB, HBase, or even HDFS. JStorm continuously reads data from this data source, and then sends it to the downstream bolt for processing.

# bolt

Bolt represents the processing logic. After the bolt receives the message, it processes the message (that is, executes the user's business logic). After processing, it can continue to send the processed message to the downstream bolt, which will form a processing pipeline (pipeline , but more precisely it should be a directed graph); it can also end directly.

Usually the last bolt of a pipeline will do some data storage work, such as writing real-time calculated data into DB, HBase, etc., for the front-end business to query and display.

component interface

The JStorm framework defines an interface for the spout component: nextTuple, as the name suggests, is to get the next message. During execution, it can be understood that the JStorm framework will constantly call this interface to pull data from the data source and send data to the bolt.

At the same time, the bolt component defines an interface: execute, which is where users use to process business logic.

Each topology can have multiple spouts, which represent receiving messages from multiple data sources at the same time, or multiple bolts, to execute different business logic.

Scheduling and Execution

Next is the scheduling and execution principle of the topology. For a topology, JStorm will eventually schedule one or more workers, and each worker is a real operating system execution process, which is distributed to one or more machines in a cluster Execute in parallel.

In each worker, there can be multiple tasks, each representing an execution thread. Each task is the implementation of the above-mentioned component (component), either a spout or a bolt.

When users submit a topology, they will specify the following execution parameters:

#Total number of workers

That is, the total number of processes. For example, if I submit a topology and specify the number of workers as 3, then there may be 3 processes executing in the end. The reason why it is possible is that according to the configuration, JStorm may add internal components, such as _acker or __topology_master (both of which are special bolts), which will cause the final number of executed processes to be greater than the one specified by the user number. Our default is that if the number of workers set by the user is less than 10, then __topology_master only exists as a task and does not monopolize the worker; if the number of workers set by the user is greater than or equal to 10, then __topology_master will exclusively occupy a worker as a task

#Parallelism of each component

As mentioned above, each topology can contain multiple spouts and bolts, and each spout and bolt can individually specify a degree of parallelism (parallelism), which represents how many threads (tasks) execute the spout or bolt at the same time.

In JStorm, each execution thread has a task id, which increases from 1, and the task id in each component is continuous.

Or the above topology, which contains a spout and a bolt, the parallelism of the spout is 5, and the parallelism of the bolt is 10. Then we end up with 15 threads to execute: 5 spout execution threads, 10 bolt execution threads.

At this time, the task id of the spout may be 1-5, and the task id of the bolt may be 6-15. The reason why it is possible is that when JStorm schedules, it does not guarantee that the task id must start from the spout and then go to the bolt. . But the task ids in the same component must be consecutive.

#The relationship between each component

That is, the user needs to specify which bolts should process the data sent by a specific spout, or in other words, which bolts should process the data sent by an intermediate bolt.

Still taking the above topology as an example, they will be distributed in 3 processes. JStorm uses a uniform scheduling algorithm, so when executing, you will see that each process has 5 threads executing. Of course, since the spout has 5 threads, it cannot be evenly distributed among the 3 processes. There will be a situation where there is only 1 spout thread in a process; similarly, there will be 4 bolt threads in a process.

During the running of a topology, if a process (worker) hangs up, JStorm will continue to try to restart the process after detecting it. This is the concept of 7*24 hours of uninterrupted execution.

communication of messages

As mentioned above, spout messages will be sent to specific bolts, and bolts can also be sent to other bolts, so how do they communicate?

First, when sending a message from a spout, JStorm will calculate the list of target task ids to be sent, and then check whether the target task id is in this process or in another process. If it is in this process, then you can directly go inside the process Communication (such as directly putting this message into the execution queue of the target task in this process); if it is cross-process, then JStorm will use netty to send the message to the target task.

Real-time calculation result output

JStorm runs 7*24 hours. If an external system needs to query the processing results at a specific point in time, it will not directly request JStorm (of course, DRPC can support this requirement, but the performance is not very good). Generally speaking, in the spout or bolt of JStorm, there will be a logic to regularly write calculation results to external storage, so that the data can be stored in real time or near real time according to business needs, and then directly query the calculation results in the external storage. Can.

Paste the above content directly on the JStorm official website, do not complain

2. Jstorm cluster installation

1. System environment preparation

# OS: CentOS 6.8 minimal
# host.ip: 10.1.1.78 aniutv-1
# host.ip: 10.1.1.80 aniutv-2
# host.ip: 10.1.1.97 aniutv-5

2. Customize the installation directory

jstorm : /opt/jstorm (source installation);

zookeeper : /opt/zookeeper (source code installation);

java : /usr/java/jdk1.7.0_79 (rpm package installation)

3. Zookeeper cluster installation

Zookeeper cluster reference (http://blog.csdn.net/wh211212/article/details/56014983)

4. Zeromq installation

Zeromq download address: http://zeromq.org/area:download/

Download zeromq-4.2.1.tar.gz to /usr/local/src

cd /usr/local/src && tar -zxf zeromq-4.2.1.tar.gz -C /opt

cd /opt/zeromq-4.2.1 && ./configure && make && sudo make install && sudo ldconfig

5. jzmq installation

cd /opt && git clone  https://github.com/nathanmarz/jzmq.git

./autogen.sh && ./configure && make && make install

6. JStorm installation

wget https://github.com/alibaba/jstorm/releases/download/2.1.1/jstorm-2.1.1.zip -P /usr/local/src
cd /usr/local/src && unzip jstorm-2.1.1.zip -d /opt
cd /opt && mv jstorm-2.1.1 jstorm
# mkdir /opt/jstorm/jstorm_data
echo '# jstorm env' >> ~/.bashrc
echo 'export JSTORM_HOME=/opt/jstorm' >> ~/.bashrc
echo 'export PATH=$PATH:$JSTORM_HOME/bin' >> ~/.bashrc
source ~/.bashrc

# JStorm configuration

sed -i /'storm.zookeeper.servers:/a\ - "10.1.1.78"' /opt/jstorm/conf/storm.yaml
sed -i /'storm.zookeeper.servers:/a\ - "10.1.1.80"' /opt/jstorm/conf/storm.yaml
sed -i /'storm.zookeeper.servers:/a\ - "10.1.1.97"' /opt/jstorm/conf/storm.yaml
sed -i /'storm.zookeeper.root/a\ nimbus.host: "10.1.1.78"' /opt/jstorm/conf/storm.yaml<>

Configuration items:

storm.zookeeper.servers: indicates the address of zookeeper;

nimbus.host: indicates the address of nimbus;

storm.zookeeper.root: Indicates the root directory of JStorm in zookeeper. When multiple JStorms share a zookeeper, this option needs to be set . The default is "/jstorm";

storm.local.dir: Indicates the JStorm temporary data storage directory, and it is necessary to ensure that the JStorm program has write permissions to the directory;

java.library.path: Installation directory of Zeromq and java zeromq library, default "/usr/local/lib:/opt/local/lib:/usr/lib";

supervisor.slots.ports: Indicates the port Slot list provided by Supervisor, be careful not to conflict with other ports, the default is 68xx, and Storm’s is 67xx;

topology.enable.classloader: false, the classloader is disabled by default. If the jar of the application conflicts with the jar that JStorm depends on, for example, if the application uses thrift9, but jstorm uses thrift7, the classloader needs to be enabled. It is recommended to turn it off by default at the cluster level, and turn on this option on the specific topology that needs to be isolated.

# The following command only needs to be executed on the machine where jstorm_ui is installed and the jar node is submitted

mkdir ~/.jstorm
cp -f $JSTORM_HOME/conf/storm.yaml ~/.jstorm

7. Install JStorm Web UI

Mandatory to use tomcat7.0 or above, remember to copy ~/.jstorm/storm.yaml,  Web UI can be on the same node as Nimbus

mkdir ~/.jstorm
cp -f $JSTORM_HOME/conf/storm.yaml ~/.jstorm
Download tomcat 7.x (take apache-tomcat-7.0.37 as an example)
tar -xzf apache-tomcat-7.0.75.tar.gz
cd apache-tomcat-7.0.75
cd webapps
cp $JSTORM_HOME/jstorm-ui-2.1.1.war ./
mv ROOT ROOT.old
ln -s jstorm-ui-2.1.1 ROOT   
# Also not ln -s jstorm-ui-2.1.1.war ROOT This should be careful
cd ../bin
./startup.sh

8. Start JStorm

1. Execute "nohup jstorm nimbus &" on the nimbus node (10.1.1.78), check $JSTORM_HOME/logs/nimbus.log for errors

2. Execute "nohup jstorm supervisor &" on the supervisor node (10.1.1.78, 10.1.1.80, 10.1.1.97), check $JSTORM_HOME/logs/supervisor.log for errors

9、JStorm Web UI

The screenshot of the successful start of the JStorm cluster is as follows:

CentOS 6.8 Actual Deployment of JStorm Cluster CentOS 6.8 Actual Deployment of JStorm Cluster

# Summary of JStorm cluster installation problems

1. Pay attention to the /etc/hosts setting, add the corresponding ip hostname

2. Set up ssh password-free operation (this step is completed in the zookeeper cluster)

3. Pay attention to the environment variable settings of each service

 

Guess you like

Origin blog.csdn.net/yaxuan88521/article/details/131933719