Storm 1.2 data flow of word calculation topology

The word calculation topology consists of a spout and three downstream bolts, as shown in the figure below:


1. The function of the sentence generation spout
SentenceSpout class is very simple. It emits a data stream consisting of a single-valued tuple to the backend, the key name is "sentence", the key The value is a sentence stored in string format, like this:
{"sentence":"my dog ​​has fleas"}
For simplicity, our data source is a list of static statements. Spout will keep looping emitting each sentence as a tuple.


2. Statement Split
Bolt The Split Sentence Bolt (SplitSentenceBolt) class will subscribe to the tuple stream emitted by the sentence spout. Whenever a tuple is received, the bolt will obtain the sentence corresponding to the range of "sentence", and then split the sentence into words. Each word emits a tuple backwards.
{"word":"my"}
{"word":"dog"}
{"word":"has"}
{"word":"fleas"}


Third, word count bolt
word count bolt (WordCountBolt) subscribe to SplitSentenceBolt class The output of , saves the number of occurrences of each particular word. Whenever the bolt receives a tuple, it increments the count of the corresponding word by one, and sends the current count of the word backwards.
{"word":"dog","count":


5} Fourth, report the bolt
Report the output of the Bolt subscription WordCountBolt class, like WordCountBolt, maintain a table of counts corresponding to all words. When a tuple is received, the reported bolt will update the count data in the table and print the value on the terminal.

5. Implementing the word count topology
The basic concepts of Storm were introduced earlier, and we are ready to implement a simple application. Now start developing a Storm topology and execute it in local mode. Storm local mode simulates a Storm cluster within a JVM instance. It greatly simplifies the development and debugging of users in the development environment or IDE.
1. Implement SentenceSpout
For simplicity, the implementation of SentenceSpout simulates a data source by repeating a list of static statements. Each sentence is emitted in a backward loop as a single-valued tuple. The complete implementation is as follows:
public class SentenceSpout extends BaseRichSpout{
	private SpoutOutputCollector collector;
	private String[] sentences = {
			"my dog has fleas",
			"i like cold beverages",
			"the dog ate my homework",
			"don't have a cow man",
			"i don't think i like fleas"
	};
	private int index = 0;
	
	public void nextTuple() {
		this.collector.emit(new Values(sentences[index]));
		index++;
		if(index >= sentences.length){
			index=0;
		}
		Utils.sleep(1);
	}

	public void open(Map arg0, TopologyContext arg1, SpoutOutputCollector arg2) {
		this.collector = arg2;
	}

	public void declareOutputFields(OutputFieldsDeclarer declarer) {
		declarer.declare(new Fields("sentence"));
	}
}

The BaseRichSpout class is a convenient implementation of the Ispout and Icomponent interfaces. The interface provides default implementations for methods that are not used in this example. Using this class, we can focus on the required methods. The method declareOutputFields() is defined in the Icomponent interface, which all Storm components (spout and bolt) must implement. Storm's components use this method to tell Storm which data streams the component will emit and which fields are included in the tuple of each data stream. In this example, we declare that the spout will emit a stream of data, where the tuple contains a field (sentence).
The Open() method is defined in the ISpout interface, and all Spout components call this method during initialization. The Open() method receives three parameters, a map containing Storm configuration information, the TopologyContext object provides information about the components in the topology, and the SpoutOutputCollector object provides a method for emitting tuples. In this case, no additional operations are required during initialization, so the implementation of the open() method simply saves the reference to the SpoutOutputCollector object in a variable.
The NextTuple() method is the core of all spout implementations, and Storm emits tuples to the output collector by calling this method. In this example, we emit the statement corresponding to the current index and increment the index to point to the next statement.

2. Implement statement splitting bolt
SplitSentenceBolt class is implemented as follows:
public class SplitSentenceBolt extends BaseRichBolt{
	private OutputCollector collector;

	public void execute(Tuple tuple) {
		String sentence = tuple.getStringByField("sentence");
		String[] words = sentence.split(" ");
		for(String word : words){
			this.collector.emit(new Values(word));
		}
	}

	public void prepare(Map map, TopologyContext topologycontext,
			OutputCollector outputcollector) {
		this.collector=outputcollector;
	}

	public void declareOutputFields(OutputFieldsDeclarer outputfieldsdeclarer) {
		outputfieldsdeclarer.declare(new Fields("word"));
	}
}

The BaseRichBolt class is a convenient implementation of the Icomponent and IBolt interfaces. By inheriting this class, we don't have to implement methods that we don't care about in this example, and focus on implementing the functions we need.
The prepare() method is defined in IBolt, which is similar to the open() method defined in the ISpout interface. This method is called when the bolt is initialized and can be used to prepare resources used by the bolt, such as database connections. Like the SentenceSpout class, the SplitSentenceBolt class has no additional operations during initialization, so the prepare() method just holds a reference to the OutputCollector object.
In the declareOutputFields() method, SplitSentenceBolt declares an output stream, each tuple containing a field "word".
The core functionality of the SplitSentenceBolt class is implemented in the execute() method, which is defined by the IBolt interface. This method is called whenever a tuple is received from the subscribed stream. In this example, the execute() method reads the value of the "sentence" field as a string, then splits it into words, each of which emits a tuple to the subsequent output stream.

3. Implement word count bolt
The implementation of the WordCountBolt class is as follows:
public class WordCountBolt extends BaseRichBolt{
	private OutputCollector collector;
	private HashMap<String, Long> counts = null;

	public void prepare(Map map, TopologyContext topologycontext,
			OutputCollector outputcollector) {
		this.collector=outputcollector;
		this.counts = new HashMap<String, Long>();
	}

	public void execute(Tuple tuple) {
		String word = tuple.getStringByField("word");
		Long count = this.counts.get(word);
		if(count == null){
			count = 0L;
		}
		count++;
		this.counts.put(word, count);
		this.collector.emit(new Values(word,count));
	}

	public void declareOutputFields(OutputFieldsDeclarer outputfieldsdeclarer) {
		outputfieldsdeclarer.declare(new Fields("word","count"));
	}
}

The WordCountBolt class is the component in the topology that actually does the word count. In the prepare() method of the bolt, an instance of HashMap<String, Long> is instantiated to store words and corresponding counts. Most instance variables are usually instantiated in the prepare() method. This design pattern is determined by how the topology is deployed. When a topology is published, all bolt and spout components are first serialized and sent over the network to the cluster. If the spout or bolt instantiates any instance variables that cannot be serialized before serialization (such as in the constructor), a NotSerializableException exception will be thrown during serialization, and the topology will fail to deploy. In this case, since HashMap<String, Long> is serializable, it is also safe to instantiate in the constructor. But it is usually better to assign and instantiate primitive data types and serializable objects in the constructor, and instantiate the non-serializable counterparts in the prepare() method.
In the declareOutputFields() method, the class WordCountBolt declares an output stream where the tuple contains words and their corresponding counts.
In the execute() method, when a word is received, it first looks up the count corresponding to the word (if the word has not appeared, the count is initialized to 0), increments and stores the count, and then transmits the word and the latest count as a tuple backwards . The word counts are emitted as a stream, which other bolts in the topology can subscribe to for further processing.

4. Implement reporting bolts
public class ReportBolt extends BaseRichBolt{
	private HashMap<String, Long> counts = null;
	
	public void prepare(Map map, TopologyContext topologycontext,
			OutputCollector outputcollector) {
		this.counts = new HashMap<String, Long>();
	}

	public void execute(Tuple tuple) {
		String word = tuple.getStringByField("word");
		Long count = tuple.getLongByField("count");
		this.counts.put(word, count);
	}

	public void declareOutputFields(OutputFieldsDeclarer outputfieldsdeclarer) {
		// this bolt does not emit anything
	}
	
	public void cleanup(){
		List<String> keys = new ArrayList<String>();
		keys.addAll(this.counts.keySet());
		Collections.sort(keys);
		for(String key : keys){
			System.out.println(key+":"+this.counts.get(key));
		}
	}
}

The purpose of the ReportBolt class is to generate a report on the count of all words. Similar to WordCountBolt, ReportBolt uses a HashMap<String, Long> to store words and corresponding counts. In this case, its function is to simply store the count tuple emitted by the received count bolt.
The difference between the reporting bolt and the other bolts above is that it is a bolt at the end of the data stream and only accepts tuples. Since it does not emit any data stream, the declareOutputFields() method is empty.
The cleanup() method is referenced for the first time in the report, which is defined in the IBolt interface. Storm calls this method before terminating a bolt. In this example we use the cleanup() method to output the final calculation result when the topology is closed. Typically, the cleanup() method is used to release resources occupied by the bolt, such as open file handles or database connections.
Something to keep in mind when developing bolts is that the IBolt.cleanup() method is unreliable and not guaranteed to execute when the topology is running on a Storm cluster.

5. Implement word count topology
public class WordCountTopolopy {
	private static final String SENTENCE_SPOUT_ID = "sentence-spout";
	private static final String SPLIT_BOLT_ID = "split-bolt";
	private static final String COUNT_BOLT_ID = "count-bolt";
	private static final String REPORT_BOLT_ID = "report-bolt";
	private static final String TOPOLOPY_NAME = "word-count-topolopy";
	
	public static void main(String[] args) throws Exception{
		SentenceSpout spout = new SentenceSpout();
		SplitSentenceBolt splitBolt = new SplitSentenceBolt ();
		WordCountBolt countBolt = new WordCountBolt();
		ReportBolt reportBolt = new ReportBolt();
		
		TopologyBuilder builder = new TopologyBuilder();
		
		builder.setSpout(SENTENCE_SPOUT_ID, spout);
		builder.setBolt(SPLIT_BOLT_ID, splitBolt).shuffleGrouping(SENTENCE_SPOUT_ID);
		builder.setBolt(COUNT_BOLT_ID, countBolt).fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));
		builder.setBolt(REPORT_BOLT_ID, reportBolt).globalGrouping(COUNT_BOLT_ID);
		
		Config config = new Config();
		LocalCluster cluster = new LocalCluster();
		cluster.submitTopology(TOPOLOPY_NAME, config, builder.createTopology());
		
		watForSeconds(1000);
		
		cluster.killTopology(TOPOLOPY_NAME);
		cluster.shutdown();
	}
}

In this example, we first define a series of string constants as unique identifiers for Storm components. The TopologyBuilder class provides a fluent interface-style API to define data flow between topology components. First register a sentence spout and assign it a unique ID:
builder.setSpout(SENTENCE_SPOUT_ID, spout);
then register a SplitSentenceBolt, which subscribes to the data stream emitted by SentenceSpout:
builder.setBolt(SPLIT_BOLT_ID, splitBolt).shuffleGrouping(SENTENCE_SPOUT_ID) ;
The setBolt() method of the class TopologyBuilder will register a bolt and return an instance of BoltDeclarer, which can define the data source of the bolt. In this example, we assign the unique ID of SentenceSpout to the shuffleGrouping() method to establish this subscription relationship. The shuffleGrouping() method tells Storm to distribute the tuple emitted by class SentenceSpout to instances of SplitSentenceBolt evenly and randomly. The next line of code establishes the connection between the class SplitSentenceBolt and the class WordCountBolt:
builder.setBolt(COUNT_BOLT_ID, countBolt).fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));
You will learn that sometimes it is necessary to route tuples containing specific data to special bolt instances. So we use the fieldsGrouping() method of class BoltDeclarer to ensure that all tuples with the same "word" field value are routed to the same WordCountBolt instance.
The final step in defining the data flow is to route the stream of tuples emitted by the WordCountBolt instance to the ReportBolt class. In this example, we want all tuples emitted by WordCountBolt to be routed to the only ReportBolt task. The globalGrouping() method provides this method:
builder.setBolt(REPORT_BOLT_ID, reportBolt).globalGrouping(COUNT_BOLT_ID);
All data flows are defined, the last step to calculate the word count is to compile and submit to the cluster:
Config config = new Config();
LocalCluster cluster = new LocalCluster();
cluster.submitTopology(TOPOLOPY_NAME, config, builder.createTopology());
watForSeconds(1000);
cluster.killTopology(TOPOLOPY_NAME);
cluster.shutdown();
Here we use Storm's local mode and use Storm's LocalCluster class to simulate a complete Storm cluster in the local development environment. Local mode is an easy way to develop and test. Eliminates the overhead of repeated deployments in a distributed cluster. Local mode can also easily execute Storm topology in the IDE, set breakpoints, pause operations, observe variables, and analyze program performance. When the topology is published to a distributed cluster, these things can be time-consuming or even difficult to do.
Storm's Config class is a subclass of HashMap<String,Object> and defines some Storm-specific constants and convenience methods to configure the topology's runtime behavior. When a topology is submitted, Storm will merge the default configuration with the configuration in the Config instance and pass it as a parameter to the submitTopology() method. The merged configuration is distributed to the bolt's open() and prepare() methods of each spout. At this level, the Config object represents a set of configuration parameters that are globally effective for all components of the topology. The WordCountTopology class can now be run, the mian() method will submit the topology, after 10 seconds of execution, stop the topology, and finally shut down the local mode cluster. After the program is executed, you can see output similar to the following in the console: The


above comes from:


Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326608250&siteId=291194637