Storm1.0.6中的Trident

http://storm.apache.org/releases/1.0.6/Trident-tutorial.html
Trident是Storm做实时计算的一个高层次抽象,实现无缝的高吞吐量、有状态的、低延时的分布式查询。Trident操作有join、aggregation、grouping、function、filters.
以对流数据的单词进行计数为例:
为演示流处理,先生成一个源源不断地测试数据

FixedBatchSpout spout = new FixedBatchSpout(new Fields("sentence"), 3,
               new Values("the cow jumped over the moon"),
               new Values("the man went to the store and bought some candy"),
               new Values("four score and seven years ago"),
               new Values("how many apples can you eat"));
spout.setCycle(true);//源源不断地发送流

然后生成一个TridentTopology,来处理生产的spout。

TridentTopology topology = new TridentTopology();        
TridentState wordCounts =
     topology.newStream("spout1", spout)
       .each(new Fields("sentence"), new Split(), new Fields("word"))
       .groupBy(new Fields("word"))
       .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))                
       .parallelismHint(6);

以上每一步都是流处理,newStream读取输入源的数据生成一个流,输入源可以是Kestrel或者Kafka。Trident将哪些数据被处理的状态元数据保存在zookeeper中,这里的spout1指定了元数据的名称。这些流数据被分成小的tuple批次来处理。Trident提供了丰富的API来处理这些tuple批次。
each()函数中new Split()函数应用到流中的每个tuple。取"sentence" field的数据分解成单词,每个sentence产生出多个tuple,命名为新的Field “word”.

public class Split extends BaseFunction {
   public void execute(TridentTuple tuple, TridentCollector collector) {
       String sentence = tuple.getString(0);
       for(String word: sentence.split(" ")) {
           collector.emit(new Values(word));                
       }
   }
}

接下来对word进行计数,并持久化。首先对"word"进行group,然后每个group用Count()聚合。Trident保证高容错,并且仅处理一次

猜你喜欢

转载自blog.csdn.net/weixin_42628594/article/details/83822352