Storm SQL Integration

Storm SQL integration allows users to run SQL queries Storm streaming data. In flow analysis, SQL interfaces will not only speed up the development cycle, but also opened up a unified batch Apache Hive and real opportunities for streaming data processing.

StormSQL SQL queries will be compiled to a high standard of Trident Topologies and allow them on the Storm cluster. This article will give users how StormSQL. If someone is interested in the details of the design and implementation of StormSQL, please refer here

use

It allows the storm sqlcommand to compile the SQL statement for Trident topology, and submitted to the Storm cluster.

      
      
1
      
      
$ bin/storm sql <sql-file> <topo-name>

Here sql-filecontaining SQL statements to be executed, topo-nameis the name of the topology of submission.

Supported Features

In the current version of the library (1.0.1) supports the following features:

  • Streaming read and write the external data source
  • Filter tuples
  • Forecast (Projections)

Specifies the external data source

StormSQL data is represented in the form of an external table, the user can use CREATE EXTERNAL TABLEstatement specifies the data source. CREATE EXTERNAL TABLEThe syntax strictly follow the Hive Data Definition Language definition.

      
      
1
2
3
4
5
6
7
8
      
      
CREATE EXTERNAL TABLE table_name field_list
[ STORED AS
INPUTFORMAT input_format_classname
OUTPUTFORMAT output_format_classname
]
LOCATION location
[ TBLPROPERTIES tbl_properties ]
[ AS select_stmt ]

You can Hive Data Definition Language find a detailed explanation of each property. For example: The following statement specifies a Kafka spout and sink:

      
      
1
      
      
CREATE EXTERNAL TABLE FOO (ID INT PRIMARY KEY) LOCATION 'kafka://localhost:2181/brokers?topic=test' TBLPROPERTIES '{"producer":{"bootstrap.servers":"localhost:9092","acks":"1","key.serializer":"org.apache.org.apache.storm.kafka.IntSerializer","value.serializer":"org.apache.org.apache.storm.kafka.ByteBufferSerializer"}}'

Docking external data sources

用户对接外部数据源需要实现 ISqlTridentDataSource 接口并且使用 Java 的服务加载机制注册他们,外部数据源就会基于表中 URI 的 Scheme 来选择。请参阅 storm-sql-kafka 来了解更多实现细节。

示例: 过滤 Kafka 数据流

假设有一个 Kafka 数据流存储交易的订单数据。流中的每个消息包含订单的 id 、产品的单价及订单的数量。我们的目的是过滤出有很大交易额的订单,将这些订单插入另一个 Kafka 数据流用于进行进一步分析。

用户可以在 SQL 文件中指定如下的 SQL 语句:

      
      
1
2
3
      
      
CREATE EXTERNAL TABLE ORDERS (ID INT PRIMARY KEY, UNIT_PRICE INT, QUANTITY INT) LOCATION 'kafka://localhost:2181/brokers?topic=orders' TBLPROPERTIES '{"producer":{"bootstrap.servers":"localhost:9092","acks":"1","key.serializer":"org.apache.org.apache.storm.kafka.IntSerializer","value.serializer":"org.apache.org.apache.storm.kafka.ByteBufferSerializer"}}'
CREATE EXTERNAL TABLE LARGE_ORDERS (ID INT PRIMARY KEY, TOTAL INT) LOCATION 'kafka://localhost:2181/brokers?topic=large_orders' TBLPROPERTIES '{"producer":{"bootstrap.servers":"localhost:9092","acks":"1","key.serializer":"org.apache.org.apache.storm.kafka.IntSerializer","value.serializer":"org.apache.org.apache.storm.kafka.ByteBufferSerializer"}}'
INSERT INTO LARGE_ORDERS SELECT ID, UNIT_PRICE * QUANTITY AS TOTAL FROM ORDERS WHERE UNIT_PRICE * QUANTITY > 50

第一条语句定义的 ORDER 表代表了输入流。 LOCATION 字段指定了 ZK 地址 (localhost:2181) 、brokers 在 Zookeeper 中的路径 (/brokers) 以及 topic (orders)。TBLPROPERTIES 字段指定了 KafkaProducer 的配置项。
目前 storm-sql-kafka的实现即使 table 是 read-only 或 write-only 情况都需要指定 LOCATIONTBLPROPERTIES 项。

类似的第二条语句定义的 LARGE_ORDERS 表代表了输出流。第三条 SELECT 语句定义了 topology : 其使 StormSQL 过滤外部表 ORDERS 中的所有订单(译注:过滤出总价在 50 以上的订单),计算总价格并将满足的记录插入指定的 LARGE_ORDER Kafka 流中。

要运行这个示例,用户需要在 classpath 中包含数据源 (这个示例中是 storm-sql-kafka) 及其依赖。一种办法是将所需的 jars 放到 extlib 目录中:

      
      
1
2
3
4
5
6
      
      
$ cp curator-client-2.5.0.jar curator-framework-2.5.0.jar zookeeper-3.4.6.jar
extlib/
$ cp scala-library-2.10.4.jar kafka-clients-0.8.2.1.jar kafka_2.10-0.8.2.1.jar metrics-core-2.2.0.jar extlib/
$ cp json-simple-1.1.1.jar extlib/
$ cp jackson-annotations-2.6.0.jar extlib/
$ cp storm-kafka-*.jar storm-sql-kafka-*.jar storm-sql-runtime-*.jar extlib/

Then submit SQL statements to StormSQL:

      
      
1
      
      
$ bin/storm sql order_filtering order_filtering.sql

Now you should be able to see the Storm UI in order_filteringtopology.

Current defects

Polymerization (Aggregation), window (Windowing) and even table (joining) has not been implemented; topology does not support the specified degree of parallelism; parallel for all processing tasks are 1.

Users also need to extlibprovide rely on external data source directory, because otherwise topology will ClassNotFoundExceptionnot run.

Kafka StormSQL currently implemented connector is assumed that the input and output data JSON format. The connector does not support INPUTFORMATand OUTPUTFORMAT.

Original: Large column  Storm SQL Integration


Guess you like

Origin www.cnblogs.com/sanxiandoupi/p/11641305.html