Storm SQL integration allows users to run SQL queries Storm streaming data. In flow analysis, SQL interfaces will not only speed up the development cycle, but also opened up a unified batch Apache Hive and real opportunities for streaming data processing.
StormSQL SQL queries will be compiled to a high standard of Trident Topologies and allow them on the Storm cluster. This article will give users how StormSQL. If someone is interested in the details of the design and implementation of StormSQL, please refer here
use
It allows the storm sql
command to compile the SQL statement for Trident topology, and submitted to the Storm cluster.
|
|
Here sql-file
containing SQL statements to be executed, topo-name
is the name of the topology of submission.
Supported Features
In the current version of the library (1.0.1) supports the following features:
- Streaming read and write the external data source
- Filter tuples
- Forecast (Projections)
Specifies the external data source
StormSQL data is represented in the form of an external table, the user can use CREATE EXTERNAL TABLE
statement specifies the data source. CREATE EXTERNAL TABLE
The syntax strictly follow the Hive Data Definition Language definition.
|
|
You can Hive Data Definition Language find a detailed explanation of each property. For example: The following statement specifies a Kafka spout and sink:
|
|
Docking external data sources
用户对接外部数据源需要实现 ISqlTridentDataSource
接口并且使用 Java 的服务加载机制注册他们,外部数据源就会基于表中 URI 的 Scheme 来选择。请参阅 storm-sql-kafka
来了解更多实现细节。
示例: 过滤 Kafka 数据流
假设有一个 Kafka 数据流存储交易的订单数据。流中的每个消息包含订单的 id 、产品的单价及订单的数量。我们的目的是过滤出有很大交易额的订单,将这些订单插入另一个 Kafka 数据流用于进行进一步分析。
用户可以在 SQL 文件中指定如下的 SQL 语句:
|
|
第一条语句定义的 ORDER
表代表了输入流。 LOCATION
字段指定了 ZK 地址 (localhost:2181
) 、brokers 在 Zookeeper 中的路径 (/brokers
) 以及 topic (orders
)。TBLPROPERTIES
字段指定了 KafkaProducer 的配置项。
目前 storm-sql-kafka
的实现即使 table 是 read-only 或 write-only 情况都需要指定 LOCATION
和 TBLPROPERTIES
项。
类似的第二条语句定义的 LARGE_ORDERS
表代表了输出流。第三条 SELECT
语句定义了 topology : 其使 StormSQL 过滤外部表 ORDERS
中的所有订单(译注:过滤出总价在 50 以上的订单),计算总价格并将满足的记录插入指定的 LARGE_ORDER
Kafka 流中。
要运行这个示例,用户需要在 classpath 中包含数据源 (这个示例中是 storm-sql-kafka
) 及其依赖。一种办法是将所需的 jars 放到 extlib
目录中:
|
|
Then submit SQL statements to StormSQL:
|
|
Now you should be able to see the Storm UI in order_filtering
topology.
Current defects
Polymerization (Aggregation), window (Windowing) and even table (joining) has not been implemented; topology does not support the specified degree of parallelism; parallel for all processing tasks are 1.
Users also need to extlib
provide rely on external data source directory, because otherwise topology will ClassNotFoundException
not run.
Kafka StormSQL currently implemented connector is assumed that the input and output data JSON format. The connector does not support INPUTFORMAT
and OUTPUTFORMAT
.
Original: Large column Storm SQL Integration