1.前置条件:
1.1 环境准备:
Java 8 Linux,
Mac OS X或其他类Unix操作系统(不支持Windows)
8G的RAM
2个vCPU
1.2 下载并解压 druid
解压命令:tar -xzf apache-druid-0.13.0-incubating-bin.tar.gz
进入目录:cd apache-druid-0.13.0-incubating
1.3 下载并解压 zookeeper
Druid依赖Apache ZooKeeper进行分布式协调,在druid的根目录中,下载并运行Zookeeper。运行以下命令:
curl https://archive.apache.org/dist/zookeeper/zookeeper-3.4.11/zookeeper-3.4.11.tar.gz -o zookeeper-3.4.11.tar.gz
tar -xzf zookeeper-3.4.11.tar.gz
mv zookeeper-3.4.11 zk
1.4 启动并运行druid
MacBook-Air-3:apache-druid-0.13.0-incubating g2$ bin/supervise -c quickstart/tutorial/conf/tutorial-cluster.conf
这将带来Zookeeper和Druid服务的实例,所有这些都在本地机器上运行,例如:
MacBook-Air-3:apache-druid-0.13.0-incubating g2$ bin/supervise -c quickstart/tutorial/conf/tutorial-cluster.conf
[Tue Dec 25 16:11:35 2018] Running command[zk], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/zk.log]: bin/run-zk quickstart/tutorial/conf
[Tue Dec 25 16:11:35 2018] Running command[coordinator], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/coordinator.log]: bin/run-druid coordinator quickstart/tutorial/conf
[Tue Dec 25 16:11:35 2018] Running command[broker], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/broker.log]: bin/run-druid broker quickstart/tutorial/conf
[Tue Dec 25 16:11:35 2018] Running command[historical], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/historical.log]: bin/run-druid historical quickstart/tutorial/conf
[Tue Dec 25 16:11:35 2018] Running command[overlord], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/overlord.log]: bin/run-druid overlord quickstart/tutorial/conf
[Tue Dec 25 16:11:35 2018] Running command[middleManager], logging to[/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/var/sv/middleManager.log]: bin/run-druid middleManager quickstart/tutorial/conf
所有持久状态(如集群元数据存储和服务段)都将保存在apache-druid-0.13.0-incubating下的var目录中。服务的日志位于var / sv。
2.加载数据
此示例数据位于Druid软件包根目录下的quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz中。页面编辑事件作为JSON对象存储在文本文件中。
示例数据包含以下列,示例事件如下所示:
{
"time": "2015-09-12T00:47:47.870Z",
"channel": "#vi.wikipedia",
"cityName": null,
"comment": "clean up using [[Project:AWB|AWB]]",
"countryIsoCode": null,
"countryName": null,
"isAnonymous": false,
"isMinor": false,
"isNew": false,
"isRobot": true,
"isUnpatrolled": false,
"metroCode": null,
"namespace": "Main",
"page": "Atractus duboisi",
"regionIsoCode": null,
"regionName": null,
"user": "ThitxongkhoiAWB",
"delta": 18,
"added": 18,
"deleted": 0
}
2.1 通过kafka流式的方式加载数据
2.1.1 Kafka相关操作
(1)下载
curl -O https://archive.apache.org/dist/kafka/0.10.2.0/kafka_2.11-0.10.2.0.tgz
tar -xzf kafka_2.11-0.10.2.0.tgz
cd kafka_2.11-0.10.2.0
(2)启动kafka
MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ ./bin/kafka-server-start.sh config/server.properties
(3)查看主题
MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ ./bin/kafka-topics.sh --zookeeper localhost:2181 --list
(4)创建主题
MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ ./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia
(5)删除主题
MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ ./bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic wikipedia
2.1.2 druid 相关操作:
数据进入kafka之后,需要定义一个datasource specfile告诉druid怎样接入数据,里面声明时间戳格式,纬度列、指标列、预聚合的粒度等等。
(1)在druid中启动从kafka中摄入数据的处理逻辑
MacBook-Air-3:apache-druid-0.13.0-incubating g2$ curl -XPOST -H'Content-Type: application/json' -d @quickstart/tutorial/wikipedia-kafka-supervisor.json http://localhost:8090/druid/indexer/v1/supervisor
说明:如果主管成功创建,您将收到包含主管ID的回复;在我们的例子中,我们应该看到{“id”:“wikipedia-kafka”}
数据(定义datasource):wikipedia-kafka-supervisor.json
{
"type": "kafka",
"dataSchema": {
"dataSource": "wikipedia",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "time",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [
"channel",
"cityName",
"comment",
"countryIsoCode",
"countryName",
"isAnonymous",
"isMinor",
"isNew",
"isRobot",
"isUnpatrolled",
"metroCode",
"namespace",
"page",
"regionIsoCode",
"regionName",
"user",
{ "name": "added", "type": "long" },
{ "name": "deleted", "type": "long" },
{ "name": "delta", "type": "long" }
]
}
}
},
"metricsSpec" : [],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "NONE",
"rollup": false
}
},
"tuningConfig": {
"type": "kafka",
"reportParseExceptions": false
},
"ioConfig": {
"topic": "wikipedia",
"replicas": 2,
"taskDuration": "PT10M",
"completionTimeout": "PT20M",
"consumerProperties": {
"bootstrap.servers": "localhost:9092"
}
}
}
(2)解压wikiticker-2015-09-12-sampled.json.gz
MacBook-Air-3:tutorial g2$ pwd
/Users/g2/myresource/druid/apache-druid-0.13.0-incubating/quickstart/tutorial
MacBook-Air-3:tutorial g2$ gunzip -k wikiticker-2015-09-12-sampled.json.gz
数据示例:
{
"time": "2015-09-12T00:48:02.596Z",
"channel": "#es.wikipedia",
"cityName": "Mexico City",
"comment": "Cambio en la redacción del texto y correción en sintaxis",
"countryIsoCode": "MX",
"countryName": "Mexico",
"isAnonymous": true,
"isMinor": false,
"isNew": false,
"isRobot": false,
"isUnpatrolled": false,
"metroCode": null,
"namespace": "Main",
"page": "Mathis Bolly",
"regionIsoCode": "DIF",
"regionName": "Mexico City",
"user": "189.217.75.123",
"delta": -67,
"added": 0,
"deleted": 67
}
(3)在kafka目录中,解压如下命令(向kafka中写入数据)
MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ export KAFKA_OPTS="-Dfile.encoding=UTF-8"
MacBook-Air-3:kafka_2.11-0.10.2.0 g2$ ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia < /Users/g2/myresource/druid/apache-druid-0.13.0-incubating/quickstart/tutorial/wikiticker-2015-09-12-sampled.json
现在,数据成功写入到kafka,接下来我们就可以到druid中进行查询了。
3.2 加载文件
3.2.1 准备数据和摄取任务规范:wikipedia-index.json
{
"type" : "index",
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"dimensionsSpec" : {
"dimensions" : [
"channel",
"cityName",
"comment",
"countryIsoCode",
"countryName",
"isAnonymous",
"isMinor",
"isNew",
"isRobot",
"isUnpatrolled",
"metroCode",
"namespace",
"page",
"regionIsoCode",
"regionName",
"user",
{ "name": "added", "type": "long" },
{ "name": "deleted", "type": "long" },
{ "name": "delta", "type": "long" }
]
},
"timestampSpec": {
"column": "time",
"format": "iso"
}
}
},
"metricsSpec" : [],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2015-09-12/2015-09-13"],
"rollup" : false
}
},
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "local",
"baseDir" : "quickstart/tutorial/",
"filter" : "wikiticker-2015-09-12-sampled.json.gz"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
}
}
3.2.2 Load batch data
MacBook-Air-3:apache-druid-0.13.0-incubating g2$ bin/post-index-task --file quickstart/tutorial/wikipedia-index.json
Beginning indexing data for wikipedia
Task started: index_wikipedia_2018-12-25T10:00:45.744Z
Task log: http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2018-12-25T10:00:45.744Z/log
Task status: http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2018-12-25T10:00:45.744Z/status
Task index_wikipedia_2018-12-25T10:00:45.744Z still running...
Task index_wikipedia_2018-12-25T10:00:45.744Z still running...
Task index_wikipedia_2018-12-25T10:00:45.744Z still running...
Task index_wikipedia_2018-12-25T10:00:45.744Z still running...
Task index_wikipedia_2018-12-25T10:00:45.744Z still running...
Task finished with status: SUCCESS
Completed indexing data for wikipedia. Now loading indexed data onto the cluster...
wikipedia loading complete! You may now query your data
MacBook-Air-3:apache-druid-0.13.0-incubating g2$
现在,数据成功加载到druid,接下来我们就可以到druid中进行查询了。
3.查询数据
3.1 Native JSON queries
3.1.1 查询的数据请求示例
{
"queryType" : "topN",
"dataSource" : "wikipedia",
"intervals" : ["2015-09-12/2015-09-13"],
"granularity" : "all",
"dimension" : "page",
"metric" : "count",
"threshold" : 15,
"aggregations" : [
{
"type" : "count",
"name" : "count"
}
]
}
3.1.2 查询:
MacBook-Air-3:apache-druid-0.13.0-incubating g2$ curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-top-pages.json http://localhost:8082/druid/v2?pretty
3.1.3 执行结果
[ {
"timestamp" : "2015-09-12T00:46:58.771Z",
"result" : [ {
"count" : 33,
"page" : "Wikipedia:Vandalismusmeldung"
}, {
"count" : 28,
"page" : "User:Cyde/List of candidates for speedy deletion/Subpage"
}, {
"count" : 27,
"page" : "Jeremy Corbyn"
}, {
"count" : 21,
"page" : "Wikipedia:Administrators' noticeboard/Incidents"
}, {
"count" : 20,
"page" : "Flavia Pennetta"
}, {
"count" : 18,
"page" : "Total Drama Presents: The Ridonculous Race"
}, {
"count" : 18,
"page" : "User talk:Dudeperson176123"
}, {
"count" : 18,
"page" : "Wikipédia:Le Bistro/12 septembre 2015"
}, {
"count" : 17,
"page" : "Wikipedia:In the news/Candidates"
}, {
"count" : 17,
"page" : "Wikipedia:Requests for page protection"
}, {
"count" : 16,
"page" : "Utente:Giulio Mainardi/Sandbox"
}, {
"count" : 16,
"page" : "Wikipedia:Administrator intervention against vandalism"
}, {
"count" : 15,
"page" : "Anthony Martial"
}, {
"count" : 13,
"page" : "Template talk:Connected contributor"
}, {
"count" : 12,
"page" : "Chronologie de la Lorraine"
} ]
3.2 Druid SQL queries
德鲁伊还支持用于查询的SQL方言。让我们运行一个SQL查询,它等同于上面显示的本机JSON查询。
3.2.1 sql请求的json格式
{
"query":"SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE \"__time\" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10"
}
3.2.2 查询
MacBook-Air-3:apache-druid-0.13.0-incubating g2$ curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-top-pages-sql.json http://localhost:8082/druid/v2/sql
3.2.3 执行结果
[
{
"page": "Wikipedia:Vandalismusmeldung",
"Edits": 33
},
{
"page": "User:Cyde/List of candidates for speedy deletion/Subpage",
"Edits": 28
},
{
"page": "Jeremy Corbyn",
"Edits": 27
},
{
"page": "Wikipedia:Administrators' noticeboard/Incidents",
"Edits": 21
},
{
"page": "Flavia Pennetta",
"Edits": 20
},
{
"page": "Total Drama Presents: The Ridonculous Race",
"Edits": 18
},
{
"page": "User talk:Dudeperson176123",
"Edits": 18
},
{
"page": "Wikipédia:Le Bistro/12 septembre 2015",
"Edits": 18
},
{
"page": "Wikipedia:In the news/Candidates",
"Edits": 17
},
{
"page": "Wikipedia:Requests for page protection",
"Edits": 17
}
]
3.3 sql client
为方便起见,Druid包中包含一个SQL命令行客户端,位于Druid包根目录的bin/dsql中。 我们现在运行bin / dsql;你应该看到以下提示:
MacBook-Air-3:apache-druid-0.13.0-incubating g2$ bin/dsql
Welcome to dsql, the command-line client for Druid SQL.
Type "\h" for help.
dsql>
3.3.1 示例1:
dsql> SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;
┌──────────────────────────────────────────────────────────┬───────┐
│ page │ Edits │
├──────────────────────────────────────────────────────────┼───────┤
│ Wikipedia:Vandalismusmeldung │ 33 │
│ User:Cyde/List of candidates for speedy deletion/Subpage │ 28 │
│ Jeremy Corbyn │ 27 │
│ Wikipedia:Administrators' noticeboard/Incidents │ 21 │
│ Flavia Pennetta │ 20 │
│ Total Drama Presents: The Ridonculous Race │ 18 │
│ User talk:Dudeperson176123 │ 18 │
│ Wikipédia:Le Bistro/12 septembre 2015 │ 18 │
│ Wikipedia:In the news/Candidates │ 17 │
│ Wikipedia:Requests for page protection │ 17 │
└──────────────────────────────────────────────────────────┴───────┘
Retrieved 10 rows in 0.35s.
3.3.2 示例2:Timeseries
dsql> SELECT FLOOR(__time to HOUR) AS HourTime, SUM(deleted) AS LinesDeleted FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY FLOOR(__time to HOUR);
┌──────────────────────────┬──────────────┐
│ HourTime │ LinesDeleted │
├──────────────────────────┼──────────────┤
│ 2015-09-12T00:00:00.000Z │ 1761 │
│ 2015-09-12T01:00:00.000Z │ 16208 │
│ 2015-09-12T02:00:00.000Z │ 14543 │
│ 2015-09-12T03:00:00.000Z │ 13101 │
│ 2015-09-12T04:00:00.000Z │ 12040 │
│ 2015-09-12T05:00:00.000Z │ 6399 │
│ 2015-09-12T06:00:00.000Z │ 9036 │
│ 2015-09-12T07:00:00.000Z │ 11409 │
│ 2015-09-12T08:00:00.000Z │ 11616 │
│ 2015-09-12T09:00:00.000Z │ 17509 │
│ 2015-09-12T10:00:00.000Z │ 19406 │
│ 2015-09-12T11:00:00.000Z │ 16284 │
│ 2015-09-12T12:00:00.000Z │ 18672 │
│ 2015-09-12T13:00:00.000Z │ 30520 │
│ 2015-09-12T14:00:00.000Z │ 18025 │
│ 2015-09-12T15:00:00.000Z │ 26399 │
│ 2015-09-12T16:00:00.000Z │ 24759 │
│ 2015-09-12T17:00:00.000Z │ 19634 │
│ 2015-09-12T18:00:00.000Z │ 17345 │
│ 2015-09-12T19:00:00.000Z │ 19305 │
│ 2015-09-12T20:00:00.000Z │ 22265 │
│ 2015-09-12T21:00:00.000Z │ 16394 │
│ 2015-09-12T22:00:00.000Z │ 16379 │
│ 2015-09-12T23:00:00.000Z │ 15289 │
└──────────────────────────┴──────────────┘
Retrieved 24 rows in 0.25s.
dsql>
3.3.3 GroupBy
dsql> SELECT channel, SUM(added) FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY channel ORDER BY SUM(added) DESC LIMIT 5;
┌───────────────┬─────────┐
│ channel │ EXPR$1 │
├───────────────┼─────────┤
│ #en.wikipedia │ 3045299 │
│ #it.wikipedia │ 711011 │
│ #fr.wikipedia │ 642555 │
│ #ru.wikipedia │ 640698 │
│ #es.wikipedia │ 634670 │
└───────────────┴─────────┘
Retrieved 5 rows in 0.13s.
3.3.4 EXPLAIN PLAN FOR
dsql> EXPLAIN PLAN FOR SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ DruidQueryRel(query=[{"queryType":"topN","dataSource":{"type":"table","name":"wikipedia"},"virtualColumns":[],"dimension":{"type":"default","dimension":"page","outputName":"d0","outputType":"STRING"},"metric":{"type":"numeric","metric":"a0"},"threshold":10,"intervals":{"type":"intervals","intervals":["2015-09-12T00:00:00.000Z/2015-09-13T00:00:00.001Z"]},"filter":null,"granularity":{"type":"all"},"aggregations":[{"type":"count","name":"a0"}],"postAggregations":[],"context":{},"descending":false}], signature=[{d0:STRING, a0:LONG}]) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Retrieved 1 row in 0.09s.