一、druid主要功能介绍

列式存储格式
分布式系统：Druid通常部署在数十到数百台服务器的群集中
大规模并行处理
实时或批量摄取
自愈，自平衡，易于操作：扩展或扩展集群，只需添加或删除服务器，集群就会在后台自动重新平衡自身，而不会造成任何停机
容错架构：一旦Druid摄取了数据，副本就被安全地存储在深度存储（通常是云存储，HDFS或共享文件系统）中
快速过滤的索引：位图索引
基于时间的分区
自动汇总

二、druid使用场景

1、适用于如下场景

插入多，更新少
多数查询为聚合、搜索或扫描
查询延迟为100毫秒到几秒钟
数据具有时间成分
需要对高基数数据列（如url，id）进行快速计数和排名
从kafka、hdfs之类的对象存储中加载数据

2、不适用于如下场景

需要使用主键对现有记录进行低延时更新
查询延时不太重要
“大”连接操作

三、常见查询操作

查询可以使用sql语句和json格式的查询，本文仅介绍json格式的查询，更详细的介绍可参考官方文档：https://druid.apache.org/docs/latest/design/

汇总查询：时间序列、TopN、GroupBy
元数据查询：时间边界、段元数据、数据源元数据
其他查询：扫瞄、搜索

1、timeseries时间序列

{

"queryType": "timeseries",

"dataSource": "sample_datasource",

"granularity": "day",

"descending": "true",

"filter": {

"type": "and",

"fields": [

{ "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" },

{ "type": "or",

"fields": [

{ "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" },

{ "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" }

]

}

]

},

"aggregations": [

{ "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" },

{ "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" }

],

"postAggregations": [

{ "type": "arithmetic",

"name": "sample_divide",

"fn": "/",

"fields": [

{ "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" },

{ "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" }

]

}

],

"intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ]
}

属性	描述
queryType	timeseries
dataSource	索引名
descending	是否进行降序排序。默认值为`false`（升序）
granularity	聚合粒度：`all`，`none`，`second`，`minute`，`fifteen_minute`，`thirty_minute`，`hour`，`day`，`week`，`month`，`quarter`和`year`
filter	过滤操作
aggregations	聚合操作
postAggregations	数据聚合后进行的后聚和操作
intervals	代表ISO-8601间隔的JSON对象。这定义了运行查询的时间范围。

上面的查询将从“ sample_datasource”表中返回2个数据点，从2012年1月1日到2012年1月3日之间每天返回一个数据点。每个数据点将是sample_fieldName1的（长）总和，sample_fieldName2的（两倍）总和和sample_fieldName1的（两倍）结果除以过滤器集的sample_fieldName2。

总计："context": { "grandTotal": true } 在时间序列结果集中的最后一行包含额外的“总计”行

零填充："context" : { "skipEmptyBuckets": "true" } 时间序列查询通常用零填充空的内部时间段，上述操作可禁用零填充

2、TopN

{

"queryType": "topN",

"dataSource": "sample_data",

"dimension": "sample_dim",

"threshold": 5,

"metric": "count",

"granularity": "all",

"filter": {

"type": "and",

"fields": [

{

"type": "selector",

"dimension": "dim1",

"value": "some_value"

},

{

"type": "selector",

"dimension": "dim2",

"value": "some_other_val"

}

]

},

"aggregations": [

{

"type": "longSum",

"name": "count",

"fieldName": "count"

},

{

"type": "doubleSum",

"name": "some_metric",

"fieldName": "some_metric"

}

],

"postAggregations": [

{

"type": "arithmetic",

"name": "average",

"fn": "/",

"fields": [

{

"type": "fieldAccess",

"name": "some_metric",

"fieldName": "some_metric"

},

{

"type": "fieldAccess",

"name": "count",

"fieldName": "count"

}

]

}

],

"intervals": [

"2013-08-31T00:00:00.000/2013-09-03T00:00:00.000"

]

}

属性	描述
queryType	topN
threshold	定义topN中N的整数（即，您希望在顶部列表中有多少个结果）
metric	一个String或JSON对象，用于指定要对排名靠前的指标进行排序的指标

3、GroupBy分组查询

{

"queryType": "groupBy",

"dataSource": "sample_datasource",

"granularity": "day",

"dimensions": ["country", "device"],

"limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] },

"filter": {

"type": "and",

"fields": [

{ "type": "selector", "dimension": "carrier", "value": "AT&T" },

{ "type": "or",

"fields": [

{ "type": "selector", "dimension": "make", "value": "Apple" },

{ "type": "selector", "dimension": "make", "value": "Samsung" }

]

}

]

},

"aggregations": [

{ "type": "longSum", "name": "total_usage", "fieldName": "user_count" },

{ "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" }

],

"postAggregations": [

{ "type": "arithmetic",

"name": "avg_usage",

"fn": "/",

"fields": [

{ "type": "fieldAccess", "fieldName": "data_transfer" },

{ "type": "fieldAccess", "fieldName": "total_usage" }

]

}

],

"intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ],

"having": {

"type": "greaterThan",

"aggregation": "total_usage",

"value": 100

}

}

小计："subtotalsSpec":[ ["D1", "D2", D3"], ["D1", "D3"], ["D3"]] 允许在单个查询中计算多个子组

{

"type": "groupBy",

...

...

"dimensions": [

{

"type" : "default",

"dimension" : "d1col",

"outputName": "D1"

},

{

"type" : "extraction",

"dimension" : "d2col",

"outputName" : "D2",

"extractionFn" : extraction_func

},

{

"type":"lookup",

"dimension":"d3col",

"outputName":"D3",

"name":"my_lookup"

}

],

...

...

"subtotalsSpec":[ ["D1", "D2", D3"], ["D1", "D3"], ["D3"]],

..

}

4、scan扫描查询

{

"queryType": "scan",

"dataSource": "wikipedia",

"resultFormat": "list",

"columns":[],

"intervals": [

"2013-01-01/2013-01-02"

],

"batchSize":20480,

"limit":3

}

属性	描述
queryType	scan
resultFormat	结果的表示方式：列表，compactedList或valueVector。目前仅`list`和`compactedList`受支持。默认是`list`
columns	要扫描的维度和指标的字符串数组。如果保留为空，则返回所有维度和指标。
batchSize	返回给客户端之前缓冲的最大行数。默认是`20480`
limit	要返回多少行。如果未指定，将返回所有行。

5、search搜索查询

{

"queryType": "search",

"dataSource": "sample_datasource",

"granularity": "day",

"searchDimensions": [

"dim1",

"dim2"

],

"query": {

"type": "insensitive_contains",

"value": "Ke"

},

"sort" : {

"type": "lexicographic"

},

"intervals": [

"2013-01-01T00:00:00.000/2013-01-03T00:00:00.000"

]

}

insensitive_contains：维度值的任何部分包含此搜索查询规范中指定的值，则无论大小写如何，都会发生“匹配”

{ "type" : "insensitive_contains", "value" : "some_value" }

fragment：维度值的任何部分包含此搜索查询规范中指定的所有值，则无论默认情况如何，都会发生“匹配”

{ "type" : "fragment", "case_sensitive" : false, "values" : ["fragment1", "fragment2"] }

contains：维度值的任何部分包含此搜索查询规范中指定的值，则会发生“匹配”

{ "type" : "contains", "case_sensitive" : true, "value" : "some_value" }

regex：维度值的任何部分包含此搜索查询规范中指定的模式，则会发生“匹配”

{ "type" : "regex", "pattern" : "some_pattern" }

6、timeBoundary时间边界查询

时间边界查询返回数据集的最早和最新数据点

{

"queryType" : "timeBoundary",

"dataSource": "sample_datasource",

"bound" : < "maxTime" | "minTime" > # optional, defaults to returning both timestamps if not set

"filter" : { "type": "and", "fields": [<filter>, <filter>, ...] } # optional

}

bound：设置为maxTime或minTime仅返回最新或最早的时间戳。如果未设置，默认返回两者

7、segmentMetadata细分元数据查询

细分元数据查询返回关于以下内容的细分信息：

段中所有列的基数
段中字符串类型列的最小值/最大值
如果段列以平面格式存储，则估计的字节大小
段内存储的行数
间隔段覆盖
段中所有列的列类型
估计的总段字节大小（如果以平面格式存储）
该细分是否汇总？
细分ID

{

"queryType":"segmentMetadata",

"dataSource":"sample_datasource",

"intervals":["2013-01-01/2014-01-01"]

}

8、dataSourceMetadata数据源元数据查询

数据源元数据查询返回数据源的元数据信息

{

"queryType" : "dataSourceMetadata",

"dataSource": "sample_datasource"

}

四、查询组件介绍

1、filter过滤器

（1）选择器过滤器selector

"filter": { "type": "selector", "dimension": <dimension_string>, "value": <dimension_value_string> }

这相当于WHERE <dimension_string> = '<dimension_value_string>'

（2）列比较过滤器columnComparison

"filter": { "type": "columnComparison", "dimensions": [<dimension_a>, <dimension_b>] }

这相当于WHERE <dimension_a> = <dimension_b>

`（3）`正则表达式过滤器regex

"filter": { "type": "regex", "dimension": <dimension_string>, "pattern": <pattern_string> }

（4）逻辑表达式过滤器and、or、not

"filter": { "type": "and", "fields": [<filter>, <filter>, ...] }

"filter": { "type": "or", "fields": [<filter>, <filter>, ...] }

"filter": { "type": "not", "field": <filter> }

（5）JavaScript过滤器 javascript

{

"type" : "javascript",

"dimension" : <dimension_string>,

"function" : "function(value) { <...> }"

}

（6）提取过滤器extraction

现在不建议使用提取过滤器

（7）搜索过滤器search

用于对部分字符串匹配进行过滤

{

"filter": {

"type": "search",

"dimension": "product",

"query": {

"type": "insensitive_contains",

"value": "foo"

}

}

}

（8）在过滤器中in

{

"type": "in",

"dimension": "outlaw",

"values": ["Good", "Bad", "Ugly"]

}

（9）模糊匹配过滤器like

{

"type": "like",

"dimension": "last_name",

"pattern": "D%"

}

（10）绑定过滤器bound

以下绑定过滤器表示条件21 <= age <= 31：

{

"type": "bound",

"dimension": "age",

"lower": "21",

"upper": "31" ,

"ordering": "numeric"

}

此过滤器foo <= name <= hoo使用默认的字典排序顺序来表达条件

{

"type": "bound",

"dimension": "name",

"lower": "foo",

"upper": "hoo"

}

使用严格界限，此过滤器表示条件 21 < age < 31

{

"type": "bound",

"dimension": "age",

"lower": "21",

"lowerStrict": true,

"upper": "31" ,

"upperStrict": true,

"ordering": "numeric"

}

（11）间隔过滤器interval

{

"type" : "interval",

"dimension" : "__time",

"intervals" : [

"2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z",

"2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z"

]

}

2、查询粒度

"granularity":"查询粒度"

（1）简单粒度 day、hour……

简单的粒度通过其UTC时间（例如，以00:00 UTC开始的天数）指定为字符串和存储桶时间戳记。

支持粒度字符串是：all，none，second，minute，fifteen_minute，thirty_minute，hour，day，week，month，quarter和year。

all 将所有内容存储到一个存储桶中
none不存储数据（它实际上使用索引的粒度-此处的最小值none表示毫秒粒度）。使用none在TimeseriesQuery目前也不建议（该系统将尝试生成0值全部毫秒不存在的，这往往是很多）。

（2）持续时间粒度duration

{"type": "duration", "duration": 3600000, "origin": "2012-01-01T00:30:00Z"}

（3）期间时间粒度period

{"type": "period", "period": "P3M", "timeZone": "America/Los_Angeles", "origin": "2012-02-01T00:00:00-08:00"}

3、查询维度

https://druid.apache.org/docs/latest/querying/dimensionspecs.html

4、聚合aggregations

（1）计数count：{ "type" : "count", "name" : <output_name> }

（2）总和longSum、doubleSum、floatSum：{ "type" : "floatSum", "name" : <output_name>, "fieldName" : <metric_name> }

（3）最大最小聚合：doubleMin、doubleMax、floatMin、floatMax、longMin、longMax

（4）算数平均值：doubleMean

（5）第一个/最后一个过滤器：doubleFirst、doubleLast、floatFirst、floatLast、longFirst、longLast、stringFirst、stringLast

（6）Any聚合器：doubleAny、floatAny、longAny、stringAny

（7）杂项聚集->过滤的聚合器filtered

过滤的聚合器包装任何给定的聚合器，但仅聚合给定维度过滤器匹配的值。

这使得可以同时计算已过滤和未过滤聚合的结果，而不必发出多个查询，并将这两个结果都用作聚合后的一部分。

注意：如果只需要过滤的结果，请考虑将过滤器放在查询本身上，这将更快，因为它不需要扫描所有数据。

{

"type" : "filtered",

"filter" : {

"type" : "selector",

"dimension" : <dimension>,

"value" : <dimension value>

}

"aggregator" : <aggregation>

}

5、后聚合postAggregation

（1）算数后聚合

算术后聚合器将提供的函数从左到右应用于给定的字段。这些字段可以是聚合器或其他后期聚合器。支持的功能有+，-，*，/，和quotient。

/0如果除以0，无论分子如何，除法总是返回。
quotient 除法的行为类似于常规浮点除法

算术后聚合器还可以指定一个ordering，定义排序结果时结果值的顺序（例如，这对topN查询很有用）：

如果未null指定任何顺序（或），则使用默认的浮点顺序。
numericFirst顺序总是先返回有限值，然后是NaN，最后返回无限值。

postAggregation : {

"type" : "arithmetic",

"name" : <output_name>,

"fn" : <arithmetic_function>,

"fields": [<post_aggregator>, <post_aggregator>, ...],

"ordering" : <null (default), or "numericFirst">

}

（2）字段访问器后聚合器 fieldAccess

这些后聚合器返回指定聚合器产生的值。fieldName引用查询的聚合部分中给出的聚合器的输出名称。使用类型“ fieldAccess”返回原始聚合对象，或使用类型“ finalizingFieldAccess”返回最终值。

{ "type" : "fieldAccess", "name": <output_name>, "fieldName" : <aggregator_name> }

（3）恒定的后聚合器constant

{ "type"  : "constant", "name"  : <output_name>, "value" : <numerical_value> }

（4）最大最小的后聚合`doubleGreatest`

doubleGreatest并longGreatest计算所有字段和Double.NEGATIVE_INFINITY的最大值

doubleMax聚合器和doubleGreatest后聚合器之间的区别在于，doubleMax返回某一特定列的所有行的最大值，而doubleGreatest返回一行中多个列的最大值

示例：

{

...

"aggregations" : [

{ "type" : "doubleSum", "name" : "tot", "fieldName" : "total" },

{ "type" : "doubleSum", "name" : "part", "fieldName" : "part" }

],

"postAggregations" : [{

"type" : "arithmetic",

"name" : "part_percentage",

"fn" : "*",

"fields" : [

{ "type" : "arithmetic",

"name" : "ratio",

"fn" : "/",

"fields" : [

{ "type" : "fieldAccess", "name" : "part", "fieldName" : "part" },

{ "type" : "fieldAccess", "name" : "tot", "fieldName" : "tot" }

]

},

{ "type" : "constant", "name": "const", "value" : 100 }

]

}]

...

}

6、groupBy分组

（1）查询过滤器：filter

{ "queryType": "groupBy", "dataSource": "sample_datasource", ... "having": { "type" : "filter", "filter" : <any Druid query filter> } }

（2）数值过滤器：greaterThan、equalTo、lessThan

"having": { "type": "greaterThan", "aggregation": "<aggregate_metric>", "value": <numeric_value> }

（3）维度选择过滤器dimSelector

"having": { "type": "dimSelector", "dimension": "<dimension>", "value": <dimension_value> }

（4）逻辑表达式过滤器：or、and、not

{

"queryType": "groupBy",

"dataSource": "sample_datasource",

...

"having":

{

"type": "and",

"havingSpecs": [

{

"type": "greaterThan",

"aggregation": "<aggregate_metric>",

"value": <numeric_value>

},

{

"type": "lessThan",

"aggregation": "<aggregate_metric>",

"value": <numeric_value>

}

]

}

}

7、虚拟列

虚拟列是在查询过程中从一组列创建的可查询列“视图”。

尽管虚拟列始终将自己显示为单个列，但虚拟列可能会从多个基础列中提取。

虚拟列可用作维度或聚合器的输入。

{

"queryType": "scan",

"dataSource": "page_data",

"columns":[],

"virtualColumns": [

{

"type": "expression",

"name": "fooPage",

"expression": "concat('foo' + page)",

"outputType": "STRING"

},

{

"type": "expression",

"name": "tripleWordCount",

"expression": "wordCount * 3",

"outputType": "LONG"

}

],

"intervals": [

"2013-01-01/2019-01-02"

]

}

参考：https://druid.apache.org/docs/latest/design/

Druid简介及常用查询操作