目录
(2)数值过滤器:greaterThan、equalTo、lessThan
一、druid主要功能介绍
- 列式存储格式
- 分布式系统:Druid通常部署在数十到数百台服务器的群集中
- 大规模并行处理
- 实时或批量摄取
- 自愈,自平衡,易于操作:扩展或扩展集群,只需添加或删除服务器,集群就会在后台自动重新平衡自身,而不会造成任何停机
- 容错架构:一旦Druid摄取了数据,副本就被安全地存储在深度存储(通常是云存储,HDFS或共享文件系统)中
- 快速过滤的索引:位图索引
- 基于时间的分区
- 自动汇总
二、druid使用场景
1、适用于如下场景
- 插入多,更新少
- 多数查询为聚合、搜索或扫描
- 查询延迟为100毫秒到几秒钟
- 数据具有时间成分
- 需要对高基数数据列(如url,id)进行快速计数和排名
- 从kafka、hdfs之类的对象存储中加载数据
2、不适用于如下场景
- 需要使用主键对现有记录进行低延时更新
- 查询延时不太重要
- “大”连接操作
三、常见查询操作
查询可以使用sql语句和json格式的查询,本文仅介绍json格式的查询,更详细的介绍可参考官方文档:https://druid.apache.org/docs/latest/design/
1、timeseries时间序列
{
"queryType": "timeseries",
"dataSource": "sample_datasource",
"granularity": "day",
"descending": "true",
"filter": {
"type": "and",
"fields": [
{ "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" },
{ "type": "or",
"fields": [
{ "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" },
{ "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" }
]
}
]
},
"aggregations": [
{ "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" },
{ "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" }
],
"postAggregations": [
{ "type": "arithmetic",
"name": "sample_divide",
"fn": "/",
"fields": [
{ "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" },
{ "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" }
]
}
],
"intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ]
}
属性 | 描述 |
---|---|
queryType | timeseries |
dataSource | 索引名 |
descending | 是否进行降序排序。默认值为false (升序) |
granularity | 聚合粒度:all ,none ,second ,minute ,fifteen_minute ,thirty_minute ,hour ,day ,week ,month ,quarter 和year |
filter | 过滤操作 |
aggregations | 聚合操作 |
postAggregations | 数据聚合后进行的后聚和操作 |
intervals | 代表ISO-8601间隔的JSON对象。这定义了运行查询的时间范围。 |
上面的查询将从“ sample_datasource”表中返回2个数据点,从2012年1月1日到2012年1月3日之间每天返回一个数据点。每个数据点将是sample_fieldName1的(长)总和,sample_fieldName2的(两倍)总和和sample_fieldName1的(两倍)结果除以过滤器集的sample_fieldName2。
总计:"context": { "grandTotal": true } 在时间序列结果集中的最后一行包含额外的“总计”行
零填充:"context" : { "skipEmptyBuckets": "true" } 时间序列查询通常用零填充空的内部时间段,上述操作可禁用零填充
2、TopN
{
"queryType": "topN",
"dataSource": "sample_data",
"dimension": "sample_dim",
"threshold": 5,
"metric": "count",
"granularity": "all",
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "dim1",
"value": "some_value"
},
{
"type": "selector",
"dimension": "dim2",
"value": "some_other_val"
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "count",
"fieldName": "count"
},
{
"type": "doubleSum",
"name": "some_metric",
"fieldName": "some_metric"
}
],
"postAggregations": [
{
"type": "arithmetic",
"name": "average",
"fn": "/",
"fields": [
{
"type": "fieldAccess",
"name": "some_metric",
"fieldName": "some_metric"
},
{
"type": "fieldAccess",
"name": "count",
"fieldName": "count"
}
]
}
],
"intervals": [
"2013-08-31T00:00:00.000/2013-09-03T00:00:00.000"
]
}
属性 | 描述 |
---|---|
queryType | topN |
threshold | 定义topN中N的整数(即,您希望在顶部列表中有多少个结果) |
metric | 一个String或JSON对象,用于指定要对排名靠前的指标进行排序的指标 |
3、GroupBy分组查询
{
"queryType": "groupBy",
"dataSource": "sample_datasource",
"granularity": "day",
"dimensions": ["country", "device"],
"limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] },
"filter": {
"type": "and",
"fields": [
{ "type": "selector", "dimension": "carrier", "value": "AT&T" },
{ "type": "or",
"fields": [
{ "type": "selector", "dimension": "make", "value": "Apple" },
{ "type": "selector", "dimension": "make", "value": "Samsung" }
]
}
]
},
"aggregations": [
{ "type": "longSum", "name": "total_usage", "fieldName": "user_count" },
{ "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" }
],
"postAggregations": [
{ "type": "arithmetic",
"name": "avg_usage",
"fn": "/",
"fields": [
{ "type": "fieldAccess", "fieldName": "data_transfer" },
{ "type": "fieldAccess", "fieldName": "total_usage" }
]
}
],
"intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ],
"having": {
"type": "greaterThan",
"aggregation": "total_usage",
"value": 100
}
}
-
小计:"subtotalsSpec":[ ["D1", "D2", D3"], ["D1", "D3"], ["D3"]] 允许在单个查询中计算多个子组
{
"type": "groupBy",
...
...
"dimensions": [
{
"type" : "default",
"dimension" : "d1col",
"outputName": "D1"
},
{
"type" : "extraction",
"dimension" : "d2col",
"outputName" : "D2",
"extractionFn" : extraction_func
},
{
"type":"lookup",
"dimension":"d3col",
"outputName":"D3",
"name":"my_lookup"
}
],
...
...
"subtotalsSpec":[ ["D1", "D2", D3"], ["D1", "D3"], ["D3"]],
..
}
4、scan扫描查询
{
"queryType": "scan",
"dataSource": "wikipedia",
"resultFormat": "list",
"columns":[],
"intervals": [
"2013-01-01/2013-01-02"
],
"batchSize":20480,
"limit":3
}
属性 | 描述 |
---|---|
queryType | scan |
resultFormat | 结果的表示方式:列表,compactedList或valueVector。目前仅list 和compactedList 受支持。默认是list |
columns | 要扫描的维度和指标的字符串数组。如果保留为空,则返回所有维度和指标。 |
batchSize | 返回给客户端之前缓冲的最大行数。默认是20480 |
limit | 要返回多少行。如果未指定,将返回所有行。 |
5、search搜索查询
{
"queryType": "search",
"dataSource": "sample_datasource",
"granularity": "day",
"searchDimensions": [
"dim1",
"dim2"
],
"query": {
"type": "insensitive_contains",
"value": "Ke"
},
"sort" : {
"type": "lexicographic"
},
"intervals": [
"2013-01-01T00:00:00.000/2013-01-03T00:00:00.000"
]
}
insensitive_contains:
维度值的任何部分包含此搜索查询规范中指定的值,则无论大小写如何,都会发生“匹配”
{ "type" : "insensitive_contains", "value" : "some_value" }
fragment:
维度值的任何部分包含此搜索查询规范中指定的所有值,则无论默认情况如何,都会发生“匹配”
{ "type" : "fragment", "case_sensitive" : false, "values" : ["fragment1", "fragment2"] }
contains
:维度值的任何部分包含此搜索查询规范中指定的值,则会发生“匹配”
{ "type" : "contains", "case_sensitive" : true, "value" : "some_value" }
regex:维度值的任何部分包含此搜索查询规范中指定的模式,则会发生“匹配”
{ "type" : "regex", "pattern" : "some_pattern" }
6、timeBoundary时间边界查询
时间边界查询返回数据集的最早和最新数据点
{
"queryType" : "timeBoundary",
"dataSource": "sample_datasource",
"bound" : < "maxTime" | "minTime" > # optional, defaults to returning both timestamps if not set
"filter" : { "type": "and", "fields": [<filter>, <filter>, ...] } # optional
}
bound:设置为maxTime
或minTime
仅返回最新或最早的时间戳。如果未设置,默认返回两者
7、segmentMetadata细分元数据查询
细分元数据查询返回关于以下内容的细分信息:
- 段中所有列的基数
- 段中字符串类型列的最小值/最大值
- 如果段列以平面格式存储,则估计的字节大小
- 段内存储的行数
- 间隔段覆盖
- 段中所有列的列类型
- 估计的总段字节大小(如果以平面格式存储)
- 该细分是否汇总?
- 细分ID
{
"queryType":"segmentMetadata",
"dataSource":"sample_datasource",
"intervals":["2013-01-01/2014-01-01"]
}
8、dataSourceMetadata数据源元数据查询
数据源元数据查询返回数据源的元数据信息
{
"queryType" : "dataSourceMetadata",
"dataSource": "sample_datasource"
}
四、查询组件介绍
1、filter过滤器
(1)选择器过滤器selector
"filter": { "type": "selector", "dimension": <dimension_string>, "value": <dimension_value_string> }
这相当于WHERE <dimension_string> = '<dimension_value_string>'
(2)列比较过滤器columnComparison
"filter": { "type": "columnComparison", "dimensions": [<dimension_a>, <dimension_b>] }
这相当于WHERE <dimension_a> = <dimension_b>
(3)
正则表达式过滤器regex
"filter": { "type": "regex", "dimension": <dimension_string>, "pattern": <pattern_string> }
(4)逻辑表达式过滤器and、or、not
"filter": { "type": "and", "fields": [<filter>, <filter>, ...] }
"filter": { "type": "or", "fields": [<filter>, <filter>, ...] }
"filter": { "type": "not", "field": <filter> }
(5)JavaScript过滤器 javascript
{
"type" : "javascript",
"dimension" : <dimension_string>,
"function" : "function(value) { <...> }"
}
(6)提取过滤器extraction
现在不建议使用提取过滤器
(7)搜索过滤器search
用于对部分字符串匹配进行过滤
{
"filter": {
"type": "search",
"dimension": "product",
"query": {
"type": "insensitive_contains",
"value": "foo"
}
}
}
(8)在过滤器中in
{
"type": "in",
"dimension": "outlaw",
"values": ["Good", "Bad", "Ugly"]
}
(9)模糊匹配过滤器like
{
"type": "like",
"dimension": "last_name",
"pattern": "D%"
}
(10)绑定过滤器bound
以下绑定过滤器表示条件21 <= age <= 31
:
{
"type": "bound",
"dimension": "age",
"lower": "21",
"upper": "31" ,
"ordering": "numeric"
}
此过滤器foo <= name <= hoo
使用默认的字典排序顺序来表达条件
{
"type": "bound",
"dimension": "name",
"lower": "foo",
"upper": "hoo"
}
使用严格界限,此过滤器表示条件 21 < age < 31
{
"type": "bound",
"dimension": "age",
"lower": "21",
"lowerStrict": true,
"upper": "31" ,
"upperStrict": true,
"ordering": "numeric"
}
(11)间隔过滤器interval
{
"type" : "interval",
"dimension" : "__time",
"intervals" : [
"2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z",
"2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z"
]
}
2、查询粒度
"granularity":"查询粒度"
(1)简单粒度 day、hour……
简单的粒度通过其UTC时间(例如,以00:00 UTC开始的天数)指定为字符串和存储桶时间戳记。
支持粒度字符串是:all
,none
,second
,minute
,fifteen_minute
,thirty_minute
,hour
,day
,week
,month
,quarter
和year
。
all
将所有内容存储到一个存储桶中none
不存储数据(它实际上使用索引的粒度-此处的最小值none
表示毫秒粒度)。使用none
在TimeseriesQuery目前也不建议(该系统将尝试生成0值全部毫秒不存在的,这往往是很多)。
(2)持续时间粒度duration
{"type": "duration", "duration": 3600000, "origin": "2012-01-01T00:30:00Z"}
(3)期间时间粒度period
{"type": "period", "period": "P3M", "timeZone": "America/Los_Angeles", "origin": "2012-02-01T00:00:00-08:00"}
3、查询维度
https://druid.apache.org/docs/latest/querying/dimensionspecs.html
4、聚合aggregations
(1)计数count:{ "type" : "count", "name" : <output_name> }
(2)总和longSum、doubleSum、floatSum:{ "type" : "floatSum", "name" : <output_name>, "fieldName" : <metric_name> }
(3)最大最小聚合:doubleMin、doubleMax、floatMin、floatMax、longMin、longMax
(4)算数平均值:doubleMean
(5)第一个/最后一个过滤器:doubleFirst、doubleLast、floatFirst、floatLast、longFirst、longLast、stringFirst、stringLast
(6)Any聚合器:doubleAny、floatAny、longAny、stringAny
(7)杂项聚集->过滤的聚合器filtered
过滤的聚合器包装任何给定的聚合器,但仅聚合给定维度过滤器匹配的值。
这使得可以同时计算已过滤和未过滤聚合的结果,而不必发出多个查询,并将这两个结果都用作聚合后的一部分。
注意:如果只需要过滤的结果,请考虑将过滤器放在查询本身上,这将更快,因为它不需要扫描所有数据。
{
"type" : "filtered",
"filter" : {
"type" : "selector",
"dimension" : <dimension>,
"value" : <dimension value>
}
"aggregator" : <aggregation>
}
5、后聚合postAggregation
(1)算数后聚合
算术后聚合器将提供的函数从左到右应用于给定的字段。这些字段可以是聚合器或其他后期聚合器。支持的功能有+
,-
,*
,/
,和quotient
。
/
0
如果除以0
,无论分子如何,除法总是返回。quotient
除法的行为类似于常规浮点除法
算术后聚合器还可以指定一个ordering
,定义排序结果时结果值的顺序(例如,这对topN查询很有用):
- 如果未
null
指定任何顺序(或),则使用默认的浮点顺序。 numericFirst
顺序总是先返回有限值,然后是NaN
,最后返回无限值。
postAggregation : {
"type" : "arithmetic",
"name" : <output_name>,
"fn" : <arithmetic_function>,
"fields": [<post_aggregator>, <post_aggregator>, ...],
"ordering" : <null (default), or "numericFirst">
}
(2)字段访问器后聚合器 fieldAccess
这些后聚合器返回指定聚合器产生的值。fieldName
引用查询的聚合部分中给出的聚合器的输出名称。使用类型“ fieldAccess”返回原始聚合对象,或使用类型“ finalizingFieldAccess”返回最终值。
{ "type" : "fieldAccess", "name": <output_name>, "fieldName" : <aggregator_name> }
(3)恒定的后聚合器constant
{ "type" : "constant", "name" : <output_name>, "value" : <numerical_value> }
(4)最大最小的后聚合doubleGreatest
doubleGreatest
并longGreatest
计算所有字段和Double.NEGATIVE_INFINITY的最大值
doubleMax
聚合器和doubleGreatest
后聚合器之间的区别在于,doubleMax
返回某一特定列的所有行的最大值,而doubleGreatest
返回一行中多个列的最大值
示例:
{
...
"aggregations" : [
{ "type" : "doubleSum", "name" : "tot", "fieldName" : "total" },
{ "type" : "doubleSum", "name" : "part", "fieldName" : "part" }
],
"postAggregations" : [{
"type" : "arithmetic",
"name" : "part_percentage",
"fn" : "*",
"fields" : [
{ "type" : "arithmetic",
"name" : "ratio",
"fn" : "/",
"fields" : [
{ "type" : "fieldAccess", "name" : "part", "fieldName" : "part" },
{ "type" : "fieldAccess", "name" : "tot", "fieldName" : "tot" }
]
},
{ "type" : "constant", "name": "const", "value" : 100 }
]
}]
...
}
6、groupBy分组
(1)查询过滤器:filter
{ "queryType": "groupBy", "dataSource": "sample_datasource", ... "having": { "type" : "filter", "filter" : <any Druid query filter> } }
(2)数值过滤器:greaterThan、equalTo、lessThan
"having": { "type": "greaterThan", "aggregation": "<aggregate_metric>", "value": <numeric_value> }
(3)维度选择过滤器dimSelector
"having": { "type": "dimSelector", "dimension": "<dimension>", "value": <dimension_value> }
(4)逻辑表达式过滤器:or、and、not
{
"queryType": "groupBy",
"dataSource": "sample_datasource",
...
"having":
{
"type": "and",
"havingSpecs": [
{
"type": "greaterThan",
"aggregation": "<aggregate_metric>",
"value": <numeric_value>
},
{
"type": "lessThan",
"aggregation": "<aggregate_metric>",
"value": <numeric_value>
}
]
}
}
7、虚拟列
虚拟列是在查询过程中从一组列创建的可查询列“视图”。
尽管虚拟列始终将自己显示为单个列,但虚拟列可能会从多个基础列中提取。
虚拟列可用作维度或聚合器的输入。
{
"queryType": "scan",
"dataSource": "page_data",
"columns":[],
"virtualColumns": [
{
"type": "expression",
"name": "fooPage",
"expression": "concat('foo' + page)",
"outputType": "STRING"
},
{
"type": "expression",
"name": "tripleWordCount",
"expression": "wordCount * 3",
"outputType": "LONG"
}
],
"intervals": [
"2013-01-01/2019-01-02"
]
}