Druid简介及常用查询操作

目录

一、druid主要功能介绍

二、druid使用场景

1、适用于如下场景

2、不适用于如下场景

三、常见查询操作

1、timeseries时间序列

2、TopN

3、GroupBy分组查询

4、scan扫描查询

5、search搜索查询

6、timeBoundary时间边界查询

7、segmentMetadata细分元数据查询

8、dataSourceMetadata数据源元数据查询

四、查询组件介绍

1、filter过滤器

(1)选择器过滤器selector

(2)列比较过滤器columnComparison

(3)正则表达式过滤器regex

(4)逻辑表达式过滤器and、or、not

(5)JavaScript过滤器 javascript

(6)提取过滤器extraction

(7)搜索过滤器search

(8)在过滤器中in

(9)模糊匹配过滤器like

(10)绑定过滤器bound

(11)间隔过滤器interval

2、查询粒度

(1)简单粒度 day、hour……

(2)持续时间粒度duration

(3)期间时间粒度period

3、查询维度

4、聚合aggregations

5、后聚合postAggregation

(1)算数后聚合

(2)字段访问器后聚合器 fieldAccess

(3)恒定的后聚合器constant

(4)最大最小的后聚合doubleGreatest

6、groupBy分组

(1)查询过滤器:filter

(2)数值过滤器:greaterThan、equalTo、lessThan

(4)逻辑表达式过滤器:or、and、not

7、虚拟列


一、druid主要功能介绍

  • 列式存储格式
  • 分布式系统:Druid通常部署在数十到数百台服务器的群集中
  • 大规模并行处理
  • 实时或批量摄取
  • 自愈,自平衡,易于操作:扩展或扩展集群,只需添加或删除服务器,集群就会在后台自动重新平衡自身,而不会造成任何停机
  • 容错架构:一旦Druid摄取了数据,副本就被安全地存储在深度存储(通常是云存储,HDFS或共享文件系统)中
  • 快速过滤的索引:位图索引
  • 基于时间的分区
  • 自动汇总

二、druid使用场景

1、适用于如下场景

  • 插入多,更新少
  • 多数查询为聚合、搜索或扫描
  • 查询延迟为100毫秒到几秒钟
  • 数据具有时间成分
  • 需要对高基数数据列(如url,id)进行快速计数和排名
  • 从kafka、hdfs之类的对象存储中加载数据

2、不适用于如下场景

  • 需要使用主键对现有记录进行低延时更新
  • 查询延时不太重要
  • “大”连接操作

三、常见查询操作

    查询可以使用sql语句和json格式的查询,本文仅介绍json格式的查询,更详细的介绍可参考官方文档:https://druid.apache.org/docs/latest/design/

1、timeseries时间序列

{

"queryType": "timeseries",

"dataSource": "sample_datasource",

"granularity": "day",

"descending": "true",

"filter": {

"type": "and",

"fields": [

{ "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" },

{ "type": "or",

"fields": [

{ "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" },

{ "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" }

]

}

]

},

"aggregations": [

{ "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" },

{ "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" }

],

"postAggregations": [

{ "type": "arithmetic",

"name": "sample_divide",

"fn": "/",

"fields": [

{ "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" },

{ "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" }

]

}

],

"intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ]
}
属性 描述
queryType  timeseries
dataSource 索引名
descending 是否进行降序排序。默认值为false(升序)
granularity 聚合粒度:allnonesecondminutefifteen_minutethirty_minutehourdayweekmonthquarteryear
filter 过滤操作
aggregations 聚合操作
postAggregations 数据聚合后进行的后聚和操作
intervals 代表ISO-8601间隔的JSON对象。这定义了运行查询的时间范围。

上面的查询将从“ sample_datasource”表中返回2个数据点,从2012年1月1日到2012年1月3日之间每天返回一个数据点。每个数据点将是sample_fieldName1的(长)总和,sample_fieldName2的(两倍)总和和sample_fieldName1的(两倍)结果除以过滤器集的sample_fieldName2。

总计:"context": { "grandTotal": true }    在时间序列结果集中的最后一行包含额外的“总计”行

零填充:"context" : { "skipEmptyBuckets": "true" }   时间序列查询通常用零填充空的内部时间段,上述操作可禁用零填充

2、TopN

{

"queryType": "topN",

"dataSource": "sample_data",

"dimension": "sample_dim",

"threshold": 5,

"metric": "count",

"granularity": "all",

"filter": {

"type": "and",

"fields": [

{

"type": "selector",

"dimension": "dim1",

"value": "some_value"

},

{

"type": "selector",

"dimension": "dim2",

"value": "some_other_val"

}

]

},

"aggregations": [

{

"type": "longSum",

"name": "count",

"fieldName": "count"

},

{

"type": "doubleSum",

"name": "some_metric",

"fieldName": "some_metric"

}

],

"postAggregations": [

{

"type": "arithmetic",

"name": "average",

"fn": "/",

"fields": [

{

"type": "fieldAccess",

"name": "some_metric",

"fieldName": "some_metric"

},

{

"type": "fieldAccess",

"name": "count",

"fieldName": "count"

}

]

}

],

"intervals": [

"2013-08-31T00:00:00.000/2013-09-03T00:00:00.000"

]

}
属性 描述
queryType topN
threshold 定义topN中N的整数(即,您希望在顶部列表中有多少个结果)
metric 一个String或JSON对象,用于指定要对排名靠前的指标进行排序的指标

3、GroupBy分组查询

{

"queryType": "groupBy",

"dataSource": "sample_datasource",

"granularity": "day",

"dimensions": ["country", "device"],

"limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] },

"filter": {

"type": "and",

"fields": [

{ "type": "selector", "dimension": "carrier", "value": "AT&T" },

{ "type": "or",

"fields": [

{ "type": "selector", "dimension": "make", "value": "Apple" },

{ "type": "selector", "dimension": "make", "value": "Samsung" }

]

}

]

},

"aggregations": [

{ "type": "longSum", "name": "total_usage", "fieldName": "user_count" },

{ "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" }

],

"postAggregations": [

{ "type": "arithmetic",

"name": "avg_usage",

"fn": "/",

"fields": [

{ "type": "fieldAccess", "fieldName": "data_transfer" },

{ "type": "fieldAccess", "fieldName": "total_usage" }

]

}

],

"intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ],

"having": {

"type": "greaterThan",

"aggregation": "total_usage",

"value": 100

}

}
  • 小计:"subtotalsSpec":[ ["D1", "D2", D3"], ["D1", "D3"], ["D3"]]  允许在单个查询中计算多个子组

{

"type": "groupBy",

...

...

"dimensions": [

{

"type" : "default",

"dimension" : "d1col",

"outputName": "D1"

},

{

"type" : "extraction",

"dimension" : "d2col",

"outputName" : "D2",

"extractionFn" : extraction_func

},

{

"type":"lookup",

"dimension":"d3col",

"outputName":"D3",

"name":"my_lookup"

}

],

...

...

"subtotalsSpec":[ ["D1", "D2", D3"], ["D1", "D3"], ["D3"]],

..

}

4、scan扫描查询

{

"queryType": "scan",

"dataSource": "wikipedia",

"resultFormat": "list",

"columns":[],

"intervals": [

"2013-01-01/2013-01-02"

],

"batchSize":20480,

"limit":3

}
属性 描述
queryType scan
resultFormat 结果的表示方式:列表,compactedList或valueVector。目前仅listcompactedList受支持。默认是list
columns 要扫描的维度和指标的字符串数组。如果保留为空,则返回所有维度和指标。
batchSize 返回给客户端之前缓冲的最大行数。默认是20480
limit 要返回多少行。如果未指定,将返回所有行。

5、search搜索查询

{

"queryType": "search",

"dataSource": "sample_datasource",

"granularity": "day",

"searchDimensions": [

"dim1",

"dim2"

],

"query": {

"type": "insensitive_contains",

"value": "Ke"

},

"sort" : {

"type": "lexicographic"

},

"intervals": [

"2013-01-01T00:00:00.000/2013-01-03T00:00:00.000"

]

}

insensitive_contains:维度值的任何部分包含此搜索查询规范中指定的值,则无论大小写如何,都会发生“匹配”

    { "type" : "insensitive_contains", "value" : "some_value" }

fragment:维度值的任何部分包含此搜索查询规范中指定的所有值,则无论默认情况如何,都会发生“匹配”

    { "type" : "fragment", "case_sensitive" : false, "values" : ["fragment1", "fragment2"] }

contains:维度值的任何部分包含此搜索查询规范中指定的值,则会发生“匹配”

    { "type" : "contains", "case_sensitive" : true, "value" : "some_value" }

regex:维度值的任何部分包含此搜索查询规范中指定的模式,则会发生“匹配”

    { "type" : "regex", "pattern" : "some_pattern" }

6、timeBoundary时间边界查询

时间边界查询返回数据集的最早和最新数据点

{

"queryType" : "timeBoundary",

"dataSource": "sample_datasource",

"bound" : < "maxTime" | "minTime" > # optional, defaults to returning both timestamps if not set

"filter" : { "type": "and", "fields": [<filter>, <filter>, ...] } # optional

}

bound:设置为maxTimeminTime仅返回最新或最早的时间戳。如果未设置,默认返回两者

7、segmentMetadata细分元数据查询

细分元数据查询返回关于以下内容的细分信息:

  • 段中所有列的基数
  • 段中字符串类型列的最小值/最大值
  • 如果段列以平面格式存储,则估计的字节大小
  • 段内存储的行数
  • 间隔段覆盖
  • 段中所有列的列类型
  • 估计的总段字节大小(如果以平面格式存储)
  • 该细分是否汇总?
  • 细分ID
{

"queryType":"segmentMetadata",

"dataSource":"sample_datasource",

"intervals":["2013-01-01/2014-01-01"]

}

8、dataSourceMetadata数据源元数据查询

数据源元数据查询返回数据源的元数据信息

{

"queryType" : "dataSourceMetadata",

"dataSource": "sample_datasource"

}

四、查询组件介绍

1、filter过滤器

(1)选择器过滤器selector

"filter": { "type": "selector", "dimension": <dimension_string>, "value": <dimension_value_string> }

这相当于WHERE <dimension_string> = '<dimension_value_string>'

(2)列比较过滤器columnComparison

"filter": { "type": "columnComparison", "dimensions": [<dimension_a>, <dimension_b>] }

这相当于WHERE <dimension_a> = <dimension_b>

(3)正则表达式过滤器regex

"filter": { "type": "regex", "dimension": <dimension_string>, "pattern": <pattern_string> }

(4)逻辑表达式过滤器and、or、not

"filter": { "type": "and", "fields": [<filter>, <filter>, ...] }
"filter": { "type": "or", "fields": [<filter>, <filter>, ...] }
"filter": { "type": "not", "field": <filter> }

(5)JavaScript过滤器 javascript

{

"type" : "javascript",

"dimension" : <dimension_string>,

"function" : "function(value) { <...> }"

}

(6)提取过滤器extraction

    现在不建议使用提取过滤器

(7)搜索过滤器search

   用于对部分字符串匹配进行过滤

{

"filter": {

"type": "search",

"dimension": "product",

"query": {

"type": "insensitive_contains",

"value": "foo"

}

}

}

(8)在过滤器中in

{

"type": "in",

"dimension": "outlaw",

"values": ["Good", "Bad", "Ugly"]

}

(9)模糊匹配过滤器like

{

"type": "like",

"dimension": "last_name",

"pattern": "D%"

}

(10)绑定过滤器bound

以下绑定过滤器表示条件21 <= age <= 31

{

"type": "bound",

"dimension": "age",

"lower": "21",

"upper": "31" ,

"ordering": "numeric"

}

此过滤器foo <= name <= hoo使用默认的字典排序顺序来表达条件

{

"type": "bound",

"dimension": "name",

"lower": "foo",

"upper": "hoo"

}

使用严格界限,此过滤器表示条件 21 < age < 31

{

"type": "bound",

"dimension": "age",

"lower": "21",

"lowerStrict": true,

"upper": "31" ,

"upperStrict": true,

"ordering": "numeric"

}

(11)间隔过滤器interval

{

"type" : "interval",

"dimension" : "__time",

"intervals" : [

"2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z",

"2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z"

]

}

2、查询粒度

"granularity":"查询粒度"

(1)简单粒度 day、hour……

    简单的粒度通过其UTC时间(例如,以00:00 UTC开始的天数)指定为字符串和存储桶时间戳记。

    支持粒度字符串是:allnonesecondminutefifteen_minutethirty_minutehourdayweekmonthquarteryear

  • all 将所有内容存储到一个存储桶中
  • none不存储数据(它实际上使用索引的粒度-此处的最小值none表示毫秒粒度)。使用noneTimeseriesQuery目前也不建议(该系统将尝试生成0值全部毫秒不存在的,这往往是很多)。

(2)持续时间粒度duration

{"type": "duration", "duration": 3600000, "origin": "2012-01-01T00:30:00Z"}

(3)期间时间粒度period

{"type": "period", "period": "P3M", "timeZone": "America/Los_Angeles", "origin": "2012-02-01T00:00:00-08:00"}

3、查询维度

https://druid.apache.org/docs/latest/querying/dimensionspecs.html

4、聚合aggregations

(1)计数count:{ "type" : "count", "name" : <output_name> }

(2)总和longSum、doubleSum、floatSum:{ "type" : "floatSum", "name" : <output_name>, "fieldName" : <metric_name> }

(3)最大最小聚合:doubleMin、doubleMax、floatMin、floatMax、longMin、longMax

(4)算数平均值:doubleMean

(5)第一个/最后一个过滤器:doubleFirst、doubleLast、floatFirst、floatLast、longFirst、longLast、stringFirst、stringLast

(6)Any聚合器:doubleAny、floatAny、longAny、stringAny

(7)杂项聚集->过滤的聚合器filtered

过滤的聚合器包装任何给定的聚合器,但仅聚合给定维度过滤器匹配的值。

这使得可以同时计算已过滤和未过滤聚合的结果,而不必发出多个查询,并将这两个结果都用作聚合后的一部分。

注意:如果只需要过滤的结果,请考虑将过滤器放在查询本身上,这将更快,因为它不需要扫描所有数据。

{

"type" : "filtered",

"filter" : {

"type" : "selector",

"dimension" : <dimension>,

"value" : <dimension value>

}

"aggregator" : <aggregation>

}

5、后聚合postAggregation

(1)算数后聚合

算术后聚合器将提供的函数从左到右应用于给定的字段。这些字段可以是聚合器或其他后期聚合器。支持的功能有+-*/,和quotient

  • /0如果除以0,无论分子如何,除法总是返回。
  • quotient 除法的行为类似于常规浮点除法

算术后聚合器还可以指定一个ordering,定义排序结果时结果值的顺序(例如,这对topN查询很有用):

  • 如果未null指定任何顺序(或),则使用默认的浮点顺序。
  • numericFirst顺序总是先返回有限值,然后是NaN,最后返回无限值。
postAggregation : {

"type" : "arithmetic",

"name" : <output_name>,

"fn" : <arithmetic_function>,

"fields": [<post_aggregator>, <post_aggregator>, ...],

"ordering" : <null (default), or "numericFirst">

}

(2)字段访问器后聚合器 fieldAccess

这些后聚合器返回指定聚合器产生的值。fieldName引用查询的聚合部分中给出的聚合器的输出名称。使用类型“ fieldAccess”返回原始聚合对象,或使用类型“ finalizingFieldAccess”返回最终值。

{ "type" : "fieldAccess", "name": <output_name>, "fieldName" : <aggregator_name> }

(3)恒定的后聚合器constant

{ "type"  : "constant", "name"  : <output_name>, "value" : <numerical_value> }

(4)最大最小的后聚合doubleGreatest

doubleGreatestlongGreatest计算所有字段和Double.NEGATIVE_INFINITY的最大值

doubleMax聚合器和doubleGreatest后聚合器之间的区别在于,doubleMax返回某一特定列的所有行的最大值,而doubleGreatest返回一行中多个列的最大值

示例:

{

...

"aggregations" : [

{ "type" : "doubleSum", "name" : "tot", "fieldName" : "total" },

{ "type" : "doubleSum", "name" : "part", "fieldName" : "part" }

],

"postAggregations" : [{

"type" : "arithmetic",

"name" : "part_percentage",

"fn" : "*",

"fields" : [

{ "type" : "arithmetic",

"name" : "ratio",

"fn" : "/",

"fields" : [

{ "type" : "fieldAccess", "name" : "part", "fieldName" : "part" },

{ "type" : "fieldAccess", "name" : "tot", "fieldName" : "tot" }

]

},

{ "type" : "constant", "name": "const", "value" : 100 }

]

}]

...

}

6、groupBy分组

(1)查询过滤器:filter

{ "queryType": "groupBy", "dataSource": "sample_datasource", ... "having": { "type" : "filter", "filter" : <any Druid query filter> } }

(2)数值过滤器:greaterThan、equalTo、lessThan

"having": { "type": "greaterThan", "aggregation": "<aggregate_metric>", "value": <numeric_value> }

(3)维度选择过滤器dimSelector

"having": { "type": "dimSelector", "dimension": "<dimension>", "value": <dimension_value> }

(4)逻辑表达式过滤器:or、and、not

{

"queryType": "groupBy",

"dataSource": "sample_datasource",

...

"having":

{

"type": "and",

"havingSpecs": [

{

"type": "greaterThan",

"aggregation": "<aggregate_metric>",

"value": <numeric_value>

},

{

"type": "lessThan",

"aggregation": "<aggregate_metric>",

"value": <numeric_value>

}

]

}

}

7、虚拟列

虚拟列是在查询过程中从一组列创建的可查询列“视图”。

尽管虚拟列始终将自己显示为单个列,但虚拟列可能会从多个基础列中提取。

虚拟列可用作维度或聚合器的输入。

{

"queryType": "scan",

"dataSource": "page_data",

"columns":[],

"virtualColumns": [

{

"type": "expression",

"name": "fooPage",

"expression": "concat('foo' + page)",

"outputType": "STRING"

},

{

"type": "expression",

"name": "tripleWordCount",

"expression": "wordCount * 3",

"outputType": "LONG"

}

],

"intervals": [

"2013-01-01/2019-01-02"

]

}

参考:https://druid.apache.org/docs/latest/design/

猜你喜欢

转载自blog.csdn.net/wzl1217333452/article/details/109551418