文章目录
  1. 1. Timeseries queries
  2. 2. TopN queries
  3. 3. groupBy Queries
  4. 4. Time Boundary Queries
  5. 5. Search Queries

Druid的查询都通过向可查询节点(Broker、Historcal或者Realtime)发送HTTP REST风格的请求来获得结果。请求使用JSON格式,通常发送到Broker节点。

Druid的查询分为三类:

  • 聚合查询(Aggregation Queries)

Timeseries
TopN
GroupBy

聚合查询中,尽量使用Timeseries和TopN,因为GroupBy虽然最灵活,但是性能最差。

  • 元查询(Metadata Queries)

Time Boundary
Segment Metadata
Datasource Metadata

  • 搜索查询(Search Queries)

[Search] (http://druid.io/docs/0.8.3/querying/searchquery.html)

Druid的查询可以通过Query的唯一标识取消:

DELETE /druid/v2/{queryId}

Timeseries queries

这类查询使用时间序列查询对象,获得JSON对象数组,每个JSON对象表示一个时间序列查询的结果。

每一个时间序列查询有七个主要的部分组成:

属性 描述 是否必须
queryType 这类查询中都为:timeseries,Druid首先检查这个字段来决定如何处理查询语句
dataSource 描述数据源的字符串或者对象
intervals 一个描述ISO-8601时间间隔的JSON对象,定义查询的时间区间
granularity 定义查询结果的时间粒度
filter 一个说明哪些字段参与查询的JSON对象,类似SQL中的where语句
aggregations 定义汇总查询哪些指标
postAggregations 定义汇总数据执行哪些操作
context 定义查询配置参数

一个时间序列查询对象的示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
{
"queryType": "timeseries",
"dataSource": "sample_datasource",
"granularity": "day",
"filter": {
"type": "and",
"fields": [
{ "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" },
{ "type": "or",
"fields": [
{ "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" },
{ "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" }
]
}
]
},
"aggregations": [
{ "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" },
{ "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" }
],
"postAggregations": [
{
"type": "arithmetic",
"name": "sample_divide",
"fn": "/",
"fields": [
{"type": "fieldAccess", "name": "sample_name1", "fieldName": "sample_fieldName1" },
{"type": "fieldAccess", "name": "sample_name2", "fieldName": "sample_fieldName2" }
]
}
],
"intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ]
}

这个查询执行后,返回数据如下:

1
2
3
4
5
6
7
8
9
10
[
{
"timestamp": "2012-01-01T00:00:00.000Z",
"result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> }
},
{
"timestamp": "2012-01-02T00:00:00.000Z",
"result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> }
}
]

数据以粒度为天进行处理,返回两个结果,每个代表一天。如果某一天没有数据,则该处为0。可以在context中设置"skipEmptyBuckets": "true",则不返回空数据的结果。

TopN queries

TopN查询返回一个指定维度上有序的结果数组。在单一维度上,和GroupBy与Ordering的组合效果类似,但是TopN效率更高。

TopN查询内部近似为每个节点对内部数据排序并返回最大的K个结果到broker。K=max(1000, threshold)

topN查询由以下11部分组成:

属性 描述 是否必须
queryType 总是topN
dataSource 描述数据源的字符串或者对象
intervals 一个描述ISO-8601时间间隔的JSON对象,定义查询的时间区间
granularity 定义查询结果的时间粒度
filter 一个说明哪些字段参与查询的JSON对象,类似SQL中的where语句
aggregations 定义汇总查询哪些指标
postAggregations 定义汇总数据执行哪些操作
dimension 用来描述进行top操作选择字段的字符串或对象
threshold 阈值,控制topN中N
metric 用来描述在top列表中排序的字段的字符串或者对象
context 定义查询配置参数

例如:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
{
"queryType": "topN",
"dataSource": "sample_data",
"dimension": "sample_dim",
"threshold": 5,
"metric": "count",
"granularity": "all",
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "dim1",
"value": "some_value"
},
{
"type": "selector",
"dimension": "dim2",
"value": "some_other_val"
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "count",
"fieldName": "count"
},
{
"type": "doubleSum",
"name": "some_metric",
"fieldName": "some_metric"
}
],
"postAggregations": [
{
"type": "arithmetic",
"name": "sample_divide",
"fn": "/",
"fields": [
{
"type": "fieldAccess",
"name": "some_metric",
"fieldName": "some_metric"
},
{
"type": "fieldAccess",
"name": "count",
"fieldName": "count"
}
]
}
],
"intervals": [
"2013-08-31T00:00:00.000/2013-09-03T00:00:00.000"
]
}

返回数据示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[
{
"timestamp": "2013-08-31T00:00:00.000Z",
"result": [
{
"dim1": "dim1_val",
"count": 111,
"some_metrics": 10669,
"average": 96.11711711711712
},
{
"dim1": "another_dim1_val",
"count": 88,
"some_metrics": 28344,
"average": 322.09090909090907
},
{
"dim1": "dim1_val3",
"count": 70,
"some_metrics": 871,
"average": 12.442857142857143
},
{
"dim1": "dim1_val4",
"count": 62,
"some_metrics": 815,
"average": 13.14516129032258
},
{
"dim1": "dim1_val5",
"count": 60,
"some_metrics": 2787,
"average": 46.45
}
]
}
]

由前面描述,topN是取每个节点的topK汇总,因此这个算法得到的排名和结果在数据维度大于1000时都是近似值,维度小于1000的时候得到的是准确的值。

通过服务器参数druid.query.topN.minTopNThreshold可以更改这个K阈值。

但是这样的处理过程可能会导致部分数据缺失。但是想要获得准确的topN需要更多的性能,可以考虑通过两次topN来获取。

groupBy Queries

Note: 如果只是想要获取一段时间范围内的简单的统计,使用更佳性能的Timeseries更好。如果是需要对一个维度进行有序的分组,使用topN更好。

groupBy查询语句有11部分组成:

属性 描述 是否必须
queryType 总是topN
dataSource 描述数据源的字符串或者对象
dimensions 需要分组的维度列表
limitSpec 分组结果的数量限制
having 分组语句中判断一个数据项是否被返回的条件
granularity 定义查询结果的时间粒度
filter 一个说明哪些字段参与查询的JSON对象,类似SQL中的where语句
aggregations 定义汇总查询哪些指标
postAggregations 定义汇总数据执行哪些操作
intervals 一个描述ISO-8601时间间隔的JSON对象,定义查询的时间区间
context 定义查询配置参数

比如:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
{
"queryType": "groupBy",
"dataSource": "sample_datasource",
"granularity": "day",
"dimensions": ["country", "device"],
"limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] },
"filter": {
"type": "and",
"fields": [
{ "type": "selector", "dimension": "carrier", "value": "AT&T" },
{ "type": "or",
"fields": [
{ "type": "selector", "dimension": "make", "value": "Apple" },
{ "type": "selector", "dimension": "make", "value": "Samsung" }
]
}
]
},
"aggregations": [
{ "type": "longSum", "name": "total_usage", "fieldName": "user_count" },
{ "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" }
],
"postAggregations": [
{ "type": "arithmetic",
"name": "avg_usage",
"fn": "/",
"fields": [
{ "type": "fieldAccess", "fieldName": "data_transfer" },
{ "type": "fieldAccess", "fieldName": "total_usage" }
]
}
],
"intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ],
"having": {
"type": "greaterThan",
"aggregation": "total_usage",
"value": 100
}
}

这个查询将会返回时间段内每一天中,m*n个数据,最多是5000个。n表示country的基数,m表示device的基数。返回数据如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[ 
{
"version" : "v1",
"timestamp" : "2012-01-01T00:00:00.000Z",
"event" : {
"country" : <some_dim_value_one>,
"device" : <some_dim_value_two>,
"total_usage" : <some_value_one>,
"data_transfer" :<some_value_two>,
"avg_usage" : <some_avg_usage_value>
}
},
{
"version" : "v1",
"timestamp" : "2012-01-01T00:00:12.000Z",
"event" : {
"dim1" : <some_other_dim_value_one>,
"dim2" : <some_other_dim_value_two>,
"sample_name1" : <some_other_value_one>,
"sample_name2" :<some_other_value_two>,
"avg_usage" : <some_other_avg_usage_value>
}
},
...
]

Time Boundary Queries

时间界限查询。该查询实例如下:

1
2
3
4
5
{
"queryType" : "timeBoundary",
"dataSource": "sample_datasource",
"bound" : < "maxTime" | "minTime" > # optional, defaults to returning both timestamps if not set
}

返回的数据如下:

1
2
3
4
5
6
7
[ {
"timestamp" : "2013-05-09T18:24:00.000Z",
"result" : {
"minTime" : "2013-05-09T18:24:00.000Z",
"maxTime" : "2013-05-09T18:37:00.000Z"
}
} ]

Search Queries

搜索查询包含以下部分:

属性 描述 是否必须
queryType 总是search
dataSource 描述数据源的字符串或者对象
granularity 定义查询结果的时间粒度
filter 一个说明哪些字段参与查询的JSON对象,类似SQL中的where语句
intervals 一个描述ISO-8601时间间隔的JSON对象,定义查询的时间区间
searchDimensions 进行搜索查询的维度。如果不存在这个参数则表示在所有参数上进行
query 定义怎么匹配,有InsensitiveContainsSearchQuerySpec和FragmentSearchQuerySpec两种方式
sort 指定结果的排序方式。有lexicographic(默认)和strlen两种方式
context 定义查询配置参数

查询示例如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
"queryType": "search",
"dataSource": "sample_datasource",
"granularity": "day",
"searchDimensions": [
"dim1",
"dim2"
],
"query": {
"type": "insensitive_contains",
"value": "Ke"
},
"sort" : {
"type": "lexicographic"
},
"intervals": [
"2013-01-01T00:00:00.000/2013-01-03T00:00:00.000"
]
}

返回数据如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[
{
"timestamp": "2012-01-01T00:00:00.000Z",
"result": [
{
"dimension": "dim1",
"value": "Ke$ha"
},
{
"dimension": "dim2",
"value": "Ke$haForPresident"
}
]
},
{
"timestamp": "2012-01-02T00:00:00.000Z",
"result": [
{
"dimension": "dim1",
"value": "SomethingThatContainsKe"
},
{
"dimension": "dim2",
"value": "SomethingElseThatContainsKe"
}
]
}
]
文章目录
  1. 1. Timeseries queries
  2. 2. TopN queries
  3. 3. groupBy Queries
  4. 4. Time Boundary Queries
  5. 5. Search Queries