Elasticsearch聚合分析实战(1)

Elasticsearch聚合分析实战(1)

本文通过实际示例学习Elasticsearch的聚集分析。

1. 聚集分析介绍

聚集分析主要包括两大类,度量聚集(metrics aggregation)和分组聚集(bucket aggregation),其他类型本文暂不涉及。
度量聚集基于文档集合计算一些值(如平均值);分组聚集根据分组条件对文档进行分组。

1.1. 示例分析数据

定义sport是索引数据,其中name和sport为keyword,用于作为关键词分析。

PUT sports
{
   "mappings": {
     "properties": {
        "birthdate": {
           "type": "date",
           "format": "dateOptionalTime"
        },
        "location": {
           "type": "geo_point"
        },
        "name": {
           "type": "keyword"
        },
        "rating": {
           "type": "integer"
        },
        "sport": {
           "type": "keyword"
        }
     }
  }
}

批量插入数据:

POST /sports/_bulk
{"index":{}}
{"name":"Michael", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["5", "4"],  "location":"46.22,-68.45"}
{"index":{}}
{"name":"Bob", "birthdate":"1989-11-2", "sport":"Baseball", "rating": ["3", "4"],  "location":"45.21,-68.35"}
{"index":{}}
{"name":"Jim", "birthdate":"1988-10-3", "sport":"Baseball", "rating": ["3", "2"],  "location":"45.16,-63.58" }
{"index":{}}
{"name":"Joe", "birthdate":"1992-5-20", "sport":"Baseball", "rating": ["4", "3"],  "location":"45.22,-68.53"}
{"index":{}}
{"name":"Tim", "birthdate":"1992-2-28", "sport":"Baseball", "rating": ["3", "3"],  "location":"46.22,-68.85"}
{"index":{}}
{"name":"Alfred", "birthdate":"1990-9-9", "sport":"Baseball", "rating": ["2", "2"],  "location":"45.12,-68.35"}
{"index":{}}
{"name":"Jeff", "birthdate":"1990-4-1", "sport":"Baseball", "rating": ["2", "3"], "location":"46.12,-68.55"}
{"index":{}}
{"name":"Will", "birthdate":"1988-3-1", "sport":"Baseball", "rating": ["4", "4"], "location":"46.25,-68.55" }
{"index":{}}
{"name":"Mick", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["3", "4"],  "location":"46.22,-68.45"}
{"index":{}}
{"name":"Pong", "birthdate":"1989-11-2", "sport":"Baseball", "rating": ["1", "3"],  "location":"45.21,-68.35"}
{"index":{}}
{"name":"Ray", "birthdate":"1988-10-3", "sport":"Baseball", "rating": ["2", "2"],  "location":"45.16,-63.58" }
{"index":{}}
{"name":"Ping", "birthdate":"1992-5-20", "sport":"Baseball", "rating": ["4", "3"],  "location":"45.22,-68.53"}
{"index":{}}
{"name":"Duke", "birthdate":"1992-2-28", "sport":"Baseball", "rating": ["5", "2"],  "location":"46.22,-68.85"}
{"index":{}}
{"name":"Hal", "birthdate":"1990-9-9", "sport":"Baseball", "rating": ["4", "2"],  "location":"45.12,-68.35"}
{"index":{}}
{"name":"Charge", "birthdate":"1990-4-1", "sport":"Baseball", "rating": ["3", "2"], "location":"46.12,-68.55"}
{"index":{}}
{"name":"Barry", "birthdate":"1988-3-1", "sport":"Baseball", "rating": ["5", "2"], "location":"46.25,-68.55" }
{"index":{}}
{"name":"Bank", "birthdate":"1988-3-1", "sport":"Golf", "rating": ["6", "4"], "location":"46.25,-68.55" }
{"index":{}}
{"name":"Bingo", "birthdate":"1988-3-1", "sport":"Golf", "rating": ["10", "7"], "location":"46.25,-68.55" }
{"index":{}}
{"name":"James", "birthdate":"1988-3-1", "sport":"Basketball", "rating": ["10", "8"], "location":"46.25,-68.55" }
{"index":{}}
{"name":"Wayne", "birthdate":"1988-3-1", "sport":"Hockey", "rating": ["10", "10"], "location":"46.25,-68.55" }
{"index":{}}
{"name":"Brady", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"46.25,-68.55" }
{"index":{}}
{"name":"Lewis", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"46.25,-68.55" }

1.2. 语法结构

下面看下聚集的语法结构。

"aggregations" : {
    "<aggregation_name>" : {
        "<aggregation_type>" : { 
            <aggregation_body>
        },
        ["aggregations" : { [<sub_aggregation>]* } ]
    }
    [,"<aggregation_name_2>" : { ... } ]*
}

aggregations 关键词也可以使用 “aggs” 代替,主要包括聚集名称,类型以及主体三个部分。 <aggregation_name> 是用户定义的名称,该名称在请求响应中唯一标识聚集。

<aggregation_type> 通常是聚集中第一个键确定聚集类型,如 terms, stats, 或者 geo-distance 聚集等。

<aggregation_body><aggregation_type>里面定义聚集主体内容,用于指定必要的属性,不同聚集有不同的属性。

另外两个可选项:可选提供子聚集对上级聚集结果进行分析。在查询中可选提供多个聚集(aggregation_name_2)作为独立的顶级聚集。虽然嵌套的聚集层级没有限制,但不能在度量聚集下嵌套聚集。

1.3. 值来源

一些聚合使用来自聚合文档的值。这些值既可以是特定文档的字段,也可以是通过脚本针对文档生成的值。下面示例中的terms聚集基于name字段,但order排序是基于子聚集rating_avg的值,这里使用嵌套的子聚集————度量聚集对父级分组聚集进行排序。

POST /sports/_search
{
  "size": 0,
  "aggs": {
    "the_name": {
       "terms": {
          "field": "name",
          "order": {
             "rating_avg": "desc"
          }
       },
       "aggs": {
          "rating_avg": {
             "avg": {
                "field": "rating"
             }
          }
       }
    }
  }
}

1.4. 多个顶级聚集

这里同时定义两个顶级聚集:the_nametype_cnt,同时the_name还包括子聚集rating_avg

POST /sports/_search
{
  "size": 0,
  "aggs": {
    "the_name": {
       "terms": {
          "field": "name",
          "order": {
             "rating_avg": "desc"
          }
       },
       "aggs": {
          "rating_avg": {
             "avg": {
                "field": "rating"
             }
          }
       }
    },
    "type_cnt":{
      "terms": {
        "field": "sport"
      }
    }
  }
}

2. 度量聚集

度量聚集用于计算整个文档集合的度量。可以是单个值(如平均数),也可以是多个度量值(如stats)。简单的度量聚集是value_count聚集,其返回给定字段值的总数量。下面示例返回sport值的数量。

POST /sports/_search
{
   "size": 0,
   "aggs": {
      "sport_count": {
         "value_count": {
            "field": "sport"
         }
      }
   }
}

值得注意的是,返回结果总数不是数值的唯一值。所以返回数量和索引文档数量一致。
不能在度量聚集中嵌入度量聚集,实际上也没有实际意义。但在分组聚集中嵌入度量聚集非常有用。下面章节会涉及到,但需先看看分组聚集。

3. 分组聚集

分组聚集是一种文档分组机制。每种类型分组有其文档分类方式,最简单类型是terms聚集。下面示例对sport字段的值进行分组计数。类似于SQL中根据该字段分组再计数。

POST /sports/_search
{
   "size": 0,
   "aggregations": {
      "sport": {
         "terms": {
            "field": "sport"
         }
      }
   }
}

返回结果:

{
  ......
  "aggregations" : {
    "sport" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Baseball",
          "doc_count" : 16
        },
        {
          "key" : "Football",
          "doc_count" : 2
        },
        {
          "key" : "Golf",
          "doc_count" : 2
        },
        {
          "key" : "Basketball",
          "doc_count" : 1
        },
        {
          "key" : "Hockey",
          "doc_count" : 1
        }
      ]
    }
  }
}

geo_distance聚集更有趣,虽然其有很多选项,最简单场景是根据原点计算距离范围,然后计算有多少文档位于圆内。下面计算从点"46.12,-68.55."计算20里范围内的记录:

POST /sports/_search
{
   "size": 0,
   "aggregations": {
      "baseball_player_ring": {
         "geo_distance": {
            "field": "location",
            "origin": "46.12,-68.55",
            "unit": "mi",
            "ranges": [
               {
                  "from": 0,
                  "to": 20
               }
            ]
         }
      }
   }
}

返回结果:

  ......

  "aggregations" : {
    "baseball_player_ring" : {
      "buckets" : [
        {
          "key" : "*-20.0",
          "from" : 0.0,
          "to" : 20.0,
          "doc_count" : 14
        }
      ]
    }
  }
}

4. 嵌套聚集

分组聚集最强大的能力是其嵌套能力。首先定义顶级分组聚集,然后在其内部定义二级聚集操作每个父级分组结果,嵌套可以根据需要定义很多级。

继续上面的示例,先找出一定范围内的记录,在看90后的记录数:

POST /sports/_search
{
   "size": 0,
   "aggs": {
      "baseball_player_ring": {
         "geo_distance": {
            "field": "location",
            "origin": "46.12,-68.55",
            "unit": "mi",
            "ranges": [
               {
                  "from": 0,
                  "to": 20
               }
            ]
         },
         "aggs": {
            "ring_age_ranges": {
               "range": {
                 "field": "birthdate", 
                  "ranges": [
                      {"key":"~90", "to": "1990-1-1"},
                      {"key":"90~", "from": "1990-1-1" }
                  ]
               }
            }
         }
      }
   }
}

返回结果:

  "aggregations" : {
    "baseball_player_ring" : {
      "buckets" : [
        {
          "key" : "*-20.0",
          "from" : 0.0,
          "to" : 20.0,
          "doc_count" : 14,
          "ring_age_ranges" : {
            "buckets" : [
              {
                "key" : "~90",
                "doc_count" : 10
              },
              {
                "key" : "90~",
                "doc_count" : 4
              }
            ]
          }
        }
      ]
    }
  }

下面在我们针对最里层的结果使用stats进行统计————多值度量聚集。

POST /sports/_search
{
   "size": 0,
   "aggs": {
      "baseball_player_ring": {
         "geo_distance": {
            "field": "location",
            "origin": "46.12,-68.55",
            "unit": "mi",
            "ranges": [
               {
                  "from": 0,
                  "to": 20
               }
            ]
         },
         "aggs": {
            "ring_age_ranges": {
               "range": {
                 "field": "birthdate", 
                  "ranges": [
                      {"key":"~90", "to": "1990-1-1"},
                      {"key":"90~", "from": "1990-1-1" }
                  ]
               },
               "aggs": {
                  "rating_stats": {
                     "stats": {
                        "field": "rating"
                     }
                  }
               }
            }
         }
      }
   }
}·

响应结果:

{
  ......
  "aggregations" : {
    "baseball_player_ring" : {
      "buckets" : [
        {
          "key" : "*-20.0",
          "from" : 0.0,
          "to" : 20.0,
          "doc_count" : 14,
          "ring_age_ranges" : {
            "buckets" : [
              {
                "key" : "~90",
                "doc_count" : 10,
                "rating_stats" : {
                  "count" : 20,
                  "min" : 2.0,
                  "max" : 10.0,
                  "avg" : 6.8,
                  "sum" : 136.0
                }
              },
              {
                "key" : "90~",
                "doc_count" : 4,
                "rating_stats" : {
                  "count" : 8,
                  "min" : 2.0,
                  "max" : 5.0,
                  "avg" : 2.875,
                  "sum" : 23.0
                }
              }
            ]
          }
        }
      ]
    }
  }
}

我们看到可以创建分组包含分组的复杂应用。

5. 总结

本文我们介绍了Elasticsearch的聚集应用。包括聚集的语法及说明,重点通过示例展示了度量聚集、分组聚集以及嵌套聚集。

发布了395 篇原创文章 · 获赞 761 · 访问量 143万+

猜你喜欢

转载自blog.csdn.net/neweastsun/article/details/104298675