深入Elasticsearch度量聚集(2)

本文接着前文继续讲解Elasticsearch的度量聚集。主要包括geo bounds, geo centroid, percentiles, percentile ranks等单值和多值聚集。学习完这两篇博文，应该能很好地理解elasticsearch的度量聚集。示例数据还是使用上文的数据，读者可以下载并在本地搭建好测试环境。

1. 地理位置

1.1. 边界聚集

当需要查找地理数据的地理边界时，geo bounds聚合非常有用。即计算所有geo_point类型字段的边界框。

我们示例数据sports索引中包括一个geo_point类型location字段，下面示例用于计算所有数据的边界。

GET /sports/_search?size=0
{
  "aggs" : {
    "viewport" : {
      "geo_bounds" : {
          "field" : "location", 
          "wrap_longitude" : true 
      }
    }
  }
}

可以指定wrap_longitude参数指定是否允许边界框与国际日期分解线重叠。响应如下：

{
  "took" : 29,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 22,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "viewport" : {
      "bounds" : {
        "top_left" : {
          "lat" : 46.24999996740371,
          "lon" : -68.85000005364418
        },
        "bottom_right" : {
          "lat" : 45.11999997776002,
          "lon" : -63.58000005595386
        }
      }
    }
  }
}

如果展示在地图上，即画一个矩形把我们所有数据点框起来。

1.2. 地理重心聚集

该聚集可以计算geo_point字段所有值的重心。在几何中形心是图形中所有点的算术平均位置。将此概念应用于地理坐标，可以将形心视为geo_point字段中所有经纬度对的算术平均值。如果区域形状复杂，形心是一种有用的测量方法。请看示例：

GET /sports/_search?size=0
{
  "aggs" : {
      "centroid" : {
          "geo_centroid" : {
              "field" : "location" 
          }
      }
  }
}

该聚集仅需要指定geo_point字段，响应内容：

  "aggregations" : {
    "centroid" : {
      "location" : {
        "lat" : 45.842727022245526,
        "lon" : -68.07818217203021
      },
      "count" : 22
    }
  }

返回location对象即为所有文档的重心位置。

当然，也可以把geo_centroid聚集作为其他分组聚集的子聚集。例如，我们同时使用terms聚集和geo_centroid聚集，查找不同类型运动员的重心位置：

GET /sports/_search?size=0
{
  "aggs" : {
    "sports" : {
      "terms" : { "field" : "sport" },
      "aggs" : {
        "centroid" : {
            "geo_centroid" : { "field" : "location" }
        }
      }
    }
  }
}

首先根据sport字段分成多个组，然后再根据geo_centroid字段计算每个组的重心。响应如下：

"aggregations" : {
    "sports" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Football",
          "doc_count" : 9,
          "centroid" : {
            "location" : {
              "lat" : 45.74555545113981,
              "lon" : -67.37888912670314
            },
            "count" : 9
          }
        },
        {
          "key" : "Basketball",
          "doc_count" : 5,
          "centroid" : {
            "location" : {
              "lat" : 45.6239999178797,
              "lon" : -68.56200012378395
            },
            "count" : 5
          }
        },
        {
          "key" : "Hockey",
          "doc_count" : 5,
          "centroid" : {
            "location" : {
              "lat" : 45.99199991673231,
              "lon" : -68.57000014744699
            },
            "count" : 5
          }
        },
        {
          "key" : "Handball",
          "doc_count" : 3,
          "centroid" : {
            "location" : {
              "lat" : 46.24999996740371,
              "lon" : -68.55000000447035
            },
            "count" : 3
          }
        }
      ]
    }
  }

2. 百分位

百分位数是一种有用的统计测量方法，它指示一组中某一特定百分比的观察值低于该值。

扫描二维码关注公众号，回复： 10051562 查看本文章

2.1. 百分位聚集

百分位数是一种有用的统计测量方法，它指示一组中某一特定百分比的观察值低于该值。举例：
第75个百分位数是75%以下的观测值。百分位数通常用于查找数据集中的异常值。例如在一个正态分布中，0.13和99.87个百分点代表离均值3个标准差，任何超出这些界限的数据都被认为是异常的。除了发现数据异常值外，百分位数聚合可能还有助于确定数据是否倾斜、是否双峰等。

因为百分位聚集可以返回用户指定的百分位范围，所以它是一个多值度量聚集。默认情况下，Elasticsearch使用TDigest算法计算近似百分位数。该算法有一些需要注意的地方:

该算法的精度与q(1-q)成正比。极端百分位数(例如99%)比不那么极端的百分位数(如中位数)更准确。
该算法对较小数据集值具有较高的精度。
随着分组中值不断增加，该算法开始逼近百分位数。为了节省内存，它有效地牺牲了准确性。不准确的确切级别取决于您的数据分布(数据是否正态分布)和正在聚集的数据量。

Elasticsearch有一个备选的百分比算法实现——HDR直方图(High Dynamic Range Histogram)。此算法可能比TDigest实现更快，但需要占用更多内存。

下面示例使用TDigest算法计算goals字段的百分位：

GET /sports/_search?size=0
{
  "aggs" : {
    "sport_categories":{
      "terms":{"field":"sport"},
      "aggs": {
        "scoring_percentiles" : {
           "percentiles" : {
              "field" : "goals",
              "tdigest": {
                 "compression" : 200 
               }
            }
        }
      }
    } 
  }
}

默认情况下百分位度量将生成这个范围的百分位数[1、5、25、50、75、95、99]。我们还指定了控制内存使用和近似误差的“compression”参数。通过增加压缩值(默认值为100)，可以以增加内存为代价来提高百分位计算的准确性。

然而较大的压缩值会使算法变慢。因为我们的数据集不是很大，所以压缩值不会有明显的影响。在我们的例子中，即使不改变默认的压缩，百分位计算也是准确的，示例仅但是为了演示说明参数应用。响应结果如下：

  "aggregations" : {
    "sport_categories" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Football",
          "doc_count" : 9,
          "scoring_percentiles" : {
            "values" : {
              "1.0" : 34.0,
              "5.0" : 34.0,
              "25.0" : 46.75,
              "50.0" : 53.0,
              "75.0" : 60.25,
              "95.0" : 84.0,
              "99.0" : 84.0
            }
          }
        },
        {
          "key" : "Basketball",
          "doc_count" : 5,
          "scoring_percentiles" : {
            "values" : {
              "1.0" : 848.0,
              "5.0" : 848.0,
              "25.0" : 918.5,
              "50.0" : 1284.0,
              "75.0" : 1366.75,
              "95.0" : 1483.0,
              "99.0" : 1483.0
            }
          }
        },
        {
          "key" : "Hockey",
          "doc_count" : 5,
          "scoring_percentiles" : {
            "values" : {
              "1.0" : 93.0,
              "5.0" : 93.0,
              "25.0" : 108.0,
              "50.0" : 124.0,
              "75.0" : 165.5,
              "95.0" : 218.0,
              "99.0" : 218.0
            }
          }
        },
        {
          "key" : "Handball",
          "doc_count" : 3,
          "scoring_percentiles" : {
            "values" : {
              "1.0" : 143.0,
              "5.0" : 143.0,
              "25.0" : 144.75,
              "50.0" : 150.0,
              "75.0" : 369.75,
              "95.0" : 443.0,
              "99.0" : 443.0
            }
          }
        }
      ]
    }
  }

这些数据对于理解不同运动项目中进球数的分布是非常有用的。举个例子，手球的进球分布比较分散，第1百分位数为143.14个，第99百分位数为437.14个。

如果只对数据中的异常值感兴趣，则能使用聚集的“percents”参数设定返回特定百分位。例如，在下面的查询中，我们指定要返回最极端的百分位：

GET /sports/_search?size=0
{
  "aggs" : {
    "sport_categories":{
      "terms":{"field":"sport"},
      "aggs": {
        "scoring_percentiles" : {
         "percentiles" : {
            "field" : "goals",
            "percents" : [1, 99, 99.9]
          }
        }
      }
    } 
  }
}

响应结果如下：

  "aggregations" : {
    "sport_categories" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Football",
          "doc_count" : 9,
          "scoring_percentiles" : {
            "values" : {
              "1.0" : 34.0,
              "99.0" : 84.0,
              "99.9" : 84.0
            }
          }
        },
        {
          "key" : "Basketball",
          "doc_count" : 5,
          "scoring_percentiles" : {
            "values" : {
              "1.0" : 848.0,
              "99.0" : 1483.0,
              "99.9" : 1483.0
            }
          }
        },
        {
          "key" : "Hockey",
          "doc_count" : 5,
          "scoring_percentiles" : {
            "values" : {
              "1.0" : 93.0,
              "99.0" : 218.0,
              "99.9" : 218.0
            }
          }
        },
        {
          "key" : "Handball",
          "doc_count" : 3,
          "scoring_percentiles" : {
            "values" : {
              "1.0" : 143.0,
              "99.0" : 443.0,
              "99.9" : 443.0
            }
          }
        }
      ]
    }
  }

2.2. 百分位分级聚集

前面已经说明百分位表示低于该百分位值的一定百分比。例如，如果值在第30百分位，则30%的值低于此值。“30”被称为百分位级别。例如，如果值在第30百分位，则30%的值低于此值。“30”被称为百分位排名。百分位级别聚合允许决定特定数值所属百分位级别。下面通过示例来区分两个百分位聚集的差异：

GET /sports/_search?size=0
{
  "aggs" : {
    "goal_ranks" : {
      "terms": {"field":"sport"},
      "aggs": {
        "percentile_goals":   {
          "percentile_ranks" : {
             "field" : "goals", 
             "values" : [100,200]
          }
       }
      }
    }
  }
}

我们看到没有指定要返回哪个百分位，而是要指定计算百分位级别的数值。因此百分位级别聚集可以认为是常规百分位聚集的相反形式。两种聚集使用的计算算法和近似规则是相同的。响应如下：

  "aggregations" : {
    "goal_ranks" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Football",
          "doc_count" : 9,
          "percentile_goals" : {
            "values" : {
              "100.0" : 100.0,
              "200.0" : 100.0
            }
          }
        },
        {
          "key" : "Basketball",
          "doc_count" : 5,
          "percentile_goals" : {
            "values" : {
              "100.0" : 0.0,
              "200.0" : 0.0
            }
          }
        },
        {
          "key" : "Hockey",
          "doc_count" : 5,
          "percentile_goals" : {
            "values" : {
              "100.0" : 17.0,
              "200.0" : 100.0
            }
          }
        },
        {
          "key" : "Handball",
          "doc_count" : 3,
          "percentile_goals" : {
            "values" : {
              "100.0" : 0.0,
              "200.0" : 45.22222222222222
            }
          }
        }
      ]
    }
  }

我们看到篮球得分100和200的级别为零，因为在索引中，这项运动得分的最小值是200分以上。

3. 求和聚集

有时需要汇总某个数值字段的所有值。Elasticsearch内置了对执行此类任务的求和聚集的支持。例如，使用此聚合，我们可以汇总每种运动类型中所有运动员的所有分数。

GET /sports/_search?size=0
{
  "aggs" : {
    "goal_ranks" : {
      "terms": {"field":"sport"},
      "aggs": {
        "sum_of_goals":   {
          "sum" : { "field" : "goals"}
        }
      }
    }
  }
}

这个聚集很简单，仅需要指定一个数值类型字段。响应结果如下：

  "aggregations" : {
    "goal_ranks" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Football",
          "doc_count" : 9,
          "sum_of_goals" : {
            "value" : 494.0
          }
        },
        {
          "key" : "Basketball",
          "doc_count" : 5,
          "sum_of_goals" : {
            "value" : 5885.0
          }
        },
        {
          "key" : "Hockey",
          "doc_count" : 5,
          "sum_of_goals" : {
            "value" : 696.0
          }
        },
        {
          "key" : "Handball",
          "doc_count" : 3,
          "sum_of_goals" : {
            "value" : 736.0
          }
        }
      ]
    }
  }

4. 总结

我们已经学习了Elasticsearch中一些最令人兴奋的度量聚集，这些度量不仅对Elasticsearch的普通用户有用，而且对专业的统计人员也很有用。现在你已具备了分析地理位置数据、识别异常值和评估数据集分布均匀程度所需的强大工具。

neweastsun

发布了395 篇原创文章 · 获赞 761 · 访问量 143万+

他的留言板关注