深入理解Elasticsearch Pipeline聚集(1)

深入理解Elasticsearch Pipeline聚集(1)

度量聚集和桶聚集一般用于文档中的数值型字段,而本文讨论的管道聚集针对其他聚集产生的输出值,因此管道聚集是针对中间值而不是原始文档数据。对于计算复杂统计和数学度量,如累加和、导数(变化情况)、移动平均等非常有用。

本文讨论管道聚集的两个基本类型,通过示例展示常用的管道聚集,如求和、累加求和、最小值、最大值、平均值以及导数等管道聚集。

1. 管道聚集类型

管道聚集通常分为两类:父、兄弟管道聚集。
父管道聚集使用其父聚集的输出,它获取此聚合的值计算新的分组或聚集并将它们添加到已经存在的分组中。导数聚集、累加聚集是两个常用的父管道聚集示例。

与父管道聚集相比,兄弟聚集使用兄弟聚集的输出。它获取该输出并计算一个新聚合,该聚合与兄弟聚合处于同一级别。

管道聚集需要访问父聚集或兄弟聚集的路径。这可以使用buckets_path参数引用需要使用的聚集,表示需要度量的路径。该参数有一定的语法规范:

AGG_SEPARATOR       =  '>' ;
METRIC_SEPARATOR    =  '.' ;
AGG_NAME            =  <the name of the aggregation> ;
METRIC              =  <the name of the metric (in case of multi-value metrics aggregation)> ;
PATH                =  <AGG_NAME> [ <AGG_SEPARATOR>, <AGG_NAME> ]* [ <METRIC_SEPARATOR>, <METRIC> ] ;

举例,my_bucket>my_stats.sum中的sum值在my_stats度量中,其包括在my_bucket分组聚集内。
需要强调的是路径时相对于管道聚集的位置,因此路径不能回溯至上级聚集树。举例,导数管道聚集嵌入在date_histogram中,引用兄弟度量the_sum:

{
    "aggs": {
        "total_monthly_visits":{
            "date_histogram":{
                "field":"date",
                "interval":"month"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "visits" } 
                },
                "the_derivative":{
                    "derivative":{ "buckets_path": "the_sum" } 
                }
            }
        }
    }
}

兄弟管道聚集可以放在连续分组后面,而不是嵌入在它们里面。在这种情况下,访问必要的度量,需要指定完整路径包括父聚集的路径:

{
  "aggs": {
    "visits_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      },
      "aggs": {
        "total_visits": {
          "sum": {
            "field": "visits"
          }
        }
      }
    },
    "avg_monthly_visits": {
      "avg_bucket": {
        "buckets_path": "visits_per_month>total_visits" 
      }
    }
  }
}

上面示例中,我们通过父日期直方图visits_per_month聚集引用兄弟聚集total_visits。其完整路径为visits_per_month>total_visits

需要记住的重要内容是,管道聚集不能有子聚集。但像导数管道聚集,能在它们的buckets_path引用其他管道聚集,这样可以链接多个管道聚集。举例,我们可以链接两个一级导数计算二级导数(导数的导数,变化率的变化率)。

我们知道,度量聚集和分组聚集处理缺失数据使用missing。管道聚集使用gap_policy参数处理文档不包含需要的字段或没有文档符合匹配查询形成一个或多个分组等。该参数支持下面缺失策略:

  • skip

如果分组不存在时处理缺失数据。如果启用该策略,聚集会跳过空的分组并继续使用下一个有效值计算。

  • insert_zeros

使用0代替所有缺失值,管道聚集正常处理不受影响。

2. 示例实战

测试环境:elasticsearch7.x kibana7.x

2.1. 准备测试环境

创建下面索引,映射包括三个字段:date, visits, max_time_spent

PUT /traffic_stats
{
 "mappings": {
       "properties": {
          "date": {
             "type": "date",
             "format": "dateOptionalTime"
          },
          "visits": {
             "type": "integer"
          },
           "max_time_spent": {
               "type": "integer"
           }
       }
    }
}

插入测试数据:

POST /traffic_stats/_bulk
{"index":{}}
{"visits":"488", "date":"2018-10-1", "max_time_spent":"900"}
{"index":{}}
{"visits":"783", "date":"2018-10-6", "max_time_spent":"928"}
{"index":{}}
{"visits":"789", "date":"2018-10-12", "max_time_spent":"1834"}
{"index":{}}
{"visits":"1299", "date":"2018-11-3", "max_time_spent":"592"}
{"index":{}}
{"visits":"394", "date":"2018-11-6", "max_time_spent":"1249"}
{"index":{}}
{"visits":"448", "date":"2018-11-24", "max_time_spent":"874"}
{"index":{}}
{"visits":"768", "date":"2018-12-18", "max_time_spent":"876"}
{"index":{}}
{"visits":"1194", "date":"2018-12-24", "max_time_spent":"1249"}
{"index":{}}
{"visits":"987", "date":"2018-12-28", "max_time_spent":"1599"}
{"index":{}}
{"visits":"872", "date":"2019-01-1", "max_time_spent":"828"}
{"index":{}}
{"visits":"972", "date":"2019-01-5", "max_time_spent":"723"}
{"index":{}}
{"visits":"827", "date":"2019-02-5", "max_time_spent":"1300"}
{"index":{}}
{"visits":"1584", "date":"2019-02-15", "max_time_spent":"1500"}
{"index":{}}
{"visits":"1604", "date":"2019-03-2", "max_time_spent":"1488"}
{"index":{}}
{"visits":"1499", "date":"2019-03-27", "max_time_spent":"1399"}
{"index":{}}
{"visits":"1392", "date":"2019-04-8", "max_time_spent":"1294"}
{"index":{}}
{"visits":"1247", "date":"2019-04-15", "max_time_spent":"1194"}
{"index":{}}
{"visits":"984", "date":"2019-05-15", "max_time_spent":"1184"}
{"index":{}}
{"visits":"1228", "date":"2019-05-18", "max_time_spent":"1485"}
{"index":{}}
{"visits":"1423", "date":"2019-06-14", "max_time_spent":"1452"}
{"index":{}}
{"visits":"1238", "date":"2019-06-24", "max_time_spent":"1329"}
{"index":{}}
{"visits":"1388", "date":"2019-07-14", "max_time_spent":"1542"}
{"index":{}}
{"visits":"1499", "date":"2019-07-24", "max_time_spent":"1742"}
{"index":{}}
{"visits":"1523", "date":"2019-08-13", "max_time_spent":"1552"}
{"index":{}}
{"visits":"1443", "date":"2019-08-19", "max_time_spent":"1511"}
{"index":{}}
{"visits":"1587", "date":"2019-09-14", "max_time_spent":"1497"}
{"index":{}}
{"visits":"1534", "date":"2019-09-27", "max_time_spent":"1434"}

Ok,环境和数据都准备好了,首先从平均分组管道聚集开始。

2.2. 平均分组管道聚集

平均分组管道聚集是典型的兄弟管道聚集。一般用于数值计算,通过其他兄弟聚集计算所有分组的平均值。对兄弟聚集有两个需求,兄弟聚集必须是多个分组聚集,必须指定的度量是数值。

为了理解管道聚集如何工作,可以把整个计算过程分为几个阶段。请看下面的查询,其包括三个阶段。第一,elasticsearch创建一个日期直方图,使用月作为日期间隔对索引中的visits字段进行分组。日期直方图产生多个分组,每个分组包括多个文档。接下来求和子聚集计算组内每月所有visits字段的和。最后,平均分组管道聚集引用所有兄弟聚集的和,计算所有分组的平均值。因此我们将得到每个月的平均博客访问量。

GET /traffic_stats/_search?size=0
{
  "aggs": {
    "visits_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      },
      "aggs": {
        "total_visits": {
          "sum": {
            "field": "visits"
          }
        }
      }
    },
    "avg_monthly_visits": {
      "avg_bucket": {
        "buckets_path": "visits_per_month>total_visits" 
      }
    }
  }
}

响应结果:

{
  "took" : 1184,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 27,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "visits_per_month" : {
      "buckets" : [
        {
          "key_as_string" : "2018-10-01T00:00:00.000Z",
          "key" : 1538352000000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2060.0
          }
        },
        {
          "key_as_string" : "2018-11-01T00:00:00.000Z",
          "key" : 1541030400000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2141.0
          }
        },
        {
          "key_as_string" : "2018-12-01T00:00:00.000Z",
          "key" : 1543622400000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2949.0
          }
        },
        {
          "key_as_string" : "2019-01-01T00:00:00.000Z",
          "key" : 1546300800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 1844.0
          }
        },
        {
          "key_as_string" : "2019-02-01T00:00:00.000Z",
          "key" : 1548979200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2411.0
          }
        },
        {
          "key_as_string" : "2019-03-01T00:00:00.000Z",
          "key" : 1551398400000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 3103.0
          }
        },
        {
          "key_as_string" : "2019-04-01T00:00:00.000Z",
          "key" : 1554076800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2639.0
          }
        },
        {
          "key_as_string" : "2019-05-01T00:00:00.000Z",
          "key" : 1556668800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2212.0
          }
        },
        {
          "key_as_string" : "2019-06-01T00:00:00.000Z",
          "key" : 1559347200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2661.0
          }
        },
        {
          "key_as_string" : "2019-07-01T00:00:00.000Z",
          "key" : 1561939200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2887.0
          }
        },
        {
          "key_as_string" : "2019-08-01T00:00:00.000Z",
          "key" : 1564617600000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2966.0
          }
        },
        {
          "key_as_string" : "2019-09-01T00:00:00.000Z",
          "key" : 1567296000000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 3121.0
          }
        }
      ]
    },
    "avg_monthly_visits" : {
      "value" : 2582.8333333333335
    }
  }
}

月度博客平均访问量为2582.83,仔细看看上面描述的步骤,应该能理解管道聚集的计算流程。它们利用分组聚集或度量聚集的中间结果,增加额外的计算结果。

2.2. 导数管道聚集

这是一个父管道聚集,用于计算父直方图或日期直方图特定度量的导数。有两个必要条件:

  • 度量必须是数值型,否则不可能计算导数
  • 直方图内的min_doc_count必须设置为0(这是直方图聚集的缺省值)。如果min_doc_count大于0,一些分组将被忽略,会导致错误或令人困惑的导数值。

从数学角度看,函数的导数测量函数值(输出值)相对于其参数(输入值)的变化的敏感性。也就是说,导数根据变量计算函数的变化速度。对我们的数据来说,导数聚集用来计算相对于前一个周期的变量速度。下面通过示例进行说明,首先计算一阶导数,一阶导数告诉我们函数是否增长或下降,增长或下降的幅度。示例代码:

GET /traffic_stats/_search?size=0
{
  "aggs" : {
      "visits_per_month" : {
          "date_histogram" : {
              "field" : "date",
              "interval" : "month"
          },
          "aggs": {
              "total_visits": {
                  "sum": {
                      "field": "visits"
                  }
              },
              "visits_deriv": {
                  "derivative": {
                      "buckets_path": "total_visits" 
                  }
              }
          }
      }
  }
}

buckets_path指明导数聚集使用total_visits父聚集的输出。因为导数聚集是父管道聚集,因此我们需使用父聚集。响应结果如下:

{
  "took" : 61,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 27,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "visits_per_month" : {
      "buckets" : [
        {
          "key_as_string" : "2018-10-01T00:00:00.000Z",
          "key" : 1538352000000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2060.0
          }
        },
        {
          "key_as_string" : "2018-11-01T00:00:00.000Z",
          "key" : 1541030400000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2141.0
          },
          "visits_deriv" : {
            "value" : 81.0
          }
        },
        {
          "key_as_string" : "2018-12-01T00:00:00.000Z",
          "key" : 1543622400000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2949.0
          },
          "visits_deriv" : {
            "value" : 808.0
          }
        },
        {
          "key_as_string" : "2019-01-01T00:00:00.000Z",
          "key" : 1546300800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 1844.0
          },
          "visits_deriv" : {
            "value" : -1105.0
          }
        },
        {
          "key_as_string" : "2019-02-01T00:00:00.000Z",
          "key" : 1548979200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2411.0
          },
          "visits_deriv" : {
            "value" : 567.0
          }
        },
        {
          "key_as_string" : "2019-03-01T00:00:00.000Z",
          "key" : 1551398400000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 3103.0
          },
          "visits_deriv" : {
            "value" : 692.0
          }
        },
        {
          "key_as_string" : "2019-04-01T00:00:00.000Z",
          "key" : 1554076800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2639.0
          },
          "visits_deriv" : {
            "value" : -464.0
          }
        },
        {
          "key_as_string" : "2019-05-01T00:00:00.000Z",
          "key" : 1556668800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2212.0
          },
          "visits_deriv" : {
            "value" : -427.0
          }
        },
        {
          "key_as_string" : "2019-06-01T00:00:00.000Z",
          "key" : 1559347200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2661.0
          },
          "visits_deriv" : {
            "value" : 449.0
          }
        },
        {
          "key_as_string" : "2019-07-01T00:00:00.000Z",
          "key" : 1561939200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2887.0
          },
          "visits_deriv" : {
            "value" : 226.0
          }
        },
        {
          "key_as_string" : "2019-08-01T00:00:00.000Z",
          "key" : 1564617600000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2966.0
          },
          "visits_deriv" : {
            "value" : 79.0
          }
        },
        {
          "key_as_string" : "2019-09-01T00:00:00.000Z",
          "key" : 1567296000000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 3121.0
          },
          "visits_deriv" : {
            "value" : 155.0
          }
        }
      ]
    }
  }
}

如果你比较两个相邻的分组,当前分组和前一个分组值的差即为当前导数值。举例:

 {
          "key_as_string" : "2018-11-01T00:00:00.000Z",
          "key" : 1541030400000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2141.0
          },
          "visits_deriv" : {
            "value" : 81.0
          }
        },
        {
          "key_as_string" : "2018-12-01T00:00:00.000Z",
          "key" : 1543622400000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2949.0
          },
          "visits_deriv" : {
            "value" : 808.0
          }
        }

12月数据是2949,,11月是2141,因此12月的导数值为808,即两者的差。

2.3. 二阶导数管道聚集

二阶导数是双导数或导数的导数。它衡量一个量的变化率本身是如何变化的。在elasticsearch中,我们可以通过链接导数管道聚集至另一个导数管道聚集中来计算二阶导数。这种方式首先计算一阶导数,然后基于一阶导数计算二阶导数。下面看示例:

GET /traffic_stats/_search?size=0
{
    "aggs" : {
        "visits_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "total_visits": {
                    "sum": {
                        "field": "visits"
                    }
                },
                "visits_deriv": {
                    "derivative": {
                        "buckets_path": "total_visits"
                    }
                },
                "visits_2nd_deriv": {
                    "derivative": {
                        "buckets_path": "visits_deriv" 
                    }
                }
            }
        }
    }
}

我们看到一阶导数使用路径total_visits指明依赖求和聚集来计算。而二阶导数使用路径visits_deriv,即指定一阶导数。通过这种方式,二阶导数计算可视为双管道聚集。响应结果:

{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 27,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "visits_per_month" : {
      "buckets" : [
        {
          "key_as_string" : "2018-10-01T00:00:00.000Z",
          "key" : 1538352000000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2060.0
          }
        },
        {
          "key_as_string" : "2018-11-01T00:00:00.000Z",
          "key" : 1541030400000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2141.0
          },
          "visits_deriv" : {
            "value" : 81.0
          }
        },
        {
          "key_as_string" : "2018-12-01T00:00:00.000Z",
          "key" : 1543622400000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2949.0
          },
          "visits_deriv" : {
            "value" : 808.0
          },
          "visits_2nd_deriv" : {
            "value" : 727.0
          }
        },
        {
          "key_as_string" : "2019-01-01T00:00:00.000Z",
          "key" : 1546300800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 1844.0
          },
          "visits_deriv" : {
            "value" : -1105.0
          },
          "visits_2nd_deriv" : {
            "value" : -1913.0
          }
        },
        {
          "key_as_string" : "2019-02-01T00:00:00.000Z",
          "key" : 1548979200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2411.0
          },
          "visits_deriv" : {
            "value" : 567.0
          },
          "visits_2nd_deriv" : {
            "value" : 1672.0
          }
        },
        {
          "key_as_string" : "2019-03-01T00:00:00.000Z",
          "key" : 1551398400000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 3103.0
          },
          "visits_deriv" : {
            "value" : 692.0
          },
          "visits_2nd_deriv" : {
            "value" : 125.0
          }
        },
        {
          "key_as_string" : "2019-04-01T00:00:00.000Z",
          "key" : 1554076800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2639.0
          },
          "visits_deriv" : {
            "value" : -464.0
          },
          "visits_2nd_deriv" : {
            "value" : -1156.0
          }
        },
        {
          "key_as_string" : "2019-05-01T00:00:00.000Z",
          "key" : 1556668800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2212.0
          },
          "visits_deriv" : {
            "value" : -427.0
          },
          "visits_2nd_deriv" : {
            "value" : 37.0
          }
        },
        {
          "key_as_string" : "2019-06-01T00:00:00.000Z",
          "key" : 1559347200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2661.0
          },
          "visits_deriv" : {
            "value" : 449.0
          },
          "visits_2nd_deriv" : {
            "value" : 876.0
          }
        },
        {
          "key_as_string" : "2019-07-01T00:00:00.000Z",
          "key" : 1561939200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2887.0
          },
          "visits_deriv" : {
            "value" : 226.0
          },
          "visits_2nd_deriv" : {
            "value" : -223.0
          }
        },
        {
          "key_as_string" : "2019-08-01T00:00:00.000Z",
          "key" : 1564617600000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2966.0
          },
          "visits_deriv" : {
            "value" : 79.0
          },
          "visits_2nd_deriv" : {
            "value" : -147.0
          }
        },
        {
          "key_as_string" : "2019-09-01T00:00:00.000Z",
          "key" : 1567296000000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 3121.0
          },
          "visits_deriv" : {
            "value" : 155.0
          },
          "visits_2nd_deriv" : {
            "value" : 76.0
          }
        }
      ]
    }
  }
}

看看两条邻近记录进行对比:

        {
          "key_as_string" : "2019-08-01T00:00:00.000Z",
          "key" : 1564617600000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2966.0
          },
          "visits_deriv" : {
            "value" : 79.0
          },
          "visits_2nd_deriv" : {
            "value" : -147.0
          }
        },
        {
          "key_as_string" : "2019-09-01T00:00:00.000Z",
          "key" : 1567296000000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 3121.0
          },
          "visits_deriv" : {
            "value" : 155.0
          },
          "visits_2nd_deriv" : {
            "value" : 76.0
          }
        }

我们看到8、9月份的一阶导数分别为79,155,则9月份二阶导数为两者之差76.

假设我们可以设计三个链式流水线聚合来计算第三阶、第四阶甚至更高阶的导数。然而,这对大多数数据来说几乎没有价值。前两个部分没有二阶导数因为我们需要从一阶导数中得到至少两个数据点来计算二阶导数。

2.4. 最小、最大分组管道聚集

最大分组聚集是兄弟管道聚集,其搜索兄弟聚集中带最大度量值的分组并输出对应值和分组的key。度量必须是数值类型,兄弟度量必须是多分组聚集。

下面示例中,最大分组聚集计算有日期直方图聚集生成的所有月份中最大数值。它使用求和聚集total_visits的结果,即兄弟聚集。

GET /traffic_stats/_search?size=0
{
  "aggs": {
    "visits_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      },
      "aggs": {
        "total_visits": {
          "sum": {
            "field": "visits"
          }
        }
      }
    },
    "max_monthly_visits": {
      "max_bucket": {
        "buckets_path": "visits_per_month>total_visits" 
      }
    }
  }
}

响应结果为:

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 27,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "visits_per_month" : {
      "buckets" : [
        {
          "key_as_string" : "2018-10-01T00:00:00.000Z",
          "key" : 1538352000000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2060.0
          }
        },
        {
          "key_as_string" : "2018-11-01T00:00:00.000Z",
          "key" : 1541030400000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2141.0
          }
        },
        {
          "key_as_string" : "2018-12-01T00:00:00.000Z",
          "key" : 1543622400000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2949.0
          }
        },
        {
          "key_as_string" : "2019-01-01T00:00:00.000Z",
          "key" : 1546300800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 1844.0
          }
        },
        {
          "key_as_string" : "2019-02-01T00:00:00.000Z",
          "key" : 1548979200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2411.0
          }
        },
        {
          "key_as_string" : "2019-03-01T00:00:00.000Z",
          "key" : 1551398400000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 3103.0
          }
        },
        {
          "key_as_string" : "2019-04-01T00:00:00.000Z",
          "key" : 1554076800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2639.0
          }
        },
        {
          "key_as_string" : "2019-05-01T00:00:00.000Z",
          "key" : 1556668800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2212.0
          }
        },
        {
          "key_as_string" : "2019-06-01T00:00:00.000Z",
          "key" : 1559347200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2661.0
          }
        },
        {
          "key_as_string" : "2019-07-01T00:00:00.000Z",
          "key" : 1561939200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2887.0
          }
        },
        {
          "key_as_string" : "2019-08-01T00:00:00.000Z",
          "key" : 1564617600000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2966.0
          }
        },
        {
          "key_as_string" : "2019-09-01T00:00:00.000Z",
          "key" : 1567296000000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 3121.0
          }
        }
      ]
    },
    "max_monthly_visits" : {
      "value" : 3121.0,
      "keys" : [
        "2019-09-01T00:00:00.000Z"
      ]
    }
  }
}

我们看到求和聚集计算每个月分组的访问量之和,然后最大分组管道聚集计算最大访问量的分组,结果为3121,属于2019-09-01月份对于的分组。

最小分组聚集逻辑一样。我们仅需要修改查询中的max_bucketmin_bucket
"max_monthly_visits": { "min_bucket": { "buckets_path": "visits_per_month>total_visits" } }

结果为:

"min_monthly_visits" : {
    "value" : 1844.0,
    "keys" : [
    "2019-01-01T00:00:00.000Z"
    ]
}

2.5. 求和、累加求和分组管道聚集

有时需要计算有其他聚集生成的所有分组值的和。这时可以使用求和分组管道聚集,属于兄弟聚集。下面计算所有月度访问量的和:

GET /traffic_stats/_search?size=0
{
  "aggs": {
    "visits_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      },
      "aggs": {
        "total_visits": {
          "sum": {
            "field": "visits"
          }
        }
      }
    },
    "sum_monthly_visits": {
      "sum_bucket": {
        "buckets_path": "visits_per_month>total_visits" 
      }
    }
  }
}

管道聚集使用兄弟聚集total_visits,其表示每月的访问量。响应结果为:

{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 27,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "visits_per_month" : {
      "buckets" : [
        {
          "key_as_string" : "2018-10-01T00:00:00.000Z",
          "key" : 1538352000000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2060.0
          }
        },
        {
          "key_as_string" : "2018-11-01T00:00:00.000Z",
          "key" : 1541030400000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2141.0
          }
        },
        {
          "key_as_string" : "2018-12-01T00:00:00.000Z",
          "key" : 1543622400000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2949.0
          }
        },
        {
          "key_as_string" : "2019-01-01T00:00:00.000Z",
          "key" : 1546300800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 1844.0
          }
        },
        {
          "key_as_string" : "2019-02-01T00:00:00.000Z",
          "key" : 1548979200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2411.0
          }
        },
        {
          "key_as_string" : "2019-03-01T00:00:00.000Z",
          "key" : 1551398400000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 3103.0
          }
        },
        {
          "key_as_string" : "2019-04-01T00:00:00.000Z",
          "key" : 1554076800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2639.0
          }
        },
        {
          "key_as_string" : "2019-05-01T00:00:00.000Z",
          "key" : 1556668800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2212.0
          }
        },
        {
          "key_as_string" : "2019-06-01T00:00:00.000Z",
          "key" : 1559347200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2661.0
          }
        },
        {
          "key_as_string" : "2019-07-01T00:00:00.000Z",
          "key" : 1561939200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2887.0
          }
        },
        {
          "key_as_string" : "2019-08-01T00:00:00.000Z",
          "key" : 1564617600000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2966.0
          }
        },
        {
          "key_as_string" : "2019-09-01T00:00:00.000Z",
          "key" : 1567296000000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 3121.0
          }
        }
      ]
    },
    "sum_monthly_visits" : {
      "value" : 30994.0
    }
  }
}

求和管道聚集简单计算所有月份访问量之和,即计算兄弟求和聚集产生的中间结果之和。

累加求和聚集利用不同的方法。通常情况下,累加求和是给定序列的部分值累加序列。举例,{a,b,c,…}序列的累加和为a,a+b,a+b+c,…

累加和聚集是父管道聚集,用于计算父直方图(或日期直方图)聚集中指定的度量值的累加和。与其他父管道聚集一样,特定的度量值必须是数值型,直方图的内部参数min_doc_count设为0(缺省值)。

GET /traffic_stats/_search?size=0
{
    "aggs" : {
        "visits_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "total_visits": {
                    "sum": {
                        "field": "visits"
                    }
                },
                "cumulative_visits": {
                    "cumulative_sum": {
                        "buckets_path": "total_visits" 
                    }
                }
            }
        }
    }
}

响应结果为:

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 27,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "visits_per_month" : {
      "buckets" : [
        {
          "key_as_string" : "2018-10-01T00:00:00.000Z",
          "key" : 1538352000000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2060.0
          },
          "cumulative_visits" : {
            "value" : 2060.0
          }
        },
        {
          "key_as_string" : "2018-11-01T00:00:00.000Z",
          "key" : 1541030400000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2141.0
          },
          "cumulative_visits" : {
            "value" : 4201.0
          }
        },
        {
          "key_as_string" : "2018-12-01T00:00:00.000Z",
          "key" : 1543622400000,
          "doc_count" : 3,
          "total_visits" : {
            "value" : 2949.0
          },
          "cumulative_visits" : {
            "value" : 7150.0
          }
        },
        {
          "key_as_string" : "2019-01-01T00:00:00.000Z",
          "key" : 1546300800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 1844.0
          },
          "cumulative_visits" : {
            "value" : 8994.0
          }
        },
        {
          "key_as_string" : "2019-02-01T00:00:00.000Z",
          "key" : 1548979200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2411.0
          },
          "cumulative_visits" : {
            "value" : 11405.0
          }
        },
        {
          "key_as_string" : "2019-03-01T00:00:00.000Z",
          "key" : 1551398400000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 3103.0
          },
          "cumulative_visits" : {
            "value" : 14508.0
          }
        },
        {
          "key_as_string" : "2019-04-01T00:00:00.000Z",
          "key" : 1554076800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2639.0
          },
          "cumulative_visits" : {
            "value" : 17147.0
          }
        },
        {
          "key_as_string" : "2019-05-01T00:00:00.000Z",
          "key" : 1556668800000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2212.0
          },
          "cumulative_visits" : {
            "value" : 19359.0
          }
        },
        {
          "key_as_string" : "2019-06-01T00:00:00.000Z",
          "key" : 1559347200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2661.0
          },
          "cumulative_visits" : {
            "value" : 22020.0
          }
        },
        {
          "key_as_string" : "2019-07-01T00:00:00.000Z",
          "key" : 1561939200000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2887.0
          },
          "cumulative_visits" : {
            "value" : 24907.0
          }
        },
        {
          "key_as_string" : "2019-08-01T00:00:00.000Z",
          "key" : 1564617600000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 2966.0
          },
          "cumulative_visits" : {
            "value" : 27873.0
          }
        },
        {
          "key_as_string" : "2019-09-01T00:00:00.000Z",
          "key" : 1567296000000,
          "doc_count" : 2,
          "total_visits" : {
            "value" : 3121.0
          },
          "cumulative_visits" : {
            "value" : 30994.0
          }
        }
      ]
    }
  }
}

聚集首先计算两个分组的和,然后将结果与下一个分组的值相加,以此类推。通过这种方式,它将序列中所有分组的和累加起来。

3. 总结

管道聚集用于实现涉及有其他聚集产生中间结果的复杂计算。可以提取如导数、二阶导数、移动平均等其他类型度量计算,往往并不直接针对文档数据,而是涉及多个中间步骤进行计算。

发布了395 篇原创文章 · 获赞 761 · 访问量 143万+

猜你喜欢

转载自blog.csdn.net/neweastsun/article/details/104395294