Elasticsearch核心技术与实战学习笔记 52 | Ingest Pipeline & Painless Script

一序

本文属于极客时间Elasticsearch核心技术与实战学习笔记系列。

二需求：修复与增强写入的数据

Tags 字段中，逗号分割的文本应该是数组，而不是一个字符串

需求：后期需要对 Tags 进行 Aggregation 统计

2.1 Ingest Node

Elasticsearch 5.0 后，引入的一种新的节点类型。默认配置下，每个节点都是 Ingest Node

具有预处理数据的能力，可拦截 Index 或者 Bulck API 的请求
对数据进行转换，并重新返回给 Index 和 Bluck API

无需 Logstash ，就可以进行数据的预处理，例如

为某个字段设置默认值；重命名某个字段的字段名；对字段值进行 Split 操作
支持设置 Painless 脚本，对数据进行更加复杂的加工

2.2 Pipeline & Processor

Pipeline - 管道会对通过的数据（文档），按照顺序进行加工
Processor - Elasticsearch 对一些加工的行为进行了抽象包装

Elasticsearch 有很多内置的 Processors。也支持通过插件的方式，实现自己的 Processsor

2.3 使用 Pipeline 切分字符串

2.3 demo

数据准备：

DELETE tech_blogs

#Blog数据，包含3个字段，tags用逗号间隔
PUT tech_blogs/_doc/1
{
  "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data"
}

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "to split blog tags",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "title": "Introducing big data......",
        "tags": "hadoop,elasticsearch,spark",
        "content": "You konw, for big data"
      }
    },
    {
      "_index": "index",
      "_id": "idxx",
      "_source": {
        "title": "Introducing cloud computering",
        "tags": "openstack,k8s",
        "content": "You konw, for cloud"
      }
    }
  ]
}

这里定义了processors。用来把tags按照”，“切分。

同时为文档，增加一个字段。blog查看量

Pipeline API

为ES添加一个 Pipeline

测试pipeline/_simulate

index & Update By Query

#不使用pipeline更新数据
PUT tech_blogs/_doc/1
{
  "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data"
}

#使用pipeline更新数据
PUT tech_blogs/_doc/2?pipeline=blog_pipeline
{
  "title": "Introducing cloud computering",
  "tags": "openstack,k8s",
  "content": "You konw, for cloud"
}


#查看两条数据，一条被处理，一条未被处理
POST tech_blogs/_search
{}

#update_by_query 会导致错误
POST tech_blogs/_update_by_query?pipeline=blog_pipeline
{
}

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [tags] of type [java.util.ArrayList] cannot be cast to [java.lang.String]

指定query条件：对没有views的执行updatebyquery

再来执行一次query；

3 一些内置的 Processors

Split Processor （例如：将给定字段分成一个数组）
Remove / Rename Processor （移除一个重命名字段）
Append（为商品增加一个新的标签）
Convert （将商品价格，从字符串转换成 float 类型）
Date / JSON （日期格式转换，字符串转 JSON 对象）
Date Index Name Processor （将通过该处理器的文档，分配到指定时间格式的索引中）
Fail Processor （一旦出现异常，该 Pipeline 指定的错误信息能返回给用户）
Foreach Process （数组字段，数组的每个元素都会使用到一个相同的处理器）
Grok Processor （日志的日志格式切割）
Gsub / Join / Split （字符串替换、数组转字符串、字符串转数组）
Lowercase / Upcase（大小写转换）

3.1 Ingest Node v.s Logstash

4 Painless 简介

自 ES 5.x 后引入，专门为 ES 设置，扩展了 Java 的语法
6.0 开始，ES 只支持 Painless。Grooby ,JavaScript 和 Python 都不在支持
Painless 支持所有的 Java 的数据类型及 Java API 子集
Painless Script 具备以下特性

高性能、安全
支持显示类型或者动态定义类型

4.1Painless 的用途

可以对文档字段进行加工处理

更新或者删除字段，处理数据聚合操作
Script Field：对返回的字段提前进行计算
Function Score：对文档的算分进行处理

在 Ingest Pipeline 中执行脚本
在 Reindex API，Update By Query 时，对数据进行处理

4.2通过 Painless 脚本访问字段

上下文	语法
Ingestion	ctx.field_name
Update	ctx._source.field_name
Search & Aggregation	doc{“field_name”]

不同的上下文，对应不同的语法。

4.3 demo

案例 1：Script Processsor

#########Demo for Painless###############

# 增加一个 Script Prcessor
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "to split blog tags",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },
      {
        "script": {
          "source": """
          if(ctx.containsKey("content")){
            ctx.content_length = ctx.content.length();
          }else{
            ctx.content_length=0;
          }


          """
        }
      },

      {
        "set":{
          "field": "views",
          "value": 0
        }
      }
    ]
  },

  "docs": [
    {
      "_index":"index",
      "_id":"id",
      "_source":{
        "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data"
      }
    },


    {
      "_index":"index",
      "_id":"idxx",
      "_source":{
        "title":"Introducing cloud computering",
  "tags":"openstack,k8s",
  "content":"You konw, for cloud"
      }
    }

    ]
}

这里的processors 是script，判断了content_length。