Spring Cloud Learning Route (11) - Distributed Search ElasticSeach Scenario Use

1. DSL query document

(1) DSL query classification

ES provides a JSON-based DSL to define queries.

1. Common query types:

  • Query all: query all data, for example, match_all
  • Full-text search (full text) query: Use the word segmenter to segment the user input content, and then match it in the inverted index database. For example:
    • match_query
    • multi_match_query
  • Exact query: Find data based on exact entry values, generally find exact values, for example:
    • ids
    • range
    • term
  • Geographic (geo) coordinate query: query based on latitude and longitude, for example:
    • geo_distance
    • geo_bounding_box
  • Compound (compound) query: compound query can combine wound query conditions and combine query conditions, for example:
    • bool
    • function_score

2. The basic syntax of the query

GET /indexName/_search
{
	"query": {
		"查询类型": {
			"查询条件": "条件值"
		}
	}
}

3. Use of match_all

GET /indexName/_search
{
	"query": {
		"match_all": { }
	}
}

Query effect

{
  "took" : 446,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "info" : "这是我的ES拆分Demo",
          "age" : 18,
          "email" : "[email protected]",
          "name" : {
            "firstName" : "Zengoo",
            "lastName" : "En"
          }
        }
      }
    ]
  }
}

(2) Full-text search query

The full-text search query will segment the content entered by the user, which is often used in the search box search.

1. Match query

(1) Structure

GET /indexName/_search
{
	"query": {
		"match": { 
			"FILED": "TEXT"
		}
	}
}

(2) Easy to use

GET /test/_search
{
	"query": {
		"match": { 
		  "info": "ES"  #当有联合属性all,进行匹配,就可以进行多条件匹配,按照匹配数量来确定权值大小。
		}
	}
}

use result

{
  "took" : 71,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "info" : "这是我的ES拆分Demo",
          "age" : 18,
          "email" : "[email protected]",
          "name" : {
            "firstName" : "Zengoo",
            "lastName" : "En"
          }
        }
      }
    ]
  }
}

2. multi_match query

In effect, it is the same as the "all" field of conditional query.

(1) Structure

GET /indexName/_search
{
	"query": {
		"multi_match": { 
			"query": "TEXT",
			"fields": ["FIELD1", "FIELD2"]
		}
	}
}

(2) Easy to use

GET /test/_search
{
  "query": {
    "multi_match": {
      "query": "ES",
      "fields": ["info","age"]
    }
  }
}

(3) Precise query

Exact query is generally to find exact values, so the search condition will not be divided into words.

1. term: Query based on the exact value of the term. In the mall project, it is usually used for type screening.

(1) Structure

GET /test/_search
{
  "query": {
    "term": {
      "FIELD": {
        "value": "VALUE"
      }
    }
  }
}

(2) Easy to use

GET /test/_search
{
  "query": {
    "term": {
      "city": {
        "value": "杭州" #精确值
      }
    }
  }
}

2. range: Query based on the value range. In the mall project, it is usually used for value screening.

(1) Structure

GET /test/_search
{
  "query": {
    "range": {
      "FIELD": {
        "gte": 10,
        "lte": 20
      }
    }
  }
}

(2) Easy to use

GET /test/_search
{
  "query": {
    "range": {
      "price": {
        "gte": 699, #最低值,gte 大于等于,gt 大于
        "lte": 1899 #最高值,lte 小于等于,lt 小于
      }
    }
  }
}

(4) Geographical coordinate query

1. geo_distance: Query all documents whose specified center is less than a certain distance value (circular range circle).

(1) Structure

GET /indexName/_search
{
  "query": {
    "geo_distance": {
    	"distance": "15km",
     	"FIELD": "13.21,121.5"
    }
  }
}

(2) Easy to use

GET /test/_search
{
  "query": {
    "geo_distance": {
    	"distance": "20km",
     	"location": "13.21,121.5"
    }
  }
}

2. geo_bounding_box: Query all documents whose geo_point value falls within a certain range (rectangular range circle).

(1) Structure

GET /indexName/_search
{
  "query": {
    "geo_bounding_box": {
      "FIELD": {
        "top_left": {
          "lat": 31.1,
          "lon": 121.5
        },
        "bottom_right": {
          "lat": 30.9,
          "lon": 121.7
        }
      }
    }
  }
}

(5) Compound query

Compound queries can combine other simple queries to implement more complex search logic.

1. function score: Function score calculation, which can control the score of document relevance and control the ranking of documents.

When we use the match query, the document results will be scored according to the relevance of the search terms (_score), and the returned results will be sorted in descending order of the scores.
For example, search for CSDNJava.

[
	{
		"_score": 17.85048,
		"_source": {
			"name": "Java语法菜鸟教程"
		}
	},
	{
		"_score": 12.587963,
		"_source": {
			"name": "Java语法W3CScool"
		}
	},
	{
		"_score": 11.158756,
		"_source": {
			"name": "CSDNJava语法学习树"
		}
	},
]

Related algorithms:

  • The initial scoring algorithm: TF (term frequency) = term / total number of document terms
  • Scoring algorithm to avoid public entry improvement: TF-IDF algorithm
    • IDF (Inverse Document Frequency) = Log( Total Documents / Total Documents Containing Term )
    • socre = (∑(i,n) TF) * IDF
  • BM25 algorithm: The algorithm used by default now, this algorithm is more complicated, and its word frequency curvature will eventually tend to the level.

(1) Structure

GET /hotel/_search
{
	"query": {
		"function_socre": {    #查询类型
			"query": {  #查询原始数据
				"match": {
					"all": "外滩"
				}
			},
			"functions": [  #解析方法
				{
					"filter": {     # 过滤条件
						"term": {
							"id": "1"
						}
					},
					"weight": 10  # score算分方法,weight是直接以常量为函数结果,其它的还有feild_value_factor:以某字段作为函数结果,random_score: 随机值作为函数结果,script_score:定义计算公式
				}
			],
			"boost_mode": "multiply"  # 加权模式,定义function score 与 query score的运算方式,包括 multiply:两者相乘(默认);replace:用function score 替换 query score;其它: sum、avg、max、min
		}
	}
}

(2) Easy to use

Requirement: Rank the entries given by users at the top

Elements to consider:

  • Which documents need to be weighted: documents containing term content
  • What is the scoring function: weight
  • Which weighting mode to use: sum

accomplish:

GET /hotel/_search
{
	"query": {
		"function_socre": {	# 算分算法
			"query": {
				"match": {
					"all": "速8快捷酒店"
				}
			},
			"functions": [ 
				{
					"filter": {	# 满足条件,品牌必须是速8
						"term": {
							"brand": "速8"
						}
					},
					"weight": 2  #算分权重为 2
				}
			],
			"boost_mode": "sum"
		}
	}
}

2. Compound query Boolean Query

Composition of subqueries:

  • must: must match each subquery, similar to "and"
  • should: Selective matching subquery, similar to "or"
  • must_not: Exclude matching mode, do not participate in scoring, similar to "not"
  • filter: must match, do not participate in scoring

Implementation case

#搜查位置位于上海,品牌为“皇冠假日”或是“华美达”,并且价格500<price<600元,且评分大于等于45的酒店
GET /hotel/_search
{
	"query": {
		"bool": {
			"must": [	# 必须匹配的条件
				{ "term": { "city: "上海" } }
			],
			"should": [	# 可以匹配到条件
				{ "term": { "brand": "皇冠假日" } },
				{ "term": { "brand": "华美达" } }
			],
			"must_not": [	#不匹配的条件
				{ "range": { "price": {"lte": 500, "gte": 600} }}
			],
			"filter": [	#筛选条件
				{ "range": { "score": { "gte": 45 } } }
			]
		}	
	}
}

2. Search result processing

(1) Sorting

ES supports sorting search results. By default, sorting is based on (_score). Field types that can be sorted include: keyword, numeric type, geographic coordinate type, date type, etc.

1. Structure:

# 普通类型排序
GET /test/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "FIELD": {
        "order": "desc"		# 排序字段和排序方式ASC、DESC
      }
    }
  ]
}

# 地理坐标型排序
GET /test/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "_geo_distance": {
        "FIELD": {  #精度维度
          "lat": 40,
          "lon": -70
        },
        "order": "asc",
        "unit": "km"
      }
    }
  ]
}

2. Realization case

Sorting requirements: in descending order of user evaluation, and in ascending order of price for the same evaluation.

GET /hotel/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "score": {  # 简化结构可以使用,"score": "desc"
        "order": "desc"
      },
      "price": {
        "order": "asc"
      }
    }
  ]
}

Sorting requirements: ascending order according to the distance from the user's location.

GET /hotel/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "_geo_distance": {
        "location": {
          "lat": 40.58489,
          "lon": -70.59873
        },
        "order": "asc",
        "unit": "km"
      }
    }
  ]
}

(2) Pagination

Modify pagination parameters

GET /hotel/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "price": "asc"
    }
  ],
  "from": 100,	#	分页开始的位置,默认为0
  "size": 20,	#	期望获取的文档总数
}

Deep pagination problem

When we make ES a cluster service, when we need to select the top 10 data, how will the bottom layer of ES implement it?

Since ES uses an inverted index, each ES will fragment data.

1. The first 1000 documents are sorted and queried on each data shard.

2. Aggregate the results of all nodes, reorder and select the first 1000 documents in memory.

3. From the first 1000 picks, select the document with from=990, size=10

If the number of search pages is too deep, or the result set is too large, the memory and CPU consumption will be higher, so the upper limit of the result set query set by ES is 10,000.

How to solve the problem of deep pagination?

  • seach after: Sorting is required when paginating. The principle is to query the data on the next page starting from the last sorting value (official recommendation).
  • scroll: The principle is to form a cache of sorted data and store it in memory (officially not recommended).

(3) Highlight

1. Concept: Search keywords are highlighted in search results.

2. Principle

  • Tag keywords in search results
  • Add css style to the label in the page

3. Grammar:

GET /indexName/_search
{
 "query": {
   "match": {
     "FIELD": "TEXT"
   }
 },
 "highlight": { 	#高亮字段
   "fields": {
     "FIELD": {
       "pre_tags": "<em>",  	#标签前缀
       "post_tags": "</em>", 	#标签后缀
       "require_field_match": "false"	#判断该字段是否与前面查询的字段匹配
     }
   }
 }
}

3. RestClient query documents

(1) Implementation of simple query case

//1、准备Request
SearchRequest request = new SearchRequest("hotel");
//2、组织DSL参数,QueryBuilders是ES的查询API库
request.source().query(QueryBuilders.matchAllQuery());
//3、发送请求,得到响应结果
SearchResponse response = client.search(request, RequestOptions.DEFAULT);
//4、解析响应结果,搜索结果会放置在Hits集合中
SearchHits searchHits = response.getHits();
//5、查询总数
long total = searchHits.getTotalHits().value;
//6、查询的结果数组
SearchHit[] hits = searchHits.getHits();
for(SearchHit hit: hits) {
    
    
	//得到source,source就是查询出来的实体信息
	String json = hit.getSourceAsString();
	//序列化
	HotelDoc hotelDoc = JSON.parseObject(json,HotelDoc.class);
}

(2) match query


//1、准备Request
SearchRequest request = new SearchRequest("hotel");
//2、组织DSL参数,QueryBuilders是ES的查询API库
//单字段查询
request.source().query(QueryBuilders.matchQuery("all","皇家"));
//多字段查询
//request.source().query(QueryBuilders.multiMatchQuery("皇家","name","buisiness"));
//3、发送请求,得到响应结果
SearchResponse response = client.search(request, RequestOptions.DEFAULT);
//4、解析响应结果,搜索结果会放置在Hits集合中
SearchHits searchHits = response.getHits();
//5、查询总数
long total = searchHits.getTotalHits().value;
//6、查询的结果数组
SearchHit[] hits = searchHits.getHits();
for(SearchHit hit: hits) {
    
    
	//得到source,source就是查询出来的实体信息
	String json = hit.getSourceAsString();
	//序列化
	HotelDoc hotelDoc = JSON.parseObject(json,HotelDoc.class);
}

(3) Precise query

//词条查询
QueryBuilders.termQuery("city","杭州");
//范围查询
QueryBuilders.rangeQuery("price").gte(100).lte(150);

(4) Compound query

//创建布尔查询
BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
//添加must条件
boolQuery.must(QueryBuilders.termQuery("city","杭州"));
//添加filter条件
boolQuery.filter(QueryBuilders.rangeQuery("price").lte(250));

(5) Sorting, paging, highlighting

1. Sorting and paging

// 查询
request.source().query(QueryBuilders.matchAllQuery());
// 分页配置
request.source().from(0).size(5);
// 价格排序
request.source().sort("price", SortOrder.ASC);

2. Highlight

Highlight query request

request.source().highlighter(new HighLightBuilder().field("name").requireFieldMatch(false));

Handle highlighted results

// 获取source
HotelDoc hotelDoc = JSON.parseObject(hit.getSourceAsString(), HotelDoc.class);
// 处理高亮
Map<String, HighlightFields> highlightFields = hit.getHighlightFields();
if(!CollectionUtils.isEmpty(highlightFields)) {
    
    
	// 获取字段结果
	HighlightField highlightField = highlightFields.get("name");
	if (highlightField != null) {
    
    
		// 去除高亮结果数组的第一个
		String name = highlightField.getFragments()[0].string();
		hotelDoc.setName(name);
	}
}

Guess you like

Origin blog.csdn.net/Zain_horse/article/details/131908395