es 3期第14节-全文文本分词查询

#### 1.Elasticsearch是数据库，不是普通的Java应用程序，传统数据库需要的硬件资源同样需要，提升性能最有效的就是升级硬件。
#### 2.Elasticsearch是文档型数据库，不是关系型数据库，不具备严格的ACID事务特性，任何企图直接替代严格事务性场景的应用项目都会失败!!!

##### 索引字段与属性都属于静态设置，若后期变更历史数据需要重建索引才可生效
##### 对历史数据无效！！！！
##### 一定要重建索引！！！

#### 全文文本概念
### 概念介绍
## 1.文章语句分词
## 2.分词之后，支持基于分词检索
## 3.分词算法很多，分词领域很深入
## 4.基于倒排索引算法-Inverted-Index
## 5.分词检索的打分算法TF/IDF=>BM25
## 6.字段类型仅限于text类型

## 全文搜索内容较深，初步学习使用即可

# es测试分词器默认api语法，默认分词算法 standard 按照空格、逗号这种方式分
# 初步理解分词，数据在入库前已经做好了分词并建立了索引

POST _analyze
{"text": ["hello every body, 我是DavidSoCool, 我正在学习es"],"analyzer":"standard"
}

### 全文文本检索
# Match-all：全查询
# Match：标准分词

# 准备数据

DELETE kibana_sample_data_flights_fulltext
POST _reindex
{"source": {"index": "kibana_sample_data_flights"},"dest": {"index": "kibana_sample_data_flights_fulltext"}
}

## match-all 全匹配
# 1.Match all没有限制条件，直接等同于search查询
# 2.boost:可以调整加权数值

GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match_all": {"boost": 10}}
}

## match_none，反向全匹配，可用于测试索引健康，不同与查询数据消耗性能

GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match_none": {}}
}

## match，文本匹配，最常用的
# 排序默认根据_score分值，匹配的次越多，分值就越高，可以用于做简单的推荐系统

GET kibana_sample_data_flights_fulltext/_mapping
# 先测试下分词结果，分成了4个词
POST _analyze
{"text": ["Cape Town International Airport"],"analyzer":"standard"
}
# 任意匹配一个词就能查询出来
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match": {"Origin":"Cape Town International Airport"}}
}
# 看5000条以后_score分值和Origin字段匹配的数量
GET kibana_sample_data_flights_fulltext/_search
{"from":1000,"track_total_hits": true,"query":{"match": {"Origin":"Cape Town International Airport"}}
}
# 看9000条以后_score分值和Origin字段匹配的数量
GET kibana_sample_data_flights_fulltext/_search
{"from":9000,"track_total_hits": true,"query":{"match": {"Origin":"Cape Town International Airport"}}
}

## Request 请求参数
# query:查询表达式
# analyzer:指定分词器，对于查询输入的文本进行分词
# operator:分词之间关联关系，默认是or
# minimum_should_match:分词最小匹配数量

# 这条语句等价于下面那条
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match": {"Origin":{"query": "Cape Town International Airport","analyzer": "standard","operator": "or"}}}
}
# 这条语句等价于上面那条
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match": {"Origin":"Cape Town International Airport"}}
}
# 使用operator=and，表示所有词都匹配上，注意看total
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match": {"Origin":{"query": "Cape Town International Airport","analyzer": "standard","operator": "and"}}}
}
# 去掉前两个词后，跳过100条看看
GET kibana_sample_data_flights_fulltext/_search
{"from": 100,"track_total_hits": true,"query":{"match": {"Origin":{"query": "International Airport","analyzer": "standard","operator": "and"}}}
}
# minimum_should_match，控制匹配词的精确度，可以使用数字和百分比
# 只能用or，and会查不出数据
GET kibana_sample_data_flights_fulltext/_search
{"from": 100,"track_total_hits": true,"query":{"match": {"Origin":{"query": "Cape Town International Airport","analyzer": "standard","operator": "or","minimum_should_match": 2}}}
}
# 跳过数据看看，total总数111条，跳过110条
# 第111条还是全匹配数据，112开始就只有2个词匹配的数据了
GET kibana_sample_data_flights_fulltext/_search
{"from": 110,"track_total_hits": true,"query":{"match": {"Origin":{"query": "Cape Town International Airport","analyzer": "standard","operator": "or","minimum_should_match": 2}}}
}
# 如何minimum_should_match=4就相当于使用and的了
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match": {"Origin":{"query": "Cape Town International Airport","analyzer": "standard","operator": "or","minimum_should_match": 4}}}
}
# minimum_should_match 使用百分比，这里不是简单看分词的比例，需要看文档理解
# 建议还是使用数字，如果词很多可以使用百分比
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match": {"Origin":{"query": "Cape Town International Airport","analyzer": "standard","operator": "or","minimum_should_match": "50%"}}}
}
# minimum_should_match 也可以使用负数，相当于是负相关，不建议使用
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match": {"Origin":{"query": "Cape Town International Airport","analyzer": "standard","operator": "or","minimum_should_match": -1}}}
}
# fuzziness 纠错搜索，可以帮助我们纠正输入错误的词，具体看文档
# 将Cape输入成错误的Capa
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match": {"Origin":{"query": "Capa","analyzer": "standard","operator": "or","fuzziness": 1}}}
}

## Match boolPrefix前缀匹配
# 集成了match和bool
# 去掉最后的Airport，并且把International最后的l去掉，相当于前面2个单词全匹配，最后一个Internationa使用的前缀匹配

GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match_bool_prefix": {"Origin":"Cape Town Internationa"}}
}
# 原语句
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match": {"Origin":"Cape Town International Airport"}}
}

## match_phrase 短语搜索，按照我们输入的词顺序匹配，之前的是每个词各自匹配

GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match_phrase": {"Origin":"Cape Town International Airport"}}
}
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match_phrase": {"Origin":"Cape Town"}}
}
# 中间跳过一个词Town就查不出来，因为没有这个短语
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match_phrase": {"Origin":"Cape International Airport"}}
}
# slop参数，匹配允许短语间隔误差词数量，中间跳过一个词Town也可以查出来
# slop会耗费计算资源
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match_phrase": {"Origin":{"query": "Cape International Airport","slop": 1}}}
}
# slop参数，中间跳过两个词
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match_phrase": {"Origin":{"query": "Cape Airport","slop": 2}}}
}

## Match phase prefix
# 短语前缀查询，集成了短语匹配+前缀
# 前面分词走短语查询
# 最后的分词走前缀查询

# 把Airport的末尾t去掉，效率比slop高效些
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query":{"match_phrase_prefix": {"Origin":"Cape Town International Airpor"}}
}

## Multi match 多字段
# 很多应用场景需要同时查询多个字段，查询内容一样如电商领域，商品标题与商品描述
# Multimatch专门解决此场景需求，单个字段查询时等同与match匹配

## type 匹配类型
# best_fields，多字段中选择分值最高的字段，默认匹配类型
# most_fields，多字段分值累计和
# cross_fields，多字段查询时，部分分词在第一个字段里，其它的分词在另外的字段里phrase，短语匹配，等同match_phase
# phrase_prefix，短语前缀匹配，等同match_phase_prefix
# bool_prefix，全文匹配逻辑前缀，等同match_bool_prefix.
# tie_breaker，选择多字段分值计算方式，0-选择其中较大的，1-选择合并
# 切换不同的类型(best_fields/most_fields)，测试对比前后的分值与结果数量

GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query": {"multi_match": {"query": "Cape Town International Airport","type": "best_fields","fields": ["Origin","Dest"]}}
}
# 还可以使用模糊匹配字段
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query": {"multi_match": {"query": "Cape Town International Airport","fields": "*rigin"}}
}
# 多个字段匹配，使用^符号和后面增加权重值数字，增加某个字段的权重，类同于单独写boost
GET kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query": {"multi_match": {"query": "Cape Town International Airport","type": "best_fields","fields": ["Origin","Dest^2"]}}
}

## Intervals文本顺序间隔，这个比较复杂一般用不上，需要深入研究
# 间隔查询是全文分词非常⾼级的查询能⼒，容许控制输入分词查询与内容之间的间隔。⽀持了多种间隔类型机制。
# 多个查询检索条件有先后，先基于第⼀个条件查询，之后在结果集上执⾏后⾯的查询条件，类似于 if,then 逻辑

## intervals match 间隔匹配查询
# match，关键字，间隔查询的全文分词⽅式，等同前⾯的match查询
# query，关键字，查询输入的内容
# max_gaps，关键字，容许中间间隔最⼤的词数量，默认-1，不限制
# ordered，关键字，查询的内容是否必须符合顺序，取值true/false，默认false
# analyzer，关键字，分词器
# filter，关键字，⼆级查询过滤器，⽀持多种过滤类型
# use_field，⾃定义字段类型，

## filter 参数说明，⼆级查询过滤器，⽀持多种过滤类型
# 类型说明
# after query查询在此之后执⾏
# before query查询在此之前执⾏
# contained_by 包含此执⾏条件之内的结果
# containing 包含此执⾏条件
# not_contained_by 不在此执⾏结果之内
# not_containing 不包含此条件
# not_overlapping 不重叠条件
# overlapping 重叠条件
# script 基于painless脚本限制

POST kibana_sample_data_flights_fulltext/_search
{"track_total_hits": true,"query": {"intervals": {"Dest": {"match": {"ordered": true,"query": "Sydney Smith Airport","analyzer": "standard","max_gaps": 2,"filter": {"containing": {"match": {"query": "International"}}}}}}}
}

## Query String查询字符
# DSL查询比较复杂，ES也提供了类似SOL表达式的查询方式，但功能性上并未超越DSL，仅仅是方便
# 优缺点优点:简单直接
# 缺点:语法阅读困难，表达能力有限，建议尽量不使用

# 查询Dest，用or的方式
POST kibana_sample_data_flights_fulltext/_search
{"query":{"query_string": {"query": "Dest:(Phoenix or Ministro)"}}
}
# 查询数字范围
POST kibana_sample_data_flights_fulltext/_search
{"query":{"query_string": {"query": "FlightDelayMin:[10 TO 100]"}}
}

## Url查询字符
# 查询表达式基于URL的形式
## 优缺点
# 优点:简洁直接
# 缺点:表达能力局限，极少情况下应用，建议使用DSL

POST kibana_sample_data_flights_fulltext/_search?q=(Dest:Phoenix) AND (Origin:Chubu)

### 查询性能分析
## Profile性能分析
# 1.基于查询树，生成性能分析报告
# 2.与传统关系型数据库执行计划一样等价
# 3.Kibana具备可视化功能，看懂需要一定功力

POST kibana_sample_data_flights_fulltext/_search
{"profile":true,"query":{"query_string": {"query": "Dest:(Phoenix or Ministro)"}}
}

profile查询解结果如下

还可使用search profiler如下

## Explain分值计算评估，有兴趣可以深入
# 1.解释分值计算逻辑与规则
# 2.帮助理解全文查询分值计算信息

POST kibana_sample_data_flights_fulltext/_explain/74TR0Y8BbWz2Sn6EhZCn
{"query":{"match": {"Dest": "Ministro Pistarini International Airport"}}
}

_explain结果如下，这是Dest字段ministro的分值计算

## 全文查询建议
# 全文文本查询是非精确查询（可以通过一些参数控制位精确查询）
# 查询关联度与分词算法（需要去了解，查询结果不是想要的并非是es错误）
# 查询精确度问题（近似值）

elasticsearch text 文本字段类型官⽅参考 https://www.elastic.co/guide/en/elasticsearch/reference/8.6/text.html

elasticsearch analysis-analyzers 内置分词器官⽅参考 https://www.elastic.co/guide/en/elasticsearch/reference/8.6/analysis-analyzers.html

elasticsearch full-text-queries 全文查询官⽅参考 https://www.elastic.co/guide/en/elasticsearch/reference/8.6/full-text-queries.html

elasticsearch query-dsl-intervals-query 间隔查询官⽅参考 https://www.elastic.co/guide/en/elasticsearch/reference/8.6/query-dsl-intervals-query.html

elasticsearch index-modules-similarity

elasticsearch similarity 相似度算法官⽅参考 https://www.elastic.co/guide/en/elasticsearch/reference/8.6/index-modules-similarity.html https://www.elastic.co/guide/en/elasticsearch/reference/8.6/similarity.html