4.通过Analyzer进行分词

it2022-05-08  7

1. Analysis与Analyzer

Analysis : 文本分析是吧全文本转换成一系列单词(term/token)的过程,也叫分词Analysis : 是通过Analyzer来实现的 可使用Elasticsearch内置的分析器/或者按需定制化分析器 除了在数据写入时转换词条,匹配Query语句时候也需要用相同的分析器对查询语句进行分析

2. Analyzer 的组成

分词器是专门处理分词的组建,Analyzer由三部分组成

Character Filters:针对原始文本处理,例如去除 htmlTokenizer:按照规则切分为单词Token Filter:将切分的单词进行加工,小写,删除 stopwords ,增加同义词

3.Elasticsearch 的内置分词器

3.1 Standard Analyzer

GET _analyze { "analyzer": "standard", "text":"Mastering brown-foxes leap over is" } # 输出结果 { "tokens" : [ { "token" : "mastering", // 小写处理 "start_offset" : 0, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "brown", "start_offset" : 10, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "foxes", // 切分字符 "start_offset" : 16, "end_offset" : 21, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "leap", "start_offset" : 22, "end_offset" : 26, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "over", "start_offset" : 27, "end_offset" : 31, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "is", // 没有删除停用词 "start_offset" : 32, "end_offset" : 34, "type" : "<ALPHANUM>", "position" : 5 } ] }

3.2 Simple Analyzer

GET _analyze { "analyzer": "simple", "text":"Mastering brown-foxes leap over is" } # 返回结果 { "tokens" : [ { "token" : "mastering", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : "brown", "start_offset" : 10, "end_offset" : 15, "type" : "word", "position" : 1 }, { "token" : "foxes", "start_offset" : 16, "end_offset" : 21, "type" : "word", "position" : 2 }, { "token" : "leap", "start_offset" : 23, "end_offset" : 27, "type" : "word", "position" : 3 }, { "token" : "over", "start_offset" : 28, "end_offset" : 32, "type" : "word", "position" : 4 }, { "token" : "is", "start_offset" : 33, "end_offset" : 35, "type" : "word", "position" : 5 } ] }

3.3 Whitespace Analyzer

GET _analyze { "analyzer": "whitespace", "text":"Mastering brown-foxes leap over is" } # 返回结果 { "tokens" : [ { "token" : "Mastering", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : "brown-foxes", "start_offset" : 10, "end_offset" : 21, "type" : "word", "position" : 1 }, { "token" : "leap", "start_offset" : 22, "end_offset" : 26, "type" : "word", "position" : 2 }, { "token" : "over", "start_offset" : 27, "end_offset" : 31, "type" : "word", "position" : 3 }, { "token" : "is", "start_offset" : 32, "end_offset" : 34, "type" : "word", "position" : 4 } ] }

3.4 Stop Analyzer

GET _analyze { "analyzer": "stop", "text":"Mastering brown-foxes leap over is" } # 返回结果 { "tokens" : [ { "token" : "mastering", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : "brown", "start_offset" : 10, "end_offset" : 15, "type" : "word", "position" : 1 }, { "token" : "foxes", "start_offset" : 16, "end_offset" : 21, "type" : "word", "position" : 2 }, { "token" : "leap", "start_offset" : 22, "end_offset" : 26, "type" : "word", "position" : 3 }, { "token" : "over", "start_offset" : 27, "end_offset" : 31, "type" : "word", "position" : 4 } ] }

3.5 Keyword Analyzer

GET _analyze { "analyzer": "keyword", "text":"Mastering elasticsearch brown-foxes leap over is" } # 返回结果{ "tokens" : [ { "token" : "Mastering elasticsearch brown-foxes leap over is", "start_offset" : 0, "end_offset" : 48, "type" : "word", "position" : 0 } ] }

3.6 Pattern Analyzer

GET _analyze { "analyzer": "pattern", "text":"Mastering brown-foxes leap over is" } # 返回结果 { "tokens" : [ { "token" : "mastering", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : "brown", "start_offset" : 10, "end_offset" : 15, "type" : "word", "position" : 1 }, { "token" : "foxes", "start_offset" : 16, "end_offset" : 21, "type" : "word", "position" : 2 }, { "token" : "leap", "start_offset" : 22, "end_offset" : 26, "type" : "word", "position" : 3 }, { "token" : "over", "start_offset" : 27, "end_offset" : 31, "type" : "word", "position" : 4 }, { "token" : "is", "start_offset" : 32, "end_offset" : 34, "type" : "word", "position" : 5 } ] }

3.6 ICU Analyzer

给各位使用docker跑Elasticsearch安装插件的简单办法,无需dockerFile自制镜像。以本帖的双es启动为

1.进入es的容器并启动bash。 命令 docker exec -it es7_01 bash 注:es7_01 即容器名称 2…第一步成功你会发现你已经在容器内部,此时输入 pwd 命令会发现自己处于/usr/share/elasticsearch 路径。此时即可输入插件安装命令 bin/elasticsearch-plugin install analysis-icu 等待插件下载并安装完毕 3.输入exit退出容器bash。 4.如法炮制es7_02并安装插件。 5.docker-compose restart 重启容器 6.重启后,检查安装是否成功,输入 curl 127.0.0.1:9200/_cat/plugins,输出: es7_01 analysis-icu 7.2.0 es7_02 analysis-icu 7.2.0 代表成功

GET _analyze { "analyzer": "icu_analyzer", "text":"他说的确实有理" } { "tokens" : [ { "token" : "他", "start_offset" : 0, "end_offset" : 1, "type" : "<IDEOGRAPHIC>", "position" : 0 }, { "token" : "说的", "start_offset" : 1, "end_offset" : 3, "type" : "<IDEOGRAPHIC>", "position" : 1 }, { "token" : "确实", "start_offset" : 3, "end_offset" : 5, "type" : "<IDEOGRAPHIC>", "position" : 2 }, { "token" : "有理", "start_offset" : 5, "end_offset" : 7, "type" : "<IDEOGRAPHIC>", "position" : 3 } ] }

3.7 Ik Analyzer

ik分析器安装:

bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip

kibana 演示:

POST _analyze { "analyzer": "ik_smart", "text": "中华人民共和国国歌" } POST _analyze { "analyzer": "ik_max_word", "text": "中华人民共和国国歌" }

ik_max_word:会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合,适合 Term Query; ik_smart:会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”,适合 Phrase 查询


最新回复(0)