1. Analysis与Analyzer
Analysis : 文本分析是吧全文本转换成一系列单词(term/token)的过程,也叫分词Analysis : 是通过Analyzer来实现的
可使用Elasticsearch内置的分析器/或者按需定制化分析器 除了在数据写入时转换词条,匹配Query语句时候也需要用相同的分析器对查询语句进行分析
2. Analyzer 的组成
分词器是专门处理分词的组建,Analyzer由三部分组成
Character Filters:针对原始文本处理,例如去除 htmlTokenizer:按照规则切分为单词Token Filter:将切分的单词进行加工,小写,删除 stopwords ,增加同义词
3.Elasticsearch 的内置分词器
3.1 Standard Analyzer
GET _analyze
{
"analyzer": "standard",
"text":"Mastering brown-foxes leap over is"
}
# 输出结果
{
"tokens" : [
{
"token" : "mastering", // 小写处理
"start_offset" : 0,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "brown",
"start_offset" : 10,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "foxes", // 切分字符
"start_offset" : 16,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "leap",
"start_offset" : 22,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "over",
"start_offset" : 27,
"end_offset" : 31,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "is", // 没有删除停用词
"start_offset" : 32,
"end_offset" : 34,
"type" : "<ALPHANUM>",
"position" : 5
}
]
}
3.2 Simple Analyzer
GET _analyze
{
"analyzer": "simple",
"text":"Mastering brown-foxes leap over is"
}
# 返回结果
{
"tokens" : [
{
"token" : "mastering",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "brown",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
{
"token" : "foxes",
"start_offset" : 16,
"end_offset" : 21,
"type" : "word",
"position" : 2
},
{
"token" : "leap",
"start_offset" : 23,
"end_offset" : 27,
"type" : "word",
"position" : 3
},
{
"token" : "over",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 4
},
{
"token" : "is",
"start_offset" : 33,
"end_offset" : 35,
"type" : "word",
"position" : 5
}
]
}
3.3 Whitespace Analyzer
GET _analyze
{
"analyzer": "whitespace",
"text":"Mastering brown-foxes leap over is"
}
# 返回结果
{
"tokens" : [
{
"token" : "Mastering",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "brown-foxes",
"start_offset" : 10,
"end_offset" : 21,
"type" : "word",
"position" : 1
},
{
"token" : "leap",
"start_offset" : 22,
"end_offset" : 26,
"type" : "word",
"position" : 2
},
{
"token" : "over",
"start_offset" : 27,
"end_offset" : 31,
"type" : "word",
"position" : 3
},
{
"token" : "is",
"start_offset" : 32,
"end_offset" : 34,
"type" : "word",
"position" : 4
}
]
}
3.4 Stop Analyzer
GET _analyze
{
"analyzer": "stop",
"text":"Mastering brown-foxes leap over is"
}
# 返回结果
{
"tokens" : [
{
"token" : "mastering",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "brown",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
{
"token" : "foxes",
"start_offset" : 16,
"end_offset" : 21,
"type" : "word",
"position" : 2
},
{
"token" : "leap",
"start_offset" : 22,
"end_offset" : 26,
"type" : "word",
"position" : 3
},
{
"token" : "over",
"start_offset" : 27,
"end_offset" : 31,
"type" : "word",
"position" : 4
}
]
}
3.5 Keyword Analyzer
GET _analyze
{
"analyzer": "keyword",
"text":"Mastering elasticsearch brown-foxes leap over is"
}
# 返回结果{
"tokens" : [
{
"token" : "Mastering elasticsearch brown-foxes leap over is",
"start_offset" : 0,
"end_offset" : 48,
"type" : "word",
"position" : 0
}
]
}
3.6 Pattern Analyzer
GET _analyze
{
"analyzer": "pattern",
"text":"Mastering brown-foxes leap over is"
}
# 返回结果
{
"tokens" : [
{
"token" : "mastering",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "brown",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
{
"token" : "foxes",
"start_offset" : 16,
"end_offset" : 21,
"type" : "word",
"position" : 2
},
{
"token" : "leap",
"start_offset" : 22,
"end_offset" : 26,
"type" : "word",
"position" : 3
},
{
"token" : "over",
"start_offset" : 27,
"end_offset" : 31,
"type" : "word",
"position" : 4
},
{
"token" : "is",
"start_offset" : 32,
"end_offset" : 34,
"type" : "word",
"position" : 5
}
]
}
3.6 ICU Analyzer
给各位使用docker跑Elasticsearch安装插件的简单办法,无需dockerFile自制镜像。以本帖的双es启动为
1.进入es的容器并启动bash。 命令 docker exec -it es7_01 bash 注:es7_01 即容器名称 2…第一步成功你会发现你已经在容器内部,此时输入 pwd 命令会发现自己处于/usr/share/elasticsearch 路径。此时即可输入插件安装命令 bin/elasticsearch-plugin install analysis-icu 等待插件下载并安装完毕 3.输入exit退出容器bash。 4.如法炮制es7_02并安装插件。 5.docker-compose restart 重启容器 6.重启后,检查安装是否成功,输入 curl 127.0.0.1:9200/_cat/plugins,输出: es7_01 analysis-icu 7.2.0 es7_02 analysis-icu 7.2.0 代表成功
GET _analyze
{
"analyzer": "icu_analyzer",
"text":"他说的确实有理"
}
{
"tokens" : [
{
"token" : "他",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "说的",
"start_offset" : 1,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "确实",
"start_offset" : 3,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "有理",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 3
}
]
}
3.7 Ik Analyzer
ik分析器安装:
bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip
kibana 演示:
POST _analyze
{
"analyzer": "ik_smart",
"text": "中华人民共和国国歌"
}
POST _analyze
{
"analyzer": "ik_max_word",
"text": "中华人民共和国国歌"
}
ik_max_word:会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合,适合 Term Query; ik_smart:会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”,适合 Phrase 查询