[elasticsearch笔记] 其他

it2022-05-05 109

文章目录

通用建议主动disable不使用的field特性不要使用默认字符串mappingsTesting精准搜索（exact）和词根搜索（stemming）混合字段值参与score的计算_recovery_freeze

通用建议

不要返回大文件集，如果需要，使用 Scroll API避免单个大文件，ES默认最大100M（http.max_content_length）,可以调整，但是Lucene任然限制大约2GB多请求时，使用 _bulk，但是自己集群一次_bulk最合适操作多少个document，需要在集群中做benchmark，可通过二分查找方式快速定位这个数字在集群启动的时候，为了加速加载过程，可以做两项设置：index.refresh_interval =-1、index.number_of_replicas=0，在集群启动之后，再调整这两个数值至少把运行ES机器的内存的一半给到 the filesystem cache，ES搜索速度很大程度依赖于 the filesystem cache优先使用 auto-generated id, 如果使用自定义id，在索引文件时，就必须判断该ID是否存在，随着数据量变大，这个过程的开销将会越来越大nested 让查询速度慢几倍，parent-child让查询速度慢几百倍，能不用就不用把多个字段合并到一个字段搜索，有助于提高搜索速度，利用 copy_to 可以做到 PUT movies { "mappings": { "properties": { "name_and_plot": { "type": "text" }, "name": { "type": "text", "copy_to": "name_and_plot" }, "plot": { "type": "text", "copy_to": "name_and_plot" } } } } pre-index data，提高搜索速度 PUT index/_doc/1 { "designation": "spoon", "price": 13 } GET index/_search { "aggs": { "price_ranges": { "range": { "field": "price", "ranges": [ { "to": 10 }, { "from": 10, "to": 100 }, { "from": 100 } ] } } } } PUT index { "mappings": { "properties": { "price_range": { "type": "keyword" } } } } PUT index/_doc/1 { "designation": "spoon", "price": 13, "price_range": "10-100" } GET index/_search { "aggs": { "price_ranges": { "terms": { "field": "price_range" } } } } 增加副本数一定会提高吞吐量吗？不是。合理的副本数为： max(max_failures, ceil(num_nodes/num_primaries)-1)通过Profile API分析查询的耗时情况，只是反映相对情况，绝对数值没有太多意义 GET /twitter/_search { "profile": true, "query" : { "match" : { "message" : "some number" } } }

主动disable不使用的field特性

需要 histograms，不需要filter PUT index { "mappings": { "properties": { "foo": { "type": "integer", "index": false } } } } 只关系是否match，而不关心score PUT index { "mappings": { "properties": { "foo": { "type": "text", "norms": false } } } } 不需要 phrase queries, 告诉ES不需要索引位置信息 PUT index { "mappings": { "properties": { "foo": { "type": "text", "index_options": "freqs" } } } } 不关心得分，不需要phrase queries PUT index { "mappings": { "properties": { "foo": { "type": "text", "norms": false, "index_options": "freqs" } } } }

不要使用默认字符串mappings

默认字符串mappings会索引field为 text 和 keyword。但是很多情况我们只需要 keyword。自定义 dynamic_templates

PUT index { "mappings": { "dynamic_templates": [ { "strings": { "match_mapping_type": "string", "mapping": { "type": "keyword" } } } ] } }

Testing

<dependencies> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-test-framework</artifactId> <version>${lucene.version}</version> <scope>test</scope> </dependency> <dependency> <groupId>org.elasticsearch.test</groupId> <artifactId>framework</artifactId> <version>${elasticsearch.version}</version> <scope>test</scope> </dependency> </dependencies>

精准搜索（exact）和词根搜索（stemming）混合

普通搜索，是基于词根搜索的，但是如何处理特定词不进行词根搜索呢？在 simple_query_string 搜索中，query中 quote 的字段在quote_field_suffix 字段进行搜索，通过 quote_field_suffix 指向 exact 来实现。

PUT index { "settings": { "analysis": { "analyzer": { "english_exact": { "tokenizer": "standard", "filter": [ "lowercase" ] } } } }, "mappings": { "properties": { "body": { "type": "text", "analyzer": "english", "fields": { "exact": { "type": "text", "analyzer": "english_exact" } } } } } } PUT index/_doc/1 { "body": "Ski resort" } PUT index/_doc/2 { "body": "A pair of skis" } POST index/_refresh GET index/_search { "query": { "simple_query_string": { "fields": [ "body" ], "query": "ski" } } } GET index/_search { "query": { "simple_query_string": { "fields": [ "body.exact" ], "query": "ski" } } } GET index/_search { "query": { "simple_query_string": { "fields": [ "body" ], "quote_field_suffix": ".exact", "query": "\"ski\"" } } }

字段值参与score的计算

script_score PUT script_score_index { "mappings": { "properties": { "url":{ "type": "text" }, "pagerank": { "type": "long" }, "url_length": { "type": "rank_feature", "positive_score_impact": false } } } } PUT script_score_index/_doc/1 { "content":"elasticsearch", "pagerank": 1, "url_length": 22 } PUT script_score_index/_doc/2 { "content":"elasticsearch", "pagerank": 8, "url_length": 22 } GET script_score_index/_search { "query": { "script_score": { "query": { "match": { "content": "elasticsearch" } }, "script": { "source": "_score*saturation(doc['pagerank'].value,10)" } } } } rank_feature PUT rank_feature_index { "mappings": { "properties": { "url":{ "type": "text" }, "pagerank": { "type": "rank_feature" }, "url_length": { "type": "rank_feature", "positive_score_impact": false } } } } PUT rank_feature_index/_doc/1 { "content":"elasticsearch", "pagerank": 1, "url_length": 22 } PUT rank_feature_index/_doc/2 { "content":"elasticsearch", "pagerank": 8, "url_length": 22 } GET rank_feature_index/_search { "query": { "rank_feature": { "field": "pagerank" } } } GET rank_feature_index/_search { "query": { "bool": { "must": { "match": { "content": "elasticsearch" } }, "should": { "rank_feature": { "field": "pagerank", "saturation": { "pivot": 10 } } } } } }

_recovery

GET kibana_sample_data_ecommerce,kibana_sample_data_flights/_recovery?human GET /_recovery?human GET _recovery?human&detailed=true

_freeze

普通索引，会缓存在内存中，使得索引数据的时候，速度特别快但是，对于时间序列相关的搜索，其每次搜索的结果可能都大不相同，把相关数据缓存到索引没有意义，并且还消耗大量内存不需要缓存的索引适合使用 freeze；重新需要缓存，使用 _unfreezefreeze index 在未来变化的可能性一般是非常低的最佳实践：由于 freeze index 未来变化的肯能性非常低，推荐在freeze index之前进行 _forcemerge, 已保证每个分片shard 在磁盘上只有一个段segment。这样不仅可以提供更好的压缩，也可简化在数据结构 POST /my_index/_freeze POST /my_index/_unfreeze POST /twitter/_forcemerge?max_num_segments=1 GET /twitter/_search?q=user:kimchy&ignore_throttled=false # # sth: true if the index frozen # GET /_cat/indices/*?v&h=i,sth

专利

最新回复(0)