Sphinx中文指南（二）——Sphinx中文分词coreseek篇

it2022-05-08 9

阅读本文前，请先查看前篇——Sphinx中文入门指南

目前，实现Sphinx中文的分词的方法据我所知有3种：

1、Coreseek

2、Sphinx-for-chinese

3、在客户端先分词，然后使用Sphinx字索引（查看安装原文）直接对输入词进行检索

Coreseek安装

在上篇中我们介绍了安装Sphinx的一些必要条件，在此不一一而论。本文基础基于上篇！

下载Coreseek：

[root@localhost ~]#cd /usr/local/src [root@localhost src]# wget http://www.coreseek.cn/uploads/csft/3.1/Source/csft-3.1.tar.gz ####coreseek源文件 [root@localhost src]# wget http://www.coreseek.cn/uploads/csft/3.1/Source/mmseg-3.1.tar.gz #####coreseek所使用的词典 [root@localhost src]#tar zxvf csft-3.1.tar.gz [root@localhost src]#tar zxvf mmseg-3.1.tar.gz

#####在安装coreseek前必须先安装mmseg [root@localhost src]# cd mmseg-3.1 [root@localhost mmseg-3.1]# ./configure –prefix=/usr/local/mmseg [root@localhost mmseg-3.1]# make && make install

######## 安装coreseek ######## ##这里不使用python数据源，若需要，请加上 –with-python,在mmseg上一定要对应路径

[root@localhost csft-3.1]# ./configure –prefix=/usr/local/coreseek –with-mmseg-includes=/usr/local/mmseg/include/mmseg \ –with-mmseg-libs=/usr/local/mmseg/lib –without-iconv [root@localhost csft-3.1]# make && make install

若无问题，安装完毕后在/usr/local/下生成 coreseek目录及其下文件。

接下来要生成 mmseg词库及配置文件：

[root@localhost csft-3.1]#cd /usr/loca/mmseg [root@localhost mmseg]# bin/mmseg -u /usr/local/src/mmseg-3.1/data/unigram.txt ###unigram.txt是对应的词典文件，将会生成unigram.txt.uni [root@localhost mmseg]# cd ../coreseek [root@localhost coreseek]# mkdir dict ###创建字典目录 [root@localhost coreseek]# cp /usr/local/src/mmseg-3.1/data/unigram.txt.uni dict/uni.lib ###把创建的词典复制到dict [root@localhost coreseek]# vim dict/mmseg.ini ####创建mmseg的配置文件，此文件在coreseek的windows版本已自带！

mmseg.ini: [mmseg] merge_number_and_ascii=1; number_and_ascii_joint=-; compress_space=0; seperate_number_ascii=1; 至此，mmseg配置完毕！下一步配置csft.conf——coreseek的配置文件

我的配置实例： source article_src { type = mysql sql_host = 192.168.1.10 sql_user = root sql_pass = pwd sql_db = test sql_port = 3306 # optional, default is 3306

sql_query_pre = SET NAMES utf8 #sql_query_pre = SET SESSION query_cache_type=OFF ##这个可以关闭sql查询缓存 sql_query = SELECT id,title,cat_id,member_id,content,created FROM sphinx_article

sql_attr_uint = cat_id sql_attr_uint = member_id sql_attr_timestamp = created sql_query_info = select * from sphinx_article where id=$id

}

index article { source = article_src path = /usr/local/coreseek/var/data/article docinfo = extern charset_type = zh_cn.utf-8 ###指定coreseek的编码 charset_dictpath = /usr/local/coreseek/dict #####coreseek字典文件

min_prefix_len = 0 min_infix_len = 0 min_word_len = 2 ngram_len = 1 ngram_chars = U+4E00..U+9FBF, U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF,\ U+2F800..U+2FA1F, U+2E80..U+2EFF, U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF,\ U+3040..U+309F, U+30A0..U+30FF, U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF,\ U+3130..U+318F, U+A000..U+A48F, U+A490..U+A4CF html_strip = 0 }

indexer { mem_limit = 256M } searchd { # address = 0.0.0.0 log = /usr/local/coreseek/var/log/searchd.log query_log = /usr/local/coreseek/var/log/query.log read_timeout = 5 max_children = 30 pid_file = /usr/local/coreseek/var/log/searchd.pid max_matches = 1000 seamless_rotate = 1 }

建立索引： [root@localhost coreseek]# bin/indexer article Coreseek Full Text Server 3.1 Copyright (c) 2006-2008 coreseek.com using config file ‘./csft.conf’… indexing index ‘article’… collected 1000 docs, 0.0 MB sorted 0.0 Mhits, 100.0% done total 1000 docs, 21460 bytes total 3.244 sec, 6614.99 bytes/sec, 30.82 docs/sec total 2 reads, 0.0 sec, 26.8 kb/read avg, 0.4 msec/read avg total 5 writes, 0.0 sec, 11.0 kb/write avg, 0.1 msec/write avg [root@localhost coreseek]#

使用CLI端测试一下：

[root@localhost coreseek]# bin/search -c csft.conf -i article 建筑材料租赁 Coreseek Full Text Server 3.1 Copyright (c) 2006-2008 coreseek.com using config file ‘csft.conf’… index ‘article’: query ‘建筑材料租赁 ‘: returned 1 matches of 1 total in 0.035 sec

displaying matches: 1. document=14, weight=3 id=14 title=??????????????? cat_id=1 member_id=2 content=?????????????????????????????????????????????????????????? created=1264244709 words: 1. ‘建筑’: 3 documents, 3 hits 2. ‘材料’: 4 documents, 4 hits 3. ‘租赁’: 2 documents, 2 hits [root@localhost coreseek]#

可见，中文分词成功执行！并能从sql中查询出结果！

Sphinx中文分词coreseek篇完毕！下一篇：Sphinx中文分词Sphinx-for-chinese篇 2010年1月24日最后修改

转载于:https://www.cnblogs.com/Jerry-blog/p/5044631.html

专利

最新回复(0)