solr+jieba结巴分词

it2024-08-18 84

为什么选择结巴分词

分词效率高词料库构建时使用的是jieba (python)

结巴分词Java版本

下载 git clone https://github.com/huaban/jieba-analysis 编译 cd jieba-analysis mvn install 注意如果mvn版本较高，需要修改pom.xml文件，在plugins前面增加

solr tokenizer版本

https://github.com/sing1ee/analyzer-solr (solr 5)https://github.com/sing1ee/jieba-solr.git (solr 4)

支持solr 6或7或更高

如果你的solr像我一样，版本比较新，需要对代码稍做修改，但改动其实不大。(根据给编译时报的错误做修改即可)

build.gradle的diff

diff --git a/build.gradle b/build.gradle index 2a87525..06c5cc3 100644 --- a/build.gradle +++ b/build.gradle @@ -1,4 +1,4 @@ -group = 'analyzer.solr5' +group = 'analyzer.solr7' version = '1.0' apply plugin: 'java' apply plugin: "eclipse" @@ -14,15 +14,14 @@ repositories { dependencies { testCompile group: 'junit', name: 'junit', version: '4.11' - compile("org.apache.lucene:lucene-core:5.0.0") - compile("org.apache.lucene:lucene-queryparser:5.0.0") - compile("org.apache.lucene:lucene-analyzers-common:5.0.0") - compile('com.huaban:jieba-analysis:1.0.0') -// compile("org.fnlp:fnlp-core:2.0-SNAPSHOT") + compile("org.apache.lucene:lucene-core:7.1.0") + compile("org.apache.lucene:lucene-queryparser:7.1.0") + compile("org.apache.lucene:lucene-analyzers-common:7.1.0") + compile files('libs/jieba-analysis-1.0.3.jar') compile("edu.stanford.nlp:stanford-corenlp:3.5.1") } task "create-dirs" << { sourceSets*.java.srcDirs*.each { it.mkdirs() } sourceSets*.resources.srcDirs*.each { it.mkdirs() } -} \ No newline at end of file +}

编译

./gladlew build

集成到solr

拷贝jar包到solr的目录下：server/solr-webapp/webapp/WEB-INF/lib

schema修改

转载于:https://www.cnblogs.com/lotushy/p/8404603.html

最新回复(0)