elasticsearch ik 中文分词安装配置

2018年11月20日 in unix Tags:

elasticsearch自带有中文分词，但是特别的傻，后面会做对比，在这里推荐analysis ik，用es来做全文检索工具的人员80%-90%会用这个中文分词工具，一直在更新维护。

1，elasticsearch分词器（analyzers）说明

elasticsearch中，内置了很多分词器（analyzers），例如standard （标准分词器）、english （英文分词）和chinese （中文分词）。

其中standard 就是无脑的一个一个词（汉字）切分，所以适用范围广，但是精准度低；

english 对英文更加智能，可以识别单数负数，大小写，过滤stopwords（例如“the”这个词）等；

2，安装maven

$ brew search maven      //mac
# apt-get install maven  //Ubuntu
# yum install maven      //CentOS or redhat

$ mvn -v
apache Maven 3.5.0 (ff8f5e7444045639af65f6095c62210b5713f426; 2017-04-04T03:39:06+08:00)
Maven home: /usr/local/Cellar/maven/3.5.0/libexec
Java version: 1.8.0_112, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_112.jdk/Contents/Home/jre
Default locale: zh_CN, platform encoding: UTF-8
OS name: "mac os x", version: "10.12.6", arch: "x86_64", family: "mac"

3，下载analysis ik插件

$ git clone https://github.com/medcl/elasticsearch-analysis-ik.git
$ cd elasticsearch-analysis-ik
$ git branch -a    //根据不同的es版本，进行git checkout
* master //主分支是6.2.3的
 remotes/origin/2.x
 remotes/origin/5.3.x
 remotes/origin/5.x
 remotes/origin/6.1.x
 remotes/origin/HEAD -> origin/master
 remotes/origin/arkxu-master
 remotes/origin/master
 remotes/origin/revert-80-patch-1
 remotes/origin/rm
 remotes/origin/wyhw-ik_lucene4

$ mvn package  //打包

$ ll target/releases/
total 4400
drwxr-xr-x 3 zhangying staff 102 4 24 13:46 ./
drwxr-xr-x 11 zhangying staff 374 4 24 13:32 ../
-rw-r--r-- 1 zhangying staff 4501993 4 24 13:32 elasticsearch-analysis-ik-6.2.3.zip

//在releases目录会生成一个zip文件，将其解压
$ cd target/releases/ && unzip elasticsearch-analysis-ik-6.2.3.zip

4，安装analysis ik插件

$ brew info elasticsearch
elasticsearch: stable 6.2.3, HEAD
Distributed search & analytics engine

https://www.elastic.co/products/elasticsearch

/usr/local/Cellar/elasticsearch/6.2.3 (112 files, 30.8MB) *
 Built from source on 2018-04-24 at 14:17:01
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/elasticsearch.rb
==> Requirements
Required: java = 1.8 ✔
==> Options
--HEAD
 Install HEAD version
==> Caveats
Data: /usr/local/var/lib/elasticsearch/elasticsearch_zhangying/
Logs: /usr/local/var/log/elasticsearch/elasticsearch_zhangying.log
Plugins: /usr/local/var/elasticsearch/plugins/  //插件路径
Config: /usr/local/etc/elasticsearch/

To have launchd start elasticsearch now and restart at login:
 brew services start elasticsearch
Or, if you don't want/need a background service you can just run:
 elasticsearch

//将刚才解压出来目录，移动plugins下面
$ mv elasticsearch /usr/local/var/elasticsearch/plugins/ik

在这里要注意，不要在elasticsearch.yml文件中加index:analysis:analyzer:，老版支持，但是es6.x尝试了几种办法都没有成功，会报以下错误：

node settings must not contain any index level settings

5，启动elasticsearch

$ elasticsearch  //启动

如果出现以下内容就说成功了

analysis-ik 中文分词

6，测试中文分词

//创建索引
$ curl -XPUT "http://127.0.0.1:9200/tank?pretty" 

//创建mapping
$ curl -XPOST "http://127.0.0.1:9200/tank/chinese/_mapping?pretty" -H "Content-Type: application/json" -d '
{
    "chinese": {
            "_all":{
              "enabled":false //禁止全字段全文检索
            },
            "properties": {
                "id": {
                    "type": "integer"
                },
                "username": {
                    "type": "text",
                    "analyzer": "ik_max_word" //精确分词模式
                },
                "description": {
                    "type": "text",
                    "analyzer": "ik_max_word"
                }
            }
        }
  }
'
//插入二条数据
$ curl -XPOST "http://127.0.0.1:9200/tank/chinese/?pretty"  -H "Content-Type: application/json" -d '
{
    "id" : 1,
    "username" :  "中国高铁速度很快",
    "description" :  "如果要修改一个字段的类型"
}'

$ curl -XPOST "http://127.0.0.1:9200/tank/chinese/?pretty"  -H "Content-Type: application/json" -d '
{
    "id" : 2,
    "username" :  "动车和复兴号，都属于高铁",
    "description" :  "现在想要修改为string类型"
}'

//搜索
$ curl -XPOST "http://127.0.0.1:9200/tank/chinese/_search?pretty"  -H "Content-Type: application/json"  -d '
> {
>     "query": {
>         "match": {
>             "username": "中国高铁"
>         }
>     }
> }
> '
{
  "took" : 188,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.8630463,
    "hits" : [
      {
        "_index" : "tank",
        "_type" : "chinese",
        "_id" : "oJfx_2IBVvjz0l6TkJ6K",
        "_score" : 0.8630463,  //权重越高，匹配度越大
        "_source" : {
          "id" : 1,
          "username" : "中国高铁速度很快",
          "description" : "如果要修改一个字段的类型"
        }
      },
      {
        "_index" : "tank",
        "_type" : "chinese",
        "_id" : "oZfx_2IBVvjz0l6Tpp64",
        "_score" : 0.5753642,
        "_source" : {
          "id" : 2,
          "username" : "动车和复兴号，都属于高铁",
          "description" : "现在想要修改为string类型"
        }
      }
    ]
  }
}

7，elasticsearch内置中文分词和ik分词对比

$ curl -XPOST 'http://localhost:9200/tank/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
> {
> "analyzer":"ik_smart",  //简短分词
> "text":"感叹号"
> }'
{
 "tokens" : [
 {
 "token" : "感叹号",
 "start_offset" : 0,
 "end_offset" : 3,
 "type" : "CN_WORD",
 "position" : 0
 }
 ]
}

$ curl -XPOST 'http://localhost:9200/tank/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
> {
> "analyzer":"standard",  //es自带分词
> "text":"感叹号"
> }'
{
 "tokens" : [
 {
 "token" : "感",
 "start_offset" : 0,
 "end_offset" : 1,
 "type" : "<IDEOGRAPHIC>",
 "position" : 0
 },
 {
 "token" : "叹",
 "start_offset" : 1,
 "end_offset" : 2,
 "type" : "<IDEOGRAPHIC>",
 "position" : 1
 },
 {
 "token" : "号",
 "start_offset" : 2,
 "end_offset" : 3,
 "type" : "<IDEOGRAPHIC>",
 "position" : 2
 }
 ]
}

$ curl -XPOST 'http://localhost:9200/tank/_analyze?pretty=true' -H 'Content-Type: application/json' -d '
> {
> "analyzer":"ik_max_word",  //精确分词
> "text":"感叹号"
> }'
{
 "tokens" : [
 {
 "token" : "感叹号",
 "start_offset" : 0,
 "end_offset" : 3,
 "type" : "CN_WORD",
 "position" : 0
 },
 {
 "token" : "感叹",
 "start_offset" : 0,
 "end_offset" : 2,
 "type" : "CN_WORD",
 "position" : 1
 },
 {
 "token" : "叹号",
 "start_offset" : 1,
 "end_offset" : 3,
 "type" : "CN_WORD",
 "position" : 2
 }
 ]
}

1GB硬盘	2个独立站点
1000M带宽	不限制流量

2核CPU	1G内存
30硬盘	10M带宽

8核CPU	32G内存
1TB硬盘	1000M带宽

帮助中心

elasticsearch ik 中文分词安装配置

问题分类

在线客服

您可能对以下问题感兴趣？

7*24小时在线提交工单

最新文章

产品推荐

客服支持

Ask.

售前咨询

技术支持

elasticsearch ik 中文分词 安装配置

问题分类

在线客服

您可能对以下问题感兴趣？

7*24小时 在线提交工单

最新文章

产品推荐

客服支持

Ask.

售前咨询

技术支持

elasticsearch ik 中文分词安装配置

7*24小时在线提交工单