elasticsearch 拼音 中文 分词 混合使用
前一篇文章说了IK中文分词,其实想实现的目的,就是拼音和中文都搜索到东西。类似百度搜索框的输入提示,淘宝搜索框的输入提示。
1,安装配置analysis-pinyin
//下载 $ git clone https://github.com/medcl/elasticsearch-analysis-pinyin.git $ cd elasticsearch-analysis-pinyin $ git branch -a * master //主分支是6.2.3,对应 es6.2.3 remotes/origin/0.16.x remotes/origin/1.x remotes/origin/2.x remotes/origin/5.3.x remotes/origin/5.x remotes/origin/6.1.x remotes/origin/HEAD -> origin/master remotes/origin/master $ mvn package //打包 $ ll target/releases/ total 4400 drwxr-xr-x 3 zhangying staff 102 4 24 13:46 ./ drwxr-xr-x 11 zhangying staff 374 4 24 13:32 ../ -rw-r--r-- 1 zhangying staff 4501993 4 24 13:32 elasticsearch-analysis-pinyin-6.2.3.zip $ cd target/releases/ && unzip elasticsearch-analysis-pinyin-6.2.3.zip $ brew info elasticsearch elasticsearch: stable 6.2.3, HEAD Distributed search & analytics engine https://www.elastic.co/products/elasticsearch /usr/local/Cellar/elasticsearch/6.2.3 (112 files, 30.8MB) * Built from source on 2018-04-24 at 14:17:01 From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/elasticsearch.rb ==> Requirements Required: java = 1.8 ✔ ==> Options --HEAD Install HEAD version ==> Caveats Data: /usr/local/var/lib/elasticsearch/elasticsearch_zhangying/ Logs: /usr/local/var/log/elasticsearch/elasticsearch_zhangying.log Plugins: /usr/local/var/elasticsearch/plugins/ //插件地址 Config: /usr/local/etc/elasticsearch/ To have launchd start elasticsearch now and restart at login: brew services start elasticsearch Or, if you don't want/need a background service you can just run: elasticsearch //将mvn后的插件copy到es插件目录 $ mv elasticsearch /usr/local/var/elasticsearch/plugins/pinyin $ elasticsearch //启动
2,测试pinyin分词
2.1,测试分词
$ curl -XPOST 'http://localhost:9200/pinyin/_analyze?pretty=true' -H 'Content-Type: application/json' -d ' > { > "analyzer":"pinyin", > "text":"gaotie" > }' { "tokens" : [ { "token" : "gao", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "gaotie", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "tie", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 1 } ] } $ curl -XPOST 'http://localhost:9200/pinyin/_analyze?pretty=true' -H 'Content-Type: application/json' -d ' > { > "analyzer":"pinyin", > "text":"高铁" > }' { "tokens" : [ { "token" : "gao", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "gt", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "tie", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 1 } ] }
从上面可以看出,pinyin分词对pinyin和中文都能分的,并且分出来的结果还不一样。
2.2,创建索引,mapping,插入数据
curl -XPUT "http://127.0.0.1:9200/pinyin?pretty" curl -XPOST "http://127.0.0.1:9200/pinyin/test/_mapping?pretty" -H "Content-Type: application/json" -d ' { "test": { "_all":{ "enabled":false }, "properties": { "id": { "type": "integer" }, "username": { "type": "text", "analyzer": "pinyin" }, "description": { "type": "text", "analyzer": "pinyin" } } } } ' curl -XPOST "http://127.0.0.1:9200/pinyin/test/?pretty" -H "Content-Type: application/json" -d ' { "id" : 1, "username" : "中国高铁速度很快", "description" : "如果要修改一个字段的类型" }' curl -XPOST "http://127.0.0.1:9200/pinyin/test/?pretty" -H "Content-Type: application/json" -d ' { "id" : 2, "username" : "动车和复兴号,都属于高铁", "description" : "现在想要修改为string类型" }'
2.3,全拼音测试
$ curl -XPOST "http://127.0.0.1:9200/pinyin/test/_search?pretty" -H "Content-Type: application/json" -d ' > { > "query": { > "match": { > "username": "gao tie" > } > } > } > ' { "took" : 13, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.4039931, "hits" : [ { "_index" : "pinyin", "_type" : "test", "_id" : "TGZ2AWMBlEkarXCPb7ED", "_score" : 0.4039931, "_source" : { "id" : 1, "username" : "中国高铁速度很快", "description" : "如果要修改一个字段的类型" } }, { "_index" : "pinyin", "_type" : "test", "_id" : "TWZ2AWMBlEkarXCPb7En", "_score" : 0.35767543, "_source" : { "id" : 2, "username" : "动车和复兴号,都属于高铁", "description" : "现在想要修改为string类型" } } ] } }
2.3,拼音分词,汉字搜索
$ curl -XPOST "http://127.0.0.1:9200/pinyin/test/_search?pretty" -H "Content-Type: application/json" -d ' > { > "query": { > "match": { > "username": "中国高铁" > } > } > } > ' { "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 1.9398875, "hits" : [ { "_index" : "pinyin", "_type" : "test", "_id" : "TGZ2AWMBlEkarXCPb7ED", "_score" : 1.9398875, "_source" : { "id" : 1, "username" : "中国高铁速度很快", "description" : "如果要修改一个字段的类型" } }, { "_index" : "pinyin", "_type" : "test", "_id" : "TWZ2AWMBlEkarXCPb7En", "_score" : 0.35767543, "_source" : { "id" : 2, "username" : "动车和复兴号,都属于高铁", "description" : "现在想要修改为string类型" } } ] } }
2.4,部分首字母
$ curl -XPOST "http://127.0.0.1:9200/pinyin/test/_search?pretty" -H "Content-Type: application/json" -d ' > { > "query": { > "match": { > "username": "Gaot" > } > } > } > ' { "took" : 5, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.20199655, "hits" : [ { "_index" : "pinyin", "_type" : "test", "_id" : "TGZ2AWMBlEkarXCPb7ED", "_score" : 0.20199655, "_source" : { "id" : 1, "username" : "中国高铁速度很快", "description" : "如果要修改一个字段的类型" } }, { "_index" : "pinyin", "_type" : "test", "_id" : "TWZ2AWMBlEkarXCPb7En", "_score" : 0.17883772, "_source" : { "id" : 2, "username" : "动车和复兴号,都属于高铁", "description" : "现在想要修改为string类型" } } ] } } //同上 $ curl -XPOST "http://127.0.0.1:9200/pinyin/test/_search?pretty" -H "Content-Type: application/json" -d ' { "query": { "match": { "username": "gtie" } } } '
2.5,全首字母搜索
$ curl -XPOST "http://127.0.0.1:9200/pinyin/test/_search?pretty" -H "Content-Type: application/json" -d ' > { > "query": { > "match": { > "username": "gt" > } > } > } > ' { "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : null, "hits" : [ ] } }
全首字母高铁(gt),没有搜索到东西。
3,拼音分词和中文分词混合使用
3.1,自定义analyzer,并设置过滤器
$ curl -XPUT "http://localhost:9200/pinyin_ik/?pretty" -H "Content-Type: application/json" -d' { "index": { "analysis": { "analyzer": { "ik_pinyin_analyzer": { "type": "custom", "tokenizer": "ik_max_word", "filter": ["my_pinyin", "word_delimiter"] } }, "filter": { "my_pinyin": { "type": "pinyin" } } } } }' $ curl -XPOST "http://127.0.0.1:9200/pinyin_ik/test/_mapping?pretty" -H "Content-Type: application/json" -d ' { "test": { "_all":{ "enabled":false }, "properties": { "id": { "type": "integer" }, "username": { "type": "text", "analyzer": "ik_pinyin_analyzer" }, "description": { "type": "text", "analyzer": "ik_pinyin_analyzer" } } } } ' $ curl -XPOST "http://127.0.0.1:9200/pinyin_ik/test/?pretty" -H "Content-Type: application/json" -d ' { "id" : 1, "username" : "中国高铁速度很快", "description" : "如果要修改一个字段的类型" }' $ curl -XPOST "http://127.0.0.1:9200/pinyin_ik/test/?pretty" -H "Content-Type: application/json" -d ' { "id" : 2, "username" : "动车和复兴号,都属于高铁", "description" : "现在想要修改为string类型" }'
3.2,全首字母搜索
$ curl -XPOST "http://127.0.0.1:9200/pinyin_ik/test/_search?pretty" -H "Content-Type: application/json" -d ' > { > "query": { > "match": { > "username": "gt" > } > } > } > ' { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.6935897, "hits" : [ { "_index" : "pinyin_ik", "_type" : "test", "_id" : "S2ZzAWMBlEkarXCPu7Hq", "_score" : 0.6935897, "_source" : { "id" : 2, "username" : "动车和复兴号,都属于高铁", "description" : "现在想要修改为string类型" } }, { "_index" : "pinyin_ik", "_type" : "test", "_id" : "SmZzAWMBlEkarXCPubHw", "_score" : 0.6827974, "_source" : { "id" : 1, "username" : "中国高铁速度很快", "description" : "如果要修改一个字段的类型" } } ] } }