Ubuntu 下安装 elasticsearch 和 analysis-ik 插件支持中文全文搜索 - sdk社区

安装 elasticsearch 需要 java8 以上环境支持；

Java 有 OpenJDK 和 Oracle Java 两个主要的实现，几乎没有区别，只是 Oracle Java 有一些额外的商业功能。
这里安装 java10 参考了如何在Ubuntu 18.04上安装Java

1. 安装OpenJDK 10 JDK

sudo apt install default-jdk
java -version

输出类似 openjdk version "10.0.1" 即为成功。

2. 下载运行 elasticsearch

主站下载地址: https://www.elastic.co/downloads/elasticsearch

这里 Ubuntu/MacOs 点击直接下载 elasticsearch-6.4.2.tar.gz

或者

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.4.2.tar.gz
tar -zxvf elasticsearch-6.4.2.tar.gz
cd elasticsearch-6.4.2
# 启动测试，daemon 请加 -d 参数
./bin/elasticsearch

查看是否启动成功，查看 9200 端口是否被监听

netstat -anpl|grep 9200

如果端口监听成功, 再使用 curl 命令查询测试：

curl 'http://localhost:9200/?pretty'
# 返回 json 字符串即可开始使用了

2.1 elasticsearch 配置

文件位于 config/elasticsearch.yml

打开前面的注释代码如下：

# Lock the memory on startup:

bootstrap.memory_lock: true

network.host: 127.0.0.1

http.port: 9200

内存管理 config/jvm.options

把内存改为自己服务器内存的一半以下；比如说这里改为 1G ；

#-Xms1g
-Xms1G

#-Xmx1g
-Xmx1G

3. 中文分词插件 analyzer-ik

elasticsearch 内置的分词器只能将中文每个字拆分，所以需要扩展插件

安装方法：

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.4.2/elasticsearch-analysis-ik-6.4.2.zip
# 如上面网络问题请使用下面地址
./bin/elasticsearch-plugin install https://www.guke1.com/elasticsearch-analysis-ik-6.4.2.zip

然后重新启动 elasticsearch 测试 ik-analyzer 分词效果：

curl -H 'Content-Type: application/json'  -XGET 'localhost:9200/_analyze?pretty' -d '{"analyzer":"ik_max_word","text":"李木聪的技术博客"}'

输出效果如下：

{
  "tokens" : [
    {
      "token" : "李",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "木",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "聪",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "的",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "技术",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "博客",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

3.1 analysis-ik 增加自定义词库

自定义词库位于 config/analysis-ik/IKAnalyzer.cfg.xml

vim ./config/analysis-ik/IKAnalyzer.cfg.xml

修改为：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">my.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

增加一个 李木聪 到 my.dic 词库中，自定义多个关键词需一行一个

echo '李木聪' > config/analysis-ik/my.dic

再次重新启动下服务，测试中文分词效果：

curl -H 'Content-Type: application/json'  -XGET 'localhost:9200/_analyze?pretty' -d '{"analyzer":"ik_max_word","text":"李木聪的技术博客"}'

输出效果如下：

{
  "tokens" : [
    {
      "token" : "李木聪",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "的",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "技术",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "博客",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}