Chapter3 Elasticsearch restful api (DSL)

司允晨

2023-12-01

3.1 elasticsearch的基本概念

cluster	整个elasticsearch 默认就是集群状态，整个集群是一份完整、互备的数据。
node	集群中的一个节点，一般只一个进程就是一个node
shard	分片，即使是一个节点中的数据也会通过hash算法，分成多个片存放，默认是5片。（7.0默认改为1片）
index	相当于rdbms的database(5.x), 对于用户来说是一个逻辑数据库，虽然物理上会被分多个shard存放，也可能存放在多个node中。 6.x 7.x index相当于table
type	类似于rdbms的table，但是与其说像table，其实更像面向对象中的class , 同一Json的格式的数据集合。（6.x只允许建一个，7.0被废弃，造成index实际相当于table级）
document	类似于rdbms的 row、面向对象里的object
field	相当于字段、属性

GET /_cat/nodes?v 查询各个节点状态

GET /_cat/indices?v 查询各个索引状态

GET /_cat/shards/xxxx 查询某个索引的分片情况

3.2 es中保存的数据结构

public class Movie{
    String id;
    String name;
    Double doubanScore;
    List<Actor> actorList;
}

public class Actor {
    String id;
    String name;
}

这两个对象如果放在关系型数据库保存，会被拆成2张表，但是elasticsearch是用一个json来表示一个document。

所以他保存到es中应该是：

{
	"id": "1",
	"name": "operation red sea",
	"doubanScore": "8.5",
	"actorList": [{
		"id": "1",
		"name": "zhangyi"
		},
		{
			"id": "2",
			"name": "haiqing"
		},
		{
			"id": "3",
			"name": "zhanghanyu"
		}
	]
}

3.3 对数据的操作

3.3.1 查看es中有哪些索引

GET /_cat/indices?v

es 中会默认存在一个名为.kibana的索引

表头的含义

health	green(集群完整) yellow(单点正常、集群不完整) red(单点不正常)
status	是否能使用
index	索引名
uuid	索引统一编号
pri	主节点几个
rep	从节点几个
docs.count	文档数
docs.deleted	文档被删了多少
store.size	整体占空间大小
pri.store.size	主节点占

3.3.2 增加一个索引

PUT /movie_index

3.3.3 删除一个索引

ES 是不删除也不修改任何数据的，而是增加版本号

DELETE /movie_index

3.3.4 新增文档

PUT /index/type/id

PUT /movie_index/movie/1
{ "id":1,
  "name":"operation red sea",
  "doubanScore":8.5,
  "actorList":[  
	{"id":1,"name":"zhang yi"},
	{"id":2,"name":"hai qing"},
	{"id":3,"name":"zhang han yu"}
  ]
}
PUT /movie_index/movie/2
{
  "id":2,
  "name":"operation meigong river",
  "doubanScore":8.0,
  "actorList":[  
	{"id":3,"name":"zhang han yu"}
  ]
}
 
PUT /movie_index/movie/3
{
  "id":3,
  "name":"incident red sea",
  "doubanScore":5.0,
  "actorList":[  
	{"id":4,"name":"zhang chen"}
  ]
}

如果之前没建过index或者type，es 会自动创建。

3.3.5 直接用id查找

GET movie_index/movie/1

3.3.6 修改—整体替换

和新增没有区别要求：必须包括全部字段

PUT /movie_index/movie/3
{
  "id":"3",
  "name":"incident red sea",
  "doubanScore":"5.0",
  "actorList":[  
    {"id":"1","name":"zhang chen"}
  ]
}

3.3.7修改—某个字段

POST movie_index/movie/3/_update
{ 
  "doc": {
    "doubanScore":"7.0"
  } 
}

3.3.8 删除一个document

DELETE movie_index/movie/3

3.3.9 搜索type全部数据

GET movie_index/movie/_search

{
  "took": 2,    //耗费时间 毫秒
  "timed_out": false, //是否超时
  "_shards": {
    "total": 5,   //发送给全部5个分片
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,  //命中3条数据
    "max_score": 1,   //最大评分
    "hits": [  // 结果
      {
        "_index": "movie_index",
        "_type": "movie",
        "_id": 2,
        "_score": 1,
        "_source": {
          "id": "2",
          "name": "operation meigong river",
          "doubanScore": 8.0,
          "actorList": [
            {
              "id": "1",
              "name": "zhang han yu"
            }
          ]
        }
          ......
          ......
      }

3.3.10 按条件查询(全部)

GET movie_index/movie/_search
{
  "query":{
    "match_all": {}
  }
}

3.3.11 按分词查询

GET movie_index/movie/_search
{
  "query":{
    "match": {"name":"red"}
  }
}

3.3.12 按分词子属性查询

GET movie_index/movie/_search
{
  "query":{
    "match": {"actorList.name":"zhang"}
  }
}

3.3.13 match phrase

GET movie_index/movie/_search
{
    "query":{
      "match_phrase": {"name":"operation red"}
    }
}

按短语查询，不再利用分词技术，直接用短语在原始数据中匹配

3.3.14 fuzzy查询

GET movie_index/movie/_search
{
    "query":{
      "fuzzy": {"name":"rad"}
    }
}

校正匹配分词，当一个单词都无法准确匹配，es通过一种算法对非常接近的单词也给与一定的评分，能够查询出来，但是消耗更多的性能。

3.3.15 过滤--查询后过滤

GET movie_index/movie/_search
{
    "query":{
      "match": {"name":"red"}
    },
    "post_filter":{
      "term": {
        "actorList.id": 3
      }
    }
}

3.3.16 过滤--查询前过滤（推荐使用）

GET movie_index/movie/_search
{ 
    "query":{
        "bool":{
          "filter":[ {"term": {  "actorList.id": "1"  }},
                     {"term": {  "actorList.id": "3"  }}
           ], 
           "must":{"match":{"name":"red"}}
         }
    }
}

3.3.17 过滤--按范围过滤

GET movie_index/movie/_search
{
   "query": {
     "bool": {
       "filter": {
         "range": {
            "doubanScore": {"gte": 8}
         }
       }
     }
   }
}

关于范围操作符：

gt	大于
lt	小于
gte	大于等于 great than or equals
lte	小于等于 less than or equals

3.3.18 排序

GET movie_index/movie/_search
{
  "query":{
    "match": {"name":"red sea"}
  }
  , "sort": [
    {
      "doubanScore": {
        "order": "desc"
      }
    }
  ]
}

3.3.19 分页查询

GET movie_index/movie/_search
{
  "query": { "match_all": {} },
  "from": 1,
  "size": 1
}

3.3.20 指定查询的字段

GET movie_index/movie/_search
{
  "query": { "match_all": {} },
  "_source": ["name", "doubanScore"]
}

3.3.21 高亮

GET movie_index/movie/_search
{
    "query":{
      "match": {"name":"red sea"}
    },
    "highlight": {
      "fields": {"name":{} }
    }
}

3.3.22 聚合

取出每个演员共参演了多少部电影

GET movie_index/movie/_search
{ 
  "aggs": {
    "groupby_actor": {
      "terms": {
        "field": "actorList.name.keyword"  
      }
    }
  }
}

每个演员参演电影的平均分是多少，并按评分排序

GET movie_index/movie/_search
{ 
  "aggs": {
    "groupby_actor_id": {
      "terms": {
        "field": "actorList.name.keyword" ,
        "order": {
          "avg_score": "desc"
          }
      },
      "aggs": {
        "avg_score":{
          "avg": {
            "field": "doubanScore" 
          }
        }
       }
    } 
  }
}

聚合时为何要加 .keyword后缀？

.keyword 是某个字符串字段，专门储存不分词格式的副本，在某些场景中只允许只用不分词的格式，比如过滤filter 比如聚合aggs, 所以字段要加上.keyword的后缀。

3.4 中文分词

elasticsearch本身自带的中文分词，就是单纯把中文一个字一个字的分开，根本没有词汇的概念。但是实际应用中，用户都是以词汇为条件，进行查询匹配的，如果能够把文章以词汇为单位切分开，那么与用户的查询条件能够更贴切的匹配上，查询速度也更加快速。

分词器下载网址：GitHub - medcl/elasticsearch-analysis-ik: The IK Analysis plugin integrates Lucene IK analyzer into elasticsearch, support customized dictionary.

3.4.1 安装

下载好的zip包，请解压后放到 /opt/module/elasticsearch/plugins/ik

然后重启es

3.4.2 测试使用

使用默认

GET movie_index/_analyze
{  
  "text": "我是中国人"
}

使用分词器

GET movie_index/_analyze
{  
  "analyzer": "ik_smart", 
  "text": "我是中国人"
}

另外一个分词器：ik_max_word

GET movie_index/_analyze
{  
  "analyzer": "ik_smart", 
  "text": "我是中国人"
}

能够看出不同的分词器，分词有明显的区别，所以以后定义一个type不能再使用默认的mapping了，要手工建立mapping, 因为要选择分词器。

3.4.3 自定义词库

修改/opt/module/elasticsearch/plugins/ik/config/中的IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict"></entry>
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
         <entry key="remote_ext_dict">http://192.168.67.163/fenci/myword.txt</entry>
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

按照标红的路径利用nginx发布静态资源

在nginx.conf中配置

server {
        listen  80;
        server_name  192.168.67.163;
        location /fenci/ {
           root es;
    }
}

并且在/usr/local/nginx/下建/es/fenci/目录，目录下加myword.txt

myword.txt中编写关键词，每一行代表一个词。

hello word

hello es

word

然后重启es服务器，重启nginx。

在kibana中测试分词效果

GET movie_index/_analyze
{  
  "analyzer": "ik_max_word", 
  "text": "hello ik"
}

更新完成后，es只会对新增的数据用新词分词。历史数据是不会重新分词的。如果想要历史数据重新分词。需要执行：

POST movies_index_chn/_update_by_query?conflicts=proceed

3.5 关于mapping

之前说type可以理解为table，那每个字段的数据类型是如何定义的呢

3.5.1 查看mapping

GET movie_index/_mapping/movie

实际上每个type中的字段是什么数据类型，由mapping定义。

但是如果没有设定mapping系统会自动，根据一条数据的格式来推断出应该的数据格式。

true/false → boolean
1020 → long
20.1 → double
“2018-02-01” → date
“hello world” → text +keyword

默认只有text会进行分词，keyword是不会分词的字符串。

mapping除了自动定义，还可以手动定义，但是只能对新加的、没有数据的字段进行定义。一旦有了数据就无法再做修改了。

注意：虽然每个Field的数据放在不同的type下,但是同一个名字的Field在一个index下只能有一种mapping定义。

3.5.2 基于中文分词搭建索引

1）建立mapping

PUT movie_chn
{
  "mappings": {
    "movie":{
      "properties": {
        "id":{
          "type": "long"
        },
        "name":{
          "type": "text"
          , "analyzer": "ik_smart"
        },
        "doubanScore":{
          "type": "double"
        },
        "actorList":{
          "properties": {
            "id":{
              "type":"long"
            },
            "name":{
              "type":"keyword"
            }
          }
        }
      }
    }
  }
}

2）插入数据

PUT movie_chn
{
  "mappings": {
    "movie":{
      "properties": {
        "id":{
          "type": "long"
        },
        "name":{
          "type": "text"
          , "analyzer": "ik_smart"
        },
        "doubanScore":{
          "type": "double"
        },
        "actorList":{
          "properties": {
            "id":{
              "type":"long"
            },
            "name":{
              "type":"keyword"
            }
          }
        }
      }
    }
  }
}

3）查询测试

GET /movie_chn/movie/_search
{
  "query": {
    "match": {
      "name": "红海战役"
    }
  }
}
 
GET /movie_chn/movie/_search
{
  "query": {
    "term": {
      "actorList.name": "张译"
    }
  }
}

3.6 索引别名 _aliases

索引别名就像一个快捷方式或软连接，可以指向一个或多个索引，也可以给任何一个需要索引名的API来使用。别名带给我们极大的灵活性，允许我们做下面这些：

给多个索引分组 (例如， last_three_months)
给索引的一个子集创建视图
在运行的集群中可以无缝的从一个索引切换到另一个索引

3.6.1 创建索引别名

建表时直接声明

PUT movie_chn_2020
{  "aliases": {
      "movie_chn_2020-query": {}
  }, 
  "mappings": {
    "movie":{
      "properties": {
        "id":{
          "type": "long"
        },
        "name":{
          "type": "text"
          , "analyzer": "ik_smart"
        },
        "doubanScore":{
          "type": "double"
        },
        "actorList":{
          "properties": {
            "id":{
              "type":"long"
            },
            "name":{
              "type":"keyword"
            }
          }
        }
      }
    }
  }
}

为已存在的索引增加别名

POST  _aliases
{
    "actions": [
        { "add":    
           { "index": "movie_chn_xxxx", "alias": "movie_chn_2020-query" }
        }
    ]
}

也可以通过加过滤条件缩小查询范围，建立一个子集视图

POST  _aliases
{
    "actions": [
        { "add":    
		{ "index": "movie_chn_xxxx", 
		  "alias": "movie_chn0919-query-zhhy",
              "filter": {
                  "term": {  "actorList.id": "3"
                   }
               }
		 }
	}
    ]
}

3.6.2 查询别名。与使用普通索引没有区别

GET movie_chn_2020-query/_search

3.6.3 删除某个索引的别名

POST  _aliases
{
    "actions": [
        { "remove": { 
               "index": "movie_chn_xxxx", "alias": "movie_chn_2020-query" }
        }
    ]
}

3.6.4 为某个别名进行无缝切换4.5.5查询别名列表

GET _cat/aliases?v

3.7 索引模板

Index Template 索引模板，顾名思义，就是创建索引的模具，其中可以定义一系列规则来帮助我们构建符合特定业务需求的索引的 mappings 和 settings，通过使用 Index Template 可以让我们的索引具备可预知的一致性。

3.7.1 常见的场景: 分割索引

分割索引就是根据时间间隔把一个业务索引切分成多个索引。

比如把 order_info 变成 order_info_20200101,order_info_20200102 …..

这样做的好处有两个：

结构变化的灵活性：因为elasticsearch不允许对数据结构进行修改。但是实际使用中索引的结构和配置难免变化，那么只要对下一个间隔的索引进行修改，原来的索引位置原状。这样就有了一定的灵活性。
查询范围优化：因为一般情况并不会查询全部时间周期的数据，那么通过切分索引，物理上减少了扫描数据的范围，也是对性能的优化。

3.7.2 查看系统中已有的模板清单

GET _cat/templates

3.7.3 创建模板

PUT _template/template_movie2020
{
  "index_patterns": ["movie_test*"],                  
  "settings": {                                               
    "number_of_shards": 1
  },
  "aliases" : { 
    "{index}-query": {},
    "movie_test-query":{}
  },
  "mappings": {                                          
	"_doc": {
      "properties": {
        "id": {
          "type": "keyword"
        },
        "movie_name": {
          "type": "text",
          "analyzer": "ik_smart"
        }
      }
    }
  }
}

其中 "index_patterns": ["movie_test*"], 的含义就是凡是往movie_test开头的索引写入数据时，如果索引不存在，那么es会根据此模板自动建立索引。

在 "aliases" 中用{index}表示，获得真正的创建的索引名。

测试

POST movie_test_2020xxxx/_doc
{
  "id":"333",
  "name":"zhang3"
}

3.7.4 查看某个模板详情

GET  _template/template_movie2020
或者
GET  _template/template_movie*