当前位置: 首页 > news >正文

ElasticSearch映射分词

目录

弃用Type

why

映射

查询 mapping of index

创建 index with mapping

添加 field with mapping

数据迁移

1.新建 一个 index with correct mapping 

2.数据迁移 reindex data into that index

分词

POST _analyze

自定义词库

ik分词器

circuit_breaking_exception


弃用Type

ES 6.x 之前,Type 开始弃用

ES 7.x ,被弱化,仍支持

ES 8.x ,完全移除

弃用后,每个索引只包含一种文档类型

如果需要区分不同类型的文档,俩种方式:

  • 创建不同的索引
  • 在文档中添加自定义字段来实现。

why

Elasticsearch 的底层存储(Lucene)是基于索引的,而不是基于 Type 的。

在同一个索引中,不同 Type 的文档可能具有相同名称但不同类型的字段,这种字段类型冲突会导致数据不一致和查询错误。

GET /bank/_search
{
  "query": {
    "match": {
      "address": "mill lane"
    }
  },
  "_source": ["account_number","address"]
}

从查询语句可以看出,查询是基于index的,不会去指定type。如果有不同type的address,就会引起查询冲突。


映射

Mapping 定义 doc和field 如何被存储和被检索

Mapping(映射) 是 Elasticsearch 中用于定义文档结构和字段类型的机制。它类似于关系型数据库中的表结构(Schema),用于描述文档中包含哪些字段、字段的数据类型(如文本、数值、日期等),以及字段的其他属性(如是否分词、是否索引等)。

Mapping 是 Elasticsearch 的核心概念之一,它决定了数据如何被存储、索引和查询。

查询 mapping of index

 _mapping

GET /bank/_mapping
{
  "bank" : {
    "mappings" : {
      "properties" : {
        "account_number" : {
          "type" : "long"
        },
        "address" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "age" : {
          "type" : "long"
        },
        "balance" : {
          "type" : "long"
        },
        "city" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "email" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "employer" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "firstname" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "gender" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "lastname" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "state" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}
  • text 可以添加子field ---keyword,类型是 keyword。keyword存储精确值

创建 index with mapping

Put /{indexName}

Put /my_index
{
  "mappings": {
    "properties": {
      "account_number": {
        "type": "long"
      },
      "address": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "city": {
        "type": "keyword"
      }
    }
  }
}

添加 field with mapping

  •  PUT /{indexName}/_mapping + mapping.properties请求体
PUT /my_index/_mapping
{
  "properties": {
    "state": {
      "type": "keyword",
      "index": false
    }
  }
}
  •  "index": false  该字段无法被索引,不会参与检索   默认true

数据迁移

 ES不支持修改已存在的mapping。若想更新已存在的mapping,就要进行数据迁移。

1.新建 一个 index with correct mapping 

PUT /my_bank
{
  "mappings": {
    "properties": {
      "account_number": {
        "type": "long"
      },
      "address": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "age": {
        "type": "integer"
      },
      "balance": {
        "type": "long"
      },
      "city": {
        "type": "keyword"
      },
      "email": {
        "type": "keyword"
      },
      "employer": {
        "type": "keyword"
      },
      "firstname": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "gender": {
        "type": "keyword"
      },
      "lastname": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "state": {
        "type": "keyword"
      }
    }
  }
}

2.数据迁移 reindex data into that index

POST _reindex
{
  "source": {
    "index": "bank",
    "type": "account"
  },
  "dest": {
    "index": "my_bank"
  }
}
  • ES 8.0  弃用type参数 


分词

        将文本拆分为单个词项(tokens)

POST _analyze

标准分词器

POST _analyze
{
  "analyzer": "standard",
  "text": ["it's test data","hello world"]
}

 Response

{
  "tokens" : [
    {
      "token" : "it's",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "test",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "data",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "hello",
      "start_offset" : 15,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "world",
      "start_offset" : 21,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

自定义词库

nginx/html目录下 创建es/term.text,添加词条

配置ik远程词库,/elasticsearch/config/analysis-ik/IKAnalyzer.cfg.xml

 测试

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "尚硅谷项目谷粒商城"
}

 [尚硅谷,谷粒商城]为term.text词库中的词条

 Response

{
  "tokens" : [
    {
      "token" : "尚硅谷",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "项目",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "谷粒商城",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

ik分词器

        中文分词

github地址

https://github.com/infinilabs/analysis-ik

    下载地址

    bin/elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-ik/7.4.2

    进入docker容器ES 下载 ik 插件

    卸载插件

    elasticsearch-plugin remove analysis-ik

    测试

    POST _analyze
    {
      "analyzer": "ik_smart",
      "text": "我要成为java高手"
    }

    Response 

    {
      "tokens" : [
        {
          "token" : "我",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "CN_CHAR",
          "position" : 0
        },
        {
          "token" : "要",
          "start_offset" : 1,
          "end_offset" : 2,
          "type" : "CN_CHAR",
          "position" : 1
        },
        {
          "token" : "成为",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "CN_WORD",
          "position" : 2
        },
        {
          "token" : "java",
          "start_offset" : 4,
          "end_offset" : 8,
          "type" : "ENGLISH",
          "position" : 3
        },
        {
          "token" : "高手",
          "start_offset" : 8,
          "end_offset" : 10,
          "type" : "CN_WORD",
          "position" : 4
        }
      ]
    }

    circuit_breaking_exception

    熔断器机制被触发

    {
        "error": {
            "root_cause": [
                {
                    "type": "circuit_breaking_exception",
                    "reason": "[parent] Data too large, data for [<http_request>] would be [124604192/118.8mb], which is larger than the limit of [123273216/117.5mb], real usage: [124604192/118.8mb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1788/1.7kb, in_flight_requests=0/0b, accounting=225547/220.2kb]",
                    "bytes_wanted": 124604192,
                    "bytes_limit": 123273216,
                    "durability": "PERMANENT"
                }
            ],
            "type": "circuit_breaking_exception",
            "reason": "[parent] Data too large, data for [<http_request>] would be [124604192/118.8mb], which is larger than the limit of [123273216/117.5mb], real usage: [124604192/118.8mb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1788/1.7kb, in_flight_requests=0/0b, accounting=225547/220.2kb]",
            "bytes_wanted": 124604192,
            "bytes_limit": 123273216,
            "durability": "PERMANENT"
        },
        "status": 429
    }

    查看ES日志

    docker logs elasticsearch

    检查 Elasticsearch 的内存使用情况

    GET /_cat/nodes?v&h=name,heap.percent,ram.percent
    • 如果 heap.percent 或 ram.percent 接近 100%,说明内存不足。

     增加 Elasticsearch 堆内存

    删除并重新创建容器 调整 -Xms 和 -Xmx 参数 256m

    docker run --name elasticsearch -p 9200:9200 -p 9300:9300 \
    > -e "discovery.type=single-node" \
    > -e ES_JAVA_OPTS="-Xms64m -Xmx256m" \
    > -v /mydata/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml \
    > -v  /mydata/elasticsearch/data:/usr/share/elasticsearch/data \
    > -v /mydata/elasticsearch/plugins:/usr/share/elasticsearch/plugins \
    > -d elasticsearch:7.4.2

    相关文章:

  • vue3响应式丢失解决办法(三)
  • Leetcode Hot100 第30题 416.分割等和子集
  • CTM工具箱--系统美化工具箱
  • Leetcode100-春招-矩阵题类
  • 图论入门算法:拓扑排序(C++)
  • Copilot:Excel中的Python高级分析来了
  • C#控制台大小Console.SetWindowSize函数失效解决
  • AtCoder Beginner Contest 393(ABCDEF)
  • 苹果CMS站群插件的自动生成功能:提升网站流量的秘诀
  • DeepSeek R1 32B 本地部署实战
  • 解决“IndentationError: unexpected indent”错误
  • 【强化学习的数学原理】第07课-时序差分方法-笔记
  • 【Linux内核】进程管理(上)
  • DOS命令 setx 用法
  • MyBatis进阶
  • Spring的BeanFactory和FactoryBean有 什么不同
  • 基于矢量轨道角动量波的透射超表面设计
  • 基于deepseek api和openweather 天气API实现Function Calling技术讲解
  • pandas(11 分类数据和数据可视化)
  • 腿足机器人之八- 腿足机器人动力学
  • 网站ip过万/网络营销的种类有哪些
  • 企业为什么做网站 图片/企业营销推广策划
  • 阿里云企业网站备案/bt磁力搜索神器
  • 购物网站建设和使用/优化大师电脑版官网
  • 手机网站设置在哪里找/seo引擎优化教程
  • 如何做网站打广告/全国十大跨境电商排名