上云无忧 > 文档中心 > 百度智能云Elasticsearch - 百度NLP中文分词插件
Elasticsearch
百度智能云Elasticsearch - 百度NLP中文分词插件

文档简介:
analysis-baidu-nlp是百度智能云Elasticsearch(简称ES)团队自主研发的中文分词插件,该插件在中文分词上的性能与准确率均处于业界领先水平。
*此产品及展示信息均由百度智能云官方提供。免费试用 咨询热线:400-826-7010,为您提供专业的售前咨询,让您快速了解云产品,助您轻松上云! 微信咨询
  免费试用、价格特惠

analysis-baidu-nlp是百度智能云Elasticsearch(简称ES)团队自主研发的中文分词插件,该插件在中文分词上的性能与准确率均处于业界领先水平。

背景

analysis-baidu-nlp 基于百度NLP内部自主研发的DeepCRF模型,该模型凝聚了百度在中文搜索领域十几年的技术积累,其模型性能与准确率均处于业界领先地位

提供基础粒度和短语粒度两种分词结果,以供不同的应用需求,短语粒度是以基础粒度的分词进行智能组合的结果。

注意:

  • 词典模型会在第一次使用时加载到JVM的堆外内存,我们推荐所用的套餐节点内存8G以上。
  • 目前NLP中文分词插件支持6.5.3、7.4.2版本的实例,不支持NLP中文分词插件的集群,请提交工单,BES团队会协助升级集群,升级方式参见ES版本升级。

分词粒度

analysis-baidu-nlp主要提供两种粒度的Analyzer:

  1. 基础粒度模型(bd-nlp-basic)
  2. 短语粒度模型(bd-nlp-phrase)

两种Analyzer内部集成大小写过滤器、停用词过滤器,开箱即用。

同名提供两种Tokenizer:

  1. 基础模型粒度(bd-nlp-basic)
  2. 短语大粒度模型(bd-nlp-phrase)

两种粒度Tokenizer只提供最原始的切词结果,用户可根据自己的应用需求添加自定义的停用词过滤以及一些复杂的过滤器。

与ik在基础粒度和短语粒度切词的对比

基础粒度对比

对 “维修基金” 进行基础最大粒度切词效果对比

  • bd-nlp-basic 切词
POST /_analyze
{
    "text": "维修基金",
    "analyzer": "bd-nlp-basic"
}

分词结果:

{
   "tokens": [
      {
         "token": "维修",
         "start_offset": 0,
         "end_offset": 2,
         "type": "WORD",
         "position": 0
      },
      {
         "token": "基金",
         "start_offset": 2,
         "end_offset": 4,
         "type": "WORD",
         "position": 1
      }
   ]
}
  • ik_max_word 切词
POST _analyze
{
    "analyzer": "ik_max_word",
    "text": "维修基金"
}

切词结果:

{
   "tokens": [
      {
         "token": "维修基金",
         "start_offset": 0,
         "end_offset": 4,
         "type": "CN_WORD",
         "position": 0
      },
      {
         "token": "维修",
         "start_offset": 0,
         "end_offset": 2,
         "type": "CN_WORD",
         "position": 1
      },
      {
         "token": "维",
         "start_offset": 0,
         "end_offset": 1,
         "type": "CN_WORD",
         "position": 2
      },
      {
         "token": "修",
         "start_offset": 1,
         "end_offset": 2,
         "type": "CN_CHAR",
         "position": 3
      },
      {
         "token": "基金",
         "start_offset": 2,
         "end_offset": 4,
         "type": "CN_WORD",
         "position": 4
      },
      {
         "token": "基",
         "start_offset": 2,
         "end_offset": 3,
         "type": "CN_WORD",
         "position": 5
      },
      {
         "token": "金",
         "start_offset": 3,
         "end_offset": 4,
         "type": "CN_CHAR",
         "position": 6
      }
   ]
}

对 “清明节,又称踏青节、行清节、三月节、祭祖节等” 进行短语切测效果对比

  • bd-nlp-basic 短语切词
POST /_analyze
{
    "text": "清明节,又称踏青节、行清节、三月节、祭祖节等",
    "analyzer": "bd-nlp-phrase"
}

短语切分结果:

{
   "tokens": [
      {
         "token": "清明节",
         "start_offset": 0,
         "end_offset": 3,
         "type": "WORD",
         "position": 0
      },
      {
         "token": "又称",
         "start_offset": 4,
         "end_offset": 6,
         "type": "WORD",
         "position": 2
      },
      {
         "token": "踏青节",
         "start_offset": 6,
         "end_offset": 9,
         "type": "WORD",
         "position": 3
      },
      {
         "token": "行清节",
         "start_offset": 10,
         "end_offset": 13,
         "type": "WORD",
         "position": 5
      },
      {
         "token": "三月节",
         "start_offset": 14,
         "end_offset": 17,
         "type": "WORD",
         "position": 7
      },
      {
         "token": "祭祖",
         "start_offset": 18,
         "end_offset": 20,
         "type": "WORD",
         "position": 9
      },
      {
         "token": "节",
         "start_offset": 20,
         "end_offset": 21,
         "type": "WORD",
         "position": 10
      }
   ]
}
  • ik_smart 智能切词
POST _analyze
{
    "analyzer": "ik_smart",
    "text": "清明节,又称踏青节、行清节、三月节、祭祖节等"
}

切词结果:

{
   "tokens": [
      {
         "token": "清明节",
         "start_offset": 0,
         "end_offset": 3,
         "type": "CN_WORD",
         "position": 0
      },
      {
         "token": "又称",
         "start_offset": 4,
         "end_offset": 6,
         "type": "CN_WORD",
         "position": 1
      },
      {
         "token": "踏青",
         "start_offset": 6,
         "end_offset": 8,
         "type": "CN_WORD",
         "position": 2
      },
      {
         "token": "节",
         "start_offset": 8,
         "end_offset": 9,
         "type": "CN_WORD",
         "position": 3
      },
      {
         "token": "行",
         "start_offset": 10,
         "end_offset": 11,
         "type": "CN_WORD",
         "position": 4
      },
      {
         "token": "清",
         "start_offset": 11,
         "end_offset": 12,
         "type": "CN_CHAR",
         "position": 5
      },
      {
         "token": "节",
         "start_offset": 12,
         "end_offset": 13,
         "type": "CN_WORD",
         "position": 6
      },
      {
         "token": "三月",
         "start_offset": 14,
         "end_offset": 16,
         "type": "CN_WORD",
         "position": 7
      },
      {
         "token": "节",
         "start_offset": 16,
         "end_offset": 17,
         "type": "COUNT",
         "position": 8
      },
      {
         "token": "祭祖",
         "start_offset": 18,
         "end_offset": 20,
         "type": "CN_WORD",
         "position": 9
      },
      {
         "token": "节",
         "start_offset": 20,
         "end_offset": 21,
         "type": "CN_WORD",
         "position": 10
      }
   ]
}

Analyze API使用

基础模型粒度分词

POST /_analyze
{
   "analyzer": "bd-nlp-basic",
   "text": "去年我们和他们展开了炉际竞赛,第一回合赢了,第二回合和第三回合却败下阵来。"
}

分词结果:

{
  "tokens": [
     {
        "token": "去年",
        "start_offset": 0,
        "end_offset": 2,
        "type": "WORD",
        "position": 0
     },
     {
        "token": "我们",
        "start_offset": 2,
        "end_offset": 4,
        "type": "WORD",
        "position": 1
     },
     {
        "token": "和",
        "start_offset": 4,
        "end_offset": 5,
        "type": "WORD",
        "position": 2
     },
     {
        "token": "他们",
        "start_offset": 5,
        "end_offset": 7,
        "type": "WORD",
        "position": 3
     },
     {
        "token": "展开",
        "start_offset": 7,
        "end_offset": 9,
        "type": "WORD",
        "position": 4
     },
     {
        "token": "炉际",
        "start_offset": 10,
        "end_offset": 12,
        "type": "WORD",
        "position": 6
     },
     {
        "token": "竞赛",
        "start_offset": 12,
        "end_offset": 14,
        "type": "WORD",
        "position": 7
     },
     {
        "token": "第一",
        "start_offset": 15,
        "end_offset": 17,
        "type": "WORD",
        "position": 9
     },
     {
        "token": "回合",
        "start_offset": 17,
        "end_offset": 19,
        "type": "WORD",
        "position": 10
     },
     {
        "token": "赢",
        "start_offset": 19,
        "end_offset": 20,
        "type": "WORD",
        "position": 11
     },
     {
        "token": "第二",
        "start_offset": 22,
        "end_offset": 24,
        "type": "WORD",
        "position": 14
     },
     {
        "token": "回合",
        "start_offset": 24,
        "end_offset": 26,
        "type": "WORD",
        "position": 15
     },
     {
        "token": "和",
        "start_offset": 26,
        "end_offset": 27,
        "type": "WORD",
        "position": 16
     },
     {
        "token": "第三",
        "start_offset": 27,
        "end_offset": 29,
        "type": "WORD",
        "position": 17
     },
     {
        "token": "回合",
        "start_offset": 29,
        "end_offset": 31,
        "type": "WORD",
        "position": 18
     },
     {
        "token": "败",
        "start_offset": 32,
        "end_offset": 33,
        "type": "WORD",
        "position": 20
     },
     {
        "token": "下",
        "start_offset": 33,
        "end_offset": 34,
        "type": "WORD",
        "position": 21
     },
     {
        "token": "阵",
        "start_offset": 34,
        "end_offset": 35,
        "type": "WORD",
        "position": 22
     },
     {
        "token": "来",
        "start_offset": 35,
        "end_offset": 36,
        "type": "WORD",
        "position": 23
     }
  ]
}

短语模型大粒度分词

POST /_analyze
{
   "analyzer": "bd-nlp-phrase",
   "text": "去年我们和他们展开了炉际竞赛,第一回合赢了,第二回合和第三回合却败下阵来。"
}

分词结果:

{
  "tokens": [
     {
        "token": "去年",
        "start_offset": 0,
        "end_offset": 2,
        "type": "WORD",
        "position": 0
     },
     {
        "token": "我们",
        "start_offset": 2,
        "end_offset": 4,
        "type": "WORD",
        "position": 1
     },
     {
        "token": "和",
        "start_offset": 4,
        "end_offset": 5,
        "type": "WORD",
        "position": 2
     },
     {
        "token": "他们",
        "start_offset": 5,
        "end_offset": 7,
        "type": "WORD",
        "position": 3
     },
     {
        "token": "展开",
        "start_offset": 7,
        "end_offset": 9,
        "type": "WORD",
        "position": 4
     },
     {
        "token": "炉际竞赛",
        "start_offset": 10,
        "end_offset": 14,
        "type": "WORD",
        "position": 6
     },
     {
        "token": "第一回合",
        "start_offset": 15,
        "end_offset": 19,
        "type": "WORD",
        "position": 8
     },
     {
        "token": "赢",
        "start_offset": 19,
        "end_offset": 20,
        "type": "WORD",
        "position": 9
     },
     {
        "token": "第二回合",
        "start_offset": 22,
        "end_offset": 26,
        "type": "WORD",
        "position": 12
     },
     {
        "token": "和",
        "start_offset": 26,
        "end_offset": 27,
        "type": "WORD",
        "position": 13
     },
     {
        "token": "第三",
        "start_offset": 27,
        "end_offset": 29,
        "type": "WORD",
        "position": 14
     },
     {
        "token": "回合",
        "start_offset": 29,
        "end_offset": 31,
        "type": "WORD",
        "position": 15
     },
     {
        "token": "败",
        "start_offset": 32,
        "end_offset": 33,
        "type": "WORD",
        "position": 17
     },
     {
        "token": "下",
        "start_offset": 33,
        "end_offset": 34,
        "type": "WORD",
        "position": 18
     },
     {
        "token": "阵",
        "start_offset": 34,
        "end_offset": 35,
        "type": "WORD",
        "position": 19
     },
     {
        "token": "来",
        "start_offset": 35,
        "end_offset": 36,
        "type": "WORD",
        "position": 20
     }
  ]
}

索引指定Analyzer

PUT test
{
   "mappings": {
      "doc": {
         "properties": {
            "k1": {
               "type": "text",
               "analyzer": "bd-nlp-basic" // 使用基础粒度模型
            },
            "k2": {
               "type": "text",
               "analyzer": "bd-nlp-phrase" // 使用短语粒度模型
            }
         }
      }
   },
   "settings": {
      "index": {
         "number_of_shards": "1",
         "number_of_replicas": "0"
      }
   }
}

索引指定Tokenizer

PUT /test
{
    "settings":{
        "analysis":{
            "analyzer":{
                "my_analyzer":{
                    "tokenizer":"bd-nlp-basic",   // 自定义一个analyzer
                    "filter":[
                        "lowercase"               // 添加应用需要的过滤器
                    ]
                }
            }
        }
    },
    "mappings":{
        "properties":{
            "k2":{
                "type":"text",
                "analyzer":"my_analyzer"         // 将自定义analyzer应用到对应字段上
            }
        }
    }
}

准确率与召回率

百度内部大数据集测试结果:

模型 准确率 召回率 F值
analysis-baidu-nlp 98.8% 98.9% 98.8%
相似文档
  • 百度智能云Elasticsearch的NLP中文分词插件支持用户添加自定义词典干预NLP模型,从而进行分词词典动态热更新。 用户可以根据需求,通过上传词典文件或输入文本两种方式添加自定义词典。
  • 向量检索插件由百度智能云Elasticsearch团队研发,能够快速实现向量检索、向量计算等需求。 背景: 近年来基于Text(Document) Embedding、特征向量等的向量检索在推荐系统、图片的相似度检索中得到了广泛使用。
  • 百度智能云Elasticsearch(以下简称为ES)使用不同的存储介质来存储数据,达到冷热数据分离的目的: 对于读写性能要求比较高的“热数据”,使用SSD云磁盘存储,保障了高效的查询性能。
  • 使用百度智能云Elasticsearch前,需要优先结合业务需求和所存数据,评估集群所需的资源容量,包括磁盘容量、单机规格、shard大小和数量等。评估方式如下:
  • 前提条件: 已创建百度智能云账号。创建账号参考百度云账号注册流程。 功能入口: 进入百度智能云首页。 在搜索框中输入”Elasticsearch“,点击进入百度智能云 Elasticsearch 产品首页。
官方微信
联系客服
400-826-7010
7x24小时客服热线
分享
  • QQ好友
  • QQ空间
  • 微信
  • 微博
返回顶部