KingbaseES 全文检索功能介绍

2023年5月31日上午3:36 • 人工智能 • 阅读 71

KingbaseES 内置的缺省的分词解析器采用空格分词，因为中文的词语之间没有空格分割，所以这种方法并不适用于中文。要支持中文的全文检索需要额外的中文分词插件：zhparser and sys_jieba，其中zhparser 支持 GBK 和 UTF8 字符集，sys_jieba 支持 UTF8 字符集。

一、默认空格分词

test=# SELECT to_tsvector('English','Try not to become a man of success, but rather try to become a man of value');
                             to_tsvector
 'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17
(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value');
                                                     to_tsvector
 pg_catalog.simple
(1 row)

标准化过程会完成以下操作：

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ 'become';
 ?column?

 f
(1 row)

test=# select 'become'::tsquery,to_tsquery('become'),to_tsquery('english','become');
tsquery | to_tsquery | to_tsquery
 t
(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('!become');
 ?column?

 t
(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('Try & !becom');
 ?column?

 t
(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('bec:*');
 ?column?

 'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17
(1 row)

test=# SELECT to_tsvector('english','Try not to become a man of success, but rather try to become a man of value') ;
                             to_tsvector
 'a':5,14 'becom':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rath':10 'success':8 'to':3,12 'try':1,11 'valu':17
(1 row)
                                     ^
test=# SELECT to_tsvector('french'::regconfig,'Try not to become a man of success, but rather try to become a man of value') ;
                                                   to_tsvector
 '&#x4EBA;&#x5927;&#x91D1;&#x4ED3;&#x81F4;&#x529B;&#x4E8E;&#x63D0;&#x4F9B;&#x9AD8;&#x53EF;&#x9760;&#x7684;&#x6570;&#x636E;&#x5E93;&#x4EA7;&#x54C1;':1

因为内置的分词器是按空格分割的，而中文间没有空格，因此，整句话就被看做一个分词。

create extension zhparser;
create text search configuration zhongwen_parser (parser = zhparser);
alter text search configuration zhongwen_parser add mapping for n,v,a,i,e,l,j with simple;

上面 for 后面的字母表示分词的token，上面的token映射只映射了名词(n)，动词(v)，形容词(a)，成语(i)，叹词(e)，缩写(j) 和习用语(l)6种，这6种以外的token全部被屏蔽。词典使用的是内置的simple词典。具体的token 如下：

test=# select ts_token_type('zhparser');
     ts_token_type
  3748 | simple          |           11 |       10 |      3722
 13265 | arabic          |           11 |       10 |      3722
 13267 | danish          |           11 |       10 |      3722
 13269 | dutch           |           11 |       10 |      3722
 13271 | english         |           11 |       10 |      3722
 13273 | finnish         |           11 |       10 |      3722
 13275 | french          |           11 |       10 |      3722
 13277 | german          |           11 |       10 |      3722
 13279 | hungarian       |           11 |       10 |      3722
 13281 | indonesian      |           11 |       10 |      3722
 13283 | irish           |           11 |       10 |      3722
 13285 | italian         |           11 |       10 |      3722
 13287 | lithuanian      |           11 |       10 |      3722
 13289 | nepali          |           11 |       10 |      3722
 13291 | norwegian       |           11 |       10 |      3722
 13293 | portuguese      |           11 |       10 |      3722
 13295 | romanian        |           11 |       10 |      3722
 13297 | russian         |           11 |       10 |      3722
 13299 | spanish         |           11 |       10 |      3722
 13301 | swedish         |           11 |       10 |      3722
 13303 | tamil           |           11 |       10 |      3722
 13305 | turkish         |           11 |       10 |      3722
 16390 | parser_name     |         2200 |       10 |     16389
 24587 | zhongwen_parser |         2200 |       10 |     16389

test=# select to_tsvector('zhongwen_parser','&#x4EBA;&#x5927;&#x91D1;&#x4ED3;&#x81F4;&#x529B;&#x4E8E;&#x63D0;&#x4F9B;&#x9AD8;&#x53EF;&#x9760;&#x7684;&#x6570;&#x636E;&#x5E93;&#x4EA7;&#x54C1;');
                           to_tsvector
 sys    | contains | boolean          | text, text          | func | immutable  | safe     | system | invoker  |                   | sql      | select to_tsvector($1) @@ to_tsquery($2) |
 sys    | contains | boolean          | text, text, integer | func | immutable  | safe     | system | invoker  |                   | sql      | select to_tsvector($1) @@ to_tsquery($2) |
 sys    | contains | boolean          | text, tsquery       | func | immutable  | safe     | system | invoker  |                   | sql      | select $1::tsvector @@ $2                |
 sys    | contains | boolean          | tsvector, text      | func | immutable  | safe     | system | invoker  |                   | sql      | select $1 @@ $2::tsquery                 |
 sys    | contains | boolean          | tsvector, tsquery   | func | immutable  | safe     | system | invoker  |                   | sql      | select $1 @@ $2                          |

默认contains 函数使用的是空格分词解析器，因此，无法使用contains 进行中文判断

`
test=# select contains(‘人大金仓致力于提供高可靠的数据库产品’,’产品’);
contains

Original: https://blog.csdn.net/lyu1026/article/details/120719624
Author: Kingbase 研究院
Title: KingbaseES 全文检索功能介绍

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/548363/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Pycharm中如何更新第三方库（以tensorflow库为例）

Pycharm中如何更新第三方库（以tensorflow库为例）本文主要记录如何在Pycharm中更新第三方库，以tensorflow库为例。目录 Pycharm中如何更新第三…

人工智能 2023年5月23日
0084
使用LDA分类器对邮件进行分类

简述 LDA线性判别分析（linear discriminant analysis, LDA）是最直接和最快的分类模型之一，是一种有监督的算法。模型的训练可分为3步：（1）计算某…

人工智能 2023年7月8日
00100
GBDT算法原理以及实例理解（含Python代码简单实现版）

一、算法简介： GBDT 的全称是 Gradient Boosting Decision Tree，梯度提升树，在传统机器学习算法中，GBDT算的上是TOP前三的算法。想要理解G…

人工智能 2023年7月4日
0059
Python拼接字符串的几种方式

""" 1. 使用加号"+"连接字符串用加号"+"连接两个字符串，连接后这两个字符串将连接成一个字符串。但需…

人工智能 2023年7月4日
0076
Anaconda+pycharm安装及环境配置

回答1：在环境下配置pytorch可以按照如下步骤进行： 1. 打开，创建一个新的虚拟环境，例如命名为”pytorch_env”。 2. 在命令行中使…

人工智能 2023年7月4日
0059
使用js写一个播放语音提示的功能

在java web认证在开发中，当客户请求数据更新时，它将在页面上弹出。 [En] In development, when the customer requests a da…

人工智能 2023年5月27日
0082
YOLOV7训练专属于自己的目标检测模型（保姆级教程，含数据集预处理）

ubuntu20.04 cuda11.0 cudnn8.0.4 python3.8 torch1.12.0 torchvision0.11.0 （1）把yolov7克隆到本地 gi…

人工智能 2023年6月16日
00125
SE-ResNet的实现

见：D:\pythonCodes\深度学习实验\4.1_经典分类网络\inference代码汇总\models\se_resnet.py 一、SE-ResNet的实现方法读了se…

人工智能 2023年7月14日
0099
torch.nn.interpolate—torch上采样和下采样操作

前言：最近博主搭建网络需要用到一些直接对于GPU上的tensor的上采样和下采样操作，如果使用opencv那么就需要先将数据从GPU上面copy到CPU，操作完后在转移到GPU。…

人工智能 2023年6月17日
0083
GWmodel | 地理加权模型（Ⅱ-2）：如何查看地理加权回归的显著性

《地理加权模型》系列自推出来，深受各位读者喜爱。前几天有读者问：使用 gwr()等函数运行模型后，怎么去查看它里面的信息呢？比如如何看变量系数的显著性。本篇就来介绍如何在R语言中去…

人工智能 2023年6月18日
0088
基于opencv的c++图像处理（图像二值化）

前言基于opencv的c++接口，实现常用的图像二值化方法，包括了最大类间方差法（OTSU）、固定化阈值以及自适应阈值。相关的opencv接口解析 CV_EXPORTS_W d…

人工智能 2023年6月18日
00123
PyTorch 实现联邦学习FedAvg （详解）

PyTorch 实现联邦学习FedAvg （详解）开始做第二个工作了，又把之前看的FedAvg的代码看了一遍。联邦学习好难啊… 1. 介绍简单介绍一下FedAvg …

人工智能 2023年7月4日
0073
在Anaconda安装Pytorch的详细步骤

1. 打开Anaconda Prompt（在命令行格式下，输入代码，建立pytorch环境、安装pytorch、测试pytorch过程） 2. 创建环境pytorch，使用Pyth…

人工智能 2023年7月23日
0088
医咖会免费STATA教程学习笔记——简单线性回归

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年6月18日
00101
Go 封装http请求包Get、Post

之前已经封装过leveldb包. 今天再把项目中经常会用到的一个技术封装成包，记录下来，仅供需要的小伙伴学习参考go如何封装包给别人和自己使用。有需要的小伙伴也可以在自己的项目中直…

人工智能 2023年6月30日
0082
学习笔记：多模态

1.多模态数据：不同的存在形式或信息来源均可被称之为一种模态。由两种或两种以上模态组成的数据称之为多模态数据（多模态用来表示不同形态的数据形式，或者同种形态不同的格式，一般表示文…

人工智能 2023年7月25日
0062

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

KingbaseES 全文检索功能介绍

一、默认空格分词

大家都在看