KingbaseES 全文检索功能介绍

KingbaseES 内置的缺省的分词解析器采用空格分词,因为中文的词语之间没有空格分割,所以这种方法并不适用于中文。要支持中文的全文检索需要额外的中文分词插件:zhparser and sys_jieba,其中zhparser 支持 GBK 和 UTF8 字符集,sys_jieba 支持 UTF8 字符集。

一、默认空格分词

test=# SELECT to_tsvector('English','Try not to become a man of success, but rather try to become a man of value');
                             to_tsvector
 'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17
(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value');
                                                     to_tsvector
 pg_catalog.simple
(1 row)

标准化过程会完成以下操作:

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ 'become';
 ?column?

 f
(1 row)

test=# select 'become'::tsquery,to_tsquery('become'),to_tsquery('english','become');
tsquery | to_tsquery | to_tsquery
 t
(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('!become');
 ?column?

 t
(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('Try & !becom');
 ?column?

 t
(1 row)
test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('bec:*');
 ?column?

 'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17
(1 row)

test=# SELECT to_tsvector('english','Try not to become a man of success, but rather try to become a man of value') ;
                             to_tsvector
 'a':5,14 'becom':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rath':10 'success':8 'to':3,12 'try':1,11 'valu':17
(1 row)
                                     ^
test=# SELECT to_tsvector('french'::regconfig,'Try not to become a man of success, but rather try to become a man of value') ;
                                                   to_tsvector
 '人大金仓致力于提供高可靠的数据库产品':1

因为内置的分词器是按空格分割的,而中文间没有空格,因此,整句话就被看做一个分词。

create extension zhparser;
create text search configuration zhongwen_parser (parser = zhparser);
alter text search configuration zhongwen_parser add mapping for n,v,a,i,e,l,j with simple;

上面 for 后面的字母表示分词的token,上面的token映射只映射了名词(n),动词(v),形容词(a),成语(i),叹词(e),缩写(j) 和习用语(l)6种,这6种以外的token全部被屏蔽。词典使用的是内置的simple词典。具体的token 如下:

test=# select ts_token_type('zhparser');
     ts_token_type
  3748 | simple          |           11 |       10 |      3722
 13265 | arabic          |           11 |       10 |      3722
 13267 | danish          |           11 |       10 |      3722
 13269 | dutch           |           11 |       10 |      3722
 13271 | english         |           11 |       10 |      3722
 13273 | finnish         |           11 |       10 |      3722
 13275 | french          |           11 |       10 |      3722
 13277 | german          |           11 |       10 |      3722
 13279 | hungarian       |           11 |       10 |      3722
 13281 | indonesian      |           11 |       10 |      3722
 13283 | irish           |           11 |       10 |      3722
 13285 | italian         |           11 |       10 |      3722
 13287 | lithuanian      |           11 |       10 |      3722
 13289 | nepali          |           11 |       10 |      3722
 13291 | norwegian       |           11 |       10 |      3722
 13293 | portuguese      |           11 |       10 |      3722
 13295 | romanian        |           11 |       10 |      3722
 13297 | russian         |           11 |       10 |      3722
 13299 | spanish         |           11 |       10 |      3722
 13301 | swedish         |           11 |       10 |      3722
 13303 | tamil           |           11 |       10 |      3722
 13305 | turkish         |           11 |       10 |      3722
 16390 | parser_name     |         2200 |       10 |     16389
 24587 | zhongwen_parser |         2200 |       10 |     16389
test=# select to_tsvector('zhongwen_parser','人大金仓致力于提供高可靠的数据库产品');
                           to_tsvector
 sys    | contains | boolean          | text, text          | func | immutable  | safe     | system | invoker  |                   | sql      | select to_tsvector($1) @@ to_tsquery($2) |
 sys    | contains | boolean          | text, text, integer | func | immutable  | safe     | system | invoker  |                   | sql      | select to_tsvector($1) @@ to_tsquery($2) |
 sys    | contains | boolean          | text, tsquery       | func | immutable  | safe     | system | invoker  |                   | sql      | select $1::tsvector @@ $2                |
 sys    | contains | boolean          | tsvector, text      | func | immutable  | safe     | system | invoker  |                   | sql      | select $1 @@ $2::tsquery                 |
 sys    | contains | boolean          | tsvector, tsquery   | func | immutable  | safe     | system | invoker  |                   | sql      | select $1 @@ $2                          |

默认contains 函数使用的是空格分词解析器,因此,无法使用contains 进行中文判断

`
test=# select contains(‘人大金仓致力于提供高可靠的数据库产品’,’产品’);
contains

Original: https://blog.csdn.net/lyu1026/article/details/120719624
Author: Kingbase 研究院
Title: KingbaseES 全文检索功能介绍

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/548363/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球