文档章节

PostgreSQL的全文检索(一)

kenyon_君羊
 kenyon_君羊
发布于 2012/09/28 15:40
字数 1864
阅读 9.4K
收藏 40

3 月,跳不动了?>>>

在全文检索没有出来之前,普通的文件检索都是采用的like,~,或者ilike来匹配文档字段中内容,这种检索方法对小数据量的文本检索是OK的,但数据量大了就不行了。

 普通检索的劣势:
1.语言不能完全支持,哪怕是英文,比如检索friend时不能检索出friends或者friendly
2.检索出的结果排序功能不好
3.缺少索引支持,查询速度慢,特别是两头加了两个%时根本就不走索引

 PostgreSQL在8.3.x版本后开始支持全文检索。执行步骤,主要分三步走:
1.将文档分词(parsing documents into tokens)
2.转换分词规则(converting tokens into lexemes),如去掉复数后缀s/es,以及加入stop词,使之不会在分词中出现,如常用的'的'
3.按一定顺序查询的优化方式存储(storing preprocessed documents optimized for searching) tsvector存储,使用tsquery查询
注:这里tokes是原始的拆分分词,可能包含常用的无意义的词,lexemes是加工过的有价值的分词

一、全文检索的环境和例子:
postgres=# show default_text_search_config ;
default_text_search_config
----------------------------
pg_catalog.english
(1 row)

--全文检索配置
postgres=# \dF
               List of text search configurations
   Schema   |    Name    |              Description             
------------+------------+---------------------------------------
pg_catalog | danish     | configuration for danish language
pg_catalog | dutch      | configuration for dutch language
pg_catalog | english    | configuration for english language
pg_catalog | finnish    | configuration for finnish language
pg_catalog | french     | configuration for french language
pg_catalog | german     | configuration for german language
pg_catalog | hungarian  | configuration for hungarian language
pg_catalog | italian    | configuration for italian language
pg_catalog | norwegian  | configuration for norwegian language
pg_catalog | portuguese | configuration for portuguese language
pg_catalog | romanian   | configuration for romanian language
pg_catalog | russian    | configuration for russian language
pg_catalog | simple     | simple configuration
pg_catalog | spanish    | configuration for spanish language
pg_catalog | swedish    | configuration for swedish language
pg_catalog | turkish    | configuration for turkish language
(16 rows)

--全文检索查看russian具体配置
postgres=# \dF+ russian
Text search configuration "pg_catalog.russian"
Parser: "pg_catalog.default"
      Token      | Dictionaries
-----------------+--------------
asciihword      | english_stem
asciiword       | english_stem
email           | simple
file            | simple
float           | simple
host            | simple
hword           | russian_stem
hword_asciipart | english_stem
hword_numpart   | simple
hword_part      | russian_stem
int             | simple
numhword        | simple
numword         | simple
sfloat          | simple
uint            | simple
url             | simple
url_path        | simple
version         | simple
word            | russian_stem

--查看全文检索模板
postgres=# \dFt+
                                             List of text search templates
   Schema   |   Name    |      Init      |      Lexize      |                        Description                       
------------+-----------+----------------+------------------+-----------------------------------------------------------
pg_catalog | ispell    | dispell_init   | dispell_lexize   | ispell dictionary
pg_catalog | simple    | dsimple_init   | dsimple_lexize   | simple dictionary: just lower case and check for stopword
pg_catalog | snowball  | dsnowball_init | dsnowball_lexize | snowball stemmer
pg_catalog | synonym   | dsynonym_init  | dsynonym_lexize  | synonym dictionary: replace word by its synonym
pg_catalog | thesaurus | thesaurus_init | thesaurus_lexize | thesaurus dictionary: phrase by phrase substitution
(5 rows)

--全文检索字典
postgres=# \dFd+
                                                                  List of text search dictionaries
   Schema   |      Name       |      Template       |                   Init options                    |                        Description                       
------------+-----------------+---------------------+---------------------------------------------------+-----------------------------------------------------------
pg_catalog | danish_stem     | pg_catalog.snowball | language = 'danish', stopwords = 'danish'         | snowball stemmer for danish language
pg_catalog | dutch_stem      | pg_catalog.snowball | language = 'dutch', stopwords = 'dutch'           | snowball stemmer for dutch language
pg_catalog | english_stem    | pg_catalog.snowball | language = 'english', stopwords = 'english'       | snowball stemmer for english language
pg_catalog | finnish_stem    | pg_catalog.snowball | language = 'finnish', stopwords = 'finnish'       | snowball stemmer for finnish language
pg_catalog | french_stem     | pg_catalog.snowball | language = 'french', stopwords = 'french'         | snowball stemmer for french language
pg_catalog | german_stem     | pg_catalog.snowball | language = 'german', stopwords = 'german'         | snowball stemmer for german language
pg_catalog | hungarian_stem  | pg_catalog.snowball | language = 'hungarian', stopwords = 'hungarian'   | snowball stemmer for hungarian language
pg_catalog | italian_stem    | pg_catalog.snowball | language = 'italian', stopwords = 'italian'       | snowball stemmer for italian language
pg_catalog | norwegian_stem  | pg_catalog.snowball | language = 'norwegian', stopwords = 'norwegian'   | snowball stemmer for norwegian language
pg_catalog | portuguese_stem | pg_catalog.snowball | language = 'portuguese', stopwords = 'portuguese' | snowball stemmer for portuguese language
pg_catalog | romanian_stem   | pg_catalog.snowball | language = 'romanian'                             | snowball stemmer for romanian language
pg_catalog | russian_stem    | pg_catalog.snowball | language = 'russian', stopwords = 'russian'       | snowball stemmer for russian language
pg_catalog | simple          | pg_catalog.simple   |                                                   | simple dictionary: just lower case and check for stopword
pg_catalog | spanish_stem    | pg_catalog.snowball | language = 'spanish', stopwords = 'spanish'       | snowball stemmer for spanish language
pg_catalog | swedish_stem    | pg_catalog.snowball | language = 'swedish', stopwords = 'swedish'       | snowball stemmer for swedish language
pg_catalog | turkish_stem    | pg_catalog.snowball | language = 'turkish', stopwords = 'turkish'       | snowball stemmer for turkish language

--查看全文检索分析器,带加号可看详细配置,命令\dFp+
postgres=# \dFp
           List of text search parsers
   Schema   |     Name      |     Description     
------------+---------------+---------------------
 pg_catalog | chineseparser | 
 pg_catalog | default       | default word parser
(2 rows)
参数和配置文件的具体位置一般在$PGHOME/SHARE里面,stop词是存放在$PGHOME/share/tsearch_data下面的

二、实际例子,以英文例子为例
postgres=# SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery as search;
search 
--------
t
(1 row)

postgres=# SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector as search;
search 
--------
f
(1 row)

postgres=# SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat') as search;
search 
--------
t
(1 row)

postgres=# SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat & rat') as search;
search 
--------
f
(1 row)

--默认的english分词,to_tevector区别于::tsvector是前者会加工分词,后者默认是加工好了
postgres=# SELECT to_tsvector('english','fat cats ate fat rats') @@ to_tsquery('english','fat & rat') as search;            
search
--------
t
(1 row)

--plainto_tsquery不却分分隔符,权重标签
postgres=# SELECT plainto_tsquery('english', 'The Fat & Rats:C');  
   plainto_tsquery
---------------------
'fat' & 'rat' & 'c'
(1 行记录)

--分词之间不会区分分隔符,每个分词之间插入&;,::tsquery和to_tsquery则必须要用到
postgres=# SELECT plainto_tsquery('english', 'The Fat Rats');      
plainto_tsquery
-----------------
'fat' & 'rat'
(1 行记录)


postgres=# SELECT 'The & Fat & Rats'::tsquery;
        tsquery
------------------------
'The' & 'Fat' & 'Rats'
(1 行记录)


postgres=# SELECT to_tsquery('english', 'The & Fat & Rats');
  to_tsquery
---------------
'fat' & 'rat'
(1 行记录)
三、对全文检索建立索引
有两种办法,一种是对当前文档字段加内置的转换函数,然后建索引,另一种办法是新增一个字段,然后更新原文档内容(需建立触发器和函数转换)上建立索引。推荐后一个。
方法1.原字段上建索引
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_name, body)); --组合索引,config_name是表pgweb的一个字段
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', title || ' ' || body));

方法2.新增一列转换后建索引
ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector; --新建字段列类型是tsvector
UPDATE pgweb SET textsearchable_index_col = to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,'')); CREATE INDEX textsearch_idx ON pgweb USING gin(textsearchable_index_col);
SELECT title FROM pgweb WHERE textsearchable_index_col @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10;

说明:
a.新增字段建的索引还需要创建一个触发器来实时更新新建字段内容
b.表达式索引的优点是简单,占用的空间少,缺点是每次执行需要调用to_tsvector函数来确保索引值关联
c.新建字段索引的有点是查询的速度快(无需每次去调用to_tsvevtor),尤其是使用Gist索引的时候。缺点是新建一个单独的列,消耗更多的存储空间。

四、内置实用函数示例
诸如to_tsvector,to_tsquery,tsvector_update_trigger,tsvector_update_trigger_column,ts_stat等等

--tsvector_update_trigger示例
CREATE TABLE messages (
title text,
body text,
tsv tsvector
);

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON messages FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);

INSERT INTO messages VALUES('title here','the body text is here');

postgres=# select * from messages;
   title    |         body          |            tsv             
------------+-----------------------+----------------------------
title here | the body text is here | 'bodi':4 'text':5 'titl':1
(1 row)

postgres=# SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title & body');
   title    |         body         
------------+-----------------------
title here | the body text is here
(1 row)
--ts_stat的使用
--寻找文档中出现词汇的排序
-- nentry是总的出现次数
-- ndoc是文档中(tsvector)出现的次数,重复的记为1次
postgres=# select * from messages;
        title         |                             body                             |                                                tsv                                                
----------------------+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------
title here           | the body text is here                                        | 'bodi':4 'text':5 'titl':1
kenyon               | a chinese boy                                                | 'boy':4 'chines':3 'kenyon':1
Andy Roddick retired | Andy Roddick retired,a former rank number 1 player in tennis | '1':11 'andi':1,4 'former':8 'number':10 'player':12 'rank':9 'retir':3,6 'roddick':2,5 'tenni':14
kenyon retired       | kenyon retired,a open-source lover,inserting in this area    | 'area':13 'insert':10 'kenyon':1,3 'lover':9 'open':7 'open-sourc':6 'retir':2,4 'sourc':8
Michael Jordan       | MJ is an American former professional basketball player      | 'american':6 'basketbal':9 'former':7 'jordan':2 'michael':1 'mj':3 'player':10 'profession':8
(5 rows)

postgres=# SELECT * FROM ts_stat('SELECT tsv FROM messages') ORDER BY nentry DESC, ndoc DESC, word LIMIT 10;
   word    | ndoc | nentry
-----------+------+--------
retir     |    2 |      4
kenyon    |    2 |      3
former    |    2 |      2
player    |    2 |      2
andi      |    1 |      2
roddick   |    1 |      2
1         |    1 |      1
american  |    1 |      1
area      |    1 |      1
basketbal |    1 |      1
(10 rows)
五、全文检索的限制
1.The length of each lexeme must be less than 2K bytes
2.The length of a tsvector (lexemes + positions) must be less than 1 megabyte
3.The number of lexemes must be less than 264
4.Position values in tsvector must be greater than 0 and no more than 16,383 No more than 256 positions per lexeme 5.The number of nodes (lexemes + operators) in a tsquery must be less than 32,768

六、总结:
以上是PostgreSQL内置的全文检索的环境和实际使用例子,目前对中文的全文检索并不支持,但已经有比较好的第三方工具结合使用,下一篇继续PostgreSQL中文全文检索环境搭建和实际使用。

© 著作权归作者所有

kenyon_君羊
粉丝 503
博文 173
码字总数 125197
作品 0
杭州
其他
私信 提问
加载中

评论(9)

MtrS
MtrS
SELECT to_tsvector('{"neirong":["43540", "qq2003"]}'::jsonb::text) @@ to_tsquery('qq2003') as search;```
```

流明
hasonger
hasonger
很享受你的分享~
kenyon_君羊
kenyon_君羊 博主

引用来自“羊半仙”的评论

看起来比sqlite的FTS复杂多了呀

英文,俄文等是内置的,不需要额外的配置,直接使用即可,谈不上多复杂,只是功能比较丰富,看着复杂。中文检索目前不支持内置,需要第三方实现。
狗头666
狗头666
看起来比sqlite的FTS复杂多了呀
kenyon_君羊
kenyon_君羊 博主

引用来自“宏哥”的评论

中文需要分词后存储为 vector类型即可.

有时间再整一份PG的中文全文检索,欢迎品评
kenyon_君羊
kenyon_君羊 博主

引用来自“宏哥”的评论

http://www.oschina.net/question/96003_19020 , 我做过的一个测试

受教了
kenyon_君羊
kenyon_君羊 博主

引用来自“宏哥”的评论

中文需要分词后存储为 vector类型即可.

是的,我之前用的第三方是bamboo+CRF
宏哥
宏哥
http://www.oschina.net/question/96003_19020 , 我做过的一个测试
宏哥
宏哥
中文需要分词后存储为 vector类型即可.
PostgreSQL 相似文本检索与去重 - (银屑病怎么治?银屑病怎么治疗?银屑病怎么治疗好?银屑病怎么能治疗好?)

标签 PostgreSQL , 相似字符串 , 全文检索 , 去重 , 相似问题 , 医疗 , plr , plpython , madlib , 文本处理 背景 在云栖社区的问答区,有一位网友提到有一个问题: 解这个问题的思路 1. 首先...

德哥
2018/04/18
0
0
PostgreSQL 类微博FEED系统 - 设计与性能指标

标签 PostgreSQL , feed , 微博 , 推送 , 分区 , 分片 , UDF , 挖掘 , 文本挖掘 背景 类微博系统,最频繁用到的功能: 之前写过一篇《三体高可用PCC大赛 - facebook微博 like场景 - 数据库设...

德哥
2018/04/18
0
0
PostgreSQL全文检索简介

PostgreSQL自带有一个简易的全文检索引擎,可以实现小规模数据量的全文检索功能。本文我们将引导介绍一下这个功能,对于小数据量的搜索这个功能是足够使用的,而无需搭建额外的ES等重量级的全...

dragon_tech
2019/04/24
36
0
PostgreSQL 相似搜索插件介绍大汇总 (cube,rum,pg_trgm,smlar,imgsmlr,pg_similarity) (rum,gin,gist)

标签 PostgreSQL , cube , rum , pgtrgm , smlar , imgsmlr , pgsimilarity , gin , gist , 倒排 , 相似 , 向量 , 特征 , 图像 , 文本 , 字符串 , 全文检索 背景 在搜索业务场景中,相似搜索...

德哥
2019/04/14
0
0
数据库案例集锦 - 开发者的《如来神掌》

标签 PostgreSQL , PG DBA cookbook , PG Oracle兼容性 , PG 架构师 cookbook , PG 开发者 cookbook , PG 应用案例 背景 「剑魔独孤求败,纵横江湖三十馀载,杀尽仇寇,败尽英雄,天下更无抗...

德哥
2017/06/09
0
0

没有更多内容

加载失败,请刷新页面

加载更多

MyBatis-Spring:整合Mybatis与Spring方式二:SqlSessionDaoSupport

本文上接《MyBatis-Spring:无缝隙整合Mybatis与Spring方式一:SqlSessionTemplate》, SqlSessionDaoSupport是一个抽象的支持类,用来为你提供SqlSession。调用getSqlSession()方法你会得到一...

明德君
13分钟前
11
0
count(1)、count(*)与count(列名)的执行区别

执行效果: 1. count(1) and count(*) 当表的数据量大些时,对表作分析之后,使用count(1)还要比使用count(*)用时多了! 从执行计划来看,count(1)和count(*)的效果是一样的。 但是在表做过分...

七宝1
22分钟前
11
0
SpringBoot内嵌tomcat(2)- 【Tomcat的启动入口】源码简析

在springboot支持Web模式下启动加载 ApplicationContext 是: org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext extends org.springfram......

noob_fly
26分钟前
8
0
vscode web版 搭建

即在远程编程的基础上,把通过node.js的远程通信转为通过浏览器的通信 通信应该比原有一定局限性 一键搭建代码: systemctl stop firewalld.servicewget https://github.com/cdr/code-ser...

oschinaer
26分钟前
29
0
ElasticSearch(四)Java操作ElasticSearch

今天本来想使用Spring Boot来连接ES的,但是想了想,还是决定用ES官方的Java High Level REST Client,这样可以更好地帮助我理解Spring中ES的原理。 首先,我们在POM文件中引入 <prope...

RaindropBK
27分钟前
9
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部