文档章节

lucene fst

pczhangtl
 pczhangtl
发布于 2013/12/10 16:34
字数 529
阅读 120
收藏 0

Using Finite State Transducers in Lucene

FSTs are finite-state machines that map a term (byte sequence) to an arbitrary output. They also look cool:



That FST maps the sorted words  mopmothpopstarstop and  top to their ordinal number (0, 1, 2, ...). As you traverse the arcs, you sum up the outputs, so  stop hits 3 on the  s and 1 on the  o, so its output ordinal is 4. The outputs can be arbitrary numbers or byte sequences, or combinations, etc. -- it's pluggable.

Essentially, an FST is a SortedMap<ByteSequence,SomeOutput>, if the arcs are in sorted order. With the right representation, it requires far less RAM than other SortedMap implementations, but has a higher CPU cost during lookup. The low memory footprint is vital for Lucene since an index can easily have many millions (sometimes, billions!) of unique terms.

There's a  great deal of theory behind FSTs. They generally support the same operations as FSMs (determinize, minimize, union, intersect, etc.). You can also compose them, where the outputs of one FST are intersected with the inputs of the next, resulting in a new FST.

There are some nice general-purpose FST toolkits ( OpenFst looks great) that support all these operations, but for Lucene I decided to implement  this neat algorithm which incrementally builds up the minimal unweighted FST from pre-sorted inputs. This is a perfect fit for Lucene since we already store all our terms in sorted (unicode) order.

The resulting implementation (currently a patch on  LUCENE-2792) is fast and memory efficient: it builds the 9.8 million terms in a 10 million Wikipedia index in ~8 seconds (on a fast computer), requiring less than 256 MB heap. The resulting FST is 69 MB. It can also build a  prefix trie, pruning by how many terms come through each node, with even less memory.

Note that because  addition is commutative, an FST with numeric outputs is not guaranteed to be minimal in my implementation; perhaps if I could generalize the algorithm to a weighted FST instead, which also stores a weight on each arc, that would yield the minimal FST. But I don't expect this will be a problem in practice for Lucene.

In the patch I modified the  SimpleText codec, which was loading all terms into a TreeMap mapping the BytesRef term to an int docFreq and long filePointer, to use an FST instead, and all tests pass!

There are lots of other potential places in Lucene where we could use FSTs, since we often need map the index terms to "something". For example, the terms index maps to a long file position; the field cache maps to ordinals; the terms dictionary maps to codec-specific metadata, etc. We also have multi-term queries (eg Prefix, Wildcard, Fuzzy, Regexp) that need to test a large number of terms, that could work directly via intersection with the FST instead (many apps could easily fit their entire terms dict in RAM as an FST since the format is so compact). The FST could be used for a key/value store. Lots of fun things to try!

Many thanks to  Dawid Weiss for helping me iterate on this.

本文转载自:http://blog.mikemccandless.com/2010/12/

共有 人打赏支持
上一篇: Jaxb
pczhangtl
粉丝 46
博文 707
码字总数 113318
作品 0
浦东
高级程序员
私信 提问
基于Lucene查询原理分析Elasticsearch的性能

前言 Elasticsearch是一个很火的分布式搜索系统,提供了非常强大而且易用的查询和分析能力,包括全文索引、模糊查询、多条件组合查询、地理位置查询等等,而且具有一定的分析聚合能力。因为其...

亦征
10/29
0
0
Apache Lucene 4.10.3 发布,文本搜索引擎库

Apache Lucene 4.10.3 发布,它是一个高性能的全 java 编写的文本搜索引擎库。几乎适用于所有需要全文搜索的应用程序。此版本中主要修复了 12 个 Bug。 Bug 修复: LUCENE-6019, LUCENE-6117...

oschina
2015/04/09
1K
6
Apache Lucene 4.2 发布,又是全新版本

Apache Lucene 4.2 来了!!! 值得关注的改进内容: Lucene 4.2 使用新的默认编码器 (Lucene42Codec) ,使用更高效的 docvalues 格式,FST 排序,更少的定位开销,改进数值压缩;更小的术语...

oschina
2013/03/12
9.5K
20
Apache Lucene 4.4 发布,搜索引擎框架

Apache Lucene 4.4 发布了,包含很多 bug 修复、优化和提升,下载地址: http://lucene.apache.org/core/mirrors-core-latest-redir.html 值得关注的改进有: * 全新的 Replicator 模块,实现...

oschina
2013/07/24
4.7K
3
Apache Solr 3.6 发布,全文搜索服务器

Apache Solr 3.6 发布了,该版本包含大量的 bug 修复、优化和改进,下载地址: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html 主要改进内容: * 新的 SolrJ 客户端连接器,...

红薯
2012/04/13
1K
1

没有更多内容

加载失败,请刷新页面

加载更多

开源 java CMS - FreeCMS2.8会员我的评论

项目地址:http://www.freeteam.cn/ 我的评论 从左侧管理菜单点击我的评论进入。在这里可以查看当前登录会员的所有评论记录。 删除评论 选择评论然后点击删除按钮可以完成删除操作。 为了防止...

freeteam
14分钟前
1
0
Eureka Server启用 https服务指北

文章共 591字,阅读大约需要 2分钟 ! 概 述 在我的前文《Eureka Server 开启Spring Security Basic认证》中已经给 Eureka Server 开启了最基本的鉴权措施,本文则让 HTTPS加持于 Eureka Ser...

CodeSheep
51分钟前
13
0
OSChina 周二乱弹 —— 其实我在地板也睡不着

Osc乱弹歌单(2018)请戳(这里) 【今日歌曲】 @witt-z :分享歌词: 阴天 在不开灯的房间,当所有思绪都一点一点沉淀。 分享莫文蔚的单曲《阴天》: 《阴天》- 莫文蔚 手机党少年们想听歌,...

小小编辑
今天
460
7
微服务分布式事务实现

https://www.processon.com/view/link/5b2144d7e4b001a14d3d2d30

WALK_MAN
今天
3
0
《大漠烟尘》读书笔记及读后感文章3700字

《大漠烟尘》读书笔记及读后感文章3700字: 在这个浮躁的社会里,你有多久没有好好读完一本书了? 我们总觉得自己和别人不一样,所以当看到别人身上的问题时,很少有“反求诸己”,反思自己。...

原创小博客
今天
5
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部