文档章节

lucene fst

pczhangtl
 pczhangtl
发布于 2013/12/10 16:34
字数 529
阅读 117
收藏 0

Using Finite State Transducers in Lucene

FSTs are finite-state machines that map a term (byte sequence) to an arbitrary output. They also look cool:



That FST maps the sorted words  mopmothpopstarstop and  top to their ordinal number (0, 1, 2, ...). As you traverse the arcs, you sum up the outputs, so  stop hits 3 on the  s and 1 on the  o, so its output ordinal is 4. The outputs can be arbitrary numbers or byte sequences, or combinations, etc. -- it's pluggable.

Essentially, an FST is a SortedMap<ByteSequence,SomeOutput>, if the arcs are in sorted order. With the right representation, it requires far less RAM than other SortedMap implementations, but has a higher CPU cost during lookup. The low memory footprint is vital for Lucene since an index can easily have many millions (sometimes, billions!) of unique terms.

There's a  great deal of theory behind FSTs. They generally support the same operations as FSMs (determinize, minimize, union, intersect, etc.). You can also compose them, where the outputs of one FST are intersected with the inputs of the next, resulting in a new FST.

There are some nice general-purpose FST toolkits ( OpenFst looks great) that support all these operations, but for Lucene I decided to implement  this neat algorithm which incrementally builds up the minimal unweighted FST from pre-sorted inputs. This is a perfect fit for Lucene since we already store all our terms in sorted (unicode) order.

The resulting implementation (currently a patch on  LUCENE-2792) is fast and memory efficient: it builds the 9.8 million terms in a 10 million Wikipedia index in ~8 seconds (on a fast computer), requiring less than 256 MB heap. The resulting FST is 69 MB. It can also build a  prefix trie, pruning by how many terms come through each node, with even less memory.

Note that because  addition is commutative, an FST with numeric outputs is not guaranteed to be minimal in my implementation; perhaps if I could generalize the algorithm to a weighted FST instead, which also stores a weight on each arc, that would yield the minimal FST. But I don't expect this will be a problem in practice for Lucene.

In the patch I modified the  SimpleText codec, which was loading all terms into a TreeMap mapping the BytesRef term to an int docFreq and long filePointer, to use an FST instead, and all tests pass!

There are lots of other potential places in Lucene where we could use FSTs, since we often need map the index terms to "something". For example, the terms index maps to a long file position; the field cache maps to ordinals; the terms dictionary maps to codec-specific metadata, etc. We also have multi-term queries (eg Prefix, Wildcard, Fuzzy, Regexp) that need to test a large number of terms, that could work directly via intersection with the FST instead (many apps could easily fit their entire terms dict in RAM as an FST since the format is so compact). The FST could be used for a key/value store. Lots of fun things to try!

Many thanks to  Dawid Weiss for helping me iterate on this.

本文转载自:http://blog.mikemccandless.com/2010/12/

共有 人打赏支持
pczhangtl
粉丝 46
博文 707
码字总数 113318
作品 0
浦东
高级程序员
Apache Lucene 4.10.3 发布,文本搜索引擎库

Apache Lucene 4.10.3 发布,它是一个高性能的全 java 编写的文本搜索引擎库。几乎适用于所有需要全文搜索的应用程序。此版本中主要修复了 12 个 Bug。 Bug 修复: LUCENE-6019, LUCENE-6117...

oschina
2015/04/09
1K
6
Apache Lucene 4.2 发布,又是全新版本

Apache Lucene 4.2 来了!!! 值得关注的改进内容: Lucene 4.2 使用新的默认编码器 (Lucene42Codec) ,使用更高效的 docvalues 格式,FST 排序,更少的定位开销,改进数值压缩;更小的术语...

oschina
2013/03/12
9.4K
20
Apache Lucene 4.4 发布,搜索引擎框架

Apache Lucene 4.4 发布了,包含很多 bug 修复、优化和提升,下载地址: http://lucene.apache.org/core/mirrors-core-latest-redir.html 值得关注的改进有: 全新的 Replicator 模块,实现复...

oschina
2013/07/24
4.7K
3
Apache Solr 3.6 发布,全文搜索服务器

Apache Solr 3.6 发布了,该版本包含大量的 bug 修复、优化和改进,下载地址: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html 主要改进内容: 新的 SolrJ 客户端连接器,基...

红薯
2012/04/13
1K
1
Lucene的索引系统和搜索过程分析

前言:目前自己在做使用Lucene.net和PanGu分词实现全文检索的工作,不过自己是把别人做好的项目进行迁移。因为项目整体要迁移到ASP.NET Core 2.0版本,而Lucene使用的版本是3.6.0 ,PanGu分词...

Shendu.cc
07/12
0
0

没有更多内容

加载失败,请刷新页面

加载更多

SQL count(*) 和count(1)的区别

开发中经常会使用这两个聚合函数,作用都是用来统计记录行,今天查找资料发现,其实这两个函数并没有区别, 实践才是检验的标准,首先看执行计划(表是我自己建立的): 可以看到,两个执行计...

一曲图森破
22分钟前
1
0
ppwjs之bootstrap文字排版:字体设置

<!DOCTYPT html><html><head><meta http-equiv="content-type" content="text/html; charset=utf-8" /><title>ppwjs欢迎您</title><link rel="icon" href="/favicon.ico" ......

ppwjs
25分钟前
1
0
区块链100讲:详解区块链之P2P网络

1 P2P网络 如果我们简单来看 P2P 技术,它的应用领域已经非常广泛了,从流媒体到点对点通讯、从文件共享到协同处理,多种领域都有它的身影出现。 同样的,P2P 的网络协议也有很多,比较常见的...

HiBlock
40分钟前
0
0
74.expect脚本同步文件以及指定host同步文件 构建分发系统文件和命令

20.31 expect脚本同步文件: 在expect脚本中去实现在一台机器上把文件同步到另外一台机器上去。核心命令用的是rsync ~1.自动同步文件 #!/usr/bin/expect set passwd "123456" spawn rsync -a...

王鑫linux
今天
1
0
TypeScript项目引用(project references)

转发 TypeScript项目引用(project references) TypeScript新特性之项目引用(project references) 项目引用是TypeScript 3.0中的一项新功能,允许您将TypeScript程序构建为更小的部分。 通过这...

durban
今天
1
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部