文档章节

lucene fst

pczhangtl
 pczhangtl
发布于 2013/12/10 16:34
字数 529
阅读 114
收藏 0
点赞 0
评论 0

Using Finite State Transducers in Lucene

FSTs are finite-state machines that map a term (byte sequence) to an arbitrary output. They also look cool:



That FST maps the sorted words  mopmothpopstarstop and  top to their ordinal number (0, 1, 2, ...). As you traverse the arcs, you sum up the outputs, so  stop hits 3 on the  s and 1 on the  o, so its output ordinal is 4. The outputs can be arbitrary numbers or byte sequences, or combinations, etc. -- it's pluggable.

Essentially, an FST is a SortedMap<ByteSequence,SomeOutput>, if the arcs are in sorted order. With the right representation, it requires far less RAM than other SortedMap implementations, but has a higher CPU cost during lookup. The low memory footprint is vital for Lucene since an index can easily have many millions (sometimes, billions!) of unique terms.

There's a  great deal of theory behind FSTs. They generally support the same operations as FSMs (determinize, minimize, union, intersect, etc.). You can also compose them, where the outputs of one FST are intersected with the inputs of the next, resulting in a new FST.

There are some nice general-purpose FST toolkits ( OpenFst looks great) that support all these operations, but for Lucene I decided to implement  this neat algorithm which incrementally builds up the minimal unweighted FST from pre-sorted inputs. This is a perfect fit for Lucene since we already store all our terms in sorted (unicode) order.

The resulting implementation (currently a patch on  LUCENE-2792) is fast and memory efficient: it builds the 9.8 million terms in a 10 million Wikipedia index in ~8 seconds (on a fast computer), requiring less than 256 MB heap. The resulting FST is 69 MB. It can also build a  prefix trie, pruning by how many terms come through each node, with even less memory.

Note that because  addition is commutative, an FST with numeric outputs is not guaranteed to be minimal in my implementation; perhaps if I could generalize the algorithm to a weighted FST instead, which also stores a weight on each arc, that would yield the minimal FST. But I don't expect this will be a problem in practice for Lucene.

In the patch I modified the  SimpleText codec, which was loading all terms into a TreeMap mapping the BytesRef term to an int docFreq and long filePointer, to use an FST instead, and all tests pass!

There are lots of other potential places in Lucene where we could use FSTs, since we often need map the index terms to "something". For example, the terms index maps to a long file position; the field cache maps to ordinals; the terms dictionary maps to codec-specific metadata, etc. We also have multi-term queries (eg Prefix, Wildcard, Fuzzy, Regexp) that need to test a large number of terms, that could work directly via intersection with the FST instead (many apps could easily fit their entire terms dict in RAM as an FST since the format is so compact). The FST could be used for a key/value store. Lots of fun things to try!

Many thanks to  Dawid Weiss for helping me iterate on this.

本文转载自:http://blog.mikemccandless.com/2010/12/

共有 人打赏支持
pczhangtl
粉丝 46
博文 707
码字总数 113318
作品 0
浦东
高级程序员
Apache Lucene 4.10.3 发布,文本搜索引擎库

Apache Lucene 4.10.3 发布,它是一个高性能的全 java 编写的文本搜索引擎库。几乎适用于所有需要全文搜索的应用程序。此版本中主要修复了 12 个 Bug。 Bug 修复: LUCENE-6019, LUCENE-6117...

oschina
2015/04/09
1K
6
Apache Lucene 4.2 发布,又是全新版本

Apache Lucene 4.2 来了!!! 值得关注的改进内容: Lucene 4.2 使用新的默认编码器 (Lucene42Codec) ,使用更高效的 docvalues 格式,FST 排序,更少的定位开销,改进数值压缩;更小的术语...

oschina
2013/03/12
9.4K
20
Apache Lucene 4.4 发布,搜索引擎框架

Apache Lucene 4.4 发布了,包含很多 bug 修复、优化和提升,下载地址: http://lucene.apache.org/core/mirrors-core-latest-redir.html 值得关注的改进有: 全新的 Replicator 模块,实现复...

oschina
2013/07/24
4.7K
3
Apache Solr 3.6 发布,全文搜索服务器

Apache Solr 3.6 发布了,该版本包含大量的 bug 修复、优化和改进,下载地址: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html 主要改进内容: 新的 SolrJ 客户端连接器,基...

红薯
2012/04/13
1K
1
Lucene的索引系统和搜索过程分析

前言:目前自己在做使用Lucene.net和PanGu分词实现全文检索的工作,不过自己是把别人做好的项目进行迁移。因为项目整体要迁移到ASP.NET Core 2.0版本,而Lucene使用的版本是3.6.0 ,PanGu分词...

Shendu.cc
07/12
0
0
Lucene 3.6 发布,Java 全文搜索引擎

Lucene 3.6 包含大量的 bug 修复、优化和改进,主要内容有: 完全支持 Java 7,要求 JDK 7u1 TypeTokenFilter filters tokens based on their TypeAttribute. Fixed offset bugs in a number......

红薯
2012/04/13
3.8K
8
Lucene 5.1.0 发布,Java 搜索引擎

Lucene 5.1.0 发布,此版本现已提供在:http://www.apache.org/dyn/closer.cgi/lucene/java/5.1.0。 更新内容如下: 新特性 (9) LUCENE-6066: Added DiversifiedTopDocsCollector to misc f......

chaun
2015/06/03
2.9K
12
使用FST时出现了异常

Caused by: java.lang.NullPointerException: Class is null at de.ruedigermoeller.serialization.FSTClazzInfoRegistry.getCLInfo(FSTClazzInfoRegistry.java:128) ~[fst-1.58-onejar.jar:......

JackChu
2014/08/01
1K
3
一整套的学习视频

安卓开发机密教程 http://edu.csdn.net/heima/video/androidVideo.html?fst Java基础教程 http://edu.csdn.net/heima/video/javasebxd.html?fst Java高新技术 http://edu.csdn.net/heima/vi......

adddss
2012/07/28
643
0
Apache LuceneTM5.0.0 版本即将发布

Apache LuceneTM5.0.0 版本即将发布 源文地址:http://blog.mikemccandless.com/2014/11/apache-lucene-500-is-coming.html 最后,在密集发布一系列4.x版本的特性后,最近发布的是4.10.2版本...

sbp810050504
2015/01/09
0
0

没有更多内容

加载失败,请刷新页面

加载更多

下一页

Sparkstreaming and Kafka

简介 Kafka 0.10的Spark Streaming集成设计与0.8 Direct Stream方法类似。 它提供了简单的并行性,Kafka分区和Spark分区之间的1:1对应关系,以及对偏移量和元数据的访问。 但是,由于较新的...

舒运
8分钟前
0
0
java获取当前时间所在一周的周一和周日日期

/** * 当前时间所在一周的周一和周日时间 * @param time 当前时间 * @return */ public static Map getWeekDate(String time) { Map map = new HashedMap(); SimpleDateFormat sdf = new Si......

小弱鸡
35分钟前
0
0
Redis数据的导出和导入(dump和load方式)

网上有些文章已经不再适用,本人也是踩了些坑,在此记录下。 迁移redis数据一般有如下3种方式: 第三方工具redis-dump,redis-load aof机制,需要开启aof功能 rdb存储机制 这里介绍第一种方式...

iplusx
40分钟前
1
0
ElasticSearch 高亮显示大文档搜索结果

2016年12月,我们开始研究Ambar——一个文档搜索系统。Ambar使用ElasticSearch作为核心搜索引擎。 在Ambar开发的过程中,我们处理了很多与ES相关的问题,我们想分享我们得到的宝贵经验。让我...

九州暮云
今天
0
0
Python 使用 pywifi 模块 破解wifi密码

git https://github.com/awkman/pywifi 常见常量 from pywifi import const# Define interface status.IFACE_DISCONNECTED = 0IFACE_SCANNING = 1IFACE_INACTIVE = 2IFACE_CONNEC......

阿豪boy
今天
1
0
phpstorm使用Iedis

phpstorm的redis插件Iedis是真好用 看了网上挺多的文章,但是由于我系统还是ubuntu,就有点尴尬了,现在破解之后,留个笔记,即使自己之后有需要也可以很快翻阅 先下载资源 资源下载 zip压缩...

贤郎--均灵
今天
0
0
第三章 spring-bean之FactoryBeanRegistrySupport(4)

前言 从FactoryBeanRegistrySupport类的名字可以看出FactoryBeanRegistrySupport负责FactoryBean的注册与支持。如果想知道FactoryBean相关的资料,请阅读spring-bean中关于FactoryBean的解读...

鸟菜啊
今天
0
0
CentOS “Destination Host Unreachable”问题解决办法

挑战极速安装CentOS时遇到局域网主机不能通信的情况: [root@zjd network-scripts]# ping 8.8.8.8PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.64 bytes from 8.8.8.8: icmp_seq=1 ttl=......

wffger
今天
0
0
CentoOS6.6安装netcat

CentOS下安装netcat 使用zookeeper过程中,需要监控集群状态。在使用四字命令时(echo conf | nc localhost 2181),报出如下错误:-bash: netcat: command not found。 我的系统是CentOS 6....

ghou-靠墙哭
今天
0
0
es6之解构赋值巧用

ES6 允许按照一定模式,从数组、对象等中提取值,对变量进行赋值,这被称为解构赋值。 如何进行解构赋值我这里就不赘述,本篇文章主要是将解构赋值的巧妙使用之处。 1、交互变量的值 常用交互...

秋季长青
今天
2
0

没有更多内容

加载失败,请刷新页面

加载更多

下一页

返回顶部
顶部