文档章节

机器学习开源工具及licence

liangtee
 liangtee
发布于 2013/03/02 21:28
字数 2227
阅读 527
收藏 18
以下工具绝大多数都是开源的,基于GPL、Apache等开源协议,使用时请仔细阅读各工具的license statement。

我通过浏览各开源工具网站,对其licence agreement进行了一下了解,在这里简单贴上其遵循的licence,希望有用。如果实际商用或者其他用途,还须仔细到个网站查询或联系developers。由于版本的升级,相应的licence也会有一定的改动,需要注意。如果有错误,希望大家指出,谢谢。

I. Information Retri       (like BSD)
1. Lemur/Indri
The Lemur Toolkit for Language Modeling and Information Retri
http://www.lemurproject.org/
Indri:
Lemur's latest search engine

2. Lucene/Nutch      (Apache 2.0协议)
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
Lucene是apache的顶级开源项目,基于Apache 2.0协议,完全用java编写,具有perl, c/c++, dotNet等多个port
http://lucene.apache.org/
http://www.nutch.org/

3. WGet     (GPL)
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.
http://www.gnu.org/software/wget/wget.html

II. Natural Language Processing
1. EGYPT: A Statistical Machine Translation Toolkit
http://www.clsp.jhu.edu/ws99/projects/mt/
包括GIZA等四个工具

2. GIZA++ (Statistical Machine Translation)     (GPL)
http://www.fjoch.com/GIZA++.html
GIZA++ is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och.
Franz Josef Och先后在德国Aachen大学,ISI(南加州大学信息科学研究所)和Google工作。GIZA++现已有Windows移植版本,对IBM 的model 1-5有很好支持。

3. PHARAOH (Statistical Machine Translation)
http://www.isi.edu/licensed-sw/pharaoh/
a beam search decoder for phrase-based statistical machine translation models

4. OpenNLP:     (LGPL?)
http://opennlp.sourceforge.net/
包括Maxent等20多个工具

btw: 这些SMT的工具还都喜欢用埃及相关的名字命名,像什么GIZA、PHARAOH、Cairo等等。Och在ISI时开发了GIZA++,PHARAOH也是由来自ISI的Philipp Koehn 开发的,关系还真是复杂啊

5. MINIPAR by Dekang Lin (Univ. of Alberta, Canada)
MINIPAR is a broad-coverage parser for the English language. An uation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it parses about 300 words per second.
binary填一个表后可以免费下载
http://www.cs.ualberta.ca/~lindek/minipar.htm

6. WordNet      (need connect if for commercial use)
http://wordnet.princeton.edu/
WordNet is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.
WordNet was developed by the Cognitive Science Laboratory at Princeton University under the direction of Professor George A. Miller (Principal Investigator).
WordNet最新版本是2.1 (for Windows & Unix-like OS),提供bin, src和doc。
WordNet的在线版本是 http://wordnet.princeton.edu/perl/webwn

7.   HowNet  
http://www.keenage.com/
HowNet is an on-line common-sense knowledge base unveiling inter-conceptual relations and inter-attribute relations of concepts as connoting in lexicons of the Chinese and their English equivalents.
由CAS的Zhendong Dong & Qiang Dong开发,是一个类似于WordNet的东东

8. Statistical Language Modeling Toolkit
http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html
The CMU-Cambridge Statistical Language Modeling toolkit is a suite of UNIX software tools to facilitate the construction and testing of statistical language models.

9. SRI Language Modeling Toolkit     (GPL)
www.speech.sri.com/projects/srilm/
SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation. It has been under development in the SRI Speech Technology and Research Laboratory since 1995.

10. ReWrite Decoder
http://www.isi.edu/licensed-sw/rewrite-decoder/
The ISI ReWrite Decoder Release 1.0.0a by Daniel Marcu and Ulrich Germann. It is a program that translates from one natural languge into another using statistical machine translation.

11. GATE (General Architecture for Text Engineering)
http://gate.ac.uk/
A Java Library for Text Engineering

III. Machine Learning
1. YASMET: Yet Another Small MaxEnt Toolkit (Statistical Machine Learning)
http://www.fjoch.com/YASMET.html
由Franz Josef Och编写。此外,OpenNLP项目里有一个java的MaxEnt工具,使用GIS估计参数,由东北大学的张乐(目前在英国留学)port为C++版本

2 LibSVM     (BSD)
由国立台湾大学(ntu)的Chih-Jen Lin开发,有C++,Java,perl,C#等多个语言版本
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC ), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM ). It supports multi-class classification.

3 SVM   Light     (connect if for commercial use)
由cornell的Thorsten Joachims在dortmund大学时开发,成为LibSVM之后最为有名的SVM软件包。开源,用C语言编写,用于ranking问题
http://svmlight.joachims.org/

4 CLUTO    (GPL)
http://www-users.cs.umn.edu/~karypis/cluto/
a software package for clustering low- and high-dimensional datasets
这个软件包只提供executable/library两种形式,不提供源代码下载

5. CRF++     (LGPL/BSD)
http://chasen.org/~taku/software/CRF++/
Yet Another CRF toolkit for segmenting/labelling sequential data
CRF(Conditional Random Fields),由HMM/MEMM发展起来,广泛用于IE、IR、NLP领域

6 SVM   Struct    (non-commerical)
http://www.cs.cornell.edu/People/tj/svm_light/svm_struct.html
同SVM Light,均由cornell的Thorsten Joachims开发。
SVMstruct is a Support Vector Machine (SVM) algorithm for predicting multivariate outputs. It performs supervised learning by approximating a mapping
h: X --> Y
using labeled training examples (x1,y1), ..., (xn,yn).
Unlike regular SVMs, however, which consider only univariate predictions like in classification and regression, SVMstruct can predict complex objects y like trees, sequences, or sets. Examples of problems with complex outputs are natural language parsing, sequence alignment in protein homology detection, and markov models for part-of-speech tagging.
SVMstruct can be thought of as an API for implementing different kinds of complex prediction algorithms. Currently, we have implemented the following learning tasks:
SVMmulticlass: Multi-class classification. Learns to predict one of k mutually exclusive classes. This is probably the simplest possible instance of SVMstruct and serves as a tutorial example of how to use the programming interface.
SVMcfg: Learns a weighted context free grammar from examples. Training examples (e.g. for natural language parsing) specify the sentence along with the correct parse tree. The goal is to predict the parse tree of new sentences.
SVMalign: Learning to align sequences. Given examples of how sequence pairs align, the goal is to learn the substitution matrix as well as the insertion and deletion costs of operations so that one can predict alignments of new sequences.
SVMhmm: Learns a Markov model from examples. Training examples (e.g. for part-of-speech tagging) specify the sequence of words along with the correct assignment of tags (i.e. states). The goal is to predict the tag sequences for new sentences.

IV Misc :
1 Notepad ++:    (GPL)
一个开源编辑器,支持C#,perl,CSS等几十种语言的关键字,功能可与新版的UltraEdit,Visual Studio .NET媲美
http://notepad-plus.sourceforge.net

2. WinMerge :   (GPL)
用于文本内容比较,找出不同版本的两个程序的差异
winmerge.sourceforge.net/

3. OpenPerlIDE:     (GPL)
 开源的perl编辑器,内置编译、逐行调试功能
open-perl-ide.sourceforge.net/
ps: 论起编辑器偶见过的最好的还是VS .NET了,在每个function前面有+/-号支持expand/collapse,支持区域copy/cut/paste,使用ctrl+ c/ctrl+x/ctrl+v可以一次选取一行,使用ctrl+k+c/ctrl+k+u可以comment/uncomment多行,还有还有...... Visual Studio .NET is really kool:D

4. Berkeley DB    (dual license)
http://www.sleepycat.com/
Berkeley DB不是一个关系数据库,它被称做是一个嵌入式数据库:对于c/s模型来说,它的client和server共用一个地址空间。由于数据库最初是从文件系统中发展起来的,它更像是一个key-value pair的字典型数据库。而且数据库文件能够序列化到硬盘中,所以不受内存大小限制。BDB有个子版本Berkeley DB XML,它是一个xml数据库:以xml文件形式存储数据?BDB已被包括microsoft、google、HP、ford、motorola等公司嵌入到自己的产品中去了
Berkeley DB (libdb) is a programmatic toolkit that provides embedded database support for both traditional and client/server applications. It includes b+tree, queue, extended linear hashing, fixed, and variable-length record access methods, transactions, locking, logging, shared memory caching, database recovery, and replication for highly available systems. DB supports C, C++, Java, PHP, and Perl APIs.
It turns out that at a basic level Berkeley DB is just a very high performance, reliable way of persisting dictionary style data structures - anything where a piece of data can be stored and looked up using a unique key. The key and the value can each be up to 4 gigabytes in length and can consist of anything that can be crammed in to a string of bytes, so what you do with it is completely up to you. The only operations available are "store this value under this key", "check if this key exists" and "retrieve the value for this key" so conceptually it's pretty simple - the complicated stuff all happens under the hood.
case study:
Ask Jeeves uses Berkeley DB to provide an easy-to-use tool for searching the Internet.
Microsoft uses Berkeley DB for the Groove collaboration software
AOL uses Berkeley DB for search tool meta-data and other services.
Hitachi uses Berkeley DB in its directory services server product.
Ford uses Berkeley DB to authenticate partners who access Ford's Web applications.
Hewlett Packard uses Berkeley DB in serveral products, including storage, security and wireless software.
Google uses Berkeley DB High Availability for Google Accounts.
Motorola uses Berkeley DB to track mobile units in its wireless radio network products.

11. R    (GPL)
http://www.r-project.org/
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

R统计软件与MatLab类似,都是用在科学计算领域的。


另外还有现在很火的Mahout和Weka,个人觉得Mahout更加

适合工程应用

© 著作权归作者所有

liangtee
粉丝 107
博文 94
码字总数 38111
作品 0
朝阳
程序员
私信 提问
加载中

评论(1)

ojs
ojs
总结得不错!
油焖-菠菜/spinach

#javaWeb开源框架 spinach ## 简介 spinach是基于多个优秀的开源项目,高度整合封装而成的高效,高性能,强安全性的开源Java EE快速开发平台。 集结最新主流时尚开源技术的面向互联网Web应用...

油焖-菠菜
2015/05/11
0
0
Linux基础系列之--Linux基础入门

1、Linux相关的开源协定有: GPL:General Public License(通用公共许可证) LGPL GPLv2 BSD: Apache (1)、GPL: GPL是GNU General Public License的缩写,是GNU通用公共授权非正式的中文...

静思知意
2018/06/28
0
0
比智科技/SuperDebuger

软件名称:C#超级通信调试工具 最新版本:1.0 作者:Maximus Ye QQ:275623749 GIT开源项目地址:https://code.csdn.net/yeqi3000/supernetdebuger C#通信编程交流群:83868794 邮箱:yq@yyzq...

比智科技
2016/04/02
0
0
详细介绍 Apache Licene 2.0 协议

Apache Licence是著名的非盈利开源组织Apache采用的协议。该协议和BSD类似,同样鼓励代码共享和尊重原作者的著作权,同样允许代码修改,再发布(作为开源或商业软件)。需要满足的条件也和B...

红薯
2009/11/30
193.5K
21
edwinfound/OPMP3Player

OPMP3Player是什么 OPMP3Play是一个JQuery网页在线音乐播放器插件。 播放器采用Flex 4.0开发,结合Javascript前台脚本,实现网页在线歌词的同步。 程序使用很简单,只需简单的初始化即可使用...

edwinfound
2015/09/19
0
0

没有更多内容

加载失败,请刷新页面

加载更多

微服务架构一直火,为什么服务化要搞懂?

微服务架构,这 5 年左右一直被认可,是软件架构的未来方向。需要大家理解的是,为什么需要服务化。比如微服务架构对企业来说,带来什么价值?有啥弊端? 这里浅谈一下微服务架构,主要还是在...

泥瓦匠BYSocket
37分钟前
3
0
总结:单机与分布式

传统计算方案演变 1、单机并行运算 1,打开数据源 2,统计出有多少个文件。 3,为每个文件执行相同的统计命令 4,等待所有命令执行成功。 5,合并统计后结果输出或执行进一步统计 2、分布式并...

浮躁的码农
48分钟前
5
0
关于怎么解决CENTOS7没有ETH0网卡这个问题

CentOS7系统安装完毕之后,输入ifconfig命令发现没有eth0,不符合我们的习惯。而且也无法远程ssh连接。 1.进入目录/etc/sysconfig/network-scripts/ 2.将文件ifcfg-ens33重命名为ifcfg-eth0;...

无名氏的程序员
55分钟前
5
0
HTML5 Web Storage 存储介绍

Web Storage是HTML5 API提供一个新的重要的特性; 最新的Web Storage草案中提到,在web客户端可用html5 API,以Key-Value形式来进行数据持久存储; 目前主要的浏览器已经支持该功能: 常见的...

前端老手
今天
5
0
安装mxnet出现的错误

我出现下面的错误:是因为我前面的安装步骤都正确,只是这一步出现错误,sudo python setup.py install 其实我看了下我默认的python是3.6,是大于3.5 ,改为sudo python3 setup.py install就...

南桥北木
今天
5
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部