文档章节

word分词器、ansj分词器、mmseg4j分词器、ik-analyzer分词器分词效果评估

杨尚川
 杨尚川
发布于 2014/04/29 19:09
字数 5309
阅读 10410
收藏 64

word分词是一个Java实现的中文分词组件,提供了多种基于词典的分词算法,并利用ngram模型来消除歧义。 能准确识别英文、数字,以及日期、时间等数量词,能识别人名、地名、组织机构名等未登录词。 同时提供了Lucene、Solr、ElasticSearch插件。

 

word分词器分词效果评估主要评估下面7种分词算法:

 

正向最大匹配算法:MaximumMatching
逆向最大匹配算法:ReverseMaximumMatching
正向最小匹配算法:MinimumMatching
逆向最小匹配算法:ReverseMinimumMatching
双向最大匹配算法:BidirectionalMaximumMatching
双向最小匹配算法:BidirectionalMinimumMatching
双向最大最小匹配算法:BidirectionalMaximumMinimumMatching

 

所有的双向算法都使用ngram来消歧,分词效果评估分别评估bigramtrigram

 

评估采用的测试文本有253 3709行,共2837 4490个字符,标准文本和测试文本一行行对应,标准文本中的词以空格分隔,评估标准为严格一致,评估核心代码如下:

 

/**
 * 分词效果评估
 * @param resultText 实际分词结果文件路径
 * @param standardText 标准分词结果文件路径
 * @return 评估结果
 */
public static EvaluationResult evaluation(String resultText, String standardText) {
	int perfectLineCount=0;
	int wrongLineCount=0;
	int perfectCharCount=0;
	int wrongCharCount=0;
	try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
		BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
		String result;
		while( (result = resultReader.readLine()) != null ){
			result = result.trim();
			String standard = standardReader.readLine().trim();
			if(result.equals("")){
				continue;
			}
			if(result.equals(standard)){
				//分词结果和标准一模一样
				perfectLineCount++;
				perfectCharCount+=standard.replaceAll("\\s+", "").length();
			}else{
				//分词结果和标准不一样
				wrongLineCount++;
				wrongCharCount+=standard.replaceAll("\\s+", "").length();
			}
		}
	} catch (IOException ex) {
		LOGGER.error("分词效果评估失败:", ex);
	}
	int totalLineCount = perfectLineCount+wrongLineCount;
	int totalCharCount = perfectCharCount+wrongCharCount;
	EvaluationResult er = new EvaluationResult();
	er.setPerfectCharCount(perfectCharCount);
	er.setPerfectLineCount(perfectLineCount);
	er.setTotalCharCount(totalCharCount);
	er.setTotalLineCount(totalLineCount);
	er.setWrongCharCount(wrongCharCount);
	er.setWrongLineCount(wrongLineCount);     
	return er;
}

 

/**
 * 中文分词效果评估结果
 * @author 杨尚川
 */
public class EvaluationResult implements Comparable{
    private int totalLineCount;
    private int perfectLineCount;
    private int wrongLineCount;
    private int totalCharCount;
    private int perfectCharCount;
    private int wrongCharCount;

    
    public float getLinePerfectRate(){
        return perfectLineCount/(float)totalLineCount*100;
    }
    public float getLineWrongRate(){
        return wrongLineCount/(float)totalLineCount*100;
    }
    public float getCharPerfectRate(){
        return perfectCharCount/(float)totalCharCount*100;
    }
    public float getCharWrongRate(){
        return wrongCharCount/(float)totalCharCount*100;
    }
    public int getTotalLineCount() {
        return totalLineCount;
    }
    public void setTotalLineCount(int totalLineCount) {
        this.totalLineCount = totalLineCount;
    }
    public int getPerfectLineCount() {
        return perfectLineCount;
    }
    public void setPerfectLineCount(int perfectLineCount) {
        this.perfectLineCount = perfectLineCount;
    }
    public int getWrongLineCount() {
        return wrongLineCount;
    }
    public void setWrongLineCount(int wrongLineCount) {
        this.wrongLineCount = wrongLineCount;
    }
    public int getTotalCharCount() {
        return totalCharCount;
    }
    public void setTotalCharCount(int totalCharCount) {
        this.totalCharCount = totalCharCount;
    }
    public int getPerfectCharCount() {
        return perfectCharCount;
    }
    public void setPerfectCharCount(int perfectCharCount) {
        this.perfectCharCount = perfectCharCount;
    }
    public int getWrongCharCount() {
        return wrongCharCount;
    }
    public void setWrongCharCount(int wrongCharCount) {
        this.wrongCharCount = wrongCharCount;
    }
    @Override
    public String toString(){
        return segmentationAlgorithm.name()+"("+segmentationAlgorithm.getDes()+"):"
                +"\n"
                +"分词速度:"+segSpeed+" 字符/毫秒"
                +"\n"
                +"行数完美率:"+getLinePerfectRate()+"%"
                +"  行数错误率:"+getLineWrongRate()+"%"
                +"  总的行数:"+totalLineCount
                +"  完美行数:"+perfectLineCount
                +"  错误行数:"+wrongLineCount
                +"\n"
                +"字数完美率:"+getCharPerfectRate()+"%"
                +" 字数错误率:"+getCharWrongRate()+"%"
                +" 总的字数:"+totalCharCount
                +" 完美字数:"+perfectCharCount
                +" 错误字数:"+wrongCharCount;
    }
    @Override
    public int compareTo(Object o) {
        EvaluationResult other = (EvaluationResult)o;
        if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
            return 1;
        }
        if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
            return -1;
        }
        return 0;
    }
}

 

word分词使用trigram评估结果:

 

BidirectionalMaximumMinimumMatching(双向最大最小匹配算法):
分词速度:265.62566 字符/毫秒
行数完美率:55.352688%  行数错误率:44.647312%  总的行数:2533709  完美行数:1402476  错误行数:1131233
字数完美率:46.23227% 字数错误率:53.76773% 总的字数:28374490 完美字数:13118171 错误字数:15256319

BidirectionalMaximumMatching(双向最大匹配算法):
分词速度:335.62155 字符/毫秒
行数完美率:50.16934%  行数错误率:49.83066%  总的行数:2533709  完美行数:1271145  错误行数:1262564
字数完美率:40.692997% 字数错误率:59.307003% 总的字数:28374490 完美字数:11546430 错误字数:16828060

ReverseMaximumMatching(逆向最大匹配算法):
分词速度:686.71045 字符/毫秒
行数完美率:46.723125%  行数错误率:53.27688%  总的行数:2533709  完美行数:1183828  错误行数:1349881
字数完美率:36.67598% 字数错误率:63.32402% 总的字数:28374490 完美字数:10406622 错误字数:17967868

MaximumMatching(正向最大匹配算法):
分词速度:733.9535 字符/毫秒
行数完美率:46.661713%  行数错误率:53.338287%  总的行数:2533709  完美行数:1182272  错误行数:1351437
字数完美率:36.72861% 字数错误率:63.271393% 总的字数:28374490 完美字数:10421556 错误字数:17952934

BidirectionalMinimumMatching(双向最小匹配算法):
分词速度:432.87375 字符/毫秒
行数完美率:45.863907%  行数错误率:54.136093%  总的行数:2533709  完美行数:1162058  错误行数:1371651
字数完美率:35.942123% 字数错误率:64.05788% 总的字数:28374490 完美字数:10198395 错误字数:18176095

ReverseMinimumMatching(逆向最小匹配算法):
分词速度:1033.58636 字符/毫秒
行数完美率:41.776066%  行数错误率:58.223934%  总的行数:2533709  完美行数:1058484  错误行数:1475225
字数完美率:31.678978% 字数错误率:68.32102% 总的字数:28374490 完美字数:8988748 错误字数:19385742

MinimumMatching(正向最小匹配算法):
分词速度:1175.4431 字符/毫秒
行数完美率:36.853836%  行数错误率:63.146164%  总的行数:2533709  完美行数:933769  错误行数:1599940
字数完美率:26.859812% 字数错误率:73.14019% 总的字数:28374490 完美字数:7621334 错误字数:20753156

 

 word分词使用bigram评估结果:

 

BidirectionalMaximumMinimumMatching(双向最大最小匹配算法):
分词速度:233.49121 字符/毫秒
行数完美率:55.31531%  行数错误率:44.68469%  总的行数:2533709  完美行数:1401529  错误行数:1132180
字数完美率:45.834396% 字数错误率:54.165604% 总的字数:28374490 完美字数:13005277 错误字数:15369213

BidirectionalMaximumMatching(双向最大匹配算法):
分词速度:303.59401 字符/毫秒
行数完美率:52.007233%  行数错误率:47.992767%  总的行数:2533709  完美行数:1317712  错误行数:1215997
字数完美率:42.424194% 字数错误率:57.575806% 总的字数:28374490 完美字数:12037649 错误字数:16336841

BidirectionalMinimumMatching(双向最小匹配算法):
分词速度:349.67215 字符/毫秒
行数完美率:46.766422%  行数错误率:53.23358%  总的行数:2533709  完美行数:1184925  错误行数:1348784
字数完美率:36.52718% 字数错误率:63.47282% 总的字数:28374490 完美字数:10364401 错误字数:18010089

ReverseMaximumMatching(逆向最大匹配算法):
分词速度:598.04272 字符/毫秒
行数完美率:46.723125%  行数错误率:53.27688%  总的行数:2533709  完美行数:1183828  错误行数:1349881
字数完美率:36.67598% 字数错误率:63.32402% 总的字数:28374490 完美字数:10406622 错误字数:17967868

MaximumMatching(正向最大匹配算法):
分词速度:676.7993 字符/毫秒
行数完美率:46.661713%  行数错误率:53.338287%  总的行数:2533709  完美行数:1182272  错误行数:1351437
字数完美率:36.72861% 字数错误率:63.271393% 总的字数:28374490 完美字数:10421556 错误字数:17952934

ReverseMinimumMatching(逆向最小匹配算法):
分词速度:806.9586 字符/毫秒
行数完美率:41.776066%  行数错误率:58.223934%  总的行数:2533709  完美行数:1058484  错误行数:1475225
字数完美率:31.678978% 字数错误率:68.32102% 总的字数:28374490 完美字数:8988748 错误字数:19385742

MinimumMatching(正向最小匹配算法):
分词速度:1020.9208 字符/毫秒
行数完美率:36.853836%  行数错误率:63.146164%  总的行数:2533709  完美行数:933769  错误行数:1599940
字数完美率:26.859812% 字数错误率:73.14019% 总的字数:28374490 完美字数:7621334 错误字数:20753156

 

Ansj0.9的评估结果如下:

 

Ansj ToAnalysis 精准分词:
分词速度:495.9188 字符/毫秒
行数完美率:58.609295%  行数错误率:41.390705%  总的行数:2533709  完美行数:1484989  错误行数:1048720
字数完美率:50.97614%   字数错误率:49.023857%  总的字数:28374490 完美字数:14464220 错误字数:13910270

Ansj NlpAnalysis NLP分词:
分词速度:350.7527 字符/毫秒
行数完美率:58.60353%  行数错误率:41.396465%  总的行数:2533709  完美行数:1484843  错误行数:1048866
字数完美率:50.75546%  字数错误率:49.244545%  总的字数:28374490 完美字数:14401602 错误字数:13972888

Ansj BaseAnalysis 基本分词:
分词速度:532.65424 字符/毫秒
行数完美率:54.028584%  行数错误率:45.97142%  总的行数:2533709  完美行数:1368927  错误行数:1164782
字数完美率:46.84512%   字数错误率:53.15488%  总的字数:28374490 完美字数:13292064 错误字数:15082426

Ansj IndexAnalysis 面向索引的分词:
分词速度:564.6103 字符/毫秒
行数完美率:53.510803%  行数错误率:46.489197%  总的行数:2533709  完美行数:1355808  错误行数:1177901
字数完美率:46.355087%  字数错误率:53.644913%  总的字数:28374490 完美字数:13153019 错误字数:15221471

 

Ansj1.4的评估结果如下:

 

Ansj ToAnalysis 精准分词:
分词速度:581.7306 字符/毫秒
行数完美率:58.60302%  行数错误率:41.39698%  总的行数:2533709  完美行数:1484830  错误行数:1048879
字数完美率:50.968987% 字数错误率:49.031013% 总的字数:28374490 完美字数:14462190 错误字数:13912300

Ansj NlpAnalysis NLP分词:
分词速度:138.81165 字符/毫秒
行数完美率:58.1515%  行数错误率:41.8485%  总的行数:2533687  完美行数:1473377  错误行数:1060310
字数完美率:49.806484% 字数错误率:50.19352% 总的字数:28374398 完美字数:14132290 错误字数:14242108

Ansj BaseAnalysis 基本分词:
分词速度:627.68475 字符/毫秒
行数完美率:55.3174%  行数错误率:44.6826%  总的行数:2533709  完美行数:1401582  错误行数:1132127
字数完美率:48.177986% 字数错误率:51.822014% 总的字数:28374490 完美字数:13670258 错误字数:14704232

Ansj IndexAnalysis 面向索引的分词:
分词速度:715.55176 字符/毫秒
行数完美率:50.89444%  行数错误率:49.10556%  总的行数:2533709  完美行数:1289517  错误行数:1244192
字数完美率:42.965115% 字数错误率:57.034885% 总的字数:28374490 完美字数:12191132 错误字数:16183358

 

 Ansj分词评估程序如下:

 

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.BaseAnalysis;
import org.ansj.splitWord.analysis.IndexAnalysis;
import org.ansj.splitWord.analysis.NlpAnalysis;
import org.ansj.splitWord.analysis.ToAnalysis;

/**
 * Ansj分词器分词效果评估
 * @author 杨尚川
 */
public class AnsjEvaluation {

    public static void main(String[] args) throws Exception{
        // 测试文件 d:/test-text.txt 和 标准分词结果文件 d:/standard-text.txt 的下载地址:
        // http://pan.baidu.com/s/1hqihzjY
        
        List<EvaluationResult> list = new ArrayList<>();
        // 对文本进行分词
        float rate = seg("d:/test-text.txt", "d:/result-text-BaseAnalysis.txt", "BaseAnalysis");
        // 对分词结果进行评估
        EvaluationResult result = evaluation("d:/result-text-BaseAnalysis.txt", "d:/standard-text.txt");
        result.setAnalyzer("Ansj BaseAnalysis 基本分词");
        result.setSegSpeed(rate);
        list.add(result);
        
        // 对文本进行分词
        rate = seg("d:/test-text.txt", "d:/result-text-ToAnalysis.txt", "ToAnalysis");
        // 对分词结果进行评估
        result = evaluation("d:/result-text-ToAnalysis.txt", "d:/standard-text.txt");
        result.setAnalyzer("Ansj ToAnalysis 精准分词");
        result.setSegSpeed(rate);
        list.add(result);
        
        // 对文本进行分词
        rate = seg("d:/test-text.txt", "d:/result-text-NlpAnalysis.txt", "NlpAnalysis");
        // 对分词结果进行评估
        result = evaluation("d:/result-text-NlpAnalysis.txt", "d:/standard-text.txt");
        result.setAnalyzer("Ansj NlpAnalysis NLP分词");
        result.setSegSpeed(rate);
        list.add(result);
        
        // 对文本进行分词
        rate = seg("d:/test-text.txt", "d:/result-text-IndexAnalysis.txt", "IndexAnalysis");
        // 对分词结果进行评估
        result = evaluation("d:/result-text-IndexAnalysis.txt", "d:/standard-text.txt");
        result.setAnalyzer("Ansj IndexAnalysis 面向索引的分词");
        result.setSegSpeed(rate);
        list.add(result);
        
        //输出评估结果
        Collections.sort(list);
        System.out.println("");
        for(EvaluationResult r : list){
            System.out.println(r+"\n");
        }
    }
    private static float seg(final String input, final String output, final String type) throws Exception{
        float rate = 0;
        try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8"));
                BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){
            long size = Files.size(Paths.get(input));
            System.out.println("size:"+size);
            System.out.println("文件大小:"+(float)size/1024/1024+" MB");
            int textLength=0;
            int progress=0;
            long start = System.currentTimeMillis();
            String line = null;
            while((line = reader.readLine()) != null){
                if("".equals(line.trim())){
                    writer.write("\n");
                    continue;
                }
                textLength += line.length();
                switch(type){
                    case "BaseAnalysis":
                        for(Term term : BaseAnalysis.parse(line)){
                            writer.write(term.getName()+" ");
                        }
                        break;
                    case "ToAnalysis":
                        for(Term term : ToAnalysis.parse(line)){
                            writer.write(term.getName()+" ");
                        }
                        break;
                    case "NlpAnalysis":
                        try{
                            for(Term term : NlpAnalysis.parse(line)){
                                writer.write(term.getName()+" ");
                            }
                        }catch(Exception e){}
                        break;
                    case "IndexAnalysis":
                        for(Term term : IndexAnalysis.parse(line)){
                            writer.write(term.getName()+" ");
                        }
                        break;
                }                
                writer.write("\n");
                progress += line.length();
                if( progress > 500000){
                    progress = 0;
                    System.out.println("分词进度:"+(int)(textLength*2.99/size*100)+"%");
                }
            }
            long cost = System.currentTimeMillis() - start;
            rate = textLength/(float)cost;
            System.out.println("字符数目:"+textLength);
            System.out.println("分词耗时:"+cost+" 毫秒");
            System.out.println("分词速度:"+rate+" 字符/毫秒");
        }
        return rate;
    }
    /**
     * 分词效果评估
     * @param resultText 实际分词结果文件路径
     * @param standardText 标准分词结果文件路径
     * @return 评估结果
     */
    private static EvaluationResult evaluation(String resultText, String standardText) {
        int perfectLineCount=0;
        int wrongLineCount=0;
        int perfectCharCount=0;
        int wrongCharCount=0;
        try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
            BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
            String result;
            while( (result = resultReader.readLine()) != null ){
                result = result.trim();
                String standard = standardReader.readLine().trim();
                if(result.equals("")){
                    continue;
                }
                if(result.equals(standard)){
                    //分词结果和标准一模一样
                    perfectLineCount++;
                    perfectCharCount+=standard.replaceAll("\\s+", "").length();
                }else{
                    //分词结果和标准不一样
                    wrongLineCount++;
                    wrongCharCount+=standard.replaceAll("\\s+", "").length();
                }
            }
        } catch (IOException ex) {
            System.err.println("分词效果评估失败:" + ex.getMessage());
        }
        int totalLineCount = perfectLineCount+wrongLineCount;
        int totalCharCount = perfectCharCount+wrongCharCount;
        EvaluationResult er = new EvaluationResult();
        er.setPerfectCharCount(perfectCharCount);
        er.setPerfectLineCount(perfectLineCount);
        er.setTotalCharCount(totalCharCount);
        er.setTotalLineCount(totalLineCount);
        er.setWrongCharCount(wrongCharCount);
        er.setWrongLineCount(wrongLineCount);     
        return er;
    }
    /**
     * 分词结果
     */
    private static class EvaluationResult implements Comparable{
        private String analyzer;
        private float segSpeed;
        private int totalLineCount;
        private int perfectLineCount;
        private int wrongLineCount;
        private int totalCharCount;
        private int perfectCharCount;
        private int wrongCharCount;

        public String getAnalyzer() {
            return analyzer;
        }
        public void setAnalyzer(String analyzer) {
            this.analyzer = analyzer;
        }
        public float getSegSpeed() {
            return segSpeed;
        }
        public void setSegSpeed(float segSpeed) {
            this.segSpeed = segSpeed;
        }
        public float getLinePerfectRate(){
            return perfectLineCount/(float)totalLineCount*100;
        }
        public float getLineWrongRate(){
            return wrongLineCount/(float)totalLineCount*100;
        }
        public float getCharPerfectRate(){
            return perfectCharCount/(float)totalCharCount*100;
        }
        public float getCharWrongRate(){
            return wrongCharCount/(float)totalCharCount*100;
        }
        public int getTotalLineCount() {
            return totalLineCount;
        }
        public void setTotalLineCount(int totalLineCount) {
            this.totalLineCount = totalLineCount;
        }
        public int getPerfectLineCount() {
            return perfectLineCount;
        }
        public void setPerfectLineCount(int perfectLineCount) {
            this.perfectLineCount = perfectLineCount;
        }
        public int getWrongLineCount() {
            return wrongLineCount;
        }
        public void setWrongLineCount(int wrongLineCount) {
            this.wrongLineCount = wrongLineCount;
        }
        public int getTotalCharCount() {
            return totalCharCount;
        }
        public void setTotalCharCount(int totalCharCount) {
            this.totalCharCount = totalCharCount;
        }
        public int getPerfectCharCount() {
            return perfectCharCount;
        }
        public void setPerfectCharCount(int perfectCharCount) {
            this.perfectCharCount = perfectCharCount;
        }
        public int getWrongCharCount() {
            return wrongCharCount;
        }
        public void setWrongCharCount(int wrongCharCount) {
            this.wrongCharCount = wrongCharCount;
        }
        @Override
        public String toString(){
            return analyzer+":"
                    +"\n"
                    +"分词速度:"+segSpeed+" 字符/毫秒"
                    +"\n"
                    +"行数完美率:"+getLinePerfectRate()+"%"
                    +"  行数错误率:"+getLineWrongRate()+"%"
                    +"  总的行数:"+totalLineCount
                    +"  完美行数:"+perfectLineCount
                    +"  错误行数:"+wrongLineCount
                    +"\n"
                    +"字数完美率:"+getCharPerfectRate()+"%"
                    +" 字数错误率:"+getCharWrongRate()+"%"
                    +" 总的字数:"+totalCharCount
                    +" 完美字数:"+perfectCharCount
                    +" 错误字数:"+wrongCharCount;
        }
        @Override
        public int compareTo(Object o) {
            EvaluationResult other = (EvaluationResult)o;
            if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
                return 1;
            }
            if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
                return -1;
            }
            return 0;
        }
    }
}

 

 

MMSeg4j1.9.1的评估结果如下:

 

MMSeg4j ComplexSeg:
分词速度:794.24805 字符/毫秒
行数完美率:38.817604%  行数错误率:61.182396%  总的行数:2533688  完美行数:983517  错误行数:1550171
字数完美率:29.604435% 字数错误率:70.39557% 总的字数:28374428 完美字数:8400089 错误字数:19974339

MMSeg4j SimpleSeg:
分词速度:1026.1058 字符/毫秒
行数完美率:37.570095%  行数错误率:62.429905%  总的行数:2533688  完美行数:951909  错误行数:1581779
字数完美率:28.455273% 字数错误率:71.54473% 总的字数:28374428 完美字数:8074021 错误字数:20300407

MMSeg4j MaxWordSeg:
分词速度:813.0676 字符/毫秒
行数完美率:34.27573%  行数错误率:65.72427%  总的行数:2533688  完美行数:868440  错误行数:1665248
字数完美率:25.20896% 字数错误率:74.79104% 总的字数:28374428 完美字数:7152898 错误字数:21221530

 

MMSeg4j1.9.1分词评估程序如下:

 

import com.chenlb.mmseg4j.ComplexSeg;
import com.chenlb.mmseg4j.Dictionary;
import com.chenlb.mmseg4j.MMSeg;
import com.chenlb.mmseg4j.MaxWordSeg;
import com.chenlb.mmseg4j.Seg;
import com.chenlb.mmseg4j.SimpleSeg;
import com.chenlb.mmseg4j.Word;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.StringReader;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

/**
 * MMSeg4j分词器分词效果评估
 * @author 杨尚川
 */
public class MMSeg4jEvaluation {

    public static void main(String[] args) throws Exception{
        // 测试文件 d:/test-text.txt 和 标准分词结果文件 d:/standard-text.txt 的下载地址:
        // http://pan.baidu.com/s/1hqihzjY
        
        List<EvaluationResult> list = new ArrayList<>();
        Dictionary dic = Dictionary.getInstance();
        // 对文本进行分词
        float rate = seg("d:/test-text.txt", "d:/result-text-ComplexSeg.txt", new ComplexSeg(dic));
        // 对分词结果进行评估
        EvaluationResult result = evaluation("d:/result-text-ComplexSeg.txt", "d:/standard-text.txt");
        result.setAnalyzer("MMSeg4j ComplexSeg");
        result.setSegSpeed(rate);
        list.add(result);
        
        // 对文本进行分词
        rate = seg("d:/test-text.txt", "d:/result-text-SimpleSeg.txt", new SimpleSeg(dic));
        // 对分词结果进行评估
        result = evaluation("d:/result-text-SimpleSeg.txt", "d:/standard-text.txt");
        result.setAnalyzer("MMSeg4j SimpleSeg");
        result.setSegSpeed(rate);
        list.add(result);
        
        // 对文本进行分词
        rate = seg("d:/test-text.txt", "d:/result-text-MaxWordSeg.txt", new MaxWordSeg(dic));
        // 对分词结果进行评估
        result = evaluation("d:/result-text-MaxWordSeg.txt", "d:/standard-text.txt");
        result.setAnalyzer("MMSeg4j MaxWordSeg");
        result.setSegSpeed(rate);
        list.add(result);
        
        //输出评估结果
        Collections.sort(list);
        System.out.println("");
        for(EvaluationResult r : list){
            System.out.println(r+"\n");
        }
    }
    private static float seg(final String input, final String output, final Seg seg) throws Exception{
        float rate = 0;
        try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8"));
                BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){
            long size = Files.size(Paths.get(input));
            System.out.println("size:"+size);
            System.out.println("文件大小:"+(float)size/1024/1024+" MB");
            int textLength=0;
            int progress=0;
            long start = System.currentTimeMillis();
            String line = null;
            while((line = reader.readLine()) != null){
                if("".equals(line.trim())){
                    writer.write("\n");
                    continue;
                }
                textLength += line.length();
                writer.write(seg(line, seg));
                writer.write("\n");
                progress += line.length();
                if( progress > 500000){
                    progress = 0;
                    System.out.println("分词进度:"+(int)(textLength*2.99/size*100)+"%");
                }
            }
            long cost = System.currentTimeMillis() - start;
            rate = textLength/(float)cost;
            System.out.println("字符数目:"+textLength);
            System.out.println("分词耗时:"+cost+" 毫秒");
            System.out.println("分词速度:"+rate+" 字符/毫秒");
        }
        return rate;
    }
    private static String seg(String text, Seg seg) throws IOException {
        StringBuilder result = new StringBuilder();
        MMSeg mmSeg = new MMSeg(new StringReader(text), seg);
        Word word = null;
        while((word=mmSeg.next())!=null) {
            result.append(word.getString()).append(" ");			
        }
        return result.toString().trim();
    }
    /**
     * 分词效果评估
     * @param resultText 实际分词结果文件路径
     * @param standardText 标准分词结果文件路径
     * @return 评估结果
     */
    private static EvaluationResult evaluation(String resultText, String standardText) {
        int perfectLineCount=0;
        int wrongLineCount=0;
        int perfectCharCount=0;
        int wrongCharCount=0;
        try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
            BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
            String result;
            while( (result = resultReader.readLine()) != null ){
                result = result.trim();
                String standard = standardReader.readLine().trim();
                if(result.equals("")){
                    continue;
                }
                if(result.equals(standard)){
                    //分词结果和标准一模一样
                    perfectLineCount++;
                    perfectCharCount+=standard.replaceAll("\\s+", "").length();
                }else{
                    //分词结果和标准不一样
                    wrongLineCount++;
                    wrongCharCount+=standard.replaceAll("\\s+", "").length();
                }
            }
        } catch (IOException ex) {
            System.err.println("分词效果评估失败:" + ex.getMessage());
        }
        int totalLineCount = perfectLineCount+wrongLineCount;
        int totalCharCount = perfectCharCount+wrongCharCount;
        EvaluationResult er = new EvaluationResult();
        er.setPerfectCharCount(perfectCharCount);
        er.setPerfectLineCount(perfectLineCount);
        er.setTotalCharCount(totalCharCount);
        er.setTotalLineCount(totalLineCount);
        er.setWrongCharCount(wrongCharCount);
        er.setWrongLineCount(wrongLineCount);     
        return er;
    }
    /**
     * 分词结果
     */
    private static class EvaluationResult implements Comparable{
        private String analyzer;
        private float segSpeed;
        private int totalLineCount;
        private int perfectLineCount;
        private int wrongLineCount;
        private int totalCharCount;
        private int perfectCharCount;
        private int wrongCharCount;

        public String getAnalyzer() {
            return analyzer;
        }
        public void setAnalyzer(String analyzer) {
            this.analyzer = analyzer;
        }
        public float getSegSpeed() {
            return segSpeed;
        }
        public void setSegSpeed(float segSpeed) {
            this.segSpeed = segSpeed;
        }
        public float getLinePerfectRate(){
            return perfectLineCount/(float)totalLineCount*100;
        }
        public float getLineWrongRate(){
            return wrongLineCount/(float)totalLineCount*100;
        }
        public float getCharPerfectRate(){
            return perfectCharCount/(float)totalCharCount*100;
        }
        public float getCharWrongRate(){
            return wrongCharCount/(float)totalCharCount*100;
        }
        public int getTotalLineCount() {
            return totalLineCount;
        }
        public void setTotalLineCount(int totalLineCount) {
            this.totalLineCount = totalLineCount;
        }
        public int getPerfectLineCount() {
            return perfectLineCount;
        }
        public void setPerfectLineCount(int perfectLineCount) {
            this.perfectLineCount = perfectLineCount;
        }
        public int getWrongLineCount() {
            return wrongLineCount;
        }
        public void setWrongLineCount(int wrongLineCount) {
            this.wrongLineCount = wrongLineCount;
        }
        public int getTotalCharCount() {
            return totalCharCount;
        }
        public void setTotalCharCount(int totalCharCount) {
            this.totalCharCount = totalCharCount;
        }
        public int getPerfectCharCount() {
            return perfectCharCount;
        }
        public void setPerfectCharCount(int perfectCharCount) {
            this.perfectCharCount = perfectCharCount;
        }
        public int getWrongCharCount() {
            return wrongCharCount;
        }
        public void setWrongCharCount(int wrongCharCount) {
            this.wrongCharCount = wrongCharCount;
        }
        @Override
        public String toString(){
            return analyzer+":"
                    +"\n"
                    +"分词速度:"+segSpeed+" 字符/毫秒"
                    +"\n"
                    +"行数完美率:"+getLinePerfectRate()+"%"
                    +"  行数错误率:"+getLineWrongRate()+"%"
                    +"  总的行数:"+totalLineCount
                    +"  完美行数:"+perfectLineCount
                    +"  错误行数:"+wrongLineCount
                    +"\n"
                    +"字数完美率:"+getCharPerfectRate()+"%"
                    +" 字数错误率:"+getCharWrongRate()+"%"
                    +" 总的字数:"+totalCharCount
                    +" 完美字数:"+perfectCharCount
                    +" 错误字数:"+wrongCharCount;
        }
        @Override
        public int compareTo(Object o) {
            EvaluationResult other = (EvaluationResult)o;
            if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
                return 1;
            }
            if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
                return -1;
            }
            return 0;
        }
    }
}

 

ik-analyzer2012_u6的评估结果如下:

 

IKAnalyzer 智能切分:
分词速度:178.3516 字符/毫秒
行数完美率:37.55943%  行数错误率:62.440567%  总的行数:2533686  完美行数:951638  错误行数:1582048
字数完美率:27.978464% 字数错误率:72.02154% 总的字数:28374416 完美字数:7938726 错误字数:20435690

IKAnalyzer 细粒度切分:
分词速度:182.97859 字符/毫秒
行数完美率:18.872742%  行数错误率:81.12726%  总的行数:2533686  完美行数:478176  错误行数:2055510
字数完美率:10.936535% 字数错误率:89.06347% 总的字数:28374416 完美字数:3103178 错误字数:25271238

 

 

ik-analyzer2012_u6分词评估程序如下:

 

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.StringReader;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;

/**
 * IKAnalyzer分词器分词效果评估
 * @author 杨尚川
 */
public class IKAnalyzerEvaluation {

    public static void main(String[] args) throws Exception{
        // 测试文件 d:/test-text.txt 和 标准分词结果文件 d:/standard-text.txt 的下载地址:
        // http://pan.baidu.com/s/1hqihzjY
        
        List<EvaluationResult> list = new ArrayList<>();
        
        // 对文本进行分词
        float rate = seg("d:/test-text.txt", "d:/result-text-ComplexSeg.txt", true);
        // 对分词结果进行评估
        EvaluationResult result = evaluation("d:/result-text-ComplexSeg.txt", "d:/standard-text.txt");
        result.setAnalyzer("IKAnalyzer 智能切分");
        result.setSegSpeed(rate);
        list.add(result);
        
        // 对文本进行分词
        rate = seg("d:/test-text.txt", "d:/result-text-SimpleSeg.txt", false);
        // 对分词结果进行评估
        result = evaluation("d:/result-text-SimpleSeg.txt", "d:/standard-text.txt");
        result.setAnalyzer("IKAnalyzer 细粒度切分");
        result.setSegSpeed(rate);
        list.add(result);
        
        //输出评估结果
        Collections.sort(list);
        System.out.println("");
        for(EvaluationResult r : list){
            System.out.println(r+"\n");
        }
    }
    private static float seg(final String input, final String output, final boolean useSmart) throws Exception{
        float rate = 0;
        try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8"));
                BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){
            long size = Files.size(Paths.get(input));
            System.out.println("size:"+size);
            System.out.println("文件大小:"+(float)size/1024/1024+" MB");
            int textLength=0;
            int progress=0;
            long start = System.currentTimeMillis();
            String line = null;
            while((line = reader.readLine()) != null){
                if("".equals(line.trim())){
                    writer.write("\n");
                    continue;
                }
                textLength += line.length();
                writer.write(seg(line, useSmart));
                writer.write("\n");
                progress += line.length();
                if( progress > 500000){
                    progress = 0;
                    System.out.println("分词进度:"+(int)(textLength*2.99/size*100)+"%");
                }
            }
            long cost = System.currentTimeMillis() - start;
            rate = textLength/(float)cost;
            System.out.println("字符数目:"+textLength);
            System.out.println("分词耗时:"+cost+" 毫秒");
            System.out.println("分词速度:"+rate+" 字符/毫秒");
        }
        return rate;
    }
    private static String seg(String text, boolean useSmart) throws IOException {
        StringBuilder result = new StringBuilder();
        IKSegmenter ik = new IKSegmenter(new StringReader(text), useSmart);
        Lexeme word = null;
        while((word=ik.next())!=null) {
            result.append(word.getLexemeText()).append(" ");			
        }
        return result.toString().trim();
    }
    /**
     * 分词效果评估
     * @param resultText 实际分词结果文件路径
     * @param standardText 标准分词结果文件路径
     * @return 评估结果
     */
    private static EvaluationResult evaluation(String resultText, String standardText) {
        int perfectLineCount=0;
        int wrongLineCount=0;
        int perfectCharCount=0;
        int wrongCharCount=0;
        try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
            BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
            String result;
            while( (result = resultReader.readLine()) != null ){
                result = result.trim();
                String standard = standardReader.readLine().trim();
                if(result.equals("")){
                    continue;
                }
                if(result.equals(standard)){
                    //分词结果和标准一模一样
                    perfectLineCount++;
                    perfectCharCount+=standard.replaceAll("\\s+", "").length();
                }else{
                    //分词结果和标准不一样
                    wrongLineCount++;
                    wrongCharCount+=standard.replaceAll("\\s+", "").length();
                }
            }
        } catch (IOException ex) {
            System.err.println("分词效果评估失败:" + ex.getMessage());
        }
        int totalLineCount = perfectLineCount+wrongLineCount;
        int totalCharCount = perfectCharCount+wrongCharCount;
        EvaluationResult er = new EvaluationResult();
        er.setPerfectCharCount(perfectCharCount);
        er.setPerfectLineCount(perfectLineCount);
        er.setTotalCharCount(totalCharCount);
        er.setTotalLineCount(totalLineCount);
        er.setWrongCharCount(wrongCharCount);
        er.setWrongLineCount(wrongLineCount);     
        return er;
    }
    /**
     * 分词结果
     */
    private static class EvaluationResult implements Comparable{
        private String analyzer;
        private float segSpeed;
        private int totalLineCount;
        private int perfectLineCount;
        private int wrongLineCount;
        private int totalCharCount;
        private int perfectCharCount;
        private int wrongCharCount;

        public String getAnalyzer() {
            return analyzer;
        }
        public void setAnalyzer(String analyzer) {
            this.analyzer = analyzer;
        }
        public float getSegSpeed() {
            return segSpeed;
        }
        public void setSegSpeed(float segSpeed) {
            this.segSpeed = segSpeed;
        }
        public float getLinePerfectRate(){
            return perfectLineCount/(float)totalLineCount*100;
        }
        public float getLineWrongRate(){
            return wrongLineCount/(float)totalLineCount*100;
        }
        public float getCharPerfectRate(){
            return perfectCharCount/(float)totalCharCount*100;
        }
        public float getCharWrongRate(){
            return wrongCharCount/(float)totalCharCount*100;
        }
        public int getTotalLineCount() {
            return totalLineCount;
        }
        public void setTotalLineCount(int totalLineCount) {
            this.totalLineCount = totalLineCount;
        }
        public int getPerfectLineCount() {
            return perfectLineCount;
        }
        public void setPerfectLineCount(int perfectLineCount) {
            this.perfectLineCount = perfectLineCount;
        }
        public int getWrongLineCount() {
            return wrongLineCount;
        }
        public void setWrongLineCount(int wrongLineCount) {
            this.wrongLineCount = wrongLineCount;
        }
        public int getTotalCharCount() {
            return totalCharCount;
        }
        public void setTotalCharCount(int totalCharCount) {
            this.totalCharCount = totalCharCount;
        }
        public int getPerfectCharCount() {
            return perfectCharCount;
        }
        public void setPerfectCharCount(int perfectCharCount) {
            this.perfectCharCount = perfectCharCount;
        }
        public int getWrongCharCount() {
            return wrongCharCount;
        }
        public void setWrongCharCount(int wrongCharCount) {
            this.wrongCharCount = wrongCharCount;
        }
        @Override
        public String toString(){
            return analyzer+":"
                    +"\n"
                    +"分词速度:"+segSpeed+" 字符/毫秒"
                    +"\n"
                    +"行数完美率:"+getLinePerfectRate()+"%"
                    +"  行数错误率:"+getLineWrongRate()+"%"
                    +"  总的行数:"+totalLineCount
                    +"  完美行数:"+perfectLineCount
                    +"  错误行数:"+wrongLineCount
                    +"\n"
                    +"字数完美率:"+getCharPerfectRate()+"%"
                    +" 字数错误率:"+getCharWrongRate()+"%"
                    +" 总的字数:"+totalCharCount
                    +" 完美字数:"+perfectCharCount
                    +" 错误字数:"+wrongCharCount;
        }
        @Override
        public int compareTo(Object o) {
            EvaluationResult other = (EvaluationResult)o;
            if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
                return 1;
            }
            if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
                return -1;
            }
            return 0;
        }
    }
}

 

 

ansj、mmseg4j和ik-analyzer的评估程序可在附件中下载,word分词只需运行项目根目录下的evaluation.bat脚本即可。

 

 

参考资料:

1、word分词器分词效果评估测试数据集和标准数据集 

2、word分词器评估程序

3、word分词器主页

4、ansj分词器主页

5、mmseg4j分词器主页

6、ik-analyzer分词器主页 

 

© 著作权归作者所有

杨尚川

杨尚川

粉丝 1101
博文 220
码字总数 1624053
作品 12
东城
架构师
私信 提问
加载中

评论(1)

x
xiaoshula
来个图表好不?
Java开源项目cws_evaluation:中文分词器分词效果评估

cwsevaluation 是一个Java开源项目,用于对Java中文分词器分词效果进行评估。 cwsevaluation 是通过对前文《word分词器、ansj分词器、mmseg4j分词器、ik-analyzer分词器分词效果评估》中写的...

杨尚川
2014/08/30
1K
5
cws_evaluation v1.0 发布,Java 中文分词器分词效果评估

cws_evaluation 是一个Java开源项目,用于对Java中文分词器分词效果进行评估。 支持的分词器有:word分词器、ansj分词器、mmseg4j分词器、ik-analyzer分词器、jcseg分词器、fudannlp分词器、...

杨尚川
2014/05/02
991
4
cws_evaluation v1.1 发布,中文分词器分词效果评估对比

cws_evaluation 是一个Java开源项目,用于对中文分词器的分词效果进行评估对比,目前支持9大中文分词器。分别是:word分词器、ansj分词器、mmseg4j分词器、ik-analyzer分词器、jcseg分词器、...

杨尚川
2015/05/12
2K
0
杨尚川/cws_evaluation

中文分词器分词效果评估对比 捐赠致谢 使用说明: 如何建立开发环境? 如果是使用Netbeans、IDEA,则直接打开项目如果是使用Eclipse、MyEclipse,则要执行导入操作推荐使用IDEA 评估采用的测...

杨尚川
2015/03/25
0
0
[转]与Lucene 4.10配合的中文分词比较(标准详细的比较)

本文转自: http://www.hansight.com/blog-lucene4.10-with-chinese-segment.html 感谢原作者。 比较目的 衡量每种分词的指标,内存消耗、CPU消耗,得到一个在Lucene中比较好的分词版本。 分...

狮子的魂
2015/01/18
7.9K
7

没有更多内容

加载失败,请刷新页面

加载更多

64.监控平台介绍 安装zabbix 忘记admin密码

19.1 Linux监控平台介绍 19.2 zabbix监控介绍 19.3/19.4/19.6 安装zabbix 19.5 忘记Admin密码如何做 19.1 Linux监控平台介绍: 常见开源监控软件 ~1.cacti、nagios、zabbix、smokeping、ope...

oschina130111
今天
10
0
当餐饮遇上大数据,嗯真香!

之前去开了一场会,主题是「餐饮领袖新零售峰会」。认真听完了餐饮前辈和新秀们的分享,觉得获益匪浅,把脑子里的核心纪要整理了一下,今天和大家做一个简单的分享,欢迎感兴趣的小伙伴一起交...

数澜科技
今天
7
0
DNS-over-HTTPS 的下一代是 DNS ON BLOCKCHAIN

本文作者:PETER LAI ,是 Diode 的区块链工程师。在进入软件开发领域之前,他主要是在做工商管理相关工作。Peter Lai 也是一位活跃的开源贡献者。目前,他正在与 Diode 团队一起开发基于区块...

红薯
今天
6
0
CC攻击带来的危害我们该如何防御?

随着网络的发展带给我们很多的便利,但是同时也带给我们一些网站安全问题,网络攻击就是常见的网站安全问题。其中作为站长最常见的就是CC攻击,CC攻击是网络攻击方式的一种,是一种比较常见的...

云漫网络Ruan
今天
11
0
实验分析性专业硕士提纲撰写要点

为什么您需要研究论文的提纲? 首先当您进行研究时,您需要聚集许多信息和想法,研究论文提纲可以较好地组织你的想法, 了解您研究资料的流畅度和程度。确保你写作时不会错过任何重要资料以此...

论文辅导员
今天
8
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部