文档章节

11个重要的模型评估方法

q
 qiuliangliang
发布于 2017/05/30 18:51
字数 1096
阅读 16
收藏 0

Confidence Interval

http://www.datasciencecentral.com/profiles/blogs/black-box-confidence-intervals-excel-and-perl-implementations-det

Confidence intervals are used to assess how reliable a statistical estimate is. Wide confidence intervals mean that your model is poor (and it is worth investigating other models), or that your data is very noisy if confidence intervals don't improve by changing the model (that is, testing a different theoretical statistical distribution for your observations.) Modern confidence intervals are model-free, data -driven: click here to see how to compute them. A more general framework to assess and reduce sources of variance is called analysis of variance. Modern definitions of variance have a number of desirable properties

置信区间是指由样本统计量所构造的总体参数的估计区间。在统计学中,一个概率样本的置信区间(Confidence interval)是对这个样本的某个总体参数的区间估计。置信区间展现的是这个参数的真实值有一定概率落在测量结果的周围的程度。置信区间给出的是被测量参数的测量值的可信程度,即前面所要求的“一个概率”。

Pr(c1<=μ<=c2)=1-α

α是显著性水平(例:0.05或0.10)

100%*(1-α)指置信水平(例:95%或90%)

表达方式:interval(c1,c2)——置信区间。

Confusion Matrix. Used in the context of clustering. These N x N matrices (where N is the number of clusters) are designed as followed: the element in cell (i, j) represents the number of observations, in the test training set (as opposed to the control training set, in a cross-validation setting) that belong to cluster i and are assigned (by the clustering algorithm) to cluster j. When these numbers are transformed into proportions, these matrices are sometimes called contingency tables. A wrongly assigned observation is called false positive (non-fraudulent transaction erroneously labelled as fraudulent) or false negative (fraudulent transaction erroneously labelled as non- fraudulent). The higher the concentration of observations in the diagonal of the confusion matrix, the higher the accuracy / predictive power of your clustering algorithm.

Gain and Lift Chart. Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model. Cumulative gains and lift charts are visual aids for measuring model performance. Both charts consist of a lift curve and a baseline. Click here for details

Kolmogorov-Smirnov Chart. This non-parametric statistical test is used to compare two distributions, to assess how close they are to each other. In this context, one of the distributions is the theoretical distribution that the observations are supposed to follow (usually a continuous distribution with one or two parameters, such as Gaussian law), while the other distribution is the actual, empirical, parameter-free, discrete distribution computed on the observations.

Chi Square. It is another statistical test similar to Kolmogorov-Smirnov, but in this case it is a parametric test. It requires you to aggreate observations in a number of buckets or bins, each with at least 10 observations. 

ROC curve. Unlike the lift chart, the ROC curve is almost independent of the response rate. The receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity or the sensitivity index d', known as "d-prime" in signal detection and biomedical informatics, or recall in machine learning. The false-positive rate is also known as the fall-out and can be calculated as (1 - specificity). The ROC curve is thus the sensitivity as a function of fall-out. Click here for details

Gini Coefficient.  The Gini coefficient is sometimes used in classification problems.  Gini = 2*AUC  - 1, where AUC is the area under the curve (see the ROC curve entry above). A Gini ratio above 60% corresponds to a good model. Not to be confused with the Gini index or Gini impurity, used when building decision trees.

Root Mean Square Error. RMSE is the must used and abused metric to compute goodness of fit. It is defined as the square root of the absolute value of the correlation coefficient between true values and predicted values, and widely used by Excel users.

L^1 version of RSME. The RSME metric (see above entry) is an L^2 metric, sensitive to outliers. Modern metrics are L^1 and sometimes based on rank statistics rather than raw data. One of these new metrics, developed by our data scientist, is described here.  

Cross Validation. This is a general framework to assess how a model will perform in the future; it is also used for model selection. It consists of splitting your training set into test and control data sets, training your algorithm (classifier, or predictive algorithm) on the control data set, and testing it on the test data set. Since the true values are known on the test data set, you can compare them with your predicted values, using one of the other comparison tools mentioned in this article. Usually the test data set itself is split into multiple subsets or data bins, to compute confidence intervals for predicted values. The test data set must be carefully selected, and must include different time frames and different types of observations (compared with the control data set), each with enough data points, in order to get sound, reliable conclusions as how the model will perform on future data, or on data that has slightly involved. Another idea is to introduce noise in the test data set and see how it impacts prediction: this is referred to as model sensitivity analysis. 

Predictive Power. This metric was developed internally at Data Science Central by our data scientist. It is related to the concept of entropy or the Gini index mentioned above in this article. It was designed as a synthetic metric satisfying interesting properties, and used to select a good subset of features in any machine learning project, or as a criterion to decide which node to split at each iteration, when building decision trees. Click here for details.

 

本文转载自:http://www.datasciencecentral.com/profiles/blogs/7-important-model-evaluation-error-metrics-everyone

共有 人打赏支持
q
粉丝 0
博文 2
码字总数 0
作品 0
朝阳
私信 提问
GIS几个重要的研究方向

1 空间数据库的准确性研究 地理信息数据中误差处理和不确定性错误处理的方法和技术 ,包括 : 不确定性误差模型 ; 误差跟踪并对误差进行编码的方法 ; 计算和表达在 GIS应用中的误差 ; 数据精度...

晨曦之光
2012/04/12
213
1
入门 | 用机器学习进行欺诈预测的模型设计

Airbnb网站基于允许任何人将闲置的房屋进行长期或短期出租构建商业模式,来自房客或房东的欺诈风险是必须解决的问题。Airbnb信任和安全小组通过构建机器学习模型进行欺诈预测,本文介绍了其设...

技术小能手
09/11
0
0
基于机器学习的工控安全风险评估

1 引言 随着工业控制网络与企业信息网络的不断融合,工业控制系统的安全管理受到了重大的挑战。工控系统安全等级评估是安全管理的重要内容,传统的安全等级评估方法主要有故障树分析法、层次...

liqing1310200526
03/14
0
0
Ian Goodfellow:你的GAN水平我来打分

  选自arXiv   作者:Catherine Olsson, Surya Bhupatiraju, Tom Brown, Augustus Odena, Ian Goodfellow   机器之心编译   机器之心编辑部      如何评价生成模型的性能好坏?这...

机器之心
08/17
0
0
最完整的检测模型评估指标mAP计算指南(附代码)在这里!

前言 对于使用机器学习解决的大多数常见问题,通常有多种可用的模型。每个模型都有自己的独特之处,并随因素变化而表现不同每个模型在“验证/测试”数据集上来评估性能,性能衡量使用各种统计...

机器学习算法工程师
06/19
0
0

没有更多内容

加载失败,请刷新页面

加载更多

微服务分布式事务实现

https://www.processon.com/view/link/5b2144d7e4b001a14d3d2d30

WALK_MAN
今天
2
0
《大漠烟尘》读书笔记及读后感文章3700字

《大漠烟尘》读书笔记及读后感文章3700字: 在这个浮躁的社会里,你有多久没有好好读完一本书了? 我们总觉得自己和别人不一样,所以当看到别人身上的问题时,很少有“反求诸己”,反思自己。...

原创小博客
今天
3
0
大数据教程(9.5)用MR实现sql中的jion逻辑

上一篇博客讲解了使用jar -jar的方式来运行提交MR程序,以及通过修改YarnRunner的源码来实现MR的windows开发环境提交到集群的方式。本篇博主将分享sql中常见的join操作。 一、需求 订单数据表...

em_aaron
今天
3
0
十万个为什么之什么是resultful规范

起源 越来越多的人开始意识到,网站即软件,而且是一种新型的软件。这种"互联网软件"采用客户端/服务器模式,建立在分布式体系上,通过互联网通信,具有高延时(high latency)、高并发等特点...

尾生
今天
3
0
Terraform配置文件(Terraform configuration)

Terraform配置文件 翻译自Terraform Configuration Terraform用文本文件来描述设备、设置变量。这些文件被称为Terraform配置文件,以.tf结尾。这一部分将讲述Terraform配置文件的加载与格式。...

buddie
今天
5
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部