文档章节

R数据分析及可视化的一个简单例子

yinkaipeng
 yinkaipeng
发布于 2014/11/12 09:37
字数 1754
阅读 420
收藏 1

需求分析

葡萄牙某银行拟根据现有客户资料建立预测模型,以配合其数据库营销策略,营销方式为电话直销,销售产品为某金融产品(term deposit),数据分析的目标为通过预测模型识别对该金融产品有较高购买意愿的用户群。  

数据形式:从数据库中导出的excel文件

数据内容:

 # bank client data:
   1 - age (numeric)
   2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
                                       "blue-collar","self-employed","retired","technician","services")
   3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)
   4 - education (categorical: "unknown","secondary","primary","tertiary")
   5 - default: has credit in default? (binary: "yes","no")
   6 - balance: average yearly balance, in euros (numeric)
   7 - housing: has housing loan? (binary: "yes","no")
   8 - loan: has personal loan? (binary: "yes","no")
   # related with the last contact of the current campaign:
   9 - contact: contact communication type (categorical: "unknown","telephone","cellular")
  10 - day: last contact day of the month (numeric)
  11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
  12 - duration: last contact duration, in seconds (numeric)
   # other attributes:
  13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
  15 - previous: number of contacts performed before this campaign and for this client (numeric)
  16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

  Output variable (desired target):
  17 - y - has the client subscribed a term deposit? (binary: "yes","no")

下面用R语言进行数据分析及可视化

> bank <- read.csv("H:/bank/bank-full.csv", sep = ";", header = T) //载入数据
> summary(bank) //对数据进行分析汇总

//导入要用到的包
> library(caret)
> library(ggplot2)
> library(gplots)
 
> require(rpart)
> require(caret)
> require(ggplot2)
> require(gplots)
//运用决策树模型对数据做初步分类建模和变量选择
> bank.tree <- rpart(y ~ ., data = bank, method = "class", cp = 0.001)
> treeImp <- varImp(bank.tree, scale = TRUE, surrogates = FALSE, competes = TRUE)
> treeImp$Variable <- rownames(treeImp)
> treeImp.sort <- treeImp[order(-treeImp$Overall), ]
> ggplot(treeImp, aes(Variable, Overall)) + geom_bar(stat = "identity") + coord_flip()

//根据cpplot对树做裁剪
plotcp(bank.tree)

//输出
> printcp(bank.tree)

Classification tree:

rpart(formula = y ~ ., data = bank, method = "class", cp = 0.001)

 

Variables actually used in tree construction:

 [1] age       balance   contact   day       duration  education housing   job       marital 

[10] month     pdays     poutcome  previous

 

Root node error: 5289/45211 = 0.11698

 

n= 45211

 

          CP nsplit rel error  xerror     xstd

0.0380034      0   1.00000 1.00000 0.012921

0.0253356      3   0.88599 0.89147 0.012287

0.0170164      4   0.86065 0.86425 0.012120

0.0080355      5   0.84364 0.84969 0.012028

0.0042226      7   0.82757 0.84061 0.011971

0.0039705     10   0.81490 0.84061 0.011971

0.0034033     13   0.80299 0.83324 0.011924

0.0022373     15   0.79618 0.82133 0.011848

0.0019537     21   0.78276 0.81584 0.011812

10 0.0017962     24   0.77690 0.81566 0.011811

11 0.0016071     26   0.77330 0.81717 0.011821

12 0.0015126     30   0.76687 0.82057 0.011843

13 0.0014180     32   0.76385 0.81906 0.011833

14 0.0013235     40   0.75099 0.81887 0.011832

15 0.0012290     44   0.74570 0.81755 0.011823

16 0.0011344     51   0.73587 0.81660 0.011817

17 0.0010399     61   0.72452 0.81868 0.011831

18 0.0010000     63   0.72244 0.82416 0.011866

//绘制决策树
> bank.tree <- rpart(y ~ ., data = bank, method = "class", cp = 0.0022373)
> plot(bank.tree, branch = 0, margin = 0.1, uniform = T)
> text(bank.tree, use.n = T, col = "red", cex = 0.6)

//变量初选,分析和变换

根据决策树分析的结果,我们选择变量重要性最高的前5个变量做进一步研究,依次是:

 

Duration : last contact duration, in seconds (numeric)

month : last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

poutcome : outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

pdays : number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

previous : number of contacts performed before this campaign and for this client (numeric)

//a) Duration参数分析
> bank$y_dummy = ifelse(bank$y == "yes", 1, 0)
> summary(bank$duration)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0   103.0   180.0   258.2   319.0  4918.0 
> ggplot(bank, aes(duration, y_dummy)) + geom_smooth() + geom_point()

//根据拟合形态对Duration做一个二次项。
> bank$duration.sq <- bank$duration * bank$duration

//b)对month 参数分析
> summary(bank$month)
  apr   aug   dec   feb   jan   jul   jun   mar   may   nov   oct   sep 
 2932  6247   214  2649  1403  6895  5341   477 13766  3970   738   579 
> library(Rcmdr)
> plotMeans(bank$y_dummy, bank$month, error.bars = "se")

> bank$month.sel <- ifelse(bank$month == "dec", 1, 0)
> bank$month.sel <- ifelse(bank$month == "mar", 1, bank$month)
> bank$month.sel <- ifelse(bank$month == "oct", 1, bank$month)
> bank$month.sel <- ifelse(bank$month == "sep", 1, bank$month)

//c) poutcome参数分析
>summary(bank$poutcome)
failure   other success unknown 
   4901    1840    1511   36959 
> plotMeans(bank$y_dummy, bank$poutcome, error.bars = "se")

> bank$poutcome.success <- ifelse(bank$poutcome == "success", 1, 0)

//d)pdays参数分析
 
> summary(bank$pdays)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -1.0    -1.0    -1.0    40.2    -1.0   871.0 
> bank$nocontact <- ifelse(bank$pdays == -1, 1, 0)
> bank$pdays <- ifelse(bank$pdays == -1, 0, bank$pdays)
> summary(bank$nocontact)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  1.0000  1.0000  0.8174  1.0000  1.0000 
> plotMeans(bank$y_dummy, as.factor(bank$nocontact), error.bars = "se")

> ggplot(bank, aes(log(pdays + 1))) + geom_histogram()

> ggplot(bank, aes(log(pdays + 1), y_dummy)) + geom_smooth() + geom_point()

//e) previous参数分析
 
> summary(bank$previous)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.0000   0.0000   0.0000   0.5803   0.0000 275.0000 
 
> ggplot(bank, aes(log(previous + 1))) + geom_histogram()

> ggplot(bank, aes(log(previous + 1), y_dummy)) + geom_smooth() + geom_point()

> bank$previous.0 <- as.factor(ifelse(bank$previous == 0, 1, 0))
> bank$previous.1 <- as.factor(ifelse(bank$previous == 1, 1, 0))
> bank$previous.2 <- as.factor(ifelse(bank$previous == 2, 1, 0))
> bank$previous.2plus <- as.factor(ifelse(bank$previous > 2, 1, 0))

//逻辑回归建模
> logistic.full <- glm(y_dummy ~ duration + duration.sq + month.sel + poutcome.success + 
+     bank$nocontact + log(pdays + 1) + bank$previous.0 + bank$previous.1 + bank$previous.2 + 
+     bank$previous.2plus, data = bank)
> summary(logistic.full)

Call:

glm(formula = y_dummy ~ duration + duration.sq + month.sel +

    poutcome.success + bank$nocontact + log(pdays + 1) + bank$previous.0 +

    bank$previous.1 + bank$previous.2 + bank$previous.2plus,

    data = bank)

 

Deviance Residuals:

     Min        1Q    Median        3Q       Max 

-1.15672  -0.11482  -0.04176   0.01313   1.08332 

 

Coefficients: (2 not defined because of singularities)

                       Estimate Std. Error t value Pr(>|t|)   

(Intercept)           1.583e-01  2.348e-02   6.742 1.58e-11 ***

duration              6.574e-04  9.605e-06  68.444  < 2e-16 ***

duration.sq          -1.350e-07  6.145e-09 -21.971  < 2e-16 ***

month.sel            -6.720e-03  4.346e-04 -15.463  < 2e-16 ***

poutcome.success      4.555e-01  8.085e-03  56.340  < 2e-16 ***

bank$nocontact       -1.749e-01  2.336e-02  -7.490 7.04e-14 ***

log(pdays + 1)       -2.106e-02  4.335e-03  -4.858 1.19e-06 ***

bank$previous.01             NA         NA      NA       NA   

bank$previous.11     -2.522e-02  7.126e-03  -3.540 0.000401 ***

bank$previous.21     -1.641e-02  7.714e-03  -2.128 0.033367 * 

bank$previous.2plus1         NA         NA      NA       NA   

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

(Dispersion parameter for gaussian family taken to be 0.07690626)

 

    Null deviance: 4670.3  on 45210  degrees of freedom

Residual deviance: 3476.3  on 45202  degrees of freedom

AIC: 12340

 

Number of Fisher Scoring iterations: 2

> logistic.step <- step(logistic.full, direction = "both", k = 2)


Start:  AIC=12340.45

y_dummy ~ duration + duration.sq + month.sel + poutcome.success +

    bank$nocontact + log(pdays + 1) + bank$previous.0 + bank$previous.1 +

    bank$previous.2 + bank$previous.2plus

 

 

Step:  AIC=12340.45

y_dummy ~ duration + duration.sq + month.sel + poutcome.success +

    bank$nocontact + log(pdays + 1) + bank$previous.0 + bank$previous.1 +

    bank$previous.2

 

 

Step:  AIC=12340.45

y_dummy ~ duration + duration.sq + month.sel + poutcome.success +

    bank$nocontact + log(pdays + 1) + bank$previous.1 + bank$previous.2

 

                   Df Deviance   AIC

<none>                  3476.3 12340

- bank$previous.2   1   3476.7 12343

- bank$previous.1   1   3477.3 12351

- log(pdays + 1)    1   3478.1 12362

- bank$nocontact    1   3480.6 12394

- month.sel         1   3494.7 12577

- duration.sq       1   3513.4 12819

- poutcome.success  1   3720.4 15407

- duration          1   3836.6 16797

> summary(logistic.step)

Call:

glm(formula = y_dummy ~ duration + duration.sq + month.sel +

    poutcome.success + bank$nocontact + log(pdays + 1) + bank$previous.1 +

    bank$previous.2, data = bank)

 

Deviance Residuals:

     Min        1Q    Median        3Q       Max 

-1.15672  -0.11482  -0.04176   0.01313   1.08332 

 

Coefficients:

                   Estimate Std. Error t value Pr(>|t|)   

(Intercept)       1.583e-01  2.348e-02   6.742 1.58e-11 ***

duration          6.574e-04  9.605e-06  68.444  < 2e-16 ***

duration.sq      -1.350e-07  6.145e-09 -21.971  < 2e-16 ***

month.sel        -6.720e-03  4.346e-04 -15.463  < 2e-16 ***

poutcome.success  4.555e-01  8.085e-03  56.340  < 2e-16 ***

bank$nocontact   -1.749e-01  2.336e-02  -7.490 7.04e-14 ***

log(pdays + 1)   -2.106e-02  4.335e-03  -4.858 1.19e-06 ***

bank$previous.11 -2.522e-02  7.126e-03  -3.540 0.000401 ***

bank$previous.21 -1.641e-02  7.714e-03  -2.128 0.033367 * 

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

(Dispersion parameter for gaussian family taken to be 0.07690626)

 

    Null deviance: 4670.3  on 45210  degrees of freedom

Residual deviance: 3476.3  on 45202  degrees of freedom

AIC: 12340

 

Number of Fisher Scoring iterations: 2

//模型scoring和ROC评估
> require(ROCR)
载入需要的程辑包:ROCR
> bank.pred<-1/(1+exp(-predict(logistic.step)))
> roc.data <- prediction(bank.pred, labels = bank$y)
> roc.data <- performance(roc.data, "tpr", "fpr")
> plot(roc.data)

> score<-data.frame("prob.y"=bank.pred,"y"=as.factor(bank$y_dummy))
>ggplot(score, aes(x=prob.y, fill=y)) + geom_histogram(position="identity", binwidth=0.01,alpha=0.5)

通过对ROCScore分布的分析,逻辑回归Score的分类效果还是不错的。具体的score cutoff值需要根据业务要求和营销成本而定。

© 著作权归作者所有

共有 人打赏支持
yinkaipeng
粉丝 0
博文 11
码字总数 11509
作品 0
西安
程序员
在ipython notebook中调用ggplot的三种不同的方法

在大数据时代,数据可视化是一个非常热门的话题。各个BI的厂商无不在数据可视化领域里投入大量的精力。Tableau凭借其强大的数据可视化的功能成为硅谷炙手可热的上市公司。Tableau的数据可视化...

naughty
2014/07/03
0
2
数据可视化之美:经典案例与实践解析

作者:jihong10102006 随着DT时代的到来,传统的统计图表很难对复杂数据进行直观地展示。这几年数据可视化作为一个新研究领域也变得越来越火。成功的可视化,如果做得漂亮,虽表面简单却富含...

op07p6aaqo9u71
2017/10/26
0
0
R | 对亚马逊新总部可能位置进行可视化

前不久,亚马逊宣布在寻找一个城市建立第二个总部。 他们对这个城市的标准是超过100万人,同时这个城市也要人才济济。 在一个新闻网站上面发现,我发现了一些可能的城市名单,他们都有超过1...

wzgl__wh
2017/10/28
0
0
Python VS R,哪个更适合开发者?

21CTO社区导读:Python和R同时出现在20世纪九十年代。它为数据科学家提供了大量的数据模型操作的能力。在发展了30年后,Python与R开始出现不同的爱好者阵营。 90年代出现了互联网,这一事物永...

21CTO
04/28
0
0
Python和R:学哪个好?

在当下,人工智能的浪潮席卷而来。从AlphaGo、无人驾驶技术、人脸识别、语音对话,到商城推荐系统,金融业的风控,量化运营、用户洞察、企业征信、智能投顾等,人工智能的应用广泛渗透到各行...

个推君
2017/11/28
0
0

没有更多内容

加载失败,请刷新页面

加载更多

ajax 提交返回map集合 获取不到值

后台java代码 @RequestMapping("/cameraList") @ResponseBody public Map<String, Object> cameraListForPage(@RequestParam(defaultValue = "1", value = "page") Integer page......

S三少S
5分钟前
0
0
TypeScrip最污的技术课-技术胖TypeScript图文视频教程

近日Node.js之父瑞安达尔(Ryan Dahl)发布新的开源项目 deno,从官方介绍来看,可以认为它是下一代 Node,使用 rust 语言代替 C++ 重新编写跨平台底层内核驱动,上层仍然使用 V8 引擎,最终...

JamesView
7分钟前
5
0
Es学习笔记

1.过滤排重聚合查询 筛选出某一个聚合值的个数统计。相当于mysql的distinct. 关键字:cardinality "aggs": { "2":{ "cardinality": { "field": "field" } ...

Gmupload
9分钟前
0
0
h5语义化标签

语义化HTML:用最恰当的HTML元素标签做恰当的事情。 优点: 提升可访问性; SEO; 结构清晰,利于维护; 通用容器:div——块级通用容器;span——短语内容无语义容器。 <title></title>:简...

莫西摩西
15分钟前
0
0
修改11g rac中 asm diskstring的发现路径

问题 : 如果我 们asm_disking以前是/dev/oracleasm/disks/* ,并且现在已经有磁盘组再用这个磁盘串了,那么,我们无法直接修改这个发现串为 ORCL:*,修改会报错,提示存在的磁盘无法使用新的...

tututu_jiang
18分钟前
0
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部