R数据分析及可视化的一个简单例子
R数据分析及可视化的一个简单例子
yinkaipeng 发表于3年前
R数据分析及可视化的一个简单例子
  • 发表于 3年前
  • 阅读 371
  • 收藏 1
  • 点赞 0
  • 评论 0

新睿云服务器60天免费使用,快来体验!>>>   

摘要: R是一种统计学语言 R是一个面向对象语言 R为数据挖掘提供了灵活且高效的算法 R是数据可视化的一把利器 Rhadoop炼数成金

需求分析

葡萄牙某银行拟根据现有客户资料建立预测模型,以配合其数据库营销策略,营销方式为电话直销,销售产品为某金融产品(term deposit),数据分析的目标为通过预测模型识别对该金融产品有较高购买意愿的用户群。  

数据形式:从数据库中导出的excel文件

数据内容:

 # bank client data:
   1 - age (numeric)
   2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
                                       "blue-collar","self-employed","retired","technician","services")
   3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)
   4 - education (categorical: "unknown","secondary","primary","tertiary")
   5 - default: has credit in default? (binary: "yes","no")
   6 - balance: average yearly balance, in euros (numeric)
   7 - housing: has housing loan? (binary: "yes","no")
   8 - loan: has personal loan? (binary: "yes","no")
   # related with the last contact of the current campaign:
   9 - contact: contact communication type (categorical: "unknown","telephone","cellular")
  10 - day: last contact day of the month (numeric)
  11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
  12 - duration: last contact duration, in seconds (numeric)
   # other attributes:
  13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
  15 - previous: number of contacts performed before this campaign and for this client (numeric)
  16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

  Output variable (desired target):
  17 - y - has the client subscribed a term deposit? (binary: "yes","no")

下面用R语言进行数据分析及可视化

> bank <- read.csv("H:/bank/bank-full.csv", sep = ";", header = T) //载入数据
> summary(bank) //对数据进行分析汇总

//导入要用到的包
> library(caret)
> library(ggplot2)
> library(gplots)
 
> require(rpart)
> require(caret)
> require(ggplot2)
> require(gplots)
//运用决策树模型对数据做初步分类建模和变量选择
> bank.tree <- rpart(y ~ ., data = bank, method = "class", cp = 0.001)
> treeImp <- varImp(bank.tree, scale = TRUE, surrogates = FALSE, competes = TRUE)
> treeImp$Variable <- rownames(treeImp)
> treeImp.sort <- treeImp[order(-treeImp$Overall), ]
> ggplot(treeImp, aes(Variable, Overall)) + geom_bar(stat = "identity") + coord_flip()

//根据cpplot对树做裁剪
plotcp(bank.tree)

//输出
> printcp(bank.tree)

Classification tree:

rpart(formula = y ~ ., data = bank, method = "class", cp = 0.001)

 

Variables actually used in tree construction:

 [1] age       balance   contact   day       duration  education housing   job       marital 

[10] month     pdays     poutcome  previous

 

Root node error: 5289/45211 = 0.11698

 

n= 45211

 

          CP nsplit rel error  xerror     xstd

0.0380034      0   1.00000 1.00000 0.012921

0.0253356      3   0.88599 0.89147 0.012287

0.0170164      4   0.86065 0.86425 0.012120

0.0080355      5   0.84364 0.84969 0.012028

0.0042226      7   0.82757 0.84061 0.011971

0.0039705     10   0.81490 0.84061 0.011971

0.0034033     13   0.80299 0.83324 0.011924

0.0022373     15   0.79618 0.82133 0.011848

0.0019537     21   0.78276 0.81584 0.011812

10 0.0017962     24   0.77690 0.81566 0.011811

11 0.0016071     26   0.77330 0.81717 0.011821

12 0.0015126     30   0.76687 0.82057 0.011843

13 0.0014180     32   0.76385 0.81906 0.011833

14 0.0013235     40   0.75099 0.81887 0.011832

15 0.0012290     44   0.74570 0.81755 0.011823

16 0.0011344     51   0.73587 0.81660 0.011817

17 0.0010399     61   0.72452 0.81868 0.011831

18 0.0010000     63   0.72244 0.82416 0.011866

//绘制决策树
> bank.tree <- rpart(y ~ ., data = bank, method = "class", cp = 0.0022373)
> plot(bank.tree, branch = 0, margin = 0.1, uniform = T)
> text(bank.tree, use.n = T, col = "red", cex = 0.6)

//变量初选,分析和变换

根据决策树分析的结果,我们选择变量重要性最高的前5个变量做进一步研究,依次是:

 

Duration : last contact duration, in seconds (numeric)

month : last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

poutcome : outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

pdays : number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

previous : number of contacts performed before this campaign and for this client (numeric)

//a) Duration参数分析
> bank$y_dummy = ifelse(bank$y == "yes", 1, 0)
> summary(bank$duration)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0   103.0   180.0   258.2   319.0  4918.0 
> ggplot(bank, aes(duration, y_dummy)) + geom_smooth() + geom_point()

//根据拟合形态对Duration做一个二次项。
> bank$duration.sq <- bank$duration * bank$duration

//b)对month 参数分析
> summary(bank$month)
  apr   aug   dec   feb   jan   jul   jun   mar   may   nov   oct   sep 
 2932  6247   214  2649  1403  6895  5341   477 13766  3970   738   579 
> library(Rcmdr)
> plotMeans(bank$y_dummy, bank$month, error.bars = "se")

> bank$month.sel <- ifelse(bank$month == "dec", 1, 0)
> bank$month.sel <- ifelse(bank$month == "mar", 1, bank$month)
> bank$month.sel <- ifelse(bank$month == "oct", 1, bank$month)
> bank$month.sel <- ifelse(bank$month == "sep", 1, bank$month)

//c) poutcome参数分析
>summary(bank$poutcome)
failure   other success unknown 
   4901    1840    1511   36959 
> plotMeans(bank$y_dummy, bank$poutcome, error.bars = "se")

> bank$poutcome.success <- ifelse(bank$poutcome == "success", 1, 0)

//d)pdays参数分析
 
> summary(bank$pdays)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -1.0    -1.0    -1.0    40.2    -1.0   871.0 
> bank$nocontact <- ifelse(bank$pdays == -1, 1, 0)
> bank$pdays <- ifelse(bank$pdays == -1, 0, bank$pdays)
> summary(bank$nocontact)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  1.0000  1.0000  0.8174  1.0000  1.0000 
> plotMeans(bank$y_dummy, as.factor(bank$nocontact), error.bars = "se")

> ggplot(bank, aes(log(pdays + 1))) + geom_histogram()

> ggplot(bank, aes(log(pdays + 1), y_dummy)) + geom_smooth() + geom_point()

//e) previous参数分析
 
> summary(bank$previous)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.0000   0.0000   0.0000   0.5803   0.0000 275.0000 
 
> ggplot(bank, aes(log(previous + 1))) + geom_histogram()

> ggplot(bank, aes(log(previous + 1), y_dummy)) + geom_smooth() + geom_point()

> bank$previous.0 <- as.factor(ifelse(bank$previous == 0, 1, 0))
> bank$previous.1 <- as.factor(ifelse(bank$previous == 1, 1, 0))
> bank$previous.2 <- as.factor(ifelse(bank$previous == 2, 1, 0))
> bank$previous.2plus <- as.factor(ifelse(bank$previous > 2, 1, 0))

//逻辑回归建模
> logistic.full <- glm(y_dummy ~ duration + duration.sq + month.sel + poutcome.success + 
+     bank$nocontact + log(pdays + 1) + bank$previous.0 + bank$previous.1 + bank$previous.2 + 
+     bank$previous.2plus, data = bank)
> summary(logistic.full)

Call:

glm(formula = y_dummy ~ duration + duration.sq + month.sel +

    poutcome.success + bank$nocontact + log(pdays + 1) + bank$previous.0 +

    bank$previous.1 + bank$previous.2 + bank$previous.2plus,

    data = bank)

 

Deviance Residuals:

     Min        1Q    Median        3Q       Max 

-1.15672  -0.11482  -0.04176   0.01313   1.08332 

 

Coefficients: (2 not defined because of singularities)

                       Estimate Std. Error t value Pr(>|t|)   

(Intercept)           1.583e-01  2.348e-02   6.742 1.58e-11 ***

duration              6.574e-04  9.605e-06  68.444  < 2e-16 ***

duration.sq          -1.350e-07  6.145e-09 -21.971  < 2e-16 ***

month.sel            -6.720e-03  4.346e-04 -15.463  < 2e-16 ***

poutcome.success      4.555e-01  8.085e-03  56.340  < 2e-16 ***

bank$nocontact       -1.749e-01  2.336e-02  -7.490 7.04e-14 ***

log(pdays + 1)       -2.106e-02  4.335e-03  -4.858 1.19e-06 ***

bank$previous.01             NA         NA      NA       NA   

bank$previous.11     -2.522e-02  7.126e-03  -3.540 0.000401 ***

bank$previous.21     -1.641e-02  7.714e-03  -2.128 0.033367 * 

bank$previous.2plus1         NA         NA      NA       NA   

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

(Dispersion parameter for gaussian family taken to be 0.07690626)

 

    Null deviance: 4670.3  on 45210  degrees of freedom

Residual deviance: 3476.3  on 45202  degrees of freedom

AIC: 12340

 

Number of Fisher Scoring iterations: 2

> logistic.step <- step(logistic.full, direction = "both", k = 2)


Start:  AIC=12340.45

y_dummy ~ duration + duration.sq + month.sel + poutcome.success +

    bank$nocontact + log(pdays + 1) + bank$previous.0 + bank$previous.1 +

    bank$previous.2 + bank$previous.2plus

 

 

Step:  AIC=12340.45

y_dummy ~ duration + duration.sq + month.sel + poutcome.success +

    bank$nocontact + log(pdays + 1) + bank$previous.0 + bank$previous.1 +

    bank$previous.2

 

 

Step:  AIC=12340.45

y_dummy ~ duration + duration.sq + month.sel + poutcome.success +

    bank$nocontact + log(pdays + 1) + bank$previous.1 + bank$previous.2

 

                   Df Deviance   AIC

<none>                  3476.3 12340

- bank$previous.2   1   3476.7 12343

- bank$previous.1   1   3477.3 12351

- log(pdays + 1)    1   3478.1 12362

- bank$nocontact    1   3480.6 12394

- month.sel         1   3494.7 12577

- duration.sq       1   3513.4 12819

- poutcome.success  1   3720.4 15407

- duration          1   3836.6 16797

> summary(logistic.step)

Call:

glm(formula = y_dummy ~ duration + duration.sq + month.sel +

    poutcome.success + bank$nocontact + log(pdays + 1) + bank$previous.1 +

    bank$previous.2, data = bank)

 

Deviance Residuals:

     Min        1Q    Median        3Q       Max 

-1.15672  -0.11482  -0.04176   0.01313   1.08332 

 

Coefficients:

                   Estimate Std. Error t value Pr(>|t|)   

(Intercept)       1.583e-01  2.348e-02   6.742 1.58e-11 ***

duration          6.574e-04  9.605e-06  68.444  < 2e-16 ***

duration.sq      -1.350e-07  6.145e-09 -21.971  < 2e-16 ***

month.sel        -6.720e-03  4.346e-04 -15.463  < 2e-16 ***

poutcome.success  4.555e-01  8.085e-03  56.340  < 2e-16 ***

bank$nocontact   -1.749e-01  2.336e-02  -7.490 7.04e-14 ***

log(pdays + 1)   -2.106e-02  4.335e-03  -4.858 1.19e-06 ***

bank$previous.11 -2.522e-02  7.126e-03  -3.540 0.000401 ***

bank$previous.21 -1.641e-02  7.714e-03  -2.128 0.033367 * 

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

(Dispersion parameter for gaussian family taken to be 0.07690626)

 

    Null deviance: 4670.3  on 45210  degrees of freedom

Residual deviance: 3476.3  on 45202  degrees of freedom

AIC: 12340

 

Number of Fisher Scoring iterations: 2

//模型scoring和ROC评估
> require(ROCR)
载入需要的程辑包:ROCR
> bank.pred<-1/(1+exp(-predict(logistic.step)))
> roc.data <- prediction(bank.pred, labels = bank$y)
> roc.data <- performance(roc.data, "tpr", "fpr")
> plot(roc.data)

> score<-data.frame("prob.y"=bank.pred,"y"=as.factor(bank$y_dummy))
>ggplot(score, aes(x=prob.y, fill=y)) + geom_histogram(position="identity", binwidth=0.01,alpha=0.5)

通过对ROCScore分布的分析,逻辑回归Score的分类效果还是不错的。具体的score cutoff值需要根据业务要求和营销成本而定。

  • 打赏
  • 点赞
  • 收藏
  • 分享
共有 人打赏支持
粉丝 0
博文 9
码字总数 8858
×
yinkaipeng
如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
打赏人
留言
* 支付类型
微信扫码支付
打赏金额:
已支付成功
打赏金额: