ISLR第六章Linear Model Selection and Regularization

2018/03/06 18:35
阅读数 25


Why might we want to use another fitting procedure instead of least squares?

better prediction accuracy(预测精度) and better model interpretability(模型解释力).


Subset Selection、Shrinkage、Dimension Reduction

6.1Subset Selection

6.1.1 Best Subset Selection 

该方法从p 个预测变量中挑选出与响应变最相关的变量形成子集,再对缩减的变量集合使用最小二乘方法。



 6.1.2 Stepwise Selection

由于运算效率的限制,当p 很大时,最优子集选择方法不再适用,而且也存在一些统计学上的问题。随着搜索空间始增大,






向后选择方法将满足样本量n 大于变放个数p (保证全模型可以被拟合)的条件。相反,
向前逐步选择即使在n <p 的情况下也可以使用,因此当p非常大的时候,向前逐步选择是唯




6.1.3 Choosing the Optimal Model 

In order to select the best model with respect to test error, we need to estimate this test error. There are two common approaches:

1. We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting. 

2. We can directly estimate the test error, using either a validation set approach or a cross-validation approach, as discussed in Chapter 5.

We consider both of these approaches below.

6.2 Shrinkage Methods


可以显著减少估计量方差。两种最常用的方法是岭回归(ridge regression) 和llasso。

6.2.1 Ridge Regression



其中λ 是一个调节参数(turning parameter) ,将单独确定。

called a shrinkage penalty  is shrinkage small when β1, . . . , βp are close to zero, and so it has the effect of shrinking penalty the estimates of βj towards zero. 

Selecting a good value for λ is critical;we defer this discussion to Section 6.2.3, where we use cross-validation


It is best to apply ridge regression after standardizing the predictors, using the formula


Why Does Ridge Regression Improve Over Least Squares? 

     Ridge regression’s advantage over least squares is rooted in the bias-variance trade-off. 

     In particular, when the number of variables p is almost as large as the number of observations n, as in the example in Figure 6.5, the least

squares estimates will be extremely variable. And if p > n, then the least squares estimates do not even have a unique solution,whereas

ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, ridge regression works

best in situations where the least squares estimates have high variance.(当n<p或n=p时最小二乘法将有很大的方差,然而岭回归用一点偏差的增加大大地减小了方差的值)

6.2.2 The Lasso  

    Ridge regression does have one obvious disadvantage. Unlike best subset, forward stepwise, and backward stepwise selection,

which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final

model.The penalty λβ2 j in (6.5) will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless λ = ∞).

This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number

of variables p is quite large. (岭回归不会将任何一个变量的系数压缩至0,这种设定不影响预测精度,但当变量个数非常大的时候,不便于模型解释)

     For example, in the Credit data set, it appears that the most important variables are income, limit, rating, and student. So we might wish to

build a model including just these predictors. However, ridge regression will always generate a model involving all ten predictors. Increasing

the value of λ will tend to reduce the magnitudes of the coefficients, but will not result in exclusion of any of the variables.




6.2.3 Selecting the Tuning Parameter 

We then select the tuning parameter value for which the cross-validation error is smallest. Finally, the model is re-fit using all of

the available observations and the selected value of the tuning parameter.

6.3 Dimension Reduction Methods

 We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables.

We will refer to these techniques as dimension reduction methods

using least squares. Note that in (6.17), the regression coefficients are given
by θ0, θ1, . . . , θM. If the constants φ1m, φ2m, . . . , φpm are chosen wisely, then
such dimension reduction approaches can often outperform least squares
regression. In other words, fitting (6.17) using least squares can lead to
better results than fitting (6.1) using least squares.

6.3.1Principal Components Regression(主成分回归) 

Principal components analysis (PCA) is a popular approach for deriving principal components analysis

a low-dimensional set of features from a large set of variables.


0 收藏
0 评论
0 收藏