# ISLR第六章Linear Model Selection and Regularization

2018/03/06 18:35 Why might we want to use another fitting procedure instead of least squares?

better prediction accuracy（预测精度） and better model interpretability（模型解释力）.

Subset Selection、Shrinkage、Dimension Reduction

6.1Subset Selection

6.1.1 Best Subset Selection  6.1.2 Stepwise Selection  6.1.3 Choosing the Optimal Model

In order to select the best model with respect to test error, we need to estimate this test error. There are two common approaches:

1. We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.

2. We can directly estimate the test error, using either a validation set approach or a cross-validation approach, as discussed in Chapter 5.

We consider both of these approaches below.

6.2 Shrinkage Methods

我们可以使用对系数进行约束或加罚的技巧对包含p个预测变量的模型进行拟合，也就是说，将系数估计值往零的方向压缩。

6.2.1 Ridge Regression  called a shrinkage penalty  is shrinkage small when β1, . . . , βp are close to zero, and so it has the effect of shrinking penalty the estimates of βj towards zero.

Selecting a good value for λ is critical;we defer this discussion to Section 6.2.3, where we use cross-validation It is best to apply ridge regression after standardizing the predictors, using the formula Why Does Ridge Regression Improve Over Least Squares?

Ridge regression’s advantage over least squares is rooted in the bias-variance trade-off.

In particular, when the number of variables p is almost as large as the number of observations n, as in the example in Figure 6.5, the least

squares estimates will be extremely variable. And if p > n, then the least squares estimates do not even have a unique solution,whereas

ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, ridge regression works

best in situations where the least squares estimates have high variance.(当n<p或n=p时最小二乘法将有很大的方差，然而岭回归用一点偏差的增加大大地减小了方差的值)

6.2.2 The Lasso

Ridge regression does have one obvious disadvantage. Unlike best subset, forward stepwise, and backward stepwise selection,

which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final

model.The penalty λβ2 j in (6.5) will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless λ = ∞).

This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number

of variables p is quite large. (岭回归不会将任何一个变量的系数压缩至0，这种设定不影响预测精度，但当变量个数非常大的时候，不便于模型解释)

For example, in the Credit data set, it appears that the most important variables are income, limit, rating, and student. So we might wish to

build a model including just these predictors. However, ridge regression will always generate a model involving all ten predictors. Increasing

the value of λ will tend to reduce the magnitudes of the coefficients, but will not result in exclusion of any of the variables. 6.2.3 Selecting the Tuning Parameter

We then select the tuning parameter value for which the cross-validation error is smallest. Finally, the model is re-fit using all of

the available observations and the selected value of the tuning parameter. 6.3 Dimension Reduction Methods

We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables.

We will refer to these techniques as dimension reduction methods using least squares. Note that in (6.17), the regression coefficients are given
by θ0, θ1, . . . , θM. If the constants φ1m, φ2m, . . . , φpm are chosen wisely, then
such dimension reduction approaches can often outperform least squares
regression. In other words, fitting (6.17) using least squares can lead to
better results than fitting (6.1) using least squares. 6.3.1Principal Components Regression（主成分回归）

Principal components analysis (PCA) is a popular approach for deriving principal components analysis

a low-dimensional set of features from a large set of variables.

0
0 收藏

### 作者的其它热门文章 0 评论
0 收藏
0 