Why might we want to use another fitting procedure instead of least squares?
better prediction accuracy（预测精度） and better model interpretability（模型解释力）.
Subset Selection、Shrinkage、Dimension Reduction
6.1.1 Best Subset Selection
6.1.2 Stepwise Selection
向后选择方法将满足样本量n 大于变放个数p (保证全模型可以被拟合)的条件。相反，
向前逐步选择即使在n <p 的情况下也可以使用，因此当p非常大的时候，向前逐步选择是唯
6.1.3 Choosing the Optimal Model
In order to select the best model with respect to test error, we need to estimate this test error. There are two common approaches:
1. We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.
2. We can directly estimate the test error, using either a validation set approach or a cross-validation approach, as discussed in Chapter 5.
We consider both of these approaches below.
6.2 Shrinkage Methods
可以显著减少估计量方差。两种最常用的方法是岭回归(ridge regression) 和llasso。
6.2.1 Ridge Regression
其中λ 是一个调节参数(turning parameter) ，将单独确定。
called a shrinkage penalty is shrinkage small when β1, . . . , βp are close to zero, and so it has the effect of shrinking penalty the estimates of βj towards zero.
Selecting a good value for λ is critical;we defer this discussion to Section 6.2.3, where we use cross-validation
It is best to apply ridge regression after standardizing the predictors, using the formula
Why Does Ridge Regression Improve Over Least Squares?
Ridge regression’s advantage over least squares is rooted in the bias-variance trade-off.
In particular, when the number of variables p is almost as large as the number of observations n, as in the example in Figure 6.5, the least
squares estimates will be extremely variable. And if p > n, then the least squares estimates do not even have a unique solution,whereas
ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, ridge regression works
best in situations where the least squares estimates have high variance.(当n<p或n=p时最小二乘法将有很大的方差，然而岭回归用一点偏差的增加大大地减小了方差的值)
6.2.2 The Lasso
Ridge regression does have one obvious disadvantage. Unlike best subset, forward stepwise, and backward stepwise selection,
which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final
model.The penalty λβ2 j in (6.5) will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless λ = ∞).
This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number
of variables p is quite large. (岭回归不会将任何一个变量的系数压缩至0，这种设定不影响预测精度，但当变量个数非常大的时候，不便于模型解释)
For example, in the Credit data set, it appears that the most important variables are income, limit, rating, and student. So we might wish to
build a model including just these predictors. However, ridge regression will always generate a model involving all ten predictors. Increasing
the value of λ will tend to reduce the magnitudes of the coefficients, but will not result in exclusion of any of the variables.
6.2.3 Selecting the Tuning Parameter
We then select the tuning parameter value for which the cross-validation error is smallest. Finally, the model is re-fit using all of
the available observations and the selected value of the tuning parameter.
6.3 Dimension Reduction Methods
We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables.
We will refer to these techniques as dimension reduction methods
using least squares. Note that in (6.17), the regression coefficients are given
by θ0, θ1, . . . , θM. If the constants φ1m, φ2m, . . . , φpm are chosen wisely, then
such dimension reduction approaches can often outperform least squares
regression. In other words, fitting (6.17) using least squares can lead to
better results than fitting (6.1) using least squares.
6.3.1Principal Components Regression（主成分回归）
Principal components analysis (PCA) is a popular approach for deriving principal components analysis
a low-dimensional set of features from a large set of variables.