机器学习算法整理

原创
2021/03/07 03:01
阅读数 4.5K

KNN算法——k近邻算法

假设有这么一份数据图,其中蓝色的点是恶性肿瘤,而红色的点是良性肿瘤,现在进来一个新的数据,我们要判断这个数据是良性的还是恶性的。k近邻算法的意思就是说,看这个点最近的k个点的距离,来根据这些点是蓝色还是红色的数量大小,来决定这个新来的点是蓝色还是红色。现在我们假设这个k值为3,则有

这个新进来的点,我们设为绿色,那么离它最近的点有两个是红色,一个是蓝色,则我们就可以认为这个新进来的点是一个红色的良性肿瘤的点。

由于这里面有求解距离的需求,则我们在求解距离上使用的是欧拉距离(其实就是勾股定理)

这里分别是二维、三维和高维空间的欧拉距离的计算公式,实际上就是各个数据的相同维度的值相减,平方后再将不同维度的平方值相加再开根号。

最终就是

现在我们来编程实现这个k近邻算法

import numpy as np
import matplotlib.pyplot as plt
from math import sqrt
from collections import Counter

if __name__ == "__main__":

    # 所有的肿瘤数据
    raw_data_x = [[3.393533211, 2.331273381],
                  [3.110073483, 1.781539638],
                  [1.343808831, 3.368360954],
                  [3.582294042, 4.679179110],
                  [2.280362439, 2.866990263],
                  [7.423436942, 4.696522875],
                  [5.745051997, 3.533989803],
                  [9.172168622, 2.511101045],
                  [7.792783481, 3.424088941],
                  [7.939820817, 0.791637231]]
    # 肿瘤数据的类型,0为良性肿瘤,1为恶性肿瘤
    raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
    X_train = np.array(raw_data_x)
    Y_train = np.array(raw_data_y)
    plt.scatter(X_train[Y_train == 0, 0], X_train[Y_train == 0, 1], color='g')
    plt.scatter(X_train[Y_train == 1, 0], X_train[Y_train == 1, 1], color='r')
    x = np.array([8.093607318, 3.365731514])
    plt.scatter(x[0], x[1], color='b')
    # 计算新样本到已有样本各个点的距离
    distances = [sqrt(np.sum((x_train - x)**2)) for x_train in X_train]
    print(distances)
    # 对这个距离进行排序,排序的结果是已知样本的索引
    nearest = np.argsort(distances)
    print(nearest)
    # 这里我们取6个近邻数据
    k = 6
    # 获取这前6个数据的肿瘤类型
    topK_y = [Y_train[i] for i in nearest[:k]]
    print(topK_y)
    # 对这6个相近的肿瘤类型进行投票,投票数高的,新进的肿瘤数据就是该类型的肿瘤
    votes = Counter(topK_y)
    # 打印出新进肿瘤数据的肿瘤类型
    print(votes.most_common(1)[0][0])
    plt.show()

运行结果

[4.812566907609877, 5.229270827235305, 6.749798999160064, 4.6986266144110695, 5.83460014556857, 1.4900114024329525, 2.354574897431513, 1.3761132675144652, 0.3064319992975, 2.5786840957478887]
[8 7 5 6 9 3 0 1 4 2]
[1, 1, 1, 1, 1, 0]
1

由最终结果可知,新进入的肿瘤数据是一个恶性肿瘤。

肿瘤数据分布图如下

这里绿色是良性肿瘤,红色是恶性肿瘤,蓝色是新进肿瘤数据。

现在我们将该算法写成一个函数

import numpy as np
from math import sqrt
from collections import Counter

def kNN_classify(k, X_train, Y_train, x):

    assert 1 <= k <= X_train.shape[0], "k无效"
    assert X_train.shape[0] == Y_train.shape[0], \
        "X数据集的大小必须跟Y数据集的大小相同"
    assert X_train.shape[1] == x.shape[0], \
        "x的维度必须与X数据集的数据维度相同"
    # 计算新样本到已有样本各个点的距离
    distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in X_train]
    # 对这个距离进行排序,排序的结果是已知样本的索引
    nearest = np.argsort(distances)
    # 获取这前6个数据的肿瘤类型
    topK_y = [Y_train[i] for i in nearest[:k]]
    # 对这6个相近的肿瘤类型进行投票,投票数高的,新进的肿瘤数据就是该类型的肿瘤
    votes = Counter(topK_y)
    return votes.most_common(1)[0][0]
import numpy as np
import matplotlib.pyplot as plt
from playLA.KNN import kNN_classify

if __name__ == "__main__":

    # 所有的肿瘤数据
    raw_data_x = [[3.393533211, 2.331273381],
                  [3.110073483, 1.781539638],
                  [1.343808831, 3.368360954],
                  [3.582294042, 4.679179110],
                  [2.280362439, 2.866990263],
                  [7.423436942, 4.696522875],
                  [5.745051997, 3.533989803],
                  [9.172168622, 2.511101045],
                  [7.792783481, 3.424088941],
                  [7.939820817, 0.791637231]]
    # 肿瘤数据的类型,0为良性肿瘤,1为恶性肿瘤
    raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
    X_train = np.array(raw_data_x)
    Y_train = np.array(raw_data_y)
    plt.scatter(X_train[Y_train == 0, 0], X_train[Y_train == 0, 1], color='g')
    plt.scatter(X_train[Y_train == 1, 0], X_train[Y_train == 1, 1], color='r')
    x = np.array([8.093607318, 3.365731514])
    plt.scatter(x[0], x[1], color='b')
    res = kNN_classify(6, X_train, Y_train, x)
    print(res)
    plt.show()

运行结果

1

我们来看一下一般的机器学习的流程

我们在kNN算法中并没有得到什么模型,可以说kNN是唯一一个不需要训练过程的算法。换句话说,输入样例直接输入给数据集,然后直接找出k个点,然后投票输出结果。k近邻算法是非常特殊的,可以被认为是没有模型的算法,为了和其他算法统一,可以认为训练数据集就是模型本身。

现在我们用scikit_learn机器学习库的kNN算法来重写这个过程

from sklearn.neighbors import KNeighborsClassifier
import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":
    # 所有的肿瘤数据
    raw_data_x = [[3.393533211, 2.331273381],
                  [3.110073483, 1.781539638],
                  [1.343808831, 3.368360954],
                  [3.582294042, 4.679179110],
                  [2.280362439, 2.866990263],
                  [7.423436942, 4.696522875],
                  [5.745051997, 3.533989803],
                  [9.172168622, 2.511101045],
                  [7.792783481, 3.424088941],
                  [7.939820817, 0.791637231]]
    # 肿瘤数据的类型,0为良性肿瘤,1为恶性肿瘤
    raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
    X_train = np.array(raw_data_x)
    Y_train = np.array(raw_data_y)
    plt.scatter(X_train[Y_train == 0, 0], X_train[Y_train == 0, 1], color='g')
    plt.scatter(X_train[Y_train == 1, 0], X_train[Y_train == 1, 1], color='r')
    x = np.array([8.093607318, 3.365731514])
    plt.scatter(x[0], x[1], color='b')
    kNN_classifier = KNeighborsClassifier(n_neighbors=6)
    # 提交数据集
    kNN_classifier.fit(X_train, Y_train)
    X_predict = x.reshape(1, -1)
    # 获取训练结果集
    predict = kNN_classifier.predict(X_predict)
    # 打印第0个样本的特征
    print(predict[0])
    plt.show()

运行结果

1

当然我们也可以自己封装一个跟scikit_learn一样的类

import numpy as np
from math import sqrt
from collections import Counter

class KNNClassifier:

    def __init__(self, k):
        # 初始化kNN分类器
        assert k >= 1, "k必须有效"
        self.k = k
        self._X_train = None
        self._Y_train = None

    def fit(self, X_train, Y_train):
        # 根据训练数据集X_train和Y_train训练kNN分类器
        assert X_train.shape[0] == Y_train.shape[0], \
            "X数据集的大小必须跟Y数据集的大小相同"
        assert self.k <= X_train.shape[0], \
            "X数据集的大小最小为k"
        self._X_train = X_train
        self._Y_train = Y_train
        return self

    def predict(self, X_predict):
        # 给定待预测数据集X_predict,返回表示X_predict的结果向量
        assert self._X_train is not None and self._Y_train is not None, \
            "预测前必须提交数据集"
        assert X_predict.shape[1] == self._X_train.shape[1], \
            "预测数据的维度必须与X数据集的数据维度相同"
        y_predict = [self._predict(x) for x in X_predict]
        return np.array(y_predict)

    def _predict(self, x):
        # 给定单个待预测数据x,返回x的预测结果值
        assert x.shape[0] == self._X_train.shape[1], \
            "预测数据的维度必须与X数据集的数据维度相同"
        # 计算新样本到已有样本各个点的距离
        distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in self._X_train]
        # 对这个距离进行排序,排序的结果是已知样本的索引
        nearest = np.argsort(distances)
        # 获取这前6个数据的肿瘤类型
        topK_y = [self._Y_train[i] for i in nearest[:self.k]]
        # 对这6个相近的肿瘤类型进行投票,投票数高的,新进的肿瘤数据就是该类型的肿瘤
        votes = Counter(topK_y)
        return votes.most_common(1)[0][0]
from playLA.KNNClassifier import KNNClassifier
import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":
    # 所有的肿瘤数据
    raw_data_x = [[3.393533211, 2.331273381],
                  [3.110073483, 1.781539638],
                  [1.343808831, 3.368360954],
                  [3.582294042, 4.679179110],
                  [2.280362439, 2.866990263],
                  [7.423436942, 4.696522875],
                  [5.745051997, 3.533989803],
                  [9.172168622, 2.511101045],
                  [7.792783481, 3.424088941],
                  [7.939820817, 0.791637231]]
    # 肿瘤数据的类型,0为良性肿瘤,1为恶性肿瘤
    raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
    X_train = np.array(raw_data_x)
    Y_train = np.array(raw_data_y)
    plt.scatter(X_train[Y_train == 0, 0], X_train[Y_train == 0, 1], color='g')
    plt.scatter(X_train[Y_train == 1, 0], X_train[Y_train == 1, 1], color='r')
    x = np.array([8.093607318, 3.365731514])
    plt.scatter(x[0], x[1], color='b')
    kNN_classifier = KNNClassifier(k=6)
    # 提交数据集
    kNN_classifier.fit(X_train, Y_train)
    X_predict = x.reshape(1, -1)
    # 获取训练结果集
    predict = kNN_classifier.predict(X_predict)
    # 打印第0个样本的特征
    print(predict[0])
    plt.show()

运行结果

1

分类准确度

我们在自己封装的KNN类中添加分类准确度的方法

import numpy as np
from math import sqrt
from collections import Counter

class KNNClassifier:

    def __init__(self, k):
        # 初始化kNN分类器
        assert k >= 1, "k必须有效"
        self.k = k
        self._X_train = None
        self._Y_train = None

    def fit(self, X_train, Y_train):
        # 根据训练数据集X_train和Y_train训练kNN分类器
        assert X_train.shape[0] == Y_train.shape[0], \
            "X数据集的大小必须跟Y数据集的大小相同"
        assert self.k <= X_train.shape[0], \
            "X数据集的大小最小为k"
        self._X_train = X_train
        self._Y_train = Y_train
        return self

    def predict(self, X_predict):
        # 给定待预测数据集X_predict,返回表示X_predict的结果向量
        assert self._X_train is not None and self._Y_train is not None, \
            "预测前必须提交数据集"
        assert X_predict.shape[1] == self._X_train.shape[1], \
            "预测数据的维度必须与X数据集的数据维度相同"
        y_predict = [self._predict(x) for x in X_predict]
        return np.array(y_predict)

    def _predict(self, x):
        # 给定单个待预测数据x,返回x的预测结果值
        assert x.shape[0] == self._X_train.shape[1], \
            "预测数据的维度必须与X数据集的数据维度相同"
        # 计算新样本到已有样本各个点的距离
        distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in self._X_train]
        # 对这个距离进行排序,排序的结果是已知样本的索引
        nearest = np.argsort(distances)
        # 获取这前6个数据的肿瘤类型
        topK_y = [self._Y_train[i] for i in nearest[:self.k]]
        # 对这6个相近的肿瘤类型进行投票,投票数高的,新进的肿瘤数据就是该类型的肿瘤
        votes = Counter(topK_y)
        return votes.most_common(1)[0][0]

    def score(self, X_test, y_test):
        # 根据测试数据集X_test和y_test确定当前模型的准确度
        y_predict = self.predict(X_test)
        return self._accuracy_score(y_test, y_predict)

    def _accuracy_score(self, y_true, y_predict):
        # 计算真实值和预测值之间的准确度
        assert y_true.shape[0] == y_predict.shape[0], \
            "真实值的维度必须与预测值的维度相等"
        return sum(y_true == y_predict) / len(y_true)

再使用手写数据集来进行测试

import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn import datasets
from playLA.model_selection import train_test_split
from playLA.KNNClassifier import KNNClassifier

if __name__ == "__main__":

    digits = datasets.load_digits()
    X = digits.data
    y = digits.target
    print(X.shape)
    some_digit = X[666]
    print(y[666])
    some_digit_image = some_digit.reshape(8, 8)
    plt.imshow(some_digit_image, cmap=matplotlib.cm.binary)
    plt.show()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_ratio=0.2)
    my_knn_crf = KNNClassifier(k=3)
    my_knn_crf.fit(X_train, y_train)
    print(my_knn_crf.score(X_test, y_test))

运行结果

(1797, 64)
0
0.9888579387186629

scikt-learn中的分类准确度

import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

if __name__ == "__main__":

    digits = datasets.load_digits()
    X = digits.data
    y = digits.target
    print(X.shape)
    some_digit = X[666]
    print(y[666])
    some_digit_image = some_digit.reshape(8, 8)
    plt.imshow(some_digit_image, cmap=matplotlib.cm.binary)
    plt.show()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
    knn_crf = KNeighborsClassifier(n_neighbors=3)
    knn_crf.fit(X_train, y_train)
    print(knn_crf.score(X_test, y_test))

运行结果

(1797, 64)
0
0.9888888888888889

超参数

超参数:在算法运行前需要决定的参数

模型参数:算法过程中学习的参数

kNN算法中的k是典型的超参数

寻找最好的超参数

import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

if __name__ == "__main__":

    digits = datasets.load_digits()
    X = digits.data
    y = digits.target
    print(X.shape)
    some_digit = X[666]
    print(y[666])
    some_digit_image = some_digit.reshape(8, 8)
    # plt.imshow(some_digit_image, cmap=matplotlib.cm.binary)
    # plt.show()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
    knn_crf = KNeighborsClassifier(n_neighbors=3)
    knn_crf.fit(X_train, y_train)
    print(knn_crf.score(X_test, y_test))
    # 寻找最好的超参数
    best_score = 0
    best_k = -1
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
    print("best_k =", best_k)
    print("best_score =", best_score)

运行结果

(1797, 64)
0
0.9888888888888889
best_k = 4
best_score = 0.9916666666666667

通过结果,我们可以知道,最好的超参数k为4,最好的分类准确度为99.16%。这里需要注意的是如果我们找到的最好的k为10的话,则有可能最好的k会大于10,所以需要拓展一下,继续搜索。

kNN算法并不只有这一个超参数。

在一般的情况下,这个新来的绿色节点因为有2个蓝色的节点的节点离它近,会被投票为蓝色,但是从距离来看,红色的节点离它比较近,而两个蓝色的节点离它比较远,考虑距离因素的话,进行投票的时候就会考虑距离的权重,通常是将距离的倒数作为权重,距离越近,权重越高。所以在上图中,红色的权值为1,蓝色的权值为1/3+1/4=7/12,所以该新来的绿色节点会被划分成红色。

另外一种情况,当是多分类的时候,如果不考虑距离,那么此时离新来多绿色节点的票数红、蓝、紫各有一票,那么该节点就会随机分配一个颜色,这明显是不合理的。考虑距离的情况下,此时依然是红色获胜。那么在距离的超参数搜索中,代码如下

import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

if __name__ == "__main__":

    digits = datasets.load_digits()
    X = digits.data
    y = digits.target
    print(X.shape)
    some_digit = X[666]
    print(y[666])
    some_digit_image = some_digit.reshape(8, 8)
    # plt.imshow(some_digit_image, cmap=matplotlib.cm.binary)
    # plt.show()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
    knn_crf = KNeighborsClassifier(n_neighbors=3)
    knn_crf.fit(X_train, y_train)
    print(knn_crf.score(X_test, y_test))
    # 寻找最好的超参数
    best_score = 0
    best_k = -1
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
    print("best_k =", best_k)
    print("best_score =", best_score)
    # 考虑距离
    best_method = ""
    best_score = 0
    best_k = -1
    for method in ["uniform", "distance"]:
        for k in range(1, 11):
            knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
            knn_clf.fit(X_train, y_train)
            score = knn_clf.score(X_test, y_test)
            if score > best_score:
                best_k = k
                best_score = score
                best_method = method
    print("best_method =", best_method)
    print("best_k =", best_k)
    print("best_score =", best_score)

运行结果

(1797, 64)
0
0.9888888888888889
best_k = 4
best_score = 0.9916666666666667
best_method = uniform
best_k = 4
best_score = 0.9916666666666667

现在我们在说距离都是指欧拉距离

除了欧拉距离,还有一个曼哈顿距离

指的是样本在每个维度上相应的距离的和。欧拉距离又可以写成这个样子

那么就可以引申出第三种距离——明可夫斯基距离(Minkowski Distance)

当p=1的时候,明可夫斯基距离就为曼哈顿距离,当p=2的时候,明可夫斯基距离就为欧拉距离。这样就获得了一个新的超参数p

import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import timeit

if __name__ == "__main__":

    digits = datasets.load_digits()
    X = digits.data
    y = digits.target
    print(X.shape)
    some_digit = X[666]
    print(y[666])
    some_digit_image = some_digit.reshape(8, 8)
    # plt.imshow(some_digit_image, cmap=matplotlib.cm.binary)
    # plt.show()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
    knn_crf = KNeighborsClassifier(n_neighbors=3)
    knn_crf.fit(X_train, y_train)
    print(knn_crf.score(X_test, y_test))
    # 寻找最好的超参数
    best_score = 0
    best_k = -1
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
    print("best_k =", best_k)
    print("best_score =", best_score)
    # 考虑距离
    best_method = ""
    best_score = 0
    best_k = -1
    for method in ["uniform", "distance"]:
        for k in range(1, 11):
            knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
            knn_clf.fit(X_train, y_train)
            score = knn_clf.score(X_test, y_test)
            if score > best_score:
                best_k = k
                best_score = score
                best_method = method
    print("best_method =", best_method)
    print("best_k =", best_k)
    print("best_score =", best_score)
    # 搜索明可夫斯基距离的p
    start_time = timeit.default_timer()
    best_p = -1
    best_score = 0
    best_k = -1
    for k in range(1, 11):
        for p in range(1, 6):
            knn_clf = KNeighborsClassifier(n_neighbors=k, weights='distance', p=p)
            knn_clf.fit(X_train, y_train)
            score = knn_clf.score(X_test, y_test)
            if score > best_score:
                best_k = k
                best_score = score
                best_p = p
    print(timeit.default_timer() - start_time)
    print("best_p =", best_p)
    print("best_k =", best_k)
    print("best_score =", best_score)

运行结果

(1797, 64)
0
0.9888888888888889
best_k = 4
best_score = 0.9916666666666667
best_method = uniform
best_k = 4
best_score = 0.9916666666666667
10.488031614
best_p = 2
best_k = 3
best_score = 0.9888888888888889

在这里我们可以看到,对于搜索明可夫斯基距离的p还是需要一点时间的,这里用了10秒的时间。得到的结果是在欧拉距离下的k=3时的分类准确度最高为98.88%。

网格搜索与k近邻算法中更多的超参数

之前我们在进行超参数搜索的时候都是自己写的for循环来进行搜索,scikit-learn为我们封装了一个专门的网格搜索的方式——Grid Search

import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import timeit
from sklearn.model_selection import GridSearchCV

if __name__ == "__main__":

    digits = datasets.load_digits()
    X = digits.data
    y = digits.target
    print(X.shape)
    some_digit = X[666]
    print(y[666])
    some_digit_image = some_digit.reshape(8, 8)
    # plt.imshow(some_digit_image, cmap=matplotlib.cm.binary)
    # plt.show()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
    knn_crf = KNeighborsClassifier(n_neighbors=3)
    knn_crf.fit(X_train, y_train)
    print(knn_crf.score(X_test, y_test))
    # 寻找最好的超参数
    best_score = 0
    best_k = -1
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
    print("best_k =", best_k)
    print("best_score =", best_score)
    # 考虑距离
    best_method = ""
    best_score = 0
    best_k = -1
    for method in ["uniform", "distance"]:
        for k in range(1, 11):
            knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
            knn_clf.fit(X_train, y_train)
            score = knn_clf.score(X_test, y_test)
            if score > best_score:
                best_k = k
                best_score = score
                best_method = method
    print("best_method =", best_method)
    print("best_k =", best_k)
    print("best_score =", best_score)
    # 搜索明可夫斯基距离的p
    start_time = timeit.default_timer()
    best_p = -1
    best_score = 0
    best_k = -1
    for k in range(1, 11):
        for p in range(1, 6):
            knn_clf = KNeighborsClassifier(n_neighbors=k, weights='distance', p=p)
            knn_clf.fit(X_train, y_train)
            score = knn_clf.score(X_test, y_test)
            if score > best_score:
                best_k = k
                best_score = score
                best_p = p
    print(timeit.default_timer() - start_time)
    print("best_p =", best_p)
    print("best_k =", best_k)
    print("best_score =", best_score)
    # 网格搜索
    param_grid = [
        {
            'weights': ['uniform'],
            'n_neighbors': [i for i in range(1, 11)]
        },
        {
            'weights': ['distance'],
            'n_neighbors': [i for i in range(1, 11)],
            'p': [i for i in range(1, 6)]
        }
    ]
    knn_clf = KNeighborsClassifier()
    grid_search = GridSearchCV(knn_clf, param_grid)
    start_time = timeit.default_timer()
    grid_search.fit(X_train, y_train)
    print(timeit.default_timer() - start_time)
    print(grid_search.best_estimator_)
    print(grid_search.best_score_)
    print(grid_search.best_params_)

运行结果

(1797, 64)
0
0.9888888888888889
best_k = 4
best_score = 0.9916666666666667
best_method = uniform
best_k = 4
best_score = 0.9916666666666667
10.596328198999998
best_p = 2
best_k = 3
best_score = 0.9888888888888889
36.002135212
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')
0.9860820751064653
{'n_neighbors': 1, 'weights': 'uniform'}

在这里我们可以看到,使用网格搜索花费了35.15秒的时间,而搜索出来的结果为k为1,距离为欧拉距离,权重为uniform的时候得到的分类准确度最高为98.6%,这似乎与我们之前手动搜索的最好的分类准确度不符,这个原因其实是因为网格搜索中用来评价分类器准确度的方式是更加复杂的一种方式——交叉验证(有关交叉验证的内容请参考机器学习算法整理(二) )。这个评判标准不同。现在我们要拿到网格搜索为我们找到的这个kNN的分类器

import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import timeit
from sklearn.model_selection import GridSearchCV

if __name__ == "__main__":

    digits = datasets.load_digits()
    X = digits.data
    y = digits.target
    print(X.shape)
    some_digit = X[666]
    print(y[666])
    some_digit_image = some_digit.reshape(8, 8)
    # plt.imshow(some_digit_image, cmap=matplotlib.cm.binary)
    # plt.show()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
    knn_crf = KNeighborsClassifier(n_neighbors=3)
    knn_crf.fit(X_train, y_train)
    print(knn_crf.score(X_test, y_test))
    # 寻找最好的超参数
    best_score = 0
    best_k = -1
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
    print("best_k =", best_k)
    print("best_score =", best_score)
    # 考虑距离
    best_method = ""
    best_score = 0
    best_k = -1
    for method in ["uniform", "distance"]:
        for k in range(1, 11):
            knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
            knn_clf.fit(X_train, y_train)
            score = knn_clf.score(X_test, y_test)
            if score > best_score:
                best_k = k
                best_score = score
                best_method = method
    print("best_method =", best_method)
    print("best_k =", best_k)
    print("best_score =", best_score)
    # 搜索明可夫斯基距离的p
    start_time = timeit.default_timer()
    best_p = -1
    best_score = 0
    best_k = -1
    for k in range(1, 11):
        for p in range(1, 6):
            knn_clf = KNeighborsClassifier(n_neighbors=k, weights='distance', p=p)
            knn_clf.fit(X_train, y_train)
            score = knn_clf.score(X_test, y_test)
            if score > best_score:
                best_k = k
                best_score = score
                best_p = p
    print(timeit.default_timer() - start_time)
    print("best_p =", best_p)
    print("best_k =", best_k)
    print("best_score =", best_score)
    # 网格搜索
    param_grid = [
        {
            'weights': ['uniform'],
            'n_neighbors': [i for i in range(1, 11)]
        },
        {
            'weights': ['distance'],
            'n_neighbors': [i for i in range(1, 11)],
            'p': [i for i in range(1, 6)]
        }
    ]
    knn_clf = KNeighborsClassifier()
    grid_search = GridSearchCV(knn_clf, param_grid)
    start_time = timeit.default_timer()
    grid_search.fit(X_train, y_train)
    print(timeit.default_timer() - start_time)
    print(grid_search.best_estimator_)
    print(grid_search.best_score_)
    print(grid_search.best_params_)
    knn_clf = grid_search.best_estimator_
    print(knn_clf.predict(X_test))
    print(knn_clf.score(X_test, y_test))

运行结果

(1797, 64)
0
0.9888888888888889
best_k = 4
best_score = 0.9916666666666667
best_method = uniform
best_k = 4
best_score = 0.9916666666666667
10.529266493
best_p = 2
best_k = 3
best_score = 0.9888888888888889
36.449392196999995
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')
0.9860820751064653
{'n_neighbors': 1, 'weights': 'uniform'}
[8 1 3 4 4 0 7 0 8 0 4 6 1 1 2 0 1 6 7 3 3 6 3 2 9 4 0 2 0 3 0 8 7 2 3 5 1
 3 1 5 8 6 2 6 3 1 3 0 0 4 9 9 2 8 7 0 5 4 0 9 5 5 9 3 4 2 8 8 7 1 4 3 0 2
 7 2 1 2 4 0 9 0 6 6 2 0 0 5 4 4 3 1 3 8 6 4 4 7 5 6 8 4 8 4 6 9 7 7 0 8 8
 3 9 7 1 8 4 2 7 0 0 4 9 6 7 3 4 6 4 8 4 7 2 6 5 5 8 7 2 5 5 9 7 9 3 1 9 4
 4 1 5 1 6 4 4 8 1 6 2 5 2 1 4 4 3 9 4 0 6 0 8 3 8 7 3 0 3 0 5 9 2 7 1 8 1
 4 3 3 7 8 2 7 2 2 8 0 5 7 6 7 3 4 7 1 7 0 9 2 8 9 3 8 9 1 1 1 9 8 8 0 3 7
 3 3 4 8 2 1 8 6 0 1 7 7 5 8 3 8 7 6 8 4 2 6 2 3 7 4 9 3 5 0 6 3 8 3 3 1 4
 5 3 2 5 6 8 6 9 5 5 3 6 5 9 3 7 7 0 2 4 9 9 9 2 5 6 1 9 6 9 7 7 4 5 0 0 5
 3 8 4 4 3 2 5 3 2 2 3 0 9 8 2 1 4 0 6 2 8 0 6 4 9 9 8 3 9 8 6 3 2 7 9 4 2
 7 5 1 1 6 1 0 4 9 2 9 0 3 3 0 7 4 8 5 9 5 9 5 0 7 9 8]
0.9833333333333333

网格搜索中还有一些参数可以很好的帮助我们进行超参数的搜索,比如说并行处理和打印搜索信息

import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import timeit
from sklearn.model_selection import GridSearchCV

if __name__ == "__main__":

    digits = datasets.load_digits()
    X = digits.data
    y = digits.target
    print(X.shape)
    some_digit = X[666]
    print(y[666])
    some_digit_image = some_digit.reshape(8, 8)
    # plt.imshow(some_digit_image, cmap=matplotlib.cm.binary)
    # plt.show()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
    knn_crf = KNeighborsClassifier(n_neighbors=3)
    knn_crf.fit(X_train, y_train)
    print(knn_crf.score(X_test, y_test))
    # 寻找最好的超参数
    best_score = 0
    best_k = -1
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
    print("best_k =", best_k)
    print("best_score =", best_score)
    # 考虑距离
    best_method = ""
    best_score = 0
    best_k = -1
    for method in ["uniform", "distance"]:
        for k in range(1, 11):
            knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
            knn_clf.fit(X_train, y_train)
            score = knn_clf.score(X_test, y_test)
            if score > best_score:
                best_k = k
                best_score = score
                best_method = method
    print("best_method =", best_method)
    print("best_k =", best_k)
    print("best_score =", best_score)
    # 搜索明可夫斯基距离的p
    start_time = timeit.default_timer()
    best_p = -1
    best_score = 0
    best_k = -1
    for k in range(1, 11):
        for p in range(1, 6):
            knn_clf = KNeighborsClassifier(n_neighbors=k, weights='distance', p=p)
            knn_clf.fit(X_train, y_train)
            score = knn_clf.score(X_test, y_test)
            if score > best_score:
                best_k = k
                best_score = score
                best_p = p
    print(timeit.default_timer() - start_time)
    print("best_p =", best_p)
    print("best_k =", best_k)
    print("best_score =", best_score)
    # 网格搜索
    param_grid = [
        {
            'weights': ['uniform'],
            'n_neighbors': [i for i in range(1, 11)]
        },
        {
            'weights': ['distance'],
            'n_neighbors': [i for i in range(1, 11)],
            'p': [i for i in range(1, 6)]
        }
    ]
    knn_clf = KNeighborsClassifier()
    grid_search = GridSearchCV(knn_clf, param_grid)
    start_time = timeit.default_timer()
    grid_search.fit(X_train, y_train)
    print(timeit.default_timer() - start_time)
    print(grid_search.best_estimator_)
    print(grid_search.best_score_)
    print(grid_search.best_params_)
    knn_clf = grid_search.best_estimator_
    print(knn_clf.predict(X_test))
    print(knn_clf.score(X_test, y_test))
    # 并行搜索
    grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=-1, verbose=2)
    start_time = timeit.default_timer()
    grid_search.fit(X_train, y_train)
    print(timeit.default_timer() - start_time)

运行结果

(1797, 64)
0
0.9888888888888889
best_k = 4
best_score = 0.9916666666666667
best_method = uniform
best_k = 4
best_score = 0.9916666666666667
10.68357993
best_p = 2
best_k = 3
best_score = 0.9888888888888889
35.479140363
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')
0.9860820751064653
{'n_neighbors': 1, 'weights': 'uniform'}
[8 1 3 4 4 0 7 0 8 0 4 6 1 1 2 0 1 6 7 3 3 6 3 2 9 4 0 2 0 3 0 8 7 2 3 5 1
 3 1 5 8 6 2 6 3 1 3 0 0 4 9 9 2 8 7 0 5 4 0 9 5 5 9 3 4 2 8 8 7 1 4 3 0 2
 7 2 1 2 4 0 9 0 6 6 2 0 0 5 4 4 3 1 3 8 6 4 4 7 5 6 8 4 8 4 6 9 7 7 0 8 8
 3 9 7 1 8 4 2 7 0 0 4 9 6 7 3 4 6 4 8 4 7 2 6 5 5 8 7 2 5 5 9 7 9 3 1 9 4
 4 1 5 1 6 4 4 8 1 6 2 5 2 1 4 4 3 9 4 0 6 0 8 3 8 7 3 0 3 0 5 9 2 7 1 8 1
 4 3 3 7 8 2 7 2 2 8 0 5 7 6 7 3 4 7 1 7 0 9 2 8 9 3 8 9 1 1 1 9 8 8 0 3 7
 3 3 4 8 2 1 8 6 0 1 7 7 5 8 3 8 7 6 8 4 2 6 2 3 7 4 9 3 5 0 6 3 8 3 3 1 4
 5 3 2 5 6 8 6 9 5 5 3 6 5 9 3 7 7 0 2 4 9 9 9 2 5 6 1 9 6 9 7 7 4 5 0 0 5
 3 8 4 4 3 2 5 3 2 2 3 0 9 8 2 1 4 0 6 2 8 0 6 4 9 9 8 3 9 8 6 3 2 7 9 4 2
 7 5 1 1 6 1 0 4 9 2 9 0 3 3 0 7 4 8 5 9 5 9 5 0 7 9 8]
0.9833333333333333
Fitting 5 folds for each of 60 candidates, totalling 300 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[CV] n_neighbors=1, weights=uniform ..................................
[CV] n_neighbors=1, weights=uniform ..................................
[CV] n_neighbors=1, weights=uniform ..................................
[CV] n_neighbors=1, weights=uniform ..................................
[CV] n_neighbors=1, weights=uniform ..................................
[CV] n_neighbors=2, weights=uniform ..................................
[CV] n_neighbors=2, weights=uniform ..................................
[CV] n_neighbors=2, weights=uniform ..................................
[CV] ................... n_neighbors=1, weights=uniform, total=   0.1s
[CV] n_neighbors=2, weights=uniform ..................................
[CV] n_neighbors=2, weights=uniform ..................................
[CV] ................... n_neighbors=1, weights=uniform, total=   0.1s
[CV] n_neighbors=4, weights=uniform ..................................
[CV] ................... n_neighbors=1, weights=uniform, total=   0.1s
[CV] ................... n_neighbors=1, weights=uniform, total=   0.1s
[CV] ................... n_neighbors=1, weights=uniform, total=   0.1s
[CV] n_neighbors=3, weights=uniform ..................................
[CV] n_neighbors=3, weights=uniform ..................................
[CV] n_neighbors=4, weights=uniform ..................................
[CV] n_neighbors=3, weights=uniform ..................................
[CV] n_neighbors=3, weights=uniform ..................................
[CV] ................... n_neighbors=2, weights=uniform, total=   0.1s
[CV] n_neighbors=4, weights=uniform ..................................
[CV] n_neighbors=3, weights=uniform ..................................
[CV] n_neighbors=4, weights=uniform ..................................
[CV] ................... n_neighbors=2, weights=uniform, total=   0.1s
[CV] n_neighbors=4, weights=uniform ..................................
[CV] ................... n_neighbors=2, weights=uniform, total=   0.1s
[CV] n_neighbors=5, weights=uniform ..................................
[CV] ................... n_neighbors=2, weights=uniform, total=   0.1s
[CV] n_neighbors=5, weights=uniform ..................................
[CV] ................... n_neighbors=2, weights=uniform, total=   0.1s
[CV] n_neighbors=5, weights=uniform ..................................
[CV] n_neighbors=5, weights=uniform ..................................
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    0.9s
[CV] ................... n_neighbors=4, weights=uniform, total=   0.1s
[CV] n_neighbors=5, weights=uniform ..................................
[CV] ................... n_neighbors=3, weights=uniform, total=   0.1s
[CV] n_neighbors=6, weights=uniform ..................................
[CV] ................... n_neighbors=3, weights=uniform, total=   0.1s
[CV] n_neighbors=6, weights=uniform ..................................
[CV] ................... n_neighbors=4, weights=uniform, total=   0.1s
[CV] ................... n_neighbors=3, weights=uniform, total=   0.1s
[CV] ................... n_neighbors=3, weights=uniform, total=   0.1s
[CV] n_neighbors=6, weights=uniform ..................................
[CV] ................... n_neighbors=4, weights=uniform, total=   0.1s
[CV] n_neighbors=6, weights=uniform ..................................
[CV] ................... n_neighbors=3, weights=uniform, total=   0.1s
[CV] ................... n_neighbors=4, weights=uniform, total=   0.1s
[CV] n_neighbors=6, weights=uniform ..................................
[CV] ................... n_neighbors=4, weights=uniform, total=   0.1s
[CV] ................... n_neighbors=5, weights=uniform, total=   0.1s
[CV] n_neighbors=7, weights=uniform ..................................
[CV] n_neighbors=7, weights=uniform ..................................
[CV] n_neighbors=7, weights=uniform ..................................
[CV] n_neighbors=7, weights=uniform ..................................
[CV] n_neighbors=7, weights=uniform ..................................
[CV] ................... n_neighbors=5, weights=uniform, total=   0.1s
[CV] ................... n_neighbors=5, weights=uniform, total=   0.1s
[CV] n_neighbors=8, weights=uniform ..................................
[CV] ................... n_neighbors=5, weights=uniform, total=   0.1s
[CV] n_neighbors=8, weights=uniform ..................................
[CV] n_neighbors=8, weights=uniform ..................................
[CV] n_neighbors=8, weights=uniform ..................................
[CV] ................... n_neighbors=5, weights=uniform, total=   0.1s
[CV] n_neighbors=8, weights=uniform ..................................
[CV] ................... n_neighbors=6, weights=uniform, total=   0.1s
[CV] n_neighbors=9, weights=uniform ..................................
[CV] n_neighbors=9, weights=uniform ..................................
[CV] ................... n_neighbors=6, weights=uniform, total=   0.1s
[CV] n_neighbors=9, weights=uniform ..................................
[CV] ................... n_neighbors=6, weights=uniform, total=   0.1s
[CV] n_neighbors=9, weights=uniform ..................................
[CV] ................... n_neighbors=6, weights=uniform, total=   0.1s
[CV] ................... n_neighbors=6, weights=uniform, total=   0.1s
[CV] n_neighbors=9, weights=uniform ..................................
[CV] n_neighbors=10, weights=uniform .................................
[CV] ................... n_neighbors=7, weights=uniform, total=   0.1s
[CV] n_neighbors=10, weights=uniform .................................
[CV] ................... n_neighbors=7, weights=uniform, total=   0.1s
[CV] ................... n_neighbors=7, weights=uniform, total=   0.1s
[CV] ................... n_neighbors=7, weights=uniform, total=   0.1s
[CV] n_neighbors=10, weights=uniform .................................
[CV] ................... n_neighbors=7, weights=uniform, total=   0.1s
[CV] ................... n_neighbors=8, weights=uniform, total=   0.1s
[CV] ................... n_neighbors=8, weights=uniform, total=   0.1s
[CV] n_neighbors=10, weights=uniform .................................
[CV] n_neighbors=10, weights=uniform .................................
[CV] ................... n_neighbors=8, weights=uniform, total=   0.1s
[CV] n_neighbors=1, p=1, weights=distance ............................
[CV] ................... n_neighbors=8, weights=uniform, total=   0.1s
[CV] n_neighbors=1, p=1, weights=distance ............................
[CV] n_neighbors=1, p=1, weights=distance ............................
[CV] ................... n_neighbors=8, weights=uniform, total=   0.1s
[CV] ................... n_neighbors=9, weights=uniform, total=   0.1s
[CV] n_neighbors=1, p=1, weights=distance ............................
[CV] ................... n_neighbors=9, weights=uniform, total=   0.1s
[CV] n_neighbors=1, p=1, weights=distance ............................
[CV] ................... n_neighbors=9, weights=uniform, total=   0.1s
[CV] n_neighbors=1, p=2, weights=distance ............................
[CV] n_neighbors=1, p=2, weights=distance ............................
[CV] ................... n_neighbors=9, weights=uniform, total=   0.1s
[CV] n_neighbors=1, p=2, weights=distance ............................
[CV] ................... n_neighbors=9, weights=uniform, total=   0.1s
[CV] n_neighbors=1, p=2, weights=distance ............................
[CV] .................. n_neighbors=10, weights=uniform, total=   0.1s
[CV] n_neighbors=1, p=2, weights=distance ............................
[CV] n_neighbors=1, p=3, weights=distance ............................
[CV] .................. n_neighbors=10, weights=uniform, total=   0.1s
[CV] n_neighbors=1, p=3, weights=distance ............................
[CV] .................. n_neighbors=10, weights=uniform, total=   0.0s
[CV] n_neighbors=1, p=3, weights=distance ............................
[CV] ............. n_neighbors=1, p=1, weights=distance, total=   0.0s
[CV] ............. n_neighbors=1, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=1, p=3, weights=distance ............................
[CV] ............. n_neighbors=1, p=1, weights=distance, total=   0.0s
[CV] .................. n_neighbors=10, weights=uniform, total=   0.1s
[CV] ............. n_neighbors=1, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=1, p=3, weights=distance ............................
[CV] .................. n_neighbors=10, weights=uniform, total=   0.1s
[CV] ............. n_neighbors=1, p=1, weights=distance, total=   0.0s
[CV] ............. n_neighbors=1, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=1, p=4, weights=distance ............................
[CV] ............. n_neighbors=1, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=1, p=4, weights=distance ............................
[CV] ............. n_neighbors=1, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=1, p=4, weights=distance ............................
[CV] n_neighbors=1, p=4, weights=distance ............................
[CV] n_neighbors=1, p=4, weights=distance ............................
[CV] ............. n_neighbors=1, p=2, weights=distance, total=   0.0s
[CV] ............. n_neighbors=1, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=1, p=5, weights=distance ............................
[CV] n_neighbors=1, p=5, weights=distance ............................
[CV] n_neighbors=1, p=5, weights=distance ............................
[CV] n_neighbors=1, p=5, weights=distance ............................
[CV] n_neighbors=1, p=5, weights=distance ............................
[CV] n_neighbors=2, p=1, weights=distance ............................
[CV] ............. n_neighbors=2, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=2, p=1, weights=distance ............................
[CV] ............. n_neighbors=2, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=2, p=1, weights=distance ............................
[CV] ............. n_neighbors=2, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=2, p=1, weights=distance ............................
[CV] ............. n_neighbors=2, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=2, p=1, weights=distance ............................
[CV] ............. n_neighbors=2, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=2, p=2, weights=distance ............................
[CV] ............. n_neighbors=1, p=5, weights=distance, total=   0.3s
[CV] n_neighbors=2, p=2, weights=distance ............................
[CV] ............. n_neighbors=1, p=4, weights=distance, total=   0.3s
[CV] ............. n_neighbors=1, p=5, weights=distance, total=   0.3s
[CV] ............. n_neighbors=1, p=5, weights=distance, total=   0.3s
[CV] ............. n_neighbors=1, p=5, weights=distance, total=   0.3s
[CV] ............. n_neighbors=1, p=4, weights=distance, total=   0.3s
[CV] n_neighbors=2, p=2, weights=distance ............................
[CV] n_neighbors=2, p=3, weights=distance ............................
[CV] ............. n_neighbors=1, p=5, weights=distance, total=   0.3s
[CV] ............. n_neighbors=1, p=4, weights=distance, total=   0.3s
[CV] n_neighbors=2, p=3, weights=distance ............................
[CV] ............. n_neighbors=1, p=3, weights=distance, total=   0.3s
[CV] ............. n_neighbors=1, p=4, weights=distance, total=   0.3s
[CV] n_neighbors=2, p=4, weights=distance ............................
[CV] ............. n_neighbors=1, p=3, weights=distance, total=   0.3s
[CV] ............. n_neighbors=1, p=3, weights=distance, total=   0.3s
[CV] ............. n_neighbors=1, p=4, weights=distance, total=   0.3s
[CV] n_neighbors=2, p=4, weights=distance ............................
[CV] ............. n_neighbors=2, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=2, p=2, weights=distance ............................
[CV] n_neighbors=2, p=4, weights=distance ............................
[CV] ............. n_neighbors=1, p=3, weights=distance, total=   0.3s
[CV] n_neighbors=2, p=5, weights=distance ............................
[CV] ............. n_neighbors=1, p=3, weights=distance, total=   0.3s
[CV] n_neighbors=2, p=5, weights=distance ............................
[CV] n_neighbors=3, p=1, weights=distance ............................
[CV] n_neighbors=3, p=1, weights=distance ............................
[CV] ............. n_neighbors=2, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=2, p=2, weights=distance ............................
[CV] n_neighbors=3, p=1, weights=distance ............................
[CV] n_neighbors=3, p=2, weights=distance ............................
[CV] n_neighbors=3, p=2, weights=distance ............................
[CV] ............. n_neighbors=2, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=2, p=3, weights=distance ............................
[CV] n_neighbors=3, p=3, weights=distance ............................
[CV] ............. n_neighbors=2, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=3, p=3, weights=distance ............................
[CV] ............. n_neighbors=3, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=3, p=1, weights=distance ............................
[CV] ............. n_neighbors=3, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=3, p=1, weights=distance ............................
[CV] ............. n_neighbors=2, p=2, weights=distance, total=   0.0s
[CV] ............. n_neighbors=3, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=3, p=2, weights=distance ............................
[CV] n_neighbors=3, p=3, weights=distance ............................
[CV] ............. n_neighbors=3, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=3, p=2, weights=distance ............................
[CV] ............. n_neighbors=3, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=3, p=2, weights=distance ............................
[CV] ............. n_neighbors=3, p=1, weights=distance, total=   0.0s
[CV] ............. n_neighbors=3, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=3, p=4, weights=distance ............................
[CV] ............. n_neighbors=3, p=2, weights=distance, total=   0.0s
[CV] ............. n_neighbors=3, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=3, p=4, weights=distance ............................
[CV] n_neighbors=3, p=5, weights=distance ............................
[CV] ............. n_neighbors=3, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=3, p=5, weights=distance ............................
[CV] n_neighbors=3, p=5, weights=distance ............................
[CV] ............. n_neighbors=2, p=4, weights=distance, total=   0.3s
[CV] n_neighbors=2, p=4, weights=distance ............................
[CV] ............. n_neighbors=2, p=5, weights=distance, total=   0.3s
[CV] n_neighbors=2, p=5, weights=distance ............................
[CV] ............. n_neighbors=2, p=4, weights=distance, total=   0.3s
[CV] n_neighbors=2, p=4, weights=distance ............................
[CV] ............. n_neighbors=2, p=5, weights=distance, total=   0.3s
[CV] n_neighbors=2, p=5, weights=distance ............................
[CV] ............. n_neighbors=2, p=4, weights=distance, total=   0.3s
[CV] n_neighbors=2, p=5, weights=distance ............................
[CV] ............. n_neighbors=2, p=3, weights=distance, total=   0.3s
[CV] n_neighbors=2, p=3, weights=distance ............................
[CV] ............. n_neighbors=2, p=3, weights=distance, total=   0.4s
[CV] n_neighbors=2, p=3, weights=distance ............................
[CV] ............. n_neighbors=2, p=3, weights=distance, total=   0.4s
[CV] n_neighbors=4, p=1, weights=distance ............................
[CV] ............. n_neighbors=3, p=3, weights=distance, total=   0.4s
[CV] n_neighbors=3, p=3, weights=distance ............................
[CV] ............. n_neighbors=3, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=3, p=5, weights=distance ............................
[CV] ............. n_neighbors=3, p=3, weights=distance, total=   0.4s
[CV] n_neighbors=3, p=3, weights=distance ............................
[CV] ............. n_neighbors=3, p=3, weights=distance, total=   0.4s
[CV] ............. n_neighbors=3, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=3, p=4, weights=distance ............................
[CV] n_neighbors=4, p=1, weights=distance ............................
[CV] ............. n_neighbors=4, p=1, weights=distance, total=   0.1s
[CV] n_neighbors=4, p=1, weights=distance ............................
[CV] ............. n_neighbors=3, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=3, p=5, weights=distance ............................
[CV] ............. n_neighbors=3, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=3, p=4, weights=distance ............................
[CV] ............. n_neighbors=3, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=3, p=4, weights=distance ............................
[CV] ............. n_neighbors=4, p=1, weights=distance, total=   0.1s
[CV] ............. n_neighbors=4, p=1, weights=distance, total=   0.1s
[CV] n_neighbors=4, p=1, weights=distance ............................
[CV] n_neighbors=4, p=2, weights=distance ............................
[CV] ............. n_neighbors=4, p=1, weights=distance, total=   0.1s
[CV] n_neighbors=4, p=1, weights=distance ............................
[CV] ............. n_neighbors=4, p=2, weights=distance, total=   0.1s
[CV] n_neighbors=4, p=2, weights=distance ............................
[CV] ............. n_neighbors=4, p=1, weights=distance, total=   0.1s
[CV] ............. n_neighbors=4, p=2, weights=distance, total=   0.1s
[CV] n_neighbors=4, p=2, weights=distance ............................
[CV] n_neighbors=4, p=2, weights=distance ............................
[CV] ............. n_neighbors=2, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=4, p=3, weights=distance ............................
[CV] ............. n_neighbors=2, p=5, weights=distance, total=   0.4s
[CV] ............. n_neighbors=2, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=4, p=3, weights=distance ............................
[CV] n_neighbors=4, p=4, weights=distance ............................
[CV] ............. n_neighbors=2, p=4, weights=distance, total=   0.5s
[CV] ............. n_neighbors=2, p=4, weights=distance, total=   0.5s
[CV] n_neighbors=4, p=4, weights=distance ............................
[CV] ............. n_neighbors=4, p=2, weights=distance, total=   0.1s
[CV] n_neighbors=4, p=3, weights=distance ............................
[CV] n_neighbors=4, p=4, weights=distance ............................
[CV] ............. n_neighbors=4, p=2, weights=distance, total=   0.1s
[CV] n_neighbors=4, p=2, weights=distance ............................
[CV] ............. n_neighbors=2, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=4, p=5, weights=distance ............................
[CV] ............. n_neighbors=4, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=4, p=5, weights=distance ............................
[CV] ............. n_neighbors=2, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=5, p=1, weights=distance ............................
[CV] ............. n_neighbors=5, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=5, p=1, weights=distance ............................
[CV] ............. n_neighbors=3, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=5, p=1, weights=distance ............................
[CV] ............. n_neighbors=3, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=5, p=1, weights=distance ............................
[CV] ............. n_neighbors=3, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=5, p=2, weights=distance ............................
[CV] ............. n_neighbors=3, p=5, weights=distance, total=   0.4s
[CV] ............. n_neighbors=3, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=5, p=2, weights=distance ............................
[CV] ............. n_neighbors=3, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=5, p=3, weights=distance ............................
[CV] ............. n_neighbors=3, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=5, p=3, weights=distance ............................
[CV] ............. n_neighbors=5, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=5, p=3, weights=distance ............................
[CV] n_neighbors=5, p=4, weights=distance ............................
[CV] ............. n_neighbors=5, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=5, p=1, weights=distance ............................
[CV] ............. n_neighbors=5, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=5, p=2, weights=distance ............................
[CV] ............. n_neighbors=5, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=5, p=2, weights=distance ............................
[CV] ............. n_neighbors=5, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=5, p=2, weights=distance ............................
[CV] ............. n_neighbors=5, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=5, p=4, weights=distance ............................
[CV] ............. n_neighbors=5, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=5, p=5, weights=distance ............................
[CV] ............. n_neighbors=5, p=2, weights=distance, total=   0.1s
[CV] ............. n_neighbors=5, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=5, p=5, weights=distance ............................
[CV] n_neighbors=5, p=5, weights=distance ............................
[CV] ............. n_neighbors=4, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=4, p=4, weights=distance ............................
[CV] ............. n_neighbors=4, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=4, p=3, weights=distance ............................
[CV] ............. n_neighbors=4, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=4, p=5, weights=distance ............................
[CV] ............. n_neighbors=4, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=4, p=3, weights=distance ............................
[CV] ............. n_neighbors=4, p=4, weights=distance, total=   0.5s
[CV] n_neighbors=4, p=4, weights=distance ............................
[CV] ............. n_neighbors=4, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=6, p=1, weights=distance ............................
[CV] ............. n_neighbors=4, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=4, p=5, weights=distance ............................
[CV] ............. n_neighbors=4, p=5, weights=distance, total=   0.5s
[CV] n_neighbors=4, p=5, weights=distance ............................
[CV] ............. n_neighbors=6, p=1, weights=distance, total=   0.1s
[CV] n_neighbors=6, p=1, weights=distance ............................
[CV] ............. n_neighbors=6, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=6, p=1, weights=distance ............................
[CV] ............. n_neighbors=5, p=4, weights=distance, total=   0.5s
[CV] n_neighbors=5, p=4, weights=distance ............................
[CV] ............. n_neighbors=5, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=5, p=3, weights=distance ............................
[CV] ............. n_neighbors=6, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=6, p=1, weights=distance ............................
[CV] ............. n_neighbors=5, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=5, p=4, weights=distance ............................
[CV] ............. n_neighbors=5, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=5, p=3, weights=distance ............................
[CV] ............. n_neighbors=5, p=4, weights=distance, total=   0.5s
[CV] n_neighbors=5, p=4, weights=distance ............................
[CV] ............. n_neighbors=6, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=6, p=2, weights=distance ............................
[CV] ............. n_neighbors=5, p=5, weights=distance, total=   0.5s
[CV] n_neighbors=5, p=5, weights=distance ............................
[CV] ............. n_neighbors=5, p=5, weights=distance, total=   0.5s
[CV] n_neighbors=6, p=1, weights=distance ............................
[CV] ............. n_neighbors=5, p=5, weights=distance, total=   0.5s
[CV] n_neighbors=5, p=5, weights=distance ............................
[CV] ............. n_neighbors=6, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=6, p=2, weights=distance ............................
[CV] ............. n_neighbors=6, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=6, p=2, weights=distance ............................
[CV] ............. n_neighbors=6, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=6, p=2, weights=distance ............................
[CV] ............. n_neighbors=4, p=5, weights=distance, total=   0.4s
[CV] ............. n_neighbors=6, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=6, p=2, weights=distance ............................
[CV] n_neighbors=6, p=3, weights=distance ............................
[CV] ............. n_neighbors=6, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=6, p=3, weights=distance ............................
[CV] ............. n_neighbors=4, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=6, p=3, weights=distance ............................
[CV] ............. n_neighbors=4, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=6, p=4, weights=distance ............................
[CV] ............. n_neighbors=4, p=4, weights=distance, total=   0.4s
[CV] ............. n_neighbors=6, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=6, p=4, weights=distance ............................
[CV] ............. n_neighbors=4, p=3, weights=distance, total=   0.4s
[CV] n_neighbors=6, p=4, weights=distance ............................
[CV] n_neighbors=6, p=5, weights=distance ............................
[CV] ............. n_neighbors=4, p=5, weights=distance, total=   0.3s
[CV] n_neighbors=6, p=5, weights=distance ............................
[CV] ............. n_neighbors=4, p=3, weights=distance, total=   0.4s
[CV] n_neighbors=7, p=1, weights=distance ............................
[CV] ............. n_neighbors=7, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=7, p=1, weights=distance ............................
[CV] ............. n_neighbors=7, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=7, p=1, weights=distance ............................
[CV] ............. n_neighbors=5, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=7, p=1, weights=distance ............................
[CV] ............. n_neighbors=5, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=7, p=2, weights=distance ............................
[CV] ............. n_neighbors=7, p=1, weights=distance, total=   0.1s
[CV] n_neighbors=7, p=1, weights=distance ............................
[CV] ............. n_neighbors=5, p=3, weights=distance, total=   0.4s
[CV] n_neighbors=7, p=2, weights=distance ............................
[CV] ............. n_neighbors=5, p=3, weights=distance, total=   0.4s
[CV] ............. n_neighbors=5, p=5, weights=distance, total=   0.4s
[Parallel(n_jobs=-1)]: Done 180 tasks      | elapsed:    3.3s
[CV] n_neighbors=7, p=3, weights=distance ............................
[CV] n_neighbors=7, p=3, weights=distance ............................
[CV] ............. n_neighbors=5, p=4, weights=distance, total=   0.5s
[CV] n_neighbors=7, p=3, weights=distance ............................
[CV] ............. n_neighbors=5, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=7, p=4, weights=distance ............................
[CV] ............. n_neighbors=7, p=1, weights=distance, total=   0.2s
[CV] n_neighbors=7, p=2, weights=distance ............................
[CV] ............. n_neighbors=7, p=2, weights=distance, total=   0.2s
[CV] n_neighbors=7, p=2, weights=distance ............................
[CV] ............. n_neighbors=7, p=1, weights=distance, total=   0.2s
[CV] n_neighbors=7, p=4, weights=distance ............................
[CV] ............. n_neighbors=7, p=2, weights=distance, total=   0.1s
[CV] n_neighbors=7, p=2, weights=distance ............................
[CV] ............. n_neighbors=7, p=2, weights=distance, total=   0.1s
[CV] n_neighbors=7, p=5, weights=distance ............................
[CV] ............. n_neighbors=7, p=2, weights=distance, total=   0.1s
[CV] n_neighbors=7, p=5, weights=distance ............................
[CV] ............. n_neighbors=7, p=2, weights=distance, total=   0.1s
[CV] n_neighbors=7, p=5, weights=distance ............................
[CV] ............. n_neighbors=6, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=6, p=3, weights=distance ............................
[CV] ............. n_neighbors=6, p=4, weights=distance, total=   0.5s
[CV] n_neighbors=6, p=4, weights=distance ............................
[CV] ............. n_neighbors=6, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=6, p=3, weights=distance ............................
[CV] ............. n_neighbors=6, p=5, weights=distance, total=   0.5s
[CV] n_neighbors=6, p=5, weights=distance ............................
[CV] ............. n_neighbors=6, p=4, weights=distance, total=   0.5s
[CV] n_neighbors=6, p=4, weights=distance ............................
[CV] ............. n_neighbors=6, p=5, weights=distance, total=   0.5s
[CV] n_neighbors=6, p=5, weights=distance ............................
[CV] ............. n_neighbors=6, p=4, weights=distance, total=   0.5s
[CV] n_neighbors=6, p=5, weights=distance ............................
[CV] ............. n_neighbors=6, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=8, p=1, weights=distance ............................
[CV] ............. n_neighbors=8, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=8, p=1, weights=distance ............................
[CV] ............. n_neighbors=8, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=8, p=1, weights=distance ............................
[CV] ............. n_neighbors=8, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=8, p=1, weights=distance ............................
[CV] ............. n_neighbors=8, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=8, p=2, weights=distance ............................
[CV] ............. n_neighbors=7, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=7, p=3, weights=distance ............................
[CV] ............. n_neighbors=8, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=8, p=2, weights=distance ............................
[CV] ............. n_neighbors=7, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=7, p=3, weights=distance ............................
[CV] ............. n_neighbors=7, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=7, p=4, weights=distance ............................
[CV] ............. n_neighbors=7, p=3, weights=distance, total=   0.4s
[CV] n_neighbors=7, p=4, weights=distance ............................
[CV] ............. n_neighbors=7, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=7, p=4, weights=distance ............................
[CV] ............. n_neighbors=7, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=7, p=5, weights=distance ............................
[CV] ............. n_neighbors=8, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=8, p=2, weights=distance ............................
[CV] ............. n_neighbors=7, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=7, p=5, weights=distance ............................
[CV] ............. n_neighbors=7, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=8, p=1, weights=distance ............................
[CV] ............. n_neighbors=8, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=8, p=2, weights=distance ............................
[CV] ............. n_neighbors=8, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=8, p=2, weights=distance ............................
[CV] ............. n_neighbors=6, p=5, weights=distance, total=   0.4s
[CV] ............. n_neighbors=6, p=4, weights=distance, total=   0.4s
[CV] ............. n_neighbors=6, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=8, p=3, weights=distance ............................
[CV] n_neighbors=8, p=3, weights=distance ............................
[CV] ............. n_neighbors=6, p=5, weights=distance, total=   0.4s
[CV] ............. n_neighbors=6, p=3, weights=distance, total=   0.4s
[CV] ............. n_neighbors=6, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=8, p=4, weights=distance ............................
[CV] n_neighbors=8, p=4, weights=distance ............................
[CV] ............. n_neighbors=8, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=8, p=4, weights=distance ............................
[CV] n_neighbors=8, p=5, weights=distance ............................
[CV] n_neighbors=8, p=5, weights=distance ............................
[CV] ............. n_neighbors=8, p=2, weights=distance, total=   0.0s
[CV] n_neighbors=8, p=3, weights=distance ............................
[CV] ............. n_neighbors=6, p=3, weights=distance, total=   0.4s
[CV] n_neighbors=9, p=1, weights=distance ............................
[CV] ............. n_neighbors=9, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=9, p=1, weights=distance ............................
[CV] ............. n_neighbors=9, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=9, p=1, weights=distance ............................
[CV] ............. n_neighbors=9, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=9, p=1, weights=distance ............................
[CV] ............. n_neighbors=9, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=9, p=1, weights=distance ............................
[CV] ............. n_neighbors=9, p=1, weights=distance, total=   0.0s
[CV] n_neighbors=9, p=2, weights=distance ............................
[CV] ............. n_neighbors=7, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=9, p=2, weights=distance ............................
[CV] ............. n_neighbors=7, p=3, weights=distance, total=   0.4s
[CV] ............. n_neighbors=7, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=9, p=2, weights=distance ............................
[CV] ............. n_neighbors=7, p=3, weights=distance, total=   0.4s
[CV] ............. n_neighbors=7, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=9, p=3, weights=distance ............................
[CV] ............. n_neighbors=7, p=4, weights=distance, total=   0.4s
[CV] n_neighbors=9, p=3, weights=distance ............................
[CV] n_neighbors=9, p=3, weights=distance ............................
[CV] ............. n_neighbors=7, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=9, p=4, weights=distance ............................
[CV] n_neighbors=9, p=4, weights=distance ............................
[CV] ............. n_neighbors=9, p=2, weights=distance, total=   0.1s
[CV] n_neighbors=9, p=5, weights=distance ............................
[CV] ............. n_neighbors=9, p=2, weights=distance, total=   0.1s
[CV] n_neighbors=9, p=2, weights=distance ............................
[CV] ............. n_neighbors=9, p=2, weights=distance, total=   0.1s
[CV] n_neighbors=9, p=2, weights=distance ............................
[CV] ............. n_neighbors=9, p=2, weights=distance, total=   0.1s
[CV] n_neighbors=9, p=5, weights=distance ............................
[CV] ............. n_neighbors=8, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=8, p=5, weights=distance ............................
[CV] ............. n_neighbors=8, p=5, weights=distance, total=   0.4s
[CV] n_neighbors=8, p=5, weights=distance ............................
[CV] ............. n_neighbors=8, p=4, weights=distance, total=   0.5s
[CV] n_neighbors=8, p=4, weights=distance ............................
[CV] ............. n_neighbors=8, p=4, weights=distance, total=   0.5s
[CV] n_neighbors=8, p=4, weights=distance ............................
[CV] ............. n_neighbors=8, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=8, p=3, weights=distance ............................
[CV] ............. n_neighbors=8, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=8, p=3, weights=distance ............................
[CV] ............. n_neighbors=8, p=4, weights=distance, total=   0.5s
[CV] n_neighbors=8, p=5, weights=distance ............................
[CV] ............. n_neighbors=9, p=2, weights=distance, total=   0.1s
[CV] n_neighbors=9, p=5, weights=distance ............................
[CV] ............. n_neighbors=8, p=3, weights=distance, total=   0.5s
[CV] n_neighbors=9, p=5, weights=distance ............................
[CV] ............. n_neighbors=9, p=4, weights=distance, total=   0.6s
[CV] n_neighbors=9, p=4, weights=distance ............................
[CV] ............. n_neighbors=9, p=3, weights=distance, total=   0.6s
[CV] n_neighbors=9, p=3, weights=distance ............................
[CV] ............. n_neighbors=9, p=5, weights=distance, total=   0.5s
[CV] n_neighbors=9, p=5, weights=distance ............................
[CV] ............. n_neighbors=9, p=3, weights=distance, total=   0.6s
[CV] n_neighbors=9, p=4, weights=distance ............................
[CV] ............. n_neighbors=9, p=4, weights=distance, total=   0.6s
[CV] n_neighbors=9, p=4, weights=distance ............................
[CV] ............. n_neighbors=9, p=3, weights=distance, total=   0.6s
[CV] n_neighbors=9, p=3, weights=distance ............................
[CV] ............. n_neighbors=8, p=5, weights=distance, total=   0.5s
[CV] n_neighbors=10, p=1, weights=distance ...........................
[CV] ............. n_neighbors=9, p=5, weights=distance, total=   0.5s
[CV] ............. n_neighbors=8, p=5, weights=distance, total=   0.5s
[CV] n_neighbors=10, p=1, weights=distance ...........................
[CV] n_neighbors=10, p=1, weights=distance ...........................
[CV] ............. n_neighbors=8, p=4, weights=distance, total=   0.5s
[CV] ............. n_neighbors=8, p=5, weights=distance, total=   0.5s
[CV] n_neighbors=10, p=1, weights=distance ...........................
[CV] ............. n_neighbors=8, p=4, weights=distance, total=   0.5s
[CV] n_neighbors=10, p=1, weights=distance ...........................
[CV] n_neighbors=10, p=2, weights=distance ...........................
[CV] ............. n_neighbors=9, p=5, weights=distance, total=   0.5s
[CV] ............. n_neighbors=8, p=3, weights=distance, total=   0.6s
[CV] ............ n_neighbors=10, p=1, weights=distance, total=   0.1s
[CV] n_neighbors=10, p=2, weights=distance ...........................
[CV] ............. n_neighbors=8, p=3, weights=distance, total=   0.6s
[CV] n_neighbors=10, p=2, weights=distance ...........................
[CV] n_neighbors=10, p=2, weights=distance ...........................
[CV] n_neighbors=10, p=2, weights=distance ...........................
[CV] ............. n_neighbors=9, p=5, weights=distance, total=   0.5s
[CV] ............ n_neighbors=10, p=1, weights=distance, total=   0.1s
[CV] ............ n_neighbors=10, p=1, weights=distance, total=   0.1s
[CV] n_neighbors=10, p=3, weights=distance ...........................
[CV] n_neighbors=10, p=3, weights=distance ...........................
[CV] n_neighbors=10, p=3, weights=distance ...........................
[CV] ............ n_neighbors=10, p=1, weights=distance, total=   0.1s
[CV] ............ n_neighbors=10, p=1, weights=distance, total=   0.1s
[CV] n_neighbors=10, p=3, weights=distance ...........................
[CV] ............ n_neighbors=10, p=2, weights=distance, total=   0.1s
[CV] n_neighbors=10, p=3, weights=distance ...........................
[CV] n_neighbors=10, p=4, weights=distance ...........................
[Parallel(n_jobs=-1)]: Done 269 out of 300 | elapsed:    5.0s remaining:    0.6s
[CV] ............ n_neighbors=10, p=2, weights=distance, total=   0.1s
[CV] ............ n_neighbors=10, p=2, weights=distance, total=   0.1s
[CV] ............ n_neighbors=10, p=2, weights=distance, total=   0.1s
[CV] ............ n_neighbors=10, p=2, weights=distance, total=   0.1s
[CV] n_neighbors=10, p=4, weights=distance ...........................
[CV] n_neighbors=10, p=4, weights=distance ...........................
[CV] n_neighbors=10, p=4, weights=distance ...........................
[CV] n_neighbors=10, p=4, weights=distance ...........................
[CV] ............. n_neighbors=9, p=5, weights=distance, total=   0.6s
[CV] n_neighbors=10, p=5, weights=distance ...........................
[CV] ............. n_neighbors=9, p=4, weights=distance, total=   0.6s
[CV] ............. n_neighbors=9, p=4, weights=distance, total=   0.6s
[CV] n_neighbors=10, p=5, weights=distance ...........................
[CV] n_neighbors=10, p=5, weights=distance ...........................
[CV] ............. n_neighbors=9, p=4, weights=distance, total=   0.6s
[CV] ............. n_neighbors=9, p=3, weights=distance, total=   0.6s
[CV] n_neighbors=10, p=5, weights=distance ...........................
[CV] n_neighbors=10, p=5, weights=distance ...........................
[CV] ............. n_neighbors=9, p=3, weights=distance, total=   0.7s
[CV] ............ n_neighbors=10, p=3, weights=distance, total=   0.6s
[CV] ............ n_neighbors=10, p=3, weights=distance, total=   0.6s
[CV] ............ n_neighbors=10, p=3, weights=distance, total=   0.6s
[CV] ............ n_neighbors=10, p=4, weights=distance, total=   0.6s
[CV] ............ n_neighbors=10, p=4, weights=distance, total=   0.6s
[CV] ............ n_neighbors=10, p=3, weights=distance, total=   0.6s
[CV] ............ n_neighbors=10, p=3, weights=distance, total=   0.6s
[CV] ............ n_neighbors=10, p=4, weights=distance, total=   0.6s
[CV] ............ n_neighbors=10, p=4, weights=distance, total=   0.6s
[CV] ............ n_neighbors=10, p=4, weights=distance, total=   0.6s
[CV] ............ n_neighbors=10, p=5, weights=distance, total=   0.4s
[CV] ............ n_neighbors=10, p=5, weights=distance, total=   0.4s
[CV] ............ n_neighbors=10, p=5, weights=distance, total=   0.4s
[CV] ............ n_neighbors=10, p=5, weights=distance, total=   0.4s
[CV] ............ n_neighbors=10, p=5, weights=distance, total=   0.4s
5.793163790999998

在这里我们可以看到进行并行搜索只用了5.79秒的时间,并且打印了很多搜索时的超参数情况。

更多的距离定义

我们除了使用明可夫斯基的距离以外还可以使用其他的距离,比如

  1. 向量空间余弦相似度 Cosine Similarity
  2. 调整余弦相似度 Adjusted Cosine Similarity
  3. 皮尔森相关系数 Pearson Correlation Coefficient
  4. Jaccard相关系数 Jaccard Coeffient

在kNN算法中有一个参数叫做metric的超参数,它默认是明可夫斯基距离,我们也可以调整这个参数来使用其他的距离。

线性回归法

线性回归算法

  1. 解决回归问题
  2. 思想简单,实现容易
  3. 许多强大的非线形模型的基础
  4. 结果具有很好的可解释性
  5. 蕴含机器学习中的很多重要思想

分类问题和回归问题

跟上一章的分类问题不同(左图),上一章的横轴和纵轴都是样本特征,每一个点有两个样本特征。而在线性回归的图形中(右图),只有横轴是样本特征,纵轴是样本的输出标记。如果我们要想看有两个样本特征的回归问题,就需要在三维空间中进行观察。

简单线性回归:样本特征只有一个,称为简单线性回归。

在坐标系中直线的方程为y=ax+b,其中a是斜率,b是截距。

 

其中是预测值,是真值。我们希望的差距尽量小。表达的差距:,由于该值有正有负,所以使用绝对值,虽然使用绝对值来表示差距是可以的,但是在我们计算a、b的时候,使用绝对值非常不方便,它不是一个处处可导的函数。为了使得函数可导,可以使用,这是一个非常好的,处处可导的连续函数。考虑所有的样本,则有。目标就是使得尽可能小。因为,则目标为:找到a和b,使得尽可能小。

称为损失函数(loss function),也就是说度量出我们的这个模型没有拟合住我们的样本的这一部分,就是损失的那一部分。但是在有的算法中有可能是度量的是拟合的程度,在这种情况下,称这个函数为效用函数(utility function)。不管是损失函数还是效用函数,我们的机器学习算法都是通过分析问题,确定问题的损失函数或者效用函数;通过最优化损失函数或者效用函数,获得机器学习模型。这是一种求解机器学习算法的思路,近乎所有参数学习算法都是这样的套路。机器学习算法有参数学习算法和非参数学习算法,所谓参数学习算法就是我们要创建一个模型,机器学习任务就是要学习这些模型的参数。包括

  1. 线性回归
  2. 多项式回归
  3. 逻辑回归
  4. SVM
  5. 神经网络

区别在于他们的模型不同,建立的目标函数不同,优化的方式不同。正因为大部分的机器学习算法都拥有这样的思路,所以有一个重要的学科叫做最优化原理。实际上最优化原理不仅仅是机器学习算法中使用的思路,在经典的传统计算机算法领域,最优化原理也发挥着作用,我们使用计算机解决的许多问题,其本质其实都是一个最优化问题,比如图的最短路径、最小生成树、背包问题等等。最优化领域的分支中有一个分支叫凸优化,它解决的是一类特殊的优化问题。

尽可能小是典型的最小二乘法问题:最小化误差的平方。通常系数a、b都有直接的表达公式

,它们的推导也是通过偏导数来进行的,具体参考高等数学整理(二) ,这里的指的是x、y数据集的平均值。

简单线形回归的实现

import numpy as np

class SimpleLinearRegression:

    def __init__(self):
        # 初始化Simple Linear Regression模型
        self._a = None
        self._b = None

    def fit(self, x_train, y_train):
        # 根据训练数据集x_train,y_train训练Simple Linear Regression模型
        assert x_train.ndim == 1, \
            "简单线性回归仅能处理有1个特征的训练数据"
        assert len(x_train) == len(y_train), \
            "x_train的长度必须与y_train的长度相等"
        # 获取x、y训练数据集的平均值
        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)
        num = 0
        d = 0
        for x, y in zip(x_train, y_train):
            num += (x - x_mean) * (y - y_mean)
            d += (x - x_mean)**2
        self._a = num / d
        self._b = y_mean - self._a * x_mean
        return self

    def predict(self, x_predict):
        # 给定待预测数据集x_predict,返回表示x_predict的结果向量
        assert x_predict.ndim == 1, \
            "简单线性回归仅能处理有1个特征的训练数据"
        assert self._a is not None and self._b is not None, \
            "开始预测前必须fit"
        return np.array([self._predict(x) for x in x_predict])

    def _predict(self, x_single):
        # 给定单个待预测数据x_single,返回x_single的预测结果值
        return self._a * x_single + self._b

    def __repr__(self):
        return "SimpleLinearRegression()"
from playLA.SimpleLinearRegression import SimpleLinearRegression
import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":

    reg1 = SimpleLinearRegression()
    x = np.array([1, 2, 3, 4, 5])
    y = np.array([1, 3, 2, 3, 5])
    reg1.fit(x, y)
    x_predict = 6
    print(reg1.predict(np.array([x_predict])))
    print(reg1._a)
    print(reg1._b)
    y_hat1 = reg1.predict(x)
    plt.scatter(x, y)
    plt.plot(x, y_hat1, color="r")
    plt.axis([0, 6, 0, 6])
    plt.show()

运行结果

[5.2]
0.8
0.39999999999999947

向量化

在上面的实现当中,我们使用了循环叠加的方式来计算a分子、分母的每一个值,这样运算的性能是比较低的。

我们来看一下a的表达式,在线性代数中,我们知道两个向量的点乘就是各个分量相乘再相加,那么在a的分子当中,w就相当于,v就相当于;而在a的分母当中,w、v都为

中,w向量由构成,v向量由,这样就转化成了两个向量w•v

import numpy as np

class SimpleLinearRegressionVector:

    def __init__(self):
        # 初始化Simple Linear Regression模型
        self._a = None
        self._b = None

    def fit(self, x_train, y_train):
        # 根据训练数据集x_train,y_train训练Simple Linear Regression模型
        assert x_train.ndim == 1, \
            "简单线性回归仅能处理有1个特征的训练数据"
        assert len(x_train) == len(y_train), \
            "x_train的长度必须与y_train的长度相等"
        # 获取x、y训练数据集的平均值
        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)
        num = (x_train - x_mean).dot(y_train - y_mean)
        d = (x_train - x_mean).dot(x_train - x_mean)
        self._a = num / d
        self._b = y_mean - self._a * x_mean
        return self

    def predict(self, x_predict):
        # 给定待预测数据集x_predict,返回表示x_predict的结果向量
        assert x_predict.ndim == 1, \
            "简单线性回归仅能处理有1个特征的训练数据"
        assert self._a is not None and self._b is not None, \
            "开始预测前必须fit"
        return np.array([self._predict(x) for x in x_predict])

    def _predict(self, x_single):
        # 给定单个待预测数据x_single,返回x_single的预测结果值
        return self._a * x_single + self._b

    def __repr__(self):
        return "SimpleLinearRegressionVector()"
from playLA.SimpleLinearRegressionVector import SimpleLinearRegressionVector
import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":

    reg1 = SimpleLinearRegressionVector()
    x = np.array([1, 2, 3, 4, 5])
    y = np.array([1, 3, 2, 3, 5])
    reg1.fit(x, y)
    x_predict = 6
    print(reg1.predict(np.array([x_predict])))
    print(reg1._a)
    print(reg1._b)
    y_hat1 = reg1.predict(x)
    plt.scatter(x, y)
    plt.plot(x, y_hat1, color="r")
    plt.axis([0, 6, 0, 6])
    plt.show()

运行结果与之前相同,现在我们来进行大数据量的性能测试。

from playLA.SimpleLinearRegressionVector import SimpleLinearRegressionVector
from playLA.SimpleLinearRegression import SimpleLinearRegression
import numpy as np
import timeit

if __name__ == "__main__":

    m = 1000000
    big_x = np.random.random(size=m)
    big_y = big_x * 2 + 3 + np.random.normal(size=m)
    reg1 = SimpleLinearRegression()
    reg2 = SimpleLinearRegressionVector()
    start_time = timeit.default_timer()
    reg1.fit(big_x, big_y)
    print(timeit.default_timer() - start_time)
    start_time = timeit.default_timer()
    reg2.fit(big_x, big_y)
    print(timeit.default_timer() - start_time)
    print(reg1._a)
    print(reg1._b)
    print(reg2._a)
    print(reg2._b)

运行结果

0.870928624
0.00943964399999997
2.0010333897209
2.998210238026
2.0010333897208623
2.9982102380260187

由结果可以看出循环叠加方法和向量点乘算法存在着巨大的性能差异。

衡量线性回归法的指标:MSE,RMSE和MAE

在上一章KNN算法中,我们将数据集分成了训练数据集和测试数据集两类,我们使用训练数据集来训练出了模型,然后使用模型来预测测试数据集,将预测的结果与测试数据集自己带的真实的标签进行对比,这样我们就得到了分类准确度(accuracy)。我们可以使用分类的准确度来衡量我们的机器学习模型究竟是好是坏。对于回归模型该如何评价好坏呢?我们依然使用简单线性回归来说明。

目标:找到a和b,使得尽可能小。我们在实际使用过程中也是将数据分为训练数据集和测试数据集的,该式子其实是对于训练数据集来说的,也就是。训练过程结果了,我们就得到了a和b。之后,我们将测试数据集代入式子中,根据每一个相应的就能得到每一个。那么,我们就可以根据来作为我们线性回归模型好坏的衡量标准。当我们和别人汇报这个衡量标准的时候,这个衡量标准是和m相关的。比如说两个人都做了一个房产预测的算法,其中一人说他做的累计平方和是1000,另外一人说他的是800,这样就能说明另外一人的算法更好吗?答案是不能,因为我们不知道这两个人的测试数据集有多少。如果第一个人的测试数据集有10000个元素,而另一个人的测试数据集有10个元素,那么在这种情况下,10000个样本累计起来的误差是1000,10个样本累计起来的误差就达到800,显然是第一个人的算法更好。所以我们需要改进这个衡量标准。就是给这个式子除以m,就为,相当于让我们的衡量标准跟测试数量无关,这就是一个更为通用的线形回归算法的衡量标准,称为均方误差MSE(Mean Squared Error)。这个衡量标准依然有一个问题,那就是量纲,比如说预测房产以万元为单位,那么这个衡量标准得到的结果其实是万元的平方,所以一个简单的改进方式就是开方。

,这样量纲和y的量纲是一致的,这个衡量标准被称为均方根误差RMSE(Root Mean Squared Error)

还有一种衡量标准叫做平均绝对误差MAE(Mean Absolute Error),之前我们说绝对值不是一个处处可导的函数,所以它不方便用来求极值,但是这个方法完全可以用来评价我们的线性回归算法。换句话说我们评价一个算法的标准和我们在训练模型的时候最优化的目标的函数是可以完全不一致的。现在我们用一个真实的波士顿房产数据来说明

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

if __name__ == "__main__":

    boston = datasets.load_boston()
    print(boston.DESCR)

运行结果

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

这是一份波士顿的房产数据,一共有506个样本,13个特征,下面具体的写出了这13个特征是什么。由于我们只做简单线性回归,我们只取出这13个特征中的一个RM(对于每一个房子具体有多少个房间)。我们对于这个房产中房间的多少,进而来预测房价。我们来看一下RM在这13个特征中的排序

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

if __name__ == "__main__":

    boston = datasets.load_boston()
    # print(boston.DESCR)
    print(boston.feature_names)

运行结果

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

根据结果我们可以看出RM位于第6位(从0开始,索引为5)。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

if __name__ == "__main__":

    boston = datasets.load_boston()
    # print(boston.DESCR)
    # print(boston.feature_names)
    # 只使用房间数量这个特征
    x = boston.data[:, 5]
    print(x.shape)
    y = boston.target
    print(y.shape)

运行结果

(506,)
(506,)

这里我们可以看到x、y都是一个506个属性的向量。我们将这些数据绘制出来

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

if __name__ == "__main__":

    boston = datasets.load_boston()
    # print(boston.DESCR)
    # print(boston.feature_names)
    # 只使用房间数量这个特征
    x = boston.data[:, 5]
    print(x.shape)
    y = boston.target
    print(y.shape)
    plt.scatter(x, y)
    plt.show()

运行结果

我们可以看到在纵坐标50的地方有一些特殊点,这些点其实是一些限制值,比方说我们设定最高上限为50,而有些数据超过了50,则我们会把这些值放到50这个点上。则对于这些最大值的点,它有可能不是真实值,需要删除。现在我们来看一下y的最大值是多少。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

if __name__ == "__main__":

    boston = datasets.load_boston()
    # print(boston.DESCR)
    # print(boston.feature_names)
    # 只使用房间数量这个特征
    x = boston.data[:, 5]
    print(x.shape)
    y = boston.target
    print(y.shape)
    # plt.scatter(x, y)
    # plt.show()
    print(np.max(y))

运行结果

(506,)
(506,)
50.0

从结果可以看出y的最大值确实就是50.现在我们将这些值删除,重新绘制

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

if __name__ == "__main__":

    boston = datasets.load_boston()
    # print(boston.DESCR)
    # print(boston.feature_names)
    # 只使用房间数量这个特征
    x = boston.data[:, 5]
    print(x.shape)
    y = boston.target
    print(y.shape)
    # plt.scatter(x, y)
    # plt.show()
    # print(np.max(y))
    x = x[y < 50]
    y = y[y < 50]
    plt.scatter(x, y)
    plt.show()

运行结果

现在我们可以看到这些奇怪的点就消失了,这是我们真正要使用的数据集。

现在我们增加一个将数据集分割成训练数据集和测试数据集的方法

import numpy as np


def train_test_split(X, y, test_ratio=0.2, seed=None):
    """将数据 X 和 y 按照test_ratio分割成X_train, X_test, y_train, y_test"""
    assert X.shape[0] == y.shape[0], \
        "the size of X must be equal to the size of y"
    assert 0.0 <= test_ratio <= 1.0, \
        "test_ration must be valid"

    if seed:
        np.random.seed(seed)

    shuffled_indexes = np.random.permutation(len(X))

    test_size = int(len(X) * test_ratio)
    test_indexes = shuffled_indexes[:test_size]
    train_indexes = shuffled_indexes[test_size:]

    X_train = X[train_indexes]
    y_train = y[train_indexes]

    X_test = X[test_indexes]
    y_test = y[test_indexes]

    return X_train, X_test, y_train, y_test
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from playLA.model_selection import train_test_split

if __name__ == "__main__":

    boston = datasets.load_boston()
    # print(boston.DESCR)
    # print(boston.feature_names)
    # 只使用房间数量这个特征
    x = boston.data[:, 5]
    print(x.shape)
    y = boston.target
    print(y.shape)
    # plt.scatter(x, y)
    # plt.show()
    # print(np.max(y))
    x = x[y < 50]
    y = y[y < 50]
    # plt.scatter(x, y)
    # plt.show()
    x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
    print(x_train.shape)
    print(x_test.shape)

运行结果

(506,)
(506,)
(392,)
(98,)

根据结果我们可以看到训练样本有392个,测试样本有98个。现在我们要进行简单线性回归的训练,看一下训练结果的a和b,并将训练数据和训练后的直线绘制出来。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from playLA.model_selection import train_test_split
from playLA.SimpleLinearRegressionVector import SimpleLinearRegressionVector

if __name__ == "__main__":

    boston = datasets.load_boston()
    # print(boston.DESCR)
    # print(boston.feature_names)
    # 只使用房间数量这个特征
    x = boston.data[:, 5]
    print(x.shape)
    y = boston.target
    print(y.shape)
    # plt.scatter(x, y)
    # plt.show()
    # print(np.max(y))
    x = x[y < 50]
    y = y[y < 50]
    # plt.scatter(x, y)
    # plt.show()
    x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
    print(x_train.shape)
    print(x_test.shape)
    reg = SimpleLinearRegressionVector()
    reg.fit(x_train, y_train)
    print(reg._a)
    print(reg._b)
    plt.scatter(x_train, y_train)
    plt.plot(x_train, reg.predict(x_train), color="r")
    plt.show()

运行结果

(506,)
(506,)
(392,)
(98,)
7.8608543562689555
-27.459342806705543

现在我们来看一下它的均方误差MSE、均方根误差RMSE以及平均绝对误差MAE

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from playLA.model_selection import train_test_split
from playLA.SimpleLinearRegressionVector import SimpleLinearRegressionVector
from math import sqrt

if __name__ == "__main__":

    boston = datasets.load_boston()
    # print(boston.DESCR)
    # print(boston.feature_names)
    # 只使用房间数量这个特征
    x = boston.data[:, 5]
    print(x.shape)
    y = boston.target
    print(y.shape)
    # plt.scatter(x, y)
    # plt.show()
    # print(np.max(y))
    x = x[y < 50]
    y = y[y < 50]
    # plt.scatter(x, y)
    # plt.show()
    x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
    print(x_train.shape)
    print(x_test.shape)
    reg = SimpleLinearRegressionVector()
    reg.fit(x_train, y_train)
    print(reg._a)
    print(reg._b)
    # plt.scatter(x_train, y_train)
    # plt.plot(x_train, reg.predict(x_train), color="r")
    # plt.show()
    y_predict = reg.predict(x_test)
    mse_test = np.sum((y_predict - y_test)**2) / len(y_test)
    print(mse_test)
    rmse_test = sqrt(mse_test)
    print(rmse_test)
    mae_test = np.sum(np.absolute(y_predict - y_test)) / len(y_test)
    print(mae_test)

运行结果

(506,)
(506,)
(392,)
(98,)
7.8608543562689555
-27.459342806705543
24.156602134387438
4.914936635846635
3.5430974409463873

我们可以看到RMSE的值为4.9,我们可以理解为在RMSE这个指标下,我们预测的房产数据平均误差在4.9万美元左右。MAE的值为3.54,换句话说在MAE这个指标下,我们预测的房产数据平均误差在3.54万美元左右。现在我们把这些方法进行一个封装。

import numpy as np
from math import sqrt


def accuracy_score(y_true, y_predict):
    """计算y_true和y_predict之间的准确率"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum(y_true == y_predict) / len(y_true)


def mean_squared_error(y_true, y_predict):
    """计算y_true和y_predict之间的MSE"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum((y_true - y_predict)**2) / len(y_true)


def root_mean_squared_error(y_true, y_predict):
    """计算y_true和y_predict之间的RMSE"""

    return sqrt(mean_squared_error(y_true, y_predict))


def mean_absolute_error(y_true, y_predict):
    """计算y_true和y_predict之间的MAE"""

    return np.sum(np.absolute(y_true - y_predict)) / len(y_true)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from playLA.model_selection import train_test_split
from playLA.SimpleLinearRegressionVector import SimpleLinearRegressionVector
from math import sqrt
from playLA.metrics import *

if __name__ == "__main__":

    boston = datasets.load_boston()
    # print(boston.DESCR)
    # print(boston.feature_names)
    # 只使用房间数量这个特征
    x = boston.data[:, 5]
    print(x.shape)
    y = boston.target
    print(y.shape)
    # plt.scatter(x, y)
    # plt.show()
    # print(np.max(y))
    x = x[y < 50]
    y = y[y < 50]
    # plt.scatter(x, y)
    # plt.show()
    x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
    print(x_train.shape)
    print(x_test.shape)
    reg = SimpleLinearRegressionVector()
    reg.fit(x_train, y_train)
    print(reg._a)
    print(reg._b)
    # plt.scatter(x_train, y_train)
    # plt.plot(x_train, reg.predict(x_train), color="r")
    # plt.show()
    y_predict = reg.predict(x_test)
    # mse_test = np.sum((y_predict - y_test)**2) / len(y_test)
    # print(mse_test)
    # rmse_test = sqrt(mse_test)
    # print(rmse_test)
    # mae_test = np.sum(np.absolute(y_predict - y_test)) / len(y_test)
    # print(mae_test)
    print(mean_squared_error(y_test, y_predict))
    print(root_mean_squared_error(y_test, y_predict))
    print(mean_absolute_error(y_test, y_predict))

运行结果

(506,)
(506,)
(392,)
(98,)
7.8608543562689555
-27.459342806705543
24.156602134387438
4.914936635846635
3.5430974409463873

scikit_learn机器学习库中的MSE和MAE

scikit_learn中没有RMSE,所以我们要对MSE开根号就可以了

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from playLA.model_selection import train_test_split
from playLA.SimpleLinearRegressionVector import SimpleLinearRegressionVector
from math import sqrt
# from playLA.metrics import *
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

if __name__ == "__main__":

    boston = datasets.load_boston()
    # print(boston.DESCR)
    # print(boston.feature_names)
    # 只使用房间数量这个特征
    x = boston.data[:, 5]
    print(x.shape)
    y = boston.target
    print(y.shape)
    # plt.scatter(x, y)
    # plt.show()
    # print(np.max(y))
    x = x[y < 50]
    y = y[y < 50]
    # plt.scatter(x, y)
    # plt.show()
    x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
    print(x_train.shape)
    print(x_test.shape)
    reg = SimpleLinearRegressionVector()
    reg.fit(x_train, y_train)
    print(reg._a)
    print(reg._b)
    # plt.scatter(x_train, y_train)
    # plt.plot(x_train, reg.predict(x_train), color="r")
    # plt.show()
    y_predict = reg.predict(x_test)
    # mse_test = np.sum((y_predict - y_test)**2) / len(y_test)
    # print(mse_test)
    # rmse_test = sqrt(mse_test)
    # print(rmse_test)
    # mae_test = np.sum(np.absolute(y_predict - y_test)) / len(y_test)
    # print(mae_test)
    print(mean_squared_error(y_test, y_predict))
    print(sqrt(mean_squared_error(y_test, y_predict)))
    print(mean_absolute_error(y_test, y_predict))

运行结果

(506,)
(506,)
(392,)
(98,)
7.8608543562689555
-27.459342806705543
24.156602134387438
4.914936635846635
3.5430974409463873

RMSE vs MAE

RMSE和MAE的量纲都是一样的,但是我们在实际的实践中会发现RMSE会比MAE要大,这是因为RMSE是将错误值平方后再开根,如果有两个样本的错误值是100,在经过平方操作后,这个100的差距就扩大到了10000的差距,也就是说RMSE有放大我们预测结果和真实结果较大差距的趋势;而MAE是没有这样一个趋势的,它直接就反映的是样本的预测结果和真实结果的这一个差距,没有平方操作。也正是因为这个原因,从某种程度上来说,我们尽量的让RMSE这个值更加小,相对来讲它的意义更大一些,因为这背后就意味着我们整个样本的错误中,那个最大的错误值相应的比较小。

最好的衡量线性回归法的指标:R Squared

我们评价分类问题的准确度:1最好,0最差,这就非常容易的判断模型的好坏,但是对于RMSE和MAE就没有这样的性质,这是RMSE和MAE的局限性。

这个问题的解决办法就是使用一个新的指标——R Squared,它的式子如下

表示的是Residual Sum of Squares;表示的是Total Sum of Squares,这两种计算方法分别为

就是预测结果-真值差距的平方再进行一个累加。就是整个y的平均值-真值差距的平方和。

是使用我们的模型预测产生的错误,是使用(平均值)预测产生的错误,我们也可以把它理解成一个模型,这个模型跟x无关,不管是什么x,预测的结果就是样本的均值。对于这样一个模型,在机器学习领域有一个统一的名词叫做Baseline Model,意思就是最基本的Model。这个最基本的Model,它的错误肯定是比较多的,因为它完全不考虑x,而预测结果直接为样本输出结果的均值。而使用我们的模型预测的结果应该是相应比较少的,它充分的考虑了x和y之间的关系。基于此,我们可以这样来理解R^2这个式子,我们使用Baseline Model会产生非常多的错误,而使用我们自己的模型来预测相应的也会产生一些错误,但是同时也会减少一些错误。所以我用1减去用我们自己的模型产生的错误/使用Baseline Model产生的错误,最终的结果其实就相当于衡量了我们的模型拟合住的这些数据的地方,就衡量了我们的模型没有产生错误的相应的指标

结论:

  1. R^2 <= 1
  2. R^2越大越好。当我们的预测模型不犯任何错误,R^2得到最大值1
  3. 当我们的模型等于基准模型时,R^2为0
  4. 如果R^2 < 0,说明我们学习到的模型还不如基准模型。此时,很有可能我们的数据不存在任何线性关系。

我们将式子变型,得到

表示y这组数据对应的方差。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from playLA.model_selection import train_test_split
from playLA.SimpleLinearRegressionVector import SimpleLinearRegressionVector
from math import sqrt
# from playLA.metrics import *
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

if __name__ == "__main__":

    boston = datasets.load_boston()
    # print(boston.DESCR)
    # print(boston.feature_names)
    # 只使用房间数量这个特征
    x = boston.data[:, 5]
    print(x.shape)
    y = boston.target
    print(y.shape)
    # plt.scatter(x, y)
    # plt.show()
    # print(np.max(y))
    x = x[y < 50]
    y = y[y < 50]
    # plt.scatter(x, y)
    # plt.show()
    x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
    print(x_train.shape)
    print(x_test.shape)
    reg = SimpleLinearRegressionVector()
    reg.fit(x_train, y_train)
    print(reg._a)
    print(reg._b)
    # plt.scatter(x_train, y_train)
    # plt.plot(x_train, reg.predict(x_train), color="r")
    # plt.show()
    y_predict = reg.predict(x_test)
    # mse_test = np.sum((y_predict - y_test)**2) / len(y_test)
    # print(mse_test)
    # rmse_test = sqrt(mse_test)
    # print(rmse_test)
    # mae_test = np.sum(np.absolute(y_predict - y_test)) / len(y_test)
    # print(mae_test)
    print(mean_squared_error(y_test, y_predict))
    print(sqrt(mean_squared_error(y_test, y_predict)))
    print(mean_absolute_error(y_test, y_predict))
    # 计算R Square
    print(1 - mean_squared_error(y_test, y_predict) / np.var(y_test))

运行结果

(506,)
(506,)
(392,)
(98,)
7.8608543562689555
-27.459342806705543
24.156602134387438
4.914936635846635
3.5430974409463873
0.6129316803937322

当然我们也可以在我们自己封装的工具方法中增加R Square方法

import numpy as np
from math import sqrt


def accuracy_score(y_true, y_predict):
    """计算y_true和y_predict之间的准确率"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum(y_true == y_predict) / len(y_true)


def mean_squared_error(y_true, y_predict):
    """计算y_true和y_predict之间的MSE"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum((y_true - y_predict)**2) / len(y_true)


def root_mean_squared_error(y_true, y_predict):
    """计算y_true和y_predict之间的RMSE"""

    return sqrt(mean_squared_error(y_true, y_predict))


def mean_absolute_error(y_true, y_predict):
    """计算y_true和y_predict之间的MAE"""

    return np.sum(np.absolute(y_true - y_predict)) / len(y_true)

def r2_score(y_true, y_predict):
    """计算y_true和y_predict之间的R Square"""
    return 1 - mean_squared_error(y_true, y_predict) / np.var(y_true)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from playLA.model_selection import train_test_split
from playLA.SimpleLinearRegressionVector import SimpleLinearRegressionVector
from math import sqrt
# from playLA.metrics import *
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from playLA.metrics import r2_score

if __name__ == "__main__":

    boston = datasets.load_boston()
    # print(boston.DESCR)
    # print(boston.feature_names)
    # 只使用房间数量这个特征
    x = boston.data[:, 5]
    print(x.shape)
    y = boston.target
    print(y.shape)
    # plt.scatter(x, y)
    # plt.show()
    # print(np.max(y))
    x = x[y < 50]
    y = y[y < 50]
    # plt.scatter(x, y)
    # plt.show()
    x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
    print(x_train.shape)
    print(x_test.shape)
    reg = SimpleLinearRegressionVector()
    reg.fit(x_train, y_train)
    print(reg._a)
    print(reg._b)
    # plt.scatter(x_train, y_train)
    # plt.plot(x_train, reg.predict(x_train), color="r")
    # plt.show()
    y_predict = reg.predict(x_test)
    # mse_test = np.sum((y_predict - y_test)**2) / len(y_test)
    # print(mse_test)
    # rmse_test = sqrt(mse_test)
    # print(rmse_test)
    # mae_test = np.sum(np.absolute(y_predict - y_test)) / len(y_test)
    # print(mae_test)
    print(mean_squared_error(y_test, y_predict))
    print(sqrt(mean_squared_error(y_test, y_predict)))
    print(mean_absolute_error(y_test, y_predict))
    # 计算R Square
    # print(1 - mean_squared_error(y_test, y_predict) / np.var(y_test))
    print(r2_score(y_test, y_predict))

运行结果

(506,)
(506,)
(392,)
(98,)
7.8608543562689555
-27.459342806705543
24.156602134387438
4.914936635846635
3.5430974409463873
0.6129316803937322

scikit_learn中同样也有这个方法

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from playLA.model_selection import train_test_split
from playLA.SimpleLinearRegressionVector import SimpleLinearRegressionVector
from math import sqrt
# from playLA.metrics import *
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
# from playLA.metrics import r2_score
from sklearn.metrics import r2_score

if __name__ == "__main__":

    boston = datasets.load_boston()
    # print(boston.DESCR)
    # print(boston.feature_names)
    # 只使用房间数量这个特征
    x = boston.data[:, 5]
    print(x.shape)
    y = boston.target
    print(y.shape)
    # plt.scatter(x, y)
    # plt.show()
    # print(np.max(y))
    x = x[y < 50]
    y = y[y < 50]
    # plt.scatter(x, y)
    # plt.show()
    x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
    print(x_train.shape)
    print(x_test.shape)
    reg = SimpleLinearRegressionVector()
    reg.fit(x_train, y_train)
    print(reg._a)
    print(reg._b)
    # plt.scatter(x_train, y_train)
    # plt.plot(x_train, reg.predict(x_train), color="r")
    # plt.show()
    y_predict = reg.predict(x_test)
    # mse_test = np.sum((y_predict - y_test)**2) / len(y_test)
    # print(mse_test)
    # rmse_test = sqrt(mse_test)
    # print(rmse_test)
    # mae_test = np.sum(np.absolute(y_predict - y_test)) / len(y_test)
    # print(mae_test)
    print(mean_squared_error(y_test, y_predict))
    print(sqrt(mean_squared_error(y_test, y_predict)))
    print(mean_absolute_error(y_test, y_predict))
    # 计算R Square
    # print(1 - mean_squared_error(y_test, y_predict) / np.var(y_test))
    print(r2_score(y_test, y_predict))

运行结果与之前相同,最后我们将该方法封装到我们的简单线性回归类的score方法中

import numpy as np
from .metrics import r2_score

class SimpleLinearRegressionVector:

    def __init__(self):
        # 初始化Simple Linear Regression模型
        self._a = None
        self._b = None

    def fit(self, x_train, y_train):
        # 根据训练数据集x_train,y_train训练Simple Linear Regression模型
        assert x_train.ndim == 1, \
            "简单线性回归仅能处理有1个特征的训练数据"
        assert len(x_train) == len(y_train), \
            "x_train的长度必须与y_train的长度相等"
        # 获取x、y训练数据集的平均值
        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)
        num = (x_train - x_mean).dot(y_train - y_mean)
        d = (x_train - x_mean).dot(x_train - x_mean)
        self._a = num / d
        self._b = y_mean - self._a * x_mean
        return self

    def predict(self, x_predict):
        # 给定待预测数据集x_predict,返回表示x_predict的结果向量
        assert x_predict.ndim == 1, \
            "简单线性回归仅能处理有1个特征的训练数据"
        assert self._a is not None and self._b is not None, \
            "开始预测前必须fit"
        return np.array([self._predict(x) for x in x_predict])

    def _predict(self, x_single):
        # 给定单个待预测数据x_single,返回x_single的预测结果值
        return self._a * x_single + self._b

    def score(self, x_test, y_test):
        # 根据测试数据集x_test和y_test确定当前模型的准确度
        y_predict = self.predict(x_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return "SimpleLinearRegressionVector()"
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from playLA.model_selection import train_test_split
from playLA.SimpleLinearRegressionVector import SimpleLinearRegressionVector
from math import sqrt
# from playLA.metrics import *
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
# from playLA.metrics import r2_score
from sklearn.metrics import r2_score

if __name__ == "__main__":

    boston = datasets.load_boston()
    # print(boston.DESCR)
    # print(boston.feature_names)
    # 只使用房间数量这个特征
    x = boston.data[:, 5]
    print(x.shape)
    y = boston.target
    print(y.shape)
    # plt.scatter(x, y)
    # plt.show()
    # print(np.max(y))
    x = x[y < 50]
    y = y[y < 50]
    # plt.scatter(x, y)
    # plt.show()
    x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
    print(x_train.shape)
    print(x_test.shape)
    reg = SimpleLinearRegressionVector()
    reg.fit(x_train, y_train)
    print(reg._a)
    print(reg._b)
    # plt.scatter(x_train, y_train)
    # plt.plot(x_train, reg.predict(x_train), color="r")
    # plt.show()
    y_predict = reg.predict(x_test)
    # mse_test = np.sum((y_predict - y_test)**2) / len(y_test)
    # print(mse_test)
    # rmse_test = sqrt(mse_test)
    # print(rmse_test)
    # mae_test = np.sum(np.absolute(y_predict - y_test)) / len(y_test)
    # print(mae_test)
    print(mean_squared_error(y_test, y_predict))
    print(sqrt(mean_squared_error(y_test, y_predict)))
    print(mean_absolute_error(y_test, y_predict))
    # 计算R Square
    # print(1 - mean_squared_error(y_test, y_predict) / np.var(y_test))
    # print(r2_score(y_test, y_predict))
    print(reg.score(x_test, y_test))

运行结果也与之前相同

多元线性回归和正规方程解

之前我们看到的当是一个数字的话,那么对应的就是简单线性回归问题,这里当是一个向量的时候,那么对应的就是多元线性回归了。

这个线性方程就可以由原来的y=ax+b变成了,当只有一个特征的时候,就是a,就是b。

则预测值就为

多元线性回归的目标依然是使尽可能的小。只不过变成了

则目标也变成了找到,使得尽可能的小

我们把整理成一个向量,我们将改变一下,变成,其中是我们虚拟出来的一个特征,且。则我们的是这样一个矩阵

则我们样本的预测值(拟合值)来表示就是

,则

则目标就是使尽可能小。类比极限存在的必要条件,函数对向量的导数为0,,该式就被称为多元线性回归的正规方程解(Normal Equation)。这个式子看似简单,但其实它的时间复杂度较高:,虽然对矩阵的乘法运算,求逆运算都有加速方案。但是即使使用这些加速方案,整体优化后时间复杂度依然有。由此我们知道它是比O(n^2)的级别还要高,在实际应用的时候不管是实际样本量非常大(比方说有1000000个样本),或者样本特征非常多(比方说有1000000个特征)对应的就是行数特别多或者列数特别多,那么在这种情况下用这种正规方程解可能都会出现计算缓慢的效率问题。这个问题是有解决方案的,这个解决方案是我们更常用的一种求解多元线性回归的一种解决方法。

实现多元线性回归

,其中为截距(intercept),为系数(coefficients)。在最终报告给我们的用户相应的结果的时候,有可能不是将整体报告给用户,而是将截距和系数分开。这样做的原因是在系数部分,每一个值都对应原来样本中的一个特征,这些系数可以用来描述特征,对于最终样本相应的贡献程度是怎样的;而截距和我们的样本的特征是不相干的,它只是一个偏移,所以我们把这两部分分开。

import numpy as np
from numpy.linalg import inv
from .metrics import r2_score

class LinearRegression:
    
    def __init__(self):
        # 初始化LinearRegression模型
        self.coef = None  #系数
        self.interception = None  #截距
        self._theta = None

    def fit_normal(self, X_train, y_train):
        # 根据训练数据集X_train, y_train训练Linear Regression模型
        assert X_train.shape[0] == y_train.shape[0], \
            "X_train的列数必须等于y_train的长度"
        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        self._theta = inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)
        self.interception = self._theta[0]
        self.coef = self._theta[1:]
        return self

    def predict(self, X_predict):
        # 给定待预测数据集X_predict,返回表示X_predict的结果向量
        assert self.interception is not None and self.coef is not None, \
            "开始预测前必须fit"
        assert X_predict.shape[1] == len(self.coef), \
            "预测的特征数必须与训练的特征数相等"
        X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
        return X_b.dot(self._theta)

    def score(self, X_test, y_test):
        # 根据测试数据集X_test和y_test确定当前模型的准确度
        y_predict = self.predict(X_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return "LinearRegression()"

现在我们继续使用波士顿的数据来测试,不过这一次使用的不是单一维度(特征)的数据是全维度数据。

from sklearn import datasets

if __name__ == "__main__":

    boston = datasets.load_boston()
    X = boston.data
    y = boston.target
    X = X[y < 50]
    y = y[y < 50]
    print(X.shape)

运行结果

(490, 13)

根据结果我们可以看出一共有490个样本数据,13个特征。现在我们来看一下通过这些数据训练出来的系数向量和截距以及评价预测结果的R^2

from sklearn import datasets
from playLA.model_selection import train_test_split
from playLA.LinearRegression import LinearRegression

if __name__ == "__main__":

    boston = datasets.load_boston()
    X = boston.data
    y = boston.target
    X = X[y < 50]
    y = y[y < 50]
    print(X.shape)
    X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
    reg = LinearRegression()
    reg.fit_normal(X_train, y_train)
    print(reg.coef)
    print(reg.interception)
    print(reg.score(X_test, y_test))

运行结果

(490, 13)
[-1.20354261e-01  3.64423279e-02 -3.61493155e-02  5.12978140e-02
 -1.15775825e+01  3.42740062e+00 -2.32311760e-02 -1.19487594e+00
  2.60101728e-01 -1.40219119e-02 -8.35430488e-01  7.80472852e-03
 -3.80923751e-01]
34.11739972320593
0.8129794056212907

通过评价预测结果的R^2值可以看出来,多特征值的多元线性回归的准确度比使用单一特征的简单线性回归要高出不少。从某种程度上也印证了如果我们的数据它的特征更多的话,并且这些特征如果真的能非常好反映我们最终要预测的那个指标(在这里就是房价指标),那么相应的使用更多的特征这样的数据,最终的预测结果会是更好的。

使用scikit-learn解决回归问题

from sklearn import datasets
# from playLA.model_selection import train_test_split
from sklearn.model_selection import train_test_split
# from playLA.LinearRegression import LinearRegression
from sklearn.linear_model import LinearRegression

if __name__ == "__main__":

    boston = datasets.load_boston()
    X = boston.data
    y = boston.target
    X = X[y < 50]
    y = y[y < 50]
    print(X.shape)
    # X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
    reg = LinearRegression()
    # reg.fit_normal(X_train, y_train)
    reg.fit(X_train, y_train)
    # print(reg.coef)
    print(reg.coef_)
    # print(reg.interception)
    print(reg.intercept_)
    print(reg.score(X_test, y_test))

运行结果

(490, 13)
[-1.15625837e-01  3.13179564e-02 -4.35662825e-02 -9.73281610e-02
 -1.09500653e+01  3.49898935e+00 -1.41780625e-02 -1.06249020e+00
  2.46031503e-01 -1.23291876e-02 -8.79440522e-01  8.31653623e-03
 -3.98593455e-01]
32.59756158869959
0.8009390227581041

在这里我们会发现使用scikit-learn得到的结果跟我们自己封装的LinearRegression有一些不同,这是因为在sklearn中分割训练数据和测试数据的train_test_split方法跟我们自己实现的train_test_split方法稍有不同,这有可能导致切割出来的数据是有不同的。如果改回我们自己的切割方法

from sklearn import datasets
from playLA.model_selection import train_test_split
# from sklearn.model_selection import train_test_split
# from playLA.LinearRegression import LinearRegression
from sklearn.linear_model import LinearRegression

if __name__ == "__main__":

    boston = datasets.load_boston()
    X = boston.data
    y = boston.target
    X = X[y < 50]
    y = y[y < 50]
    print(X.shape)
    X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
    # X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
    reg = LinearRegression()
    # reg.fit_normal(X_train, y_train)
    reg.fit(X_train, y_train)
    # print(reg.coef)
    print(reg.coef_)
    # print(reg.interception)
    print(reg.intercept_)
    print(reg.score(X_test, y_test))

运行结果

(490, 13)
[-1.20354261e-01  3.64423279e-02 -3.61493155e-02  5.12978140e-02
 -1.15775825e+01  3.42740062e+00 -2.32311760e-02 -1.19487594e+00
  2.60101728e-01 -1.40219119e-02 -8.35430488e-01  7.80472852e-03
 -3.80923751e-01]
34.117399723229845
0.8129794056212809

这样得到的结果就跟之前相同了。不过这里我们需要注意的是,scikit-learn中的fit方法并不是我们之前讲的正规方程解,具体的方式会在后续说明。

线性回归的可解释性和更多思考

import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression

if __name__ == "__main__":

    boston = datasets.load_boston()
    X = boston.data
    y = boston.target
    X = X[y < 50]
    y = y[y < 50]
    reg = LinearRegression()
    # 此处对所有数据进行拟合,不做拆分成训练数据和测试数据
    reg.fit(X, y)
    print(reg.coef_)
    # 对各个特征进行排序
    print(np.argsort(reg.coef_))
    # 查看排序后对应特征的名称
    print(boston.feature_names[np.argsort(reg.coef_)])
    print(boston.DESCR)

运行结果

[-1.06715912e-01  3.53133180e-02 -4.38830943e-02  4.52209315e-01
 -1.23981083e+01  3.75945346e+00 -2.36790549e-02 -1.21096549e+00
  2.51301879e-01 -1.37774382e-02 -8.38180086e-01  7.85316354e-03
 -3.50107918e-01]
[ 4  7 10 12  0  2  6  9 11  1  8  3  5]
['NOX' 'DIS' 'PTRATIO' 'LSTAT' 'CRIM' 'INDUS' 'AGE' 'TAX' 'B' 'ZN' 'RAD'
 'CHAS' 'RM']
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

从结果可以看出,所有特征里面贡献最大的是RM,其实就是房子中的房间数量,它对房价的影响是最大的,换句话说房间数量越多的房屋房价越高。所有特征里面贡献第二大的是CHAS,表示房子是否临河,临河为1,不临河为0。相当于临河的房子房价比较高,而不临河的房子房价就会比较低。这是正相关这一头,在负相关这一头,第一项是NOX,表示房子周边一氧化氮的浓度,由于是负相关,则一氧化氮的浓度越高,相应的房价就越低。负相关第二项是DIS,表示距离5个波士顿劳务雇佣中心相应的加权平均的距离,这个距离越大,房价越低;相反,距离越小,房价越高。也就是说离上班的地方越近,房价越高,越远则房价越低。这就是线性回归对数据具有可解释性。更重要的是,当我们获得了这种可解释性之后,我们可以针对性的去采集更多的特征来更好的描述房价。即使使用线性回归对结果进行预测,预测的结果不够好,但使用这种方式首先来看一看数据的特征和最终结果的线性关系相应的系数有多大,整体很有可能也是非常有意义的。所以从某种角度来讲,我们拿到一组数据之后,先使用线性的方式去试试看,它总之是没有坏处的

梯度下降法

我们在上面直接计算出了线性回归最小化参数的数学解,但其实很多机器学习的模型是求不到数学解的。基于这样的模型,我们就需要一种基于搜索的策略,来找到这个最优解。梯度下降法的介绍如下。

  1. 本身不是一个机器学习算法
  2. 是一种基于搜索的最优化方法
  3. 作用:最小化一个损失函数
  4. 梯度上升法:最大化一个效用函数

关于梯度下降法的数学推导请参考高等数学整理(二) ,这里不做赘述。

线性回归中使用梯度下降法

之前我们已经知道,我们的目标是使损失函数尽可能小。那么第一步自然是求各个的偏导数了

这里是一个复合函数求导,根据复合函数求导公式即可得出上面的全部偏导数。不过这里需要注意的是是要求的常数向量,是数据矩阵,就是梯度,它是一个由所有偏导数组成的向量。我们可以看到梯度的大小跟我们的样本数量有关,我们的样本数量越大,这里梯度中的每一个元素相应的也就越大,这其实是不合理的。我们希望我们求出来的梯度中的每一个元素值跟m无关。为此我们让整个梯度再除以一个m,就有

这就让我们的目标损失函数发生了变化,变为尽可能小。实际上这个值正是均方误差MSE的值,但有的时候也会取,这样做的目的是为了约掉求导后的平方,变成1/m,其实这样做的作用并不大。所以当使用梯度下降法求最小值的时候我们要对目标损失函数进行一些设计,不见得所有的目标函数都非常的合适

实现线性回归中的梯度下降法

首先我们先构建一些我们自己的虚拟数据

import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":

    # 随机生成100个2以内的随机数,设置种子保证每次运行随机数相同
    np.random.seed(666)
    x = 2 * np.random.random(size=100)
    # 将生成的随机数建立对应的线性数据(带噪声)
    y = x * 3 + 4 + np.random.normal(size=100)
    # 将x构建成100行1列的矩阵
    X = x.reshape(-1, 1)
    print(X.shape)
    plt.scatter(x, y)
    plt.show()

运行结果

(100, 1)

求拟合的常数向量代码如下

import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":

    # 随机生成100个2以内的随机数,设置种子保证每次运行随机数相同
    np.random.seed(666)
    x = 2 * np.random.random(size=100)
    # 将生成的随机数建立对应的线性数据(带噪声)
    y = x * 3 + 4 + np.random.normal(size=100)
    # 将x构建成100行1列的矩阵
    X = x.reshape(-1, 1)
    print(X.shape)
    # plt.scatter(x, y)
    # plt.show()
    def J(theta, X_b, y):
        # 构建损失函数
        try:
            return np.sum((y - X_b.dot(theta))**2) / len(X_b)
        except:
            return float('inf')

    def dJ(theta, X_b, y):
        # 对theta求偏导数,获取梯度向量
        res = np.empty(len(theta))
        res[0] = np.sum(X_b.dot(theta) - y)
        for i in range(1, len(theta)):
            res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
        return res * 2 / len(X_b)

    def gradient_descent(X_b, y, initial_theta, eta, n_iters = 1e4, epsilon=1e-8):
        """
        梯度下降算法
        :param X_b: 带虚拟特征1的自变量特征矩阵
        :param y: 因变量向量
        :param initial_theta: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
        :param eta: 迭代步长、学习率
        :param n_iters: 最大迭代次数
        :param epsilon: 误差值
        :return:
        """
        theta = initial_theta
        # 真实迭代次数
        i_iter = 0
        while i_iter < n_iters:
            # 获取梯度
            gradient = dJ(theta, X_b, y)
            last_theta = theta
            # 迭代更新theta,不断顺着梯度方向寻找新的theta
            theta = theta - eta * gradient
            # 计算前后两次迭代后的损失函数差值的绝对值
            if abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon:
                break
            # 更新迭代次数
            i_iter += 1
        return theta
    X_b = np.hstack([np.ones((len(X), 1)), X.reshape(-1, 1)])
    initial_theta = np.zeros(X_b.shape[1])
    eta = 0.01
    theta = gradient_descent(X_b, y, initial_theta, eta)
    print(theta)

运行结果

(100, 1)
[4.02145786 3.00706277]

根据结果,我们可以看出,分别为4.02145786和3.00706277,跟我们设定的4和3非常接近。

现在我们将该方法封装到我们的线性回归类中

import numpy as np
from numpy.linalg import inv
from .metrics import r2_score

class LinearRegression:
    
    def __init__(self):
        # 初始化LinearRegression模型
        self.coef = None  #系数
        self.interception = None  #截距
        self._theta = None

    def fit_normal(self, X_train, y_train):
        # 根据训练数据集X_train, y_train训练Linear Regression模型
        assert X_train.shape[0] == y_train.shape[0], \
            "X_train的列数必须等于y_train的长度"
        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        self._theta = inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)
        self.interception = self._theta[0]
        self.coef = self._theta[1:]
        return self

    def fit_gd(self, X_train, y_train, eta=0.01, n_iters=1e4):
        # 根据训练数据集X_train,y_train,使用梯度下降法训练Linear Regression模型
        assert X_train.shape[0] == y_train.shape[0], \
            "X_train的列数必须等于y_train的长度"

        def J(theta, X_b, y):
            # 构建损失函数
            try:
                return np.sum((y - X_b.dot(theta)) ** 2) / len(X_b)
            except:
                return float('inf')

        def dJ(theta, X_b, y):
            # 对theta求偏导数,获取梯度向量
            res = np.empty(len(theta))
            res[0] = np.sum(X_b.dot(theta) - y)
            for i in range(1, len(theta)):
                res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
            return res * 2 / len(X_b)

        def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
            """
            梯度下降算法
            :param X_b: 带虚拟特征1的自变量特征矩阵
            :param y: 因变量向量
            :param initial_theta: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
            :param eta: 迭代步长、学习率
            :param n_iters: 最大迭代次数
            :param epsilon: 误差值
            :return:
            """
            theta = initial_theta
            # 真实迭代次数
            i_iter = 0
            while i_iter < n_iters:
                # 获取梯度
                gradient = dJ(theta, X_b, y)
                last_theta = theta
                # 迭代更新theta,不断顺着梯度方向寻找新的theta
                theta = theta - eta * gradient
                # 计算前后两次迭代后的损失函数差值的绝对值
                if abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon:
                    break
                # 更新迭代次数
                i_iter += 1
            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.zeros(X_b.shape[1])
        self._theta = gradient_descent(X_b, y_train, initial_theta, eta, n_iters)
        self.interception = self._theta[0]
        self.coef = self._theta[1:]
        return self

    def predict(self, X_predict):
        # 给定待预测数据集X_predict,返回表示X_predict的结果向量
        assert self.interception is not None and self.coef is not None, \
            "开始预测前必须fit"
        assert X_predict.shape[1] == len(self.coef), \
            "预测的特征数必须与训练的特征数相等"
        X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
        return X_b.dot(self._theta)

    def score(self, X_test, y_test):
        # 根据测试数据集X_test和y_test确定当前模型的准确度
        y_predict = self.predict(X_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return "LinearRegression()"
import numpy as np
import matplotlib.pyplot as plt
from playLA.LinearRegression import LinearRegression

if __name__ == "__main__":

    # 随机生成100个2以内的随机数,设置种子保证每次运行随机数相同
    np.random.seed(666)
    x = 2 * np.random.random(size=100)
    # 将生成的随机数建立对应的线性数据(带噪声)
    y = x * 3 + 4 + np.random.normal(size=100)
    # 将x构建成100行1列的矩阵
    X = x.reshape(-1, 1)
    print(X.shape)
    # plt.scatter(x, y)
    # plt.show()
    def J(theta, X_b, y):
        # 构建损失函数
        try:
            return np.sum((y - X_b.dot(theta))**2) / len(X_b)
        except:
            return float('inf')

    def dJ(theta, X_b, y):
        # 对theta求偏导数,获取梯度向量
        res = np.empty(len(theta))
        res[0] = np.sum(X_b.dot(theta) - y)
        for i in range(1, len(theta)):
            res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
        return res * 2 / len(X_b)

    def gradient_descent(X_b, y, initial_theta, eta, n_iters = 1e4, epsilon=1e-8):
        """
        梯度下降算法
        :param X_b: 带虚拟特征1的自变量特征矩阵
        :param y: 因变量向量
        :param initial_theta: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
        :param eta: 迭代步长、学习率
        :param n_iters: 最大迭代次数
        :param epsilon: 误差值
        :return:
        """
        theta = initial_theta
        # 真实迭代次数
        i_iter = 0
        while i_iter < n_iters:
            # 获取梯度
            gradient = dJ(theta, X_b, y)
            last_theta = theta
            # 迭代更新theta,不断顺着梯度方向寻找新的theta
            theta = theta - eta * gradient
            # 计算前后两次迭代后的损失函数差值的绝对值
            if abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon:
                break
            # 更新迭代次数
            i_iter += 1
        return theta
    X_b = np.hstack([np.ones((len(X), 1)), X.reshape(-1, 1)])
    initial_theta = np.zeros(X_b.shape[1])
    eta = 0.01
    theta = gradient_descent(X_b, y, initial_theta, eta)
    print(theta)
    reg = LinearRegression()
    reg.fit_gd(X, y)
    print(reg.coef)
    print(reg.interception)

运行结果

(100, 1)
[4.02145786 3.00706277]
[3.00706277]
4.021457858204859

梯度下降法的向量化和数据标准化

首先我们先将梯度转换一下,使其有统一的格式

这里的,在上一小节中,我们可以看到在求梯度的时候,采用的是for循环的方式来求梯度向量的各个分量。实际上这种运算方式也是比较慢的。而上面的式子向量与矩阵的点乘运算法则(向量的每一个分量与矩阵的每一个列向量分别相乘)可以反写成(有关向量与矩阵点乘的运算法则可以参考线性代数整理 ,向量与矩阵点乘的结果依然是一个向量)

由于我们一般说的向量都是列向量,为了表示行向量,所以这里加了转置操作。为了严谨起见,我们还是要将该式子的结果转换成一个列向量,我们可以将最终结果再进行一下转置,根据矩阵转置运算法则,可得,所以最终可得

这就是向量化后的梯度,现在我们将原来的使用for循环求梯度的方法进行一下修改

import numpy as np
from numpy.linalg import inv
from .metrics import r2_score

class LinearRegression:
    
    def __init__(self):
        # 初始化LinearRegression模型
        self.coef = None  #系数
        self.interception = None  #截距
        self._theta = None

    def fit_normal(self, X_train, y_train):
        # 根据训练数据集X_train, y_train训练Linear Regression模型
        assert X_train.shape[0] == y_train.shape[0], \
            "X_train的列数必须等于y_train的长度"
        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        self._theta = inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)
        self.interception = self._theta[0]
        self.coef = self._theta[1:]
        return self

    def fit_gd(self, X_train, y_train, eta=0.01, n_iters=1e4):
        # 根据训练数据集X_train,y_train,使用梯度下降法训练Linear Regression模型
        assert X_train.shape[0] == y_train.shape[0], \
            "X_train的列数必须等于y_train的长度"

        def J(theta, X_b, y):
            # 构建损失函数
            try:
                return np.sum((y - X_b.dot(theta)) ** 2) / len(X_b)
            except:
                return float('inf')

        def dJ(theta, X_b, y):
            # 对theta求偏导数,获取梯度向量
            # res = np.empty(len(theta))
            # res[0] = np.sum(X_b.dot(theta) - y)
            # for i in range(1, len(theta)):
            #     res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
            # return res * 2 / len(X_b)
            return X_b.T.dot(X_b.dot(theta) - y) * 2 / len(X_b)

        def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
            """
            梯度下降算法
            :param X_b: 带虚拟特征1的自变量特征矩阵
            :param y: 因变量向量
            :param initial_theta: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
            :param eta: 迭代步长、学习率
            :param n_iters: 最大迭代次数
            :param epsilon: 误差值
            :return:
            """
            theta = initial_theta
            # 真实迭代次数
            i_iter = 0
            while i_iter < n_iters:
                # 获取梯度
                gradient = dJ(theta, X_b, y)
                last_theta = theta
                # 迭代更新theta,不断顺着梯度方向寻找新的theta
                theta = theta - eta * gradient
                # 计算前后两次迭代后的损失函数差值的绝对值
                if abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon:
                    break
                # 更新迭代次数
                i_iter += 1
            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.zeros(X_b.shape[1])
        self._theta = gradient_descent(X_b, y_train, initial_theta, eta, n_iters)
        self.interception = self._theta[0]
        self.coef = self._theta[1:]
        return self

    def predict(self, X_predict):
        # 给定待预测数据集X_predict,返回表示X_predict的结果向量
        assert self.interception is not None and self.coef is not None, \
            "开始预测前必须fit"
        assert X_predict.shape[1] == len(self.coef), \
            "预测的特征数必须与训练的特征数相等"
        X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
        return X_b.dot(self._theta)

    def score(self, X_test, y_test):
        # 根据测试数据集X_test和y_test确定当前模型的准确度
        y_predict = self.predict(X_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return "LinearRegression()"

现在我们还是使用波士顿的真实房源数据来看一下分析情况。

import numpy as np
from sklearn import datasets
from playLA.LinearRegression import LinearRegression
from playLA.model_selection import train_test_split
import timeit

if __name__ == "__main__":

    boston = datasets.load_boston()
    X = boston.data
    y = boston.target
    X = X[y < 50]
    y = y[y < 50]
    reg = LinearRegression()
    X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
    reg.fit_normal(X_train, y_train)
    print(reg.score(X_test, y_test))
    reg2 = LinearRegression()
    start_time = timeit.default_timer()
    reg2.fit_gd(X_train, y_train)
    print(reg2.coef)
    print(reg2.score(X_test, y_test))
    print(X_train[:10, :])
    print(timeit.default_timer() - start_time)

运行结果

0.8129794056212907
/opt/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py:90: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/Users/admin/Downloads/PycharmProjects/matrix/playLA/LinearRegression.py:31: RuntimeWarning: overflow encountered in square
  return np.sum((y - X_b.dot(theta)) ** 2) / len(X_b)
/Users/admin/Downloads/PycharmProjects/matrix/playLA/LinearRegression.py:65: RuntimeWarning: invalid value encountered in double_scalars
  if abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon:
[nan nan nan nan nan nan nan nan nan nan nan nan nan]
nan
[[1.42362e+01 0.00000e+00 1.81000e+01 0.00000e+00 6.93000e-01 6.34300e+00
  1.00000e+02 1.57410e+00 2.40000e+01 6.66000e+02 2.02000e+01 3.96900e+02
  2.03200e+01]
 [3.67822e+00 0.00000e+00 1.81000e+01 0.00000e+00 7.70000e-01 5.36200e+00
  9.62000e+01 2.10360e+00 2.40000e+01 6.66000e+02 2.02000e+01 3.80790e+02
  1.01900e+01]
 [1.04690e-01 4.00000e+01 6.41000e+00 1.00000e+00 4.47000e-01 7.26700e+00
  4.90000e+01 4.78720e+00 4.00000e+00 2.54000e+02 1.76000e+01 3.89250e+02
  6.05000e+00]
 [1.15172e+00 0.00000e+00 8.14000e+00 0.00000e+00 5.38000e-01 5.70100e+00
  9.50000e+01 3.78720e+00 4.00000e+00 3.07000e+02 2.10000e+01 3.58770e+02
  1.83500e+01]
 [6.58800e-02 0.00000e+00 2.46000e+00 0.00000e+00 4.88000e-01 7.76500e+00
  8.33000e+01 2.74100e+00 3.00000e+00 1.93000e+02 1.78000e+01 3.95560e+02
  7.56000e+00]
 [2.49800e-02 0.00000e+00 1.89000e+00 0.00000e+00 5.18000e-01 6.54000e+00
  5.97000e+01 6.26690e+00 1.00000e+00 4.22000e+02 1.59000e+01 3.89960e+02
  8.65000e+00]
 [7.75223e+00 0.00000e+00 1.81000e+01 0.00000e+00 7.13000e-01 6.30100e+00
  8.37000e+01 2.78310e+00 2.40000e+01 6.66000e+02 2.02000e+01 2.72210e+02
  1.62300e+01]
 [9.88430e-01 0.00000e+00 8.14000e+00 0.00000e+00 5.38000e-01 5.81300e+00
  1.00000e+02 4.09520e+00 4.00000e+00 3.07000e+02 2.10000e+01 3.94540e+02
  1.98800e+01]
 [1.14320e-01 0.00000e+00 8.56000e+00 0.00000e+00 5.20000e-01 6.78100e+00
  7.13000e+01 2.85610e+00 5.00000e+00 3.84000e+02 2.09000e+01 3.95580e+02
  7.67000e+00]
 [5.69175e+00 0.00000e+00 1.81000e+01 0.00000e+00 5.83000e-01 6.11400e+00
  7.98000e+01 3.54590e+00 2.40000e+01 6.66000e+02 2.02000e+01 3.92680e+02
  1.49800e+01]]
0.29687309100000003

此时我们会看见使用梯度下降法来训练出现了警告。我们可以看到系数向量里面的元素都是nan,且评分也是nan。通过对前10条训练数据各个维度(特征)的比较,我们可以发现,有些维度(特征)的数据比较小,只有零点零几,而有些维度却又有几百这么大。面对这样的一个维度,我们最终求得的梯度可能也是非常大的,我们使用默认的eta(步长、学习率)太大,使得我们的梯度下降法的过程是不收敛的。所以我们要修改eta的值来验证我们的猜测。

import numpy as np
from sklearn import datasets
from playLA.LinearRegression import LinearRegression
from playLA.model_selection import train_test_split
import timeit

if __name__ == "__main__":

    boston = datasets.load_boston()
    X = boston.data
    y = boston.target
    X = X[y < 50]
    y = y[y < 50]
    reg = LinearRegression()
    X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
    reg.fit_normal(X_train, y_train)
    print(reg.score(X_test, y_test))
    reg2 = LinearRegression()
    start_time = timeit.default_timer()
    reg2.fit_gd(X_train, y_train, eta=0.000001)
    print(reg2.coef)
    print(reg2.score(X_test, y_test))
    # print(X_train[:10, :])
    print(timeit.default_timer() - start_time)

运行结果

0.8129794056212907
[-0.10245704  0.11535876 -0.06248791  0.00207516  0.00447152  0.11954208
  0.04684195  0.03460927 -0.00452122  0.00324507  0.1271939   0.04484862
 -0.22542441]
0.27586818724477224
0.301082199

此时已经没有警告了,但是我们可以看到R^2值评分只有0.27586818724477224,而使用正规方程解的R^2值评分却有0.8。很显然我们现在的梯度下降法还没有达到我们损失函数的最小值。这可能又是因为eta太小了,导致每一步行进的又非常小,需要更多的循环(迭代)次数才能找到损失函数的最小值。为了验证这个猜测,我们需要调大最大的循环(迭代)次数。

import numpy as np
from sklearn import datasets
from playLA.LinearRegression import LinearRegression
from playLA.model_selection import train_test_split
import timeit

if __name__ == "__main__":

    boston = datasets.load_boston()
    X = boston.data
    y = boston.target
    X = X[y < 50]
    y = y[y < 50]
    reg = LinearRegression()
    X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
    reg.fit_normal(X_train, y_train)
    print(reg.score(X_test, y_test))
    reg2 = LinearRegression()
    start_time = timeit.default_timer()
    reg2.fit_gd(X_train, y_train, eta=0.000001, n_iters=1e6)
    print(reg2.coef)
    print(reg2.score(X_test, y_test))
    # print(X_train[:10, :])
    print(timeit.default_timer() - start_time)

运行结果

0.8129794056212907
[-1.07889200e-01  5.91494760e-02 -5.72920411e-02  1.19334353e-01
  2.07223623e-01  3.91254775e+00  1.50564949e-03 -5.36511902e-01
  1.13424276e-01 -9.76209406e-03  5.35544815e-02  1.58440412e-02
 -3.78786162e-01]
0.7542932581943915
27.325919068999998

这一次的运行时间花费了27秒,得到的R^2值评分只有0.7542932581943915。换句话说我们使用了这么长的时间使用梯度下降法还是没有达到我们的损失函数的最小值。如果继续调大循环(迭代)次数,那么又显然太耗时了。对于这种情况该怎么办呢?之所以出现这种情况,是因为我们的这些训练数据,它整体不在一个规模上

梯度下降法与数据归一化

使用梯度下降法前,最好进行数据归一化。

我们在使用线性方程正规方程解的时候是不需要数据归一化的,这是因为在线性回归模型的求解过程整体变成了一个公式的计算,那么在这个公式的计算中,牵扯的这种中间搜索的过程比较少,所以我们不需要进行数据归一化。当我们使用梯度下降法的时候,事情变的不一样了。由于我们有eta(步长、学习率)这个变量,所以首先会出现一个问题,如果我们最终这些数值,它们不在一个维度上,将会影响我们梯度的结果。而梯度的结果再乘以eta是我们真正每一步走的步长,这个步长就有可能或者太大或者太小。如果太大会导致结果不收敛,就像我们之前谁用默认的eta值得到的结果;但如果太小的话又会导致我们搜索的过程太慢,像我们后面实验的那样。但是如果我们将所有的数据进行归一化,那么这个问题就完全解决了。

import numpy as np
from sklearn import datasets
from playLA.LinearRegression import LinearRegression
from playLA.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import timeit

if __name__ == "__main__":

    boston = datasets.load_boston()
    X = boston.data
    y = boston.target
    X = X[y < 50]
    y = y[y < 50]
    reg = LinearRegression()
    X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
    reg.fit_normal(X_train, y_train)
    print(reg.score(X_test, y_test))
    reg2 = LinearRegression()
    # 将训练数据进行归一化处理
    standardScaler = StandardScaler()
    standardScaler.fit(X_train)
    X_train_standard = standardScaler.transform(X_train)
    start_time = timeit.default_timer()
    reg2.fit_gd(X_train_standard, y_train)
    print(reg2.coef)
    X_test_standard = standardScaler.transform(X_test)
    print(reg2.score(X_test_standard, y_test))
    # print(X_train[:10, :])
    print(timeit.default_timer() - start_time)

运行结果

0.8129794056212907
[-1.04042202  0.83093351 -0.24794356  0.01179456 -1.35034756  2.25074
 -0.66384353 -2.53568774  2.25572406 -2.34011572 -1.76565394  0.70923397
 -2.72677064]
0.8129873310487505
0.13547030499999996

这一次从结果来看,使用梯度下降法得到的R^2值跟我们使用正规方程解得到的R^2值是一致的,说明我们已经找到了这个损失函数的最小值。与此同时速度是非常快的。这就是数据归一化的威力。

梯度下降法的优势

import numpy as np
from playLA.LinearRegression import LinearRegression
import timeit

if __name__ == "__main__":

    # 样本数
    m = 1000
    # 特征值
    n = 5000
    # 随机化生成一个自变量正太分布矩阵
    big_X = np.random.normal(size=(m, n))
    # 随机化生成一个0到100之间的系数向量,长度为n+1
    true_theta = np.random.uniform(0, 100, size=n+1)
    # 生成因变量向量(含噪音,均值为0,标准差为10)
    big_y = big_X.dot(true_theta[1:]) + true_theta[0] + np.random.normal(0, 10, size=m)
    big_reg1 = LinearRegression()
    start_time = timeit.default_timer()
    big_reg1.fit_normal(big_X, big_y)
    print(timeit.default_timer() - start_time)
    big_reg2 = LinearRegression()
    start_time = timeit.default_timer()
    big_reg2.fit_gd(big_X, big_y)
    print(timeit.default_timer() - start_time)

运行结果

1.9443837830000001
3.53725637

根据结果我们可以看出,此时使用正规方程的训练时间比梯度下降法会更少,但随着我们不断调大特征值数量的时候

import numpy as np
from playLA.LinearRegression import LinearRegression
import timeit

if __name__ == "__main__":

    # 样本数
    m = 1000
    # 特征值
    n = 6000
    # 随机化生成一个自变量正太分布矩阵
    big_X = np.random.normal(size=(m, n))
    # 随机化生成一个0到100之间的系数向量,长度为n+1
    true_theta = np.random.uniform(0, 100, size=n+1)
    # 生成因变量向量(含噪音,均值为0,标准差为10)
    big_y = big_X.dot(true_theta[1:]) + true_theta[0] + np.random.normal(0, 10, size=m)
    big_reg1 = LinearRegression()
    start_time = timeit.default_timer()
    big_reg1.fit_normal(big_X, big_y)
    print(timeit.default_timer() - start_time)
    big_reg2 = LinearRegression()
    start_time = timeit.default_timer()
    big_reg2.fit_gd(big_X, big_y)
    print(timeit.default_timer() - start_time)

运行结果

3.018317079
3.375811186
import numpy as np
from playLA.LinearRegression import LinearRegression
import timeit

if __name__ == "__main__":

    # 样本数
    m = 1000
    # 特征值
    n = 7000
    # 随机化生成一个自变量正太分布矩阵
    big_X = np.random.normal(size=(m, n))
    # 随机化生成一个0到100之间的系数向量,长度为n+1
    true_theta = np.random.uniform(0, 100, size=n+1)
    # 生成因变量向量(含噪音,均值为0,标准差为10)
    big_y = big_X.dot(true_theta[1:]) + true_theta[0] + np.random.normal(0, 10, size=m)
    big_reg1 = LinearRegression()
    start_time = timeit.default_timer()
    big_reg1.fit_normal(big_X, big_y)
    print(timeit.default_timer() - start_time)
    big_reg2 = LinearRegression()
    start_time = timeit.default_timer()
    big_reg2.fit_gd(big_X, big_y)
    print(timeit.default_timer() - start_time)

运行结果

4.479472085
3.2846188210000005
import numpy as np
from playLA.LinearRegression import LinearRegression
import timeit

if __name__ == "__main__":

    # 样本数
    m = 1000
    # 特征值
    n = 8000
    # 随机化生成一个自变量正太分布矩阵
    big_X = np.random.normal(size=(m, n))
    # 随机化生成一个0到100之间的系数向量,长度为n+1
    true_theta = np.random.uniform(0, 100, size=n+1)
    # 生成因变量向量(含噪音,均值为0,标准差为10)
    big_y = big_X.dot(true_theta[1:]) + true_theta[0] + np.random.normal(0, 10, size=m)
    big_reg1 = LinearRegression()
    start_time = timeit.default_timer()
    big_reg1.fit_normal(big_X, big_y)
    print(timeit.default_timer() - start_time)
    big_reg2 = LinearRegression()
    start_time = timeit.default_timer()
    big_reg2.fit_gd(big_X, big_y)
    print(timeit.default_timer() - start_time)

运行结果

6.36739793
2.9523499839999996
import numpy as np
from playLA.LinearRegression import LinearRegression
import timeit

if __name__ == "__main__":

    # 样本数
    m = 1000
    # 特征值
    n = 9000
    # 随机化生成一个自变量正太分布矩阵
    big_X = np.random.normal(size=(m, n))
    # 随机化生成一个0到100之间的系数向量,长度为n+1
    true_theta = np.random.uniform(0, 100, size=n+1)
    # 生成因变量向量(含噪音,均值为0,标准差为10)
    big_y = big_X.dot(true_theta[1:]) + true_theta[0] + np.random.normal(0, 10, size=m)
    big_reg1 = LinearRegression()
    start_time = timeit.default_timer()
    big_reg1.fit_normal(big_X, big_y)
    print(timeit.default_timer() - start_time)
    big_reg2 = LinearRegression()
    start_time = timeit.default_timer()
    big_reg2.fit_gd(big_X, big_y)
    print(timeit.default_timer() - start_time)

运行结果

8.681674721
2.8626002980000003

根据以上的结果,我们发现随着特征值数量的不断增大,使用正规方程解的方式,训练时间在不断增大,而使用梯度下降法的时间却在不断减少,逐渐趋于稳定,而且远远低于正规方程解的训练时间。在一个什么样的特征值情况下,梯度下降法会快过正规方程解,这个时间是不确定的,根据每台计算机的性能、配置来决定的,需要我们自己去调试。

随机梯度下降法

之前我们看的这个梯度

的算法又叫批量梯度下降法(Batch Gradient Descent).这里有一个问题,如果我们的样本量m非常大,那么计算这个梯度本身是非常耗时的。

由于在批量梯度下降法中,我们在计算梯度向量的每一个分量中都对m个样本进行了计算,而改进的方式就是我们只取一个样本来计算搜索的方向,而不是损失函数的梯度

我们每次只是随机取出一个i,这个式子也可以表达一个方向,我们向这个方向进行搜索,不停的迭代,那么能否得到我们整个损失函数的最小值呢?这个思想就叫做随机梯度下降法(Stochastic Gradient Descent)

从上图,我们可以看出,损失函数每次搜索的路径是随机的,而不是像批量梯度下降法一样总是沿着固定方向(最快到达最小值的垂线)搜索。随机梯度下降法不仅不保证是最快的下降路径,它还有可能是回退(上升)的。但即使如此,我们通过随机梯度下降法,也能差不多的来到损失函数的最小值附近。虽然它不像批量梯度下降法一样,一定来到最小值的位置。但是如果我们的样本量m非常大的话,我们愿意用一定的精度来换取一定的时间,这样一来,随机梯度下降法就有意义了。在随机梯度下降法中,步长(学习率)的取值变的很重要,这是因为在我们的随机梯度下降法的过程中,如果学习率一直取一个固定值的话,那很有可能在一定程度上,我们的随机梯度下降法已经来到损失函数最小值中心左右的位置了,但是由于这个随机的过程不够好,学习率又是一个固定值,慢慢又跳出了最小值所在的位置,所以我们希望在实际中,我们的随机梯度下降法的学习率是逐渐递减的。我们可以设计一个函数来让我们的学习率值随着循环(迭代)次数的增多,相应的越来越小。

(学习率、i_iters迭代次数,随着迭代次数的增多,越来越小)

但是这种方式有一个问题,如果迭代次数比较少的时候,我们的值下降的太快了,比如从1到2的时候,直接下降了50%,但是从10000到10001的时候,值又只下降了万分之一。所以我们通常会在分母上添加一个常数b来缓解这种情况;而分子通常也取一个常数a,比1灵活一些。

这里的a和b就是随机梯度下降法的两个超参数,我们这里不做调整,而是取两个经验值,a取值为5,b取值为50。不管怎样,为了在随机梯度下降法中,为了得到比较好的收敛结果,我们的学习率应该随着我们迭代次数的增加逐渐递减。这种逐渐递减的思想是模拟在搜索领域一个非常重要的思想——模拟退火的思想。好比在打造钢铁的时候,火炼的温度是逐渐从高到低冷却的,这就是一个逐渐退火的过程,那么这个温度是跟时间t相关的。所以有的时候,我们将学习率写为

我们还是先用批量梯度下降法来算100万条模拟样本数据进行线性函数常数的拟合

import numpy as np
import timeit

if __name__ == "__main__":

    m = 1000000
    # 生成100万条自变量随机数据向量
    np.random.seed(666)
    x = np.random.normal(size=m)
    # 整理成m*1的矩阵
    X = x.reshape(-1, 1)
    # 生成因变量向量(带噪音,均值为0,标准差为3)
    y = 4 * x + 3 + np.random.normal(0, 3, size=m)
    def J(theta, X_b, y):
        # 构建损失函数
        try:
            return np.sum((y - X_b.dot(theta))**2) / len(X_b)
        except:
            return float('inf')

    def dJ(theta, X_b, y):
        # 获取梯度向量
        return X_b.T.dot(X_b.dot(theta) - y) * 2 / len(X_b)

    def gradient_descent(X_b, y, initial_theta, eta, n_iters = 1e4, epsilon=1e-8):
        """
        批量梯度下降算法
        :param X_b: 带虚拟特征1的自变量特征矩阵
        :param y: 因变量向量
        :param initial_theta: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
        :param eta: 迭代步长、学习率
        :param n_iters: 最大迭代次数
        :param epsilon: 误差值
        :return:
        """
        theta = initial_theta
        # 真实迭代次数
        i_iter = 0
        while i_iter < n_iters:
            # 获取梯度
            gradient = dJ(theta, X_b, y)
            last_theta = theta
            # 迭代更新theta,不断顺着梯度方向寻找新的theta
            theta = theta - eta * gradient
            # 计算前后两次迭代后的损失函数差值的绝对值
            if abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon:
                break
            # 更新迭代次数
            i_iter += 1
        return theta

    start_time = timeit.default_timer()
    X_b = np.hstack([np.ones((len(X), 1)), X])
    initial_theta = np.zeros(X_b.shape[1])
    eta = 0.01
    theta = gradient_descent(X_b, y, initial_theta, eta)
    print(theta)
    print(timeit.default_timer() - start_time)

运行结果

[3.00375106 4.00219484]
6.453924730000001

现在我们加上随机梯度下降法

import numpy as np
import timeit

if __name__ == "__main__":

    m = 1000000
    # 生成100万条自变量随机数据向量
    np.random.seed(666)
    x = np.random.normal(size=m)
    # 整理成m*1的矩阵
    X = x.reshape(-1, 1)
    # 生成因变量向量(带噪音,均值为0,标准差为3)
    y = 4 * x + 3 + np.random.normal(0, 3, size=m)
    def J(theta, X_b, y):
        # 构建损失函数
        try:
            return np.sum((y - X_b.dot(theta))**2) / len(X_b)
        except:
            return float('inf')

    def dJ(theta, X_b, y):
        # 获取梯度向量
        return X_b.T.dot(X_b.dot(theta) - y) * 2 / len(X_b)

    def gradient_descent(X_b, y, initial_theta, eta, n_iters = 1e4, epsilon=1e-8):
        """
        批量梯度下降算法
        :param X_b: 带虚拟特征1的自变量特征矩阵
        :param y: 因变量向量
        :param initial_theta: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
        :param eta: 迭代步长、学习率
        :param n_iters: 最大迭代次数
        :param epsilon: 误差值
        :return:
        """
        theta = initial_theta
        # 真实迭代次数
        i_iter = 0
        while i_iter < n_iters:
            # 获取梯度
            gradient = dJ(theta, X_b, y)
            last_theta = theta
            # 迭代更新theta,不断顺着梯度方向寻找新的theta
            theta = theta - eta * gradient
            # 计算前后两次迭代后的损失函数差值的绝对值
            if abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon:
                break
            # 更新迭代次数
            i_iter += 1
        return theta

    start_time = timeit.default_timer()
    X_b = np.hstack([np.ones((len(X), 1)), X])
    initial_theta = np.zeros(X_b.shape[1])
    eta = 0.01
    theta = gradient_descent(X_b, y, initial_theta, eta)
    print(theta)
    print(timeit.default_timer() - start_time)
    def dJ_sgd(theta, X_b_i, y_i):
        # 获取某一个随机梯度值
        return X_b_i.T.dot(X_b_i.dot(theta) - y_i) * 2

    def sgd(X_b, y, initial_theta, n_iters):
        """
        随机梯度下降法
        :param X_b: 带虚拟特征1的自变量特征矩阵
        :param y: 因变量向量
        :param initial_theta: 初始的常数向量
        :param n_iters: 最大迭代次数
        :return:
        """
        t0 = 5
        t1 = 50
        def learning_rate(t):
            # 随机梯度下降法的学习率
            return t0 / (t + t1)
        theta = initial_theta
        for cur_iter in range(n_iters):
            rand_i = np.random.randint(len(X_b))
            gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i])
            theta = theta - learning_rate(cur_iter) * gradient
        return theta

    start_time = timeit.default_timer()
    initial_theta = np.zeros(X_b.shape[1])
    theta = sgd(X_b, y, initial_theta, n_iters=len(X_b) // 3)
    print(theta)
    print(timeit.default_timer() - start_time)

运行结果

[3.00375106 4.00219484]
6.453924730000001
[3.01543883 3.98431204]
4.022960640000001

通过结果,我们可以看出通过随机梯度下降法拟合的常数向量和使用批量梯度下降法是非常接近的,而使用的时间要少了2秒多。因为这里使用的是简单线性回归,所以使用了三分之一的样本数作为迭代次数,当然在高维特征线性回归中,不能只使用三分之一的样本数作为迭代次数,后面会介绍。

scikit-learn中的随机梯度下降法

我们先把随机梯度下降法封装到线性回归类中,不过跟上一小节的比有了一些调整,就是迭代次数的问题

import numpy as np
from numpy.linalg import inv
from .metrics import r2_score

class LinearRegression:
    
    def __init__(self):
        # 初始化LinearRegression模型
        self.coef = None  #系数
        self.interception = None  #截距
        self._theta = None

    def fit_normal(self, X_train, y_train):
        # 根据训练数据集X_train, y_train训练Linear Regression模型
        assert X_train.shape[0] == y_train.shape[0], \
            "X_train的列数必须等于y_train的长度"
        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        self._theta = inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)
        self.interception = self._theta[0]
        self.coef = self._theta[1:]
        return self

    def fit_gd(self, X_train, y_train, eta=0.01, n_iters=1e4):
        # 根据训练数据集X_train,y_train,使用梯度下降法训练Linear Regression模型
        assert X_train.shape[0] == y_train.shape[0], \
            "X_train的列数必须等于y_train的长度"

        def J(theta, X_b, y):
            # 构建损失函数
            try:
                return np.sum((y - X_b.dot(theta)) ** 2) / len(X_b)
            except:
                return float('inf')

        def dJ(theta, X_b, y):
            # 对theta求偏导数,获取梯度向量
            # res = np.empty(len(theta))
            # res[0] = np.sum(X_b.dot(theta) - y)
            # for i in range(1, len(theta)):
            #     res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
            # return res * 2 / len(X_b)
            return X_b.T.dot(X_b.dot(theta) - y) * 2 / len(X_b)

        def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
            """
            梯度下降算法
            :param X_b: 带虚拟特征1的自变量特征矩阵
            :param y: 因变量向量
            :param initial_theta: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
            :param eta: 迭代步长、学习率
            :param n_iters: 最大迭代次数
            :param epsilon: 误差值
            :return:
            """
            theta = initial_theta
            # 真实迭代次数
            i_iter = 0
            while i_iter < n_iters:
                # 获取梯度
                gradient = dJ(theta, X_b, y)
                last_theta = theta
                # 迭代更新theta,不断顺着梯度方向寻找新的theta
                theta = theta - eta * gradient
                # 计算前后两次迭代后的损失函数差值的绝对值
                if abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon:
                    break
                # 更新迭代次数
                i_iter += 1
            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.zeros(X_b.shape[1])
        self._theta = gradient_descent(X_b, y_train, initial_theta, eta, n_iters)
        self.interception = self._theta[0]
        self.coef = self._theta[1:]
        return self

    def fit_sgd(self, X_train, y_train, n_iters=5, t0=5, t1=50):
        # 根据训练数据集X_train,y_train,使用随机梯度下降法训练Linear Regression模型
        assert X_train.shape[0] == y_train.shape[0], \
            "X_train的列数必须等于y_train的长度"
        assert n_iters >= 1

        def dJ_sgd(theta, X_b_i, y_i):
            # 获取某一个随机梯度值
            return X_b_i.T.dot(X_b_i.dot(theta) - y_i) * 2

        def sgd(X_b, y, initial_theta, n_iters, t0=5, t1=50):
            """
            随机梯度下降法
            :param X_b: 带虚拟特征1的自变量特征矩阵
            :param y: 因变量向量
            :param initial_theta: 初始的常数向量
            :param n_iters: 最大迭代次数
            :return:
            """
            def learning_rate(t):
                # 随机梯度下降法的学习率
                return t0 / (t + t1)

            theta = initial_theta
            m = len(X_b)
            # 将样本数量看5遍
            for cur_iter in range(n_iters):
                indexes = np.random.permutation(m)
                X_b_new = X_b[indexes]
                y_new = y[indexes]
                for i in range(m):
                    gradient = dJ_sgd(theta, X_b_new[i], y_new[i])
                    theta = theta - learning_rate(cur_iter * m + i) * gradient
            return theta
        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.random.randn(X_b.shape[1])
        self._theta = sgd(X_b, y_train, initial_theta, n_iters, t0, t1)
        self.interception = self._theta[0]
        self.coef = self._theta[1:]
        return self

    def predict(self, X_predict):
        # 给定待预测数据集X_predict,返回表示X_predict的结果向量
        assert self.interception is not None and self.coef is not None, \
            "开始预测前必须fit"
        assert X_predict.shape[1] == len(self.coef), \
            "预测的特征数必须与训练的特征数相等"
        X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
        return X_b.dot(self._theta)

    def score(self, X_test, y_test):
        # 根据测试数据集X_test和y_test确定当前模型的准确度
        y_predict = self.predict(X_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return "LinearRegression()"
import numpy as np
import timeit
from playLA.LinearRegression import LinearRegression

if __name__ == "__main__":

    m = 1000000
    # 生成100万条自变量随机数据向量
    np.random.seed(666)
    x = np.random.normal(size=m)
    # 整理成m*1的矩阵
    X = x.reshape(-1, 1)
    # 生成因变量向量(带噪音,均值为0,标准差为3)
    y = 4 * x + 3 + np.random.normal(0, 3, size=m)
    def J(theta, X_b, y):
        # 构建损失函数
        try:
            return np.sum((y - X_b.dot(theta))**2) / len(X_b)
        except:
            return float('inf')

    def dJ(theta, X_b, y):
        # 获取梯度向量
        return X_b.T.dot(X_b.dot(theta) - y) * 2 / len(X_b)

    def gradient_descent(X_b, y, initial_theta, eta, n_iters = 1e4, epsilon=1e-8):
        """
        批量梯度下降算法
        :param X_b: 带虚拟特征1的自变量特征矩阵
        :param y: 因变量向量
        :param initial_theta: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
        :param eta: 迭代步长、学习率
        :param n_iters: 最大迭代次数
        :param epsilon: 误差值
        :return:
        """
        theta = initial_theta
        # 真实迭代次数
        i_iter = 0
        while i_iter < n_iters:
            # 获取梯度
            gradient = dJ(theta, X_b, y)
            last_theta = theta
            # 迭代更新theta,不断顺着梯度方向寻找新的theta
            theta = theta - eta * gradient
            # 计算前后两次迭代后的损失函数差值的绝对值
            if abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon:
                break
            # 更新迭代次数
            i_iter += 1
        return theta

    start_time = timeit.default_timer()
    X_b = np.hstack([np.ones((len(X), 1)), X])
    initial_theta = np.zeros(X_b.shape[1])
    eta = 0.01
    theta = gradient_descent(X_b, y, initial_theta, eta)
    print(theta)
    print(timeit.default_timer() - start_time)
    def dJ_sgd(theta, X_b_i, y_i):
        # 获取某一个随机梯度值
        return X_b_i.T.dot(X_b_i.dot(theta) - y_i) * 2

    def sgd(X_b, y, initial_theta, n_iters):
        """
        随机梯度下降法
        :param X_b: 带虚拟特征1的自变量特征矩阵
        :param y: 因变量向量
        :param initial_theta: 初始的常数向量
        :param n_iters: 最大迭代次数
        :return:
        """
        t0 = 5
        t1 = 50
        def learning_rate(t):
            # 随机梯度下降法的学习率
            return t0 / (t + t1)
        theta = initial_theta
        for cur_iter in range(n_iters):
            rand_i = np.random.randint(len(X_b))
            gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i])
            theta = theta - learning_rate(cur_iter) * gradient
        return theta

    start_time = timeit.default_timer()
    initial_theta = np.zeros(X_b.shape[1])
    theta = sgd(X_b, y, initial_theta, n_iters=len(X_b) // 3)
    print(theta)
    print(timeit.default_timer() - start_time)
    reg = LinearRegression()
    reg.fit_sgd(X, y, n_iters=2)
    print(reg.coef)
    print(reg.interception)

运行结果

[3.00375106 4.00219484]
5.302007303
[3.01543883 3.98431204]
4.018497287
[4.0004335]
3.0050395227201014

现在我们使用波士顿房价的真实数据来看看随机梯度下降法的结果

from sklearn import datasets
from playLA.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from playLA.LinearRegression import LinearRegression
import timeit

if __name__ == "__main__":

    boston = datasets.load_boston()
    X = boston.data
    y = boston.target
    X = X[y < 50]
    y = y[y < 50]
    X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
    standardScaler = StandardScaler()
    standardScaler.fit(X_train)
    X_train_standard = standardScaler.transform(X_train)
    X_test_standard = standardScaler.transform(X_test)
    reg = LinearRegression()
    start_time = timeit.default_timer()
    reg.fit_sgd(X_train_standard, y_train, n_iters=2)
    print(timeit.default_timer() - start_time)
    print(reg.score(X_test_standard, y_test))

运行结果

0.005665511000000012
0.7857275413602652

根据结果我们可以看出,R^2评价值没有达到之前使用正规方程解的0.8,说明我们的随机梯度下降法还没有达到损失函数的最小值。现在我们调大对整个样本数据查看的次数。

from sklearn import datasets
from playLA.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from playLA.LinearRegression import LinearRegression
import timeit

if __name__ == "__main__":

    boston = datasets.load_boston()
    X = boston.data
    y = boston.target
    X = X[y < 50]
    y = y[y < 50]
    X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
    standardScaler = StandardScaler()
    standardScaler.fit(X_train)
    X_train_standard = standardScaler.transform(X_train)
    X_test_standard = standardScaler.transform(X_test)
    reg = LinearRegression()
    start_time = timeit.default_timer()
    reg.fit_sgd(X_train_standard, y_train, n_iters=50)
    print(timeit.default_timer() - start_time)
    print(reg.score(X_test_standard, y_test))

运行结果

0.07702159399999997
0.8128901094766997

从这个例子可以看出,在随机梯度下降法中,我们可能要调整整个样本数据查看的次数才有可能达到满意的结果。

最后我们来看一下在scikit-learn中是如何使用随机梯度下降法SGD的。

from sklearn import datasets
from playLA.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from playLA.LinearRegression import LinearRegression
from sklearn.linear_model import SGDRegressor
import timeit

if __name__ == "__main__":

    boston = datasets.load_boston()
    X = boston.data
    y = boston.target
    X = X[y < 50]
    y = y[y < 50]
    X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
    standardScaler = StandardScaler()
    standardScaler.fit(X_train)
    X_train_standard = standardScaler.transform(X_train)
    X_test_standard = standardScaler.transform(X_test)
    reg = LinearRegression()
    start_time = timeit.default_timer()
    reg.fit_sgd(X_train_standard, y_train, n_iters=50)
    print(timeit.default_timer() - start_time)
    print(reg.score(X_test_standard, y_test))
    sgd_reg = SGDRegressor()
    start_time = timeit.default_timer()
    sgd_reg.fit(X_train_standard, y_train)
    print(timeit.default_timer() - start_time)
    print(sgd_reg.score(X_test_standard, y_test))

运行结果

0.07724935
0.8128901094766997
0.001499403999999982
0.8130319307073004

这里需要说明的是scikit-learn中的随机梯度下降法比我们自己封装的要复杂,效率更高(从结果也可以看出来),它使用了很多优化的手段。

PCA与梯度上升法

什么是PCA

梯度法是寻找一个函数的极值,不管是寻找损失函数的最小值还是效用函数的最大值。这里我们来看梯度法的另外一个用处——主成分分析(Principal Component Analysis)

主成分分析不仅应用与机器学习领域也是统计学一个非常重要的方法。主成分分析简称为PCA(英文首字母),它的作用如下

  1. 一个非监督的机器学习算法
  2. 主要用于数据的降维
  3. 通过降维,可以发现更便于人类理解的特征
  4. 其他应用:可视化;去噪

以前我们在讲线性回归的时候,无论一组数据,它有多少个维度,我们都使用一根直线轴来表示,如

这里x轴是各个维度的向量数据,而y轴则是这些各个维度的线性函数值。与线性回归不同,PCA是将不同维度的数据用不同的坐标轴来表示,如

这里就是两个不同维度的特征用不同的坐标轴来表示。PCA的主要作用是降维,我们来理解一下它的原理

降维有一个非常简单的办法就是扔掉一个特征,比如

或者

我们降维的目的是为了保持原有的样本的距离最大化,很明显,在向x轴降维后,样本的距离之间要比向y轴降维后的更大一些。那么有没有一种方案比向x轴降维保持样本间距离更大的方案呢?

会不会有一根直线,所有的点全部映射到这根直线上,它们间的距离是最大的呢?

这里我们需要注意的是这条直线对于我们来说是未知的,也是我们要求的。而且这条直线也可以理解为一个轴,同样也是将二维数据降到了一维。而且所有的样本点更加趋近于原来的分布情况。

定义样本间的间距:使用方差(Variance),关于方差的概念请参考概率论整理(二)

,这里就是均值,在概率论中称为数学期望。

这样PCA就变成了找到一个轴,使得样本空间的所有点映射到这个轴后,方差最大

我们首先要将样本的均值归为0(demean)

这样样本的分布是没有改变的,我们只是将坐标轴进行了移动,这样使得样本的每一个维度的均值都是0.

则方差变成了,这里均值为0。而这条直线,我们称为w=(w1,w2),使得我们所有的样本,映射到w以后,有:

最大

这里需要注意的是都是一个向量,那么两个向量相减之后,其实是它们的模,则上式转化为

最大

由于我们进行了demean处理,依然为0,则有

最大,就是把所有的样本点映射到w后的模的平方和最大。

线性代数整理 中,我们知道,一个向量与另一个向量的点乘等于两个向量的模相乘再乘以两个向量夹角的余弦,由于w表示方向,则w的模为1,则有

由此可知,,则有最大。

最终目标就为:求w,使得最大。

当然,我们这里是一个二维向量,如果是n维向量的话,则有

或者

这是一个目标函数的最优化问题,使用梯度上升法解决

使用梯度上升法求解PCA问题

由上一小节,我们知道了目标为:求w,使得最大。这里未知的是w,而X都是已知的样本数据信息。

求梯度,第一步当然就是求w的各个偏导数了

这里是一个复合函数求导,根据复合函数求导公式即可得出上面的全部偏导数。根据点乘的定义,则有

=,故

类似于线性回归中的向量化处理,上面的式子向量与矩阵的点乘运算法则(向量的每一个分量与矩阵的每一个列向量分别相乘,向量与矩阵点乘的结果依然是一个向量)可以反写成

由于我们一般说的向量都是列向量,为了表示行向量,所以这里加了转置操作,根据矩阵转置的法则,则有=。这就是我们要的求梯度最后的式子

求数据的主成分PCA

我们先来绘制100个具有两个维度的向量数据,第二个维度跟第一个维度有一定的线性关系

import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":

    X = np.empty((100, 2))
    X[:, 0] = np.random.uniform(0, 100, size=100)
    X[:, 1] = 0.75 * X[:, 0] + 3 + np.random.normal(0, 10, size=100)
    plt.scatter(X[:, 0], X[:, 1])
    plt.show()

现在我们对这组数据进行降维。首先就是demean操作,将所有向量的均值归0。

import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":

    X = np.empty((100, 2))
    X[:, 0] = np.random.uniform(0, 100, size=100)
    X[:, 1] = 0.75 * X[:, 0] + 3 + np.random.normal(0, 10, size=100)
    # plt.scatter(X[:, 0], X[:, 1])
    # plt.show()

    def demean(X):
        return X - np.mean(X, axis=0)
    X_demean = demean(X)
    plt.scatter(X_demean[:, 0], X_demean[:, 1])
    plt.show()

通过这两幅图的对比,我们可以发现它们的形状并没有变化,只是坐标发生了变化。

现在我们要使用梯度上升法来求w向量,先获取目标函数

def f(w, X):
    # 目标函数
    return np.sum((X.dot(w)**2)) / len(X)

再获取梯度=

def df_math(w, X):
    # 梯度
    return X.T.dot(X.dot(w)) * 2 / len(X)

现在我们来验证一下我们求解梯度的正确性

def df_debug(w, X, epsilon=0.0001):
    res = np.empty(len(w))
    for i in range(len(w)):
        w_1 = w.copy()
        w_1[i] += epsilon
        w_2 = w.copy()
        w_2[i] -= epsilon
        res[i] = (f(w_1, X) - f(w_2, X)) / (2 * epsilon)
    return res

然后是我们的梯度上升法

def direction(w):
    # 把w变成单位向量,只表示方向
    return w / np.linalg.norm(w)

def gradient_ascent(df, X, initial_w, eta, n_iters = 1e4, epsilon=1e-8):
    """
    梯度上升法
    :param df: 梯度函数
    :param X: 数据矩阵
    :param initial_w: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
    :param eta: 步长,学习率
    :param n_iters: 最大迭代次数
    :param epsilon: 误差值
    :return:
    """
    # 将初始的常数向量变成单位向量
    w = direction(initial_w)
    # 真实迭代次数
    cur_iter = 0
    while cur_iter < n_iters:
        # 获取梯度
        gradient = df(w, X)
        last_w = w
        # 迭代更新w,不断顺着梯度方向寻找新的w
        # 跟梯度下降法不同的是,梯度下降法是-,梯度上升法这里是+
        w = w + eta * gradient
        # 将获取的新的w重新变成单位向量
        w = direction(w)
        # 计算前后两次迭代后的目标函数差值的绝对值
        if abs(f(w, X) - f(last_w, X)) < epsilon:
            break
        # 更新迭代次数
        cur_iter += 1
    return w

现在我们来看一下梯度上升法求出来的w向量坐标

import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":

    X = np.empty((100, 2))
    X[:, 0] = np.random.uniform(0, 100, size=100)
    X[:, 1] = 0.75 * X[:, 0] + 3 + np.random.normal(0, 10, size=100)
    # plt.scatter(X[:, 0], X[:, 1])
    # plt.show()

    def demean(X):
        return X - np.mean(X, axis=0)
    X_demean = demean(X)
    # plt.scatter(X_demean[:, 0], X_demean[:, 1])
    # plt.show()

    def f(w, X):
        # 目标函数
        return np.sum((X.dot(w)**2)) / len(X)

    def df_math(w, X):
        # 梯度
        return X.T.dot(X.dot(w)) * 2 / len(X)

    def df_debug(w, X, epsilon=0.0001):
        res = np.empty(len(w))
        for i in range(len(w)):
            w_1 = w.copy()
            w_1[i] += epsilon
            w_2 = w.copy()
            w_2[i] -= epsilon
            res[i] = (f(w_1, X) - f(w_2, X)) / (2 * epsilon)
        return res

    def direction(w):
        # 把w变成单位向量,只表示方向
        return w / np.linalg.norm(w)

    def gradient_ascent(df, X, initial_w, eta, n_iters = 1e4, epsilon=1e-8):
        """
        梯度上升法
        :param df: 梯度函数
        :param X: 数据矩阵
        :param initial_w: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
        :param eta: 步长,学习率
        :param n_iters: 最大迭代次数
        :param epsilon: 误差值
        :return:
        """
        # 将初始的常数向量变成单位向量
        w = direction(initial_w)
        # 真实迭代次数
        cur_iter = 0
        while cur_iter < n_iters:
            # 获取梯度
            gradient = df(w, X)
            last_w = w
            # 迭代更新w,不断顺着梯度方向寻找新的w
            # 跟梯度下降法不同的是,梯度下降法是-,梯度上升法这里是+
            w = w + eta * gradient
            # 将获取的新的w重新变成单位向量
            w = direction(w)
            # 计算前后两次迭代后的目标函数差值的绝对值
            if abs(f(w, X) - f(last_w, X)) < epsilon:
                break
            # 更新迭代次数
            cur_iter += 1
        return w

    # 不能从0向量开始
    initial_w = np.random.random(X.shape[1])
    eta = 0.001
    # 不能使用StandardScaler标准化数据
    print(gradient_ascent(df_debug, X_demean, initial_w, eta))
    print(gradient_ascent(df_math, X_demean, initial_w, eta))

运行结果

[0.76912318 0.63910057]
[0.76912318 0.63910057]

由此可见,两种方式求出来的w坐标是一样的。

最后我们将w这条直线给绘制出来

import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":

    X = np.empty((100, 2))
    X[:, 0] = np.random.uniform(0, 100, size=100)
    X[:, 1] = 0.75 * X[:, 0] + 3 + np.random.normal(0, 10, size=100)
    # plt.scatter(X[:, 0], X[:, 1])
    # plt.show()

    def demean(X):
        return X - np.mean(X, axis=0)
    X_demean = demean(X)
    # plt.scatter(X_demean[:, 0], X_demean[:, 1])
    # plt.show()

    def f(w, X):
        # 目标函数
        return np.sum((X.dot(w)**2)) / len(X)

    def df_math(w, X):
        # 梯度
        return X.T.dot(X.dot(w)) * 2 / len(X)

    def df_debug(w, X, epsilon=0.0001):
        res = np.empty(len(w))
        for i in range(len(w)):
            w_1 = w.copy()
            w_1[i] += epsilon
            w_2 = w.copy()
            w_2[i] -= epsilon
            res[i] = (f(w_1, X) - f(w_2, X)) / (2 * epsilon)
        return res

    def direction(w):
        # 把w变成单位向量,只表示方向
        return w / np.linalg.norm(w)

    def gradient_ascent(df, X, initial_w, eta, n_iters = 1e4, epsilon=1e-8):
        """
        梯度上升法
        :param df: 梯度函数
        :param X: 数据矩阵
        :param initial_w: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
        :param eta: 步长,学习率
        :param n_iters: 最大迭代次数
        :param epsilon: 误差值
        :return:
        """
        # 将初始的常数向量变成单位向量
        w = direction(initial_w)
        # 真实迭代次数
        cur_iter = 0
        while cur_iter < n_iters:
            # 获取梯度
            gradient = df(w, X)
            last_w = w
            # 迭代更新w,不断顺着梯度方向寻找新的w
            # 跟梯度下降法不同的是,梯度下降法是-,梯度上升法这里是+
            w = w + eta * gradient
            # 将获取的新的w重新变成单位向量
            w = direction(w)
            # 计算前后两次迭代后的目标函数差值的绝对值
            if abs(f(w, X) - f(last_w, X)) < epsilon:
                break
            # 更新迭代次数
            cur_iter += 1
        return w

    # 不能从0向量开始
    initial_w = np.random.random(X.shape[1])
    eta = 0.001
    # 不能使用StandardScaler标准化数据
    # print(gradient_ascent(df_debug, X_demean, initial_w, eta))
    w = gradient_ascent(df_math, X_demean, initial_w, eta)
    plt.scatter(X_demean[:, 0], X_demean[:, 1])
    plt.plot([0, w[0] * 100], [0, w[1] * 100], color='r')
    plt.show()

也就是说所有的数据点落到这个红线上的映射,它的方差最大

现在我们去掉生成数据的噪音,来看看会有什么结果

X2 = np.empty((100, 2))
X2[:, 0] = np.random.uniform(0, 100, size=100)
X2[:, 1] = 0.75 * X2[:, 0] + 3
plt.scatter(X2[:, 0], X2[:, 1])
plt.show()

最后我们把这组去除噪音的数据的w直线绘制出来

import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":

    X = np.empty((100, 2))
    X[:, 0] = np.random.uniform(0, 100, size=100)
    X[:, 1] = 0.75 * X[:, 0] + 3 + np.random.normal(0, 10, size=100)
    # plt.scatter(X[:, 0], X[:, 1])
    # plt.show()

    def demean(X):
        return X - np.mean(X, axis=0)
    X_demean = demean(X)
    # plt.scatter(X_demean[:, 0], X_demean[:, 1])
    # plt.show()

    def f(w, X):
        # 目标函数
        return np.sum((X.dot(w)**2)) / len(X)

    def df_math(w, X):
        # 梯度
        return X.T.dot(X.dot(w)) * 2 / len(X)

    def df_debug(w, X, epsilon=0.0001):
        res = np.empty(len(w))
        for i in range(len(w)):
            w_1 = w.copy()
            w_1[i] += epsilon
            w_2 = w.copy()
            w_2[i] -= epsilon
            res[i] = (f(w_1, X) - f(w_2, X)) / (2 * epsilon)
        return res

    def direction(w):
        # 把w变成单位向量,只表示方向
        return w / np.linalg.norm(w)

    def gradient_ascent(df, X, initial_w, eta, n_iters = 1e4, epsilon=1e-8):
        """
        梯度上升法
        :param df: 梯度函数
        :param X: 数据矩阵
        :param initial_w: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
        :param eta: 步长,学习率
        :param n_iters: 最大迭代次数
        :param epsilon: 误差值
        :return:
        """
        # 将初始的常数向量变成单位向量
        w = direction(initial_w)
        # 真实迭代次数
        cur_iter = 0
        while cur_iter < n_iters:
            # 获取梯度
            gradient = df(w, X)
            last_w = w
            # 迭代更新w,不断顺着梯度方向寻找新的w
            # 跟梯度下降法不同的是,梯度下降法是-,梯度上升法这里是+
            w = w + eta * gradient
            # 将获取的新的w重新变成单位向量
            w = direction(w)
            # 计算前后两次迭代后的目标函数差值的绝对值
            if abs(f(w, X) - f(last_w, X)) < epsilon:
                break
            # 更新迭代次数
            cur_iter += 1
        return w

    # 不能从0向量开始
    initial_w = np.random.random(X.shape[1])
    eta = 0.001
    # 不能使用StandardScaler标准化数据
    # print(gradient_ascent(df_debug, X_demean, initial_w, eta))
    w = gradient_ascent(df_math, X_demean, initial_w, eta)
    # plt.scatter(X_demean[:, 0], X_demean[:, 1])
    # plt.plot([0, w[0] * 100], [0, w[1] * 100], color='r')
    # plt.show()
    X2 = np.empty((100, 2))
    X2[:, 0] = np.random.uniform(0, 100, size=100)
    X2[:, 1] = 0.75 * X2[:, 0] + 3
    # plt.scatter(X2[:, 0], X2[:, 1])
    # plt.show()
    X2_demean = demean(X2)
    w2 = gradient_ascent(df_math, X2_demean, initial_w, eta)
    plt.scatter(X2_demean[:, 0], X2_demean[:, 1])
    plt.plot([0, w2[0] * 100], [0, w2[1] * 100], color='r')
    plt.show()

求数据的前n个主成分

虽然我们现在将这些数据点映射到一条直线上了,并且保持了方差最大。但是在该坐标系下,这条直线上的点依然是一个二维数据的点,那如何才能将其变为一维数据呢?这里是一个二维数据,如果是n维数据,我们又如何不断对其降维呢?

主成分分析法其实是一组坐标系转移到另外一组坐标系的过程。对于新的坐标系来说,之前我们只求出了第一个轴所在的方向,也就是求出了第一主成分。当我们求出了第一主成分以后,如何求出下一个主成分呢?其实,我们在求出第一主成分的时候,数据进行改变,将数据在第一主成分上的分量去掉就是第二主成分。

也就是说,我们现在要求的就是绿线向量,很明显可以这样来求

这里w是一个单位向量,代码中也做了体现。

def direction(w):
    # 把w变成单位向量,只表示方向
    return w / np.linalg.norm(w)

我们将上一小节的代码进行一点点改变

import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":

    X = np.empty((100, 2))
    X[:, 0] = np.random.uniform(0, 100, size=100)
    X[:, 1] = 0.75 * X[:, 0] + 3 + np.random.normal(0, 10, size=100)

    def demean(X):
        return X - np.mean(X, axis=0)
    X = demean(X)

    def f(w, X):
        # 目标函数
        return np.sum((X.dot(w)**2)) / len(X)

    def df(w, X):
        # 梯度
        return X.T.dot(X.dot(w)) * 2 / len(X)

    def direction(w):
        # 把w变成单位向量,只表示方向
        return w / np.linalg.norm(w)

    def first_component(X, initial_w, eta, n_iters = 1e4, epsilon=1e-8):
        """
        梯度上升法,求出第一主成分
        :param X: 数据矩阵
        :param initial_w: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
        :param eta: 步长,学习率
        :param n_iters: 最大迭代次数
        :param epsilon: 误差值
        :return:
        """
        # 将初始的常数向量变成单位向量
        w = direction(initial_w)
        # 真实迭代次数
        cur_iter = 0
        while cur_iter < n_iters:
            # 获取梯度
            gradient = df(w, X)
            last_w = w
            # 迭代更新w,不断顺着梯度方向寻找新的w
            # 跟梯度下降法不同的是,梯度下降法是-,梯度上升法这里是+
            w = w + eta * gradient
            # 将获取的新的w重新变成单位向量
            w = direction(w)
            # 计算前后两次迭代后的目标函数差值的绝对值
            if abs(f(w, X) - f(last_w, X)) < epsilon:
                break
            # 更新迭代次数
            cur_iter += 1
        return w

    # 不能从0向量开始
    initial_w = np.random.random(X.shape[1])
    eta = 0.01
    # 不能使用StandardScaler标准化数据
    # print(gradient_ascent(df_debug, X_demean, initial_w, eta))
    w = first_component(X, initial_w, eta)

求第二主成分

import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":

    X = np.empty((100, 2))
    X[:, 0] = np.random.uniform(0, 100, size=100)
    X[:, 1] = 0.75 * X[:, 0] + 3 + np.random.normal(0, 10, size=100)

    def demean(X):
        return X - np.mean(X, axis=0)
    X = demean(X)

    def f(w, X):
        # 目标函数
        return np.sum((X.dot(w)**2)) / len(X)

    def df(w, X):
        # 梯度
        return X.T.dot(X.dot(w)) * 2 / len(X)

    def direction(w):
        # 把w变成单位向量,只表示方向
        return w / np.linalg.norm(w)

    def first_component(X, initial_w, eta, n_iters = 1e4, epsilon=1e-8):
        """
        梯度上升法,求出第一主成分
        :param X: 数据矩阵
        :param initial_w: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
        :param eta: 步长,学习率
        :param n_iters: 最大迭代次数
        :param epsilon: 误差值
        :return:
        """
        # 将初始的常数向量变成单位向量
        w = direction(initial_w)
        # 真实迭代次数
        cur_iter = 0
        while cur_iter < n_iters:
            # 获取梯度
            gradient = df(w, X)
            last_w = w
            # 迭代更新w,不断顺着梯度方向寻找新的w
            # 跟梯度下降法不同的是,梯度下降法是-,梯度上升法这里是+
            w = w + eta * gradient
            # 将获取的新的w重新变成单位向量
            w = direction(w)
            # 计算前后两次迭代后的目标函数差值的绝对值
            if abs(f(w, X) - f(last_w, X)) < epsilon:
                break
            # 更新迭代次数
            cur_iter += 1
        return w

    # 不能从0向量开始
    initial_w = np.random.random(X.shape[1])
    eta = 0.01
    # 不能使用StandardScaler标准化数据
    # print(gradient_ascent(df_debug, X_demean, initial_w, eta))
    w = first_component(X, initial_w, eta)
    X2 = np.empty(X.shape)
    # 求第二主成分
    X2 = X - X.dot(w).reshape(-1, 1) * w
    plt.scatter(X2[:, 0], X2[:, 1])
    plt.show()

其实这个结果一点也不奇怪,它就是之前那条红线相垂直的方向。我们知道两个向量如果互相垂直,它们的点乘为0,我们来印证一下

import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":

    X = np.empty((100, 2))
    X[:, 0] = np.random.uniform(0, 100, size=100)
    X[:, 1] = 0.75 * X[:, 0] + 3 + np.random.normal(0, 10, size=100)

    def demean(X):
        return X - np.mean(X, axis=0)
    X = demean(X)

    def f(w, X):
        # 目标函数
        return np.sum((X.dot(w)**2)) / len(X)

    def df(w, X):
        # 梯度
        return X.T.dot(X.dot(w)) * 2 / len(X)

    def direction(w):
        # 把w变成单位向量,只表示方向
        return w / np.linalg.norm(w)

    def first_component(X, initial_w, eta, n_iters = 1e4, epsilon=1e-8):
        """
        梯度上升法,求出第一主成分
        :param X: 数据矩阵
        :param initial_w: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
        :param eta: 步长,学习率
        :param n_iters: 最大迭代次数
        :param epsilon: 误差值
        :return:
        """
        # 将初始的常数向量变成单位向量
        w = direction(initial_w)
        # 真实迭代次数
        cur_iter = 0
        while cur_iter < n_iters:
            # 获取梯度
            gradient = df(w, X)
            last_w = w
            # 迭代更新w,不断顺着梯度方向寻找新的w
            # 跟梯度下降法不同的是,梯度下降法是-,梯度上升法这里是+
            w = w + eta * gradient
            # 将获取的新的w重新变成单位向量
            w = direction(w)
            # 计算前后两次迭代后的目标函数差值的绝对值
            if abs(f(w, X) - f(last_w, X)) < epsilon:
                break
            # 更新迭代次数
            cur_iter += 1
        return w

    # 不能从0向量开始
    initial_w = np.random.random(X.shape[1])
    eta = 0.01
    # 不能使用StandardScaler标准化数据
    # print(gradient_ascent(df_debug, X_demean, initial_w, eta))
    w = first_component(X, initial_w, eta)
    X2 = np.empty(X.shape)
    # 求第二主成分
    X2 = X - X.dot(w).reshape(-1, 1) * w
    # plt.scatter(X2[:, 0], X2[:, 1])
    # plt.show()
    w2 = first_component(X2, initial_w, eta)
    print(w.dot(w2))

运行结果

2.6676799232960846e-06

由结果可知,这是一个非常接近0的数。

现在我们来求前n个主成分

import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":

    X = np.empty((100, 2))
    X[:, 0] = np.random.uniform(0, 100, size=100)
    X[:, 1] = 0.75 * X[:, 0] + 3 + np.random.normal(0, 10, size=100)

    def demean(X):
        return X - np.mean(X, axis=0)
    X = demean(X)

    def f(w, X):
        # 目标函数
        return np.sum((X.dot(w)**2)) / len(X)

    def df(w, X):
        # 梯度
        return X.T.dot(X.dot(w)) * 2 / len(X)

    def direction(w):
        # 把w变成单位向量,只表示方向
        return w / np.linalg.norm(w)

    def first_component(X, initial_w, eta, n_iters = 1e4, epsilon=1e-8):
        """
        梯度上升法,求出第一主成分
        :param X: 数据矩阵
        :param initial_w: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
        :param eta: 步长,学习率
        :param n_iters: 最大迭代次数
        :param epsilon: 误差值
        :return:
        """
        # 将初始的常数向量变成单位向量
        w = direction(initial_w)
        # 真实迭代次数
        cur_iter = 0
        while cur_iter < n_iters:
            # 获取梯度
            gradient = df(w, X)
            last_w = w
            # 迭代更新w,不断顺着梯度方向寻找新的w
            # 跟梯度下降法不同的是,梯度下降法是-,梯度上升法这里是+
            w = w + eta * gradient
            # 将获取的新的w重新变成单位向量
            w = direction(w)
            # 计算前后两次迭代后的目标函数差值的绝对值
            if abs(f(w, X) - f(last_w, X)) < epsilon:
                break
            # 更新迭代次数
            cur_iter += 1
        return w

    # 不能从0向量开始
    initial_w = np.random.random(X.shape[1])
    eta = 0.01
    # 不能使用StandardScaler标准化数据
    # print(gradient_ascent(df_debug, X_demean, initial_w, eta))
    w = first_component(X, initial_w, eta)
    X2 = np.empty(X.shape)
    # 求第二主成分
    X2 = X - X.dot(w).reshape(-1, 1) * w
    # plt.scatter(X2[:, 0], X2[:, 1])
    # plt.show()
    w2 = first_component(X2, initial_w, eta)
    # print(w.dot(w2))

    def first_n_components(n, X, eta = 0.01, n_iters = 1e4, epsilon=1e-8):
        X_pca = X.copy()
        X_pca = demean(X_pca)
        res = []
        for i in range(n):
            initial_w = np.random.random(X_pca.shape[1])
            w = first_component(X_pca, initial_w, eta)
            res.append(w)
            X_pca = X_pca - X_pca.dot(w).reshape(-1, 1) * w
        return res

    print(first_n_components(2, X))

运行结果

[array([0.79120886, 0.61154603]), array([-0.61154382,  0.79121057])]

虽然这里只是一个二维的模拟数据,之后会使用真实的高维数据。

高维数据映射为低维数据

这里的X为我们的原始数据矩阵,其中m是样本的数量,n是一个样本的维度;为我们计算出来的前k个主成分,每一个主成分对应一个单位方向,每一个行向量就是它的空间坐标,有k个这样的坐标。主成分分析法的本质就是从一个坐标系转换到另外一个坐标系,原来的坐标系有n个维度的话,那么我们转换的坐标系也有n个维度,只不过对于转换的坐标系来说,我们取出来它前k个,我们说这k个方向更加的重要,是一个k*n的矩阵。我们现在要做的就是将X从n维转换成k维。之前我们知道向量在另一个向量的映射的大小(模)即为两个向量的点乘,而每一个点乘每一个即为在各个分量上的坐标。

则我们最终要得到的降维后的数据矩阵就为

k就是新数据的维度,样本数依然为m个。当然如果我们想把低维数据恢复成高维数据

可以这么做

但我们要注意的是这个并不是之前的X,因为X在降维的过程中是有信息丢失的,即便我们要进行恢复操作,那么丢失的信息是恢复不过来的。但是这个反向的操作从数学的角度是成立的。

现在我们将之前的代码进行一个类的封装,并加上高维数据映射与低维数据反向映射

import numpy as np

class PCA:

    def __init__(self, n_components):
        """
        初始化PCA
        :param n_components: 主成分的个数
        """
        assert n_components >= 1, "n_components必须有效"
        self._n_components = n_components
        self.components_ = None

    def fit(self, X, eta=0.01, n_iters=1e4):
        # 获得数据集X的前n个主成分
        assert self._n_components <= X.shape[1], \
            "n_components必须小于等于X的特征数"

        def demean(X):
            return X - np.mean(X, axis=0)

        def f(w, X):
            # 目标函数
            return np.sum((X.dot(w) ** 2)) / len(X)

        def df(w, X):
            # 梯度
            return X.T.dot(X.dot(w)) * 2 / len(X)

        def direction(w):
            # 把w变成单位向量,只表示方向
            return w / np.linalg.norm(w)

        def first_component(X, initial_w, eta, n_iters=1e4, epsilon=1e-8):
            """
            梯度上升法,求出第一主成分
            :param X: 数据矩阵
            :param initial_w: 初始的常数向量,这里需要注意的是真正待求的是常数向量,求偏导的也是常数向量
            :param eta: 步长,学习率
            :param n_iters: 最大迭代次数
            :param epsilon: 误差值
            :return:
            """
            # 将初始的常数向量变成单位向量
            w = direction(initial_w)
            # 真实迭代次数
            cur_iter = 0
            while cur_iter < n_iters:
                # 获取梯度
                gradient = df(w, X)
                last_w = w
                # 迭代更新w,不断顺着梯度方向寻找新的w
                # 跟梯度下降法不同的是,梯度下降法是-,梯度上升法这里是+
                w = w + eta * gradient
                # 将获取的新的w重新变成单位向量
                w = direction(w)
                # 计算前后两次迭代后的目标函数差值的绝对值
                if abs(f(w, X) - f(last_w, X)) < epsilon:
                    break
                # 更新迭代次数
                cur_iter += 1
            return w

        X_pca = demean(X)
        self.components_ = np.empty(shape=(self._n_components, X.shape[1]))
        for i in range(self._n_components):
            initial_w = np.random.random(X_pca.shape[1])
            w = first_component(X_pca, initial_w, eta, n_iters)
            self.components_[i, :] = w
            X_pca = X_pca - X_pca.dot(w).reshape(-1, 1) * w

        return self

    def transform(self, X):
        # 将给定的X,映射到各个主成分分量中
        assert X.shape[1] == self.components_.shape[1]
        return X.dot(self.components_.T)

    def inverse_transform(self, X):
        # 将给定的X,反向映射回原来的特征空间
        assert X.shape[1] == self.components_.shape[0]
        return X.dot(self.components_)

    def __repr__(self):
        return "PCA(n_components=%d)" % self._n_components

import numpy as np
import matplotlib.pyplot as plt
from playLA.PCA import PCA

if __name__ == "__main__":

    X = np.empty((100, 2))
    X[:, 0] = np.random.uniform(0, 100, size=100)
    X[:, 1] = 0.75 * X[:, 0] + 3 + np.random.normal(0, 10, size=100)
    pca = PCA(n_components=2)
    pca.fit(X)
    # 查看两个主成分的方向
    print(pca.components_)
    pca = PCA(n_components=1)
    pca.fit(X)
    X_reduction = pca.transform(X)
    print(X_reduction.shape)
    X_restore = pca.inverse_transform(X_reduction)
    print(X_restore.shape)
    plt.scatter(X[:, 0], X[:, 1], color='b', alpha=0.5)
    plt.scatter(X_restore[:, 0], X_restore[:, 1], color='r', alpha=0.5)
    plt.show()

运行结果

[[ 0.77885098  0.62720901]
 [-0.62720655  0.77885297]]
(100, 1)
(100, 2)

这里其中蓝色的点就是原始二维数据,红色的点就是降维后的数据再回复成的二维数据,从这里我们可以看到虽然数据可以恢复成二维,但是它是丢失信息的。

展开阅读全文
加载中

作者的其它热门文章

打赏
0
0 收藏
分享
打赏
0 评论
0 收藏
0
分享
返回顶部
顶部