Kanonpy

# 每日一课 Kaggle 练习讲解¶

• （1） 数据可视化和数据分布变换
• （2） 缺省值处理
• （3） 数据特征变换
• （4） 数据建模及交叉检验
• （5） 模型组合
In [1]:
```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm, skew
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import LabelEncoder
%matplotlib inline```

## 加载数据¶

In [2]:
```train_path = "http://kaggle.shikanon.com/house-prices-advanced-regression-techniques/train.csv"
Out[2]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [3]:
`train_df.columns`
Out[3]:
```Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
'SaleCondition', 'SalePrice'],
dtype='object')```

• MSSubClass: 建筑的等级，类型：类别型
• MSZoning: 区域分类，类型：类别型
• LotFrontage: 距离街道的直线距离，类型：数值型，单位：英尺
• LotArea: 地皮面积，类型：数值型，单位：平方英尺
• Street: 街道类型，类型：类别型
• Alley: 巷子类型，类型：类别型
• LotShape: 房子整体形状，类型：类别型
• LandContour: 平整度级别，类型：类别型
• Utilities: 公共设施类型，类型：类别型
• LotConfig: 房屋配置，类型：类别型
• LandSlope: 倾斜度，类型：类别型
• Neighborhood: 市区物理位置，类型：类别型
• Condition1: 主干道或者铁路便利程度，类型：类别型
• Condition2: 主干道或者铁路便利程度，类型：类别型
• BldgType: 住宅类型，类型：类别型
• HouseStyle: 住宅风格，类型：类别型
• OverallQual: 整体材料和饰面质量，类型：数值型
• OverallCond: 总体状况评价，类型：数值型
• YearBuilt: 建筑年份，类型：数值型
• RoofStyle: 屋顶类型，类型：类别型
• RoofMatl: 屋顶材料，类型：类别型
• Exterior1st: 住宅外墙，类型：类别型
• Exterior2nd: 住宅外墙，类型：类别型
• MasVnrType: 砌体饰面类型，类型：类别型
• MasVnrArea: 砌体饰面面积，类型：数值型，单位：平方英尺
• ExterQual: 外部材料质量，类型：类别型
• ExterCond: 外部材料的现状，类型：类别型
• Foundation: 地基类型，类型：类别型
• BsmtQual: 地下室高度，类型：类别型
• BsmtCond: 地下室概况，类型：类别型
• BsmtExposure: 花园地下室墙，类型：类别型
• BsmtFinType1: 地下室装饰质量，类型：类别型
• BsmtFinSF1: 地下室装饰面积，类型：类别型
• BsmtFinType2: 地下室装饰质量，类型：类别型
• BsmtFinSF2: 地下室装饰面积，类型：类别型
• BsmtUnfSF: 未装饰的地下室面积，类型：数值型，单位：平方英尺
• TotalBsmtSF: 地下室总面积，类型：数值型，单位：平方英尺
• Heating: 供暖类型，类型：类别型
• HeatingQC: 供暖质量和条件，类型：类别型
• CentralAir: 中央空调状况，类型：类别型
• Electrical: 电力系统，类型：类别型
• 1stFlrSF: 首层面积，类型：数值型，单位：平方英尺
• 2ndFlrSF: 二层面积，类型：数值型，单位：平方英尺
• LowQualFinSF: 低质装饰面积，类型：数值型，单位：平方英尺
• GrLivArea: 地面以上居住面积，类型：数值型，单位：平方英尺
• BsmtFullBath: 地下室全浴室，类型：数值
• BsmtHalfBath: 地下室半浴室，类型：数值
• FullBath: 高档全浴室，类型：数值
• HalfBath: 高档半浴室，类型：数值
• BedroomAbvGr: 地下室以上的卧室数量，类型：数值
• KitchenAbvGr: 厨房数量，类型：数值
• KitchenQual: 厨房质量，类型：类别型
• TotRmsAbvGrd: 地上除卧室以外的房间数，类型：数值
• Functional: 房屋功用性评级，类型：类别型
• Fireplaces: 壁炉数量，类型：数值
• FireplaceQu: 壁炉质量，类型：类别型
• GarageType: 车库位置，类型：类别型
• GarageYrBlt: 车库建造年份，类别：数值型
• GarageFinish: 车库内饰，类型：类别型
• GarageCars: 车库车容量大小，类别：数值型
• GarageArea: 车库面积，类别：数值型，单位：平方英尺
• GarageQual: 车库质量，类型：类别型
• GarageCond: 车库条件，类型：类别型
• PavedDrive: 铺的车道情况，类型：类别型
• WoodDeckSF: 木地板面积，类型：数值型，单位：平方英尺
• OpenPorchSF: 开放式门廊区面积，类型：数值型，单位：平方英尺
• EnclosedPorch: 封闭式门廊区面积，类型：数值型，单位：平方英尺
• 3SsnPorch: 三个季节门廊面积，类型：数值型，单位：平方英尺
• ScreenPorch: 纱门门廊面积，类型：数值型，单位：平方英尺
• PoolArea: 泳池面积，类型：数值型，单位：平方英尺
• PoolQC:泳池质量，类型：类别型
• Fence: 围墙质量，类型：类别型
• MiscFeature: 其他特征，类型：类别型
• MiscVal: 其他杂项特征值，类型：类别型
• MoSold: 卖出月份，类别：数值型
• YrSold: 卖出年份，类别：数值型
• SaleType: 交易类型，类型：类别型
• SaleCondition: 交易条件，类型：类别型

## 数据处理和特征分析¶

In [4]:
```#Saving Ids
train_ID = train_df['Id']
test_ID = test_df['Id']

#Dropping Ids
train_df.drop("Id", axis = 1, inplace = True)
test_df.drop("Id", axis = 1, inplace = True)```

### 数据观察和可视化¶

In [5]:
```fig, ax = plt.subplots()
ax.scatter(x = train_df['GrLivArea'], y = train_df['SalePrice'], c = "skyblue")
plt.ylabel('SalePrice', fontsize=8)
plt.xlabel('GrLivArea', fontsize=8)
plt.show()```

Kaggle的作者在这里有建议去掉:)

In [6]:
`train_df.drop(train_df[(train_df['GrLivArea']>4000)&(train_df['GrLivArea']<30000)].index,inplace=True)`
In [7]:
```fig, ax = plt.subplots()
ax.scatter(x = train_df['GrLivArea'], y = train_df['SalePrice'], c = "skyblue")
plt.ylabel('SalePrice', fontsize=8)
plt.xlabel('GrLivArea', fontsize=8)
plt.show()```

• 最后一行是 `SalePrice`， 我们可以看到她跟各变量的关系，还有各变量相互之间的关系

### 观察数据分布¶

In [8]:
```# 统计表述
train_df['SalePrice'].describe()```
Out[8]:
```count      1456.000000
mean     180151.233516
std       76696.592530
min       34900.000000
25%      129900.000000
50%      163000.000000
75%      214000.000000
max      625000.000000
Name: SalePrice, dtype: float64```
In [9]:
```# 绘制分布图
sns.distplot(train_df['SalePrice'],
kde_kws={"color": "coral", "lw": 1, "label": "KDE"},
hist_kws={"histtype": "stepfilled", "linewidth": 3, "alpha": 1, "color": "skyblue"});```

Q-Q图，全称 Quantile Quantile Plot，中文名叫分位数图，Q-Q图是一个概率图，用于比较观测与预测值之间的概率分布差异，这里的比较对象一般采用正态分布，Q-Q图可以用于检验数据分布的相似性，而P-P图是根据变量的累积概率对应于所指定的理论分布累积概率绘制的散点图，两者基本一样

In [10]:
```# 绘制P-P图
fig = plt.figure()
res = stats.probplot(train_df['SalePrice'], dist="norm", plot=plt)
plt.show()```

In [11]:
```# 对数变换

train_df['SalePrice_Log'] = np.log(train_df['SalePrice'])

sns.distplot(train_df['SalePrice_Log'],
kde_kws={"color": "coral", "lw": 1, "label": "KDE"},
hist_kws={"histtype": "stepfilled", "linewidth": 3, "alpha": 1, "color": "skyblue"});

# 偏度与峰值(skewness and kurtosis)
print("Skewness: %f" % train_df['SalePrice_Log'].skew())
print("Kurtosis: %f" % train_df['SalePrice_Log'].kurt())

fig = plt.figure()
res = stats.probplot(train_df['SalePrice_Log'], plot=plt)
plt.show()```

```Skewness: 0.065449
Kurtosis: 0.666438```

In [12]:
```# 指数变换
train_df['SalePrice_Exp'] = np.exp(train_df['SalePrice']/train_df['SalePrice'].mean())

sns.distplot(train_df['SalePrice_Exp'],
kde_kws={"color": "coral", "lw": 1, "label": "KDE"},
hist_kws={"histtype": "stepfilled", "linewidth": 3, "alpha": 1, "color": "skyblue"});

# 偏度与峰值(skewness and kurtosis)
print("Skewness: %f" % train_df['SalePrice_Exp'].skew())
print("Kurtosis: %f" % train_df['SalePrice_Exp'].kurt())

fig = plt.figure()
res = stats.probplot(train_df['SalePrice_Exp'], plot=plt)
plt.show()```

```Skewness: 6.060076
Kurtosis: 56.822460```

In [13]:
```# 幂函数变换
train_df['SalePrice_Square'] = train_df['SalePrice']**0.5

sns.distplot(train_df['SalePrice_Square'],
kde_kws={"color": "coral", "lw": 1, "label": "KDE"},
hist_kws={"histtype": "stepfilled", "linewidth": 3, "alpha": 1, "color": "skyblue"});

# 偏度与峰值(skewness and kurtosis)
print("Skewness: %f" % train_df['SalePrice_Square'].skew())
print("Kurtosis: %f" % train_df['SalePrice_Square'].kurt())

fig = plt.figure()
res = stats.probplot(train_df['SalePrice_Square'], plot=plt)
plt.show()```

```Skewness: 0.810797
Kurtosis: 1.245798```

In [14]:
```# 对数变换

train_df['SalePrice_Log1p'] = np.log1p(train_df['SalePrice'])

sns.distplot(train_df['SalePrice_Log1p'],
kde_kws={"color": "coral", "lw": 1, "label": "KDE"},
hist_kws={"histtype": "stepfilled", "linewidth": 3, "alpha": 1, "color": "skyblue"});

# 偏度与峰值(skewness and kurtosis)
print("Skewness: %f" % train_df['SalePrice_Log1p'].skew())
print("Kurtosis: %f" % train_df['SalePrice_Log1p'].kurt())

fig = plt.figure()
res = stats.probplot(train_df['SalePrice_Log1p'], plot=plt)
plt.show()```

```Skewness: 0.065460
Kurtosis: 0.666423```

In [15]:
```del train_df['SalePrice_Square']
del train_df["SalePrice_Exp"]
del train_df['SalePrice_Log']```
In [16]:
`del train_df["SalePrice"]`

In [17]:
```size_train_df = train_df.shape[0]
size_test_df = test_df.shape[0]
target_variable = train_df['SalePrice_Log1p'].values
data = pd.concat((train_df, test_df),sort=False).reset_index(drop=True)
data.drop(['SalePrice_Log1p'], axis=1, inplace=True)```

### 缺失值处理¶

• （1）把缺失值单独作为一类，比如对类别型用none。
• （2）采用平均数、中值、众数等特定统计值来填充缺失值。
• （3）采用函数预测等方法填充缺失值。

NA for 'PoolQC' means "No Pool".

MiscFeature: NA means "None"

Alley: NA means "No alley access"

Fence: NA means "No fence"

FireplaceQu: NA means "No fireplace"

LotFrontage: fill missing values with median LotFrontage of neighborhood

GarageFinish: NA means "None"

GarageQual: NA means "None"

GarageCond: NA means "None"

GarageYrBlt: NA means 0

GarageType: NA means "None"

BsmtCond: NA means "None"

BsmtExposure: NA means "None"

BsmtQual: NA means "None"

BsmtFinType2: NA means "None"

BsmtFinType1: NA means "None"

MasVnrType: NA means "None"

MasVnrArea: NA means "0"

BsmtHalfBath: NA means "0"

BsmtFullBath: NA means "0"

BsmtFinSF1: NA means "0"

BsmtFinSF2: NA means "0"

BsmtUnfSF: NA means "0"

TotalBsmtSF: NA means "0"

GarageCars: NA means 0

GarageArea: NA means 0

In [18]:
`data.count().sort_values().head(20) # 通过 count 可以找出有缺失值的数据`
Out[18]:
```PoolQC             8
MiscFeature      105
Alley            198
Fence            570
FireplaceQu     1495
LotFrontage     2429
GarageCond      2756
GarageYrBlt     2756
GarageFinish    2756
GarageQual      2756
GarageType      2758
BsmtExposure    2833
BsmtCond        2833
BsmtQual        2834
BsmtFinType2    2835
BsmtFinType1    2836
MasVnrType      2891
MasVnrArea      2892
MSZoning        2911
BsmtFullBath    2913
dtype: int64```
In [19]:
```data_na = (data.isnull().sum() / len(data)) * 100
data_na.drop(data_na[data_na==0].index,inplace=True)
data_na = data_na.sort_values(ascending=False)
f, ax = plt.subplots(figsize=(10, 10))
plt.xticks(rotation='90')
sns.barplot(x=data_na.index, y=data_na)
plt.xlabel('Features', fontsize=10)
plt.ylabel('Percent of missing values', fontsize=10)
plt.title('Percent missing data by feature', fontsize=10)```
Out[19]:
`Text(0.5, 1.0, 'Percent missing data by feature')`

In [20]:
```var = 'Utilities'
train_var_count = train_df[var].value_counts()
fig = sns.barplot(x=train_var_count.index, y=train_var_count)
plt.xticks(rotation=90);```

In [21]:
```var = 'MSZoning'
train_var_count = train_df[var].value_counts()
fig = sns.barplot(x=train_var_count.index, y=train_var_count)
plt.xticks(rotation=90);```

In [22]:
```# 填充nil
features_fill_na_none = ['PoolQC','MiscFeature','Alley','Fence','FireplaceQu',
'GarageQual','GarageCond','GarageFinish','GarageType',
'BsmtExposure','BsmtCond','BsmtQual','BsmtFinType1','BsmtFinType2',
'MasVnrType']

# 填充0
features_fill_na_0 = ['GarageYrBlt', 'GarageArea', 'GarageCars', 'MasVnrArea',
'BsmtFullBath','BsmtHalfBath', 'BsmtFinSF1', 'BsmtFinSF2',
'BsmtUnfSF', 'TotalBsmtSF']

# 填众数
features_fill_na_mode = ["Functional", "MSZoning", "SaleType", "Electrical",
"KitchenQual", "Exterior2nd", "Exterior1st"]

for feature_none in features_fill_na_none:
data[feature_none].fillna('None',inplace=True)

for feature_0 in features_fill_na_0:
data[feature_0].fillna(0,inplace=True)

for feature_mode in features_fill_na_mode:
mode_value = data[feature_mode].value_counts().sort_values(ascending=False).index[0]
data[features_fill_na_mode] = data[features_fill_na_mode].fillna(mode_value)

# 用中值代替
data["LotFrontage"] = data.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))

# 像 Utilities 这种总共才两个值，同时有一个值是作为主要的，这种字段是无意义的，应该删除
data.drop(['Utilities'], axis=1,inplace=True)```
In [23]:
```data_na = (data.isnull().sum() / len(data)) * 100
data_na.drop(data_na[data_na==0].index,inplace=True)
data_na = data_na.sort_values(ascending=False)
data_na # data_na 为空```
Out[23]:
`Series([], dtype: float64)`

In [24]:
```#MSSubClass=The building class
data['MSSubClass'] = data['MSSubClass'].apply(str)

#Changing OverallCond into a categorical variable
data['OverallCond'] = data['OverallCond'].astype(str)

#Year and month sold are transformed into categorical features.
data['YrSold'] = data['YrSold'].astype(str)
data['MoSold'] = data['MoSold'].astype(str)```

### 绘制关系矩阵¶

In [25]:
```# 关系矩阵
corrmat = train_df.corr()
corrmat```
Out[25]:
MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice_Log1p
MSSubClass 1.000000 -0.408655 -0.142192 0.032416 -0.059277 0.027689 0.040459 0.022800 -0.075268 -0.065598 ... -0.012853 -0.006687 -0.011966 -0.043802 -0.025979 0.007957 -0.007666 -0.013512 -0.021330 -0.075083
LotFrontage -0.408655 1.000000 0.387570 0.225994 -0.055673 0.113913 0.079785 0.162026 0.133704 0.056971 ... 0.077154 0.116131 0.017010 0.075575 0.047832 0.075160 0.005636 0.028042 0.010598 0.363286
LotArea -0.142192 0.387570 1.000000 0.088719 -0.002832 0.006590 0.006930 0.081170 0.173426 0.114691 ... 0.167040 0.061679 -0.016108 0.021505 0.045620 0.033875 0.039192 0.007188 -0.013014 0.258945
OverallQual 0.032416 0.225994 0.088719 1.000000 -0.090692 0.571712 0.550971 0.400028 0.213079 -0.057520 ... 0.232819 0.297803 -0.112407 0.031621 0.067732 0.018121 -0.031068 0.076414 -0.024321 0.819240
OverallCond -0.059277 -0.055673 -0.002832 -0.090692 1.000000 -0.375691 0.074703 -0.130053 -0.042542 0.040015 ... -0.003063 -0.029649 0.070103 0.025419 0.054617 0.008079 0.068729 -0.003135 0.043755 -0.036843
YearBuilt 0.027689 0.113913 0.006590 0.571712 -0.375691 1.000000 0.591906 0.314066 0.248272 -0.048393 ... 0.222690 0.183905 -0.386904 0.031717 -0.049703 -0.014373 -0.034193 0.013881 -0.012593 0.588977
YearRemodAdd 0.040459 0.079785 0.006930 0.550971 0.074703 0.591906 1.000000 0.176067 0.121690 -0.067188 ... 0.204020 0.222649 -0.193348 0.045596 -0.038176 -0.009490 -0.010100 0.022629 0.036597 0.568986
MasVnrArea 0.022800 0.162026 0.081170 0.400028 -0.130053 0.314066 0.176067 1.000000 0.235557 -0.071337 ... 0.149491 0.106073 -0.109453 0.020270 0.065296 -0.016025 -0.029691 0.004019 -0.004880 0.430073
BsmtFinSF1 -0.075268 0.133704 0.173426 0.213079 -0.042542 0.248272 0.121690 0.235557 1.000000 -0.048738 ... 0.201462 0.071851 -0.103053 0.029879 0.070026 0.016380 0.005149 -0.001773 0.018506 0.382710
BsmtFinSF2 -0.065598 0.056971 0.114691 -0.057520 0.040015 -0.048393 -0.067188 -0.071337 -0.048738 1.000000 ... 0.069028 0.005083 0.036269 -0.030090 0.088676 0.053178 0.004871 -0.015726 0.031384 0.006420
BsmtUnfSF -0.140890 0.142075 -0.003774 0.310164 -0.137267 0.148810 0.180972 0.111684 -0.526140 -0.209286 ... -0.006876 0.129148 -0.002336 0.020843 -0.012435 -0.031244 -0.023802 0.035456 -0.040834 0.223248
TotalBsmtSF -0.255441 0.313392 0.221940 0.532666 -0.176000 0.399867 0.294866 0.337869 0.460324 0.116478 ... 0.229984 0.215559 -0.095872 0.041762 0.094511 0.004418 -0.018253 0.030026 -0.012192 0.641553
1stFlrSF -0.265001 0.398400 0.267644 0.462042 -0.145613 0.279929 0.238304 0.316507 0.386453 0.106134 ... 0.230865 0.179049 -0.063074 0.060553 0.097093 0.032088 -0.020801 0.045081 -0.010014 0.613742
2ndFlrSF 0.311294 0.049902 0.037277 0.279745 0.031297 0.002953 0.136103 0.156796 -0.183358 -0.098241 ... 0.083670 0.198407 0.065690 -0.023740 0.043308 0.038375 0.017111 0.039163 -0.024874 0.306605
LowQualFinSF 0.046499 0.042589 0.005675 -0.029826 0.025406 -0.183720 -0.062215 -0.069533 -0.066611 0.014714 ... -0.025114 0.019339 0.060975 -0.004334 0.026713 0.073070 -0.003822 -0.022441 -0.029074 -0.037698
GrLivArea 0.077956 0.341427 0.231887 0.583519 -0.078567 0.192645 0.289264 0.363472 0.121479 -0.004995 ... 0.241827 0.307325 0.016148 0.023967 0.112408 0.064346 -0.000974 0.065328 -0.031898 0.718844
BsmtFullBath 0.003282 0.074686 0.147595 0.104092 -0.053107 0.185009 0.116765 0.080342 0.661933 0.160254 ... 0.174636 0.056251 -0.049034 0.000249 0.024075 0.037039 -0.022877 -0.023770 0.067665 0.238851
BsmtHalfBath -0.002509 -0.009424 0.047391 -0.047172 0.117207 -0.039945 -0.013297 0.012157 0.068869 0.071986 ... 0.034626 -0.024385 -0.007803 0.035565 0.032902 0.027886 -0.007211 0.038478 -0.045303 -0.014974
FullBath 0.132131 0.187048 0.117336 0.543791 -0.194167 0.466710 0.438212 0.266317 0.037159 -0.075291 ... 0.182131 0.252911 -0.113812 0.036304 -0.006557 0.021509 -0.013872 0.058197 -0.016574 0.590919
HalfBath 0.177476 0.037031 0.005981 0.267431 -0.059927 0.240144 0.181136 0.195264 -0.014508 -0.031243 ... 0.104538 0.194921 -0.094317 -0.004590 0.073497 0.001010 0.001589 -0.007127 -0.008853 0.311191
BedroomAbvGr -0.023627 0.269945 0.118960 0.096848 0.013249 -0.072623 -0.041919 0.099176 -0.121893 -0.015134 ... 0.044039 0.093803 0.042402 -0.024263 0.044941 0.064118 0.007965 0.048477 -0.034849 0.204117
KitchenAbvGr 0.281783 -0.002959 -0.016565 -0.184281 -0.087204 -0.174481 -0.149288 -0.036558 -0.082722 -0.040926 ... -0.089670 -0.069738 0.037113 -0.024670 -0.051779 -0.012306 0.062294 0.026340 0.031454 -0.147891
TotRmsAbvGrd 0.040247 0.330249 0.173629 0.415834 -0.055766 0.089207 0.187520 0.265334 0.001877 -0.033490 ... 0.159720 0.219969 0.006790 -0.005908 0.061924 0.041588 0.025640 0.041966 -0.032190 0.533446
Fireplaces -0.046377 0.238140 0.259701 0.387425 -0.022277 0.143162 0.108732 0.236906 0.236219 0.049027 ... 0.194972 0.160647 -0.022885 0.012042 0.187656 0.051221 0.001943 0.053947 -0.022567 0.487126
GarageYrBlt 0.084926 0.058055 -0.033111 0.547320 -0.323836 0.825192 0.641445 0.249917 0.146532 -0.087353 ... 0.222580 0.224239 -0.296517 0.023897 -0.074779 -0.035670 -0.032231 0.006725 -0.000030 0.544005
GarageCars -0.040490 0.289064 0.150977 0.598739 -0.185494 0.536749 0.419573 0.362122 0.224043 -0.037331 ... 0.223010 0.209762 -0.150590 0.036290 0.051622 0.003360 -0.042886 0.041608 -0.037179 0.680408
GarageArea -0.100145 0.318820 0.162183 0.554905 -0.150679 0.477311 0.369590 0.361810 0.268651 -0.016485 ... 0.219967 0.228089 -0.120615 0.036213 0.053732 0.011637 -0.027088 0.034602 -0.025870 0.655212
WoodDeckSF -0.012853 0.077154 0.167040 0.232819 -0.003063 0.222690 0.204020 0.149491 0.201462 0.069028 ... 1.000000 0.053498 -0.125151 -0.032472 -0.073489 0.068309 -0.009287 0.024595 0.023860 0.330573
OpenPorchSF -0.006687 0.116131 0.061679 0.297803 -0.029649 0.183905 0.222649 0.106073 0.071851 0.005083 ... 0.053498 1.000000 -0.092094 -0.005148 0.077261 0.030360 -0.018277 0.072515 -0.056326 0.327038
EnclosedPorch -0.011966 0.017010 -0.016108 -0.112407 0.070103 -0.386904 -0.193348 -0.109453 -0.103053 0.036269 ... -0.125151 -0.092094 1.000000 -0.037427 -0.083154 0.068796 0.018277 -0.029565 -0.010343 -0.148636
3SsnPorch -0.043802 0.075575 0.021505 0.031621 0.025419 0.031717 0.045596 0.020270 0.029879 -0.030090 ... -0.032472 -0.005148 -0.037427 1.000000 -0.031526 -0.006770 0.000326 0.029386 0.018516 0.056065
ScreenPorch -0.025979 0.047832 0.045620 0.067732 0.054617 -0.049703 -0.038176 0.065296 0.070026 0.088676 ... -0.073489 0.077261 -0.083154 -0.031526 1.000000 0.063724 0.031884 0.022863 0.010383 0.123860
PoolArea 0.007957 0.075160 0.033875 0.018121 0.008079 -0.014373 -0.009490 -0.016025 0.016380 0.053178 ... 0.068309 0.030360 0.068796 -0.006770 0.063724 1.000000 0.035480 -0.022901 -0.062640 0.040679
MiscVal -0.007666 0.005636 0.039192 -0.031068 0.068729 -0.034193 -0.010100 -0.029691 0.005149 0.004871 ... -0.009287 -0.018277 0.018277 0.000326 0.031884 0.035480 1.000000 -0.006657 0.004806 -0.019752
MoSold -0.013512 0.028042 0.007188 0.076414 -0.003135 0.013881 0.022629 0.004019 -0.001773 -0.015726 ... 0.024595 0.072515 -0.029565 0.029386 0.022863 -0.022901 -0.006657 1.000000 -0.146229 0.062231
YrSold -0.021330 0.010598 -0.013014 -0.024321 0.043755 -0.012593 0.036597 -0.004880 0.018506 0.031384 ... 0.023860 -0.056326 -0.010343 0.018516 0.010383 -0.062640 0.004806 -0.146229 1.000000 -0.034319
SalePrice_Log1p -0.075083 0.363286 0.258945 0.819240 -0.036843 0.588977 0.568986 0.430073 0.382710 0.006420 ... 0.330573 0.327038 -0.148636 0.056065 0.123860 0.040679 -0.019752 0.062231 -0.034319 1.000000

37 rows × 37 columns

In [26]:
```mask = np.zeros_like(corrmat) # 返回相同大小的0矩阵
Out[26]:
```array([[1., 1., 1., ..., 1., 1., 1.],
[0., 1., 1., ..., 1., 1., 1.],
[0., 0., 1., ..., 1., 1., 1.],
...,
[0., 0., 0., ..., 1., 1., 1.],
[0., 0., 0., ..., 0., 1., 1.],
[0., 0., 0., ..., 0., 0., 1.]])```
In [27]:
```# 绘制热力图
plt.subplots(figsize=(12,9))
Out[27]:
`<matplotlib.axes._subplots.AxesSubplot at 0x7f8c20d156d8>`

## 特征工程¶

In [28]:
```encode_cat_variables = ('Alley', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'BsmtQual', 'CentralAir',
'Condition1', 'Condition2', 'Electrical', 'ExterCond', 'ExterQual', 'Exterior1st', 'Exterior2nd', 'Fence',
'FireplaceQu', 'Foundation', 'Functional', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType',
'Heating', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LandContour', 'LandSlope', 'LotConfig', 'LotShape',
'MSSubClass', 'MSZoning', 'MasVnrType', 'MiscFeature', 'MoSold', 'Neighborhood', 'OverallCond', 'PavedDrive',
'PoolQC', 'RoofMatl', 'RoofStyle', 'SaleCondition', 'SaleType', 'Street', 'YrSold')

numerical_features = [col for col in data.columns if col not in encode_cat_variables]
# for variable in encode_cat_variables:
#     lbl = LabelEncoder()
#     lbl.fit(list(data[variable].values))
#     data[variable] = lbl.transform(list(data[variable].values))

for variable in data.columns:
if variable not in encode_cat_variables:
data[variable] = data[variable].apply(float)
else:
data[variable] = data[variable].apply(str)

print(data.shape)

data = pd.get_dummies(data)
print(data.shape)```

```(2915, 78)
(2915, 343)```
In [29]:
```# 可以计算一个总面积指标
data['TotalSF'] = data['TotalBsmtSF'] + data['1stFlrSF'] + data['2ndFlrSF']```
In [30]:
```print("Categorical Features: %d"%len(encode_cat_variables))
print("Numerical Features: %d"%len(numerical_features))```

```Categorical Features: 46
Numerical Features: 32```

In [31]:
```#Boxplot for numerical_features
sns.set_style("whitegrid")
f, ax = plt.subplots(figsize=(15, 15))
ax.set_xscale("log")
ax = sns.boxplot(data=data[numerical_features] , orient="h", palette="ch:2.5,-.2,dark=.3")
ax.set(ylabel="Features")
ax.set(xlabel="Value")
ax.set(title="Distribution")
sns.despine(trim=True, left=True)```

box-cox变换

Box-Cox变换是Box和Cox在1964年提出的一种广义幂变换方法，用于连续的响应变量不满足正态分布的情况。Box-Cox变换之后，可以一定程度上减小不可观测的误差和预测变量的相关性。Box-Cox变换的主要特点是引入一个参数，通过数据本身估计该参数进而确定应采取的数据变换形式

In [32]:
```# 计算数值型变量的偏态
skewed_features = data[numerical_features].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewed_features```
Out[32]:
```MiscVal          21.932147
PoolArea         18.701829
LotArea          13.123758
LowQualFinSF     12.080315
3SsnPorch        11.368094
KitchenAbvGr      4.298845
BsmtFinSF2        4.142863
EnclosedPorch     4.000796
ScreenPorch       3.943508
BsmtHalfBath      3.942892
MasVnrArea        2.600697
OpenPorchSF       2.529245
WoodDeckSF        1.848285
1stFlrSF          1.253011
LotFrontage       1.092709
GrLivArea         0.977860
BsmtFinSF1        0.974138
BsmtUnfSF         0.920135
2ndFlrSF          0.843237
TotRmsAbvGrd      0.749579
Fireplaces        0.725958
HalfBath          0.698770
TotalBsmtSF       0.662657
BsmtFullBath      0.622820
BedroomAbvGr      0.328129
GarageArea        0.217748
OverallQual       0.181902
FullBath          0.159917
GarageCars       -0.219402
YearBuilt        -0.598087
GarageYrBlt      -3.903046
dtype: float64```
In [33]:
```skewed_features = skewed_features[abs(skewed_features) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewed_features.shape[0]))

from scipy.special import boxcox1p
skewed_features_name = skewed_features.index
lam = 0.15 # 超参数
for feat in skewed_features_name:
tranformer_feat = boxcox1p(data[feat], lam)
data[feat] = tranformer_feat

data[numerical_features].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)```

`There are 20 skewed numerical features to Box Cox transform`
Out[33]:
```PoolArea         16.487849
3SsnPorch         8.918476
LowQualFinSF      8.737917
MiscVal           5.592866
BsmtHalfBath      3.798589
KitchenAbvGr      3.695780
ScreenPorch       2.975708
BsmtFinSF2        2.561989
EnclosedPorch     2.023182
TotRmsAbvGrd      0.749579
Fireplaces        0.725958
HalfBath          0.698770
TotalBsmtSF       0.662657
MasVnrArea        0.636713
BsmtFullBath      0.622820
2ndFlrSF          0.329999
BedroomAbvGr      0.328129
WoodDeckSF        0.225571
GarageArea        0.217748
OverallQual       0.181902
LotArea           0.178523
1stFlrSF          0.173668
FullBath          0.159917
GrLivArea         0.102287
OpenPorchSF       0.100679
GarageCars       -0.219402
BsmtFinSF1       -0.488447
YearBuilt        -0.598087
LotFrontage      -0.808022
BsmtUnfSF        -1.536727
GarageYrBlt      -3.922152
dtype: float64```
In [34]:
```#Boxplot for numerical_features
sns.set_style("whitegrid")
f, ax = plt.subplots(figsize=(15, 15))
ax.set_xscale("log")
ax = sns.boxplot(data=data[numerical_features] , orient="h", palette="ch:2.5,-.2,dark=.3")
ax.set(ylabel="Features")
ax.set(xlabel="Value")
ax.set(title="Distribution")
sns.despine(trim=True, left=True)```

In [35]:
```train = data[:size_train_df]
test = data[size_train_df:]```

## 建模¶

In [36]:
```from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb```

In [40]:
```#Validation function
n_folds = 5

def rmsle_cv(model):
kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
rmse= np.sqrt(-cross_val_score(model, train.values, target_variable, scoring="neg_mean_squared_error", cv = kf))
return(rmse)```

LASSO回归(LASSO Regression)

In [41]:
```lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))```

`Lasso score: 0.1101 (0.0058)`

In [44]:
```KRR = make_pipeline(RobustScaler(), KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5))
score = rmsle_cv(KRR)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))```

`Lasso score: 0.1152 (0.0043)`

ElasticNet 是一种使用L1和L2先验作为正则化矩阵的线性回归模型,弹性网络是结合了岭回归和Lasso回归，由两者加权平均所得。

https://www.jianshu.com/p/65a573b9fe32

In [75]:
```ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
score = rmsle_cv(ENet)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))```

`Lasso score: 0.1100 (0.0059)`

In [74]:
```GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
max_depth=4, max_features='sqrt',
min_samples_leaf=15, min_samples_split=10,
loss='huber', random_state=5)

score = rmsle_cv(GBoost)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))```

`Lasso score: 0.1164 (0.0082)`

XGBoost

In [47]:
```model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468,
learning_rate=0.05, max_depth=3,
min_child_weight=1.7817, n_estimators=2200,
reg_alpha=0.4640, reg_lambda=0.8571,
subsample=0.5213, silent=1,

score = rmsle_cv(model_xgb)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))```

`Lasso score: 0.1172 (0.0053)`

LightGBM

In [48]:
```model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
learning_rate=0.05, n_estimators=720,
max_bin = 55, bagging_fraction = 0.8,
bagging_freq = 5, feature_fraction = 0.2319,
feature_fraction_seed=9, bagging_seed=9,
min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)
score = rmsle_cv(model_lgb)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))```

`Lasso score: 0.1162 (0.0060)`

In [68]:
```alphas = [0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01]
cv_ridge_score = [rmsle_cv(make_pipeline(RobustScaler(), Lasso(alpha=alpha, random_state=1))).mean()
for alpha in alphas]```
In [69]:
```cv_ridge = pd.Series(cv_ridge_score, index = alphas)
cv_ridge.plot(title = "Validation - Just Do It")
plt.xlabel("alpha")
plt.ylabel("rmse")```
Out[69]:
`Text(0, 0.5, 'rmse')`

In [78]:
```train_size = int(len(train)*0.7)
X_train = train.values[:train_size]
X_test = train.values[train_size:]
y_train = target_variable[:train_size]
y_test = target_variable[train_size:]

GBoost.fit(X_train, y_train)
ENet.fit(X_train, y_train)```
Out[78]:
```Pipeline(memory=None,
steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
with_scaling=True)), ('elasticnet', ElasticNet(alpha=0.0005, copy_X=True, fit_intercept=True, l1_ratio=0.9,
max_iter=1000, normalize=False, positive=False, precompute=False,
random_state=3, selection='cyclic', tol=0.0001, warm_start=False))])```

In [81]:
```preds = pd.DataFrame({"preds":GBoost.predict(X_test), "true":y_test})
preds["residuals"] = preds["true"] - preds["preds"]
preds.plot(x = "preds", y = "residuals",kind = "scatter")```
Out[81]:
`<matplotlib.axes._subplots.AxesSubplot at 0x7f8c1a72aef0>`

In [82]:
```preds = pd.DataFrame({"preds":ENet.predict(X_test), "true":y_test})
preds["residuals"] = preds["true"] - preds["preds"]
preds.plot(x = "preds", y = "residuals",kind = "scatter")```
Out[82]:
`<matplotlib.axes._subplots.AxesSubplot at 0x7f8c1a6ed128>`

In [79]:
```xgb_preds = np.expm1(GBoost.predict(X_test))
lasso_preds = np.expm1(ENet.predict(X_test))

predictions = pd.DataFrame({"xgb":xgb_preds, "lasso":lasso_preds})
predictions.plot(x = "xgb", y = "lasso", kind = "scatter")```
Out[79]:
`<matplotlib.axes._subplots.AxesSubplot at 0x7f8c00ba9d30>`

## 集成学习(模型融合)¶

Stacking models 是指通过将多个模型进行集成以取得更小预测方差的方法, 通过前面的两个模型残差图的分布可以发现具有一定的互补性，所以这才是融合的目的，将他的残差分布变得更均匀。

In [83]:
```class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, models):
self.models = models

# we define clones of the original models to fit the data in
def fit(self, X, y):
self.models_ = [clone(x) for x in self.models]

# Train cloned base models
for model in self.models_:
model.fit(X, y)

return self

#Now we do the predictions for cloned models and average them
def predict(self, X):
predictions = np.column_stack([
model.predict(X) for model in self.models_
])
return np.mean(predictions, axis=1)```
In [85]:
```averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))

score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))```

`Averaged base models score: 0.1081 (0.0056)`

In [87]:
```class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, base_models, meta_model, n_folds=5):
self.base_models = base_models
self.meta_model = meta_model
self.n_folds = n_folds

def fit(self, X, y):
self.base_models_ = [list() for x in self.base_models]
self.meta_model_ = clone(self.meta_model) # 复制基准模型，因为这里会有多个模型
kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)

# 训练基准模型，基于基准模型训练的结果导出成特征
# that are needed to train the cloned meta-model
out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
for i, model in enumerate(self.base_models):
for train_index, holdout_index in kfold.split(X, y): #分为预测和训练
instance = clone(model)
self.base_models_[i].append(instance)
instance.fit(X[train_index], y[train_index])
y_pred = instance.predict(X[holdout_index])
out_of_fold_predictions[holdout_index, i] = y_pred

# 将基准模型预测数据作为特征用来给meta_model训练
self.meta_model_.fit(out_of_fold_predictions, y)
return self

def predict(self, X):
meta_features = np.column_stack([
np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
for base_models in self.base_models_ ])
return self.meta_model_.predict(meta_features)```
In [95]:
```from sklearn.linear_model import LinearRegression
meta_model = LinearRegression()
stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR, lasso),
meta_model = meta_model,
n_folds=10)

score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))```

`Stacking Averaged models score: 0.1086 (0.0060)`

### Kanonpy

y0w1as5eg37urfds
2017/12/28
0
0

04/29
0
0
ML4T笔记 | 03-01 How Machine Learning is used at a hedge fund

01 - Overview The focus is on creating a model that can be used to predict future prices for stocks or other assets. machine learning provides a suite of tools that support a da......

01/23
0
0
AOPR每日一贴使用

neluzyy1
2016/01/25
33
0
Scikit-Learn大变化：合并Pandas

AI科技大本营
2018/09/15
0
0

FAT_mt

5
0

7
0
tabel 中含有复选框的列 数据理解

1、el-ui中实现某一列为复选框 实现多选非常简单: 手动添加一个el-table-column，设type属性为selction即可； 2、@selection-change事件：选项发生勾选状态变化时触发该事件 <el-table @sel...

everthing

6
0
【技术分享】TestFlight测试的流程文档

qtb999

10
0

2019年8月6日，Spring 官方在其博客宣布，Spring Boot 1.x 停止维护，Spring Boot 1.x 生命周期正式结束。 其实早在2018年7月30号，Spring 官方就已经在博客进行过预告，Spring Boot 1.X 将维...

Java技术剑

18
0