# python使用cuML训练你的机器学习模型

2020/11/13 21:07

### 动机

Sklearn是一个很好的库，有各种机器学习模型，可以用来训练数据。但是如果你的数据很大，你可能需要很长时间来训练你的数据，特别是当你用不同的超参数来寻找最佳模型时。

cuML是一套快速的，GPU加速的机器学习算法，设计用于数据科学和分析任务。它的API类似于Sklearn的，这意味着你可以使用训练Sklearn模型的代码来训练cuML的模型。

from cuml.ensemble import RandomForestClassifier

clf = KNeighborsClassifier(n_neighbors=10)
clf.fit(X, y)


### 创建数据

from sklearn import datasets
X, y  = datasets.make_classification(n_samples=40000)


X = X.astype(np.float32)
y = y.astype(np.float32)


### 支持向量机

def train_data(model, X=X, y=y):
clf = model
clf.fit(X, y)


from sklearn.svm import SVC
from cuml.svm import SVC as SVC_gpu

clf_svc = SVC(kernel='poly', degree=2, gamma='auto', C=1)
sklearn_time_svc = %timeit -o train_data(clf_svc)

clf_svc = SVC_gpu(kernel='poly', degree=2, gamma='auto', C=1)
cuml_time_svc = %timeit -o train_data(clf_svc)

print(f"""Average time of sklearn's {clf_svc.__class__.__name__}""", sklearn_time_svc.average, 's')
print(f"""Average time of cuml's {clf_svc.__class__.__name__}""", cuml_time_svc.average, 's')

print('Ratio between sklearn and cuml is', sklearn_time_svc.average/cuml_time_svc.average)

Average time of sklearn's SVC 48.56009825014287 s
Average time of cuml's SVC 19.611496431714304 s
Ratio between sklearn and cuml is 2.476103668030909


cuML的SVC比sklearn的SVC快2.5倍！

!pip install cutecharts

import cutecharts.charts as ctc

def plot(sklearn_time, cuml_time):

chart = ctc.Bar('Sklearn vs cuml')
chart.set_options(
labels=['sklearn', 'cuml'],
x_label='library',
y_label='time (s)',
)

return chart

plot(sklearn_time_svc, cuml_time_svc).render_notebook()


#### 更好的显卡

Average time of sklearn's SVC 35.791008955999914 s
Average time of cuml's SVC 1.9953700327142931 s
Ratio between sklearn and cuml is 17.93702840535976


### 随机森林分类器

clf_rf = RandomForestClassifier(max_features=1.0,
n_estimators=40)
sklearn_time_rf = %timeit -o train_data(clf_rf)

clf_rf = RandomForestClassifier_gpu(max_features=1.0,
n_estimators=40)
cuml_time_rf = %timeit -o train_data(clf_rf)

print(f"""Average time of sklearn's {clf_rf.__class__.__name__}""", sklearn_time_rf.average, 's')
print(f"""Average time of cuml's {clf_rf.__class__.__name__}""", cuml_time_rf.average, 's')

print('Ratio between sklearn and cuml is', sklearn_time_rf.average/cuml_time_rf.average)

Average time of sklearn's RandomForestClassifier 29.824075075857113 s
Average time of cuml's RandomForestClassifier 0.49404465585715635 s
Ratio between sklearn and cuml is 60.3671646323408


cuML的RandomForestClassifier比Sklearn的RandomForestClassifier快60倍!如果训练Sklearn的RandomForestClassifier需要30秒，那么训练cuML的RandomForestClassifier只需要不到半秒！

#### 更好的显卡

Average time of Sklearn's RandomForestClassifier 24.006061030143037 s
Average time of cuML's RandomForestClassifier 0.15141178591425808 s.
The ratio between Sklearn’s and cuML is 158.54816641379068


#### 最近邻分类器

Average time of sklearn's KNeighborsClassifier 0.07836367340000508 s
Average time of cuml's KNeighborsClassifier 0.004251259535714585 s
Ratio between sklearn and cuml is 18.43304854518441


cuML的KNeighborsClassifier比Sklearn的KNeighborsClassifier快18倍。

#### 更大的显卡内存

Average time of sklearn's KNeighborsClassifier 0.07511190322854547 s
Average time of cuml's KNeighborsClassifier 0.0015137992111426033 s
Ratio between sklearn and cuml is 49.618141346401956


### 总结

• Alienware M15-GeForce 2060和6.3 GB显卡内存
index sklearn(s) cuml(s) sklearn/cuml
SVC 50.24 23.69 2.121
RandomForestClassifier 29.82 0.443 67.32
KNeighborsClassifier 0.078 0.004 19.5
LinearRegression 0.005 0.006 0.8333
Ridge 0.021 0.006 3.5
KNeighborsRegressor 0.076 0.002 38
• Dell Precision 7740-Quadro RTX 5000和17 GB显卡内存
index sklearn(s) cuml(s) sklearn/cuml
SVC 35.79 1.995 17.94
RandomForestClassifier 24.01 0.151 159
KNeighborsClassifier 0.075 0.002 37.5
LinearRegression 0.006 0.002 3
Ridge 0.005 0.002 2.5
KNeighborsRegressor 0.069 0.001 69

### 结论

sklearn机器学习中文官方文档： http://sklearn123.com/

0 评论
0 收藏
0