模型训练
博客专区 > KYO4321 的博客 > 博客详情
模型训练
KYO4321 发表于3个月前
模型训练
  • 发表于 3个月前
  • 阅读 12
  • 收藏 0
  • 点赞 0
  • 评论 1

腾讯云 新注册用户 域名抢购1元起>>>   

##随机森林调整参数 https://spark.apache.org/docs/2.1.0/ml-tuning.html https://stackoverflow.com/questions/32769573/how-to-cross-validate-randomforest-model K折交叉检验

##计算AUC https://weiminwang.blog/2016/06/09/pyspark-tutorial-building-a-random-forest-binary-classifier-on-unbalanced-dataset/

cross-validation与Train-Validation Split两者的区别 https://spark.apache.org/docs/2.1.0/ml-tuning.html

https://stackoverflow.com/questions/41902360/random-forest-in-spark

##批量计算各种率 https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-spark-advanced-data-exploration-modeling

Each model building code section is split into steps: + 1、Model training data with one parameter set 2、Model evaluation on a test data set with metrics 3、Saving model in blob for future consumption

保存最好的那个模型,后续加载应用

####################################################

CV USING ELASTIC NET FOR LINEAR REGRESSION

RECORD START TIME

timestart = datetime.datetime.now()

LOAD PYSPARK LIBRARIES

from pyspark.ml.regression import LinearRegression from pyspark.ml import Pipeline from pyspark.ml.evaluation import RegressionEvaluator from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

DEFINE ALGORITHM/MODEL

lr = LinearRegression()

DEFINE GRID PARAMETERS

paramGrid = ParamGridBuilder().addGrid(lr.regParam, (0.01, 0.1))
.addGrid(lr.maxIter, (5, 10))
.addGrid(lr.tol, (1e-4, 1e-5))
.addGrid(lr.elasticNetParam, (0.25,0.75))
.build()

DEFINE PIPELINE

SIMPLY THE MODEL HERE, WITHOUT TRANSFORMATIONS

pipeline = Pipeline(stages=[lr])

DEFINE CV WITH PARAMETER SWEEP

cv = CrossValidator(estimator= lr, estimatorParamMaps=paramGrid, evaluator=RegressionEvaluator(), numFolds=3)

CONVERT TO DATA FRAME, AS CROSSVALIDATOR WON'T RUN ON RDDS

trainDataFrame = sqlContext.createDataFrame(oneHotTRAINreg, ["features", "label"])

TRAIN WITH CROSS-VALIDATION

cv_model = cv.fit(trainDataFrame)

EVALUATE MODEL ON TEST SET

testDataFrame = sqlContext.createDataFrame(oneHotTESTreg, ["features", "label"])

MAKE PREDICTIONS ON TEST DOCUMENTS

cvModel uses the best model found (lrModel).

predictionAndLabels = cv_model.transform(testDataFrame)

CONVERT TO DF AND SAVE REGISER DF AS TABLE

predictionAndLabels.registerTempTable("tmp_results");

PRINT ELAPSED TIME

timeend = datetime.datetime.now() timedelta = round((timeend-timestart).total_seconds(), 2) print "Time taken to execute above cell: " + str(timedelta) + " seconds";

共有 人打赏支持
粉丝 4
博文 30
码字总数 15120
评论 (1)
KYO4321
https://weiminwang.blog/2016/06/09/pyspark-tutorial-building-a-random-forest-binary-classifier-on-unbalanced-dataset/
×
KYO4321
如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
打赏人
留言
* 支付类型
微信扫码支付
打赏金额:
已支付成功
打赏金额: