Wine Quality Prediction

原创
2017/03/26 20:48
阅读数 64

1、下载数据Wine Quality Data Set 

2、删除csv文件的第一行

3、编写spark代码

package com.spark.machine.learning

import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.sql.SparkSession

/**
  * Created by 黄坤平 on 2017/3/26.
  */
object WinePredicted extends App {

  Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
  Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

  //创建SparkSession
  val spark = SparkSession
    .builder()
    .master("local")
    .appName("WinePredicted")
    .config("spark.some.config.option", "some-value")
    .getOrCreate();

  import spark.implicits._

  //读取训练数据并转化为DataFrame
  val trainingDF = spark.sparkContext.textFile("D:\\winequality\\winequality-red.csv").map(_.split(";"))
    .map(w => (w(11).toDouble, Vectors.dense(w(0).toDouble, w(1).toDouble, w(2).toDouble,
      w(3).toDouble, w(4).toDouble, w(5).toDouble, w(6).toDouble, w(7).toDouble,
      w(8).toDouble, w(9).toDouble, w(10).toDouble))).toDF("label", "features")

  //显示训练数据
  trainingDF.show()

  //创建线性回归对象,并设置迭代最大值,通过fix(训练数据)生成模型
  val model = new LinearRegression().setMaxIter(10).fit(trainingDF)

  //使用模型测试
  val testDF = spark.createDataFrame(Seq((5.0, Vectors.dense(7.4,
    0.7, 0.0, 1.9, 0.076, 25.0, 67.0, 0.9968, 3.2, 0.68, 9.8)), (5.0,
    Vectors.dense(7.8, 0.88, 0.0, 2.6, 0.098, 11.0, 34.0, 0.9978, 3.51, 0.56,
      9.4)), (7.0, Vectors.dense(7.3, 0.65, 0.0, 1.2, 0.065, 15.0, 18.0, 0.9968,
    3.36, 0.57, 9.5)))).toDF("label", "features")
  testDF.show()
  testDF.createOrReplaceTempView("test")
  //获取预测值
  val tested = model.transform(testDF).select("features", "label", "prediction")
  tested.show()

  //获取预测的数据
  val predictDF = spark.sql("SELECT features FROM test")
  predictDF.show()

  //预测
  val predicted = model.transform(predictDF).select("features", "prediction")
  predicted.show()
}

4、输出结果


//trainingDF.show()
+-----+--------------------+
|label|            features|
+-----+--------------------+
|  5.0|[7.4,0.7,0.0,1.9,...|
|  5.0|[7.8,0.88,0.0,2.6...|
|  5.0|[7.8,0.76,0.04,2....|
|  6.0|[11.2,0.28,0.56,1...|
|  5.0|[7.4,0.7,0.0,1.9,...|
|  5.0|[7.4,0.66,0.0,1.8...|
|  5.0|[7.9,0.6,0.06,1.6...|
|  7.0|[7.3,0.65,0.0,1.2...|
|  7.0|[7.8,0.58,0.02,2....|
|  5.0|[7.5,0.5,0.36,6.1...|
|  5.0|[6.7,0.58,0.08,1....|
|  5.0|[7.5,0.5,0.36,6.1...|
|  5.0|[5.6,0.615,0.0,1....|
|  5.0|[7.8,0.61,0.29,1....|
|  5.0|[8.9,0.62,0.18,3....|
|  5.0|[8.9,0.62,0.19,3....|
|  7.0|[8.5,0.28,0.56,1....|
|  5.0|[8.1,0.56,0.28,1....|
|  4.0|[7.4,0.59,0.08,4....|
|  6.0|[7.9,0.32,0.51,1....|
+-----+--------------------+
only showing top 20 rows

//testDF.show()
+-----+--------------------+
|label|            features|
+-----+--------------------+
|  5.0|[7.4,0.7,0.0,1.9,...|
|  5.0|[7.8,0.88,0.0,2.6...|
|  7.0|[7.3,0.65,0.0,1.2...|
+-----+--------------------+

//tested.show()
+--------------------+-----+-----------------+
|            features|label|       prediction|
+--------------------+-----+-----------------+
|[7.4,0.7,0.0,1.9,...|  5.0|5.352730835965481|
|[7.8,0.88,0.0,2.6...|  5.0|4.817999361975048|
|[7.3,0.65,0.0,1.2...|  7.0|5.280106355690734|
+--------------------+-----+-----------------+

//predictDF.show()
+--------------------+
|            features|
+--------------------+
|[7.4,0.7,0.0,1.9,...|
|[7.8,0.88,0.0,2.6...|
|[7.3,0.65,0.0,1.2...|
+--------------------+

predicted.show()
+--------------------+-----------------+
|            features|       prediction|
+--------------------+-----------------+
|[7.4,0.7,0.0,1.9,...|5.352730835965481|
|[7.8,0.88,0.0,2.6...|4.817999361975048|
|[7.3,0.65,0.0,1.2...|5.280106355690734|
+--------------------+-----------------+


Process finished with exit code 0

5、总结(创建训练数据 => 创建线性回归对象生成模型 => 预测(或测试))

#主要步骤
	a. 创建训练数据框
		val trainingDF = spark.sparkContext.textFile("D:\\winequality\\winequality-red.csv").map(_.split(";"))
			.map(w => (w(11).toDouble, Vectors.dense(w(0).toDouble, w(1).toDouble, w(2).toDouble,
			w(3).toDouble, w(4).toDouble, w(5).toDouble, w(6).toDouble, w(7).toDouble,
			w(8).toDouble, w(9).toDouble, w(10).toDouble))).toDF("label", "features")
	b. 创建线性回归对象,并设置迭代最大值,通过fix(训练数据)生成模型
		val model = new LinearRegression().setMaxIter(10).fit(trainingDF)
	c. 使用模型进行预测(或测试)
		val predicted = model.transform(predictDF).select("features", "prediction")

6、maven依赖

    <properties>
      <scala.version>2.11.8</scala.version>
      <spark.version>2.1.0</spark.version>
    </properties>

    <!-- scala -->
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>

    <!-- spark -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>

      <!-- spark mllib -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-mllib_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>

 

展开阅读全文
打赏
0
0 收藏
分享
加载中
更多评论
打赏
0 评论
0 收藏
0
分享
返回顶部
顶部