SparkMllib之特征转换(重点)

1.特征转换(重点都掌握)

1.1 类别值属性的数值化[掌握]
  • 性别:男、女、其他
  • labelencoder: 0–1—2
    • SparkMllib中实现的方法是:StringIndexer
    • 需求分析: StringIndexer将标签的字符串列编码为标签索引列
      在这里插入图片描述
  • 根据上述图示,a出现次数最多即为0,其次c即为1,最后b即为2(出现最多的往前排 从0开始)

scala代码:

import org.apache.spark.ml.feature.{
    
    StringIndexer, StringIndexerModel}
import org.apache.spark.sql.{
    
    DataFrame, SparkSession}

/**
 * @author liu a fu
 * @date 2021/1/28 0028
 * @version 1.0
 * @DESC-
 *    1-准备环境sparksession
 *    2-创建数据--id---catagory两个字段的数据
 *    3-调用StringIndexer方法将按照图示的顺序进行编码
 *    4-展示结果
 */
object _01StringToIndexTest {
    
    
  def main(args: Array[String]): Unit = {
    
    
      //1-准备环境
      val spark: SparkSession = SparkSession
        .builder()
        .appName(this.getClass.getSimpleName.stripSuffix("$"))
        .master("local[5]")
        .getOrCreate()
    spark.sparkContext.setLogLevel("WARN")

    //2-创建数据--id---category两个字段的数据
    val data: DataFrame = spark.createDataFrame(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")))
      .toDF("id","category")   //映射的表的字段

    data.printSchema()
    /**
     * root
     * |-- id: integer (nullable = false)
     * |-- category: string (nullable = true)
     */
    //data.show()

    //3-调用StringIndexer方法将按照图示的顺序进行编码
    val stringIndex: StringIndexer = new StringIndexer().setInputCol("category").setOutputCol("category_index")
    val strModel: StringIndexerModel = stringIndex.fit(data)
    val strResult: DataFrame = strModel.transform(data)
    //4-展示结果
    strResult.show(truncate = false)
    /**
     * +---+--------+--------------+
     * |id |category|category_index|
     * +---+--------+--------------+
     * |0  |a       |0.0           |
     * |1  |b       |2.0           |
     * |2  |c       |1.0           |
     * |3  |a       |0.0           |
     * |4  |a       |0.0           |
     * |5  |c       |1.0           |
     * +---+--------+--------------+
     */

    //X(特征-男性和女性的特征)-----y(0-1-)-----(man-women)
    
  }

}

如果需要将已经转换后的数据需要返还回去?

  • IndexToString
val index: IndexToString = new IndexToString()
      .setInputCol("category_index")
      .setOutputCol("defore_Category")
      .setLabels(strModel.labels)
    index.transform(strResult).show(false)

    /**
     * +---+--------+--------------+---------------+
     * |id |category|category_index|defore_Category|
     * +---+--------+--------------+---------------+
     * |0  |a       |0.0           |a              |
     * |1  |b       |2.0           |b              |
     * |2  |c       |1.0           |c              |
     * |3  |a       |0.0           |a              |
     * |4  |a       |0.0           |a              |
     * |5  |c       |1.0           |c              |
     * +---+--------+--------------+---------------+
     */

总结:

  • StringIndexer将类别值转化为数值
  • IndexToString将数值转化为类别值,需要具备参考的class
1.2 连续值属性离散化【掌握】

在这里插入图片描述

  • Age:0-130岁------青年(14>)+中年(30>)+老年(60>)
  • 二值化
    在这里插入图片描述
import org.apache.spark.ml.feature.Binarizer
import org.apache.spark.sql.SparkSession
object Binaziner_3 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val spark: SparkSession = SparkSession.builder()
      .appName("SparkMlilb")
      .master("local[2]")
      .getOrCreate()
    spark.sparkContext.setLogLevel("WARN")
    val data = Array((0, 0.1), (1, 0.8), (2, 0.2))
    val dataFrame = spark.createDataFrame(data).toDF("label", "feature")

    val binarizer: Binarizer = new Binarizer()
      .setInputCol("feature")
      .setOutputCol("binarized_feature")
      .setThreshold(0.5)   //设置的阈值 0.5

    val binarizedDataFrame = binarizer.transform(dataFrame)
    val binarizedFeatures = binarizedDataFrame.select("binarized_feature")
    binarizedFeatures.collect().foreach(println)
  }
}

结果:
[0.0]
[1.0]
[0.0]
  • 分箱:Bucketrizer
    Bucketizer将一列连续特征转换为一列要素存储区,其中存储区由用户指定。它需要一个参数
  • 将连续值属性离散化,离散化为多个分箱
import org.apache.spark.ml.feature.Bucketizer
import org.apache.spark.sql.SparkSession

/**
 * @author liu a fu
 * @date 2021/1/28 0028
 * @version 1.0
 * @DESC
 */
object _02BinarizerTest {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //1-准备环境
    val spark: SparkSession = SparkSession
      .builder()
      .appName(this.getClass.getSimpleName.stripSuffix("$"))
      .master("local[8]")
      .getOrCreate()

    spark.sparkContext.setLogLevel("WARN")

    //2-准备数据
    val data = Array(-0.5, -0.3, 0.0, 0.2)
    val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
    df.printSchema()

    //3-对数据进行  Bucketizer
    val bucket: Bucketizer = new Bucketizer()
      .setInputCol("features")
      .setOutputCol("bucketResult")
      //NegativeInfinity 负无穷   PositiveInfinity 正无穷  左开右闭区间
      .setSplits(Array(Double.NegativeInfinity, -0.5, 0, 0.5, Double.PositiveInfinity))

    //4-展示结果
    bucket.transform(df).show()

    /**
     * +--------+------------+
     * |features|bucketResult|
     * +--------+------------+
     * |    -0.5|         1.0|
     * |    -0.3|         1.0|
     * |     0.0|         2.0|
     * |     0.2|         2.0|
     * +--------+------------+
     */

  }

}

  • QuantileDiscretizer 连续属性离散化
    在这里插入图片描述
import org.apache.spark.ml.feature.QuantileDiscretizer
import org.apache.spark.sql.SparkSession

/**
 * @author liu a fu
 * @date 2021/1/28 0028
 * @version 1.0
 * @DESC
 */
object _03QuantileDiscretizerTest {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //1-准备环境
    val spark: SparkSession = SparkSession
      .builder()
      .appName(this.getClass.getSimpleName.stripSuffix("$"))
      .master("local[8]")
      .getOrCreate()

    spark.sparkContext.setLogLevel("WARN")

    //2-准备数据
    val data = Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2))
    val df = spark.createDataFrame(data).toDF("id", "hour")
    df.describe().show(false)

    //3-对数据进行处理
    val buckets: QuantileDiscretizer = new QuantileDiscretizer()
      .setInputCol("hour")
      .setOutputCol("selectHour")
      .setNumBuckets(3)  //设置3个桶

    buckets.fit(df).transform(df).show(truncate = false)
    /**
     * +---+----+----------+
     * | id|hour|selectHour|
     * +---+----+----------+
     * |  0|18.0|       2.0|
     * |  1|19.0|       2.0|
     * |  2| 8.0|       1.0|
     * |  3| 5.0|       1.0|
     * |  4| 2.2|       0.0|
     * +---+----+----------+
     */
    
  }

}
1.3 将特征进行特征组合【必须掌握】
  • VectorAssembler 特征组合
    在这里插入图片描述

特征:年龄,性别,体重 X=(年龄,性别,体重) y=相亲1
在这里插入图片描述

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.{
    
    DataFrame, SparkSession}

/**
 * @author liu a fu
 * @date 2021/1/28 0028
 * @version 1.0
 * @DESC  特征的组合  VectorAssemble
 */
object _04FeaturesVectorAssemble {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //1-准备环境
    val spark: SparkSession = SparkSession
      .builder()
      .appName(this.getClass.getSimpleName.stripSuffix("$"))
      .master("local[8]")
      .getOrCreate()

    spark.sparkContext.setLogLevel("WARN")

    //2- 准备数据
    val df: DataFrame = spark.createDataFrame(
      Seq((0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0),
        (1, 20, 2.0, Vectors.dense(0.1, 11.0, 0.5), 0.0))
    ).toDF("id", "hour", "mobile", "userFeatures", "clicked")

    //3- 使用VectorAssemble将离散的特征组合为特征向量 并且show输出
    val assembler: VectorAssembler = new VectorAssembler()
      .setInputCols(Array("id","hour", "mobile", "userFeatures"))
      .setOutputCol("features")
    assembler.transform(df).show(false)
    /**
     * +---+----+------+--------------+-------+---------------------------+
     * |id |hour|mobile|userFeatures  |clicked|features                   |
     * +---+----+------+--------------+-------+---------------------------+
     * |0  |18  |1.0   |[0.0,10.0,0.5]|1.0    |[0.0,18.0,1.0,0.0,10.0,0.5]|
     * |1  |20  |2.0   |[0.1,11.0,0.5]|0.0    |[1.0,20.0,2.0,0.1,11.0,0.5]|
     * +---+----+------+--------------+-------+---------------------------+
     */
  }
}

1.4 数值型数据的标准化
  • StandardScaler—减去均值除以方差的操作
import org.apache.spark.SparkContext
import org.apache.spark.ml.feature.{
    
    StandardScaler, StandardScalerModel}
import org.apache.spark.sql.{
    
    DataFrame, SparkSession}

/**
  * DESC: 
  * Complete data processing and modeling process steps:
  *
  */
object StandScalerTest {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //    * 1-准备环境
    val spark: SparkSession = SparkSession.builder().master("local[*]").appName("StringToIndexerTest").getOrCreate()
    val sc: SparkContext = spark.sparkContext
    sc.setLogLevel("WARN")
    //   * 2 读取数据
    val path = "D:\\BigData\\Workspace\\SparkMachineLearningTest\\SparkMllib_BigDataSH16\\src\\main\\resources\\sample_libsvm_data.txt"
    val df: DataFrame = spark.read.format("libsvm").load(path)
    df.printSchema() //label+features
    //  * 3-标准化的操作---减去均值除以方差
    val std: StandardScaler = new StandardScaler()
      .setInputCol("features")
      .setOutputCol("std_features")
      .setWithMean(true)  //均值
      .setWithStd(true)    //方差
    val stdModel: StandardScalerModel = std.fit(df)
    val result: DataFrame = stdModel.transform(df)
    result.show(false)
  }
}
  • MinMaxScaler—减去最大值除以最大值减去最小值
  • MaxAbsSclaer—一个数据值除以最大值的绝对值

猜你喜欢

转载自blog.csdn.net/m0_49834705/article/details/113358575