1.特征转换(重点都掌握)
1.1 类别值属性的数值化[掌握]
- 性别:男、女、其他
- labelencoder: 0–1—2
- SparkMllib中实现的方法是:StringIndexer
- 需求分析: StringIndexer将标签的字符串列编码为标签索引列
- 根据上述图示,a出现次数最多即为0,其次c即为1,最后b即为2(出现最多的往前排 从0开始)
scala代码:
import org.apache.spark.ml.feature.{
StringIndexer, StringIndexerModel}
import org.apache.spark.sql.{
DataFrame, SparkSession}
/**
* @author liu a fu
* @date 2021/1/28 0028
* @version 1.0
* @DESC-
* 1-准备环境sparksession
* 2-创建数据--id---catagory两个字段的数据
* 3-调用StringIndexer方法将按照图示的顺序进行编码
* 4-展示结果
*/
object _01StringToIndexTest {
def main(args: Array[String]): Unit = {
//1-准备环境
val spark: SparkSession = SparkSession
.builder()
.appName(this.getClass.getSimpleName.stripSuffix("$"))
.master("local[5]")
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
//2-创建数据--id---category两个字段的数据
val data: DataFrame = spark.createDataFrame(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")))
.toDF("id","category") //映射的表的字段
data.printSchema()
/**
* root
* |-- id: integer (nullable = false)
* |-- category: string (nullable = true)
*/
//data.show()
//3-调用StringIndexer方法将按照图示的顺序进行编码
val stringIndex: StringIndexer = new StringIndexer().setInputCol("category").setOutputCol("category_index")
val strModel: StringIndexerModel = stringIndex.fit(data)
val strResult: DataFrame = strModel.transform(data)
//4-展示结果
strResult.show(truncate = false)
/**
* +---+--------+--------------+
* |id |category|category_index|
* +---+--------+--------------+
* |0 |a |0.0 |
* |1 |b |2.0 |
* |2 |c |1.0 |
* |3 |a |0.0 |
* |4 |a |0.0 |
* |5 |c |1.0 |
* +---+--------+--------------+
*/
//X(特征-男性和女性的特征)-----y(0-1-)-----(man-women)
}
}
如果需要将已经转换后的数据需要返还回去?
- IndexToString
val index: IndexToString = new IndexToString()
.setInputCol("category_index")
.setOutputCol("defore_Category")
.setLabels(strModel.labels)
index.transform(strResult).show(false)
/**
* +---+--------+--------------+---------------+
* |id |category|category_index|defore_Category|
* +---+--------+--------------+---------------+
* |0 |a |0.0 |a |
* |1 |b |2.0 |b |
* |2 |c |1.0 |c |
* |3 |a |0.0 |a |
* |4 |a |0.0 |a |
* |5 |c |1.0 |c |
* +---+--------+--------------+---------------+
*/
总结:
- StringIndexer将类别值转化为数值
- IndexToString将数值转化为类别值,需要具备参考的class
1.2 连续值属性离散化【掌握】
- Age:0-130岁------青年(14>)+中年(30>)+老年(60>)
- 二值化:
import org.apache.spark.ml.feature.Binarizer
import org.apache.spark.sql.SparkSession
object Binaziner_3 {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.appName("SparkMlilb")
.master("local[2]")
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
val data = Array((0, 0.1), (1, 0.8), (2, 0.2))
val dataFrame = spark.createDataFrame(data).toDF("label", "feature")
val binarizer: Binarizer = new Binarizer()
.setInputCol("feature")
.setOutputCol("binarized_feature")
.setThreshold(0.5) //设置的阈值 0.5
val binarizedDataFrame = binarizer.transform(dataFrame)
val binarizedFeatures = binarizedDataFrame.select("binarized_feature")
binarizedFeatures.collect().foreach(println)
}
}
结果:
[0.0]
[1.0]
[0.0]
- 分箱:Bucketrizer
Bucketizer将一列连续特征转换为一列要素存储区,其中存储区由用户指定。它需要一个参数 - 将连续值属性离散化,离散化为多个分箱
import org.apache.spark.ml.feature.Bucketizer
import org.apache.spark.sql.SparkSession
/**
* @author liu a fu
* @date 2021/1/28 0028
* @version 1.0
* @DESC
*/
object _02BinarizerTest {
def main(args: Array[String]): Unit = {
//1-准备环境
val spark: SparkSession = SparkSession
.builder()
.appName(this.getClass.getSimpleName.stripSuffix("$"))
.master("local[8]")
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
//2-准备数据
val data = Array(-0.5, -0.3, 0.0, 0.2)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
df.printSchema()
//3-对数据进行 Bucketizer
val bucket: Bucketizer = new Bucketizer()
.setInputCol("features")
.setOutputCol("bucketResult")
//NegativeInfinity 负无穷 PositiveInfinity 正无穷 左开右闭区间
.setSplits(Array(Double.NegativeInfinity, -0.5, 0, 0.5, Double.PositiveInfinity))
//4-展示结果
bucket.transform(df).show()
/**
* +--------+------------+
* |features|bucketResult|
* +--------+------------+
* | -0.5| 1.0|
* | -0.3| 1.0|
* | 0.0| 2.0|
* | 0.2| 2.0|
* +--------+------------+
*/
}
}
- QuantileDiscretizer 连续属性离散化
import org.apache.spark.ml.feature.QuantileDiscretizer
import org.apache.spark.sql.SparkSession
/**
* @author liu a fu
* @date 2021/1/28 0028
* @version 1.0
* @DESC
*/
object _03QuantileDiscretizerTest {
def main(args: Array[String]): Unit = {
//1-准备环境
val spark: SparkSession = SparkSession
.builder()
.appName(this.getClass.getSimpleName.stripSuffix("$"))
.master("local[8]")
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
//2-准备数据
val data = Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2))
val df = spark.createDataFrame(data).toDF("id", "hour")
df.describe().show(false)
//3-对数据进行处理
val buckets: QuantileDiscretizer = new QuantileDiscretizer()
.setInputCol("hour")
.setOutputCol("selectHour")
.setNumBuckets(3) //设置3个桶
buckets.fit(df).transform(df).show(truncate = false)
/**
* +---+----+----------+
* | id|hour|selectHour|
* +---+----+----------+
* | 0|18.0| 2.0|
* | 1|19.0| 2.0|
* | 2| 8.0| 1.0|
* | 3| 5.0| 1.0|
* | 4| 2.2| 0.0|
* +---+----+----------+
*/
}
}
1.3 将特征进行特征组合【必须掌握】
- VectorAssembler 特征组合
特征:年龄,性别,体重 X=(年龄,性别,体重) y=相亲1
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.{
DataFrame, SparkSession}
/**
* @author liu a fu
* @date 2021/1/28 0028
* @version 1.0
* @DESC 特征的组合 VectorAssemble
*/
object _04FeaturesVectorAssemble {
def main(args: Array[String]): Unit = {
//1-准备环境
val spark: SparkSession = SparkSession
.builder()
.appName(this.getClass.getSimpleName.stripSuffix("$"))
.master("local[8]")
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
//2- 准备数据
val df: DataFrame = spark.createDataFrame(
Seq((0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0),
(1, 20, 2.0, Vectors.dense(0.1, 11.0, 0.5), 0.0))
).toDF("id", "hour", "mobile", "userFeatures", "clicked")
//3- 使用VectorAssemble将离散的特征组合为特征向量 并且show输出
val assembler: VectorAssembler = new VectorAssembler()
.setInputCols(Array("id","hour", "mobile", "userFeatures"))
.setOutputCol("features")
assembler.transform(df).show(false)
/**
* +---+----+------+--------------+-------+---------------------------+
* |id |hour|mobile|userFeatures |clicked|features |
* +---+----+------+--------------+-------+---------------------------+
* |0 |18 |1.0 |[0.0,10.0,0.5]|1.0 |[0.0,18.0,1.0,0.0,10.0,0.5]|
* |1 |20 |2.0 |[0.1,11.0,0.5]|0.0 |[1.0,20.0,2.0,0.1,11.0,0.5]|
* +---+----+------+--------------+-------+---------------------------+
*/
}
}
1.4 数值型数据的标准化
- StandardScaler—减去均值除以方差的操作
import org.apache.spark.SparkContext
import org.apache.spark.ml.feature.{
StandardScaler, StandardScalerModel}
import org.apache.spark.sql.{
DataFrame, SparkSession}
/**
* DESC:
* Complete data processing and modeling process steps:
*
*/
object StandScalerTest {
def main(args: Array[String]): Unit = {
// * 1-准备环境
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("StringToIndexerTest").getOrCreate()
val sc: SparkContext = spark.sparkContext
sc.setLogLevel("WARN")
// * 2 读取数据
val path = "D:\\BigData\\Workspace\\SparkMachineLearningTest\\SparkMllib_BigDataSH16\\src\\main\\resources\\sample_libsvm_data.txt"
val df: DataFrame = spark.read.format("libsvm").load(path)
df.printSchema() //label+features
// * 3-标准化的操作---减去均值除以方差
val std: StandardScaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("std_features")
.setWithMean(true) //均值
.setWithStd(true) //方差
val stdModel: StandardScalerModel = std.fit(df)
val result: DataFrame = stdModel.transform(df)
result.show(false)
}
}
- MinMaxScaler—减去最大值除以最大值减去最小值
- MaxAbsSclaer—一个数据值除以最大值的绝对值