0x0 Dataset to POJO
method:
- Convert query results to RDD
- Create an RDD as a DataFrame and pass in the schema parameter
- Call the as method to convert the Dataset to the corresponding POJO Dataset
- Call the collectAsList() method
code show as below:
1. Table structure
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
| id| string| null|
| name| string| null|
| class| string| null|
+--------+---------+-------+
2.POJO type
public class Student {
String id;
String name;
String major;
...
}
3. Convert the code
SparkSession spark = CloudUtils.getSparkSession();
// 查询原始数据
Dataset<Row> student = spark.sql("select * from `event`.`student`");
// 生成schema
List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("id", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("name", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("major", DataTypes.StringType, true));
StructType schema = DataTypes.createStructType(fields);
// 转换查询结果为POJO List
List<Student> students = spark.createDataFrame(student.toJavaRDD(), schema)
.as(Encoders.bean(Student.class))
.collectAsList();
System.out.println(students);
Note:
The date type in the Dataset is timestamp and the Date type in java is not compatible, but is compatible with the Timestamp type.
In order to solve the above problems, we can first convert Dataset to JSON, and then convert JSON to POJO, the code is as follows:
// 查出数据并转为json集合
List<String> jsonList = spark.sql("select * from `event`.`user`")
.toJSON()
.collectAsList();
// 将json转为pojo,这里使用的是FastJSON
List<User> users = jsonList.stream()
.map(jsonString -> JSON.parseObject(jsonString, User.class))
.collect(Collectors.toList());
System.out.println(users);
0x1 POJO to Dataset
1. Table structure
+---------+---------+-------+
|col_name |data_type|comment|
+---------+---------+-------+
| user_id | string| null|
|user_name| string| null|
|user_age | int | null|
+---------+---------+-------+
2.POJO type
public class User{
String userId;
String userName;
Integer userAge;
...
}
Conversion code:
// 获取users列表
List<User> users = createUsers();
// 使用createDataFrame转为dataset
Dataset<Row> ds = spark.createDataFrame(users, User.class);
// 将驼峰式列名改为下划线式列名,camelToUnderline方法网上搜索
String[] columns = ds.columns();
String[] newColumns = Arrays.stream(columns)
.map(column -> camelToUnderline(column))
.toArray(String[]::new);
// 转为新的df(重命名后的)
ds.toDF(newColumns);
ds.show();
Also note:
For some types that cannot be converted, json transition is still used. The code is as follows:
// 创建user list
List<User> users = createUsers();
// 将user list转为json list
List<String> jsonList = users.stream()
.map(JSON::toJSONString)
.collect(Collectors.toList());
// 将json list转为json dataset
Dataset<String> jsonDataset = spark.createDataset(jsonList, Encoders.STRING());
// 转换为row dataset
Dataset<Row> ds = spark.read().json(jsonDataset.toJavaRDD());
ds.show();
Output result:
+------------+---+----+
| birthday| id|name|
+------------+---+----+
|689875200000| 1| AAA|
|689875200000| 2| BBB|
+------------+---+----+