[Real-time data warehouse] Demand analysis, order and order details associated source code of DWM order wide table

One DWM layer - order wide table

1 Demand Analysis and Ideas

Orders are an important object of statistical analysis, and there are many dimensional statistical requirements around orders, such as users, regions, commodities, categories, brands, and so on.

In order to make statistical calculations more convenient in the future and reduce the association between large tables, the relevant data surrounding the order is integrated into a wide table of orders during the real-time calculation process.

insert image description here

As shown in the figure above, since the previous operation has split the data into fact data and dimension data, fact data (green) enters the kafka data stream (DWD layer), and dimension data (blue) enters hbase for long-term storage. Then in the DWM layer, the real-time and dimensional data should be integrated and associated to form a wide table. Then there are two kinds of associations to be dealt with here, the association between fact data and fact data, and the association between fact data and dimension data.

  • The association between fact data and fact data is actually the association between streams.
  • Fact data is associated with dimension data, which is actually querying external data sources in stream computing.

2 Realization of order and order details association code

(1) Receive order and order detail data from Kafka's dwd layer

The implementation idea is as follows:

insert image description here

a Create an order entity class

package com.hzy.gmall.realtime.beans;

import lombok.Data;
import java.math.BigDecimal;

/**
 * Desc: 订单实体类
 */
@Data
public class OrderInfo {
    
    
    // 属性名需要与数据库中字段名相同
    Long id;
    Long province_id;
    String order_status;
    Long user_id;
    BigDecimal total_amount;    //实际付款金额
    BigDecimal activity_reduce_amount;
    BigDecimal coupon_reduce_amount;
    BigDecimal original_total_amount;
    BigDecimal feight_fee;
    String expire_time;
    String create_time;
    String operate_time;
    String create_date; // 把其他字段处理得到
    String create_hour;
    Long create_ts; // 通过create_time转换
}

b create order details entity class

package com.hzy.gmall.realtime.beans;

import lombok.Data;
import java.math.BigDecimal;

/**
 * Desc:订单明细实体类
 */
@Data
public class OrderDetail {
    
    
    Long id;
    Long order_id ; // 无外键约束
    Long sku_id;
    BigDecimal order_price ;
    Long sku_num ;
    String sku_name; // 冗余字段,可以减少连接查询的次数
    String create_time;
    BigDecimal split_total_amount;
    BigDecimal split_activity_amount;
    BigDecimal split_coupon_amount;
    Long create_ts; // 通过create_time转换
}

c Create OrderWideApp under dwm package to read order and order detail data

package com.hzy.gmall.realtime.app.dwm;
/**
 * 订单宽表的准备
 */
public class OrderWideApp {
    
    
    public static void main(String[] args) throws Exception {
    
    
        //TODO 1 基本环境准备
        //1.1 流处理环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //1.2 设置并行度
        env.setParallelism(4);

        //TODO 2 检查点设置(略)

        //TODO 3 从kafka中读取数据
        //3.1 声明消费主题以及消费者组
        String orderInfoSourceTopic = "dwd_order_info";
        String orderDetailSourceTopic = "dwd_order_detail";
        String groupId = "order_wide_app_group";

        //3.2 获取kafka消费者对象
        // 订单
        FlinkKafkaConsumer<String> orderInfoKafkaSource = MyKafkaUtil.getKafkaSource(orderInfoSourceTopic, groupId);
        // 订单明细
        FlinkKafkaConsumer<String> orderDetailKafkaSource = MyKafkaUtil.getKafkaSource(orderDetailSourceTopic, groupId);
        //3.3 读取数据,封装为流
        // 订单流
        DataStreamSource<String> orderInfoStrDS = env.addSource(orderInfoKafkaSource);
        // 订单明细流
        DataStreamSource<String> orderDetailStrDS = env.addSource(orderDetailKafkaSource);

        //TODO 4 对流中数据类型进行转换 String -> 实体对象
        //订单
        SingleOutputStreamOperator<OrderInfo> orderInfoDS = orderInfoStrDS.map(
                new RichMapFunction<String, OrderInfo>() {
    
    
                    private SimpleDateFormat sdf;

                    @Override
                    public void open(Configuration parameters) throws Exception {
    
    
                        sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
                    }

                    @Override
                    public OrderInfo map(String jsonStr) throws Exception {
    
    
                        OrderInfo orderInfo = JSON.parseObject(jsonStr, OrderInfo.class);
                        orderInfo.setCreate_ts(sdf.parse(orderInfo.getCreate_time()).getTime());
                        return orderInfo;
                    }
                }
        );
        // 订单明细
        SingleOutputStreamOperator<OrderDetail> orderDetailDS = orderDetailStrDS.map(
                new RichMapFunction<String, OrderDetail>() {
    
    
                    private SimpleDateFormat sdf;

                    @Override
                    public void open(Configuration parameters) throws Exception {
    
    
                        sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
                    }

                    @Override
                    public OrderDetail map(String jsonStr) throws Exception {
    
    
                        OrderDetail orderDetail = JSON.parseObject(jsonStr, OrderDetail.class);
                        orderDetail.setCreate_ts(sdf.parse(orderDetail.getCreate_time()).getTime());
                        return orderDetail;
                    }
                }
        );

        orderInfoDS.print("订单信息:");
        orderDetailDS.print("订单明细:");

        env.execute();
    }
}

d test

Start zookeeper, kafka, maxwell, hdfs, wait to exit the safe mode, start Hbase

Add two pieces of data in the configuration table table_process, as follows

source_table    operate_type  sink_type  sink_table          sink_columns  sink_pk  sink_extend  
     
order_detail    insert        kafka      dwd_order_detail    id,order_id,sku_id,sku_name,img_url,order_price,sku_num,create_time,source_type,source_id,split_total_amount,split_activity_amount,split_coupon_amount                                                                id       (NULL)       
order_info      insert        kafka      dwd_order_info      id,consignee,consignee_tel,total_amount,order_status,user_id,payment_way,delivery_address,province_id,activity_reduce_amount,coupon_reduce_amount,original_total_amount,feight_fee,feight_fee_reduce,refundable_time  id       (NULL)          

Start BaseDBApp and OrderWideApp, simulate and generate business data, and observe the results.

  • Implementation process

Business data generation** -> Maxwell synchronization -> Kafka's ods_base_db_m topic -> BaseDBApp shunt write back to kafka -> dwd_order_info and dwd_order_detail -> **OrderWideApp reads data from kafka's dwd layer and prints out

(2) Association between orders and order details (dual stream join)

The flow join in flink is roughly divided into two types, one is the join based on the time window (Time Windowed Join), such as join, coGroup, etc. The other is a state cache-based join (Temporal Table Join), such as intervalJoin.

Here intervalJoin is chosen, because compared with window join, intervalJoin is easier to use and avoids the problem that the data to be matched is in different windows. There is only one problem with intervalJoin at present, that is, it does not support left join yet.

But here the association between the order master table and the order slave table does not require left join, so intervalJoin is a better choice.

For details, see the description of the window on the official website .

The conditions for combining elements in Interval join are: the keys in the two streams (we temporarily call them A and B) are the same and the timestamp of the element in B is within a certain range of the timestamp of the element in A.

This condition can be expressed more formally as b.timestamp ∈ [a.timestamp + lowerBound; a.timestamp + upperBound]ora.timestamp + lowerBound <= b.timestamp <= a.timestamp + upperBound

Here a and b are elements in A and B that share the same key. The upper bound and the lower bound can be positive or negative, as long as the lower bound is always less than or equal to the upper bound. Interval join currently only performs inner joins.

When a pair of elements is passed to ProcessJoinFunction, their timestamp is the maximum value of the timestamps of the two elements (timestamp can ProcessJoinFunction.Contextbe accessed ).

Interval join currently only supports event time.

The way intervalJoin connects data is as follows:

insert image description here

In the above example, we joined two streams, orange and green, and the conditions for joining are: -2 milliseconds as the lower bound and +1 millisecond as the upper bound. By default, upper and lower bounds are also included in the interval, but .lowerBoundExclusive()and .upperBoundExclusive()can exclude them.

The condition represented by the triangle in the figure can also be written as a more formal expression:

orangeElem.ts + lowerBound <= greenElem.ts <= orangeElem.ts + upperBound

a Set the event time watermark

// TODO 5 指定Watermark并提取事件时间字段
//订单
SingleOutputStreamOperator<OrderInfo> orderInfoWithWatermarkDS = orderInfoDS.assignTimestampsAndWatermarks(
        WatermarkStrategy.<OrderInfo>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                .withTimestampAssigner(
                        new SerializableTimestampAssigner<OrderInfo>() {
    
    
                            @Override
                            public long extractTimestamp(OrderInfo orderInfo, long recordTimestamp) {
    
    
                                return orderInfo.getCreate_ts();
                            }
                        }
                )
);
// 订单明细
SingleOutputStreamOperator<OrderDetail> orderDetailWithWatermarkDS = orderDetailDS.assignTimestampsAndWatermarks(
        WatermarkStrategy.<OrderDetail>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                .withTimestampAssigner(
                        new SerializableTimestampAssigner<OrderDetail>() {
    
    
                            @Override
                            public long extractTimestamp(OrderDetail orderDetail, long recordTimestamp) {
    
    
                                return orderDetail.getCreate_ts();
                            }
                        }
                )
);

b Create the merged wide table entity class

package com.hzy.gmall.realtime.beans;

import lombok.AllArgsConstructor;
import lombok.Data;
import org.apache.commons.lang3.ObjectUtils;

import java.math.BigDecimal;

/**
 * Desc: 订单和订单明细关联宽表对应实体类
 */
@Data
@AllArgsConstructor
public class OrderWide {
    
    
    Long detail_id;
    Long order_id ;
    Long sku_id;
    BigDecimal order_price ;
    Long sku_num ;
    String sku_name;
    Long province_id;
    String order_status;
    Long user_id;

    BigDecimal total_amount;
    BigDecimal activity_reduce_amount;
    BigDecimal coupon_reduce_amount;
    BigDecimal original_total_amount;
    BigDecimal feight_fee;
    BigDecimal split_feight_fee;
    BigDecimal split_activity_amount;
    BigDecimal split_coupon_amount;
    BigDecimal split_total_amount;

    String expire_time;
    String create_time;
    String operate_time;
    String create_date; // 把其他字段处理得到
    String create_hour;

    String province_name;//查询维表得到
    String province_area_code;
    String province_iso_code;
    String province_3166_2_code;

    Integer user_age ;
    String user_gender;

    Long spu_id;     //作为维度数据 要关联进来
    Long tm_id;
    Long category3_id;
    String spu_name;
    String tm_name;
    String category3_name;

    public OrderWide(OrderInfo orderInfo, OrderDetail orderDetail){
    
    
        mergeOrderInfo(orderInfo);
        mergeOrderDetail(orderDetail);

    }

    // 将订单的信息赋值给订单宽表
    public void  mergeOrderInfo(OrderInfo orderInfo  )  {
    
    
        if (orderInfo != null) {
    
    
            this.order_id = orderInfo.id;
            this.order_status = orderInfo.order_status;
            this.create_time = orderInfo.create_time;
            this.create_date = orderInfo.create_date;
            this.activity_reduce_amount = orderInfo.activity_reduce_amount;
            this.coupon_reduce_amount = orderInfo.coupon_reduce_amount;
            this.original_total_amount = orderInfo.original_total_amount;
            this.feight_fee = orderInfo.feight_fee;
            this.total_amount =  orderInfo.total_amount;
            this.province_id = orderInfo.province_id;
            this.user_id = orderInfo.user_id;
        }
    }

    // 将订单明细的信息赋值给订单宽表
    public void mergeOrderDetail(OrderDetail orderDetail  )  {
    
    
        if (orderDetail != null) {
    
    
            this.detail_id = orderDetail.id;
            this.sku_id = orderDetail.sku_id;
            this.sku_name = orderDetail.sku_name;
            this.order_price = orderDetail.order_price;
            this.sku_num = orderDetail.sku_num;
            this.split_activity_amount=orderDetail.split_activity_amount;
            this.split_coupon_amount=orderDetail.split_coupon_amount;
            this.split_total_amount=orderDetail.split_total_amount;
        }
    }

    // firstNonNull获取参数中第一个不为空的值
    public void mergeOtherOrderWide(OrderWide otherOrderWide){
    
    
        this.order_status = ObjectUtils.firstNonNull( this.order_status ,otherOrderWide.order_status);
        this.create_time =  ObjectUtils.firstNonNull(this.create_time,otherOrderWide.create_time);
        this.create_date =  ObjectUtils.firstNonNull(this.create_date,otherOrderWide.create_date);
        this.coupon_reduce_amount =  ObjectUtils.firstNonNull(this.coupon_reduce_amount,otherOrderWide.coupon_reduce_amount);
        this.activity_reduce_amount =  ObjectUtils.firstNonNull(this.activity_reduce_amount,otherOrderWide.activity_reduce_amount);
        this.original_total_amount =  ObjectUtils.firstNonNull(this.original_total_amount,otherOrderWide.original_total_amount);
        this.feight_fee = ObjectUtils.firstNonNull( this.feight_fee,otherOrderWide.feight_fee);
        this.total_amount =  ObjectUtils.firstNonNull( this.total_amount,otherOrderWide.total_amount);
        this.user_id =  ObjectUtils.<Long>firstNonNull(this.user_id,otherOrderWide.user_id);
        this.sku_id = ObjectUtils.firstNonNull( this.sku_id,otherOrderWide.sku_id);
        this.sku_name =  ObjectUtils.firstNonNull(this.sku_name,otherOrderWide.sku_name);
        this.order_price =  ObjectUtils.firstNonNull(this.order_price,otherOrderWide.order_price);
        this.sku_num = ObjectUtils.firstNonNull( this.sku_num,otherOrderWide.sku_num);
        this.split_activity_amount=ObjectUtils.firstNonNull(this.split_activity_amount);
        this.split_coupon_amount=ObjectUtils.firstNonNull(this.split_coupon_amount);
        this.split_total_amount=ObjectUtils.firstNonNull(this.split_total_amount);
    }
}

c set the associated key

// TODO 6 通过分组指定两流的关联字段 -- order_id
// 订单
KeyedStream<OrderInfo, Long> orderInfoKeyedDS = orderInfoWithWatermarkDS.keyBy(OrderInfo::getId);
// 订单明细
KeyedStream<OrderDetail, Long> orderDetailkeyedDS = orderDetailWithWatermarkDS.keyBy(OrderDetail::getOrder_id);

d order and order details association intervalJoin

Here, plus or minus 5 seconds is set to prevent the time difference between the master table and the slave table in the business system.

// TODO 7 双流join,使用intervalJoin
// 用订单(一)join订单明细(多)
SingleOutputStreamOperator<OrderWide> orderWideDS = orderInfoKeyedDS
        .intervalJoin(orderDetailkeyedDS)
        .between(Time.seconds(-5), Time.seconds(5))
        .process(
                new ProcessJoinFunction<OrderInfo, OrderDetail, OrderWide>() {
    
    
                    @Override
                    public void processElement(OrderInfo orderInfo, OrderDetail orderDetail, Context ctx, Collector<OrderWide> out) throws Exception {
    
    
                        out.collect(new OrderWide(orderInfo, orderDetail));
                    }
                }
        );
orderWideDS.print(">>>");

e test

Currently working on

  • Basic environment preparation

  • set checkpoint

  • Read two streams of data from Kafka and supplement the creation time (create_ts) when converting the structure

  • Set the associated field through keyby – order_id

  • Dual stream join

    A.intervalJoin(B)
     .between(下界,上界)
     .process()
    
  • test, same as 2(1)d

Guess you like

Origin blog.csdn.net/weixin_43923463/article/details/128322073