0503-数仓数据采集

第一章 用户行为数据采集

1.1 Flume采集

在这里插入图片描述

  1. Source
    Taildir Source
    在Flume1.7之前如果想要监控一个文件新增的内容,我们一般采用的source 为 exec tail,但是这会有一个弊端,就是当你的服务器宕机重启后,此时数据读取还是从头开始,这显然不是我们想看到的!
    在Flume1.7没有出来之前我们一般的解决思路为:当读取一条记录后,就把当前的记录的行号记录到一个文件中,宕机重启时,我们可以先从文件中获取到最后一次读取文件的行数,然后继续监控读取下去。保证数据不丢失、不重复。
    在Flume1.7时新增了一个source的类型为taildir,它可以监控一个目录下的多个文件,并且实现了实时读取记录保存的断点续传功能。
    但是Flume1.7中如果文件重命名,那么会被当成新文件而被重新采集。
  2. Channel
    (1) Memory Channel
    Memory Channel把Event保存在内存队列中,该队列能保存的Event数量有最大值上限。由于Event数据都保存在内存中,MemoryChannel有最好的性能,不过也有数据可能会丢失的风险,如果Flume崩溃或者重启,那么保存在Channel中的Event都会丢失。同时由于内存容量有限,当Event数量达到最大值或者内存达到容量上限,MemoryChannel会有数据丢失。
    (2) File Channel
    File Channel把Event保存在本地硬盘中,比Memory Channel提供更好的可靠性和可恢复性,不过要操作本地文件,性能要差一些。
    (3) Kafka Channel
    Kafka Channel把Event保存在Kafka集群中,能提供比File Channel更好的性能和比Memory Channel更高的可靠性。
  3. Sink
    (1) Avro Sink
    Avro Sink是Flume的分层收集机制的重要组成部分。 发送到此接收器的Flume事件变为Avro事件,并发送到配置指定的主机名/端口对。事件将从配置的通道中按照批量配置的批量大小取出。
    (2 )Kafka Sink
    Kafka Sink将会使用FlumeEvent header中的topic和key属性来将event发送给Kafka。如果FlumeEvent的header中有topic属性,那么此event将会发送到header的topic属性指定的topic中。如果FlumeEvent的header中有key属性,此属性将会被用来对此event中的数据指定分区,具有相同key的event将会被划分到相同的分区中,如果key属性null,那么event将会被发送到随机的分区中。可以通过自定义拦截器来设置某个event的header中的key或者topic属性。

1.1.1 Flume拦截器

自定义了连个拦截器:

  1. ETL拦截器: 过滤时间戳不合法和json数据不完整的日志
  2. 日志类型区分拦截器: 将错误日志, 启动日志, 和事件日志区分开来, 方便发往kafka不同的topic
  1. ETL拦截器
package com.lz.flume.interceptor;

import org.apache.commons.lang.math.NumberUtils;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.List;

/**
 * @ClassName LogETLInterceptor
 * @Description: TODO
 * @Author MAlone
 * @Date 2019/12/19
 * @Version V1.0
 **/
public class LogETLInterceptor implements Interceptor {
    @Override
    public void initialize() {

    }

    @Override
    public Event intercept(Event event) {

        String body = new String(event.getBody(), Charset.forName("UTF-8"));

        String[] logArray = body.split("\\|");
        if (logArray.length < 2) {
            return null;
        }

        if (logArray[0].length() != 13 || !NumberUtils.isDigits(logArray[0])) {
            return null;
        }

        if (!logArray[1].trim().startsWith("{") || !logArray[1].trim().startsWith("}")) {
            return null;
        }

        return event;
    }

    @Override
    public List<Event> intercept(List<Event> events) {
        ArrayList<Event> eventsToBack = new ArrayList<>();

        for (Event event : events) {
            Event eventToBack = intercept(event);
            if (eventToBack != null) {
                eventsToBack.add(eventToBack);
            }
        }

        return eventsToBack;
    }

    @Override
    public void close() {

    }

    public static class Builder implements Interceptor.Builder {
        @Override
        public Interceptor build() {
            return new LogETLInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }

}
  1. 日志类型区分拦截器
package com.lz.flume.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

/**
 * @ClassName LogTypeInterceptor
 * @Description: TODO
 * @Author MAlone
 * @Date 2019/12/19
 * @Version V1.0
 **/
public class LogTypeInterceptor implements Interceptor {
    @Override
    public void initialize() {

    }

    @Override
    public Event intercept(Event event) {

        // 1. 获取flume接收消息头
        Map<String, String> headers = event.getHeaders();
        // 2. 获取flume接收的json数据数据
        byte[] json = event.getBody();
        // 3. 将接送数组转换成字符串
        String jsonStr = new String(json);

        String logType = "";
        if (jsonStr.contains("start")) {
            logType = "start";
        } else {
            logType = "event";
        }
        // 4. 将日志类型存储到flume头中
        headers.put("logType", logType);
        return event;
    }

    @Override
    public List<Event> intercept(List<Event> events) {

        ArrayList<Event> eventsToBack = new ArrayList<>();

        for (Event event : events) {
            Event eventToBack = intercept(event);
            if (eventToBack != null) {
                eventsToBack.add(eventToBack);
            }
        }

        return eventsToBack;
    }

    @Override
    public void close() {

    }

    public static class Builder implements Interceptor.Builder{

        @Override
        public Interceptor build() {
            return new LogTypeInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }
}
  1. 打包

拦截器打包之后,只需要单独包,不需要将依赖的包上传。打包之后要放入flume的lib文件夹下面。

1.1.2 Flume配置

  • file-flume-kafka.conf
a1.sources=r1
a1.channels=c1 c2 
a1.sinks=k1 k2 

# configure source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /opt/module/flume/log_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /tmp/logs/app.+
a1.sources.r1.fileHeader = true
a1.sources.r1.channels = c1 c2

#interceptor
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = com.lz.flume.interceptor.LogETLInterceptor$Builder
a1.sources.r1.interceptors.i2.type = com.lz.flume.interceptor.LogTypeInterceptor$Builder

# selector
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = logType
a1.sources.r1.selector.mapping.start = c1
a1.sources.r1.selector.mapping.event = c2

# configure channel
a1.channels.c1.type = memory
a1.channels.c1.capacity=10000
a1.channels.c1.byteCapacityBufferPercentage=20

a1.channels.c2.type = memory
a1.channels.c2.capacity=10000
a1.channels.c2.byteCapacityBufferPercentage=20

# configure sink
# start-sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = tstart
a1.sinks.k1.kafka.bootstrap.servers = node11:9092,node12:9092,node13:9092
a1.sinks.k1.kafka.flumeBatchSize = 2000
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.channel = c1

# event-sink
a1.sinks.k2.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k2.kafka.topic = tevent
a1.sinks.k2.kafka.bootstrap.servers = node11:9092,node12:9092,node13:9092
a1.sinks.k2.kafka.flumeBatchSize = 2000
a1.sinks.k2.kafka.producer.acks = 1
a1.sinks.k2.channel = c2

记得分发给node12

1.1.3 Flume采集脚本

  • f1.sh
#! /bin/bash

case $1 in
"start"){
        for i in node11 node12
        do
                echo " --------启动 $i 采集flume-------"
                ssh $i "nohup /opt/module/flume/bin/flume-ng agent --conf-file /opt/module/flume/conf/file-flume-kafka.conf --name a1 -Dflume.root.logger=INFO,LOGFILE >/dev/null 2>&1 &"
        done
};;
"stop"){
        for i in node11 node12
        do
                echo " --------停止 $i 采集flume-------"
                ssh $i "ps -ef | grep file-flume-kafka | grep -v grep |awk '{print \$2}' | xargs kill"
        done

};;
esac

1.2 Kafka

1.2.1 Kafka集群启动停止脚本

#! /bin/bash

case $1 in
"start"){
        for i in node11 node12 node13
        do
                echo " --------启动 $i kafka-------"
                # 用于KafkaManager监控

                ssh $i "export JMX_PORT=9988 && /opt/module/kafka/bin/kafka-server-start.sh -daemon /opt/module/kafka/config/server.properties "
        done
};;
"stop"){
        for i in node11 node12 node13
        do
                echo " --------停止 $i kafka-------"
                ssh $i "ps -ef | grep server.properties | grep -v grep| awk '{print $2}' | xargs kill >/dev/null 2>&1 &"
        done
};;
esac

1.2.2 测试从Flume端过来的数据

  1. 创建topic
  • 创建启动日志主题
[yanlzh@node11 kafka]$ bin/kafka-topics.sh --zookeeper node11:2181,node12:2181,node13:2181  --create --replication-factor 1 --partitions 1 --topic tstart
  • 创建事件日志主题
[yanlzh@node11 kafka]$ bin/kafka-topics.sh --zookeeper node11:2181, node12:2181,node13:2181  --create --replication-factor 1 --partitions 1 --topic tevent
  1. 运行f1.sh 采集数据
  2. 消费数据
[yanlzh@node11 kafka]$ bin/kafka-console-consumer.sh --zookeeper node11:2181 --from-beginning --topic tstart

1.3 Flume消费Kafka数据写到HDFS

在这里插入图片描述

1.3.1 Flume配置

  • kafka-flume-hdfs.conf
## 组件
a1.sources=r1 r2
a1.channels=c1 c2
a1.sinks=k1 k2

## source1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = node11:9092,node12:9092,node13:9092
a1.sources.r1.kafka.zookeeperConnect = node11:2181,node12:2181,node13:2181
a1.sources.r1.kafka.topics=tstart

## source2
a1.sources.r2.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r2.batchSize = 5000
a1.sources.r2.batchDurationMillis = 2000
a1.sources.r2.kafka.bootstrap.servers = node11:9092,node12:9092,node13:9092
a1.sources.r2.kafka.zookeeperConnect = node11:2181,node12:2181,node13:2181
a1.sources.r2.kafka.topics=tevent

## channel1
a1.channels.c1.type=memory
a1.channels.c1.capacity=100000
a1.channels.c1.transactionCapacity=10000

## channel2
a1.channels.c2.type=memory
a1.channels.c2.capacity=100000
a1.channels.c2.transactionCapacity=10000

## sink1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/gmall/log/topic_start/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = logstart-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 30
a1.sinks.k1.hdfs.roundUnit = second

##sink2
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = /origin_data/gmall/log/topic_event/%Y-%m-%d
a1.sinks.k2.hdfs.filePrefix = logevent-
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.roundValue = 30
a1.sinks.k2.hdfs.roundUnit = second

## 不要产生大量小文件
a1.sinks.k1.hdfs.rollInterval = 30
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0

a1.sinks.k2.hdfs.rollInterval = 30
a1.sinks.k2.hdfs.rollSize = 0
a1.sinks.k2.hdfs.rollCount = 0

## 控制输出文件是原生文件。
a1.sinks.k1.hdfs.fileType = CompressedStream 
a1.sinks.k2.hdfs.fileType = CompressedStream 

a1.sinks.k1.hdfs.codeC = lzop
a1.sinks.k2.hdfs.codeC = lzop

## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1

a1.sources.r2.channels = c2
a1.sinks.k2.channel= c2

1.3.2 Flume消费脚本

  • f2.sh
#! /bin/bash

case $1 in
"start"){
        for i in node13
        do
                echo " --------启动 $i 消费flume-------"
                ssh $i "nohup /opt/module/flume/bin/flume-ng agent --conf-file /opt/module/flume/conf/kafka-flume-hdfs.conf --name a1 -Dflume.root.logger=INFO,LOGFILE >/opt/module/flume/log.txt   2>&1 &"
        done
};;
"stop"){
        for i in node13
        do
                echo " --------停止 $i 消费flume-------"
                ssh $i "ps -ef | grep kafka-flume-hdfs | grep -v grep |awk '{print \$2}' | xargs kill"
        done

};;
esac

1.4 采集通道启动/停止脚本

#! /bin/bash

case $1 in
"start"){
	echo " -------- 启动 集群 -------"

	echo " -------- 启动 hadoop集群 -------"
	/opt/module/hadoop-2.7.2/sbin/start-dfs.sh 
	ssh node12 "/opt/module/hadoop-2.7.2/sbin/start-yarn.sh"

	#启动 Zookeeper集群
	zk.sh start

	#启动 Flume采集集群
	f1.sh start

	#启动 Kafka采集集群
	kf.sh start

sleep 4s;

	#启动 Flume消费集群
	f2.sh start

};;
"stop"){
        echo " -------- 停止 集群 -------"

    #停止 Flume消费集群
	f2.sh stop

	#停止 Kafka采集集群
	kf.sh stop

    sleep 4s;

	#停止 Flume采集集群
	f1.sh stop

	#停止 Zookeeper集群
	zk.sh stop

	echo " -------- 停止 hadoop集群 -------"
	ssh node12 "/opt/module/hadoop-2.7.2/sbin/stop-yarn.sh"
	/opt/module/hadoop-2.7.2/sbin/stop-dfs.sh 
};;
esac

在这里插入图片描述

第二章 业务数据采集

2.1 Sqoop 导入命令

/opt/module/sqoop/bin/sqoop import \
--connect  \
--username  \
--password  \
--target-dir  \
--delete-target-dir \
--num-mappers   \
--fields-terminated-by   \
--query   "$2"' and  $CONDITIONS;'

2.2 Sqoop定时导入脚本

  • sqoop.import.sh
#!/bin/bash

db_date=$2
echo $db_date
db_name=gmall

import_data() {
/opt/module/sqoop/bin/sqoop import \
--connect jdbc:mysql://node11:3306/$db_name \
--username root \
--password 1229 \
--target-dir  /origin_data/$db_name/db/$1/$db_date \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--query   "$2"' and  $CONDITIONS;'
}

import_sku_info(){
  import_data  "sku_info"  "select 
id, spu_id, price, sku_name, sku_desc, weight, tm_id,
category3_id, create_time 
  from sku_info  where 1=1"
}

import_user_info(){
  import_data "user_info" "select 
id, name, birthday, gender, email, user_level, 
create_time 
from user_info where 1=1"
}

import_base_category1(){
  import_data "base_category1" "select 
id, name from base_category1 where 1=1"
}

import_base_category2(){
  import_data "base_category2" "select 
id, name, category1_id from base_category2 where 1=1"
}

import_base_category3(){
  import_data "base_category3" "select id, name, category2_id from base_category3 where 1=1"
}

import_order_detail(){
  import_data   "order_detail"   "select 
    od.id, 
    order_id, 
    user_id, 
    sku_id, 
    sku_name, 
    order_price, 
    sku_num, 
    o.create_time  
  from order_info o , order_detail od 
  where o.id=od.order_id 
  and DATE_FORMAT(create_time,'%Y-%m-%d')='$db_date'"
}

import_payment_info(){
  import_data  "payment_info"   "select 
    id,  
    out_trade_no, 
    order_id, 
    user_id, 
    alipay_trade_no, 
    total_amount,  
    subject , 
    payment_type, 
    payment_time 
  from payment_info 
  where DATE_FORMAT(payment_time,'%Y-%m-%d')='$db_date'"
}

import_order_info(){
  import_data   "order_info"   "select 
    id, 
    total_amount, 
    order_status, 
    user_id, 
    payment_way, 
    out_trade_no, 
    create_time, 
    operate_time  
  from order_info 
  where  (DATE_FORMAT(create_time,'%Y-%m-%d')='$db_date' or DATE_FORMAT(operate_time,'%Y-%m-%d')='$db_date')"
}

case $1 in
  "base_category1")
     import_base_category1
;;
  "base_category2")
     import_base_category2
;;
  "base_category3")
     import_base_category3
;;
  "order_info")
     import_order_info
;;
  "order_detail")
     import_order_detail
;;
  "sku_info")
     import_sku_info
;;
  "user_info")
     import_user_info
;;
  "payment_info")
     import_payment_info
;;
   "all")
   import_base_category1
   import_base_category2
   import_base_category3
   import_order_info
   import_order_detail
   import_sku_info
   import_user_info
   import_payment_info
;;
esac

2.3 执行脚本

[yanlzh@node11 bin]$ sqoop.import.sh all 2019-12-19
发布了43 篇原创文章 · 获赞 0 · 访问量 523

猜你喜欢

转载自blog.csdn.net/qq_35199832/article/details/103473599