Streaming实战

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接: https://blog.csdn.net/qq_41479464/article/details/102580168

 

用Python日志产生器开发实之产生访问Url和Ip信息,时间,http请求头,查询内容:

 

 

根据主站的url和ip地址可以看到页面的信息,然后根据实际情况来定时产生ip和url

扫描二维码关注公众号,回复: 7600787 查看本文章

通过定时调度工具每一分钟产生一批数据

Linux crontab

网站 :http://tool.lu/crontab

每一分钟执行一次的crontab表达式 : */1 * * * *

 

在linux中只写写好的 python代码:python2代码

def sample_ip():

        slice = random.sample(ip_slices , 4)

        return ".".join([str(item) for item in slice])



def sample_referer():

        if random.uniform(0, 1) > 0.2:

                return "-"



        refer_str = random.sample(http_referers, 1)

        query_str = random.sample(search_keyword, 1)

        return refer_str[0].format(query=query_str[0])



def sample_status_code():

        return random.sample(status_codes, 1)[0]



def generate_log(count = 10):

        time_str = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())



        f = open("/home/hadoop/data/project/logs/access.log","w+")



        while count >= 1:

                query_log = "{ip}\t{local_time}\t\"GET /{url} HTTP/1.1\"\t{status_code}\t{referer}".format(url=sample_url(), ip=sample_ip(), referer=sample_referer()

, status_code=sample_status_code(),local_time=time_str)

                f.write(query_log + "\n")



                count = count - 1



if __name__ == '__main__':

        generate_log(100)

 

创建log_generator.sh脚本添加执行内容:

 

修改脚本权限:

 

利用 crontab  -e写进执行脚本:

 

当日志产生后:

 

对接python日志产生输出的日志到flume
Streaming_project.conf
选型:access.log ==> 控制台输出



Streaming_project.conf:

exec-memory-logger.sources = exec-source
exec-memory-logger.sinks = logger-sink
exec-memory-logger.channels = memory-channel

exec-memory-logger.sources.exec-source.type = exec
exec-memory-logger.sources.exec-source.command = tail -F /home/hadoop/data/project/logs/access.log
exec-memory-logger.sources.exec-source.shell = /bin/sh -c

exec-memory-logger.channels.memory-channel.type = memory

exec-memory-logger.sinks.logger-sink.type = logger

exec-memory-logger.sources.exec-source.channels = memory-channel
exec-memory-logger.sinks.logger-sink.channel = memory-channel


启动flume:(注意:这里需要等号后面不应该有空格,编程习惯会习惯性带空格)
flume-ng agent --name exec-memory-logger --conf $FLUME_HOME/conf --conf-file /home/hadoop/data/project/streaming_project.conf --Dflume.root.logger=INFO,console

日志==》flume==>kafka

  • 启动zk:./zkServer.sh start
  • 启动kafka Server :
  • 修改flume配置文件使得flume sink数据到kafka

Kafka的server.properties:

kafka时先启动 zookeeper

启动zookeeper

 

然后启动 kafka:

./kafka-server-start.sh -daemon /home/hadoop/app/kafka_2.11-0.9.0.0/config/server.properties

Streaming_project2.conf: 


exec-memory-kafka.sources = exec-source
exec-memory-kafka.sinks = kafka-sink
exec-memory-kafka.channels = memory-channel

exec-memory-kafka.sources.exec-source.type = exec
exec-memory-kafka.sources.exec-source.command = tail -F /home/hadoop/data/project/logs/access.log
exec-memory-kafka.sources.exec-source.shell = /bin/sh -c

exec-memory-kafka.channels.memory-channel.type = memory

exec-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
exec-memory-kafka.sinks.kafka-sink.brokerList = hadoop000:9092
exec-memory-kafka.sinks.kafka-sink.topic = streamingtopic
exec-memory-kafka.sinks.kafka-sink.batchSize = 5
exec-memory-kafka.sinks.kafka-sink.requiredAcks = 1

exec-memory-kafka.sources.exec-source.channels = memory-channel
exec-memory-kafka.sinks.kafka-sink.channel = memory-channel

 

启动flume:

flume-ng agent --name exec-memory-kafka --conf $FLUME_HOME/conf --conf-file /home/hadoop/data/project/streaming_project2.conf -Dflume.root.logger=INFO,console

streaming_project2.conf:

exec-memory-kafka.sources = exec-source
exec-memory-kafka.sinks = kafka-sink
exec-memory-kafka.channels = memory-channel

exec-memory-kafka.sources.exec-source.type = exec
exec-memory-kafka.sources.exec-source.command = tail -F /home/hadoop/data/project/logs/access.log
exec-memory-kafka.sources.exec-source.shell = /bin/sh -c

exec-memory-kafka.channels.memory-channel.type = memory

exec-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
exec-memory-kafka.sinks.kafka-sink.brokerList = hadoop000:9092
exec-memory-kafka.sinks.kafka-sink.topic = streamingtopic1
exec-memory-kafka.sinks.kafka-sink.batchSize = 5
exec-memory-kafka.sinks.kafka-sink.requiredAcks = 1

exec-memory-kafka.sources.exec-source.channels = memory-channel
exec-memory-kafka.sinks.kafka-sink.channel = memory-channel

启动flume后需要有一个客户端能够消费我们的kafka:

 启动kafka消费:

kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic streamingtopic1

启动后flume接受到的数据:

kafaka消费者消费数据:

 

 使用spark Streaming对接kafka的数据:

配置参数:

 数据处理过程未完待续!

猜你喜欢

转载自blog.csdn.net/qq_41479464/article/details/102580168