版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
用Python日志产生器开发实之产生访问Url和Ip信息,时间,http请求头,查询内容:
根据主站的url和ip地址可以看到页面的信息,然后根据实际情况来定时产生ip和url
扫描二维码关注公众号,回复:
7600787 查看本文章
通过定时调度工具每一分钟产生一批数据
Linux crontab
每一分钟执行一次的crontab表达式 : */1 * * * *
在linux中只写写好的 python代码:python2代码
def sample_ip():
slice = random.sample(ip_slices , 4)
return ".".join([str(item) for item in slice])
def sample_referer():
if random.uniform(0, 1) > 0.2:
return "-"
refer_str = random.sample(http_referers, 1)
query_str = random.sample(search_keyword, 1)
return refer_str[0].format(query=query_str[0])
def sample_status_code():
return random.sample(status_codes, 1)[0]
def generate_log(count = 10):
time_str = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
f = open("/home/hadoop/data/project/logs/access.log","w+")
while count >= 1:
query_log = "{ip}\t{local_time}\t\"GET /{url} HTTP/1.1\"\t{status_code}\t{referer}".format(url=sample_url(), ip=sample_ip(), referer=sample_referer()
, status_code=sample_status_code(),local_time=time_str)
f.write(query_log + "\n")
count = count - 1
if __name__ == '__main__':
generate_log(100)
创建log_generator.sh脚本添加执行内容:
修改脚本权限:
利用 crontab -e写进执行脚本:
当日志产生后:
对接python日志产生输出的日志到flume
Streaming_project.conf
选型:access.log ==> 控制台输出
Streaming_project.conf:
exec-memory-logger.sources = exec-source
exec-memory-logger.sinks = logger-sink
exec-memory-logger.channels = memory-channel
exec-memory-logger.sources.exec-source.type = exec
exec-memory-logger.sources.exec-source.command = tail -F /home/hadoop/data/project/logs/access.log
exec-memory-logger.sources.exec-source.shell = /bin/sh -c
exec-memory-logger.channels.memory-channel.type = memory
exec-memory-logger.sinks.logger-sink.type = logger
exec-memory-logger.sources.exec-source.channels = memory-channel
exec-memory-logger.sinks.logger-sink.channel = memory-channel
启动flume:(注意:这里需要等号后面不应该有空格,编程习惯会习惯性带空格)
flume-ng agent --name exec-memory-logger --conf $FLUME_HOME/conf --conf-file /home/hadoop/data/project/streaming_project.conf --Dflume.root.logger=INFO,console
日志==》flume==>kafka
- 启动zk:./zkServer.sh start
- 启动kafka Server :
- 修改flume配置文件使得flume sink数据到kafka
Kafka的server.properties:
kafka时先启动 zookeeper
启动zookeeper
然后启动 kafka:
./kafka-server-start.sh -daemon /home/hadoop/app/kafka_2.11-0.9.0.0/config/server.properties
Streaming_project2.conf:
exec-memory-kafka.sources = exec-source
exec-memory-kafka.sinks = kafka-sink
exec-memory-kafka.channels = memory-channel
exec-memory-kafka.sources.exec-source.type = exec
exec-memory-kafka.sources.exec-source.command = tail -F /home/hadoop/data/project/logs/access.log
exec-memory-kafka.sources.exec-source.shell = /bin/sh -c
exec-memory-kafka.channels.memory-channel.type = memory
exec-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
exec-memory-kafka.sinks.kafka-sink.brokerList = hadoop000:9092
exec-memory-kafka.sinks.kafka-sink.topic = streamingtopic
exec-memory-kafka.sinks.kafka-sink.batchSize = 5
exec-memory-kafka.sinks.kafka-sink.requiredAcks = 1
exec-memory-kafka.sources.exec-source.channels = memory-channel
exec-memory-kafka.sinks.kafka-sink.channel = memory-channel
启动flume:
flume-ng agent --name exec-memory-kafka --conf $FLUME_HOME/conf --conf-file /home/hadoop/data/project/streaming_project2.conf -Dflume.root.logger=INFO,console
streaming_project2.conf:
exec-memory-kafka.sources = exec-source
exec-memory-kafka.sinks = kafka-sink
exec-memory-kafka.channels = memory-channel
exec-memory-kafka.sources.exec-source.type = exec
exec-memory-kafka.sources.exec-source.command = tail -F /home/hadoop/data/project/logs/access.log
exec-memory-kafka.sources.exec-source.shell = /bin/sh -c
exec-memory-kafka.channels.memory-channel.type = memory
exec-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
exec-memory-kafka.sinks.kafka-sink.brokerList = hadoop000:9092
exec-memory-kafka.sinks.kafka-sink.topic = streamingtopic1
exec-memory-kafka.sinks.kafka-sink.batchSize = 5
exec-memory-kafka.sinks.kafka-sink.requiredAcks = 1
exec-memory-kafka.sources.exec-source.channels = memory-channel
exec-memory-kafka.sinks.kafka-sink.channel = memory-channel
启动flume后需要有一个客户端能够消费我们的kafka:
启动kafka消费:
kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic streamingtopic1
启动后flume接受到的数据:
kafaka消费者消费数据:
使用spark Streaming对接kafka的数据:
配置参数:
数据处理过程未完待续!