4.1 实时计算业务介绍
学习目标
- 目标
- 了解实时计算的业务需求
- 知道实时计算的作用
- 应用
- 无
4.1.1 实时计算业务需求
实时(在线)计算:
- 解决用户冷启动问题
- 实时计算能够根据用户的点击实时反馈,快速跟踪用户的喜好
4.1.2 实时计算业务图
4.2 实时日志分析
学习目标
- 目标
- 了解实时计算的业务需求
- 知道实时计算的作用
- 应用
- 无
日志数据我们已经收集到hadoop中,但是做实时分析的时候,我们需要将每个时刻用户产生的点击行为收集到KAFKA当中,等待spark streaming程序去消费。
4.2.1 Flume收集日志到Kafka
- 目的:收集本地实时日志行为数据,到kafka
- 步骤:
- 1、开启zookeeper以及kafka测试
- 2、创建flume配置文件,开启flume
- 3、开启kafka进行日志写入测试
- 4、脚本添加以及supervisor管理
开启zookeeper,需要在一直在服务器端实时运行,以守护进程运行
/root/bigdata/kafka/bin/zookeeper-server-start.sh -daemon /root/bigdata/kafka/config/zookeeper.properties
以及kafka的测试:
/root/bigdata/kafka/bin/kafka-server-start.sh /root/bigdata/kafka/config/server.properties
测试
开启消息生产者
/root/bigdata/kafka/bin/kafka-console-producer.sh --broker-list 192.168.19.19092 --sync --topic click-trace
开启消费者
/root/bigdata/kafka/bin/kafka-console-consumer.sh --bootstrap-server 192.168.19.137:9092 --topic click-trace
2、修改原来收集日志的文件,添加flume收集日志行为到kafka的source, channel, sink
a1.sources = s1
a1.sinks = k1 k2
a1.channels = c1 c2
a1.sources.s1.channels= c1 c2
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /root/logs/userClick.log
a1.sources.s1.interceptors=i1 i2
a1.sources.s1.interceptors.i1.type=regex_filter
a1.sources.s1.interceptors.i1.regex=\\{.*\\}
a1.sources.s1.interceptors.i2.type=timestamp
# channel1
a1.channels.c1.type=memory
a1.channels.c1.capacity=30000
a1.channels.c1.transactionCapacity=1000
# channel2
a1.channels.c2.type=memory
a1.channels.c2.capacity=30000
a1.channels.c2.transactionCapacity=1000
# k1
a1.sinks.k1.type=hdfs
a1.sinks.k1.channel=c1
a1.sinks.k1.hdfs.path=hdfs://192.168.19.137:9000/user/hive/warehouse/profile.db/user_action/%Y-%m-%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.rollInterval=0
a1.sinks.k1.hdfs.rollSize=10240
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.idleTimeout=60
# k2
a1.sinks.k2.channel=c2
a1.sinks.k2.type=org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k2.kafka.bootstrap.servers=192.168.19.137:9092
a1.sinks.k2.kafka.topic=click-trace
a1.sinks.k2.kafka.batchSize=20
a1.sinks.k2.kafka.producer.requiredAcks=1
3、开启flume新的配置进行测试, 开启之前关闭之前的flume程序
#!/usr/bin/env bash
export JAVA_HOME=/root/bigdata/jdk
export HADOOP_HOME=/root/bigdata/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin
/root/bigdata/flume/bin/flume-ng agent -c /root/bigdata/flume/conf -f /root/bigdata/flume/conf/collect_click.conf -Dflume.root.logger=INFO,console -name a1
开启kafka脚本进行测试,把zookeeper也放入脚本中,关闭之前的zookeeper
#!/usr/bin/env bash
# /root/bigdata/kafka/bin/zookeeper-server-start.sh -daemon /root/bigdata/kafka/config/zookeeper.properties
/root/bigdata/kafka/bin/kafka-server-start.sh /root/bigdata/kafka/config/server.properties
/root/bigdata/kafka/bin/kafka-topics.sh --zookeeper 192.168.19.137:2181 --create --replication-factor 1 --topic click-trace --partitions 1
4.2.2 super添加脚本
[program:kafka]
command=/bin/bash /root/toutiao_project/scripts/start_kafka.sh
user=root
autorestart=true
redirect_stderr=true
stdout_logfile=/root/logs/kafka.log
loglevel=info
stopsignal=KILL
stopasgroup=true
killasgroup=true
supervisor进行update
4.2.3 测试

开启Kafka消费者
/root/bigdata/kafka/bin/kafka-console-consumer.sh --bootstrap-server 192.168.19.137:9092 --topic click-trace
写入一次点击数据:
echo {\"actionTime\":\"2019-04-10 21:04:39\",\"readTime\":\"\",\"channelId\":18,\"param\":{\"action\": \"click\", \"userId\": \"2\", \"articleId\": \"14299\", \"algorithmCombine\": \"C2\"}} >> userClick.log
观察消费者结果
[root@hadoop-master ~]# /root/bigdata/kafka/bin/kafka-console-consumer.sh --bootstrap-server 192.168.19.137:9092 --topic click-trace
{"actionTime":"2019-04-10 21:04:39","readTime":"","channelId":18,"param":{"action": "click", "userId": "2", "articleId": "14299", "algorithmCombine": "C2"}}
4.3 实时召回集业务
学习目标
- 目标
- 实时内容召回的作用
- 应用
- 应用spark streaming完成实时召回集的创建
4.3.1 实时召回实现
实时召回会用基于画像相似的文章推荐
创建online文件夹,建立在线实时处理程序
- 目的:对用户日志进行处理,实时达到求出相似文章,放入用户召回集合中
- 步骤:
- 1、配置spark streaming信息
- 2、读取点击行为日志数据,获取相似文章列表
- 3、过滤历史文章集合
- 4、存入召回结果以及历史记录结果
创建spark streaming配置信息以及happybase
导入默认的配置,SPARK_ONLINE_CONFIG
# 增加spark online 启动配置
class DefaultConfig(object):
"""默认的一些配置信息
"""
SPARK_ONLINE_CONFIG = (
("spark.app.name", "onlineUpdate"), # 设置启动的spark的app名称,没有提供,将随机产生一个名称
("spark.master", "yarn"),
("spark.executor.instances", 4)
)
配置StreamingContext,在online的__init__.py文件添加,导入模块时直接使用
# 添加sparkstreaming启动对接kafka的配置
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from setting.default import DefaultConfig
import happybase
# 用于读取hbase缓存结果配置
pool = happybase.ConnectionPool(size=10, host='hadoop-master', port=9090)
# 1、创建conf
conf = SparkConf()
conf.setAll(DefaultConfig.SPARK_ONLINE_CONFIG)
# 建立spark session以及spark streaming context
sc = SparkContext(conf=conf)
# 创建Streaming Context
stream_c = StreamingContext(sc, 60)
配置streaming 读取Kafka的配置,在配置文件中增加KAFKAIP和端口
# KAFKA配置
KAFKA_SERVER = "192.168.19.137:9092"
# 基于内容召回配置,用于收集用户行为,获取相似文章实时推荐
similar_kafkaParams = {"metadata.broker.list": DefaultConfig.KAFKA_SERVER, "group.id": 'similar'}
SIMILAR_DS = KafkaUtils.createDirectStream(stream_c, ['click-trace'], similar_kafkaParams)
创建online_update文件,建立在线召回类
import os
import sys
BASE_DIR = os.path.dirname(os.getcwd())
sys.path.insert(0, os.path.join(BASE_DIR))
print(BASE_DIR)
PYSPARK_PYTHON = "/miniconda2/envs/reco_sys/bin/python"
# 当存在多个版本时,不指定很可能会导致出错
os.environ["PYSPARK_PYTHON"] = PYSPARK_PYTHON
os.environ["PYSPARK_DRIVER_PYTHON"] = PYSPARK_PYTHON
os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.2 pyspark-shell"
from online import stream_sc, SIMILAR_DS, pool
from setting.default import DefaultConfig
from datetime import datetime
import setting.logging as lg
import logging
import redis
import json
import time
注意添加运行时环境
# 注意,如果是使用jupyter或ipython中,利用spark streaming链接kafka的话,必须加上下面语句
# 同时注意:spark version>2.2.2的话,pyspark中的kafka对应模块已被遗弃,因此这里暂时只能用2.2.2版本的spark
os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.2 pyspark-shell"
- 2、Kafka读取点击行为日志数据,获取相似文章列表
传入Kafka的数据:
OK
2019-03-05 10:19:40 0 {"action":"exposure","userId":"2","articleId":"[16000, 44371, 16421, 16181, 17454]","algorithmCombine":"C2"} 2019-03-05
Time taken: 3.72 seconds, Fetched: 1 row(s)
- 3、过滤历史文章集合
- 4、存入召回结果以及历史记录结果
class OnlineRecall(object):
"""在线处理计算平台
"""
def __init__(self):
pass
def _update_online_cb(self):
"""
通过点击行为更新用户的cb召回表中的online召回结果
:return:
"""
def foreachFunc(rdd):
for data in rdd.collect():
logger.info(
"{}, INFO: rdd filter".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))
# 判断日志行为类型,只处理点击流日志
if data["param"]["action"] in ["click", "collect", "share"]:
# print(data)
with pool.connection() as conn:
try:
# 相似文章表
sim_table = conn.table("article_similar")
# 根据用户点击流日志涉及文章找出与之最相似文章(基于内容的相似),选取TOP-k相似的作为召回推荐结果
_dic = sim_table.row(str(data["param"]["articleId"]).encode(), columns=[b"similar"])
_srt = sorted(_dic.items(), key=lambda obj: obj[1], reverse=True) # 按相似度排序
if _srt:
topKSimIds = [int(i[0].split(b":")[1]) for i in _srt[:self.k]]
# 根据历史推荐集过滤,已经给用户推荐过的文章
history_table = conn.table("history_recall")
_history_data = history_table.cells(
b"reco:his:%s" % data["param"]["userId"].encode(),
b"channel:%d" % data["channelId"]
)
# print("_history_data: ", _history_data)
history = []
if len(data) >= 2:
for l in data[:-1]:
history.extend(eval(l))
else:
history = []
# 根据历史召回记录,过滤召回结果
recall_list = list(set(topKSimIds) - set(history_data))
# print("recall_list: ", recall_list)
logger.info("{}, INFO: store user:{} cb_recall data".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S'), data["param"]["userId"]))
if recall_list:
# 如果有推荐结果集,那么将数据添加到cb_recall表中,同时记录到历史记录表中
logger.info(
"{}, INFO: get online-recall data".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))
recall_table = conn.table("cb_recall")
recall_table.put(
b"recall:user:%s" % data["param"]["userId"].encode(),
{b"online:%d" % data["channelId"]: str(recall_list).encode()}
)
history_table.put(
b"reco:his:%s" % data["param"]["userId"].encode(),
{b"channel:%d" % data["channelId"]: str(recall_list).encode()}
)
except Exception as e:
logger.info("{}, WARN: {}".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S'), e))
finally:
conn.close()
SIMILAR_DS.map(lambda x: json.loads(x[1])).foreachRDD(foreachFunc)
return None
开启实时运行,同时增加日志打印
if __name__ == '__main__':
# 启动日志配置
lg.create_logger()
op = OnlineRecall()
op._update_online_cb()
stream_c.start()
# 使用 ctrl+c 可以退出服务
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
try:
while True:
time.sleep(_ONE_DAY_IN_SECONDS)
except KeyboardInterrupt:
server.stop(0)
添加文件打印日志
# 添加到需要打印日志内容的文件中
logger = logging.getLogger('online')
# 在线更新日志
# 离线处理更新打印日志
trace_file_handler = logging.FileHandler(
os.path.join(logging_file_dir, 'online.log')
)
trace_file_handler.setFormatter(logging.Formatter('%(message)s'))
log_trace = logging.getLogger('online')
log_trace.addHandler(trace_file_handler)
log_trace.setLevel(logging.INFO)
4.4 热门与新文章召回
学习目标
- 目标
- 了解热门与新文章召回作用
- 应用
- 应用spark streaming完成召回创建
4.4.1 热门文章与新文章
- 热门文章
通过对日志数据的处理,来实时增加文章的点击次数等信息
-
新文章由头条后台审核通过的文章传入kafka
-
redis:10
新文章召回 | 结构 | 示例 |
---|---|---|
new_article | ch:{}:new | ch:18:new |
热门文章召回 | 结构 | 示例 |
---|---|---|
popular_recall | ch:{}:hot | ch:18:hot |
# 新文章存储
# ZADD ZRANGE
# ZADD key score member [[score member] [score member] ...]
# ZRANGE page_rank 0 -1
client.zadd("ch:{}:new".format(channel_id), {article_id: time.time()})
# 热门文章存储
# ZINCRBY key increment member
# ZSCORE
# 为有序集 key 的成员 member 的 score 值加上增量 increment 。
client.zincrby("ch:{}:hot".format(row['channelId']), 1, row['param']['articleId'])
# ZREVRANGE key start stop [WITHSCORES]
client.zrevrange(ch:{}:new, 0, -1)
4.4.2 添加热门以及新文章kafka配置信息
# 添加sparkstreaming启动对接kafka的配置
# 配置KAFKA相关,用于热门文章KAFKA读取
click_kafkaParams = {"metadata.broker.list": DefaultConfig.KAFKA_SERVER}
HOT_DS = KafkaUtils.createDirectStream(stream_c, ['click-trace'], click_kafkaParams)
# new-article,新文章的读取 KAFKA配置
NEW_ARTICLE_DS = KafkaUtils.createDirectStream(stream_c, ['new-article'], click_kafkaParams)
并且导入相关包
from online import HOT_DS, NEW_ARTICLE_DS
然后,并且在kafka启动脚本中添加,关闭flume与kafka,重新启动
/root/bigdata/kafka/bin/kafka-topics.sh --zookeeper 192.168.19.137:2181 --create --replication-factor 1 --topic new-article --partitions 1
增加一个新文章的topic,这里会与后台对接
4.4.3 编写热门文章收集程序
- 在线实时进行redis读取存储
class OnlineRecall(object):
"""实时处理(流式计算)部分
"""
def __init__(self):
self.client = redis.StrictRedis(host=DefaultConfig.REDIS_HOST,
port=DefaultConfig.REDIS_PORT,
db=10)
# 在线召回筛选TOP-k个结果
self.k = 20
收集热门文章代码:
def _update_hot_redis(self):
"""更新热门文章 click-trace
:return:
"""
client = self.client
def updateHotArt(rdd):
for row in rdd.collect():
logger.info("{}, INFO: {}".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S'), row))
# 如果是曝光参数,和阅读时长选择过滤
if row['param']['action'] == 'exposure' or row['param']['action'] == 'read':
pass
else:
# 解析每条行为日志,然后进行分析保存点击,喜欢,分享次数,这里所有行为都自增1
client.zincrby("ch:{}:hot".format(row['channelId']), 1, row['param']['articleId'])
HOT_DS.map(lambda x: json.loads(x[1])).foreachRDD(updateHotArt)
return None
结果,进行测试
[root@hadoop-master logs]# echo {\"actionTime\":\"2019-04-10 21:04:39\",\"readTime\":\"\",\"channelId\":18,\"param\":{\"action\": \"click\", \"userId\": \"2\", \"articleId\": \"14299\", \"algorithmCombine\": \"C2\"}} >> userClick.log
然后打印日志结果
2019-05-18 03:24:01, INFO: {'actionTime': '2019-04-10 21:04:39', 'readTime': '', 'channelId': 18, 'param': {'action': 'click', 'userId': '2', 'articleId': '14299', 'algorithmCombine': 'C2'}}
最后查询redis当中是否存入结果热门文章
127.0.0.1:6379[10]> keys *
1) "ch:18:hot"
127.0.0.1:6379[10]> ZRANGE "ch:18:hot" 0 -1
1) "14299"
127.0.0.1:6379[10]>
# ZREM 'ch:18:hot' 0, -1 可删除之前的结果
4.4.4 编写新文章收集程序
新文章如何而来,黑马头条后台在文章发布之后,会将新文章ID以固定格式传到KAFKA的new-article topic当中
新文章代码
def _update_new_redis(self):
"""更新频道新文章 new-article
:return:
"""
client = self.client
def computeFunction(rdd):
for row in rdd.collect():
channel_id, article_id = row.split(',')
logger.info("{}, INFO: get kafka new_article each data:channel_id{}, article_id{}".format(
datetime.now().strftime('%Y-%m-%d %H:%M:%S'), channel_id, article_id))
client.zadd("ch:{}:new".format(channel_id), {article_id: time.time()})
NEW_ARTICLE_DS.map(lambda x: x[1]).foreachRDD(computeFunction)
return None
测试:pip install kafka-python
查看所有本地topic情况
from kafka import KafkaClient
client = KafkaClient(hosts="127.0.0.1:9092")
for topic in client.topics:
print topic
from kafka import KafkaProducer
# kafka消息生产者
kafka_producer = KafkaProducer(bootstrap_servers=['192.168.19.137:9092'])
# 构造消息并发送
msg = '{},{}'.format(18, 13891)
kafka_producer.send('new-article', msg.encode())
可以得到redis结果
127.0.0.1:6379[10]> keys *
1) "ch:18:hot"
2) "ch:18:new"
127.0.0.1:6379[10]> ZRANGE "ch:18:new" 0 -1
1) "13890"
2) "13891"
4.4.5 添加supervisor在线实时运行进程管理
增加以下配置
[program:online]
environment=JAVA_HOME=/root/bigdata/jdk,SPARK_HOME=/root/bigdata/spark,HADOOP_HOME=/root/bigdata/hadoop,PYSPARK_PYTHON=/miniconda2/envs/reco_sys/bin/python ,PYSPARK_DRIVER_PYTHON=/miniconda2/envs/reco_sys/bin/python,PYSPARK_SUBMIT_ARGS='--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.2 pyspark-shell'
command=/miniconda2/envs/reco_sys/bin/python /root/toutiao_project/reco_sys/online/online_update.py
directory=/root/toutiao_project/reco_sys/online
user=root
autorestart=true
redirect_stderr=true
stdout_logfile=/root/logs/onlinesuper.log
loglevel=info
stopsignal=KILL
stopasgroup=true
killasgroup=true
supervisor> update
online: added process group
supervisor> status
collect-click RUNNING pid 97209, uptime 6:46:53
kafka RUNNING pid 105159, uptime 6:20:09
offline STOPPED Apr 16 04:31 PM
online RUNNING pid 124591, uptime 0:00:02
supervisor>