Python爬虫教程用Celery继续搞定分布式爬虫

一起养成写作习惯！这是我参与「掘金日新计划 · 4 月更文挑战」的第6天，点击查看活动详情。

写在前面

对于分布式爬虫学习来说，或者对于技术学习来说，没有捷径，两条路可以走，第一自己反复练习，孰能生巧；第二看别人分享的代码反复学习别人的写法，直到自己做到。

今天继续通过celery实现分布式爬虫爬取https://book.douban.com/tag/?view=type&icn=index-sorttags-all

简单回忆下上篇博客涉及 celery 相关的知识点，celery 官方定义为分布式任务队列，核心就是通过队列来实现跨线程或者跨终端进行作业分发。

队列的输入是一个作业单元，被称为task，我们只需要在定义好的函数上方，增加@app.task 装饰一下即可，查阅手册可以查看其它参数内容。

定义好 task 之后，执行worker 监控这个队列，有新作业即执行。

Python 代码走起

接下来就是正式编码部分了，我们先补充一些基本知识

celery task 任务调用有三个 API：

直接给任务发送消息 app_async(args[,kwargs[,....]])
给任务发送消息的简化写法，即 1 的简化写法 delay(*args,**kwargs)
直接调用，calling(__call__)

所有文档直接参考，有详细的解释： docs.celeryproject.org/en/latest/r…

拆解 celery 配置文件，使之从文件读取

celery_app.py文件代码如下

from celery import Celery # 导入celery模块
# 文件名为celery_app.py，则其中代码app = Celery('celery_app', include=)，Celery第一个参数为工程名，启动时也是celery -A celery_app worker --loglevel=info
# include加载任务文件 douban_task.py  该位置需要注意，如果不提前加载，任务不处于监听状态
app = Celery('celery_app',include=['douban_task'])

# 导入配置文件
app.config_from_object('config')

if __name__ == '__main__':
    app.start()
复制代码

app.config_from_object('config') 导入配置文件 config.py 任务配置文件

# 使用Redis作为消息代理
BROKER_URL = "redis://localhost:6379/2"
# 使用Redis存储结果
CELERY_RESULT_BACKEND = "redis://localhost:6379/3"
# 设定时区
CELERY_TIMEZONE = 'Asia/Shanghai'
# 任务的序列化方式
CELERY_TASK_SERIALIZER = 'json'
# 任务执行结果的序列化方式
CELERY_RESULT_SERIALIZER = 'json'

复制代码

核心douban_task.py代码如下，主要包含三个任务函数，main 为主函数，通过 main 调用 run 函数，get_detail 函数为主要输出函数，当启动 celery 时，以下三个 task 同时进入监听状态

from celery_app import app
import requests
from lxml import etree

headers = {
    'User-Agent': 'Mozilla/5.0 浏览器UA'
}

@app.task
def main(url):
    # 主函数
    print("main 函数运行")
    run.delay(url)

@app.task
def run(url):
    # 发送请求
    print("run 函数运行")
    try:
        res = requests.get(url, headers=headers)
        get_detail.delay(res.text) # 将标签页源码加载到列表中
    except Exception as e:
        print(e)


@app.task
def get_detail(html):
    print("get_detail函数运行")
    if not html:
        return None
    # 解析标签页详细数据
    et = etree.HTML(html)
    tags = et.xpath("//table[@class='tagCol']/tbody/tr/td/a/text()")
    result = []
    for tag in tags:
        tag_url = f"https://book.douban.com/tag/{tag}"
        tag_res = requests.get(tag_url,headers=headers)
        tag_et = etree.HTML(tag_res.text)
        title_result = tag_et.xpath("//div[@class='info']/h2/a/@title")
        result.extend(title_result)
        print(result)  # 最后的结果并未保存入库，直接输出了
复制代码

最后补充上爬虫入口crawl.py

from douban_task import main
url = "https://book.douban.com/tag/?view=type&icn=index-sorttags-all"
res = main.delay(url)

复制代码

代码结构如图所示 Python爬虫入门教程 76-100 用Celery继续搞定分布式爬虫

运行 celery 分布式爬虫

Python爬虫入门教程 76-100 用Celery继续搞定分布式爬虫注意看 tasks 位置，有三个任务已经就绪，接下来运行爬虫部分crawl.py即可，运行结果如下 celery 运行命令

celery worker -A tasks --loglevel=info --concurrency=5

参数”-A”指定了 Celery 实例的位置，注意文件名一定不要写错。
参数”loglevel” 指定了日志等级，默认为 warning。
参数”concurrency”指定最大并发数，默认为 CPU 核数。

写在后面

关于 celery 其实深挖下去，非常多的功能点要写，在爬虫百例中涉及的是非常少的一些功能，希望大家能接好我这块砖，如果未来某天这篇博客能在用 celery 的过程中提供些许帮助，对这篇博客来说已经值了

检索资料发现一个中文文档：www.celerycn.io/ru-men/cele… 可参考以上代码涉及了 xpath 的部分语法，可以自行检索复习一下

博主 ID：梦想橡皮擦，希望大家点赞、评论、收藏。