基于Scrapyd的爬虫部署

系统为Ubuntu16.04TLS。

1. Installtion

通过使用scrapy-client中的scrapy-deploy将scrapy project部署到scrapyd server。

# 安装scrapyd
pip install scrapyd
# 安装scrapy-client
# for python2.x
pip install git+https://github.com/scrapy/scrapyd-client
# for python3.6
pip install scrapy-client

2. Usage

a. 配置scrapy.cfg

[settings]
default = njupt.settings

[deploy:server-njupt]
url = http://localhost:6800/
project = njupt

b. 配置scrapyd

配置文件可参考scrapy文档进行配置。
其加载顺序为:
/etc/scrapyd/scrapyd.conf
/etc/scrapyd/conf.d/*
scrapyd.conf
~/.scrapyd.conf

example:

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   =
jobs_to_keep = 5
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 127.0.0.1
http_port   = 6800
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

c. 启动scrapyd

scrapyd

d. 发布

# 进入scrapy project根目录
scrapyd-deploy server-njupt -p njupt
# 指定版本号,默认为当前时间戳
scrapyd-deploy server-njupt -p njupt --version 1.0

scrapy-deploy的命令请看其帮助

e. 执行爬虫任务

curl http://localhost:6800/schedule.json -d project=njupt -d spider=njupt

可通过scrapyd-client spiders -p njupt 查看project=njupt下的spider。

3. Security

可以在scrapyd前面加一层反向代理来实现用户认证。以nginx为例, 配置nginx

server {
       listen 6801;
       location / {
            proxy_pass            http://127.0.0.1:6800/;
            auth_basic            "Restricted";
            auth_basic_user_file  /etc/nginx/htpasswd/user.htpasswd;
        }
}

/etc/nginx/htpasswd/user.htpasswd里设置用户名和密码,假设都为test。修改scrapy.cfg如下:

[settings]
default = njupt.settings

[deploy:server-njupt]
url = http://localhost:6800/
project = njupt
username = test
password = test

4. API

参考官方文档API

猜你喜欢

转载自blog.csdn.net/haiyanggeng/article/details/78665402