nutch1.4:爬虫定时抓取设置

nutch1.4定时爬取数据配合linux定时任务可以实现nutch的自动定时爬取,linux定时任务请参考《 Linux定时执行任务命令 :at和crontab》

步骤如下:

1、首先查看当前用户的 crontab服务执行命令:

crontab -l
执行结果:
no crontab for ***
表示没有定义 crontab 服务

2、编辑crontab服务:

crontab -e
*/10 * * * * /home/*/*.sh     //每10分钟执行一次 ,*.sh中包含nutch抓取脚本如crawl

注意设置服务执行账户,此处设置为root如果是其他账户则需要对应修改为其他账户名。为*.sh文件设置可执行权限。

*.sh脚本中如果调用了系统环境变量则会发现脚步无法正常执行,原因是cron无法获取环境变量导致(相关说明文章:http://peigang.iteye.com/blog/1567706),改用如下写法:

crontab -e
*/10 * * * * . /etc/profile;/bin/sh /home/*/*.sh

  . /etc/profile;/bin/sh 用来声明环境变量。

3、执行sudo apt-get install libnotify-bin

4、重新启动cron进程:

~#sudo /etc/init.d/cron restart 

    观察运行结果。重启可能不成功,使用如下步骤重新启动:

15:40:34^O^bin$ sudo /etc/init.d/cron stop
 [sudo] password for sniffer: 
 Rather than invoking init scripts through /etc/init.d, use the service(8)
 utility, e.g. service cron stop

 Since the script you are attempting to invoke has been converted to an
 Upstart job, you may also use the stop(8) utility, e.g. stop cron
 cron stop/waiting
 15:40:49^O^bin$ ps -A | grep cron
 15:40:54^O^bin$ sudo /etc/int.d/cron start
 sudo: /etc/int.d/cron: command not found
 15:41:11^O^bin$ sudo /etc/init.d/cron start
 Rather than invoking init scripts through /etc/init.d, use the service(8)
 utility, e.g. service cron start

 Since the script you are attempting to invoke has been converted to an
 Upstart job, you may also use the start(8) utility, e.g. start cron
 cron start/running, process 14362
 15:41:19^O^bin$ ps -A | grep cron
 14362 ?        00:00:00 cron

 注:nutch脚本存在无法找到JAVA_HOME的问题可以修改如下部分解决:

if [ "$JAVA_HOME" = "" ]; then
  #echo "Error: JAVA_HOME is not set."
  #exit 1
  JAVA_HOME="***"
fi

猜你喜欢

转载自peigang.iteye.com/blog/1560008