nutch入门

环境

nutch官网 http://nutch.apache.org/
linux系统 CentOS 7.3 64位
jdk1.8
apache-nutch-2.2.1-src.tar.gz
mysql

jdk配置

yum search jdk | grep java
yum install java-1.8.0-openjdk
vi /etc/profile

添加内容

#set java environment
JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-5.b12.el7_4.x86_64
JRE_HOME=$JAVA_HOME/jre
CLASS_PATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
export JAVA_HOME JRE_HOME CLASS_PATH PATH

ant 安装

yum install ant

nutch构建

wget http://archive.apache.org/dist/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz
tar -xvf apache-nutch-2.2.1-src.tar.gz
cd apache-nutch-2.2.1/
ant

ivysettings.xml maven仓库更改

http://maven.aliyun.com/nexus/content/groups/public/

生成 runtime文件夹,其下有delopy local两个文件夹

deploy对hadoop有依赖,hdfs进行存储,而mapreduce进行分析,辅以其他的功能。 local没有依赖。

配置nutch对mysql的支持,修改${APACHE_NUTCH_HOME}/ivy/ivy.xml文件,分别:

--找到以下行取消注释
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />

数据库连接配置

编辑${NUTCH_HOME}/conf/gora.properties文件

###############################
# MySQL properties           
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://192.168.58.1:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=123456

修改nutch-site配置文件

 vim nutch-site.xml 
<property>
<name>http.agent.name</name>
<value>LiuXun Nutch Spider</value>
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>

</property>
//特别添加
<property>
    <name>generate.batch.id</name>
    <value>*</value>
</property>

然后命令行执行:

ant clear

再执行

ant runtime

开始爬取

mkdir  urls
vim url.txt    ------写入需要爬的网站

输入命令

bin/nutch crawl urls -depth 3 -topN 5

猜你喜欢

转载自blog.csdn.net/zxh476771756/article/details/78965225