Nutch2.1的Crawler源码解读

运行Crawler.java时，程序经过五个步骤：InjectJob、FetcherJob、ParserJob、DbUpdaterJob、SolrIndexerJob，这五个类都是实现了org.apache.hadoop.util.Tool接口，切换各个任务都是通过runTool(<? Extends Tool>Tool.class, args)方法来调用，该接口中只有一个run(String[])的方法。同时该接口继承了一个父接口org.apache.hadoop.conf. Configurable，Configurable接口有两个方法：

void setConf(Configuration conf)

Configuration getConf();

Crawler的main方法源码如下：

public static void main(String[] args) throws Exception {
	Crawler c = new Crawler();
	Configuration conf = NutchConfiguration.create();
	int res = ToolRunner.run(conf, c, args);
	System.exit(res);
}

解读：

在上面一段代码中conf变量贯穿整个程序运行的始终，NutchConfiguration.create()加载nutch的标准配置文件，nutch-default.xml以及nutch-site.xml，首先加载nutch-default.xml，然后再加载nutch-site.xml，如果nutch-site.xml中有申明一些property，那么将覆盖nutch-default.xml中德property配置。

真正执行程序就是ToolRunner.run(conf, c, args)了：

 public static int run(Configuration conf, Tool tool, String[] args) 
    throws Exception{
    if(conf == null) {
      conf = new Configuration();
}
//将args（我们在命令行输入的命令参数进行转换）
    GenericOptionsParser parser = new GenericOptionsParser(conf, args);
    //set the configuration back, so that Tool can configure itself
    tool.setConf(conf);
    //get the args w/o generic hadoop args
    String[] toolArgs = parser.getRemainingArgs();
    return tool.run(toolArgs);
  }

解读：

tool.run(toolArgs)会返回Crawler中，调用run(String[])方法，这个方法比较简单，主要是讲toolArgs中的值进行处理，获取option的值，同时将toolArgs中的参数转换成Map类型，调用run(Map)方法，在这个方法内部就是真正进行爬取了。

这方法内部，共进行了五个步骤：

1、InjectJob：从参数中的seedDir中获取种子文件的路径，放到抓取链中

2、GeneratorJob：从抓取链获取链接，放入抓取队列

3、FetcherJob：从抓取队列中获取任务开始进行抓取

4、ParserJob：对抓取的网页进行解析，产生新的链接和网页解析结果

5、DbUpdaterJob：将新的链接更新到抓取链中

6、SolrIndexerJob：对抓取的内容进行索引

将在后面进行上述六个步骤的源码解读

Nutch2.1的Crawler源码解读

猜你喜欢