如何在web工程中调用nutch

 Nutch 默认只给用户提供一种是用方法就是用一个cygwin的模拟器去模拟一个 的环境,然后用户去用命令行在cygwin中执行搜索,还有 就是当开发者把nutch的源代码导入到eclipse的工程目录下然后执行

org.apache.nutch.crawl.Crawl类的main()这两种方法只是适合程序员做调试的时候手动开启,所以在二次开发nutch的时候就要把crawl放到在jsp中去调用,

步骤

1:新建一个web工程,把nutch源代码中plugins文件夹下的目录复制到src下,把nutch下src/java 中的这些包复制src

 

2:把nutchconf文件夹的所有配置文件复制到src目录下,把nutchnutch.job文件复制到src下边

3        nutchlib文件夹的jar文件复制到web-inflib

4        重新在src下建立一个类,用这个类去调用crawlmain()

 

package valley.test;

import org.apache.nutch.crawl.Crawl;

public class test {

	public static void main(String[] args) {
		String []arg ={"url.txt","-dir","crawled","-depth","10","-topN","50"};
		try {
			Crawl.main(arg);
		} catch (Exception e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
}

 3        接下来就可以在jsp中去调用这个test类了,一般在调用的时候都会出现一个异常-Xms100m -Xmx800m 异常代码如下Injector: Converting injected urls to crawl db entries.

java.io.IOException: Job failed!
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
	at org.apache.nutch.crawl.Injector.inject(Injector.java:160)
	at org.apache.nutch.crawl.Crawl.main(Crawl.java:113)
	at valley.test.test.main(test.java:10)
	at org.apache.jsp.MyJsp_jsp._jspService(MyJsp_jsp.java:79)
	at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
	at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
	at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
	at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
	at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
	at java.lang.Thread.run(Thread.java:619)

   这是因为tomcat的内存不够引起的具体解决办法不在罗嗦,不要忘记修改src下配置文件,如何修改和nutch中修改相同。

  最后肯定感觉很奇怪,没什么没有说url.txt放在那个位置, url.txt一定要放在tomcat/bin 下,你爬取的文件也会自动存放在bin下具体原因我也搞不明白,这些的前提是如果你使用tomcat做web服务,项目的jdk一定要使用jdk1.6不然是编译 不通过的

猜你喜欢

转载自zha-zi.iteye.com/blog/639850