nutch研究—遇到的错误和解决办法

1、cygwin 运行 bin/nutch crawl urls -dir crawled -depth 3 -topN 50 >&crawl.log

　　　　出现下面问题:bin/nutch: line 251: exec: C:\Program: not found。

解决：从新完整的安装cygwin,不要按照网上说的只安装其中需要的那几个包内容。

2、右上角选项卡乱码问题

右上角“简介”、“常见问题”在搜索主界面不乱吗，但搜索时乱码的问题。

修改 Tomcat 7.0/webapps/nutch-1.2/zh/header.html 的编码为GBK

<?xml version="1.0" encoding="GBK"?>
注意：在<?xml version="1.0" encoding="GBK"?>后在添加<META http-equiv="Content-Type" content="text/html; charset=UTF-8">

3、Nutch1.2 添加IKAnalyzer中文分词（参考这篇文章）
按照这篇文章修改源码的时候会出现以下错误：
LinkDb: finished at 2011-07-14 11:34:06, elapsed: 00:00:03
Indexer: starting at 2011-07-14 11:34:06
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:76)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:167)
解决：这是在爬取网络数据的时候，可能是忘记把IKAnalyzer3.2.8.jar放到nutch/lib目录下了。

4、修改源码后，在此搜索会出现空白页问题（这个花费我三天时间啊）出现的错误是：Caused by: java.lang.IllegalArgumentException: This AttributeSource does not have the attribute 'org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute'.

at org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:277)

at org.apache.nutch.summary.basic.BasicSummarizer.getTokens(BasicSummarizer.java:362)

at org.apache.nutch.summary.basic.BasicSummarizer.getSummary(BasicSummarizer.java:134)

出现原因是：

前面我们修改过NutchDocumentAnalyzer类，使用了IKAnalyezer类。此时就需要修改中文分词的开源IKAnalyezer的源码了。
而在IKAnalyezer中并没有添加 PositionIncrementAttribute属性，所以出现异常，于是修改IKAnalyezer的源代码IKTokenizer.java文件在添加

//引入包的地方

import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;

//变量声明的地方

private PositionIncrementAttribute posIncrAtt;

//public IKTokenizer(Reader in , boolean isMaxWordLength)方法内添加

posIncrAtt = addAttribute(PositionIncrementAttribute.class);
用ant命令重新编译IKAnaalyezer，生成IKAnalyzer3.2.8.jar（此时好像需要自己写ant的build.xml文件，我用eclipse直接导出jar文件的）

替换nutch下的对应文件，重新编译nutch。

5、第四步解决之后，还是空白页（这个花费我三天时间啊）

查看tomcat下的log文件时，会有以下异常信息：

ava.lang.IllegalArgumentException: This AttributeSource does not have the attribute 'org.apache.lucene.analysis.tokenattributes.TypeAttribute'.

at org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:277)

at org.apache.nutch.summary.basic.BasicSummarizer.getTokens(BasicSummarizer.java:364)

at org.apache.nutch.summary.basic.BasicSummarizer.getSummary(BasicSummarizer.java:135)

at org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:263)

at org.apache.nutch.searcher.FetchedSegments$SummaryTask.call(FetchedSegments.java:63)

at org.apache.nutch.searcher.FetchedSegments$SummaryTask.call(FetchedSegments.java:53)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

at java.util.concurrent.FutureTask.run(FutureTask.java:138)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

问题出现的原因和问题4类似。需要的同样的地方添加：

private TypeAttribute typeAtt;

typeAtt = addAttribute(TypeAttribute.class);

然后还是从新编译生成IKAnalyzer3.2.8.jar文件，

最后从新ant,生成nutch-1.2.job，nutch-1.2.war，nutch-1.2.jar。把爬去数据和搜索部分的都替换成最新的文件，别忘记IKAnalyzer3.2.8.jar哦。

nutch研究—遇到的错误和解决办法

猜你喜欢