Heritrix3.1.1 新特性,新功能

        本博客为原创文章,转载请注明出处:http://guoyunsky.iteye.com/blog/1744866

       本人新浪微博:http://weibo.com/guoyunwb

      趁周末看了下Heritrix,这里发现改动还是很大.虽然自己已经不怎么写爬虫,但长期关注一样一直在发展的东西,的确是一件很幸福的事情,让自己可以获益不少.这里整理下,分享给大家.

     Heritrix 3.1.1于2012年5月份发布.以下是它的英文介绍。

      Nicer code editor for crawl config and script console (HER-2001)

The crawl configuration cxml editor and the scripting console editor now use CodeMirror, which adds syntax highlighting, line numbers and other features

       Fixed occasional mangling of DNS records in ARCs and WARCs (HER-1983)

A longstanding bug that caused some DNS records in ARCs and WARCs to be mangled, due to unsafe use of a shared variable among threads, is now fixed.

       Remember all surts across checkpoint/resume (HER-1985)

Surts that were derived from seeds, or listed as surts in the seeds source, or that were added using a .seeds file in the action directory, can now be remembered across checkpoint/resume. For that to work the relevant SurtPrefixedDecideRule must be a top-level bean. The default cxml distributed with heritrix now includes the key decide rule as a top-level bean with id "acceptSurts".

       Support for saving script state (HER-1984)

Added a shared map for arbitrary use during a crawl. It can be used for state persisting for the duration of the crawl, shared among ScriptedProcessor, scripting console and other scripts, or other purposes. In scripts it can be obtained with appCtx.getData().

 

     中文翻译,以及部分讲解.

     1.更友善的界面代码编辑器

Heritrix可以通过UI界面编辑配置文件,Heritrix 3.0开始支持动态脚本(python,js等),但以前的界面的确不太友善。只是一个简单的文本输入框.Heritrix 3.1开始采用CodeMirror(http://codemirror.net/),支持高亮,行数显示等功能。

 

        2.解决ARCs和WARCs记录的DNS数据线程错位异常 (HER-1983)

解决了一个长期的bug,由于线程之间的共享变量不安全使用,导致ARCs和WARCs记录的DNS数据错位。

 

        3.可以记录Surts的所有checkpoint/resume状态 (HER-1985)(理解可能有误,待确认)

Surts都来自于种子,要么作为种子来源在cxml文件中配置,要么来自.seeds文件,或者action directory目录下.现在都可以记录他们的checkpoint/resume状态.(之前只能记录明确的种子来源,而如来自action directory则没法记录)

 

         4.支持可以保存脚本状态 (HER-1984)

在整个抓取过程中,增加了一个可以任意使用的共享map.它可以用于状态持久化,以及在一定数量的ScriptedProcessor,脚本控制台,其他脚本,以及其他用途之间共享.这个共享map在脚本中可以通过appCtx.getData()来获取.(以前执行脚本是暂时的,可能一些脚本需要上下文之类的东西,也就是上一个脚本需要下一个脚本的结果,或者不同脚本之间要共享变量,那Heritrix3.1.1可以支持了。)

 

更多技术文章、感悟、分享、勾搭,请用微信扫描:

猜你喜欢

转载自guoyunsky.iteye.com/blog/1744866