Mirroring HTML Files Only

you would like to save the crawled files in a file/directory format instead of saving them in WARC files.
First, create a job with a single seed, http://foo.org/bar/.  Configure the warcWriter bean so that its class is org.archive.modules.writer.MirrorWriterProcessor.  This Processor will store files in a directory structure that matches the crawled URIs.  The files will be stored in the crawl job's mirror directory.

猜你喜欢

转载自sharehua.iteye.com/blog/1745554