爬虫从入门到放弃——Webmagic源码阅读之PageModel

PageModel

在OOSpider里面，有这样一段注释：


/**
 * The spider for page model extractor.<br>
 * In webmagic, we call a POJO containing extract result as "page model". <br>
 * You can customize a crawler by write a page model with annotations. <br>
 * Such as:
 * <pre>
 * {@literal @}TargetUrl("http://my.oschina.net/flashsword/blog/\\d+")
 *  public class OschinaBlog{
 *
 *      {@literal @}ExtractBy("//title")
 *      private String title;
 *
 *      {@literal @}ExtractBy(value = "div.BlogContent",type = ExtractBy.Type.Css)
 *      private String content;
 *
 *      {@literal @}ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true)
 *      private List&lt;String&gt; tags;
 * }
 * </pre>
 * And start the spider by:
 * <pre>
 *   OOSpider.create(Site.me().addStartUrl("http://my.oschina.net/flashsword/blog")
 *        ,new JsonFilePageModelPipeline(), OschinaBlog.class).run();
 * }
 * </pre>
 *
 * @author [email protected] <br>
 * @since 0.2.0
 */

其中说明了PageModel是如何写的，我们这里来详细看看。

TargetUrl与HelpUrl

HelpUrl/TargetUrl是一个非常有效的爬虫开发模式，TargetUrl是我们最终要抓取的URL，最终想要的数据都来自这里；而HelpUrl则是为了发现这个最终URL，我们需要访问的页面。几乎所有垂直爬虫的需求，都可以归结为对这两类URL的处理：

对于博客页，HelpUrl是列表页，TargetUrl是文章页。
对于论坛，HelpUrl是帖子列表，TargetUrl是帖子详情。
对于电商网站，HelpUrl是分类列表，TargetUrl是商品详情。

其中，TargetUrl的源码如下：

@Retention(java.lang.annotation.RetentionPolicy.RUNTIME)
@Target({ElementType.TYPE})
public @interface TargetUrl {

    /**
     * The url patterns for class.<br>
     * Use regex expression with some changes: <br>
     *      "." stand for literal character "." instead of "any character". <br>
     *      "*" stand for any legal character for url in 0-n length ([^"'#]*) instead of "any length". <br>
     *
     * @return the url patterns for class
     */
    String[] value();

    /**
     * Define the region for url extracting. <br>
     * Only support XPath.<br>
     * When sourceRegion is set, the urls will be extracted only from the region instead of entire content. <br>
     *
     * @return the region for url extracting
     */
    String sourceRegion() default "";

}

@Retention是java当中的一个元注解，该元注解通常都是用于对软件的测试。
比如如下的代码：

@Retention(RetentionPolicy.RUNTIME)
@interface Task{.......}

参数RetentionPolicy.RUNTIME就说明了，@Task注解在程序运行时是可见的。为什么要解释一下这个注解呢，因为这个注解在之前的源代码中使用过，在ModelPipeline的process()函数中出现，用于读取注解的内容以分支判断（@ExtractBy）。所以了解一下这个注解对于源代码的理解很有好处。
这个注解在反射中大有用处。
①获取对应类的Class数据类型的运行时对象的引用——getClass()

     public class Point{.....} //声明一个类
     Point pt = new Point(); //创建对应类的实例对象
     Class cls = pt.getClass() ;    //则cls 就指向了Point类的运行时对象

②运行时对象cls的成员函数
<1>public String getName()
返回对应类的类名
<2>public boolean isAnnotationPresent(注解名.class)
判定指定的"注解"是否在运行时注解了 cls 的对应类
<3>public boolean isAnnotation();
判定cls 是否在运行时被任何注解注解过
<4>public A getAnnotation(注解名.class)
A 指的是一个注解的类型，具体用法如下：

@Retention(RetentionPolicy.RUNTIME) //指定@Task运行时可见
@interface Task{String descirption(); }

@Task(descroption="NoFinished")   //为computer作注
class Computer{.....}

所以，下面的调用就很有意义了：

Computer my = new Computer() ;
Class cls = my.getClass() ;
Task tk = (Task) cls.getAnnotation(Task.class);
//这时 tk 就指向了标注Computer的注解@Task
tk.description(); //调用@Task中的description(),输出"NoFinishing"

@Target。
@Target说明了Annotation所修饰的对象范围：Annotation可被用于 packages、types（类、接口、枚举、Annotation类型）、类型成员（方法、构造方法、成员变量、枚举值）、方法参数和本地变量（如循环变量、catch参数）。在Annotation类型的声明中使用了target可更加明晰其修饰的目标。
取值(ElementType)有：
　　　　1.CONSTRUCTOR:用于描述构造器
　　　　2.FIELD:用于描述域
　　　　3.LOCAL_VARIABLE:用于描述局部变量
　　　　4.METHOD:用于描述方法
　　　　5.PACKAGE:用于描述包
　　　　6.PARAMETER:用于描述参数
　　　　7.TYPE:用于描述类、接口(包括注解类型) 或enum声明
@Retention
@Retention定义了该Annotation被保留的时间长短：某些Annotation仅出现在源代码中，而被编译器丢弃；而另一些却被编译在class文件中；编译在class文件中的Annotation可能会被虚拟机忽略，而另一些在class被装载时将被读取（请注意并不影响class的执行，因为Annotation与class在使用上是被分离的）。使用这个meta-Annotation可以对 Annotation的“生命周期”限制。
　取值（RetentionPoicy）有：
　　　　1.SOURCE:在源文件中有效（即源文件保留）
　　　　2.CLASS:在class文件中有效（即class保留）
　　　　3.RUNTIME:在运行时有效（即运行时保留）
@Documented:
@Documented用于描述其它类型的annotation应该被作为被标注的程序成员的公共API，因此可以被例如javadoc此类的工具文档化。Documented是一个标记注解，没有成员。
@Inherited
@Inherited 元注解是一个标记注解，@Inherited阐述了某个被标注的类型是被继承的。如果一个使用了@Inherited修饰的annotation类型被用于一个class，则这个annotation将被用于该class的子类。

好了回到正轨，继续说:
@TargetUrl(value = "http://*.iteye.com/blog/*",sourceRegion = "")
在一个完整的TargetUrl注解中，value中用正则表达式，sourceRegion中用XPath。sourceRegion指定了这个URL从哪里得到——不在sourceRegion的URL不会被抽取。

@HelpUrl的源代码和@TargetUrl的代码一模一样，只是使用的方式造成的功能差异，所以这里不做分析。

ExtractBy

@ExtractBy是一个用于抽取元素的注解，它描述了一种抽取规则。
@ExtractBy注解主要作用于字段，它表示“使用这个抽取规则，将抽取到的结果保存到这个字段中”。例如：

@ExtractBy("//div[@id='readme']/text()")
private String readme;

这里"//div[@id=‘readme’]/text()"是一个XPath表示的抽取规则，而抽取到的结果则会保存到readme字段中。
我们来看看源码：

@Retention(java.lang.annotation.RetentionPolicy.RUNTIME)
@Target({ElementType.FIELD, ElementType.TYPE})
public @interface ExtractBy {

    /**
     * Extractor expression, support XPath, CSS Selector and regex.
     *
     * @return extractor expression
     */
    String value();

    /**
     * types of extractor expressions
     */
    public static enum Type {XPath, Regex, Css, JsonPath}

    /**
     * Extractor type, support XPath, CSS Selector and regex.
     *
     * @return extractor type
     */
    Type type() default Type.XPath;

    /**
     * Define whether the field can be null.<br>
     * If set to 'true' and the extractor get no result, the entire class will be discarded. <br>
     *
     * @return whether the field can be null
     */
    boolean notNull() default false;
    
 /**
     * Define whether the extractor return more than one result.
     * When set to 'true', the extractor return a list of string (so you should define the field as List). <br>
     *
     * Deprecated since 0.4.2. This option is determined automatically by the class of field.
     * @deprecated since 0.4.2
     * @return whether the extractor return more than one result
     */
    boolean multi() default false;
   .....

}

这里只展示部分代码，因为其他代码已废弃，所以我们不做分析。
在public static enum Type {XPath, Regex, Css, JsonPath}可知，除了XPath，我们还可以使用其他抽取方式来进行抽取，包括CSS选择器、正则表达式和JsonPath，在注解中指明type之后即可。

@ExtractBy(value = "div.BlogContent", type = ExtractBy.Type.Css)
private String content;

notnull为true时此字段不允许为空。

multi的属性，它表示这条抽取规则是对应多条记录还是单条记录。当字段为List类型时，这个属性会自动为true，无须再设置。而在类上使用ExtractBy（如果一个页面有多个抽取的记录呢，在类上使用@ExtractBy注解可以解决这个问题。）
在类上使用这个注解的意思很简单：使用这个结果抽取一个区域，让这块区域对应一个结果。

@ExtractBy(value = "//ul[@id=\"promos_list2\"]/li",multi = true)
public class QQMeishi {
    ……
}

类型转换（Formatter机制）是WebMagic 0.3.2增加的功能。因为抽取到的内容总是String，而我们想要的内容则可能是其他类型。Formatter可以将抽取到的内容，自动转换成一些基本类型，而无需手动使用代码进行转换。

@ExtractBy("//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()")
private int star;

除了基本的类型，比如：
在这里插入图片描述
另外，还支持java.util.Date类型的转换。但是在转换时，需要指定Date的格式。使用@Formatter

@Formatter("yyyy-MM-dd HH:mm")
@ExtractBy("//div[@class='BlogStat']/regex('\\d+-\\d+-\\d+\\s+\\d+:\\d+')")
private Date date;

一般情况下，Formatter会根据字段类型进行转换，但是特殊情况下，我们会需要手动指定类型。这主要发生在字段是List类型的时候。

@Formatter(value = "",subClazz = Integer.class)
@ExtractBy(value = "//div[@class='id']/text()", multi = true)
private List<Integer> ids;

爬虫从入门到放弃——Webmagic源码阅读之PageModel

PageModel

猜你喜欢