1.本地爬虫

Pattern:表示正则表达式

Matcher:文本匹配器，作用按照正则表达式的规则去读取字符串，从头开始读取。在大串中去找符合匹配规则的子串。

1.2.获取Pattern对象

通过Pattern p = Pattern.compile("正则表达式");获得

1.3.获取Matcher对象

通过Matcher m = p.matcher(str);获得 (m要在str中找符合p规则的小串)

其中, m为Matcher对象, p为正则表达式规则, str为要验证的字符串.

1.4.匹配文本中的对象

boolean b = m.find(); 表示拿着文本匹配器从头开始读取，寻找是否有满足规则的子串如果没有，方法返回false. 如果有，返回true。在底层记录子串的起始索引和结束索引+1.

1.5.截取文本匹配器的索引

String s = m.group(); 这时文本匹配器会停留在第一个匹配文本的结束索引+1处, 返回一个文本中索引为(0,4)不包含4索引的字符串(符合规则的).

1.6.继续匹配和获取索引

重复第4步和第五步, 从上一次停留的地方开始向后查找.

第4步和第5步一般通过while(m.find()){}循环实现.

String s = "电话12345678901, 邮箱[email protected]";
        // 写正则表达式
        String regex = "([1]\\d{10}|\\w{1,}@[\\w&&[^-]]{2,}([.][c][omn]{1,3})+)";
        // 生成正则对象
        Pattern pattern = Pattern.compile(regex);
        // 生成匹配器对象
        Matcher matcher = pattern.matcher(s);
        // 用循环去读取匹配的内容
        while (matcher.find()) {
            // 获取匹配的字符串
            String group = matcher.group();
            System.out.println(group);
        }

2.网络爬虫

 // 创建一个URL对象
        URL url = new URL("https://blog.csdn.net/Orange_sparkle?type=lately");
        // 连接网址
        URLConnection conn = url.openConnection();
//        conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");
        // 创建对象读取数据
        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
//        conn.setRequestProperty("User-Agent", "Mozilla/4.76");
        String information;
        // 获取正则表达式对象pattern
//        String regex = "";
//        Pattern pattern = Pattern.compile(regex);
        // 在读取的时候每次读一行
        while ((information = bufferedReader.readLine()) != null) {
            System.out.println(information);
//            Matcher matcher = pattern.matcher(information);
//            while (matcher.find()){
//                System.out.println(matcher.group());
//            }
        }bufferedReader.close();