声明：这篇文章仅作为学习编程的笔记，如有侵权，请联系我，我立刻删除。

正则表达式

python中的正则表达式和java里面的很类似，在用法上有一些小区别
1 Python 正则式的基本用法

Python 的正则表达式的模块是re, 它的基本语法规则就是指定一个字符序列，比如你要在一个字符串 s=123abc456eabc789 中查找字符串 abc, 只要这样写：

import re
string = '123abc456eabc789'
reg = re.compile(r'abc')
list = re.findall(reg,string)

结果就是：[‘abc’, ‘abc’]
　　这里用到的函数 ”findall(rule , target [,flag] )” 是个比较直观的函数，就是在目标字符串中查找符合规则的字符串。第一个参数是规则，第二个参数是目标字符串，后面还可以跟一个规则选项（选项功能将在 compile 函数的说明中详细说明）。返回结果结果是一个列表，中间存放的是符合规则的字符串。如果没有符合规则的字符串被找到，就返回一个空列表。
　　
　　为什么要用 r’ ..‘ 字符串（ raw 字符串）？由于正则式的规则也是由一个字符串定义的，而在正则式中大量使用转义字符 ’/’ ，如果不用 raw 字符串，则在需要写一个 ’/’ 的地方，你必须得写成 ’//’, 那么在要从目标字符串中匹配一个 ’/’ 的时候，你就得写上 4 个 ’/’ 成为 ’////’ ！这当然很麻烦，也不直观，所以一般都使用 r’’ 来定义规则字符串。当然，某些情况下，可能不用 raw 字符串比较好。

容易混的几个：

[^a-zA-Z] 表明不匹配所有英文字母
[a-z^A-Z] 表明匹配所有的英文字母和字符 ’^’
[a-zA-Z]|[0-9] 表示满足数字或字母就可以匹配，这个规则等价于 [a-zA-Z0-9] dog|cat 匹配的是‘
dog’ 或 ’cat’ 而不是字母g或字母c
要匹配 ‘ I have a dog’ 或 ’I have a cat’ ，需要写成 r’I have a (?:dog|cat)’
，而不能写成 r’I have a dog|cat’
‘(?:)’ 无捕获组
当你要将一部分规则作为一个整体对它进行某些操作，比如指定其重复次数时，你需要将这部分规则用 ’(?:’ ‘)’
把它包围起来，而不能仅仅只用一对括号，那样将得到绝对出人意料的结果。
2、从校花网上爬图片

首先第一步：观察这个网站，http://www.xiaohuar.com/hua/，要下载1到10页的图片，第二页的url是http://www.xiaohuar.com/list-1-1.html，第三页是http://www.xiaohuar.com/list-1-2.html，因此第一页也可以用http://www.xiaohuar.com/list-1-0.html这个url代替。
第二步：我用的是谷歌浏览器，查看源代码，这里写图片描述
第三步：在界面上点检查：

第四步：找到我们要爬的内容，在该网页的html中找到对应的img标签，因为每一页都有25个，找到相似的结构

第五步：我们要提取的就是src=”/d/data/……”的这个东西，点开这个链接，发现完整的url是这个样子的（以其中几个为例）：
http://www.xiaohuar.com/d/file/20180116/10f221ab4e822b9e0aff7aa5ae9a3005.jpg
http://www.xiaohuar.com/d/file/20180106/501ef3a33d49f390e243c560c8ca0349.jpg
也就是说，我们用程序需要下载的就是这个url图片
第六步：写程序：
python版本：

# coding=utf-8
import urllib
import re
import datetime

def getHtml(url):
    page = urllib.urlopen(url)
    html = page.read().decode("gbk")
    return html

def main():
    begin = datetime.datetime.now()
    x = 0
    index = 0
    while index < 10:
        url = "http://www.xiaohuar.com/list-1-" + str(index) + ".html"
        html = getHtml(url)
        root = "http://www.xiaohuar.com"
        # reg = r'/d/file/.+?jpg'
        reg = r'/d/file/.+?(?:jpg|png)'
        imgre = re.compile(reg)
        imglist = re.findall(imgre, html)
        for string in imglist:
            imageUrl = root + string
            urllib.urlretrieve(imageUrl, 'd:/images/%s.jpg' % x)
            x = x + 1
        index = index + 1
    end = datetime.datetime.now()
    print(end - begin)

这里写图片描述

java版：jar包自己去Baidu下或者直接用我的链接
http://download.csdn.net/download/tsfx051435adsl/10210925

package com.spyder;

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.jsoup.Jsoup;

public class SpyderUtils {

    // 根据url 返回html
    public static String getHTML(String url) throws IOException {
        return Jsoup.connect(url).get().toString();
    }

    // 根据匹配的规则 返回html中满足要求的字符串
    public static List<String> getMatcher(String html, String pattern) {
        List<String> result = new ArrayList<String>();
        Pattern p = Pattern.compile(pattern);
        Matcher m = p.matcher(html);
        while (m.find()) {
            result.add(m.group(0));
        }
        return result;
    }

    public static void copyImage(URL imageURL, String docName,String imgName) {
        try {
            URLConnection con = (URLConnection) imageURL.openConnection();
            InputStream input = con.getInputStream();
            BufferedInputStream bufferedInputStream = new BufferedInputStream(
                    input);
            byte[] buffer = new byte[1024 * 2];
            int len;
            BufferedOutputStream os = new BufferedOutputStream(
                    new FileOutputStream(docName + "/" + imgName));
            while ((len = bufferedInputStream.read(buffer)) != -1) {
                os.write(buffer, 0, len);
            }
            os.close();
            input.close();
        } catch (Exception e) {

        }
    }
}

package com.spyder;

import java.io.IOException;
import java.net.URL;
import java.util.List;

public class XiaoHuaTest {
    // 校花网
    public static void spyder1() throws IOException {
        long begin = System.currentTimeMillis();
        int i = 0;
        String rootString = "http://www.xiaohuar.com"; //这是根，后面会用的上
        String desturl; // 要解析的url
        String html; // 根据url解析后的html字符串
        URL url;
        List<String> output; // 把html字符串根据某种规则 提取想要的部分
        for (int j = 0; j < 10; j++) {
            desturl = "http://www.xiaohuar.com/list-1-" + j + ".html";
            html = SpyderUtils.getHTML(desturl);
            output =  SpyderUtils.getMatcher(html, "src=\"([\\w\\s./:]+?)\"");
            for (String temp : output) {
                if (temp.startsWith("src=\"/d/file")) {
                    temp = temp.replaceAll("src=", "").replaceAll("\"", "");
                    url = new URL(rootString + temp);
                    SpyderUtils.copyImage(url, "F:/SpyderImages",(i++)+".png");
                }
            }
        }
        long end = System.currentTimeMillis();
        System.out.println(end - begin);

    }
    public static void main(String[] args) throws Exception {
        spyder1();
    }
}

不知道为什么java运行结果的截图上传不上去……
执行完之后我竟然发现java用了290秒，python用了584秒….这还是挺令我意外的。当然我只是小白，没用用到多线程和别的知识，纯正则表达式爬虫小demo。这个网页的检查和右键查看源代码是一致的，所以直接解析html就行了，有的网页检查和查看源代码不一致，比如说堆糖网，你搜一下赵丽颖的图片，检查和源代码不一致（其中的一个特征是下拉刷新，应该是ajax生成的），这种情况就不能像上述的demo直接爬了，这个以后再说。

再次声明：这篇文章仅作为学习编程的笔记，如有侵权，请联系我，我立刻删除。

[python自学笔记]正则表达式和爬虫案例

正则表达式

猜你喜欢