Python实现对网页的简单解析

(这里以D站（2018年5月10号）为例子，实现爬取更新了的动漫，并将其发到邮箱)

Python版本：3.6.5（PS:在网上搜索python的内容时，用python3+内容搜索可以得到更准确的信息，因为python2.x和python3.x差别挺大的）
使用的库：

import urllib.request
from bs4 import BeautifulSoup
import pymysql
import smtplib
from email.mime.text import MIMEText
from email.utils import formataddr
import re

1.首先是对网页的爬取

url = 'http://www.dilidili.wang/'
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'lxml', from_encoding="utf-8")

这里有一点要注意的是lxml这个参数需要lxml这个库，否则可能出现异常
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to inst
自行安装pip3 install lxml

2.接下来就是对网页的解析啦，这里可以直接打印出soup的内容print(soup),然后根据自己要找的内容确定标签的位置属性，像是d站的新番更新的标签是class为hot的span标签，之后就可以使用BeautifulSoup对其进行相关的操作，大致就是使用select（即CSS选择器）这个函数进行标签的选取，以及parent和child两个属性联系标签的上下文

#由于网页会更新，这个可能无法运行
    dict = {}
    span = soup.select('span.hot')
    for i in range(len(span)):
        p = span[i].parent.figcaption.select('p')
        if p is not None and len(p) == 2:
            dict[p[0].string] = p[1].string
        if p is not None and len(p) ==1:
            dict[p[0].string] = '无'

网上教程很多，这里推荐一篇别人写的不错的关于CSS选择器介绍的博文

3.接下来就是将爬取到的数据持久化到数据库啦，这里使用mysql
博主在安装mysql的时候由于版本问题折腾了好久，这里贴上自己pip安装的模块的版本

beautifulsoup4 (4.6.0)
HTMLParser (0.0.2)
lxml (4.2.1)
pip (9.0.3)
PyMySQL (0.8.0)
setuptools (39.0.1)

Mysql的教程这里就不具体介绍了，这里贴一下自己的实现

    def update_mysql(dict):
        data = ""
        select_name_sql = "select * from dilidili where name like %s"
        select_update_sql = "select * from dilidili where name like %s AND update_content like %s"
        insert_sql = "INSERT INTO `dilidili` (`name`, `update_content`) VALUES (%s,%s)"
        update_sql = "UPDATE dilidili set update_content = %s where name like %s"
        # 打开数据库连接
        db = pymysql.connect(host="*******", user="****", passwd="*****", db="python",
                             charset="utf8")
        # 使用 cursor() 方法创建一个游标对象 cursor
        cursor = db.cursor()
        for key in dict:
            try:
                # 执行sql语句
                if cursor.execute(select_name_sql, str(key)) is 1:
                    if cursor.execute(select_update_sql, (str(key), str(dict[key]))) is 0:
                        cursor.execute(update_sql, (str(dict[key]), str(key)))
                        data += key + "     更新了" + dict[key] + "\n"
                else:
                    cursor.execute(insert_sql, (str(key), str(dict[key])))
                    data += key + "     更新了" + dict[key] + "\n"
                # 执行sql语句
                db.commit()
            except Exception as e:
                # 发生错误时回滚
                db.rollback()
                print(e)
                send_message(str(e))
        # 关闭数据库连接
        db.close()
        if data is not "":
            send_message(data)

4.最后就是发送邮件了，这里使用QQ邮箱，首先要去QQ邮箱的官网得到授权码，这样就可以不用密码发送邮件了
具体实现

def send_message(data):
        my_sender = '*****@qq.com'  # 发件人邮箱账号
        my_pass = '*********'  # 发件人邮箱授权码
        my_user = '******@qq.com'  # 收件人邮箱账号

        def mail():
            ret = True
            try:
                msg = MIMEText(data, 'plain', 'utf-8')
                msg['From'] = formataddr(["MDY", my_sender])  # 括号里的对应发件人邮箱昵称、发件人邮箱账号
                msg['To'] = formataddr(["FK", my_user])  # 括号里的对应收件人邮箱昵称、收件人邮箱账号
                msg['Subject'] = "dilidili动漫更新提醒"  # 邮件的主题，也可以说是标题

                server = smtplib.SMTP_SSL("smtp.qq.com", 465)  # 发件人邮箱中的SMTP服务器，端口是25
                server.login(my_sender, my_pass)  # 括号中对应的是发件人邮箱账号、邮箱密码
                server.sendmail(my_sender, [my_user, ], msg.as_string())  # 括号中对应的是发件人邮箱账号、收件人邮箱账号、发送邮件
                server.quit()  # 关闭连接
            except Exception as e:  # 如果 try 中的语句没有执行，则会执行下面的 ret=False
                ret = False
                print(e)
            return ret

        ret = mail()
        if ret:
            print("邮件发送成功")
        else:
            print("邮件发送失败")

代码写完，自然是要运行起来才有成就感，这里就需要你有一个云主机了，或者你直接挂在自己的PC也行，这里以CentOS系统的云主机为例子，使用定时任务运行python文件，要给自己的云主机安装python3.6，注意安装时不要将本机自带的python引用覆盖掉，否则部分系统功能将无法正常运行，定时任务的使用crontab实现

[root@VM_16_11_centos ~]# crontab -l
0 7,12,18,23 * * 1,2,3,4,5 /usr/local/bin/python3.6  /root/python/dilidili.py
*/1 * * * 6,7 /usr/local/bin/python3.6  /root/python/dilidili.py

Python实现对网页的简单解析

Python实现对网页的简单解析

猜你喜欢