BeautifulSoup 学习笔记

今天学习了下BeautifulSoup的使用

_static/cover.jpg
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

原理：Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment

#

安装：pip install BeautifulSoup

目的：本次爬取的是dotamax.com网站上的dota2英雄的使用情况

思路：先用request库去把需要爬取的页面爬取下来，然后通过BeautifulSoup去把需要的东西『扣』下来，通过字典存储，然后存入到数据库中。
是

BeautifulSoup部分

这次用到的东西不是很多，主要通过网页源代码分析，找到元素所在位置，通过BeautifulSoup扣出来
先实例化『一碗汤』soup = BeautifulSoup(html, 'lxml')
先获取所需要的data，通过find_all找出所以tr标签的内容data = soup.find_all('tr')

进行二次提取

通过for循环来依次读取data里面的数据，再通过find把需要的东西给提取出来，存放到字典里面。

for i in data:
    name = i.find('span', attrs={"class": 'hero-name-list'}).text.strip()
    win = i.find(
        'div', attrs={
            "class": 'segment segment-green'})['style'].strip('width:').strip('%;')
    picks = i.find(
        'div', attrs={
            "class": 'segment segment-green'})['style'].strip('width:').strip('%;')
    comment[name] = [win, picks]

BeautifulSoup解析器

soup = BeautifulSoup(html, 'lxml')用lxml对html内容进行解析，除了lxml，还有
通过Python标准库进行解析的BeautifulSoup(html, 'html.parser'),
通过lxml XML解析库解析的BeautifulSoup(html, ['lxml','mxl']),
以及通过html5lib解析库解析的BeautifulSoup(html, 'html5lib')

数据库部分

由于一开始没有注意到数据库的编码问题，导致出现乱码现象，所以这里记录一下数据库如何修改中文乱码的问题，我们需要把数据库的编码都设置成utf8
查看数据库编码格式：show variables like 'character_set_database';

查看数据表的编码格式：show create table table_name;

扫描二维码关注公众号，回复： 857622 查看本文章

设置创建的默认编码格式：character set utf8

如：
创建数据库时指定数据库的字符集：create database <database_name> character set utf8;

修改数据库的编码格式：alert database <database_name> character set utf8;

完整代码：

# Python Version：3.6.3
import requests
from bs4 import BeautifulSoup
import lxml
import pymysql

url = 'http://www.dotamax.com/hero/rate/'


def get_html(url):
    try:
        r = requests.get(url, timeout=10)
        r.raise_for_status()

        return r.text.encode('utf-8')
    except Exception as e:
        print(e)
        return "ERROR"


def get_content(url):
    comment = {}
    html = get_html(url)
    soup = BeautifulSoup(html, 'lxml')
    data = soup.find_all('tr')
    db = pymysql.connect('localhost', 'root', 'root', 'test', charset='utf8')
    cursor = db.cursor()
    cursor.execute("DROP TABLE IF EXISTS DOTA2")
    sql_create = """CREATE TABLE DOTA2 (aNAMES VARCHAR(20) NOT NULL, WINS  FLOAT NOT NULL, PICKS FLOAT NOT NULL) character set utf8"""
    cursor.execute(sql_create)
    sql = """set names 'utf8'"""
    cursor.execute(sql)
    for i in data:
        name = i.find('span', attrs={"class": 'hero-name-list'}).text.strip()
        win = i.find(
            'div', attrs={
                "class": 'segment segment-green'})['style'].strip('width:').strip('%;')
        picks = i.find(
            'div', attrs={
                "class": 'segment segment-green'})['style'].strip('width:').strip('%;')
        comment[name] = [win, picks]
    for i in comment:
        sql_insert = """INSERT INTO DOTA2(aNAMES, WINS, PICKS) VALUES('%s','%f','%f')""" % (
            i, float(comment[i][0][:5]), float(comment[i][1][:5]))
        try:
            cursor.execute(sql_insert)
        except Exception as err:
            print(err)
    print('ok')
    db.close()


if __name__ == '__main__':
    get_content(url)