python爬虫之Beautiful Soup实战

上一章节我们介绍了bs4的基本语法,今天我们就来用他实战来解析网页。

获取博客简介

在xpath中我们介绍了如何获取我博客的内容简介,那么今天我们就用Beautiful Soup来获取同样的内容,我们看一下两者差别,xpath的解析,我们可以看https://blog.csdn.net/lovemenghaibin/article/details/82898280
那么同样的解析,我们看一下:

from bs4 import BeautifulSoup
import requests
from lxml import etree

url = "https://blog.csdn.net/lovemenghaibin"

header = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
}

response = requests.get(url, headers=header)

soup = BeautifulSoup(response.text, "lxml")

articles = soup.select(".article-item-box")
infos = []
for article in articles:
    title = article.select_one("a").get_text().replace("原", "").strip()
    href_url = article.select_one("a")["href"]
    content = article.select_one("p[class=content] > a").get_text().strip()
    article_date = article.select_one(".info-box > p").get_text().strip()
    article_read = article.select(".info-box > p")[1].get_text().replace("阅读数:","").strip()
    article_comment = article.select(".info-box > p")[2].get_text().replace("评论数:","").strip()
    info = {
        "title": title,
        "href": href_url,
        "content": content,
        "article_date": article_date,
        "article_read": article_read,
        "article_comment": article_comment
    }
    infos.append(info)
print(infos)

获取基本信息


from bs4 import BeautifulSoup
import requests

url = "https://blog.csdn.net/lovemenghaibin"

header = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
}

response = requests.get(url, headers=header)

soup = BeautifulSoup(response.text, "lxml")

home_url = soup.select_one("#uid")["href"]
name = soup.select_one("#uid").get_text()
profile = soup.select_one("#asideProfile")
article_count = soup.select("div .data-info dl")[0].select_one(".count").get_text()
fans = profile.select_one("#fanBox dd").get_text()
attentions = soup.select("div .data-info dl")[2].select_one("dd").get_text()
comment_count = soup.select("div .data-info dl")[3].select_one("dd").get_text()

blog_level = soup.select("div .grade-box dl")[0].select_one("a")["title"].split(",")[0]
read_total_count = soup.select("div .grade-box dl")[1].select_one("dd").get_text().strip()
point_count = soup.select("div .grade-box dl")[2].select_one("dd").get_text().strip()
rank = soup.select("div .grade-box dl")[3]["title"]

info = {
    "name": name,
    "home_url": home_url,
    "article_count": article_count,
    "fans": fans,
    "attentions": attentions,
    "comment_count": comment_count,

    "blog_level":blog_level,
    "read_total_count":read_total_count,
    "point_count":point_count,
    "rank":rank
}

小结

其实通过这一节和上一届对比,我们可以发现,在beautiful soup中,其实用的基本也就是css选择器,就可以完成我们这些基本的操作,非常的简单,而不是从内容中去找规律,他对于文档的结构依赖更加的少,我们尽量按照属性来查找,而不是顺序。但是如果没有一个可标识的内容,那就只能用css取出列表,并取出第n个的标签的内容。

发布了176 篇原创文章 · 获赞 84 · 访问量 44万+

猜你喜欢

转载自blog.csdn.net/lovemenghaibin/article/details/82913203