案例一:爬取豆瓣电影Top250
目标:获取豆瓣电影Top250的电影名称、评分和评价人数等信息。
技术实现:使用requests库发送HTTP请求,BeautifulSoup库解析HTML页面,csv库保存数据到CSV文件。
代码示例:
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://movie.douban.com/top250'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
def parse_html(html):
soup = BeautifulSoup(html, 'lxml')
movie_list = soup.find('ol', class_='grid_view').find_all('li')
with open('douban_movie_top250.csv', 'a', newline='', encoding='utf-8-sig') as f:
writer = csv.writer(f)
writer.writerow(['电影名称', '评分', '评价人数'])
for movie in movie_list:
title = movie.find('div', class_='hd').find('span', class_='title').get_text()
rating_num = movie.find('div', class_='star').find('span', class_='rating_num').get_text()
comment_num = movie.find('div', class_='star').find_all('span')[-1].get_text()
writer.writerow([title, rating_num, comment_num])
for i in range(10):
url = f'https://movie.douban.com/top250?start={
i*25}&filter='
response = requests.get(url, headers=headers)
parse_html(response.text)
案例二:爬取猫眼电影Top100
目标:获取猫眼电影Top100的电影名称、主演和上映时间等信息。
技术实现:使用requests库发送HTTP请求,正则表达式解析HTML页面,数据保存到TXT文件。
代码示例:
import requests
import re
url = 'https://maoyan.com/board/4'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
def parse_html(html):
pattern = re.compile('<p class="name"><a href=".*?" title="(.*?)" data-act="boarditem-click" data-val="{movieId:\\d+}">(.*?)</a></p>.*?<p class="star">(.*?)</p>.*?<p class="releasetime">(.*?)</p>', re.S)
items = re.findall(pattern, html)
return [{
'电影名称': item[1], '主演': item[2].strip(), '上映时间': item[3]} for item in items]
def save_data():
with open('maoyan_top100.txt', 'w', encoding='utf-8') as f:
for i in range(10):
url = f'https://maoyan.com/board/4?offset={
i*10}'
response = requests.get(url, headers=headers)
for item in parse_html(response.text):
f.write(str(item) + '\n')
save_data()
案例三:爬取百度贴吧帖子内容
目标:获取百度贴吧中某帖子的所有回复内容。
技术实现:使用requests库发送HTTP请求,正则表达式解析HTML页面,数据保存到CSV文件。
代码示例:
import csv
import requests
import re
def main(page):
url = f'https://tieba.baidu.com/p/帖子ID?pn={
page}' # 替换帖子ID为实际帖子ID
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}
resp = requests.get(url, headers=headers)
html = resp.text
comments = re.findall('style="display:;">(.*?)</div>', html)
users = re.findall('class="p_author_name j_user_card" href=".*?" target="_blank">(.*?)</a>', html)
comment_times = re.findall('楼</span><span class="tail-info">(.*?)</span><div', html)
with open('tieba_comments.csv', 'a', encoding='utf-8') as f:
csvwriter = csv.writer(f)
csvwriter.writerow(('评论用户', '评论时间', '评论内容'))
for u, c, t in zip(users, comments, comment_times):
if 'img' in c or 'div' in c or len(u) > 50:
continue
csvwriter.writerow((u, t, c))
print(f'第{
page}页爬取完毕')
for page in range(1, 8): # 爬取前7页的内容
main(page)
案例四:爬取全国高校名单
目标:获取全国高校名单,并保存到TXT文件中。
技术实现:使用requests库发送HTTP请求,正则表达式解析HTML页面。
注意事项:由于高校名单的网页结构可能有所不同,具体实现需要针对目标网页进行分析。
代码示例(假设目标网页结构已知):
import requests
import re
url = '目标网页URL' # 替换为目标网页的URL
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.get(url, headers=headers)
html = response.text
pattern = re.compile('正则表达式匹配高校名称的模式') # 根据目标网页结构编写正则表达式
schools = re.findall(pattern, html)
with open('universities.txt', 'w', encoding='utf-8') as f:
for school in schools:
f.write(school + '\n')
案例五:爬取博客园文章信息
目标:获取博客园某页面上的文章信息,包括URL、标题、作者、点赞数、评论数等。
技术实现:使用requests库发送HTTP请求,BeautifulSoup库解析HTML页面,pandas库保存数据到Excel文件。
代码示例:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url_root = 'https://www.cnblogs.com/sitehome/p/'
headers = {
"Referer": "https://www.cnblogs.com/",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
}
n = range(2, 10)
urls = [url_root + f'{
i}' for i in n]
def get_single_article_info(url):
re = requests.get(url, headers=headers)
if re.status_code != 200:
print('error!')
return []
soup = BeautifulSoup(re.text, "html.parser")
articles = soup.find('div', id='post_list', class_='post-list').find_all('article', class_='post-item')
data = []
for article in articles:
author, comment, recomment, views = '', 0, 0, 0
infos = article.find_all('a')
for info in infos:
if 'post-item-title' in str(info):
href = info['