Python 数据分析实战：在线旅游攻略网站内容爬虫与分析

在线旅游攻略网站为旅行者提供了丰富的信息，涵盖景点介绍、行程规划、酒店推荐、美食推荐等内容。这些信息不仅帮助旅行者规划行程，还影响着旅游市场的消费趋势。对于旅游从业者、目的地推广机构而言，了解用户在攻略网站上的行为与内容偏好至关重要。然而，获取和分析这些分散在各个页面的数据并非易事。通过 Python 爬虫技术，可以系统性地采集数据，并运用数据分析方法挖掘有价值的信息，助力相关方优化旅游产品与服务，提升市场竞争力。

二、代码实现

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

2.1 数据收集

我们选取一家知名在线旅游攻略网站作为数据来源。该网站包含众多目的地页面，每个页面有景点、酒店、美食等板块。

# 定义函数获取单个页面内容
def get_page_content(url):
    headers = {
        'User - Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    response = requests.get(url, headers = headers)
    if response.status_code == 200:
        return response.text
    else:
        return None

# 以北京旅游攻略页面为例，获取页面内容
beijing_url = 'https://www.travelguide.com/beijing'
beijing_page = get_page_content(beijing_url)
if beijing_page:
    soup = BeautifulSoup(beijing_page, 'html.parser')

# 提取景点信息
def extract_attractions(soup):
    attractions = []
    attraction_divs = soup.find_all('div', class_='attraction - item')
    for div in attraction_divs:
        name = div.find('h3', class_='attraction - name').text.strip()
        rating = div.find('span', class_='rating - value')
        rating = float(rating.text.strip()) if rating else None
        review_count = div.find('span', class_='review - count')
        review_count = int(review_count.text.strip().split(' ')[0]) if review_count else None
        attractions.append({'Attraction_Name': name, 'Rating': rating, 'Review_Count': review_count})
    return attractions

# 提取酒店信息
def extract_hotels(soup):
    hotels = []
    hotel_divs = soup.find_all('div', class_='hotel - item')
    for div in hotel_divs:
        name = div.find('h3', class_='hotel - name').text.strip()
        price = div.find('span', class_='hotel - price')
        price = float(price.text.strip().split(' ')[0]) if price else None
        rating = div.find('span', class_='hotel - rating')
        rating = float(rating.text.strip()) if rating else None
        hotels.append({'Hotel_Name': name, 'Price': price, 'Rating': rating})
    return hotels

# 提取美食信息
def extract_foods(soup):
    foods = []
    food_divs = soup.find_all('div', class_='food - item')
    for div in food_divs:
        name = div.find('h3', class_='food - name').text.strip()
        price = div.find('span', class_='food - price')
        price = float(price.text.strip().split(' ')[0]) if price else None
        foods.append({'Food_Name': name, 'Price': price})
    return foods

# 执行数据提取
if beijing_page:
    attractions = extract_attractions(soup)
    hotels = extract_hotels(soup)
    foods = extract_foods(soup)
    attractions_df = pd.DataFrame(attractions)
    hotels_df = pd.DataFrame(hotels)
    foods_df = pd.DataFrame(foods)

2.2 数据探索性分析

# 查看景点数据基本信息
print(attractions_df.info())
# 查看酒店数据基本信息
print(hotels_df.info())
# 查看美食数据基本信息
print(foods_df.info())

# 分析景点评分分布
plt.figure(figsize=(10, 6))
sns.histplot(attractions_df['Rating'].dropna(), kde=True)
plt.title('Distribution of Attraction Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequen

一、案例背景

二、代码实现

2.1 数据收集

2.2 数据探索性分析

猜你喜欢

目录

热门文章