Software testing | Use Python to grab the page content of Baidu News

Insert image description here

Introduction

As technical engineers, we may not have time to browse the hot news in our busy work, but we who understand technology do not need to visit the website to see the major events happening now. We can use web crawler technology to get the latest and hottest news. News, this article will introduce the use of Python to grab the page content of Baidu News.

Environmental preparation

The libraries we will use are commonly used libraries, requestsand beautifulsoupthe ones we used this time. The installation commands are as follows:

pip install requests beautifulsoup4

Fetch page content

First, we use the requests library to send an HTTP request and get the content of the web page. The sample code is as follows:

import requests

url = 'http://news.baidu.com/'
response = requests.get(url)
html = response.text
print(html)

In the above code, we use the requests library to send a GET request, and get the HTML content of the webpage through the response.text property.

Next, we can use the BeautifulSoup library to parse the obtained HTML content and extract the required information. The sample code is as follows:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
news_list = soup.find_all('a', class_='f-title')
for news in news_list:
    print(news.get('href'))
    print(news.get_text())

In the above code, we use BeautifulSoupthe library to parse the HTML content, find_allfind all "f-title"a tags with class through the method, and then getobtain the link and title through the method.

Analyze news content

In the previous step, we have obtained the link and title of the news. Next, we need to further analyze the content of the news.

First, we can use the requests library mentioned above to send an HTTP request for a news link and get the HTML of the detailed content of the news. The sample code is as follows:

news_url = 'http://news.baidu.com/some_news_url'
news_response = requests.get(news_url)
news_html = news_response.text
print(news_html)

Then, we can use the BeautifulSoup library to parse the HTML content of the news and extract the body content of the news. The sample code is as follows:

news_soup = BeautifulSoup(news_html, 'html.parser')
news_content = news_soup.find('div', class_='news-content')
print(news_content.get_text())

In the above code, we assume that the class attribute of the label where the news content is located is "news-content", findfind the label through the method, and get_textobtain the text content in the label through the method.

Data storage and processing

In the first two steps, we have obtained the link, title and content of the news. Next, we can save this data to a local file or database, or do further data processing.

A common way to save data is to write the data to a CSV file. The sample code is as follows:

import csv

data = [['链接', '标题', '内容'],
        ['http://news.baidu.com/some_news_url', '新闻标题1', '新闻内容1'],
        ['http://news.baidu.com/some_news_url', '新闻标题2', '新闻内容2'],
        ['http://news.baidu.com/some_news_url', '新闻标题3', '新闻内容3']]

with open('news.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerows(data)

In the above code, we first define a two-dimensional list data, which contains the link, title and content of the news. Then use the csv library to write the data to a file named news.csv.

Note: When crawling website data, please make sure to comply with relevant laws, regulations and website usage policies to avoid legal disputes.

Guess you like

Origin blog.csdn.net/Tester_muller/article/details/132721181