Crawling Public Data: Obtaining Statistics Using Python

In this blog, we will learn how to use Python to write a web crawler to obtain statistics from government open data websites (eg: US government data website data.gov). We will use the requests, BeautifulSoup and pandas libraries to achieve this functionality. Articles will include the following:

Table of contents

1. The basic concept of reptiles

2. Use the requests library to get the content of the web page

3. Use BeautifulSoup to parse HTML

4. Extract statistics from government open data websites

5. Store the acquired data into a CSV file

6. Optimization and improvement of crawlers

1. The basic concept of reptiles

A web crawler is a program that automatically accesses the Internet and obtains information. In simple terms, it is like a virtual "spider", crawling on the "web" of the Internet, from one link to another, to obtain the data it needs. In this example, we'll use Python to write a web crawler that fetches statistics from government open data websites.

2. Use `requests` the library to get the content of the webpage

First, we need to use Python's requests library to get the content of the web page. requestsIt is a simple and easy-to-use HTTP library that can help us send HTTP requests and get responses. First, we need to install requests the library:

pip install requests

After the installation is complete, we can use the following code to get the content of the web page:

import requests

url = 'https://www.example.gov/data'
response = requests.get(url)

if response.status_code == 200:
    html_content = response.text
else:
    print(f'Error {response.status_code}: Could not fetch the webpage.')

3. `BeautifulSoup` Parsing HTML using

Next, we need to parse the fetched HTML content. For this, we will use BeautifulSoup a library. First, we need to install beautifulsoup4 and lxml:

pip install beautifulsoup4 lxml

Once installed, we can parse HTML with the following code:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'lxml')

4. Extract statistics from government open data websites

Now that we have fetched and parsed the web content, we need to extract statistics from it. This requires analyzing the HTML structure of the target website and finding the elements that contain statistical information. Here is a sample code to extract dataset name, description, release date and download link:

def extract_datasets(soup):
    dataset_list = []
    for dataset in soup.find_all('li', class_='dataset'):
        title = dataset.find('h2', class_='title').text.strip()
        description = dataset.find('p', class_='description').text.strip()
        release_date = dataset.find('span', class_='release-date').text.strip()
        download_link = dataset.find('a', class_='download')['href']

        dataset_data = {
            'title': title,
            'description': description,
            'release_date': release_date,
            'download_link': download_link
        }
        dataset_list.append(dataset_data)
    return dataset_list

datasets = extract_datasets(soup)

5. Store the acquired data into a CSV file

Once the statistics are obtained, we can store them in a CSV file. Here is a sample code for writing data to a CSV file:

import csv

def write_datasets_to_csv(datasets, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['title', 'description', 'release_date', 'download_link']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()
        for dataset in datasets:
            writer.writerow(dataset)

write_datasets_to_csv(datasets, 'gov_datasets.csv')

6. Optimization and improvement of crawlers

In practical applications, we may need to optimize and improve the crawler to improve efficiency and avoid being banned. Here are some suggestions:

Limit request rate: In order to comply with the crawler protocol of the website and avoid excessive load on the server, we can set a certain time interval between sending requests. You can use time.sleep() functions to achieve this functionality.
Use a proxy server: In order to avoid being blocked by the website due to frequent visits, we can use a proxy server to send requests. This function can be achieved using the parameters requests of the library .proxies
Error handling and retries: During the network request process, various errors may be encountered (for example: network connection timeout, server error, etc.). We can add exception handling for these errors and resend the request when an error is encountered.
Pagination and page turning: When crawling large amounts of data, we may need to deal with pagination and page turning. It can analyze the page-turning links in the webpage, and then grab the data page by page.
Multi-threading or asynchronous: In order to improve the efficiency of crawlers, we can consider using multi-threading or asynchronous technology to send requests in parallel. This functionality can be implemented using a Python threading library or asyncio libraries.

To sum up, we have learned how to use Python to write a web crawler to obtain statistical information from government public data websites. In practical applications, crawlers can be optimized and improved according to their own needs. I hope this article can be helpful to you, and I wish you all the best on the road of data scraping!