Source code included! Use Python and requests library to easily crawl national university rankings

Table of contents

Preface

China's university ranking data is relatively scattered. Some authoritative institutions will publish comprehensive rankings, such as the Ministry of Education's "Double First-Class", "985", and "211" plans, and some institutions will publish professional rankings. Therefore, the specific ranking data to be crawled may need to be filtered based on actual needs.

Code

  1. Import the requests library and BeautifulSoup library: import requestsand from bs4 import BeautifulSoup. The requests library is used to send HTTP requests and obtain response data, while the BeautifulSoup library is used to parse HTML data and provides convenient methods for traversing and searching the HTML DOM structure.
import requests
from bs4 import BeautifulSoup
  1. Define get_rank_data()a function for crawling ranking data: The target URL is first defined inside the function: url = 'https://www.shanghairanking.cn/rankings/bcur/2021', which is used to access the main page of the national university rankings. Then the request header information is defined: headers = {'User-Agent': '...'}This information contains the current visitor's browser, operating system and other information, which helps simulate the browser's access to the target site.

  2. Use the requests library to send an HTTP request: response = requests.get(url, headers=headers)The requests.get() method sends a GET request to the target URL, passing in the request header information, and the returned response data is stored in the response object.

# 发送HTTP请求并获取响应数据
def get_rank_data():
    url = 'https://www.shanghairanking.cn/rankings/bcur/2021'
    headers = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(url, headers=headers)
  1. Parse HTML data: Use the BeautifulSoup library to parse the response data: soup = BeautifulSoup(response.text, 'html.parser'), the text attribute of the response object contains the obtained HTML text data, pass it into the BeautifulSoup() constructor, and return a BeautifulSoup object. Then, use the object's find() and find_all() methods to easily locate and extract the target element.
    # 解析HTML数据
    soup = BeautifulSoup(response.text, 'html.parser')
    tbody = soup.find('tbody')
    trs = tbody.find_all('tr')
  1. Extract ranking data: tbody = soup.find('tbody')and trs = tbody.find_all('tr'), use the find() and find_all() methods respectively to locate the HTML element where the ranking data is located, here it is the tbody and tr elements of the table. Then iterate through all tr ​​elements, use the find_all() method to extract the td element under each tr element, and use the get_text() method to obtain the text content in the element. Store the ranking data in a list and save it in the form of a dictionary: rank_data.append({'rank': rank, 'name': name, 'location': location, 'category': category}).
    # 提取排名数据
    rank_data = []
    for tr in trs:
        tds = tr.find_all('td')
        if tds:
            rank = tds[0].get_text()
            name = tds[1].get_text()
            location = tds[2].get_text()
            category = tds[3].get_text()
            rank_data.append({
    
    'rank': rank, 'name': name, 'location': location, 'category': category})

    return rank_data

if __name__ == '__main__':
    rank_data = get_rank_data()
    for data in rank_data:
        print(f'{
      
      data["rank"]}: {
      
      data["name"]} ({
      
      data["location"]}) - {
      
      data["category"]}')

Summarize

The above code uses the third-party library requests and BeautifulSoup. It first visits the page where the list of "Double First-Class" colleges and universities is located on the Ministry of Education website, and then uses BeautifulSoup to parse the table data in the HTML page (the HTML structure of the table may change with the page structure. changes), and finally the data is stored in the results list in the form of tuples.

Hahahahahahaha, artificial intelligence is really awesome

You can modify the URL in the code to another address with relevant data, such as the official websites of major universities, ranking websites, etc., and then process and clean the crawled data to meet your visualization needs.

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/131021793