[Crawler Notes] Python crawler simply uses crawling proxy IP

I. Introduction

In recent years, there have been more and more crawlers on the Internet, and many websites have restricted crawlers and blocked some irregular requests. In order to achieve normal web crawler tasks, crawlers often use proxy IPs to hide their real IPs and avoid being banned by the server. This article will introduce how to use a Python crawler to obtain the proxy IP, and how to use the proxy IP in the crawler.

2. Obtain proxy IP

There are two ways to obtain proxy IP: free proxy IP website and paid proxy IP service. Free proxy IP websites usually provide some free proxy IPs, but the quality of these proxy IPs is very unstable and can easily be banned or invalid; paid proxy IP services provide stable and reliable proxy IPs, but they require payment. Since this article mainly introduces how to use Python crawler, we will use the free proxy IP website to obtain the proxy IP.

Specifically, we can use a crawler to crawl the proxy IP list on some free proxy IP websites. Here we take the website’s free proxy IP as an example. The specific steps are as follows:

  1. Open the Zdaye proxy website (https://www.zdaye.com/), select the proxy IP type and location, and click the search button.
  2. Open the developer tools (F12), enter the Network tab, click the Clear button, and then click the "Get More Content" button to observe whether any new requests are sent.
  3. A request named "nn" was found, and the requested URL was https://www.zdaye.com/nn/1, where "nn" represents the high-density proxy IP, and the number "1" represents the page number. We can obtain the proxy IP of different pages by modifying the page number.
  4. Add the "User-Agent" field in the request header to simulate a browser request and avoid being rejected by the server.
  5. Grab the HTML code in the response and extract the proxy IP and its port number from it using regular expressions or the BeautifulSoup library.

The following is the specific Python code implementation:

import requests
from bs4 import BeautifulSoup
import re

# 抓取代理IP列表
def fetch_proxy_ips():
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
    url_template = 'http://www.zdaye.com/nn/{}'
    proxy_ips = []
    for page in range(1, 11):
        url = url_template.format(page)
        resp = requests.get(url, headers=headers)
        soup = BeautifulSoup(resp.text, 'html.parser')
        trs = soup.find_all('tr')
        for tr in trs[1:]:
            tds = tr.find_all('td')
            proxy_ip = tds[1].text + ':' + tds[2].text
            proxy_ips.append(proxy_ip)
    return proxy_ips

# 测试代理IP是否可用
def test_proxy_ip(proxy_ip):
    url = 'http://httpbin.org/ip'
    proxies = {
        'http': 'http://' + proxy_ip,
        'https': 'https://' + proxy_ip
    }
    try:
        resp = requests.get(url, proxies=proxies, timeout=5)
        resp.raise_for_status()
        return True
    except:
        return False

# 获取可用的代理IP
def get_valid_proxy_ips():
    proxy_ips = fetch_proxy_ips()
    valid_proxy_ips = []
    for proxy_ip in proxy_ips:
        if test_proxy_ip(proxy_ip):
            valid_proxy_ips.append(proxy_ip)
    return valid_proxy_ips

print(get_valid_proxy_ips())

In the above code, we first use the fetch_proxy_ips() function to crawl the first 10 pages of the high-anonymity proxy IP list on the West Spur proxy website, and then use the test_proxy_ip() function to test whether each proxy IP is available (the test URL is http:/ /httpbin.org/ip), and finally use the get_valid_proxy_ips() function to return the list of available proxy IPs.

3. Use proxy IP

Proxies parameters can be specified in the requests library using proxy IP. The proxies parameter is a dictionary type, where the key name is the protocol (http or https) and the key value is the proxy IP and its port number. For example, if we want to use a proxy server with a proxy IP address of "1.2.3.4" and a port number of "5678", the proxies parameter should be:

proxies = {
    'http': 'http://1.2.3.4:5678',
    'https': 'https://1.2.3.4:5678'
}

The following is a simple crawler example that uses proxy IP to crawl the JD product search page:

import requests

# 使用代理IP爬取京东商品搜索页面
def crawl_jd_goods(query, proxy_ip):
    url_template = 'https://search.jd.com/Search?keyword={}&enc=utf-8&page={}'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
    proxies = {
        'http': 'http://' + proxy_ip,
        'https': 'https://' + proxy_ip
    }
    for page in range(1, 6):
        url = url_template.format(query, page)
        resp = requests.get(url, headers=headers, proxies=proxies)
        print(resp.status_code)
        print(resp.text)

# 获取可用的代理IP
proxy_ips = get_valid_proxy_ips()

# 使用第一个可用的代理IP爬取京东商品搜索页面
query = 'Python编程'
proxy_ip = proxy_ips[0]
crawl_jd_goods(query, proxy_ip)

In the above code, we first obtain the list of available proxy IPs, and then use the first available proxy IP to crawl the JD.com product search page (the search keyword is "Python programming").

4. Summary

It should be noted that proxy IP is not a panacea. On some websites with very powerful anti-crawler mechanisms, even using proxy IP can easily be banned. In addition, some proxy IPs have poor quality, slow access speeds, and even return error responses. Therefore, in actual use, it is necessary to select the available proxy IP according to the specific situation.

Guess you like

Origin blog.csdn.net/wq10_12/article/details/132735717