Python抓取代理IP----用代理采集代理,构建自己的代理IP池

一.前言:

    采集数据时,难免会遇到各种反爬机制,例如封IP就是很让人头疼的问题。

    封IP又分为两种情形:

         情形一:访问时间间隔短,访问太快,网站限制访问,直接提示访问频率太高,网站返回不是你所要请求的内容;

                                 

         情形二:直接封禁IP,无法访问

                       

     

        今天我们就来解决网站封IP的问题。解决方法,就是使用代理IP(proxy),目前网上有许多代理ip,有免费的也有付费的。免费的虽然不用花钱但也相对不稳定,存活时间短,想要实用、方便,就得搭建自己的IP池,且时常维护。

二.搭建自己的代理IP池:

1.整体逻辑思路:

       抓取大量IP--->将IP存储--->取IP(有效使用,无效删除取下一个)--->使用取出的有效IP

2.先寻找存放IP的容器:

    就将IP存放到mysql数据库中吧,建库(ippool),建表(project_ip)

       

字段解释:

    ip:ip地址

    port:ip端口

    speed:访问速度

    proxy_type:ip类型,http,https等

    ID:id号,很简单的一张表

建表sql:

/*
 Navicat Premium Data Transfer

 Source Server         : localhost
 Source Server Type    : MySQL
 Source Server Version : 80013
 Source Host           : localhost:3306
 Source Schema         : ippool

 Target Server Type    : MySQL
 Target Server Version : 80013
 File Encoding         : 65001
*/

SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

-- ----------------------------
-- Table structure for project_ip
-- ----------------------------
DROP TABLE IF EXISTS `project_ip`;
CREATE TABLE `project_ip`  (
  `ip` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
  `port` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
  `speed` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
  `proxy_type` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`ID`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci ROW_FORMAT = Dynamic;

SET FOREIGN_KEY_CHECKS = 1;

3. 以防万一,先找几个代理IP来用来抓取(过来人的建议,一定要有,不然...):

        

4.以某刺代理为例,先到网站对其分析:

               

具体分析内容有:数据请求方式,数据加载方式,怎么获取到相应内容,对加载数据的精确抓取;

分析后得出的结论:get请求,所需数据直接在加载的H5页面中,直接用xpath提取即可

                              

5.编写代码,抓取IP,存入数据库:

crawlAllIp.py

import requests
from scrapy.selector import Selector
import pymysql
import random
from time import *

#链接数据库
conn = pymysql.connect(host='127.0.0.1', user='root', passwd='jason!2@li&*', db='ippool', charset='utf8mb4')
cursor = conn.cursor()

def crawl_ips():
    for i in range(1,10):#先抓10页,1000个了
        sleeptime = random.choice([1, 2, 3, 4, 5, 6, 6, 7]) # 爬取应当间隔,不然网站容易封ip
        print(sleeptime)
        sleep(sleeptime)
        
        #构建一个随机请求头,也可直接用python fake_useragent库
        headers = {
            "User-Agent": random.choice(
                ['Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
                 'Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5',
                 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
                 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
                 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
                 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
                 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
                 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
                 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)',
                 'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)',
                 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20',
                 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
                 'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
                 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
                 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
                 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)',
                 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
                 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)',
                 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
                 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.3 Mobile/14E277 Safari/603.1.30',
                 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'])}

        #随便找几个代理IP,真的得找,我给的很大可能性已失效
        proxies = random.choice([
            {"http": "http://36.248.133.35:9999"},
            {"http": "http://125.123.120.109:9999"},
            {"http": "http://125.123.126.249:9999"},
        ])
        print(proxies)


        res = requests.get('https://www.xicidaili.com/nn/'+str(i),proxies=proxies,headers=headers)
        # print(res.text)
        Selectora = Selector(res)
        all_trs = Selectora.xpath('//table[@id="ip_list"]/tr')
        ip_list = []
        for tr in all_trs[1:]:
            spend_str = tr.xpath('./td/div[@class="bar"]/@title').extract()[0]  ##提取速度
            if spend_str:
                speed = float(spend_str.split('秒')[0])
                all_text = tr.xpath('./td/text()').extract()
                print(all_text)
                ip = all_text[0]
                port = all_text[1]
                proxy_type = all_text[5]
                ip_list.append((ip, port, speed, proxy_type))
        for ip_info in ip_list:
            print(ip_info)
            insert_sql = """insert project_ip(ip,port,speed,proxy_type) VALUES('{0}','{1}','{2}','{3}')""".format(
                ip_info[0], ip_info[1], ip_info[2],ip_info[3]
            )
            print(insert_sql)
            cursor.execute(insert_sql)
            conn.commit()


if __name__ == "__main__":
    crawl_ips()
    conn.close()
    cursor.close()

至此,抓取结束,已将IP存入数据库 

6.获取一个有效的IP:

思路是从库里随机抽取一个IP,用该IP去请求 https://www.baidu.com/,若是出现异常,说明IP无效,调用delete_ip函数将该ip删除,若无异常,再判断返回的状态码,若是介于200-400之间,说明ip正常,反之,调用delete_ip函数将该ip删除。最后,返回获取到的这个有效的IP

getOneIp.py

import requests
from scrapy.selector import Selector
import pymysql
from time import *

conn = pymysql.connect(host='127.0.0.1', user='root', passwd='jason!2@li&*', db='ippool', charset='utf8mb4')
cursor = conn.cursor()

class GetIP(object):
    def delete_ip(self, ip):
        # 从数据库中删除无效的ip
        delete_sql = """delete from project_ip where ip='{0}'""".format(ip)
        cursor.execute(delete_sql)
        conn.commit()
        return True

    def judge_ip(self, ip, port, proxy_type):
        # 判断一个ip是否可用
        http_url = 'https://www.baidu.com/'
        proxy_url = '{0}://{1}:{2}'.format(str(proxy_type).lower(),ip, port)
        try:
            proxy_dict = {
                'http': proxy_url,
            }
            requests.get(http_url, proxies=proxy_dict)
            return True
        except Exception as e:
            print("ip出现异常")
            # 出现异常后就把这个ip给删除掉
            self.delete_ip(ip)
            return False
        else:
            code = response.status_code
            if code >= 200 and code < 300:
                print('effective ip')
                return True
            else:
                print('invalid')
                self.delete_ip(ip)
                return False

    def get_random_ip(self):
        # 从数据库中随机获取到一个可用的ip
        random_sql = """SELECT ip,port,proxy_type FROM project_ip ORDER BY RAND() LIMIT 1"""
        result = cursor.execute(random_sql)
        for ip_info in cursor.fetchall():
            ip = ip_info[0]
            port = ip_info[1]
            proxy_type = ip_info[2]
            judge_re = self.judge_ip(ip, port, proxy_type)
            print(ip,port)
            if judge_re:  # 如果返回True
                return "{0}://'{1}':'{2}'".format(proxy_type,ip, port)
            else:
                return self.get_random_ip()

if __name__ == "__main__":
    ip = GetIP().get_random_ip()
    print(ip)

7.入库后,如果需要多次调用,多取几个有效IP就行了

Last_Get_One_Effective_Ip.py

from getOneIp import *

if __name__ == "__main__":
    get_ip = GetIP()
    effectiveIp = get_ip.get_random_ip()
    print(effectiveIp)

想要长期使用,可以将代码部署,定时采集检测,保证IP的有效性 

发布了128 篇原创文章 · 获赞 95 · 访问量 35万+

猜你喜欢

转载自blog.csdn.net/qq_36853469/article/details/103632044