爬虫是一门很重要的技术,在数据爬取的过程,IP需要经常变更,防备被爬取网站forbidden。本文主要介绍如何适用api获取代理ip,进行数据抓取。
下面的demo中代理ip来自于服务商ipidea,其他服务商使用方法基本类似。
(1)注册账号
请在服务商http://sem.ipidea.net/ 网站注册账号,并认证。
(2)根据要求添加IP白名单(自己服务器的公网IP)
(3)获取 IP和端口
获取到一个IP和端口
(4)将得到的ip和port更换到demo 里,并执行。
https://mvnrepository.com/ mvn仓库
需要引用的jar包:httpcore5-5.1.jar,httpclient5-5.0.3.jar
package com.game.test;
import org.apache.hc.client5.http.classic.methods.HttpGet;
import org.apache.hc.client5.http.config.RequestConfig;
import org.apache.hc.client5.http.impl.classic.CloseableHttpClient;
import org.apache.hc.client5.http.impl.classic.CloseableHttpResponse;
import org.apache.hc.client5.http.impl.classic.HttpClientBuilder;
import org.apache.hc.core5.http.HttpEntity;
import org.apache.hc.core5.http.HttpHost;
import org.apache.hc.core5.http.ParseException;
import org.apache.hc.core5.http.io.entity.EntityUtils;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
/**
* Create by ipidea on 2021/2/6
* <p>
* 依赖 compile 'org.apache.httpcomponents.client5:httpclient5:5.0.3'
*
* @see <a href="http://hc.apache.org/httpcomponents-client-5.0.x/httpclient5/dependency-info.html">httpcomponents</a>
*/
class HttpProxy {
public static void httpProxy() {
HttpGet request = new HttpGet("http://httpbin.org/get");
RequestConfig requestConfig = RequestConfig.custom()
.setProxy(new HttpHost("58.218.205.47", 13706))
.build();
request.setConfig(requestConfig);
try {
CloseableHttpClient httpClient = HttpClientBuilder.create()
.disableRedirectHandling()
.build();
CloseableHttpResponse response = httpClient.execute(request);
// Get HttpResponse Status
System.out.println(response.getVersion());
System.out.println(response.getCode());
System.out.println(response.getReasonPhrase());
HttpEntity entity = response.getEntity();
if (entity != null) {
// return it as a String
String result = EntityUtils.toString(entity, StandardCharsets.UTF_8);
System.out.println(result);
}
} catch (ParseException | IOException e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
httpProxy();
}
}
运行结果显示是一个新的代理IP