爬蟲程式採集網站必須使用動態代理,才能避免出現網站訪問頻繁的限制,這是眾所周知的。但是在具體採集網站的過程中,即使使用了動態代理依然會出現403、503或429的反爬錯誤,這是為什麼呢?根據以往的經驗,一般來說是因為以下幾個原因造成的:
1、動態User-Agent的修改
爬蟲程式採集網站,正常的HTTP請求都需要進行ua(User-Agent)優化,因為ua是瀏覽器標識,如果http請求沒有ua,甚至有些爬蟲程式主動標示為採集,那麼目標網站拒絕採集的可能性很高
2、控制單個代理IP的請求頻率
雖然爬蟲程式使用了動態代理,但是如果程式的多執行緒控制實現不好,會導致單個代理IP在短時間內發出大量的請求,導致該IP被訪問頻繁
3、IP有效時間的管理
動態代理IP使用過程中,必須進行存活檢查,一旦發現延遲較高、頻寬很低的代理IP,應該主動丟棄,避免使用過程中出現超時的情況
如果覺得上面的工作太麻煩,推薦使用自動轉發的爬蟲代理加強版,這種產品能實現每個http請求自動分配不同的代理IP轉發,同時進行IP池的自動多執行緒管理,確保了請求聯通率99%以上同時延遲低於300ms,可以快速上手採集網站,下面是產品demo可以直接複製使用,配置代理引數(proxyHost、proxyPort、proxyUser、proxyPass)和目標網站(targetUrl)就可以Run:
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
import java.net.URI;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import org.apache.http.Header;
import org.apache.http.HeaderElement;
import org.apache.http.HttpHost;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.AuthCache;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.client.HttpRequestRetryHandler;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.config.AuthSchemes;
import org.apache.http.client.entity.GzipDecompressingEntity;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.methods.HttpRequestBase;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.config.Registry;
import org.apache.http.config.RegistryBuilder;
import org.apache.http.conn.socket.ConnectionSocketFactory;
import org.apache.http.conn.socket.LayeredConnectionSocketFactory;
import org.apache.http.conn.socket.PlainConnectionSocketFactory;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.impl.auth.BasicScheme;
import org.apache.http.impl.client.BasicAuthCache;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.client.ProxyAuthenticationStrategy;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.message.BasicHeader;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.NameValuePair;
import org.apache.http.util.EntityUtils;
public class Demo
{
// 代理伺服器(產品官網 www.16yun.cn)
final static String proxyHost = "t.16yun.cn";
final static Integer proxyPort = 31000;
// 代理驗證資訊
final static String proxyUser = "username";
final static String proxyPass = "password";
private static PoolingHttpClientConnectionManager cm = null;
private static HttpRequestRetryHandler httpRequestRetryHandler = null;
private static HttpHost proxy = null;
private static CredentialsProvider credsProvider = null;
private static RequestConfig reqConfig = null;
static {
ConnectionSocketFactory plainsf = PlainConnectionSocketFactory.getSocketFactory();
LayeredConnectionSocketFactory sslsf = SSLConnectionSocketFactory.getSocketFactory();
Registry registry = RegistryBuilder.create()
.register("http", plainsf)
.register("https", sslsf)
.build();
cm = new PoolingHttpClientConnectionManager(registry);
cm.setMaxTotal(20);
cm.setDefaultMaxPerRoute(5);
proxy = new HttpHost(proxyHost, proxyPort, "http");
credsProvider = new BasicCredentialsProvider();
credsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials(proxyUser, proxyPass));
reqConfig = RequestConfig.custom()
.setConnectionRequestTimeout(5000)
.setConnectTimeout(5000)
.setSocketTimeout(5000)
.setExpectContinueEnabled(false)
.setProxy(new HttpHost(proxyHost, proxyPort))
.build();
}
public static void doRequest(HttpRequestBase httpReq) {
CloseableHttpResponse httpResp = null;
try {
setHeaders(httpReq);
httpReq.setConfig(reqConfig);
CloseableHttpClient httpClient = HttpClients.custom()
.setConnectionManager(cm)
.setDefaultCredentialsProvider(credsProvider)
.build();
AuthCache authCache = new BasicAuthCache();
authCache.put(proxy, new BasicScheme());
HttpClientContext localContext = HttpClientContext.create();
localContext.setAuthCache(authCache);
httpResp = httpClient.execute(httpReq, localContext);
int statusCode = httpResp.getStatusLine().getStatusCode();
System.out.println(statusCode);
BufferedReader rd = new BufferedReader(new InputStreamReader(httpResp.getEntity().getContent()));
String line = "";
while((line = rd.readLine()) != null) {
System.out.println(line);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (httpResp != null) {
httpResp.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
/**
* 設定請求頭
*
* @param httpReq
*/
private static void setHeaders(HttpRequestBase httpReq) {
// 設定Proxy-Tunnel
// Random random = new Random();
// int tunnel = random.nextInt(10000);
// httpReq.setHeader("Proxy-Tunnel", String.valueOf(tunnel));
httpReq.setHeader("Accept-Encoding", null);
}
public static void doGetRequest() {
// 要訪問的目標頁面
String targetUrl = "https://httpbin.org/ip";
try {
HttpGet httpGet = new HttpGet(targetUrl);
doRequest(httpGet);
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
doGetRequest();
}
}
本作品採用《CC 協議》,轉載必須註明作者和本文連結