使用爬蟲代理採集網站失敗的解決方法

jackcoden發表於2021-07-09

爬蟲程式採集網站必須使用動態代理,才能避免出現網站訪問頻繁的限制,這是眾所周知的。但是在具體採集網站的過程中,即使使用了動態代理依然會出現403、503或429的反爬錯誤,這是為什麼呢?根據以往的經驗,一般來說是因為以下幾個原因造成的:

1、動態User-Agent的修改

爬蟲程式採集網站,正常的HTTP請求都需要進行ua(User-Agent)優化,因為ua是瀏覽器標識,如果http請求沒有ua,甚至有些爬蟲程式主動標示為採集,那麼目標網站拒絕採集的可能性很高

2、控制單個代理IP的請求頻率

雖然爬蟲程式使用了動態代理,但是如果程式的多執行緒控制實現不好,會導致單個代理IP在短時間內發出大量的請求,導致該IP被訪問頻繁

3、IP有效時間的管理

動態代理IP使用過程中,必須進行存活檢查,一旦發現延遲較高、頻寬很低的代理IP,應該主動丟棄,避免使用過程中出現超時的情況

如果覺得上面的工作太麻煩,推薦使用自動轉發的爬蟲代理加強版,這種產品能實現每個http請求自動分配不同的代理IP轉發,同時進行IP池的自動多執行緒管理,確保了請求聯通率99%以上同時延遲低於300ms,可以快速上手採集網站,下面是產品demo可以直接複製使用,配置代理引數(proxyHost、proxyPort、proxyUser、proxyPass)和目標網站(targetUrl)就可以Run:

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
import java.net.URI;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.http.Header;
import org.apache.http.HeaderElement;
import org.apache.http.HttpHost;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.AuthCache;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.client.HttpRequestRetryHandler;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.config.AuthSchemes;
import org.apache.http.client.entity.GzipDecompressingEntity;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.methods.HttpRequestBase;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.config.Registry;
import org.apache.http.config.RegistryBuilder;
import org.apache.http.conn.socket.ConnectionSocketFactory;
import org.apache.http.conn.socket.LayeredConnectionSocketFactory;
import org.apache.http.conn.socket.PlainConnectionSocketFactory;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.impl.auth.BasicScheme;
import org.apache.http.impl.client.BasicAuthCache;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.client.ProxyAuthenticationStrategy;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.message.BasicHeader;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.NameValuePair;
import org.apache.http.util.EntityUtils;

public class Demo
{
    // 代理伺服器(產品官網 www.16yun.cn)
    final static String proxyHost = "t.16yun.cn";
    final static Integer proxyPort = 31000;

    // 代理驗證資訊
    final static String proxyUser = "username";
    final static String proxyPass = "password";


    private static PoolingHttpClientConnectionManager cm = null;
    private static HttpRequestRetryHandler httpRequestRetryHandler = null;
    private static HttpHost proxy = null;

    private static CredentialsProvider credsProvider = null;
    private static RequestConfig reqConfig = null;

    static {
        ConnectionSocketFactory plainsf = PlainConnectionSocketFactory.getSocketFactory();
        LayeredConnectionSocketFactory sslsf = SSLConnectionSocketFactory.getSocketFactory();

        Registry registry = RegistryBuilder.create()
            .register("http", plainsf)
            .register("https", sslsf)
            .build();

        cm = new PoolingHttpClientConnectionManager(registry);
        cm.setMaxTotal(20);
        cm.setDefaultMaxPerRoute(5);

        proxy = new HttpHost(proxyHost, proxyPort, "http");

        credsProvider = new BasicCredentialsProvider();
        credsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials(proxyUser, proxyPass));

        reqConfig = RequestConfig.custom()
            .setConnectionRequestTimeout(5000)
            .setConnectTimeout(5000)
            .setSocketTimeout(5000)
            .setExpectContinueEnabled(false)
            .setProxy(new HttpHost(proxyHost, proxyPort))
            .build();
    }

    public static void doRequest(HttpRequestBase httpReq) {
        CloseableHttpResponse httpResp = null;

        try {
            setHeaders(httpReq);

            httpReq.setConfig(reqConfig);

            CloseableHttpClient httpClient = HttpClients.custom()
                .setConnectionManager(cm)
                .setDefaultCredentialsProvider(credsProvider)
                .build();

            AuthCache authCache = new BasicAuthCache();
            authCache.put(proxy, new BasicScheme());

            HttpClientContext localContext = HttpClientContext.create();
            localContext.setAuthCache(authCache);

            httpResp = httpClient.execute(httpReq, localContext);

            int statusCode = httpResp.getStatusLine().getStatusCode();

            System.out.println(statusCode);

            BufferedReader rd = new BufferedReader(new InputStreamReader(httpResp.getEntity().getContent()));

            String line = "";
            while((line = rd.readLine()) != null) {
                System.out.println(line);
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (httpResp != null) {
                    httpResp.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    /**
     * 設定請求頭
     *
     * @param httpReq
     */
    private static void setHeaders(HttpRequestBase httpReq) {

        // 設定Proxy-Tunnel
        // Random random = new Random();
        // int tunnel = random.nextInt(10000);
        // httpReq.setHeader("Proxy-Tunnel", String.valueOf(tunnel));

        httpReq.setHeader("Accept-Encoding", null);

    }

    public static void doGetRequest() {
        // 要訪問的目標頁面
        String targetUrl = "https://httpbin.org/ip";


        try {
            HttpGet httpGet = new HttpGet(targetUrl);

            doRequest(httpGet);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        doGetRequest();

    }
}
本作品採用《CC 協議》,轉載必須註明作者和本文連結

相關文章