WebMagic 爬蟲框架淺析

升級之路發表於2019-02-13

原文網址 : https://juejin.im/post/5c63fcc4e51d45014f61eb72

引言

很久之前因為爬蟲需求就接觸過 WebMagic，但是一直停留在簡單使用階段。近來公司專案也有爬蟲需求，而且需要對爬蟲框架做一些定製開發，便以此為契機深入學習 WebMagic 的設計思想及實現原理。

概述

WebMagic 是國內知名開發者黃億華開源的一個 Java 爬蟲框架。WebMagic 的架構設計參照了Scrapy，目標是儘量的模組化，並體現爬蟲的功能特點。WebMagic 的結構分為Downloader、PageProcessor、Scheduler、Pipeline 四大元件，並由 Spider 將它們彼此組織起來。這四大元件對應爬蟲生命週期中的下載、處理、管理和持久化等功能。

Scheduler

Scheduler 是 WebMagic中的 URL 排程器，負責從 Spider 處理收集 (push) 需要抓取的 URL (Page 的 targetRequests)、並 poll 出將要被處理的 URL 給 Spider，同時還負責對 URL 判斷是否進行錯誤重試、及去重處理、以及總頁面數、剩餘頁面數統計等。 Scheduler 實現類主要有 DuplicateRemovedScheduler、PriorityScheduler、QueueScheduler，擴充包還有 RedisScheduler、FileCacheQueueScheduler。雖然實現類不少，但是原理都差不多，WebMagic 預設實現是 QueueScheduler，便以此分析。

Scheduler 介面定義了 Scheduler 最基礎的功能：新增一個請求，獲取一個請求。

public interface Scheduler {

    /**
     * add a url to fetch
     *
     * @param request request
     * @param task task
     */
    public void push(Request request, Task task);

    /**
     * get an url to crawl
     *
     * @param task the task of spider
     * @return the url to crawl
     */
    public Request poll(Task task);

}
複製程式碼

MonitorableScheduler 介面定義了獲取剩餘請求數和總請求數的方法。

public interface MonitorableScheduler extends Scheduler {

    public int getLeftRequestsCount(Task task);

    public int getTotalRequestsCount(Task task);

}
複製程式碼

DuplicateRemovedScheduler 抽象類實現了通用的 push 模板方法，並在 push 方法內部判斷錯誤重試、去重處理等。

public abstract class DuplicateRemovedScheduler implements Scheduler {

    protected Logger logger = LoggerFactory.getLogger(getClass());
    // 去重策略實現類，關鍵點在於 private Set<String> urls = Collections.newSetFromMap(new ConcurrentHashMap<String, Boolean>());通過 Set 和 ConcurrentHashMap 的特性實現去重及併發安全
    private DuplicateRemover duplicatedRemover = new HashSetDuplicateRemover();

    public DuplicateRemover getDuplicateRemover() {
        return duplicatedRemover;
    }

    public DuplicateRemovedScheduler setDuplicateRemover(DuplicateRemover duplicatedRemover) {
        this.duplicatedRemover = duplicatedRemover;
        return this;
    }

    // 通用 push 模版方法
    @Override
    public void push(Request request, Task task) {
        logger.trace("get a candidate url {}", request.getUrl());
        if (shouldReserved(request) || noNeedToRemoveDuplicate(request) || !duplicatedRemover.isDuplicate(request, task)) {
            logger.debug("push to queue {}", request.getUrl());
            pushWhenNoDuplicate(request, task);
        }
    }

    // 如果設定了回收重試則不需要去重處理
    protected boolean shouldReserved(Request request) {
        return request.getExtra(Request.CYCLE_TRIED_TIMES) != null;
    }
    // 如果是 POST 請求則不需要去重處理，因為 POST 請求不是冪等的，POST 請求沒有加入到 Set 中去重，所以也不會計入請求數統計中
    protected boolean noNeedToRemoveDuplicate(Request request) {
        return HttpConstant.Method.POST.equalsIgnoreCase(request.getMethod());
    }

    protected void pushWhenNoDuplicate(Request request, Task task) {

    }
}
複製程式碼

QueueScheduler 的實現很簡單，維護一個 LinkedBlockingQueue 即可，獲取剩餘請求數即佇列的 size，獲取總請求數即 HashSetDuplicateRemover 維護的 Set 集合的 size。

@ThreadSafe
public class QueueScheduler extends DuplicateRemovedScheduler implements MonitorableScheduler {
    private BlockingQueue<Request> queue = new LinkedBlockingQueue();

    public QueueScheduler() {
    }

    public void pushWhenNoDuplicate(Request request, Task task) {
        this.queue.add(request);
    }

    public Request poll(Task task) {
        return (Request)this.queue.poll();
    }

    public int getLeftRequestsCount(Task task) {
        return this.queue.size();
    }

    public int getTotalRequestsCount(Task task) {
        return this.getDuplicateRemover().getTotalRequestsCount(task);
    }
}
複製程式碼

Downloader

Downloader 是負責請求 URL 獲取返回值（HTML、Json、Jsonp 等）的一個元件，同時也會處理 POST 重定向、Https 驗證、IP 代理、判斷失敗重試等

Downloader 介面定義了下載和設定執行緒數的方法。

public interface Downloader {

    /**
     * Downloads web pages and store in Page object.
     *
     * @param request request
     * @param task task
     * @return page
     */
    public Page download(Request request, Task task);

    /**
     * Tell the downloader how many threads the spider used.
     * @param threadNum number of threads
     */
    public void setThread(int threadNum);
}
複製程式碼

AbstractDownloader 抽象類提供了更上層的 download 方法實現及定義了成功失敗的回撥方法。

public abstract class AbstractDownloader implements Downloader {

    /**
     * A simple method to download a url.
     *
     * @param url url
     * @return html
     */
    public Html download(String url) {
        return download(url, null);
    }

    /**
     * A simple method to download a url.
     *
     * @param url url
     * @param charset charset
     * @return html
     */
    public Html download(String url, String charset) {
        Page page = download(new Request(url), Site.me().setCharset(charset).toTask());
        return (Html) page.getHtml();
    }

    protected void onSuccess(Request request) {
    }

    protected void onError(Request request) {
    }

}
複製程式碼

HttpClientDownloader 類是 WebMagic Downloader 的預設實現，主要功能是根據配置生成 HttpClient 例項請求網路，將請求、結果封裝成 Page 物件，並呼叫相應的回撥方法。

通過 Site 獲取域名，然後通過域名判斷是否在 httpClients 這個 map 中已存在 HttpClient 例項，如果存在則重用，否則通過 httpClientGenerator 建立一個新的例項，然後加入到 httpClients這個 map 中並返回。注意為了確保執行緒安全性，這裡用到了執行緒安全的雙重判斷機制。

private CloseableHttpClient getHttpClient(Site site) {
    if (site == null) {
        return httpClientGenerator.getClient(null);
    }
    String domain = site.getDomain();
    CloseableHttpClient httpClient = httpClients.get(domain);
    if (httpClient == null) {
        synchronized (this) {
            httpClient = httpClients.get(domain);
            if (httpClient == null) {
                httpClient = httpClientGenerator.getClient(site);
                httpClients.put(domain, httpClient);
            }
        }
    }
    return httpClient;
}
複製程式碼

WebMagic threadNum 既是執行緒池的執行緒數，也是 HttpClient ConnectionManager 的連線數，這裡設定的就是連線數。

@Override
public void setThread(int thread) {
    httpClientGenerator.setPoolSize(thread);
}
複製程式碼

public HttpClientGenerator setPoolSize(int poolSize) {
    connectionManager.setMaxTotal(poolSize);
    return this;
}
複製程式碼

HttpClientDownloader 優先獲取 Site 物件的 charset，如果為空會智慧檢測字元編碼，首先判斷 httpResponse.getEntity().getContentType().getValue() 是否含有比如 charset=utf-8，否則用 Jsoup 解析內容，判斷是提取 meta 標籤，然後判斷針對 HTML4 中和 HTML5 中分情況判斷出字元編碼.

private String getHtmlCharset(String contentType, byte[] contentBytes) throws IOException {
    String charset = CharsetUtils.detectCharset(contentType, contentBytes);
    if (charset == null) {
        charset = Charset.defaultCharset().name();
        logger.warn("Charset autodetect failed, use {} as charset. Please specify charset in Site.setCharset()", Charset.defaultCharset());
    }
    return charset;
}
複製程式碼

public static String detectCharset(String contentType, byte[] contentBytes) throws IOException {
    String charset;
    // charset
    // 1、encoding in http header Content-Type
    charset = UrlUtils.getCharset(contentType);
    if (StringUtils.isNotBlank(contentType) && StringUtils.isNotBlank(charset)) {
        logger.debug("Auto get charset: {}", charset);
        return charset;
    }
    // use default charset to decode first time
    Charset defaultCharset = Charset.defaultCharset();
    String content = new String(contentBytes, defaultCharset);
    // 2、charset in meta
    if (StringUtils.isNotEmpty(content)) {
        Document document = Jsoup.parse(content);
        Elements links = document.select("meta");
        for (Element link : links) {
            // 2.1、html4.01 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
            String metaContent = link.attr("content");
            String metaCharset = link.attr("charset");
            if (metaContent.indexOf("charset") != -1) {
                metaContent = metaContent.substring(metaContent.indexOf("charset"), metaContent.length());
                charset = metaContent.split("=")[1];
                break;
            }
            // 2.2、html5 <meta charset="UTF-8" />
            else if (StringUtils.isNotEmpty(metaCharset)) {
                charset = metaCharset;
                break;
            }
        }
    }
    logger.debug("Auto get charset: {}", charset);
    // 3、todo use tools as cpdetector for content decode
    return charset;
}
複製程式碼

download() 方法就是常規的 HttpClient 操作請求網路，handleResponse() 方法將請求、結果封裝成 Page 物件，然後呼叫相應的回撥方法，最後將 HttpClient 的連線和代理釋放掉。

@Override
public Page download(Request request, Task task) {
    if (task == null || task.getSite() == null) {
        throw new NullPointerException("task or site can not be null");
    }
    CloseableHttpResponse httpResponse = null;
    CloseableHttpClient httpClient = getHttpClient(task.getSite());
    Proxy proxy = proxyProvider != null ? proxyProvider.getProxy(task) : null;
    HttpClientRequestContext requestContext = httpUriRequestConverter.convert(request, task.getSite(), proxy);
    Page page = Page.fail();
    try {
        httpResponse = httpClient.execute(requestContext.getHttpUriRequest(), requestContext.getHttpClientContext());
        page = handleResponse(request, request.getCharset() != null ? request.getCharset() : task.getSite().getCharset(), httpResponse, task);
        onSuccess(request);
        logger.info("downloading page success {}", request.getUrl());
        return page;
    } catch (IOException e) {
        logger.warn("download page {} error", request.getUrl(), e);
        onError(request);
        return page;
    } finally {
        if (httpResponse != null) {
            //ensure the connection is released back to pool
            EntityUtils.consumeQuietly(httpResponse.getEntity());
        }
        if (proxyProvider != null && proxy != null) {
            proxyProvider.returnProxy(proxy, page, task);
        }
    }
}
複製程式碼

PageProcessor

PageProcessor 介面定義了 process() 頁面分析的方法還有 getSite() 提供 HttpClient 請求相關配置的方法。

public interface PageProcessor {

    /**
     * process the page, extract urls to fetch, extract the data and store
     *
     * @param page page
     */
    public void process(Page page);

    /**
     * get the site settings
     *
     * @return site
     * @see Site
     */
    public Site getSite();
}
複製程式碼

這裡的頁面分析主要指HTML頁面的分析，頁面分析可以說是垂直爬蟲最複雜的一部分。Selector 是 WebMagic 為了簡化頁面抽取開發的獨立模組，整合了 CSS Selector、XPath 和正規表示式，並可以進行鏈式的抽取，很容易就實現強大的功能。

介面：

Selector：定義了根據字串選擇單個元素和選擇多個元素的方法。
ElementSelector：定義了根據 Jsoup Element選擇單個、多個元素的方法。主要用於 CSS、Xpath 選擇器。

抽象類：

BaseElementSelector 實現類前面說的兩個介面，主要用於 CSS、Xpath 選擇器繼承。模板化介面方法，並定義了一些選擇元素的方法由子類實現。

實現類：

CssSelector：CSS 選擇器的實現類，繼承 BaseElementSelector。基本實現都是基於Jsoup 的 CSS 選擇介面。
XpathSelector：Xpath 選擇器的實現類，繼承 BaseElementSelector。基本實現都是採用作者自己基於 Jsoup 實現的 Xsoup 的相關介面。
RegexSelector：正規表示式選擇器的實現類，僅實現了Selector介面。

原始碼：
原始碼就不作分析了，基本就是基本的 Java 正則 API 和 Jsoup API 的封裝呼叫。

Pipeline

Pipeline其實也是容易被忽略的一部分。大家都知道持久化的重要性，但是很多框架都選擇直接在頁面抽取的時候將持久化一起完成，例如crawer4j。但是Pipeline真正的好處是，將頁面的線上分析和離線處理拆分開來，可以在一些執行緒裡進行下載，另一些執行緒裡進行處理和持久化。

Pipeline 介面很簡單，只有一個 process() 方法，引數是 PageProcessor 的解析結果及任務 task 物件，實現類主要有 ConsolePipeline、FilePipeline、ResultItemsCollectorPipeline 等，把解析結果拼接起來輸出到控制檯、檔案或者儲存到記憶體集合物件中。原始碼很簡單也就不展開分析了。

public interface Pipeline {

    /**
     * Process extracted results.
     *
     * @param resultItems resultItems
     * @param task task
     */
    public void process(ResultItems resultItems, Task task);
}
複製程式碼

@ThreadSafe
public class FilePipeline extends FilePersistentBase implements Pipeline {

    private Logger logger = LoggerFactory.getLogger(getClass());

    /**
     * create a FilePipeline with default path"/data/webmagic/"
     */
    public FilePipeline() {
        setPath("/data/webmagic/");
    }

    public FilePipeline(String path) {
        setPath(path);
    }

    @Override
    public void process(ResultItems resultItems, Task task) {
        String path = this.path + PATH_SEPERATOR + task.getUUID() + PATH_SEPERATOR;
        try {
            PrintWriter printWriter = new PrintWriter(new OutputStreamWriter(new FileOutputStream(getFile(path + DigestUtils.md5Hex(resultItems.getRequest().getUrl()) + ".html")),"UTF-8"));
            printWriter.println("url:\t" + resultItems.getRequest().getUrl());
            for (Map.Entry<String, Object> entry : resultItems.getAll().entrySet()) {
                if (entry.getValue() instanceof Iterable) {
                    Iterable value = (Iterable) entry.getValue();
                    printWriter.println(entry.getKey() + ":");
                    for (Object o : value) {
                        printWriter.println(o);
                    }
                } else {
                    printWriter.println(entry.getKey() + ":\t" + entry.getValue());
                }
            }
            printWriter.close();
        } catch (IOException e) {
            logger.warn("write file error", e);
        }
    }
}
複製程式碼

結語

WebMagic 由四大元件(Downloader、PageProcessor、Scheduler、Pipeline)構成，核心程式碼非常簡單，主要是將這些元件結合並完成多執行緒的任務。在WebMagic 中，基本上可以對爬蟲的功能做任何定製。

@123lxw123, 本文版權屬於再惠研發團隊，歡迎轉載，轉載請保留出處。

webmagic原始碼淺析
2020-10-23
Web原始碼
爬蟲技術淺析
2020-08-19
爬蟲
Django用來作為爬蟲框架淺談
2018-11-23
Django爬蟲框架
Java實現網路爬蟲案例程式碼3：使用webmagic框架獲取天氣預報
2023-02-08
Java爬蟲Web框架
Scrapy爬蟲框架
2024-11-13
爬蟲框架
SWCJ爬蟲框架
2022-01-26
爬蟲框架
使用webmagic爬蟲對百度百科進行簡單的爬取
2019-02-20
Web爬蟲
淺析微服務框架
2019-02-26
微服務框架
爬蟲（9） - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架
2022-07-05
爬蟲框架非同步
python爬蟲Scrapy框架
2018-11-21
Python爬蟲框架
爬蟲框架如何搭建
2023-11-27
爬蟲框架
Python爬蟲—Scrapy框架
2020-10-04
Python爬蟲框架
什麼是爬蟲？Python爬蟲框架有哪些？
2022-04-18
爬蟲Python框架
Python爬蟲教程-30-Scrapy 爬蟲框架介紹
2018-09-06
Python爬蟲框架
python網路爬蟲（14）使用Scrapy搭建爬蟲框架
2019-07-27
Python爬蟲框架
常用python爬蟲框架整理
2018-07-16
Python爬蟲框架
六種高效爬蟲框架
2022-06-07
爬蟲框架
爬蟲框架-scrapy的使用
2021-04-28
爬蟲框架
Scrapy爬蟲框架的使用
2021-01-17
爬蟲框架
Python爬蟲教程-31-建立 Scrapy 爬蟲框架專案
2018-09-04
Python爬蟲框架
爬蟲 Scrapy框架爬取圖蟲圖片並下載
2018-08-27
爬蟲框架
什麼是Python爬蟲？Python爬蟲常用框架有哪些？
2020-12-24
Python爬蟲框架
2個月精通Python爬蟲——3大爬蟲框架+6場實戰+反爬蟲技巧+分散式爬蟲
2018-06-28
Python爬蟲框架分散式
淺析Java Web框架技術
2020-10-26
JavaWeb框架
vue.js框架原理淺析
2019-02-20
Vue.js框架
淺析大資料框架 Hadoop
2021-07-27
大資料框架Hadoop
Python微型非同步爬蟲框架
2019-02-16
Python非同步爬蟲框架
gerapy框架爬蟲專案部署
2018-09-27
框架爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
淺析skynet底層框架下篇
2018-07-08
框架
流式處理框架storm淺析（下篇）
2019-03-04
框架ORM
淺析前端框架如何更新檢視
2021-09-09
前端框架
Golang 網路爬蟲框架gocolly/colly
2019-01-15
Golang爬蟲框架
nodejs eggjs框架爬蟲 readhub.me
2018-11-29
NodeJS框架爬蟲
分享個人開源爬蟲框架
2019-03-01
爬蟲框架
網路爬蟲開發常用框架
2019-02-27
爬蟲框架
高效率爬蟲框架之 pyspider
2018-07-06
爬蟲框架IDE
looter——超輕量級爬蟲框架
2019-04-27
爬蟲框架

WebMagic 爬蟲框架淺析

引言

概述

Scheduler

Downloader

PageProcessor

Pipeline

結語

相關文章