Java爬蟲學習——例項:獲取起點中文網站小說並儲存成txt檔案

Enemy丶發表於2020-09-30

目標:通過使用HttpCient抓取頁面資料,Jsoup對資料進行分析然後拿到想要的資料

書名、作者、簡介、書籍檔案:(txt檔案)

案例:https://github.com/yeahmahao/-ReptilesBook

HttpClient使用方式:https://blog.csdn.net/baidu_38688646/article/details/108883222

Httplient連線池:https://blog.csdn.net/baidu_38688646/article/details/108883458

Jsoup使用方式:https://blog.csdn.net/baidu_38688646/article/details/108883606

專案結構:

1. HttpUtils 封裝HttpClient工具類

package cn.project.jd.util;

import org.apache.http.HttpResponse;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import org.springframework.stereotype.Component;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.UUID;

@Component
public class HttpUtils {
    //宣告連線池管理器
    private PoolingHttpClientConnectionManager cm;

    public HttpUtils() {
        this.cm = new PoolingHttpClientConnectionManager();
        //設定最大連線數(連線池大小)
        cm.setMaxTotal(100);
        //設定每個主機的連線數
        cm.setDefaultMaxPerRoute(10);
    }

    //使用get請求獲取頁面資料

    /**
     * 使用get請求獲取頁面資料
     * @param url
     * @return 頁面資料
     */
    public String doGetHtml(String url){
        //獲取httpClinet物件
        CloseableHttpClient httpClient =  HttpClients.custom().setConnectionManager(this.cm).build();
        //設定httpget請求物件,設定url地址
        HttpGet HG = new HttpGet(url);
        HG.setHeader("Accept","text/html,application/xhtml+xml,application/xml;q=0.9," +
                "image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9");
        HG.setHeader("accept-encoding","gzip, deflate, br");
        HG.setHeader("accept-language","zh-CN,zh;q=0.9");
        HG.setHeader("cookie","");
        HG.setHeader("user-agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " +
                "(KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36");
        //設定請求資訊
        HG.setConfig(this.getConfig());
        CloseableHttpResponse response = null;
        try {
            //設定HttpClient發起請求 獲取響應
            response = httpClient.execute(HG);

            //解析響應,返回結果
            if (response.getStatusLine().getStatusCode() == 200){
                //判斷響應體是否為null  不是則能使用EndityUtils
                if (response.getEntity() != null){
                    String con =EntityUtils.toString(response.getEntity(),"utf8");
                    return con;
                }

            }
        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            //關閉response
            if (response !=null){
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        return "";
    }


   

    private RequestConfig getConfig() {
        RequestConfig rf = RequestConfig.custom()
                .setConnectTimeout(1000)//建立連線的最長時間
                .setConnectionRequestTimeout(500)//獲取連線的最長時間
                .setSocketTimeout(10000)//資料傳輸的最長時間
                .build();
        return rf;
    }
}

2.textTask 頁面資料獲取分析定時任務

抓取頁面地址:https://www.qidian.com/all

      抓取懸疑型別的前2頁所有書籍,因為起點中文網第一頁的page=1,第二頁為page=2所以使用for迴圈逐頁獲取。示例只抓取前2頁的書籍每頁20本。

    //每隔多久執行一次任務
    @Scheduled(fixedDelay = 100000*1000)
    public void itemTask()throws Exception{
        //宣告需要解析的初始地址 https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&wq=%E6%89%8B%E6%9C%BA&page=1&s=1&click=0
//        https://www.qidian.com/all?chanId=21&action=1&orderId=&page=1&vip=0&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0   完本玄幻
//        https://www.qidian.com/all?chanId=22&action=1&orderId=&page=1&vip=0&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0   完本仙俠
//        String url ="https://www.qidian.com/all?chanId=21&subCateId=8";
        String url ="https://www.qidian.com/all?chanId=10&orderId=&page=1&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0page=";
        for(int i = 1;i < 3 ;i ++){
            String html = hus.doGetHtml(url+i);
            //解析頁面資料並儲存
            this.parse(html);
        }

        System.out.println("資料抓取完成");
    }

1)獲取書名以及作者

F12進行頁面分析

        Document doc =  Jsoup.parse(html);
        Elements els = doc.select("div.all-book-list > div > ul >li");
        for (Element lis : els) {
            //書名
            String  bookName= lis.select(".book-mid-info > h4").text();
            System.out.println("書名:《"+bookName+"》");
            //作者
            String author= lis.select(".book-mid-info > .author > [data-eid=qd_B59]").text();
            System.out.println("作者:"+author);
            //簡介
            String brieIntroduction= lis.select(".book-mid-info > .intro").text();
            System.out.println("簡介:"+brieIntroduction);
        }

2)根據書名超連結獲取到對應書籍的地址,並使用工具類解析地址獲取頁面String

            String herfs = lis.select(".book-mid-info > h4 > a").attr("href");
            System.out.println("地址:https:"+herfs);
            //讀取單本書的內容
            String bookhtml = hus.doGetHtml("https:"+herfs);
            //使用Jsoup的DOM方式得到Document
            Document bookdoc =  Jsoup.parse(bookhtml);

3)根據地址獲取章節以及章節內容並寫入檔案

Elements boks = bookdoc.select(".wrap > .book-detail-wrap > #j-catalogWrap > .volume-wrap > div");
//            > .book-detail-wrap center990 > #j-catalogWrap > .volume-wrap > div
            //命名
            String txtname = bookName +".txt";
            File file = new File("桌面\\懸疑\\"+txtname);
            //儲存成檔案
            PrintStream ps = new PrintStream(new FileOutputStream(file));
            for (Element bok : boks) {
                Elements boklis =bok.select("ul > li ");
                for (Element bokli : boklis) {
                    String bokredurlz= bokli.select("a").attr("href");
//                    System.out.println("讀書:https:"+bokredurlz);
                    String bokhtml = hus.doGetHtml("https:"+bokredurlz);
                    Document zdoc =  Jsoup.parse(bokhtml);
                    Elements zs = zdoc.select(".wrap > .main-read-container > .read-main-wrap > #j_chapterBox > .text-wrap[data-purl]");

                    //章節
                    String zhangjie = zdoc.select(".wrap > .main-read-container > .read-main-wrap > #j_chapterBox > .text-wrap[data-purl] > .main-text-wrap > .text-head > h3").text();
                    System.out.println(zhangjie);
                    ps.append("\n");
                    ps.append("\n");
                    ps.append("\n");
                    ps.append(zhangjie);
                    ps.append("\n");
                    ps.append("\n");
                   //內容
                    String neirong = zdoc.select(".wrap > .main-read-container > .read-main-wrap > #j_chapterBox > .text-wrap[data-purl] > .main-text-wrap > .read-content").text();
//                    System.out.println(neirong);
                    ps.append(neirong);
                }

            }
            ps.close();

相關文章