Java爬取先知論壇文章

nice_0e3發表於2020-08-12

原文網址 : https://www.cnblogs.com/nice0e3/p/13488414.html

Java爬取先知論壇文章

0x00 前言

上篇文章寫了部分爬蟲程式碼，這裡給出一個完整的爬取先知論壇文章程式碼，用於技術交流。

0x01 程式碼實現

pom.xml加入依賴:

<dependencies>

        <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.3</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11.3</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.4</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.7</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/junit/junit -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>





    </dependencies>

實現程式碼

實現類：

package xianzhi;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import java.util.List;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;

public class Climbimpl implements Runnable {
    private String url ;
    private int pages;
    private String filename;



    Lock lock = new ReentrantLock();

    public Climbimpl(String url, int pages,String filename) {
        this.url = url;
        this.pages = pages;
        this.filename = filename;
    }

    public void run() {
        File file = new File(this.filename);

        boolean mkdir = file.mkdir();

        if (mkdir){
            System.out.println("目錄已建立");
        }

        lock.lock();

//        String url = "https://xz.aliyun.com/";

        for (int i = 1; i < this.pages; i++) {
            try {

            String requesturl = this.url+"?page="+i;
            Document doc = null;
            doc = Jsoup.parse(new URL(requesturl), 10000);
            Elements element = doc.getElementsByClass("topic-title");
            List<String> href = element.eachAttr("href");
                for (String s : href) {
                    try{
                        Document requests = Jsoup.parse(new URL(this.url+s), 100000);
//                        String topic_content = requests.getElementById("topic_content").text();
                        String titile = requests.getElementsByClass("content-title").first().text();
                        System.out.println("已爬取"+titile+"->"+this.filename+titile+".html");


                        BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(new FileOutputStream(this.filename+titile+".html"));
                        bufferedOutputStream.write(requests.toString().getBytes());
                        bufferedOutputStream.flush();
                        bufferedOutputStream.close();


                    }catch (Exception e){
                        System.out.println("爬取"+this.url+s+"報錯"+"報錯資訊"+e);
                    }
                }


            } catch (IOException e) {
                e.printStackTrace();
            }


        }
        lock.unlock();

    }
}

main類：

package xianzhi;

public class TestClimb {
    public static void main(String[] args) {
        int Threadlist_num = 10; //執行緒數
        String url = "https://xz.aliyun.com/";  //設定url
        int pages = 10; //讀取頁數
        String path = "D:\\paramss\\";  //設定儲存路徑

        Climbimpl climbimpl = new Climbimpl(url,pages,path);
        for (int i = 0; i < Threadlist_num; i++) {
            new Thread(climbimpl).start();

        }
    }
}

0x03 結尾

該爬蟲總體的程式碼都比較簡單。

爬取網頁文章
2021-09-29
網頁
爬取部落格園文章
2020-07-31
python 爬蟲爬取 learnku 精華文章
2020-04-17
Python爬蟲
一長篇論述CSocket，CAsyncSocket 的論壇文章（轉）
2018-04-25
爬取微信公眾號文章工具
2021-03-31
爬取天貓商品評論
2020-12-03
java 論壇模組設計方案
2021-11-25
Java
Python爬蟲新手教程：知乎文章圖片爬取器
2019-07-20
Python爬蟲
Java爬蟲批量爬取圖片
2021-09-24
Java爬蟲
新手爬蟲教程：Python爬取知乎文章中的圖片
2019-01-17
爬蟲Python
Java爬蟲-爬取疫苗批次資訊
2024-06-03
Java爬蟲
java 爬取mapbox向量切片
2019-01-24
Java
富貴教你用PHP爬取掘金文章
2018-12-02
PHP
如何利用 Selenium 爬取評論資料？
2018-04-12
週六大彙總 | 看雪論壇精華優秀文章
2019-01-19
網路爬蟲——專案實戰（爬取糗事百科所有文章）
2020-02-07
爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
一個很垃圾的整站爬取--Java爬蟲
2019-01-07
Java爬蟲
Discuz！論壇搭建
2018-06-09
java爬取豆瓣書籍資訊
2019-01-03
Java
網易雲歌詞爬取（java）
2018-07-01
Java
feapder框架爬取ks評論_遞迴的方式
2024-06-07
框架遞迴
python 爬取騰訊視訊的全部評論
2021-02-17
Python
【爬蟲】利用Python爬蟲爬取小麥苗itpub部落格的所有文章的連線地址（1）
2018-12-26
爬蟲Python
可以發外鏈的論壇，哪些論壇可以發外鏈？
2024-06-18
【修羅論壇】xiuno論壇新增登錄檔單項流程
2020-12-30
Python 爬蟲進階篇-利用beautifulsoup庫爬取網頁文章內容實戰演示
2020-09-14
Python爬蟲網頁
論壇升級公告
2018-11-14
論壇幫助文件
2020-04-04
discuz論壇模板修改
2019-02-01
論好文章和爛文章
2021-05-12
微博爬取長津湖博文及評論
2021-10-08
記錄自己，來到JAVA論壇第一天
2021-04-15
Java
java爬取免費HTTP代理 code-for-fun
2018-08-07
JavaHTTP
Java基於API介面爬取商品資料
2023-10-23
JavaAPI
用java爬取京東商品頁注意點
2024-12-08
Java
【Python爬蟲實戰】使用Selenium爬取QQ音樂歌曲及評論資訊
2021-03-24
Python爬蟲
【easy52pojie】一款方便看吾愛論壇帖子的爬蟲程式
2024-03-13
爬蟲

Java爬取先知論壇文章

Java爬取先知論壇文章

0x00 前言

0x01 程式碼實現

實現程式碼

0x03 結尾

相關文章