Java學習-簡單爬蟲系統

__折戟沉沙發表於2017-09-06

原文網址 : https://blog.csdn.net/www_131374/article/details/77861677

昨天因為某些需要，就用java寫了個簡單的爬蟲系統，特在此整理一下，首先使用到的兩個工具，jsoup和httpClient。jsoup是針對html檔案的解析[jsoup官網](https://jsoup.org)，httpclint是網路請求的工具包，功能強大，內容豐富，[HttpClient官網](http://hc.apache.org/httpclient-3.x/)
因為這個比較簡短，我的程式碼格式就很隨意，接下來只把部分程式碼放上，並做一些簡單的解釋，上程式碼:

下面的方法按照page引數遞增，不停請求資料
String url = “某連結”;

    for (int i = 1; i <= 1666; i++) {
        String data = ClientHandle.getContentForUrl(url+ i);

        Business bussness = new Business();
        bussness.analysisDataTypeOne(data);
    }

根據url獲取頁面資訊

public static String getContentForUrl(String url)   {

    HttpClient client = new DefaultHttpClient();        

    HttpGet getHttp = new HttpGet(url); 

    getHttp.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"); 

    String content = null;  

    HttpResponse response;  
    try {  
        /* 獲得資訊載體 */  
        response = client.execute(getHttp);  
        HttpEntity entity = response.getEntity();  

        VisitedUrlQueue.addElem(url);  

        if (entity != null) {  
            /* 轉化為文字資訊 */  
            content = EntityUtils.toString(entity);  

            /* 判斷是否符合下載網頁原始碼到本地的條件 */  
            if (FunctionUtils.isCreateFile(url)  
                    && FunctionUtils.isHasGoalContent(content) != -1) {  
                FunctionUtils.createFile(  
                        FunctionUtils.getGoalContent(content), url);  
            }  
        }  

    } catch (ClientProtocolException e) {  
        e.printStackTrace();  
    } catch (IOException e) {  
        e.printStackTrace();  
    } finally {  
        client.getConnectionManager().shutdown();  
    }  

    return content;  
}

接下來將獲取到頁面資訊進行處理 

public void analysisDataTypeOne(String content){
    this.dealTable(content);
}

解析頁面資訊，裡邊用到了jsoup的元素

//處理最外層連結
public void dealTable(String str) {
    if (str.length() > 0) {
        //獲取頁面物件
        Document doc = Jsoup.parseBodyFragment(str);

        //解析資料
        Element content = doc.body();
        Elements links = content.getElementsByClass("newClass");

        for (Element link : links) {
            this.dealTr(link);
        }
    }
}
將上一方法中解析出的單個table塊解析出實體，
public void dealTr(Element link) {
    Elements nameEle = link.getElementsByClass("class");

    ListModel model = new ListModel();

    if (nameEle.size() > 0) {
        Elements elem = nameEle.first().getElementsByTag("a");

        if (elem.size() > 0) {

            String linkHref = element.attr("href");

            if (linkHref != "" && linkHref.length() > 0) {
                String name = element.text();

                //保֒存֒數֒據
                model.setName(name);
                model.setHref(linkHref);
            }
        }

    }
}

接下來就是model類和資料庫類
public class ListModel {
private int id; //主鍵
private String name; //名稱
private String href; //連結
}

public class db {

private static final String URL = "jdbc:mysql://localhost:3306/ChaoshenTest?useUnicode=true&characterEncoding=UTF-8";
private static final String NAME = "root";
private static final String PASSWORD = "root123";

private static Connection conn = null;
static{
    try {
        //1.載入驅動程式
        Class.forName("com.mysql.jdbc.Driver");
        //2.獲得資料庫的連線
        conn = DriverManager.getConnection(URL, NAME, PASSWORD);
    } catch (ClassNotFoundException e) {
        e.printStackTrace();
    } catch (SQLException e) {
        e.printStackTrace();
    }
}

//對外提供一個方法來獲取資料庫連線
public static Connection getConnection(){
    return conn;
}

    上面的原理也比較簡單，因為被爬的網站沒有使用反爬技術，所以上邊也沒有加入一些高階技術，以後用到了再增加。整理一下，方便以後回顧和學習。

Python 開發簡單爬蟲 (學習筆記)
2019-08-05
Python爬蟲筆記
Python爬蟲系統化學習(3)
2021-02-25
Python爬蟲
Python爬蟲系統化學習(4)
2021-03-01
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
python如何實現簡單的爬蟲功能?Python學習教程!
2021-01-12
Python爬蟲
簡單的爬蟲程式
2024-03-24
爬蟲
python簡單爬蟲(二)
2018-04-18
Python爬蟲
使用requests+BeautifulSoup的簡單爬蟲練習
2018-04-06
爬蟲
<node.js學習筆記(5)>koa框架和簡單爬蟲練習
2018-12-12
Node.js筆記框架爬蟲
python爬蟲:爬蟲的簡單介紹及requests模組的簡單使用
2022-02-24
Python爬蟲
Java 爬蟲專案實戰之爬蟲簡介
2018-11-24
Java爬蟲
java實現一個簡單的爬蟲小程式
2020-08-11
Java爬蟲
簡單瞭解python爬蟲
2020-10-13
Python爬蟲
Java簡單學生資訊管理系統
2021-07-20
Java
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
爬蟲學習-初次上路
2020-11-21
爬蟲
selenium爬蟲學習1
2024-08-29
爬蟲
python爬蟲學習1
2020-11-29
Python爬蟲
Node.js學習之路22——利用cheerio製作簡單的網頁爬蟲
2019-02-16
Node.js網頁爬蟲
開源JAVA單機爬蟲框架簡介,優缺點分析
2018-11-16
Java爬蟲框架
Python代理IP爬蟲的簡單使用
2019-03-04
Python爬蟲
nodeJS做一個簡單的爬蟲
2018-03-30
NodeJS爬蟲
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
什麼是爬蟲?學習Python爬蟲難不難?
2019-11-05
爬蟲Python
python爬蟲是什麼?學習python爬蟲難嗎
2021-03-31
Python爬蟲
爬蟲學習日記（六）
2019-01-14
爬蟲
Android 淘寶爬蟲學習
2019-03-18
Android爬蟲
爬蟲學習日記（八）
2019-01-18
爬蟲
爬蟲學習日記（七）
2019-01-15
爬蟲
爬蟲學習日記（五）
2018-12-14
爬蟲
爬蟲學習日記（三）
2018-12-07
爬蟲
爬蟲學習日記（二）
2018-11-28
爬蟲
爬蟲學習日記（一）
2018-11-28
爬蟲
11.18爬蟲學習（BeautifulSoup類）
2024-11-18
爬蟲
逆向爬蟲知識學習
2022-03-21
爬蟲
Python爬蟲 --- 2.3 Scrapy 框架的簡單使用
2018-12-19
Python爬蟲框架
phpspider簡單快速上手的php爬蟲框架
2020-02-17
PHPIDE爬蟲框架
python爬蟲簡單實現逆向JS解密
2019-08-29
Python爬蟲JS解密
情況最簡單下的爬蟲案例
2020-03-06
爬蟲

Java學習-簡單爬蟲系統

相關文章