Java+MySQL實現網路爬蟲程式

bzhxuexi發表於2013-12-02

網路爬蟲，也叫網路蜘蛛，有的專案也把它稱作“walker”。維基百科所給的定義是“一種系統地掃描網際網路，以獲取索引為目的的網路程式”。網路上有很多關於網路爬蟲的開源專案，其中比較有名的是Heritrix和Apache Nutch。

有時需要在網上搜集資訊，如果需要蒐集的是獲取方法單一而人工蒐集費時費力的資訊，比如統計一個網站每個月發了多少篇文章、用了哪些標籤，為自然語言處理專案蒐集語料，或者為模式識別專案蒐集圖片等等，就需要爬蟲程式來完成這樣的任務。而且搜尋引擎必不可少的元件之一也是網路爬蟲。

很多網路爬蟲都是用Python，Java或C#實現的。我這裡給出的是Java版本的爬蟲程式。為了節省時間和空間，我把程式限制在只掃描本部落格地址下的網頁（也就是http://johnhan.net/但不包括http://johnhany.net/wp-content/下的內容），並從網址中統計出所用的所有標籤。只要稍作修改，去掉程式碼裡的限制條件就能作為掃描整個網路的程式使用。或者對輸出格式稍作修改，可以作為生成部落格sitemap的工具。

程式碼也可以在這裡下載：johnhany/WPCrawler。

環境需求

我的開發環境是Windows7 + Eclipse。

需要XAMPP提供通過url訪問MySQL資料庫的埠。

還要用到三個開源的Java類庫：

Apache HttpComponents 4.3 提供HTTP介面，用來向目標網址提交HTTP請求，以獲取網頁的內容；

HTML Parser 2.0 用來解析網頁，從DOM節點中提取網址連結；

MySQL Connector/J 5.1.27 連線Java程式和MySQL，然後就可以用Java程式碼運算元據庫。

程式碼

程式碼位於三個檔案中，分別是：crawler.java，httpGet.java和parsePage.java。包名為net.johnhany.wpcrawler。

crawler.java

package
net.johnhany.wpcrawler;
  
import
java.sql.Connection; 
import
java.sql.DriverManager;
import
java.sql.ResultSet; 
import
java.sql.SQLException;
import
java.sql.Statement; 
  
public
class
crawler { 
      
    public
static
void 
main(String args[]) throws
Exception { 
        String frontpage =
"http://johnhany.net/";
        Connection conn =
null; 
          
        //connect the MySQL database
        try
{ 
            Class.forName("com.mysql.jdbc.Driver");
            String dburl =
"jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8";
            conn = DriverManager.getConnection(dburl,
"root", "");
            System.out.println("connection built");
        }
catch (SQLException e) {
            e.printStackTrace();
        }
catch (ClassNotFoundException e) {
            e.printStackTrace();
        }
          
        String sql =
null; 
        String url = frontpage;
        Statement stmt =
null; 
        ResultSet rs =
null; 
        int
count = 0;
          
        if(conn !=
null) { 
            //create database and table that will be needed
            try
{ 
                sql =
"CREATE DATABASE IF NOT EXISTS crawler";
                stmt = conn.createStatement();
                stmt.executeUpdate(sql);
                  
                sql =
"USE crawler"; 
                stmt = conn.createStatement();
                stmt.executeUpdate(sql);
                  
                sql =
"create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8";
                stmt = conn.createStatement();
                stmt.executeUpdate(sql);
                  
                sql =
"create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8";
                stmt = conn.createStatement();
                stmt.executeUpdate(sql);
            }
catch (SQLException e) {
                e.printStackTrace();
            }
              
            //crawl every link in the database
            while(true)
 { 
                //get page content of link "url"
                httpGet.getByString(url,conn);
                count++;
                  
                //set boolean value "crawled" to true after crawling this page
                sql =
"UPDATE record SET crawled = 1 WHERE URL = '"
+ url + "'";
                stmt = conn.createStatement();
                  
                if(stmt.executeUpdate(sql) >
0) { 
                    //get the next page that has not been crawled yet
                    sql =
"SELECT * FROM record WHERE crawled = 0";
                    stmt = conn.createStatement();
                    rs = stmt.executeQuery(sql);
                    if(rs.next()) {
                        url = rs.getString(2);
                    }else
{ 
                        //stop crawling if reach the bottom of the list
                        break;
                    }
  
                    //set a limit of crawling count
                    if(count >
1000 
|| url == null) {
                        break;
                    }
                }
            }
            conn.close();
            conn =
null; 
              
            System.out.println("Done.");
            System.out.println(count);
        }
    }
}

httpGet.java

package
net.johnhany.wpcrawler;
  
import
java.io.IOException; 
import
java.sql.Connection; 
  
import
org.apache.http.HttpEntity;
import
org.apache.http.HttpResponse;
import
org.apache.http.client.ClientProtocolException;
import
org.apache.http.client.ResponseHandler;
import
org.apache.http.client.methods.HttpGet;
import
org.apache.http.impl.client.CloseableHttpClient;
import
org.apache.http.impl.client.HttpClients;
import
org.apache.http.util.EntityUtils;
  
public
class
httpGet { 
  
    public
final
static
void 
getByString(String url, Connection conn) throws
Exception { 
        CloseableHttpClient httpclient = HttpClients.createDefault();
          
        try
{ 
            HttpGet httpget =
new HttpGet(url);
            System.out.println("executing request "
+ httpget.getURI()); 
  
            ResponseHandler<String> responseHandler =
new ResponseHandler<String>() {
  
                public
String handleResponse(
                        final
HttpResponse response)
throws ClientProtocolException, IOException {
                    int
status = response.getStatusLine().getStatusCode();
                    if
(status >= 200
&& status < 300) {
                        HttpEntity entity = response.getEntity();
                        return
entity != null
? EntityUtils.toString(entity) :
null; 
                    }
else {
                        throw
new 
ClientProtocolException("Unexpected response status: "
+ status); 
                    }
                }
            };
            String responseBody = httpclient.execute(httpget, responseHandler);
            /*
            //print the content of the page
            System.out.println("----------------------------------------");
            System.out.println(responseBody);
            System.out.println("----------------------------------------");
            */
            parsePage.parseFromString(responseBody,conn);
              
        }
finally 
{ 
            httpclient.close();
        }
    }
}

parsePage.java

package
net.johnhany.wpcrawler;
  
import
java.sql.Connection; 
import
java.sql.PreparedStatement;
import
java.sql.ResultSet; 
import
java.sql.SQLException;
import
java.sql.Statement; 
  
import
org.htmlparser.Node; 
import
org.htmlparser.Parser;
import
org.htmlparser.filters.HasAttributeFilter;
import
org.htmlparser.tags.LinkTag;
import
org.htmlparser.util.NodeList;
import
org.htmlparser.util.ParserException;
  
import
java.net.URLDecoder; 
  
public
class
parsePage { 
      
    public
static
void 
parseFromString(String content, Connection conn) 
throws Exception {
        Parser parser =
new Parser(content);
        HasAttributeFilter filter =
new 
HasAttributeFilter("href");
          
        try
{ 
            NodeList list = parser.parse(filter);
            int
count = list.size(); 
              
            //process every link on this page
            for(int
i=0; i<count; i++) {
                Node node = list.elementAt(i);
                  
                if(node
instanceof 
LinkTag) { 
                    LinkTag link = (LinkTag) node;
                    String nextlink = link.extractLink();
                    String mainurl =
"http://johnhany.net/";
                    String wpurl = mainurl +
"wp-content/"; 
  
                    //only save page from "http://johnhany.net"
                    if(nextlink.startsWith(mainurl)) {
                        String sql =
null; 
                        ResultSet rs =
null; 
                        PreparedStatement pstmt =
null; 
                        Statement stmt =
null; 
                        String tag =
null; 
                          
                        //do not save any page from "wp-content"
                        if(nextlink.startsWith(wpurl)) {
                            continue;
                        }
                          
                        try
{ 
                            //check if the link already exists in the database
                            sql =
"SELECT * FROM record WHERE URL = '"
+ nextlink + "'";
                            stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY,ResultSet.CONCUR_UPDATABLE);
                            rs = stmt.executeQuery(sql);
  
                            if(rs.next()) {
                                  
                            }else
{ 
                                //if the link does not exist in the database, insert it
                                sql =
"INSERT INTO record (URL, crawled) VALUES ('"
+ nextlink + "',0)";
                                pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
                                pstmt.execute();
                                System.out.println(nextlink);
                                  
                                //use substring for better comparison performance
                                nextlink = nextlink.substring(mainurl.length());
                                //System.out.println(nextlink);
                                  
                                if(nextlink.startsWith("tag/"))
 { 
                                    tag = nextlink.substring(4,
 nextlink.length()-1);
                                    //decode in UTF-8 for Chinese characters
                                    tag = URLDecoder.decode(tag,"UTF-8");
                                    sql =
"INSERT INTO tags (tagname) VALUES ('"
+ tag + "')";
                                    pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
                                    //if the links are different from each other, the tags must be different
                                    //so there is no need to check if the tag already exists
                                    pstmt.execute();
                                }
                            }
                        }
catch (SQLException e) {
                            //handle the exceptions
                            System.out.println("SQLException: "
+ e.getMessage()); 
                            System.out.println("SQLState: "
+ e.getSQLState()); 
                            System.out.println("VendorError: "
+ e.getErrorCode()); 
                        }
finally 
{ 
                            //close and release the resources of PreparedStatement, ResultSet and Statement
                            if(pstmt !=
null) { 
                                try
{ 
                                    pstmt.close();
                                }
catch (SQLException e2) {}
                            }
                            pstmt =
null; 
                              
                            if(rs !=
null) { 
                                try
{ 
                                    rs.close();
                                }
catch (SQLException e1) {}
                            }
                            rs =
null; 
                              
                            if(stmt !=
null) { 
                                try
{ 
                                    stmt.close();
                                }
catch (SQLException e3) {}
                            }
                            stmt =
null; 
                        }
                          
                    }
                }
            }
        }
catch (ParserException e) {
            e.printStackTrace();
        }
    }
}

程式原理

所謂“網際網路”，是網狀結構，任意兩個節點間都有可能存在路徑。爬蟲程式對網際網路的掃描，在圖論角度來講，就是對有向圖的遍歷（連結是從一個網頁指向另一個網頁，所以是有向的）。常見的遍歷方法有深度優先和廣度優先兩種。相關理論知識可以參考樹的遍歷：這裡和這裡。我的程式採用的是廣度優先方式。

程式從crawler.java的main()開始執行。

Class.forName("com.mysql.jdbc.Driver");
String dburl =
"jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8";
conn = DriverManager.getConnection(dburl,
"root", "");
System.out.println("connection built");

首先，呼叫DriverManager連線MySQL服務。這裡使用的是XAMPP的預設MySQL埠3306，埠值可以在XAMPP主介面看到：

Apache和MySQL都啟動之後，在瀏覽器位址列輸入“http://localhost/phpmyadmin/”就可以看到資料庫了。等程式執行完之後可以在這裡檢查一下執行是否正確。

sql =
"CREATE DATABASE IF NOT EXISTS crawler";
stmt = conn.createStatement();
stmt.executeUpdate(sql);
  
sql =
"USE crawler"; 
stmt = conn.createStatement();
stmt.executeUpdate(sql);
  
sql =
"create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8";
stmt = conn.createStatement();
stmt.executeUpdate(sql);
  
sql =
"create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8";
stmt = conn.createStatement();
stmt.executeUpdate(sql);

連線好資料庫後，建立一個名為“crawler”的資料庫，在庫裡建兩個表，一個叫“record”，包含欄位“recordID”，“URL”和“crawled”，分別記錄地址編號、連結地址和地址是否被掃描過；另一個叫“tags”，包含欄位“tagnum”和“tagname”，分別記錄標籤編號和標籤名。

while(true) {
    httpGet.getByString(url,conn);
    count++;
      
    sql =
"UPDATE record SET crawled = 1 WHERE URL = '"
+ url + "'";
    stmt = conn.createStatement();
      
    if(stmt.executeUpdate(sql) >
0) { 
        sql =
"SELECT * FROM record WHERE crawled = 0";
        stmt = conn.createStatement();
        rs = stmt.executeQuery(sql);
        if(rs.next()) {
            url = rs.getString(2);
        }else
{ 
            break;
        }
    }
}

接著在一個while迴圈內依次處理表record內的每個地址。每次處理時，把地址url傳遞給httpGet.getByString()，然後在表record中把crawled改為true，表明這個地址已經處理過。然後尋找下一個crawled為false的地址，繼續處理，直到處理到表尾。

這裡需要注意的細節是，執行executeQuery()後，得到了一個ResultSet結構rs，rs包含SQL查詢返回的所有行和一個指標，指標指向結果中第一行之前的位置，需要執行一次rs.next()才能讓rs的指標指向第一個結果，同時返回true，之後每次執行rs.next()都會把指標移到下一個結果上並返回true，直至再也沒有結果時，rs.next()的返回值變成了false。

還有一個細節，在執行建庫建表、INSERT、UPDATE時，需要用executeUpdate()；在執行INSERT時，需要使用executeQuery()。executeQuery()總是返回一個ResultSet，executeUpdate()返回符合查詢的行數。

httpGet.java的getByString()類負責向所給的網址傳送請求，然後下載網頁內容。

HttpGet httpget =
new HttpGet(url);
System.out.println("executing request "
+ httpget.getURI()); 
  
ResponseHandler<String> responseHandler =
new ResponseHandler<String>() {
    public
String handleResponse(
            final
HttpResponse response)
throws ClientProtocolException, IOException {
        int
status = response.getStatusLine().getStatusCode();
        if
(status >= 200
&& status < 300) {
            HttpEntity entity = response.getEntity();
            return
entity != null
? EntityUtils.toString(entity) :
null; 
        }
else {
            throw
new 
ClientProtocolException("Unexpected response status: "
+ status); 
        }
    }
};
String responseBody = httpclient.execute(httpget, responseHandler);

這段程式碼是HTTPComponents的HTTP Client元件中給出的樣例，在很多情況下可以直接使用。這部分程式碼獲得了一個字串responseBody，裡面儲存著網頁中的全部字元。

接著，就需要把responseBody傳遞給parsePage.java的parseFromString類提取連結。

Parser parser =
new Parser(content);
HasAttributeFilter filter =
new 
HasAttributeFilter("href");
  
try
{ 
    NodeList list = parser.parse(filter);
    int
count = list.size(); 
      
    //process every link on this page
    for(int
i=0; i<count; i++) {
        Node node = list.elementAt(i);
        if(node
instanceof 
LinkTag) {

在HTML檔案中，連結一般都在a標籤的href屬性中，所以需要建立一個屬性過濾器。NodeList儲存著這個HTML檔案中的所有DOM節點，通過在for迴圈中依次處理每個節點尋找符合要求的標籤，可以把網頁中的所有連結提取出來。

然後通過nextlink.startsWith()進一步篩選，只處理以“http://johnhany.net/”開頭的連結並跳過以“http://johnhany.net/wp-content/”開頭的連結。

sql =
"SELECT * FROM record WHERE URL = '"
+ nextlink + "'";
stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY,ResultSet.CONCUR_UPDATABLE);
rs = stmt.executeQuery(sql);
  
if(rs.next()) {
      
}else
{ 
    //if the link does not exist in the database, insert it
    sql =
"INSERT INTO record (URL, crawled) VALUES ('"
+ nextlink + "',0)";
    pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
    pstmt.execute();

在表record中查詢是否已經存在這個連結，如果存在（rs.next()==true），不做任何處理；如果不存在（rs.next()==false），在表中插入這個地址並把crawled置為false。因為之前recordID設為AUTO_INCREMENT，所以要用 Statement.RETURN_GENERATED_KEYS獲取適當的編號。

nextlink = nextlink.substring(mainurl.length());
  
if(nextlink.startsWith("tag/")) {
    tag = nextlink.substring(4, nextlink.length()-1);
    tag = URLDecoder.decode(tag,"UTF-8");
    sql =
"INSERT INTO tags (tagname) VALUES ('"
+ tag + "')";
    pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
    pstmt.execute();

去掉連結開頭的“http://johnhany.net/”幾個字元，提高字元比較的速度。如果含有“tag/”說明其後的字元是一個標籤的名字，把這給名字提取出來，用UTF-8編碼，保證漢字的正常顯示，然後存入表tags。類似地還可以加入判斷“article/”，“author/”，或“2013/11/”等對其他連結進行歸類。

結果

這是兩張資料庫的截圖，顯示了程式的部分結果：

在這裡可以獲得全部輸出結果。可以與部落格的sitemap比較一下，看看如果想在其基礎上實現sitemap生成工具，還要做哪些修改。

轉:http://johnhany.net/2013/11/web-crawler-using-java-and-mysql/

Java實現網路爬蟲案例程式碼
2022-11-22
Java爬蟲
python實現selenium網路爬蟲
2021-03-11
Python爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
Python網路爬蟲實戰
2022-03-18
Python爬蟲
網路爬蟲
2018-12-07
爬蟲
Java網路爬蟲實操（10）
2018-06-10
Java爬蟲
Java網路爬蟲實操（8）
2018-03-15
Java爬蟲
Java網路爬蟲實操（7）
2018-03-05
Java爬蟲
Java網路爬蟲實操（9）
2018-03-17
Java爬蟲
[網路爬蟲] 網路爬蟲實踐：大麥網演唱會預約搶票【待續】
2024-05-04
爬蟲
網路爬蟲示例
2018-10-30
爬蟲
網路爬蟲精要
2019-04-27
爬蟲
python3網路爬蟲開發實戰_Python 3開發網路爬蟲(一)
2020-12-07
Python爬蟲
《Python3網路爬蟲開發實戰》教程||爬蟲教程
2018-11-13
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
python3網路爬蟲開發實戰_Python3 爬蟲實戰
2022-01-24
Python爬蟲
[Python3網路爬蟲開發實戰] 分散式爬蟲原理
2019-12-08
Python爬蟲分散式
Python使用多程式提高網路爬蟲的爬取速度
2019-02-01
Python爬蟲
網路爬蟲的原理
2018-12-02
爬蟲
python DHT網路爬蟲
2019-02-14
Python爬蟲
網路爬蟲專案
2022-01-29
爬蟲
乾貨分享！Python網路爬蟲實戰
2020-08-07
Python爬蟲
Python網路爬蟲實戰小專案
2021-04-12
Python爬蟲
Python 3網路爬蟲開發實戰
2021-04-28
Python爬蟲
Python網路爬蟲實戰專案大全！
2020-12-19
Python爬蟲
Java實現網路爬蟲案例程式碼：從網上獲取《三國演義》全文
2022-09-22
Java爬蟲
[Python] 網路爬蟲與資訊提取（1）網路爬蟲之規則
2020-11-06
Python爬蟲
什麼是Python網路爬蟲?常見的網路爬蟲有哪些?
2020-11-27
Python爬蟲
python網路爬蟲（14）使用Scrapy搭建爬蟲框架
2019-07-27
Python爬蟲框架
寫網路爬蟲程式的三種難度
2018-12-02
爬蟲
淺談網路爬蟲中深度優先演算法和簡單程式碼實現
2021-09-09
爬蟲演算法
Python網路爬蟲實戰專案大全 32個Python爬蟲專案demo
2019-04-24
Python爬蟲
Python網路爬蟲實戰(一)快速入門
2019-09-16
Python爬蟲
《網路爬蟲開發實戰案例》筆記
2020-08-10
爬蟲筆記
2019最新《網路爬蟲JAVA專案實戰》
2019-05-09
爬蟲Java
Python3網路爬蟲開發實戰
2021-04-15
Python爬蟲
python網路爬蟲（9）構建基礎爬蟲思路
2019-06-09
Python爬蟲
網路爬蟲（python專案）
2018-12-04
爬蟲Python

Java+MySQL實現網路爬蟲程式

相關文章