Java+MySQL實現網路爬蟲程式

bzhxuexi發表於2013-12-02
          網路爬蟲,也叫網路蜘蛛,有的專案也把它稱作“walker”。維基百科所給的定義是“一種系統地掃描網際網路,以獲取索引為目的的網路程式”。網路上有很多關於網路爬蟲的開源專案,其中比較有名的是HeritrixApache Nutch

        有時需要在網上搜集資訊,如果需要蒐集的是獲取方法單一而人工蒐集費時費力的資訊,比如統計一個網站每個月發了多少篇文章、用了哪些標籤,為自然語言處理專案蒐集語料,或者為模式識別專案蒐集圖片等等,就需要爬蟲程式來完成這樣的任務。而且搜尋引擎必不可少的元件之一也是網路爬蟲。

        很多網路爬蟲都是用Python,Java或C#實現的。我這裡給出的是Java版本的爬蟲程式。為了節省時間和空間,我把程式限制在只掃描本部落格地址下的網頁(也就是http://johnhan.net/但不包括http://johnhany.net/wp-content/下的內容),並從網址中統計出所用的所有標籤。只要稍作修改,去掉程式碼裡的限制條件就能作為掃描整個網路的程式使用。或者對輸出格式稍作修改,可以作為生成部落格sitemap的工具。

        程式碼也可以在這裡下載:johnhany/WPCrawler


環境需求

        我的開發環境是Windows7 + Eclipse

        需要XAMPP提供通過url訪問MySQL資料庫的埠。

        還要用到三個開源的Java類庫:

        Apache HttpComponents 4.3 提供HTTP介面,用來向目標網址提交HTTP請求,以獲取網頁的內容;

        HTML Parser 2.0 用來解析網頁,從DOM節點中提取網址連結;

        MySQL Connector/J 5.1.27 連線Java程式和MySQL,然後就可以用Java程式碼運算元據庫。


程式碼

        程式碼位於三個檔案中,分別是:crawler.java,httpGet.java和parsePage.java。包名為net.johnhany.wpcrawler。

crawler.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
package net.johnhany.wpcrawler;
  
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
  
public class crawler {
      
    public static void main(String args[]) throws Exception {
        String frontpage = "http://johnhany.net/";
        Connection conn = null;
          
        //connect the MySQL database
        try {
            Class.forName("com.mysql.jdbc.Driver");
            String dburl = "jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8";
            conn = DriverManager.getConnection(dburl, "root", "");
            System.out.println("connection built");
        } catch (SQLException e) {
            e.printStackTrace();
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        }
          
        String sql = null;
        String url = frontpage;
        Statement stmt = null;
        ResultSet rs = null;
        int count = 0;
          
        if(conn != null) {
            //create database and table that will be needed
            try {
                sql = "CREATE DATABASE IF NOT EXISTS crawler";
                stmt = conn.createStatement();
                stmt.executeUpdate(sql);
                  
                sql = "USE crawler";
                stmt = conn.createStatement();
                stmt.executeUpdate(sql);
                  
                sql = "create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8";
                stmt = conn.createStatement();
                stmt.executeUpdate(sql);
                  
                sql = "create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8";
                stmt = conn.createStatement();
                stmt.executeUpdate(sql);
            } catch (SQLException e) {
                e.printStackTrace();
            }
              
            //crawl every link in the database
            while(true) {
                //get page content of link "url"
                httpGet.getByString(url,conn);
                count++;
                  
                //set boolean value "crawled" to true after crawling this page
                sql = "UPDATE record SET crawled = 1 WHERE URL = '" + url + "'";
                stmt = conn.createStatement();
                  
                if(stmt.executeUpdate(sql) > 0) {
                    //get the next page that has not been crawled yet
                    sql = "SELECT * FROM record WHERE crawled = 0";
                    stmt = conn.createStatement();
                    rs = stmt.executeQuery(sql);
                    if(rs.next()) {
                        url = rs.getString(2);
                    }else {
                        //stop crawling if reach the bottom of the list
                        break;
                    }
  
                    //set a limit of crawling count
                    if(count > 1000 || url == null) {
                        break;
                    }
                }
            }
            conn.close();
            conn = null;
              
            System.out.println("Done.");
            System.out.println(count);
        }
    }
}

httpGet.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
package net.johnhany.wpcrawler;
  
import java.io.IOException;
import java.sql.Connection;
  
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.ResponseHandler;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
  
public class httpGet {
  
    public final static void getByString(String url, Connection conn) throws Exception {
        CloseableHttpClient httpclient = HttpClients.createDefault();
          
        try {
            HttpGet httpget = new HttpGet(url);
            System.out.println("executing request " + httpget.getURI());
  
            ResponseHandler<String> responseHandler = new ResponseHandler<String>() {
  
                public String handleResponse(
                        final HttpResponse response) throws ClientProtocolException, IOException {
                    int status = response.getStatusLine().getStatusCode();
                    if (status >= 200 && status < 300) {
                        HttpEntity entity = response.getEntity();
                        return entity != null ? EntityUtils.toString(entity) : null;
                    } else {
                        throw new ClientProtocolException("Unexpected response status: " + status);
                    }
                }
            };
            String responseBody = httpclient.execute(httpget, responseHandler);
            /*
            //print the content of the page
            System.out.println("----------------------------------------");
            System.out.println(responseBody);
            System.out.println("----------------------------------------");
            */
            parsePage.parseFromString(responseBody,conn);
              
        } finally {
            httpclient.close();
        }
    }
}

parsePage.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
package net.johnhany.wpcrawler;
  
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
  
import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.LinkTag;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
  
import java.net.URLDecoder;
  
public class parsePage {
      
    public static void parseFromString(String content, Connection conn) throws Exception {
        Parser parser = new Parser(content);
        HasAttributeFilter filter = new HasAttributeFilter("href");
          
        try {
            NodeList list = parser.parse(filter);
            int count = list.size();
              
            //process every link on this page
            for(int i=0; i<count; i++) {
                Node node = list.elementAt(i);
                  
                if(node instanceof LinkTag) {
                    LinkTag link = (LinkTag) node;
                    String nextlink = link.extractLink();
                    String mainurl = "http://johnhany.net/";
                    String wpurl = mainurl + "wp-content/";
  
                    //only save page from "http://johnhany.net"
                    if(nextlink.startsWith(mainurl)) {
                        String sql = null;
                        ResultSet rs = null;
                        PreparedStatement pstmt = null;
                        Statement stmt = null;
                        String tag = null;
                          
                        //do not save any page from "wp-content"
                        if(nextlink.startsWith(wpurl)) {
                            continue;
                        }
                          
                        try {
                            //check if the link already exists in the database
                            sql = "SELECT * FROM record WHERE URL = '" + nextlink + "'";
                            stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY,ResultSet.CONCUR_UPDATABLE);
                            rs = stmt.executeQuery(sql);
  
                            if(rs.next()) {
                                  
                            }else {
                                //if the link does not exist in the database, insert it
                                sql = "INSERT INTO record (URL, crawled) VALUES ('" + nextlink + "',0)";
                                pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
                                pstmt.execute();
                                System.out.println(nextlink);
                                  
                                //use substring for better comparison performance
                                nextlink = nextlink.substring(mainurl.length());
                                //System.out.println(nextlink);
                                  
                                if(nextlink.startsWith("tag/")) {
                                    tag = nextlink.substring(4, nextlink.length()-1);
                                    //decode in UTF-8 for Chinese characters
                                    tag = URLDecoder.decode(tag,"UTF-8");
                                    sql = "INSERT INTO tags (tagname) VALUES ('" + tag + "')";
                                    pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
                                    //if the links are different from each other, the tags must be different
                                    //so there is no need to check if the tag already exists
                                    pstmt.execute();
                                }
                            }
                        } catch (SQLException e) {
                            //handle the exceptions
                            System.out.println("SQLException: " + e.getMessage());
                            System.out.println("SQLState: " + e.getSQLState());
                            System.out.println("VendorError: " + e.getErrorCode());
                        } finally {
                            //close and release the resources of PreparedStatement, ResultSet and Statement
                            if(pstmt != null) {
                                try {
                                    pstmt.close();
                                } catch (SQLException e2) {}
                            }
                            pstmt = null;
                              
                            if(rs != null) {
                                try {
                                    rs.close();
                                } catch (SQLException e1) {}
                            }
                            rs = null;
                              
                            if(stmt != null) {
                                try {
                                    stmt.close();
                                } catch (SQLException e3) {}
                            }
                            stmt = null;
                        }
                          
                    }
                }
            }
        } catch (ParserException e) {
            e.printStackTrace();
        }
    }
}

程式原理

        所謂“網際網路”,是網狀結構,任意兩個節點間都有可能存在路徑。爬蟲程式對網際網路的掃描,在圖論角度來講,就是對有向圖的遍歷(連結是從一個網頁指向另一個網頁,所以是有向的)。常見的遍歷方法有深度優先和廣度優先兩種。相關理論知識可以參考樹的遍歷:這裡這裡。我的程式採用的是廣度優先方式。

        程式從crawler.java的main()開始執行。

1
2
3
4
Class.forName("com.mysql.jdbc.Driver");
String dburl = "jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8";
conn = DriverManager.getConnection(dburl, "root", "");
System.out.println("connection built");

        首先,呼叫DriverManager連線MySQL服務。這裡使用的是XAMPP的預設MySQL埠3306,埠值可以在XAMPP主介面看到:

java  Java+MySQL實現網路爬蟲程式 xampp

        Apache和MySQL都啟動之後,在瀏覽器位址列輸入“http://localhost/phpmyadmin/”就可以看到資料庫了。等程式執行完之後可以在這裡檢查一下執行是否正確。

java  Java+MySQL實現網路爬蟲程式 localhost phpmyadmin

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
sql = "CREATE DATABASE IF NOT EXISTS crawler";
stmt = conn.createStatement();
stmt.executeUpdate(sql);
  
sql = "USE crawler";
stmt = conn.createStatement();
stmt.executeUpdate(sql);
  
sql = "create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8";
stmt = conn.createStatement();
stmt.executeUpdate(sql);
  
sql = "create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8";
stmt = conn.createStatement();
stmt.executeUpdate(sql);

        連線好資料庫後,建立一個名為“crawler”的資料庫,在庫裡建兩個表,一個叫“record”,包含欄位“recordID”,“URL”和“crawled”,分別記錄地址編號、連結地址和地址是否被掃描過;另一個叫“tags”,包含欄位“tagnum”和“tagname”,分別記錄標籤編號和標籤名。

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
while(true) {
    httpGet.getByString(url,conn);
    count++;
      
    sql = "UPDATE record SET crawled = 1 WHERE URL = '" + url + "'";
    stmt = conn.createStatement();
      
    if(stmt.executeUpdate(sql) > 0) {
        sql = "SELECT * FROM record WHERE crawled = 0";
        stmt = conn.createStatement();
        rs = stmt.executeQuery(sql);
        if(rs.next()) {
            url = rs.getString(2);
        }else {
            break;
        }
    }
}

        接著在一個while迴圈內依次處理表record內的每個地址。每次處理時,把地址url傳遞給httpGet.getByString(),然後在表record中把crawled改為true,表明這個地址已經處理過。然後尋找下一個crawled為false的地址,繼續處理,直到處理到表尾。

        這裡需要注意的細節是,執行executeQuery()後,得到了一個ResultSet結構rs,rs包含SQL查詢返回的所有行和一個指標,指標指向結果中第一行之前的位置,需要執行一次rs.next()才能讓rs的指標指向第一個結果,同時返回true,之後每次執行rs.next()都會把指標移到下一個結果上並返回true,直至再也沒有結果時,rs.next()的返回值變成了false。

        還有一個細節,在執行建庫建表、INSERT、UPDATE時,需要用executeUpdate();在執行INSERT時,需要使用executeQuery()。executeQuery()總是返回一個ResultSet,executeUpdate()返回符合查詢的行數。

 

        httpGet.java的getByString()類負責向所給的網址傳送請求,然後下載網頁內容。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
HttpGet httpget = new HttpGet(url);
System.out.println("executing request " + httpget.getURI());
  
ResponseHandler<String> responseHandler = new ResponseHandler<String>() {
    public String handleResponse(
            final HttpResponse response) throws ClientProtocolException, IOException {
        int status = response.getStatusLine().getStatusCode();
        if (status >= 200 && status < 300) {
            HttpEntity entity = response.getEntity();
            return entity != null ? EntityUtils.toString(entity) : null;
        } else {
            throw new ClientProtocolException("Unexpected response status: " + status);
        }
    }
};
String responseBody = httpclient.execute(httpget, responseHandler);

        這段程式碼是HTTPComponents的HTTP Client元件中給出的樣例,在很多情況下可以直接使用。這部分程式碼獲得了一個字串responseBody,裡面儲存著網頁中的全部字元。

        接著,就需要把responseBody傳遞給parsePage.java的parseFromString類提取連結。

1
2
3
4
5
6
7
8
9
10
11
Parser parser = new Parser(content);
HasAttributeFilter filter = new HasAttributeFilter("href");
  
try {
    NodeList list = parser.parse(filter);
    int count = list.size();
      
    //process every link on this page
    for(int i=0; i<count; i++) {
        Node node = list.elementAt(i);
        if(node instanceof LinkTag) {

        在HTML檔案中,連結一般都在a標籤的href屬性中,所以需要建立一個屬性過濾器。NodeList儲存著這個HTML檔案中的所有DOM節點,通過在for迴圈中依次處理每個節點尋找符合要求的標籤,可以把網頁中的所有連結提取出來。

        然後通過nextlink.startsWith()進一步篩選,只處理以“http://johnhany.net/”開頭的連結並跳過以“http://johnhany.net/wp-content/”開頭的連結。

1
2
3
4
5
6
7
8
9
10
11
sql = "SELECT * FROM record WHERE URL = '" + nextlink + "'";
stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY,ResultSet.CONCUR_UPDATABLE);
rs = stmt.executeQuery(sql);
  
if(rs.next()) {
      
}else {
    //if the link does not exist in the database, insert it
    sql = "INSERT INTO record (URL, crawled) VALUES ('" + nextlink + "',0)";
    pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
    pstmt.execute();

        在表record中查詢是否已經存在這個連結,如果存在(rs.next()==true),不做任何處理;如果不存在(rs.next()==false),在表中插入這個地址並把crawled置為false。因為之前recordID設為AUTO_INCREMENT,所以要用 Statement.RETURN_GENERATED_KEYS獲取適當的編號。

1
2
3
4
5
6
7
8
nextlink = nextlink.substring(mainurl.length());
  
if(nextlink.startsWith("tag/")) {
    tag = nextlink.substring(4, nextlink.length()-1);
    tag = URLDecoder.decode(tag,"UTF-8");
    sql = "INSERT INTO tags (tagname) VALUES ('" + tag + "')";
    pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
    pstmt.execute();

        去掉連結開頭的“http://johnhany.net/”幾個字元,提高字元比較的速度。如果含有“tag/”說明其後的字元是一個標籤的名字,把這給名字提取出來,用UTF-8編碼,保證漢字的正常顯示,然後存入表tags。類似地還可以加入判斷“article/”,“author/”,或“2013/11/”等對其他連結進行歸類。


結果

這是兩張資料庫的截圖,顯示了程式的部分結果:

java  Java+MySQL實現網路爬蟲程式 result record

java  Java+MySQL實現網路爬蟲程式 result tags

        在這裡可以獲得全部輸出結果。可以與部落格的sitemap比較一下,看看如果想在其基礎上實現sitemap生成工具,還要做哪些修改。

轉:http://johnhany.net/2013/11/web-crawler-using-java-and-mysql/

相關文章