有時需要在網上搜集資訊,如果需要蒐集的是獲取方法單一而人工蒐集費時費力的資訊,比如統計一個網站每個月發了多少篇文章、用了哪些標籤,為自然語言處理專案蒐集語料,或者為模式識別專案蒐集圖片等等,就需要爬蟲程式來完成這樣的任務。而且搜尋引擎必不可少的元件之一也是網路爬蟲。
很多網路爬蟲都是用Python,Java或C#實現的。我這裡給出的是Java版本的爬蟲程式。為了節省時間和空間,我把程式限制在只掃描本部落格地址下的網頁(也就是http://johnhan.net/但不包括http://johnhany.net/wp-content/下的內容),並從網址中統計出所用的所有標籤。只要稍作修改,去掉程式碼裡的限制條件就能作為掃描整個網路的程式使用。或者對輸出格式稍作修改,可以作為生成部落格sitemap的工具。
程式碼也可以在這裡下載:johnhany/WPCrawler。
環境需求
我的開發環境是Windows7 + Eclipse。
需要XAMPP提供通過url訪問MySQL資料庫的埠。
還要用到三個開源的Java類庫:
Apache HttpComponents 4.3 提供HTTP介面,用來向目標網址提交HTTP請求,以獲取網頁的內容;
HTML Parser 2.0 用來解析網頁,從DOM節點中提取網址連結;
MySQL Connector/J 5.1.27 連線Java程式和MySQL,然後就可以用Java程式碼運算元據庫。
程式碼
程式碼位於三個檔案中,分別是:crawler.java,httpGet.java和parsePage.java。包名為net.johnhany.wpcrawler。
crawler.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
|
package
net.johnhany.wpcrawler;
import
java.sql.Connection;
import
java.sql.DriverManager;
import
java.sql.ResultSet;
import
java.sql.SQLException;
import
java.sql.Statement;
public
class
crawler { public
static
void
main(String args[]) throws
Exception { String frontpage =
"http://johnhany.net/" ;
Connection conn =
null ; //connect the MySQL database
try
{ Class.forName( "com.mysql.jdbc.Driver" );
String dburl =
"jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8" ;
conn = DriverManager.getConnection(dburl,
"root" , "" );
System.out.println( "connection built" );
}
catch (SQLException e) {
e.printStackTrace();
}
catch (ClassNotFoundException e) {
e.printStackTrace();
}
String sql =
null ; String url = frontpage;
Statement stmt =
null ; ResultSet rs =
null ; int
count = 0 ;
if (conn !=
null ) {
//create database and table that will be needed
try
{ sql =
"CREATE DATABASE IF NOT EXISTS crawler" ;
stmt = conn.createStatement();
stmt.executeUpdate(sql);
sql =
"USE crawler" ;
stmt = conn.createStatement();
stmt.executeUpdate(sql);
sql =
"create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8" ;
stmt = conn.createStatement();
stmt.executeUpdate(sql);
sql =
"create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8" ;
stmt = conn.createStatement();
stmt.executeUpdate(sql);
}
catch (SQLException e) {
e.printStackTrace();
}
//crawl every link in the database
while ( true )
{ //get page content of link "url"
httpGet.getByString(url,conn);
count++;
//set boolean value "crawled" to true after crawling this page
sql =
"UPDATE record SET crawled = 1 WHERE URL = '"
+ url + "'" ;
stmt = conn.createStatement();
if (stmt.executeUpdate(sql) >
0 ) { //get the next page that has not been crawled yet
sql =
"SELECT * FROM record WHERE crawled = 0" ;
stmt = conn.createStatement();
rs = stmt.executeQuery(sql);
if (rs.next()) {
url = rs.getString( 2 );
} else
{ //stop crawling if reach the bottom of the list
break ;
}
//set a limit of crawling count
if (count >
1000
|| url == null ) {
break ;
}
}
}
conn.close();
conn =
null ; System.out.println( "Done." );
System.out.println(count);
}
}
} |
httpGet.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
|
package
net.johnhany.wpcrawler;
import
java.io.IOException;
import
java.sql.Connection;
import
org.apache.http.HttpEntity;
import
org.apache.http.HttpResponse;
import
org.apache.http.client.ClientProtocolException;
import
org.apache.http.client.ResponseHandler;
import
org.apache.http.client.methods.HttpGet;
import
org.apache.http.impl.client.CloseableHttpClient;
import
org.apache.http.impl.client.HttpClients;
import
org.apache.http.util.EntityUtils;
public
class
httpGet { public
final
static
void
getByString(String url, Connection conn) throws
Exception { CloseableHttpClient httpclient = HttpClients.createDefault();
try
{ HttpGet httpget =
new HttpGet(url);
System.out.println( "executing request "
+ httpget.getURI());
ResponseHandler<String> responseHandler =
new ResponseHandler<String>() {
public
String handleResponse(
final
HttpResponse response)
throws ClientProtocolException, IOException {
int
status = response.getStatusLine().getStatusCode();
if
(status >= 200
&& status < 300 ) {
HttpEntity entity = response.getEntity();
return
entity != null
? EntityUtils.toString(entity) :
null ; }
else {
throw
new
ClientProtocolException( "Unexpected response status: "
+ status); }
}
};
String responseBody = httpclient.execute(httpget, responseHandler);
/*
//print the content of the page
System.out.println("----------------------------------------");
System.out.println(responseBody);
System.out.println("----------------------------------------");
*/ parsePage.parseFromString(responseBody,conn);
}
finally
{ httpclient.close();
}
}
} |
parsePage.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
|
package
net.johnhany.wpcrawler;
import
java.sql.Connection;
import
java.sql.PreparedStatement;
import
java.sql.ResultSet;
import
java.sql.SQLException;
import
java.sql.Statement;
import
org.htmlparser.Node;
import
org.htmlparser.Parser;
import
org.htmlparser.filters.HasAttributeFilter;
import
org.htmlparser.tags.LinkTag;
import
org.htmlparser.util.NodeList;
import
org.htmlparser.util.ParserException;
import
java.net.URLDecoder;
public
class
parsePage { public
static
void
parseFromString(String content, Connection conn)
throws Exception {
Parser parser =
new Parser(content);
HasAttributeFilter filter =
new
HasAttributeFilter( "href" );
try
{ NodeList list = parser.parse(filter);
int
count = list.size();
//process every link on this page
for ( int
i= 0 ; i<count; i++) {
Node node = list.elementAt(i);
if (node
instanceof
LinkTag) { LinkTag link = (LinkTag) node;
String nextlink = link.extractLink();
String mainurl =
"http://johnhany.net/" ;
String wpurl = mainurl +
"wp-content/" ;
//only save page from "http://johnhany.net"
if (nextlink.startsWith(mainurl)) {
String sql =
null ; ResultSet rs =
null ; PreparedStatement pstmt =
null ; Statement stmt =
null ; String tag =
null ; //do not save any page from "wp-content"
if (nextlink.startsWith(wpurl)) {
continue ;
}
try
{ //check if the link already exists in the database
sql =
"SELECT * FROM record WHERE URL = '"
+ nextlink + "'" ;
stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY,ResultSet.CONCUR_UPDATABLE);
rs = stmt.executeQuery(sql);
if (rs.next()) {
} else
{ //if the link does not exist in the database, insert it
sql =
"INSERT INTO record (URL, crawled) VALUES ('"
+ nextlink + "',0)" ;
pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
pstmt.execute();
System.out.println(nextlink);
//use substring for better comparison performance
nextlink = nextlink.substring(mainurl.length());
//System.out.println(nextlink);
if (nextlink.startsWith( "tag/" ))
{ tag = nextlink.substring( 4 ,
nextlink.length()- 1 );
//decode in UTF-8 for Chinese characters
tag = URLDecoder.decode(tag, "UTF-8" );
sql =
"INSERT INTO tags (tagname) VALUES ('"
+ tag + "')" ;
pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
//if the links are different from each other, the tags must be different
//so there is no need to check if the tag already exists
pstmt.execute();
}
}
}
catch (SQLException e) {
//handle the exceptions
System.out.println( "SQLException: "
+ e.getMessage());
System.out.println( "SQLState: "
+ e.getSQLState());
System.out.println( "VendorError: "
+ e.getErrorCode());
}
finally
{ //close and release the resources of PreparedStatement, ResultSet and Statement
if (pstmt !=
null ) {
try
{ pstmt.close();
}
catch (SQLException e2) {}
}
pstmt =
null ; if (rs !=
null ) {
try
{ rs.close();
}
catch (SQLException e1) {}
}
rs =
null ; if (stmt !=
null ) {
try
{ stmt.close();
}
catch (SQLException e3) {}
}
stmt =
null ; }
}
}
}
}
catch (ParserException e) {
e.printStackTrace();
}
}
} |
程式原理
所謂“網際網路”,是網狀結構,任意兩個節點間都有可能存在路徑。爬蟲程式對網際網路的掃描,在圖論角度來講,就是對有向圖的遍歷(連結是從一個網頁指向另一個網頁,所以是有向的)。常見的遍歷方法有深度優先和廣度優先兩種。相關理論知識可以參考樹的遍歷:這裡和這裡。我的程式採用的是廣度優先方式。
程式從crawler.java的main()開始執行。
1
2
3
4
|
Class.forName( "com.mysql.jdbc.Driver" );
String dburl =
"jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8" ;
conn = DriverManager.getConnection(dburl,
"root" , "" );
System.out.println( "connection built" ); |
首先,呼叫DriverManager連線MySQL服務。這裡使用的是XAMPP的預設MySQL埠3306,埠值可以在XAMPP主介面看到:
Apache和MySQL都啟動之後,在瀏覽器位址列輸入“http://localhost/phpmyadmin/”就可以看到資料庫了。等程式執行完之後可以在這裡檢查一下執行是否正確。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
sql =
"CREATE DATABASE IF NOT EXISTS crawler" ;
stmt = conn.createStatement();
stmt.executeUpdate(sql);
sql =
"USE crawler" ;
stmt = conn.createStatement();
stmt.executeUpdate(sql);
sql =
"create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8" ;
stmt = conn.createStatement();
stmt.executeUpdate(sql);
sql =
"create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8" ;
stmt = conn.createStatement();
stmt.executeUpdate(sql); |
連線好資料庫後,建立一個名為“crawler”的資料庫,在庫裡建兩個表,一個叫“record”,包含欄位“recordID”,“URL”和“crawled”,分別記錄地址編號、連結地址和地址是否被掃描過;另一個叫“tags”,包含欄位“tagnum”和“tagname”,分別記錄標籤編號和標籤名。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
while ( true ) {
httpGet.getByString(url,conn);
count++;
sql =
"UPDATE record SET crawled = 1 WHERE URL = '"
+ url + "'" ;
stmt = conn.createStatement();
if (stmt.executeUpdate(sql) >
0 ) { sql =
"SELECT * FROM record WHERE crawled = 0" ;
stmt = conn.createStatement();
rs = stmt.executeQuery(sql);
if (rs.next()) {
url = rs.getString( 2 );
} else
{ break ;
}
}
} |
接著在一個while迴圈內依次處理表record內的每個地址。每次處理時,把地址url傳遞給httpGet.getByString(),然後在表record中把crawled改為true,表明這個地址已經處理過。然後尋找下一個crawled為false的地址,繼續處理,直到處理到表尾。
這裡需要注意的細節是,執行executeQuery()後,得到了一個ResultSet結構rs,rs包含SQL查詢返回的所有行和一個指標,指標指向結果中第一行之前的位置,需要執行一次rs.next()才能讓rs的指標指向第一個結果,同時返回true,之後每次執行rs.next()都會把指標移到下一個結果上並返回true,直至再也沒有結果時,rs.next()的返回值變成了false。
還有一個細節,在執行建庫建表、INSERT、UPDATE時,需要用executeUpdate();在執行INSERT時,需要使用executeQuery()。executeQuery()總是返回一個ResultSet,executeUpdate()返回符合查詢的行數。
httpGet.java的getByString()類負責向所給的網址傳送請求,然後下載網頁內容。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
HttpGet httpget =
new HttpGet(url);
System.out.println( "executing request "
+ httpget.getURI());
ResponseHandler<String> responseHandler =
new ResponseHandler<String>() {
public
String handleResponse(
final
HttpResponse response)
throws ClientProtocolException, IOException {
int
status = response.getStatusLine().getStatusCode();
if
(status >= 200
&& status < 300 ) {
HttpEntity entity = response.getEntity();
return
entity != null
? EntityUtils.toString(entity) :
null ; }
else {
throw
new
ClientProtocolException( "Unexpected response status: "
+ status); }
}
};
String responseBody = httpclient.execute(httpget, responseHandler); |
這段程式碼是HTTPComponents的HTTP Client元件中給出的樣例,在很多情況下可以直接使用。這部分程式碼獲得了一個字串responseBody,裡面儲存著網頁中的全部字元。
接著,就需要把responseBody傳遞給parsePage.java的parseFromString類提取連結。
1
2
3
4
5
6
7
8
9
10
11
|
Parser parser =
new Parser(content);
HasAttributeFilter filter =
new
HasAttributeFilter( "href" );
try
{ NodeList list = parser.parse(filter);
int
count = list.size();
//process every link on this page
for ( int
i= 0 ; i<count; i++) {
Node node = list.elementAt(i);
if (node
instanceof
LinkTag) { |
在HTML檔案中,連結一般都在a標籤的href屬性中,所以需要建立一個屬性過濾器。NodeList儲存著這個HTML檔案中的所有DOM節點,通過在for迴圈中依次處理每個節點尋找符合要求的標籤,可以把網頁中的所有連結提取出來。
然後通過nextlink.startsWith()進一步篩選,只處理以“http://johnhany.net/”開頭的連結並跳過以“http://johnhany.net/wp-content/”開頭的連結。
1
2
3
4
5
6
7
8
9
10
11
|
sql =
"SELECT * FROM record WHERE URL = '"
+ nextlink + "'" ;
stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY,ResultSet.CONCUR_UPDATABLE);
rs = stmt.executeQuery(sql);
if (rs.next()) {
} else
{ //if the link does not exist in the database, insert it
sql =
"INSERT INTO record (URL, crawled) VALUES ('"
+ nextlink + "',0)" ;
pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
pstmt.execute(); |
在表record中查詢是否已經存在這個連結,如果存在(rs.next()==true),不做任何處理;如果不存在(rs.next()==false),在表中插入這個地址並把crawled置為false。因為之前recordID設為AUTO_INCREMENT,所以要用 Statement.RETURN_GENERATED_KEYS獲取適當的編號。
1
2
3
4
5
6
7
8
|
nextlink = nextlink.substring(mainurl.length());
if (nextlink.startsWith( "tag/" )) {
tag = nextlink.substring( 4 , nextlink.length()- 1 );
tag = URLDecoder.decode(tag, "UTF-8" );
sql =
"INSERT INTO tags (tagname) VALUES ('"
+ tag + "')" ;
pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
pstmt.execute(); |
去掉連結開頭的“http://johnhany.net/”幾個字元,提高字元比較的速度。如果含有“tag/”說明其後的字元是一個標籤的名字,把這給名字提取出來,用UTF-8編碼,保證漢字的正常顯示,然後存入表tags。類似地還可以加入判斷“article/”,“author/”,或“2013/11/”等對其他連結進行歸類。
結果
這是兩張資料庫的截圖,顯示了程式的部分結果:
在這裡可以獲得全部輸出結果。可以與部落格的sitemap比較一下,看看如果想在其基礎上實現sitemap生成工具,還要做哪些修改。
轉:http://johnhany.net/2013/11/web-crawler-using-java-and-mysql/