Python初學者之網路爬蟲

本文將介紹我最近在學習Python過程中寫的一個爬蟲程式，將力爭做到不需要有任何Python基礎的程式設計師都能讀懂。讀者也可以先跳到文章末尾看最終收集的資料效果和完整程式碼。

1. 確立目標需求

本次練習Python爬蟲的目標需求為以下兩點：

1) 收集huajiao.com上的人氣主播資訊:每位主播的關注數，粉絲數，贊數，經驗值等資料

2) 收集每位人氣主播的直播歷史資料，包括每次直播的開播時間，觀看人數，贊數等資料

2. 確立邏輯步驟

首先通過瀏覽器檢視www.huajiao.com網站上的各個頁面，分析它的網站結構。得到如下資訊：

1) 每一個導航項列出的都是直播列表，而非主播的個人主頁列表

以“熱門推薦”為例，如下圖，每個直播頁面的url格式為http://www.huajiao.com/l/liveId, 這裡的liveId唯一標識一個直播，比如http://www.huajiao.com/l/52860333
直播列表

2) 在直播頁上有主播的使用者ID和暱稱等資訊

通過點選使用者暱稱可以進入主播的個人主頁
直播頁

3) 在主播個人主頁上有更加完整的個人資訊

更加完整的個人資訊包括關注數，粉絲數，贊數，經驗值等資料；也有主播的直播歷史資料,如下圖，每個主播個人主頁的url格式為http://www.huajiao.com/user/userId, 這裡的userId唯一標識一個主播使用者，比如http://www.huajiao.com/user/50647288

4) 程式邏輯

通過以上的分析，爬蟲可以從直播列表頁入手，獲取到所有的直播url中的直播id,即上文提到的liveId;
拿到直播id後就可以進入直播頁獲取使用者id,即前面提到的userId,
有了userId後就可以進入主播個人主頁，在個人主頁上有主播完整的個人資訊和直播歷史資訊。
具體步驟如下：

a)：抓取直播列表頁的html, 我選取的是”熱門推薦”頁面http://www.huajiao.com/category/1000
b)：從獲取到的“熱門推薦”頁面的html中過濾出所有的直播地址，http://www.huajiao.com/l/liveId
c)：通過直播id抓取直播頁面的html, 並過濾出主播的userId
d)：通過userId抓取主播的個人主頁,過濾出關注數，粉絲數，贊數，經驗值；過濾出直播歷史資料。
e)：將使用者資料和直播歷史資料寫入mysql儲存

以上是根據觀察網站頁面，直觀上得出的一個爬蟲邏輯，但實際在開發過程中，還要考慮更多，比如：

a)爬蟲要定時執行，對於已經採集到的資料，採取何種更新策略
b)直播歷史資料需要請求相應的ajax介面，對收到的資料進行json解碼分析
c)主播暱稱包含emoji表情，如果資料庫使用常用的編碼”utf8″則會寫入報錯
d)過濾直播地址來獲取直播id時，需要使用到正則匹配，我使用的是Python庫”re”
e)分析html，我使用的是”BeautifulSoup”
f)讀寫mysql，我使用的是”pymysql”

如上邏輯步驟分析清楚後，就是編碼了，利用Python來實現以上的邏輯步驟。

3. Python編碼

1) 資料表設計

資料表設計
其中Tbl_Huajiao_User用於儲存主播的個人資料，Tbl_Huajiao_Live用於儲存主播的歷史直播資料，其中欄位FScrapedTime是每次記錄更新的時間，依靠此欄位可以實現簡單的更新策略。

2) 從直播列表頁過濾出直播Id列表

 # filter out live ids from a url
 def filterLiveIds(url):
     html = urlopen(url)
     liveIds = set()
     bsObj = BeautifulSoup(html, "html.parser")
     for link in bsObj.findAll("a", href=re.compile("^(/l/)")):
         if 'href' in link.attrs:
             newPage = link.attrs['href']
             liveId = re.findall("[0-9]+", newPage)
             liveIds.add(liveId[0])
     return liveIds

1

2

3

4

5

6

7

8

9

10

11

# filter out live ids from a url

def filterLiveIds(url):

html = urlopen(url)

liveIds = set()

bsObj = BeautifulSoup(html, "html.parser")

for link in bsObj.findAll("a", href=re.compile("^(/l/)")):

if 'href' in link.attrs:

newPage = link.attrs['href']

liveId = re.findall("[0-9]+", newPage)

liveIds.add(liveId[0])

return liveIds

關於python中如何定義函式，直接看以上程式碼就可以了，使用”def”和冒號，沒有大括號。其中urlopen(url)是python的庫函式，需要做import, 如下：

from urllib.request import urlopen

1	from urllib.request import urlopen

其中BeautifulSoup是一個第三方Python庫，通過它就可以方便的解析html程式碼了，通過它的findAll()方法找出所有的a標籤，並且這個方法支援正則，所以在它的引數裡我傳入了一個正則re.compile(“^(/l/)”)來表示尋找一”/l/”開頭的所有連結地址，bsObj.findAll(“a”, href=re.compile(“^(/l/)”))的結果是一個列表，故使用for迴圈來遍歷列表內的元素，在遍歷過程中通過使用正則re.findall(“[0-9]+”, newPage)匹配出liveId, 並臨時儲存在liveIds中，並將liveIds返回給呼叫者。

3) 從直播頁過濾出主播id

 # get user id from live page
 def getUserId(liveId):
     html = urlopen("http://www.huajiao.com/" + "l/" + str(liveId))
     bsObj = BeautifulSoup(html, "html.parser")
     text = bsObj.title.get_text()
     res = re.findall("[0-9]+", text)
     return res[0]

1

2

3

4

5

6

7

# get user id from live page

def getUserId(liveId):

html = urlopen("http://www.huajiao.com/" + "l/" + str(liveId))

bsObj = BeautifulSoup(html, "html.parser")

text = bsObj.title.get_text()

res = re.findall("[0-9]+", text)

return res[0]

這裡還是使用BeautifulSoup分析直播頁的html結構，使用bsObj.title.get_text()獲取到主播Id的文字資訊後，通過正則獲取到最終的userId

4) 通過userId進入主播個人主頁獲取個人資訊

#  get user data from user page

def getUserData(userId):
     html = urlopen("http://www.huajiao.com/user/" + str(userId))
     bsObj = BeautifulSoup(html, "html.parser")
     data = dict()
     try:
         userInfoObj = bsObj.find("div", {"id":"userInfo"})
         data['FAvatar'] = userInfoObj.find("div", {"class": "avatar"}).img.attrs['src']
         userId = userInfoObj.find("p", {"class":"user_id"}).get_text()
         data['FUserId'] = re.findall("[0-9]+", userId)[0]
         tmp = userInfoObj.h3.get_text('|', strip=True).split('|')
         #print(tmp[0].encode("utf-8"))
         data['FUserName'] = tmp[0]
         data['FLevel'] = tmp[1]
         tmp = userInfoObj.find("ul", {"class":"clearfix"}).get_text('|', strip=True).split('|')
         data['FFollow'] = tmp[0]
         data['FFollowed'] = tmp[2]
         data['FSupported'] = tmp[4]
         data['FExperience'] = tmp[6]
         return data
     except AttributeError:
         #traceback.print_exc()
         print(str(userId) + ":html parse error in getUserData()")
         return 0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

# get user data from user page

def getUserData(userId):

html = urlopen("http://www.huajiao.com/user/" + str(userId))

bsObj = BeautifulSoup(html, "html.parser")

data = dict()

try:

userInfoObj = bsObj.find("div", {"id":"userInfo"})

data['FAvatar'] = userInfoObj.find("div", {"class": "avatar"}).img.attrs['src']

userId = userInfoObj.find("p", {"class":"user_id"}).get_text()

data['FUserId'] = re.findall("[0-9]+", userId)[0]

tmp = userInfoObj.h3.get_text('|', strip=True).split('|')

#print(tmp[0].encode("utf-8"))

data['FUserName'] = tmp[0]

data['FLevel'] = tmp[1]

tmp = userInfoObj.find("ul", {"class":"clearfix"}).get_text('|', strip=True).split('|')

data['FFollow'] = tmp[0]

data['FFollowed'] = tmp[2]

data['FSupported'] = tmp[4]

data['FExperience'] = tmp[6]

return data

except AttributeError:

#traceback.print_exc()

print(str(userId) + ":html parse error in getUserData()")

return 0

以上使用了python的try-except的異常處理機制，因為在使用BeautifulSoup分析html資料時，有時候會因為沒有某個物件而報錯，對於這種報錯需要處理，否則整個程式就會停止執行，這裡我們列印出了日誌，在日誌中記錄了相應的userId。當然這裡還是主要用到了BeautifulSoup便捷的功能，比如其中的get_text()方法，能夠將多個標籤的文字抽取出來並且能夠制定文字的分隔符，和對空格等字元進行過濾。

5) 將獲取的個人資訊寫入mysql

# update user data
 def replaceUserData(data):
     conn = getMysqlConn()
     cur = conn.cursor()
     try:
         cur.execute("USE wanghong")
         cur.execute("set names utf8mb4")
         cur.execute("REPLACE INTO Tbl_Huajiao_User(FUserId,FUserName, FLevel, FFollow,FFollowed,FSupported,FExperience,FAvatar,FScrapedTime) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s)",                    (int(data['FUserId']), data['FUserName'],int(data['FLevel']),int(data['FFollow']),int(data['FFollowed']), int(data['FSupported']), int(data['FExperience']), data['FAvatar'],getNowTime())
         )
         conn.commit()
     except pymysql.err.InternalError as e:
         print(e)

1

2

3

4

5

6

7

8

9

10

11

12

# update user data

def replaceUserData(data):

conn = getMysqlConn()

cur = conn.cursor()

try:

cur.execute("USE wanghong")

cur.execute("set names utf8mb4")

cur.execute("REPLACE INTO Tbl_Huajiao_User(FUserId,FUserName, FLevel, FFollow,FFollowed,FSupported,FExperience,FAvatar,FScrapedTime) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s)", (int(data['FUserId']), data['FUserName'],int(data['FLevel']),int(data['FFollow']),int(data['FFollowed']), int(data['FSupported']), int(data['FExperience']), data['FAvatar'],getNowTime())

)

conn.commit()

except pymysql.err.InternalError as e:

print(e)

這裡使用了Python第三方庫pymysql進行mysql的讀寫操作，而指定編碼utf8mb4，也就是為了避免文章開始提到的一個問題，關於emoji表情符，如果資料庫使用常用的編碼”CHARSET=utf8 COLLATE=utf8_general_ci”則會寫入報錯，注意上面sql語句裡也宣告瞭utf8mb4字符集和編碼。
這裡沒有使用mysql的“INSERT”，而是使用了“REPLACE”,是當包含同樣的FUserId的一條記錄被寫入時將替換原來的記錄，這樣能夠保證爬蟲定時更新到最新的資料。

6) 獲取某主播的直播歷史資料

# get user history lives

def getUserLives(userId):
     try:
         url = "http://webh.huajiao.com/User/getUserFeeds?fmt=json&amp;uid=" + str(userId)
         html = urlopen(url).read().decode('utf-8')
         jsonData = json.loads(html)
         if jsonData['errno'] != 0:
             print(str(userId) + "error occured in getUserFeeds for: " + jsonData['msg'])
             return 0
         return jsonData['data']['feeds']
     except Exception as e:
         print(e)
         return 0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

# get user history lives

def getUserLives(userId):

try:

url = "http://webh.huajiao.com/User/getUserFeeds?fmt=json&uid=" + str(userId)

html = urlopen(url).read().decode('utf-8')

jsonData = json.loads(html)

if jsonData['errno'] != 0:

print(str(userId) + "error occured in getUserFeeds for: " + jsonData['msg'])

return 0

return jsonData['data']['feeds']

except Exception as e:

print(e)

return 0

前面說到，獲取直播歷史資料是通過直接請求ajax介面地址的，程式碼中的url即為介面地址，這是通過瀏覽器的除錯工具獲得的。這裡用到了json的解碼。

7) 將主播的直播歷史資料寫入Mysql

這裡和以上第5項類似，就不詳述了，讀者可以在文章末尾的github地址獲取完整的程式碼

8) 定義骨架函式

# spider user ids

def spiderUserDatas():
    for liveId in getLiveIdsFromRecommendPage():
        userId = getUserId(liveId)
        userData = getUserData(userId)
        if userData:
            replaceUserData(userData)
    return 1

# spider user lives

def spiderUserLives():
    userIds = selectUserIds(100)
    for userId in userIds:
        liveDatas = getUserLives(userId[0])
        for liveData in liveDatas:
            liveData['feed']['FUserId'] = userId[0]
            replaceUserLive(liveData['feed'])
    return 1

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

# spider user ids

def spiderUserDatas():

for liveId in getLiveIdsFromRecommendPage():

userId = getUserId(liveId)

userData = getUserData(userId)

if userData:

replaceUserData(userData)

return 1

# spider user lives

def spiderUserLives():

userIds = selectUserIds(100)

for userId in userIds:

liveDatas = getUserLives(userId[0])

for liveData in liveDatas:

liveData['feed']['FUserId'] = userId[0]

replaceUserLive(liveData['feed'])

return 1

所謂的骨架函式，就是控制單個小的功能函式，實現迴圈邏輯，一頁一頁的去採集資料。
spiderUserDatas()的邏輯：拿到liveId列表後，迴圈遍歷的去取每一個liveId對應的userId,進而渠道userData並寫入mysql;
spiderUserLives()的邏輯：從mysql中選出上次爬蟲時間最晚的100個userId, 迴圈遍歷地去取每一個user的直播歷史資料並寫入mysql;

9) 定義入口函式和命令列引數

def main(argv):
    if len(argv) < 2:
        print("Usage: python3 huajiao.py [spiderUserDatas|spiderUserLives]")
        exit()
    if (argv[1] == 'spiderUserDatas'):
        spiderUserDatas()
    elif (argv[1] == 'spiderUserLives'):
        spiderUserLives()
    elif (argv[1] == 'getUserCount'):
        print(getUserCount())
    elif (argv[1] == 'getLiveCount'):
        print(getLiveCount())
    else:
        print("Usage: python3 huajiao.py [spiderUserDatas|spiderUserLives|getUserCount|getLiveCount]")
if __name__ == '__main__':
    main(sys.argv)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

def main(argv):

if len(argv) < 2:

print("Usage: python3 huajiao.py [spiderUserDatas|spiderUserLives]")

exit()

if (argv[1] == 'spiderUserDatas'):

spiderUserDatas()

elif (argv[1] == 'spiderUserLives'):

spiderUserLives()

elif (argv[1] == 'getUserCount'):

print(getUserCount())

elif (argv[1] == 'getLiveCount'):

print(getLiveCount())

else:

print("Usage: python3 huajiao.py [spiderUserDatas|spiderUserLives|getUserCount|getLiveCount]")

if __name__ == '__main__':

main(sys.argv)

首先，要命名python在命令列模式下如何接收引數，通過sys.argv;
再有__name__的含義，如果檔案被執行，則__name__的值為”__main__”;
這樣通過以上程式碼就可以實現命令列呼叫和引數處理了。
比如要爬取主播的個人資訊，則執行：

python3 huajiao.py spiderUserDatas

1	python3 huajiao.py spiderUserDatas

比如檢視爬取了多少條使用者資料資訊，則執行：

python3 huajiao.py getUserCount

1	python3 huajiao.py getUserCount

10) 加入crontab

*/1 * * * * python3 /root/PythonPractice/spiderWanghong/huajiao.py spiderUserDatas >> /tmp/huajiao.py_spiderUserDatas.log
*/1 * * * * python3 /root/PythonPractice/spiderWanghong/huajiao.py spiderUserLives >> /tmp/huajiao.py_spiderUserLives.log

1 2	/1 * * * python3 /root/PythonPractice/spiderWanghong/huajiao.py spiderUserDatas >> /tmp/huajiao.py_spiderUserDatas.log /1 * * * python3 /root/PythonPractice/spiderWanghong/huajiao.py spiderUserLives >> /tmp/huajiao.py_spiderUserLives.log

Python初學者之網路爬蟲

1. 確立目標需求

1) 收集huajiao.com上的人氣主播資訊:每位主播的關注數，粉絲數，贊數，經驗值等資料

2) 收集每位人氣主播的直播歷史資料，包括每次直播的開播時間，觀看人數，贊數等資料

2. 確立邏輯步驟

1) 每一個導航項列出的都是直播列表，而非主播的個人主頁列表

2) 在直播頁上有主播的使用者ID和暱稱等資訊

3) 在主播個人主頁上有更加完整的個人資訊

4) 程式邏輯

3. Python編碼

1) 資料表設計

2) 從直播列表頁過濾出直播Id列表

3) 從直播頁過濾出主播id

4) 通過userId進入主播個人主頁獲取個人資訊

5) 將獲取的個人資訊寫入mysql

6) 獲取某主播的直播歷史資料

7) 將主播的直播歷史資料寫入Mysql

8) 定義骨架函式

9) 定義入口函式和命令列引數

10) 加入crontab

4. 目標需求達成

5. 待改進項和後續計劃

對mysql的讀寫部分進行優化，現在寫的比較臃腫

對其他直播網站進行分析並收集資料

將各個直播網站的資料進行聚合

6. 程式碼地址

相關文章