【Python學習筆記1】Python網路爬蟲初體驗

工匠小能手發表於2018-10-28

原文網址 : https://blog.csdn.net/qq_39295735/article/details/83472755

Python筆記爬蟲

本文的資訊來源於韋瑋老師的《精通Python網路爬蟲》，僅作為個人學習筆記。

【實驗目的】

通過爬取一個網頁的標題，來了解網路爬蟲的基本原理和程式碼，並瞭解正規表示式基礎和xpath基礎。

【學習筆記】

1、正規表示式基礎

詳細的正規表示式教程可參考：http://www.runoob.com/regexp/regexp-syntax.html （和使用哪種語言無關）

python全域性匹配函式：re.compile(正規表示式).findall(源字串)，依賴於import re。後面有例子。

要點	解釋
* 前一個字元出現0\1\多次貪婪模式	貪婪模式，儘可能多地匹配。示例：如源字串“opennnn8765opeeennyourmind” 正規表示式：“open” 匹配結果：“opennnnn87362387yourmind” 正規表示式：“o.n” 匹配結果：opennnn8765opeeennyourmin
? 前一個字元出現0\1次懶惰模式 *？組合成懶惰模式+精準匹配	懶惰模式，儘可能少地匹配例項：如源字串“opennnn8765opeeennyourmin” 正規表示式：“open？” 匹配結果：open 正規表示式：“o.*？n” 匹配結果：'open', 'opeeen', 'ourmin' 原因：懶惰模式，精準匹配
.?，常與小括號組合變成(.?)	'.'表示換行符除外的任意字元，該表達通常表示匹配一行內容示例：網頁程式碼的一行是<title>welcome to china</title> pad = "<title>(.*?)</title>", 匹配結果是“welcome to china”

#!/usr/bin/python3
#-*- coding: utf-8 -*-

import re

#正規表示式體驗
testStr="opennnn8765opeeennyourmind"
testpad1="open*"
testpad2="open?"
testpad3="open*?"
testpad4="o.*n"
testpad5="o.*?n"
testRet1=re.compile(testpad1, re.S).findall(testStr)
testRet2=re.compile(testpad2, re.S).findall(testStr)
testRet3=re.compile(testpad3, re.S).findall(testStr)
testRet4=re.compile(testpad4, re.S).findall(testStr)
testRet5=re.compile(testpad5, re.S).findall(testStr)
print("test1:"+str(testRet1))
print("test2:"+str(testRet2))
print("test3:"+str(testRet3))
print("test4:"+str(testRet4))
print("test5:"+str(testRet5))

#============如下是輸出資訊====================
'''
test1:['opennnn', 'ope']
test2:['open', 'ope']
test3:['ope', 'ope']
test4:['opennnn8765opeeennyourmin']
test5:['open', 'opeeen', 'ourmin']

'''

2、xpath基礎

除了正規表示式，要對網頁檔案提取內容，還有其他好用的資訊篩選工具，如xpath表示式、Beautiful Soup等等，時間有限，本次體驗了xpath的基本表示式。

關於xpath的詳細介紹，用到的時候可以查閱：http://www.w3school.com.cn/xpath/index.asp

表示式	功能
/標籤名	逐層提取，如/html/head/title
//標籤名	提取所有的標籤，如//div 表示提取所有的<div>標籤
text()	提取標籤開始和結束中間的文字內容，如標籤式<title>helloworld</title>,則可以使用表示式 /html/head/title/text() 或//title/text()
//標籤名[@屬性=‘屬性值’]	提取屬性為xx的標籤，如div中<div class="tool">the content i want</div>，想提取出“the content i want”,可使用表示式 //div[@class='tool']/text()

xpath的程式設計實踐，暫時還沒有學到，先預留疑問“在python中如何呼叫xpath來簡化匹配規則？”

參考例子：https://www.cnblogs.com/lei0213/p/7506130.html

3、把網頁爬取下來，並提取網頁的標題資訊

注意事項

關鍵詞	解釋
正規表示式<title>(.*?)</title>	找出標籤<title>和</title>之間的內容，表示式用()擴起來。”.?” ‘.’表示任意字元，’’表示參考前面的規則繼續儘量匹配更多的字元，’?’表示懶惰模式，如果有重複內容，就只匹配一次。依賴庫：import re
urllib 爬取網頁內容	網址必須是完整路徑，包含http或https。舉例： jdUrl = "https://www.jd.com" saveFile= "C:\\Users\\xiaoniu\\Desktop\\temp\\jd.html" 網頁儲存到記憶體：Webfile=urllib.request.urlopen(jdUrl).read().decode("utf-8","ignore") 網頁儲存到檔案：urllib.request.urlretrieve(jdUrl, filename=saveFile) 依賴庫：import urllib.request
路徑用哪種斜槓	Windows的檔案路徑，在python中，要用雙斜槓\\或者和拷貝下來的路徑相反的斜槓。假如windows路徑是：C:\Users\xiaoniu 那麼路徑變數應該是path=”C:\\Users\\xiaoniu” 或path=”C:/Users/xiaoniu”，建議用第一種，因為顯式表明是路徑。第二種容易忽略，且不易定位。
爬蟲偽裝成瀏覽器	1、Fn+F12 開啟偵錯程式，開啟network，重新整理網頁，點開網址，headers中有一個欄位“User-Agent”表示使用者是通過什麼方式開啟的網址。因此爬蟲程式需要通過該欄位偽裝是網頁在瀏覽。 2、偽裝成多個瀏覽器，避免被網站識別為同一個瀏覽器，識別出來是爬蟲。每爬取一次就更換一個偽裝瀏覽器 3、通過使用build_opener()修改報頭或使用add_header()新增報頭。詳情可參加韋瑋老師的《精通Python網路爬蟲》的4.3節“瀏覽器模擬-header屬性” import random uapools=[ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)", ] def UA(): opener=urllib.request.build_opener() thisua=random.choice(uapools) ua=("User-Agent",thisua) opener.addheaders=[ua] urllib.request.install_opener(opener) print("當前使用UA："+str(thisua)) for i in range(0,10): UA() data=urllib.request.urlopen(url).read().decode("utf-8","ignore")

體驗例子：

#!/usr/bin/python3
#-*- coding: utf-8 -*-

import re
import urllib.request

#京東和糗事百科的首頁地址
jdUrl = "https://www.jd.com"
jsbkUrl = "https://www.qiushibaike.com/"


#爬取京東網首頁的標題
data=urllib.request.urlopen(jdUrl).read().decode("utf-8","ignore")
print(len(data))
pad1 = "<title>(.*?)</title>"
title = re.compile(pad1, re.S).findall(data)
print(title)


#將京東網首頁爬取到本地檔案
jd = "C:\\Users\\xiaoniu\\Desktop\\temp\\jd.html"
urllib.request.urlretrieve(jdUrl, filename=jd)

#將爬蟲偽裝成瀏覽器
opener=urllib.request.build_opener()
UA=("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0")
opener.addheaders=[UA]
urllib.request.install_opener(opener)

#爬取糗事百科的首頁
jsbk = "C:\\Users\\xiaoniu\\Desktop\\temp\\jsbk.html"
urllib.request.urlretrieve(jsbkUrl, filename=jsbk)

#爬取糗事百科的段子的第5頁,注意分析第1頁和第5頁的網址規則，從而確定爬取第x頁所需的網頁地址
page5Url="https://www.qiushibaike.com/8hr/page/5/"
jsbkp5="C:\\Users\\xiaoniu\\Desktop\\temp\\jsbkp5.html"
urllib.request.urlretrieve(page5Url, filename=jsbkp5)

執行結果是在指定路徑中將京東、糗事百科、糗事百科熱門段子第5頁分別爬取到了本地。

python網路爬蟲筆記（一）
2020-10-25
Python爬蟲筆記
python爬蟲—學習筆記-4
2024-04-23
Python爬蟲筆記
python爬蟲—學習筆記-2
2024-04-10
Python爬蟲筆記
Python爬蟲學習筆記(三)
2021-01-30
Python爬蟲筆記
python爬蟲學習筆記（二）
2020-11-24
Python爬蟲筆記
《Python3 網路爬蟲開發實戰》—學習筆記
2019-07-30
Python爬蟲筆記
python爬蟲學習1
2020-11-29
Python爬蟲
Python爬蟲初學二（網路資料採集）
2020-05-03
Python爬蟲
Python 開發簡單爬蟲 (學習筆記)
2019-08-05
Python爬蟲筆記
substrate學習筆記1：Substrate初體驗
2022-03-21
筆記
一入爬蟲深似海，總結python爬蟲學習筆記！
2019-02-14
爬蟲Python筆記
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
Python爬蟲學習筆記（三、儲存資料）
2020-10-03
Python爬蟲筆記
scrapy 爬蟲利器初體驗(1)
2018-11-26
爬蟲
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
[Python] 網路爬蟲與資訊提取（1）網路爬蟲之規則
2020-11-06
Python爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
Python學習筆記——爬蟲之Scrapy專案實戰
2018-09-03
Python筆記爬蟲
python爬蟲學習筆記4-正規表示式
2020-12-12
Python爬蟲筆記
python DHT網路爬蟲
2019-02-14
Python爬蟲
《崔慶才Python3網路爬蟲開發實戰教程》學習筆記（1）：Windows下Python多版本共存配置方法
2018-06-17
Python爬蟲筆記Windows
Python資料爬蟲學習筆記（11）爬取千圖網圖片資料
2018-09-18
Python爬蟲筆記
python學習筆記(1
2019-01-28
Python筆記
Python爬蟲學習線路圖丨Python爬蟲需要掌握哪些知識點
2018-12-10
Python爬蟲
python爬蟲是什麼?學習python爬蟲難嗎
2021-03-31
Python爬蟲
[寒假學習筆記]（二）Python初學
2019-01-20
筆記Python
網路爬蟲（python專案）
2018-12-04
爬蟲Python
專案－－python網路爬蟲
2020-08-15
Python爬蟲
python網路爬蟲合法嗎
2021-09-11
Python爬蟲
Python網路爬蟲實戰
2022-03-18
Python爬蟲
python初學筆記
2024-12-07
Python筆記
Python《爬蟲初實踐》
2020-12-11
Python爬蟲
python爬蟲學習筆記 4.2 （Scrapy入門案例（建立專案））
2020-04-30
Python爬蟲筆記
Python爬蟲 | 一條高效的學習路徑
2021-09-09
Python爬蟲
為什麼學習python及爬蟲，Python爬蟲[入門篇]？
2018-11-21
Python爬蟲
python網路爬蟲（14）使用Scrapy搭建爬蟲框架
2019-07-27
Python爬蟲框架
python3網路爬蟲開發實戰_Python 3開發網路爬蟲(一)
2020-12-07
Python爬蟲

【Python學習筆記1】Python網路爬蟲初體驗

【實驗目的】

【學習筆記】

相關文章