Python 開發簡單爬蟲 (學習筆記)

wsAdmin發表於2019-08-05

原文網址 : https://learnku.com/articles/32086?order_by=created_at&

Python開發簡單爬蟲(學習筆記)

1.管理待抓取的url集合與已抓取的url集合
2.作用:防止重複抓取與迴圈抓取url指向資源
3.實現方式:記憶體管理;關係型資料庫(mysql)管理;非關係型資料庫管理(Redis)

1.將網際網路上的URL對應的網頁下載至本地的工具
2.作用:通過url在網際網路上獲取指定的網頁,將其下載至本地並儲存成檔案,或者以字串的形式儲存在記憶體中
3.種類:
    ①urllib2(Python官方提供的基礎模組);
    ②requests(第三方外掛,提供更為強大的功能)
4. urllib2抓取網頁的三種方法

url = "https://www.baidu.com";
print "第一方法"
res_1 = urllib2.urlopen(url); #直接請求url
print res_1.getcode(); #列印請求結果狀態碼
print len(res_1.read()) #列印獲取的網頁長度

print "第二種方法"
request = urllib2.Request(url) #創造請求
request.add_header('user-agent',"Mozilla/5.0") #偽造瀏覽器請求頭並新增
res_2 = urllib2.urlopen(request) #發起請求
print res_2.getcode() #獲取請求結果狀態碼
print len(res_2.read()) #列印獲取的網頁長度

print "第三種方法"
cookie_content = cookielib.CookieJar() #建立cookie容器
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie_content)) #以容器作為引數建立opener的引數
urllib2.install_opener(opener) #給urllib2安裝opener,這樣urllib2就擁有了處理Cookie的增強能力
res_3 = urllib2.urlopen(url)
print res_3.getcode() #獲取請求結果狀態碼
print res_3.read() #列印獲取的網頁內容

1.從網頁中提取有價值資料的工具
2.解析器種類:
    ①正規表示式;
    ②html.parser(Python自帶);
    ③BeautifulSoup(第三方外掛);
    ④lxml(第三方解析器)

1.安裝:pip install beautifulsoup4
2.例項:

#獲取要抓取的網頁內容
imooc = urllib2.Request("https://www.imooc.com/search/?words=python") #創造請求
imooc.add_header('user-agent',"Mozilla/5.0") #偽造瀏覽器請求頭並新增
res_4 = urllib2.urlopen(imooc) #發起請求
print res_4.getcode() #獲取請求結果狀態碼
string_4 = res_4.read() #列印獲取的網頁長度
#建立BeautifulSoup 物件,同時將網頁字串轉為DOM樹形式;引數1:獲取的網頁字串;引數2:指定解析器;引數3:指定編碼
soup = BeautifulSoup(string_4,'html.parser',from_encoding='utf-8');
# 獲取所有圖片地址;引數1:標籤名稱,引數2:標籤屬性
path = soup.find_all('img',{'class':"course-item-img"})
for link in path:
    #列印標籤名稱,標籤屬性值,標籤內容
    print link.name, link['src'], link.get_text();

《Python開發簡單爬蟲》實踐筆記
2021-09-09
Python爬蟲筆記
python爬蟲—學習筆記-4
2024-04-23
Python爬蟲筆記
python爬蟲—學習筆記-2
2024-04-10
Python爬蟲筆記
Python爬蟲學習筆記(三)
2021-01-30
Python爬蟲筆記
python爬蟲學習筆記（二）
2020-11-24
Python爬蟲筆記
《Python3 網路爬蟲開發實戰》—學習筆記
2019-07-30
Python爬蟲筆記
<node.js學習筆記(5)>koa框架和簡單爬蟲練習
2018-12-12
Node.js筆記框架爬蟲
一入爬蟲深似海，總結python爬蟲學習筆記！
2019-02-14
爬蟲Python筆記
Python爬蟲筆記（一）——基礎知識簡單整理
2018-07-08
Python爬蟲筆記
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
Python爬蟲學習筆記（三、儲存資料）
2020-10-03
Python爬蟲筆記
python如何實現簡單的爬蟲功能?Python學習教程!
2021-01-12
Python爬蟲
【Python學習筆記1】Python網路爬蟲初體驗
2018-10-28
Python筆記爬蟲
python簡單爬蟲(二)
2018-04-18
Python爬蟲
Python學習筆記——爬蟲之Scrapy專案實戰
2018-09-03
Python筆記爬蟲
python爬蟲學習筆記4-正規表示式
2020-12-12
Python爬蟲筆記
爬蟲入門學習筆記3
2021-01-05
爬蟲筆記
簡單瞭解python爬蟲
2020-10-13
Python爬蟲
爬蟲學習筆記：練習爬取多頁天涯帖子
2019-02-16
爬蟲筆記
python爬蟲學習筆記 4.2 （Scrapy入門案例（建立專案））
2020-04-30
Python爬蟲筆記
python爬蟲學習1
2020-11-29
Python爬蟲
python爬蟲:爬蟲的簡單介紹及requests模組的簡單使用
2022-02-24
Python爬蟲
誰說爬蟲只能Python？看我用C#快速簡單實現爬蟲開發和演示！
2024-05-27
爬蟲PythonC#
python爬蟲是什麼?學習python爬蟲難嗎
2021-03-31
Python爬蟲
python網路爬蟲筆記（一）
2020-10-25
Python爬蟲筆記
Python資料爬蟲學習筆記（11）爬取千圖網圖片資料
2018-09-18
Python爬蟲筆記
爬蟲實戰開發學習（一）
2021-07-06
爬蟲
讀書筆記：《Python3網路爬蟲開發實戰》——第2章：爬蟲基礎
2019-04-09
筆記Python爬蟲
Python代理IP爬蟲的簡單使用
2019-03-04
Python爬蟲
《網路爬蟲開發實戰案例》筆記
2020-08-10
爬蟲筆記
爬蟲學習日記（六）
2019-01-14
爬蟲
爬蟲學習日記（八）
2019-01-18
爬蟲
爬蟲學習日記（七）
2019-01-15
爬蟲
爬蟲學習日記（五）
2018-12-14
爬蟲
爬蟲學習日記（三）
2018-12-07
爬蟲
爬蟲學習日記（二）
2018-11-28
爬蟲
爬蟲學習日記（一）
2018-11-28
爬蟲
Git 簡單使用學習筆記
2018-08-28
Git筆記

Python 開發簡單爬蟲 (學習筆記)

相關文章