爬蟲（03）物件導向寫爬蟲（函式，類）2020-12-14

輝子2020發表於2020-12-14

原文網址 : https://blog.csdn.net/m0_46738467/article/details/111185197

文章目錄

1. 寫一個爬帖子的專案

我們先開啟百度貼吧輸入“海賊王”，然後隨便開啟兩頁，複製一下url，研究一下規律。

https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50
https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100

以上是我點開的第二頁和第三頁的url，經過觀察我們可以看到這樣的規律

https://tieba.baidu.com/f?    # 基本的url
kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B  # 這個應該是我們輸入的關鍵字的16進位制編碼
&ie=utf-8  # 這個是譯碼的
&pn=50  # 這個應該是頁碼有關的數字
# 我們看到頁碼的規律是（頁數-1）×50

我們可以構建要抓取頁的url了，下面是程式導向的程式碼：

# 百度貼吧專案
'''
https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50
https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100
'''
import urllib.request
import urllib.parse

base_url = 'https://tieba.baidu.com/f?'
start_page = int(input('請輸入起始頁：'))
end_page = int(input('請輸入結束頁：'))
key = input('請輸入你要搜尋的主題：')
kw = {'kw':key}
kw = urllib.parse.urlencode(kw)
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36'}

for i in range(start_page,end_page+1):   # 因為range()函式的特點是左含右不含，所以要加上1才能夠輸出結束頁
    pn = (i-1)*50
    url = base_url+kw+'&ie=utf-8'+'&pn='+str(pn)   # 這裡要轉換成字串才能拼接
    req = urllib.request.Request(url,headers=headers)
    res = urllib.request.urlopen(req)
    html = res.read().decode('utf-8')
    with open(r'D:\spiderdocuments\tieba_{}_page_{}.txt'.format(key,i),'w',encoding='utf-8') as f:
        f.write(html)
    print(f'第{i}頁已經成功寫入！')
print('程式執行完畢')

在我輸入相關的資訊後，這個執行結果就是在我的相應盤裡多出來這麼幾個檔案
在這裡插入圖片描述

2. 物件導向程式設計

我們可以將上述程式碼用物件導向程式設計的思想來嘗試編寫

2.1 使用函式物件程式設計

我大概把程式導向程式設計的內容分成以下幾個步驟：

抓取網頁生成要寫入文字的內容
寫入文字
主程式生成抓取網頁需要的url
主入口判斷語句
所以我大概可以先寫這樣一個框架

import urllib.request
import urllib.parse
def readpage(url):
	pass
def writepage(filename,html)
	pass
def main():
	pass
if __name__ == '__main__':
	main()

下面我們把必要的內容移入各自的函式，得到如下結果：

# 物件導向，函式
import urllib.request
import urllib.parse


def readpage(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36'}
    req = urllib.request.Request(url, headers=headers)
    res = urllib.request.urlopen(req)
    html = res.read().decode('utf-8')
    return html


def getpage(filename, html):
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(html)
        print('寫入成功')


def main():
    base_url = 'https://tieba.baidu.com/f?'
    key = input('請輸入你要搜尋的內容：')
    start_page = int(input('請輸入起始頁：'))
    end_page = int(input('請輸入結束頁：'))
    kw = {'kw': key}
    kw = urllib.parse.urlencode(kw)
    for i in range(start_page, end_page + 1):
        pn = (i - 1) * 50
        url = base_url + kw + '&ie=utf-8' + str(pn)
        html = readpage(url)
        filename = r'D:\spiderdocuments\爬蟲結果第{}頁.txt'.format(i)
        getpage(filename, html)


if __name__ == '__main__':
    main()

當輸入相關的資料後，執行結果是在我相應的資料夾裡出現一些新的檔案：
在這裡插入圖片描述

2.2 使用類物件程式設計

程式碼如下，慢慢琢磨：

import urllib.request
import urllib.parse

'''
類屬性 例項屬性 類方法 例項方法 (靜態方法) 實現老師課堂上面的 自己在去優化 創新
'''
class BaiduSpider:
    # 把常用的不變的放到init方法裡面
    def __init__(self):
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
        }
        self.base_url = 'https://tieba.baidu.com/f?'

    def readPage(self,url):
        req = urllib.request.Request(url, headers=self.headers)
        res = urllib.request.urlopen(req)
        html = res.read().decode('utf-8')
        return html

    def writePage(self,filename,html):
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(html)
            print('寫入成功')

    def main(self):
        name = input('請輸入貼吧的名稱:')
        begin = int(input('請輸入起始頁:'))
        end = int(input('請輸入終止頁:'))
        kw = {'kw': name}
        result = urllib.parse.urlencode(kw)

        for i in range(begin, end + 1):
            pn = (i - 1) * 50
            url = self.base_url + result + '&pn=' + str(pn)
            # 呼叫函式
            html = self.readPage(url)
            filename = '第' + str(i) + '頁.html'
            self.writePage(filename, html)
if __name__ == '__main__':
    spider = BaiduSpider()
    spider.main()

【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
python strip()函式爬蟲用到
2020-12-03
Python函式爬蟲
分散式爬蟲原理之分散式爬蟲原理
2018-05-25
分散式爬蟲
爬蟲不得不學之 JavaScript 函式物件篇
2019-02-17
爬蟲JavaScript函式物件
19--Scarpy05:增量式爬蟲、分散式爬蟲
2024-04-25
爬蟲分散式
分散式爬蟲
2019-03-05
分散式爬蟲
爬蟲：多程式爬蟲
2021-05-19
爬蟲
2個月精通Python爬蟲——3大爬蟲框架+6場實戰+反爬蟲技巧+分散式爬蟲
2018-06-28
Python爬蟲框架分散式
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
爬蟲的分類
2023-12-01
爬蟲
寫個爬蟲唄
2019-02-25
爬蟲
zip函式在爬蟲中應用
2024-08-22
函式爬蟲
Python分散式爬蟲(三) - 爬蟲基礎知識
2019-03-21
Python分散式爬蟲
Python爬蟲教程-34-分散式爬蟲介紹
2018-09-06
Python爬蟲分散式
3 行寫爬蟲 - 使用 Goribot 快速構建 Golang 爬蟲
2019-10-13
爬蟲Golang
Python爬蟲：手把手教你寫迷你爬蟲架構
2020-07-10
Python爬蟲架構
通用爬蟲與聚焦爬蟲
2023-04-18
爬蟲
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
分散式爬蟲原理
2019-02-16
分散式爬蟲
爬蟲（14） - Scrapy-Redis分散式爬蟲(1) | 詳解
2022-07-06
爬蟲Redis分散式
python爬蟲是什麼？爬蟲可以分為哪幾類？
2022-11-29
Python爬蟲
爬蟲進階：反反爬蟲技巧
2018-06-28
爬蟲
反爬蟲之字型反爬蟲
2019-06-27
爬蟲
初識爬蟲類CrawlSpider
2023-11-30
爬蟲IDE
你有自己寫過爬蟲的程式嗎？說說你對爬蟲和反爬蟲的理解？
2024-11-28
爬蟲
爬蟲
2024-11-16
爬蟲
手把手教你寫網路爬蟲（2）：迷你爬蟲架構
2018-04-27
爬蟲架構
什麼是網路爬蟲?為什麼用Python寫爬蟲?
2021-03-08
爬蟲Python
【爬蟲】爬蟲專案推薦 / 思路
2020-04-21
爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
反-反爬蟲：用幾行程式碼寫出和人類一樣的動態爬蟲
2019-03-04
爬蟲行程
[爬蟲架構] 如何設計一個分散式爬蟲架構
2018-05-01
爬蟲架構分散式
用雲函式快速實現圖片爬蟲
2018-11-02
函式爬蟲
11.18爬蟲學習（BeautifulSoup類）
2024-11-18
爬蟲
新一代爬蟲平臺！不寫程式碼即可完成爬蟲...
2024-05-30
爬蟲
python爬蟲是什麼?為什麼用python語言寫爬蟲？
2022-04-02
Python爬蟲
Python為什麼叫爬蟲?Python為什麼適合寫爬蟲?
2021-02-02
Python爬蟲
不用寫程式碼的爬蟲
2019-06-17
爬蟲