爬蟲例項-淘寶頁面商品資訊獲取

夏蟲蟲發表於2020-10-08

------------恢復內容開始------------

一、完整程式碼：

在MOOC課上嵩天老師的課上有一個查詢商品頁面的例項，學習了一下，發現跟著嵩天老師的原始碼已經爬不出來了。這是因為2019年開始淘寶搜尋頁面就必須登入了，所以要爬取商品內容必須登入賬號，具體的header與cookie資訊如下：

cookie登入資訊可以登入淘寶頁面後經過在元素控制檯內部查詢。（記得重新整理）

先給出完整程式碼

import requests
import re


def getHTMLText(url):
    try:
        header = {

            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',

            'cookie': '_samesite_flag_=t*********************kmn'

        }
        r = requests.get(url, timeout=30, headers=header)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""


def parsePage(ilt, html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])
            ilt.append([price, title])
    except:
        print("1312")

def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序號","價格","商品名稱"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count,g[0],g[1]))


def main():
    goods = '手錶'
    depth = 2
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44 * i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)


main()

二、程式碼解析:

總共把程式碼分為三個部分：

1、獲取商品頁面資訊==》getHTMLText

2、解析商品頁面資訊==》parsePage

3、列印商品資訊 ==》printGoodsList

（一）、getHTMLText

def getHTMLText(url):
    try:
        header = {

            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',

            'cookie': '_samesite_flag_=t*********************kmn'

        }
        r = requests.get(url, timeout=30, headers=header)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

1. 首先定義header部分，登入資訊與瀏覽器資訊等。

2. r.raise_for_status() 當爬取失敗的時候會報錯，讓try進入except，使程式碼整體健壯。

3. r.encoding = r.apparent_encoding解析程式碼編碼，讓r資源的編碼 = 顯示的編碼apparent_encoding

4. 最終返回r.text 文字部分

（二）、parsePage

def parsePage(ilt, html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])
            ilt.append([price, title])
    except:
        print("")

這個部分是程式碼最關鍵的部分，即核心程式碼，負責查詢與解析r.text中的文字。

先說明其中的正規表示式

r'\"view_price\"\:\"[\d\.]*\"'

在淘寶搜尋書包之後可以發現其商品價格前面的都有一個關鍵詞 view_price，同理發現商品標題都有raw_title關鍵詞：

其中\"view_price\"\:\"[\d\.]*\"之所以出現這麼多\，是因為轉義字元其意義是查詢"view_price: [\d.]*"這樣的一個字串，使用findall函式可以爬取全部的資源。

同理，經過商品標題可以選擇title與raw_title不過最後選擇了raw_title，因為title在一個商品資訊內出現了兩次。

最終plt 與 tlt 分別是所有商品資訊的價格和標題，其序號是一一對應的。

 for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])
            ilt.append([price, title])

最終把所有商品資訊的價格和名稱放入列表ilt內

這plt去除外面的雙引號後使用SPLIT方法吧view_price：129這樣的商品元素分開，並取位置[ 1 ]上的元素，即商品的價格。

三、商品價格資訊列印：printGoodsList

def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序號","價格","商品名稱"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count,g[0],g[1]))

先定義TPLT格式資訊，最後使用count計數當做序號。

四、main()函式執行

def main():
    goods = '手錶'
    depth = 2
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44 * i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)

根據查詢淘寶頁面的url我們可以發現其搜尋的介面為 search?q=

並且其頁面元素為44位一頁，第一頁為空第二頁為44 第三頁為88

所以根據查詢多少我們可以定義一個深度depth為查詢的頁面，遍歷次數即每一次翻頁的次數。

https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=6&ntoffset=6&p4ppushleft=1%2C48&s=0
https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=44
https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=3&ntoffset=0&p4ppushleft=1%2C48&s=88

最終執行結果為：

Python爬蟲爬取淘寶，京東商品資訊
2020-02-11
Python爬蟲
淘寶商品資訊爬取
2020-12-20
Python爬蟲實戰：爬取淘寶的商品資訊
2021-09-11
Python爬蟲
淘寶API分享：獲取淘寶商品SKU資訊
2023-02-27
API
[Python3]selenium爬取淘寶商品資訊
2021-09-09
Python
淘寶API分享：淘寶/天貓批次獲取商品重量資訊
2023-02-27
API
淘寶API系列：淘寶/天貓獲取商品歷史價格資訊
2023-02-27
API
爬蟲入門之淘寶商品資訊定向爬取！雙十一到了學起來啊！
2020-10-30
爬蟲
淘寶API分享：獲取淘寶商品評論
2023-03-04
API
[Python3網路爬蟲開發實戰] 7-動態渲染頁面爬取-4-使用Selenium爬取淘寶商品
2018-03-30
Python爬蟲
Python網路爬蟲之爬取淘寶網頁頁面 MOOC可以執行的程式碼
2018-11-24
Python爬蟲網頁
如何使用API介面獲取淘寶商品資料
2023-12-15
API
Python爬蟲二：抓取京東商品列表頁面資訊
2018-06-26
Python爬蟲
網路爬蟲淘寶/天貓獲得淘寶商品評論 API 返回值說明
2023-03-11
爬蟲API
Java基於API介面爬取淘寶商品資料
2023-10-25
JavaAPI
淘寶API分享：關鍵字搜尋淘寶商品，獲取商品ID，詳情資料
2023-02-27
API
Python爬蟲，抓取淘寶商品評論內容!
2018-06-24
Python爬蟲
python網路爬蟲--爬取淘寶聯盟
2018-07-17
Python爬蟲
puppeteer 頁面爬取例項（元素遍歷）
2018-12-07
使用Python呼叫API介面獲取淘寶商品資料
2023-11-13
PythonAPI
淘寶聯盟優惠商品列表獲取
2019-08-12
淘寶API，獲取店鋪的所有商品
2023-02-22
API
揭秘淘寶店鋪所有商品介面：一鍵獲取海量熱銷寶貝資訊
2023-10-09
淘寶/天貓獲得淘寶商品詳情 API 如何實現實時資料獲取？
2024-01-16
API
如何高效地利用淘寶API介面獲取商品資料
2024-01-28
API
利用python編寫爬蟲爬取淘寶奶粉部分資料.1
2021-09-09
Python爬蟲
淘寶詳情API介面：一鍵獲取商品資訊的實踐探索
2023-11-28
API
兩人因使用爬蟲非法爬取、使用淘寶11.8億使用者資料獲罪
2021-06-17
爬蟲
如何教會小白使用淘寶API介面獲取商品資料
2023-12-08
API
安卓sdk webview獲取淘寶個人資訊100項，原始碼。
2018-03-31
安卓WebView原始碼
淘寶API系列：如何採集淘寶商品詳情頁資料？
2023-02-23
API
淘寶訂單資訊獲取介面API,淘寶打單發貨介面
2024-05-28
API
採集淘寶商品詳情頁資料
2021-09-09
淘寶API：淘寶/天貓獲得淘寶商品快遞費用
2023-03-04
API
Java“牽手”淘寶商品列表頁資料採集+淘寶商品價格資料排序，淘寶API介面申請指南
2023-09-19
Java排序API
Android 淘寶爬蟲學習
2019-03-18
Android爬蟲
API商品資料介面呼叫實戰：爬蟲與資料獲取
2023-10-29
API爬蟲
淘寶商品銷量資料介面，淘寶商品月銷量，淘寶商品總銷量資料介面
2023-10-07

爬蟲例項-淘寶頁面商品資訊獲取

相關文章