python 爬蟲實戰專案--爬取京東商品資訊（價格、優惠、排名、好評率等）

SpiderLQF發表於2018-06-27

Python爬蟲

利用splash爬取京東商品資訊

一、環境

window7
python3.5
pycharm
scrapy
scrapy-splash
MySQL

二、簡介

為了體驗scrapy-splash 的動態網頁渲染效果，特地編寫了利用splash爬取京東商品資訊的爬蟲，當然站在爬取效率和穩定性方面來說，動態網頁爬取首先應該考慮的還是動態頁面逆向分析。

三、環境搭建

這裡只介紹splash在window7上的搭建方法：

Splash是一個利用webkit或者基於webkit庫 Javascript渲染服務。它是一個實現了HTTP API的輕量級瀏覽器，Splash是用Python實現的，同時使用Twisted和QT。官方文件解釋是必須使用Docker容器！之所以用容器技術，好處是你可以一坨的安裝好splash，而不必一點一點的去為安裝splash填坑。

在網上的教程中，大多數是建議利用linux來安裝docker，原因如下圖：

docker使用go語言開發，並且執行在linux系統下，而如果想用window執行，只能在window基礎上先執行一個linux虛擬機器，然後再在這個linux虛擬機器下執行docker。

由於我使用的是window7系統，只能到官網（https://docs.docker.com/toolbox/toolbox_install_windows/）下載DockerToolbox，下載完成後，雙擊安裝（安裝過程自行百度）；

安裝完成後會有三個快捷鍵：

點選啟動Docker Quickstart Terminal

輸入安裝splash的命令：$docker pull scrapinghub/splash

執行命令：$docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash ，開啟8050連線埠和5023監控埠或者只開啟8050埠。

最後在scrapy專案中安裝scrapy-splash元件，在settings.py中新增

#用來支援cache_args(可選),splash設定
SPIDER_MIDDLEWARES = {
   #'e_commerce.middlewares.ECommerceSpiderMiddleware': 543,
    'scrapy_splash.SplashDeduplicateArgsMiddleware' : 100 ,
}
DOWNLOADER_MIDDLEWARES = {
   #'e_commerce.middlewares.ECommerceDownloaderMiddleware': 543,
    'scrapy_splash.SplashCookiesMiddleware' : 723 ,
    'scrapy_splash.SplashMiddleware' : 725 ,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware' : 810 ,
}
#設定去重過濾器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
SPLASH_URL = 'http://192.168.99.100:8050'

環境配置完成。

四、京東網頁分析與爬蟲程式設計實現

此次主要爬取京東商品的以下引數：

product：商品名

product_url : 商品連結

initial_price : 原價

price : 實際價格

shop : 店家

tags : 優惠活動

comment_nums : 累計評論數

summary_order : 京東排名

praise : 好評度

爬取京東商品資訊首先得有商品資訊入口，以商品書籍python（關鍵字）為例，

url：https://search.jd.com/Search?keyword=python&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=python&page=159&s=4732&click=0
簡化為：https://search.jd.com/Search?keyword=python&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=python&page=1

檢視多頁發現url裡的幾個重要關鍵字：

keyword：搜尋關鍵字

wq：搜尋關鍵字

page：頁數（呈奇數遞增）

故可以構建請求url：

https://search.jd.com/Search?keyword={1}&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq={1}&page={0}

分析網頁原始碼發現各商品資訊：

通過css或XPath很容易就可以提取出商品資訊，不過發現這裡並沒有我們想要的所有資訊，故還得找出每個商品的url。

這裡存在的一個問題是京東一頁的商品是分批顯示的，通過F12分析網路裡的XHR就會發現，新載入的商品是通過向伺服器傳送請求url：

https://search.jd.com/s_new.php?keyword=python&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=python&page=160&s=4762&scrolling=y&log_id=1530076904.20294&tpl=2_M&show_items=28950919364,12082834672,28883808237,29290652302,29291203091,27895583715,29089172884,28884416666,28938172051,28938903145,26066239470,29090468719,29094523735,28949599627,29291004039,28041915738,26232273814,26400554950,28494562293,29361448498,26291836439,19942582293,15348519021,19580125836,29251686387,27859110321,27880583607,29185119165,28738281855,29184203799

重要關鍵字page是在原來的基礎上加1，直接對該url進行模擬瀏覽器請求應該是會簡便很多，但現在我們不討論這種逆向分析方法，我們這裡採用的是splash中的execute 端點對動態網頁進行渲染，就是執行一條javascript語句模擬瀏覽器跳轉到頁面下方位置，從而達到完全載入商品的目的。

請求方法：

商品的url可以在這裡提取到：

構造每個商品url請求：https:// + url，當對該url進行請求時發現商品的價格、優惠資訊、圖書排名等需要的資訊都沒有提取出，分析發現需要進行動態渲染才能得到，所以我們這裡採用splash的render.html端點對該url網頁進行渲染，請求方法：

此時再去提取相應的資訊，提取成功。

但最後還是有一項引數沒能成功提取，那就是好評率，躲在這裡：

在瀏覽器中需要點選【商品評價】才能看到，分析網頁原始碼發現該資訊也是在點選【商品評價】後向伺服器傳送請求

請求url：

https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv1516&productId=12186192&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1
可簡化為：https://sclub.jd.com/comment/productPageComments.action?&productId=12186192&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1
構造請求url：https://sclub.jd.com/comment/productPageComments.action?&productId={0}&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1

只要傳入product_id便可以通過該url請求獲得需要的好評度

當然也可以繼續提取商品的評論或者是評論情況，這裡我覺得好評度已經能說明該商品的好壞了，當然還得結合評論數來做出判斷。

關鍵程式碼：

    def crawl_commentInfo(self , response):
        total_product_info = response.css('.gl-item')
        for product_info in total_product_info:
            product_url = 'https:' + product_info.css('.gl-i-wrap .p-img a::attr(href)').extract_first()
            #comment_info crawl firstly
            headers = self.headers.copy()
            headers.update({
                'Referer' : 'https://item.jd.com/12330816.html' ,
                'Host': 'sclub.jd.com',
            })
            match_obj = re.match('.*/(\d+).html', product_url)
            if match_obj:
                product_id = match_obj.group(1)
                yield scrapy.Request(self.start_urls[1].format(product_id) , headers = headers ,
                                     meta= {'product_url' : product_url} , callback= self.parse_product)
                print('Now crawl commentInfo url is' , self.start_urls[1].format(product_id))
            # break

        match_obj = re.match('.*page=(\d+)$', response.url)
        if match_obj:
            page_num = int(match_obj.group(1)) + 2
        else:
            page_num = 3

        if '下一頁' in response.text:
            yield SplashRequest(self.start_urls[0].format(page_num, 'python'), endpoint='execute',
                                callback=self.crawl_commentInfo , splash_headers= self.headers ,
                                args={'lua_source': self.lua_script, 'image' : 0 , 'wait': 1}, cache_args=['lua_source'])
            print('Now page_url is :', self.start_urls[0].format(page_num, 'python'))

    def parse_product(self, response):
        product_url = response.meta.get('product_url')
        praise = 0
        praise_match = re.match('.*"goodRateShow":(\d+),.*' , response.text)
        if praise_match:
            praise = praise_match.group(1)
        headers = self.headers.copy()
        headers.update({
            'Upgrade-Insecure-Requests': 1,
            'Host' : 'item.jd.com' ,

        })
        yield SplashRequest(product_url, endpoint='render.html', splash_headers = headers,
                             args={'image' : 0, 'timeout' : 15 , 'wait' : 1 } , meta={'praise' : praise}
                          )
        print('Now request url is:', product_url)

爬取結果：

爬取了京東關鍵字“python”的全部商品資訊，一共五千多條，也可以切換其他關鍵字爬取任何商品，我們可以通過評論數comment_nums分析每個商品的銷量情況（因為銷量數量和評論數量是相對應的），也可以通過排名summary_order分析出商品的銷量情況，通過praise好評度分析出商品的好壞。

Python爬蟲爬取淘寶，京東商品資訊
2020-02-11
Python爬蟲
Python爬蟲實戰：爬取淘寶的商品資訊
2021-09-11
Python爬蟲
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
Python爬蟲二：抓取京東商品列表頁面資訊
2018-06-26
Python爬蟲
【Python爬蟲實戰】使用Selenium爬取QQ音樂歌曲及評論資訊
2021-03-24
Python爬蟲
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
Python網路爬蟲實戰專案大全 32個Python爬蟲專案demo
2019-04-24
Python爬蟲
python網路爬蟲--專案實戰--scrapy嵌入selenium，晶片廠級聯評論爬取（6）
2020-10-23
Python爬蟲晶片
python爬蟲實操專案_Python爬蟲開發與專案實戰 1.6 小結
2021-02-04
Python爬蟲
python爬蟲小專案--飛常準航班資訊爬取variflight（上）
2019-03-23
Python爬蟲
爬蟲——爬取貴陽房價（Python實現）
2022-02-09
爬蟲Python
Python 爬蟲實戰之爬拼多多商品並做資料分析
2023-10-17
Python爬蟲
Python爬蟲實戰案例-爬取幣世界標紅快訊
2019-02-16
Python爬蟲
Python爬蟲實戰一：爬取csdn學院所有課程名、價格和課時
2018-06-23
Python爬蟲
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
爬蟲利器Pyppeteer的介紹和使用爬取京東商城書籍資訊
2020-09-22
爬蟲
Python爬蟲實戰之（四）| 模擬登入京東商城
2018-04-11
Python爬蟲
Python網路爬蟲實戰小專案
2021-04-12
Python爬蟲
Python網路爬蟲實戰專案大全！
2020-12-19
Python爬蟲
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
Java 爬蟲專案實戰之爬蟲簡介
2018-11-24
Java爬蟲
爬蟲實戰專案集合
2019-02-28
爬蟲
爬蟲專案實戰（一）
2020-06-15
爬蟲
爬蟲實戰專案合集
2022-01-25
爬蟲
用java爬取京東商品頁注意點
2024-12-08
Java
爬蟲實戰（一）：爬取微博使用者資訊
2018-07-15
爬蟲
python爬蟲例項專案大全-GitHub 上有哪些優秀的 Python 爬蟲專案？
2020-10-30
Python爬蟲Github
小白學 Python 爬蟲（25）：爬取股票資訊
2019-12-24
Python爬蟲
Python爬蟲實戰詳解：爬取圖片之家
2020-11-04
Python爬蟲
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
Python爬蟲開發與專案實戰pdf
2020-01-11
Python爬蟲
Python爬蟲開發與專案實戰（2）
2020-10-21
Python爬蟲
Python爬蟲開發與專案實戰（1）
2020-10-18
Python爬蟲
Python靜態網頁爬蟲專案實戰
2020-05-01
Python網頁爬蟲
大資料爬蟲專案實戰教程
2018-11-14
大資料爬蟲
爬蟲實戰——58同城租房資料爬取
2019-12-04
爬蟲
網路爬蟲——專案實戰（爬取糗事百科所有文章）
2020-02-07
爬蟲
爬蟲實戰（二）：Selenium 模擬登入並爬取資訊
2018-07-15
爬蟲