Python爬取噹噹網APP資料

松鼠愛吃餅乾發表於2020-10-21

原文網址 : https://blog.csdn.net/m0_48405781/article/details/109200626

PythonAPP

目標

場景：有時候通過傳統的方法去爬一些 Web 網頁或者 APP，受限於對方的反爬方案，很難爬到想要的資料，這個時候可以考慮使用「Appium」結合「mitmproxy」的方式去爬取資料。

其中，Appium 負責驅動 App 端自動化執行，mitmproxy 負責擷取請求資料並解析儲存到資料庫。

今天的目的是爬取「噹噹網」的所有資料，並儲存到 MongoDB 資料庫當中。

準備工作

首先，需要在 PC 上安裝好 Charles 和 Appium Desktop，並配置好 mitmproxy 環境。

# 安裝mitmproxy依賴包
pip3 install mitmproxy

# 安裝pymongodb
pip3 install pymongo

另外，需要準備一臺 Android 手機，另外 PC 端配置好 Android 開發環境。

爬取思路

1. 在配置好手動代理的情況下，開啟 Charles 實時捕獲客戶端的發起的網路請求。

開啟噹噹網搜尋商品的頁面，搜尋關鍵字「Python」,可以在 Charles 檢視到當前請求的 URL 地址包含：「word=Python」

編寫 mitmproxy 的執行指令碼檔案，重寫 response() 函式，通過對請求的 URL 進行過濾，對有用的資料進行整理並儲存到 MongoDB 資料庫當中。

class DangDangMongo(object):
    """
    初始化MongoDB資料庫
    """
    def __init__(self):
        self.client = MongoClient('localhost')
        self.db = self.client['admin']
        self.db.authenticate("root", "xag")
        self.dangdang_book_collection = self.db['dangdang_book']

def response(flow):

    # 過濾請求的URL
    if 'keyword=Python' in request.url:

        data = json.loads(response.text.encode('utf-8'))

        # 書籍
        products = data.get('products') or None

        product_datas = []

        for product in products:
            # 書ID
            product_id = product.get('id')

            # 書名
            product_name = product.get('name')

            # 書價格
            product_price = product.get('price')

            # 作者
            authorname = product.get('authorname')

            # 出版社
            publisher = product.get('publisher')

            product_datas.append({
                'product_id': product_id,
                'product_name': product_name,
                'product_price': product_price,
                'authorname': authorname,
                'publisher': publisher
            })

        DangDangMongo().dangdang_book_collection.insert_many(product_datas)
        print('成功插入資料成功')

先開啟客戶端的手動代理監聽 8080 埠，然後執行「mitmdump」命令，然後滾動商品介面，發現資料到寫入到資料庫中了。

 mitmdump -s script_dangdang.py

2. 下面我們要利用 Appium 幫我們實現自動化。

首先開啟 Appium Desktop，並啟動服務。

開啟 Android Studio，利用選單欄的 Build-Analyze APK 分析噹噹網的安卓應用，開啟 AndroidManifest.xml

可以發現應用包名和初始化 Activity 分別為：

com.dangdang.buy2、com.dangdang.buy2.StartupActivity

獲取到包名和初始 Activity 後，就可以利用 WebDriver 去模擬開啟噹噹網 APP。

self.caps = {
            'automationName': DRIVER,
            'platformName': PLATFORM,
            'deviceName': DEVICE_NAME,
            'appPackage': APP_PACKAGE,
            'appActivity': APP_ACTIVITY,
            'platformVersion': ANDROID_VERSION,
            'autoGrantPermissions': AUTO_GRANT_PERMISSIONS,
            'unicodeKeyboard': True,
            'resetKeyboard': True
        }
self.driver = webdriver.Remote(DRIVER_SERVER, self.caps)

接著使用 Android SDK 自帶的工具 uiautomatorviewer 獲取到元素資訊，使用 Appium 中的 WebDriver 去操作 UI 元素。

第一次開啟應用的時候，可能會出現紅包雨對話方塊、新人專享紅包對話方塊、切換城市對話方塊，這裡需要通過元素 ID 獲取到關閉按鈕，執行點選操作來關閉這些對話方塊。

這裡建立一個新的執行緒來單獨處理這些對話方塊。

class ExtraJob(threading.Thread):
   def run(self):
        while self.__running.isSet():

            # 為True時立即返回, 為False時阻塞直到內部的標識位為True後返回
            self.__flag.wait()

            # 1.0 【紅包雨】對話方塊
            red_packet_element = is_element_exist(self.driver, 'com.dangdang.buy2:id/close')
            if red_packet_element:
                red_packet_element.click()

            # 1.1 【新人專享券】對話方塊
            new_welcome_page_sure_element = is_element_exist(self.driver, 'com.dangdang.buy2:id/dialog_cancel_tv')
            if new_welcome_page_sure_element:
                new_welcome_page_sure_element.click()

            # 1.2 【切換位置】對話方塊
            change_city_cancle_element = is_element_exist(self.driver, 'com.dangdang.buy2:id/left_bt')
            if change_city_cancle_element:
                change_city_cancle_element.click()

extra_job = ExtraJob(dangdang.driver)
extra_job.start()

接下來就是點選搜尋按鈕，然後輸入內容，執行點選搜尋對話方塊。

 # 1.搜尋框
search_element_pro = self.wait.until(
            EC.presence_of_element_located((By.ID, 'com.dangdang.buy2:id/index_search')))
search_element_pro.click()

search_input_element = self.wait.until(
            EC.presence_of_element_located((By.ID, 'com.dangdang.buy2:id/search_text_layout')))
search_input_element.set_text(KEY_WORD)

# 2.搜尋對話方塊，開始檢索
search_btn_element = self.wait.until(
            EC.element_to_be_clickable((By.ID, 'com.dangdang.buy2:id/search_btn_search')))
search_btn_element.click()

# 3.休眠3秒，保證第一頁的內容載入完全
time.sleep(3)

待第一頁的資料載入完全之後，可以一直向上滾動頁面，直到資料全部被載入完全，資料會由 mitmproxy 自動儲存到 MongoDB 資料庫當中。

while True:
      str1 = self.driver.page_source
      self.driver.swipe(FLICK_START_X, FLICK_START_Y + FLICK_DISTANCE, FLICK_START_X, FLICK_START_X)
      time.sleep(1)
      str2 = self.driver.page_source
      if str1 == str2:
            print('停止滑動')
            # 停止執行緒
            extra_job.stop()
            break
      print('繼續滑動'

結果

首先使用 mitmdump 開啟請求監聽的服務，然後執行爬取指令碼。

App 會自動開啟，執行一系列操作後，到達商品介面，然後自動滑動介面，通過 mitmproxy 自動把有用的資料儲存到 MongoDB 資料庫中。

使用 Python 爬取網站資料
2024-07-27
Python網站
Python第一個爬蟲，爬取噹噹網 Top 500 本五星好評書籍
2019-07-19
Python爬蟲
Python爬蟲入門【3】：美空網資料爬取
2019-07-30
Python爬蟲
網頁資料抓取之噹噹網
2020-12-21
網頁
房產資料爬取、智慧財產權資料爬取、企業工商資料爬取、抖音直播間資料python爬蟲爬取
2024-07-11
Python爬蟲
Python 爬取網頁資料的兩種方法
2023-02-15
Python網頁
Python爬蟲訓練：爬取酷燃網視訊資料
2020-10-23
Python爬蟲
python網路爬蟲（7）爬取靜態資料詳解
2019-06-07
Python爬蟲
Python：爬取疫情每日資料
2020-02-17
Python
python爬蟲獲取天氣網實時資料
2022-11-29
Python爬蟲
Python資料爬蟲學習筆記（11）爬取千圖網圖片資料
2018-09-18
Python爬蟲筆記
Puppeteer爬取網頁資料
2019-03-22
網頁
Python 爬取 baidu 股票市值資料
2019-02-16
PythonAI
Python爬取CSDN部落格資料
2019-01-03
Python
python爬取網圖
2019-10-15
Python
python爬取股票資料並存到資料庫
2021-03-29
Python資料庫
拉勾網職位資料爬取
2018-08-26
python爬取58同城一頁資料
2018-08-04
Python
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
教你用Python爬取妹子圖APP
2018-08-30
PythonAPP
python-python爬取豆果網（菜譜資訊）
2019-01-22
Python
Python網路爬蟲第三彈《爬取get請求的頁面資料》
2018-09-14
Python爬蟲
快速爬取登入網站資料
2020-11-20
網站
Python爬蟲框架：scrapy爬取高考派大學資料
2019-10-07
Python爬蟲框架
輕鬆利用Python爬蟲爬取你想要的資料
2021-09-10
Python爬蟲
Python網路爬蟲3 – 生產者消費者模型爬取某金融網站資料
2019-02-28
Python爬蟲模型網站
Python網路爬蟲3 - 生產者消費者模型爬取某金融網站資料
2018-05-01
Python爬蟲模型網站
一文解決scrapy帶案例爬取噹噹圖書
2021-06-04
Python網路爬蟲實戰：爬取知乎話題下 18934 條回答資料
2019-01-17
Python爬蟲
關於python爬取網頁
2021-03-10
Python網頁
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
【Python爬蟲】正則爬取趕集網
2020-12-24
Python爬蟲
python爬蟲爬取豆瓣電影 1-10 ajax 資料
2024-07-04
Python爬蟲
Python新書上市，強烈推薦！《Python網路資料爬取及分析從入門到精通（爬取篇）》導讀
2018-06-15
Python新書
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
利用python爬取某殼的房產資料
2024-05-05
Python
python更換代理爬取豆瓣電影資料
2019-08-03
Python
使用Python進行Web爬取和資料提取
2020-07-28
PythonWeb

Python爬取噹噹網APP資料

目標

準備工作

爬取思路

結果

相關文章