Python爬蟲新手教程：手機APP資料抓取 pyspider

程式設計師啟航發表於2019-07-20

原文網址 : http://blog.itpub.net/69913713/viewspace-2651232/

Python爬蟲APPIDE

1. 手機APP資料----寫在前面

繼續練習pyspider的使用，最近搜尋了一些這個框架的一些使用技巧，發現文件竟然挺難理解的，不過使用起來暫時沒有障礙，估摸著，要在寫個5篇左右關於這個框架的教程。今天教程中增加了圖片的處理，你可以重點學習一下。

2. 手機APP資料----頁面分析

我們要爬取的網站是 http://www.liqucn.com/rj/new/ 這個網站我看了一下，有大概20000頁，每頁資料是9個，資料量大概在180000左右，可以抓取下來，後面做資料分析使用，也可以練習優化資料庫。

網站基本沒有反爬措施，上去爬就可以，略微控制一下併發，畢竟不要給別人伺服器太大的壓力。

頁面經過分析之後，可以看到它是基於URL進行的分頁，這就簡單了，我們先通過首頁獲取總頁碼，然後批量生成所有頁碼即可

http://www.liqucn.com/rj/new/?page=1
http://www.liqucn.com/rj/new/?page=2
http://www.liqucn.com/rj/new/?page=3
http://www.liqucn.com/rj/new/?page=4

獲取總頁碼的程式碼

class Handler(BaseHandler):
    crawl_config = {
    }
    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://www.liqucn.com/rj/new/?page=1', callback=self.index_page)
    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        # 獲取最後一頁的頁碼
        totle = int(response.doc(".current").text())
        for page in range(1,totle+1):
            self.crawl('http://www.liqucn.com/rj/new/?page={}'.format(page), callback=self.detail_page)
Python資源分享qun 784758214 ,內有安裝包，PDF，學習視訊，這裡是Python學習者的聚集地，零基礎，進階，都歡迎

然後copy一段官方中文翻譯，過來，時刻提醒自己

程式碼簡單分析：
def on_start(self) 方法是入口程式碼。當在web控制檯點選run按鈕時會執行此方法。
self.crawl(url, callback=self.index_page)這個方法是呼叫API生成一個新的爬取任務，
            這個任務被新增到待抓取佇列。
def index_page(self, response) 這個方法獲取一個Response物件。 
            response.doc是pyquery物件的一個擴充套件方法。pyquery是一個類似於jQuery的物件選擇器。
def detail_page(self, response)返回一個結果集物件。
            這個結果預設會被新增到resultdb資料庫（如果啟動時沒有指定資料庫預設呼叫sqlite資料庫）。你也可以重寫
            on_result(self,result)方法來指定儲存位置。
更多知識：
@every(minutes=24*60, seconds=0) 這個設定是告訴scheduler（排程器）on_start方法每天執行一次。
@config(age=10 * 24 * 60 * 60) 這個設定告訴scheduler（排程器）這個request（請求）過期時間是10天，
    10天內再遇到這個請求直接忽略。這個引數也可以在self.crawl(url, age=10*24*60*60) 和 crawl_config中設定。
@config(priority=2) 這個是優先順序設定。數字越大越先執行。

分頁資料已經新增到待爬取佇列中去了，下面開始分析爬取到的資料，這個在 detail_page 函式實現

    @config(priority=2)
    def detail_page(self, response):
        docs = response.doc(".tip_blist li").items()
        dicts = []
        for item in docs:
            title = item(".tip_list>span>a").text()
            pubdate = item(".tip_list>i:eq(0)").text()
            info = item(".tip_list>i:eq(1)").text()
            # 手機型別
            category = info.split("：")[1]
            size = info.split("/")
            if len(size) == 2:
                size = size[1]
            else:
                size = "0MB"
            app_type = item("p").text()
            mobile_type = item("h3>a").text()
            # 儲存資料
            # 建立圖片下載渠道
            img_url = item(".tip_list>a>img").attr("src")
            # 獲取檔名字
            filename = img_url[img_url.rindex("/")+1:]
            # 新增軟體logo圖片下載地址
            self.crawl(img_url,callback=self.save_img,save={"filename":filename},validate_cert=False)
            dicts.append({
                "title":title,
                "pubdate":pubdate,
                "category":category,
                "size":size,
                "app_type":app_type,
                "mobile_type":mobile_type
                })
        return dicts
Python資源分享qun 784758214 ,內有安裝包，PDF，學習視訊，這裡是Python學習者的聚集地，零基礎，進階，都歡迎

資料已經集中返回，我們重寫 on_result 來儲存資料到 mongodb 中，在編寫以前，先把連結 mongodb 的相關內容編寫完畢

import os
import pymongo
import pandas as pd
import numpy as np
import time
import json
DATABASE_IP = '127.0.0.1'
DATABASE_PORT = 27017
DATABASE_NAME = 'sun'
client = pymongo.MongoClient(DATABASE_IP,DATABASE_PORT)
db = client.sun
db.authenticate("dba", "dba")
collection = db.liqu  # 準備插入資料

資料儲存

    def on_result(self,result):
        if result:
            self.save_to_mongo(result)            
    def save_to_mongo(self,result):
        df = pd.DataFrame(result)
        #print(df)
        content = json.loads(df.T.to_json()).values()
        if collection.insert_many(content):
            print('儲存到 mongondb 成功')

獲取到的資料，如下表所示。到此為止，我們已經完成大部分的工作了，最後把圖片下載完善一下，就收工啦！

3. 手機APP資料----圖片儲存

圖片下載，其實就是儲存網路圖片到一個地址即可

    def save_img(self,response):
        content = response.content
        file_name = response.save["filename"]
        #建立資料夾（如果不存在）
        if not os.path.exists(DIR_PATH):                         
            os.makedirs(DIR_PATH) 
        file_path = DIR_PATH + "/" + file_name
        with open(file_path,"wb" ) as f:
            f.write(content)
Python資源分享qun 784758214 ,內有安裝包，PDF，學習視訊，這裡是Python學習者的聚集地，零基礎，進階，都歡迎

到此為止，任務完成，儲存之後，調整爬蟲的抓取速度，點選run，資料跑起來~~~~

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/69913713/viewspace-2651232/，如需轉載，請註明出處，否則將追究法律責任。

Python爬蟲入門教程 29-100 手機APP資料抓取 pyspider
2019-01-23
Python爬蟲APPIDE
Python爬蟲入門教程 48-100 使用mitmdump抓取手機惠農APP-手機APP爬蟲部分
2019-03-12
Python爬蟲MITAPP
Python爬蟲新手教程：微醫掛號網醫生資料抓取
2019-07-20
Python爬蟲
手機爬蟲用Appium詳細教程：利用Python控制移動App進行自動化抓取資料
2023-10-16
爬蟲APPPython
Python爬蟲之Pyspider使用
2021-09-11
Python爬蟲IDE
Python爬蟲抓取股票資訊
2021-01-03
Python爬蟲
Python爬蟲新手教程：Python分析了 7 萬款 App，萬萬沒想到
2019-07-29
Python爬蟲APP
讓爬蟲無障礙抓取上千萬APP資料
2019-05-16
爬蟲APP
爬蟲app資訊抓取之apk反編譯抓取
2019-05-10
爬蟲APPAPK編譯
爬蟲原理與資料抓取
2020-12-17
爬蟲
Python爬蟲入門教程 33-100 《海王》評論資料抓取 scrapy
2019-02-14
Python爬蟲
爬蟲抓取網頁資料原理
2021-12-06
爬蟲網頁
Python爬蟲抓取資料，為什麼要使用代理IP？
2022-12-27
Python爬蟲
Python爬蟲如何去抓取qq音樂的歌手資料？
2021-03-19
Python爬蟲
python爬蟲抓取哈爾濱天氣資訊（靜態爬蟲）
2020-04-05
Python爬蟲
Python爬蟲新手教程：知乎文章圖片爬取器
2019-07-20
Python爬蟲
Python爬蟲入門教程 21-100 網易雲課堂課程資料抓取
2019-01-09
Python爬蟲
新手爬蟲教程：Python爬取知乎文章中的圖片
2019-01-17
爬蟲Python
《python3網路爬蟲開發實戰》--pyspider
2018-10-18
Python爬蟲IDE
用Python爬蟲抓取代理IP
2019-04-17
Python爬蟲
爬蟲技術抓取網站資料方法
2021-09-11
爬蟲網站
python爬蟲專案（新手教程）之知乎（requests方式）
2018-06-13
Python爬蟲
高效率爬蟲框架之 pyspider
2018-07-06
爬蟲框架IDE
python如何抓取手機app上的視訊
2021-12-07
PythonAPP
Python爬蟲抓取技術的門道
2019-09-21
Python爬蟲
python爬蟲抓取資料時失敗_python爬蟲大佬請教下為什麼爬取的資料有時能爬到有時有爬不到，程式碼如下：...
2020-12-04
Python爬蟲
Python爬蟲實戰：爐石傳說卡牌、原畫資料抓取
2020-10-09
Python爬蟲
Python網路爬蟲抓取動態網頁並將資料存入資料庫MYSQL
2019-01-04
Python爬蟲網頁資料庫MySql
Fiddler抓包---手機APP--python爬蟲基本設定和操作
2018-10-24
APPPython爬蟲
Python爬蟲入門教程 18-100 煎蛋網XXOO圖片抓取
2019-01-04
Python爬蟲
Python爬蟲入門教程 16-100 500px攝影師社群抓取攝影師資料
2018-12-25
Python爬蟲
如何使用代理IP進行資料抓取，PHP爬蟲抓取亞馬遜商品資料
2019-05-15
PHP爬蟲亞馬遜
Python爬蟲教程-01-爬蟲介紹
2018-09-06
Python爬蟲
Python爬蟲二：抓取京東商品列表頁面資訊
2018-06-26
Python爬蟲
Python爬蟲抓取知乎所有使用者資訊
2018-03-14
Python爬蟲
python爬蟲之抓取小說(逆天邪神)
2022-03-10
Python爬蟲
【機器學習】資料準備--python爬蟲
2022-06-22
機器學習Python爬蟲
Python爬蟲教程-25-資料提取-BeautifulSoup4（三）
2018-09-06
Python爬蟲

Python爬蟲新手教程：手機APP資料抓取 pyspider

1. 手機APP資料----寫在前面

2. 手機APP資料----頁面分析

3. 手機APP資料----圖片儲存

相關文章