爬蟲專案:大麥網分析

Kinoz郝發表於2019-08-22

原文網址 : https://blog.csdn.net/weixin_44593310/article/details/100017163

大麥網爬取資料程式碼思路:

1: 需要用到的庫:
requests庫 :
Python實現的簡單易用的HTTP庫。
json庫:
JSON通常用於在Web客戶端和伺服器資料交換，即把字串型別的資料轉換成Python基本資料型別或者將Python基本資料型別轉換成字串型別。
csv庫：
用於後期csv檔案的儲存
2: 爬蟲的步驟如下:
1: 獲取內容
2: 發起請求
3: 解析內容
4: 儲存資料

--------------------------------------------------------------------------------------------------------------------------------

整個爬蟲專案是一個大的類(類分為新式類和經典類),類裡面有四個函式每一個函式就是爬蟲對應的一個步驟:
self就是用於儲存物件屬性的集合，就算沒有屬性self也是必備的

1: 我們定義__init__(魔法方法),用於構造請求頭:
headers字典: 獲取url,大麥網的cookie,和大麥網的搜尋頁面
data字典: 存放要查的單獨一個城市的名稱和一些對應的資料
url分別是獲取要爬的網站
data_key為一個列表,後面用於存放要爬取的資料

    def __init__(self):
        self.url = "https://search.damai.cn/searchajax.html"
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
            "cookie": "cna=KlOoFdYAbBwCAXpgKp9obZ4Y; isg=BOTkUmIpZQqUU5HMw-n25G4WteLWfQjnjj6Ltv4FY69yqYRzJo8Zdx9IbUkUcUA_; l=cBLM5Y6uqhvwqAAOBOCZourza779bIRAguPzaNbMi_5pa6L_MmQOkJ_0tFp6cfWd9ELB4VsrWwJ9-etlwIK40mLvCAQF",
            "referer": "https://search.damai.cn/search.htm?ctl=%20%20%20&order=1&cty="  # 大麥網的分類頁面
        }
        self.data = {
            "cty": "南京",
            "ctl": "演唱會",
            "tsg": "0",
            "order": "1"
        }
        self.data_key = None

2: 我們定義get函式,用於獲取響應:

    def get(self):
        response = requests.post(url=self.url, headers=self.headers, data=self.data)
        # 測試  print(response.text)
        return response

3: 我們定義parse函式,用於解析資料:
呼叫json的loads方法,用於把get函式內的資料由字串資料轉化為字典資料存放在dict_data變數中
把要爬取的字典資料儲存在need_spider_data變數中

        dict_data = json.loads(self.get().text)
        need_spider_data = dict_data["pageData"]["resultData"]
        # print(need_spider_data) 測試資料

現在data_key要派上用場了,我們首先對need_spider_data變數進行遍歷,把遍歷的資料存放到data_key中(append方法)

        data_key = []
        for item in need_spider_data[0]:
            data_key.append(item)
            
        self.data_key = data_key

        return need_spider_data

4: 我們定義save函式,用於存放資料:
定義list作為屬性列表,把data_key賦值給list

  list = self.data_key

解析過的資料(parse函式),賦值給my_data

 my_data = self.parse()

常用的一種儲存方式如下:

        with open("damaiwang" + ".csv", "w", newline="", encoding='utf8') as f:
            writer = csv.DictWriter(f, list)   #list屬性列表用字典格式寫出
            writer.writeheader()
            for row in my_data:		# 把my_data進行遍歷最後寫出
                writer.writerow(row)

5: 呼叫所有函式,把整個程式執行起來:
if name == ‘main’:
# 不可以用類命直接呼叫函式把類名看作一個引數進行所有函式的呼叫
spider = Spider()
spider.parse()
spider.save()

程式結束後的小提示
print("="*50)
print(“The program gone the file saved at the Root directory”)
print("="*50)

原始碼地址: 原始碼來自於阿里雲棲社群

刪除不必要的東西和多餘的註釋:

import requests
import json
import csv      # 用於後期檔案的儲存

def fozu():
    print("                            _ooOoo_                     ")
    print("                           o8888888o                    ")
    print("                           88  .  88                    ")
    print("                           (| -_- |)                    ")
    print("                            O\\ = /O                    ")
    print("                        ____/`---'\\____                ")
    print("                      .   ' \\| |// `.                  ")
    print("                       / \\||| : |||// \\               ")
    print("                     / _||||| -:- |||||- \\             ")
    print("                       | | \\\\\\ - /// | |             ")
    print("                     | \\_| ''\\---/'' | |              ")
    print("                      \\ .-\\__ `-` ___/-. /            ")
    print("                   ___`. .' /--.--\\ `. . __            ")
    print("                ."" '< `.___\\_<|>_/___.' >'"".         ")
    print("               | | : `- \\`.;`\\ _ /`;.`/ - ` : | |     ")
    print("                 \\ \\ `-. \\_ __\\ /__ _/ .-` / /      ")
    print("         ======`-.____`-.___\\_____/___.-`____.-'====== ")
    print("                            `=---='  ")
    print("                                                        ")
    print("         .............................................  ")
    print("                  佛祖鎮樓             BUG辟邪          ")
    print("                  Zen of python:                       ")
    print("                  Beautiful is better than ugly.；      ")
    print("                  Explicit is better than implicit.     ")
    print("                  Simple is better than complex.        ")
    print("                  Complex is better than complicated.   ")
    print("                  Flat is better than nested.           ")
    print("                  Sparse is better than dense.          ")
    print("                  Readability counts.                   ")
    print("                  Now is better than never.             ")
fozu()

class Spider(object):   # 新式類
    # 構造請求頭等   self就是用於儲存物件屬性的集合，就算沒有屬性self也是必備的
    def __init__(self):
        self.url = "https://search.damai.cn/searchajax.html"
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
            "cookie": "cna=KlOoFdYAbBwCAXpgKp9obZ4Y; isg=BOTkUmIpZQqUU5HMw-n25G4WteLWfQjnjj6Ltv4FY69yqYRzJo8Zdx9IbUkUcUA_; l=cBLM5Y6uqhvwqAAOBOCZourza779bIRAguPzaNbMi_5pa6L_MmQOkJ_0tFp6cfWd9ELB4VsrWwJ9-etlwIK40mLvCAQF",
            "referer": "https://search.damai.cn/search.htm?ctl=%20%20%20&order=1&cty="  # 大麥網的分類頁面
        }
        self.data = {
            "cty": "南京",
            "ctl": "演唱會",
            "tsg": "0",
            "order": "1"
        }
        self.data_key = None

    # 請求url獲取響應
    def get(self):
        response = requests.post(url=self.url, headers=self.headers, data=self.data)
        # 測試  print(response.text)
        return response

    # 解析資料
    def parse(self):
        # 將字串資料轉換成字典資料
        dict_data = json.loads(self.get().text)

        # 將需要的爬取的字典資料儲存在變數中
        need_spider_data = dict_data["pageData"]["resultData"]
        # print(need_spider_data)
        # 構造儲存頭列表,第一種方法
        data_key = []
        for item in need_spider_data[0]:
            data_key.append(item)

        # 列印測試
        # print(data_key)
        self.data_key = data_key

        return need_spider_data

    # 儲存為CSV資料
    def save(self):
        # 構建屬性列表
        list = self.data_key

        # # 此處出現儲存，報錯為缺少欄位，因此追加一個欄位
        # list.append('favourable')

        # list測試 print(list)

        my_data = self.parse()
        # 資料測試  print(my_data)

        with open("damaiwang" + ".csv", "w", newline="", encoding='utf8') as f:
            # 傳入頭資料，即第一行資料
            writer = csv.DictWriter(f, list)
            writer.writeheader()
            for row in my_data:
                writer.writerow(row)



if __name__ == '__main__':
    spider = Spider()
    spider.parse()
    spider.save()

    print("="*50)
    print("The program gone the file saved at the Root directory")
    print("="*50)

網路爬蟲專案
2022-01-29
爬蟲
【爬蟲】專案篇-使用selenium爬取大魚潮汐網
2024-04-05
爬蟲
網路爬蟲（python專案）
2018-12-04
爬蟲Python
專案－－python網路爬蟲
2020-08-15
Python爬蟲
[網路爬蟲] 網路爬蟲實踐：大麥網演唱會預約搶票【待續】
2024-05-04
爬蟲
Python網路爬蟲實戰專案大全 32個Python爬蟲專案demo
2019-04-24
Python爬蟲
資料分析專案（一）——爬蟲篇
2018-11-30
爬蟲
爬蟲專案
2019-06-07
爬蟲
精通Scrapy網路爬蟲【一】第一個爬蟲專案
2021-06-19
爬蟲
【爬蟲】爬蟲專案推薦 / 思路
2020-04-21
爬蟲
爬蟲小專案
2019-05-10
爬蟲
爬蟲專案部署
2018-04-03
爬蟲
大資料爬蟲專案實戰教程
2018-11-14
大資料爬蟲
Python網路爬蟲實戰小專案
2021-04-12
Python爬蟲
Python網路爬蟲實戰專案大全！
2020-12-19
Python爬蟲
Python爬蟲小專案：爬一個圖書網站
2018-11-21
Python爬蟲網站
奇伢爬蟲專案
2018-10-08
爬蟲
爬蟲專案總結
2020-08-31
爬蟲
scrapyd 部署爬蟲專案
2018-03-22
爬蟲
網路爬蟲專案開發日誌（三）：爬蟲上線準備
2022-02-02
爬蟲
Java 爬蟲專案實戰之爬蟲簡介
2018-11-24
Java爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
Python靜態網頁爬蟲專案實戰
2020-05-01
Python網頁爬蟲
2019最新《網路爬蟲JAVA專案實戰》
2019-05-09
爬蟲Java
爬蟲實戰專案集合
2019-02-28
爬蟲
100爬蟲專案遷移
2018-09-19
爬蟲
gerapy框架爬蟲專案部署
2018-09-27
框架爬蟲
爬蟲專案實戰（一）
2020-06-15
爬蟲
爬蟲的例項專案
2019-04-26
爬蟲
爬蟲實戰專案合集
2022-01-25
爬蟲
Datawhale-爬蟲-Task7(實戰大專案)
2019-03-07
爬蟲
Python爬蟲教程-31-建立 Scrapy 爬蟲框架專案
2018-09-04
Python爬蟲框架
python爬蟲例項專案大全-GitHub 上有哪些優秀的 Python 爬蟲專案？
2020-10-30
Python爬蟲Github
課程設計：python_網路爬蟲專案
2021-03-09
Python爬蟲
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
python爬蟲實操專案_Python爬蟲開發與專案實戰 1.6 小結
2021-02-04
Python爬蟲
網路爬蟲——專案實戰（爬取糗事百科所有文章）
2020-02-07
爬蟲
企業資料爬蟲專案
2018-10-05
爬蟲

爬蟲專案:大麥網分析

大麥網爬取資料程式碼思路:

相關文章