Python爬蟲入門教程 40-100 部落格園Python相關40W部落格抓取 scrapy

夢想橡皮擦發表於2019-02-25

原文網址 : https://flycode.co/archives/231524

Python爬蟲

爬前叨叨

第40篇部落格吹響號角，爬取部落格園部落格~本文最終抓取到了從2010年1月1日到2019年1月7日的37W+文章，後面可以分析好多東西了呢

經常看部落格的同志知道，部落格園每個欄目下面有200頁，多了的資料他就不顯示了，最多顯示4000篇部落格如何儘可能多的得到部落格資料，是這篇文章研究的一點點核心內容，能√get到多少就看你的了~

單純的從每個欄目去爬取是不顯示的，轉換一下思路，看到搜尋頁面，有時間~，有時間！

注意看URL連結

https://zzk.cnblogs.com/s/blogpost?Keywords=python&datetimerange=Customer&from=2019-01-01&to=2019-01-01  
複製程式碼

這個連結得到之後，其實用一個比較簡單的思路就可以獲取到所有python相關的文章了，迭代時間。下面編寫核心程式碼，比較重要的幾個點，我單獨提煉出來。

頁面搜尋的時候因為加了驗證，所以你必須要獲取到你本地的cookie，這個你很容易得到
字典生成器的語法是時候去複習一下了

import scrapy
from scrapy import Request,Selector
import time
import datetime

class BlogsSpider(scrapy.Spider):
    name = 'Blogs'
    allowed_domains = ['zzk.cnblogs.com']
    start_urls = ['http://zzk.cnblogs.com/']
    from_time = "2010-01-01"
    end_time = "2010-01-01"
    keywords = "python"
    page =1
    url = "https://zzk.cnblogs.com/s/blogpost?Keywords={keywords}&datetimerange=Customer&from={from_time}&to={end_time}&pageindex={page}"
    custom_settings = {
        "DEFAULT_REQUEST_HEADERS":{
            "HOST":"zzk.cnblogs.com",
            "TE":"Trailers",
            "referer": "https://zzk.cnblogs.com/s/blogpost?w=python",
            "upgrade-insecure-requests": "1",
            "user-agent": "Mozilla/5.0 Gecko/20100101 Firefox/64.0"

        }
    }


    def start_requests(self):
        cookie_str = "想辦法自己獲取到"
        self.cookies = {item.split("=")[0]: item.split("=")[1] for item in cookie_str.split("; ")}
        yield Request(self.url.format(keywords=self.keywords,from_time=self.from_time,end_time=self.end_time,page=self.page),cookies=self.cookies,callback=self.parse)

複製程式碼

頁面爬取完畢之後，需要進行解析，獲取翻頁頁碼，同時將時間+1天，下面的程式碼重點看時間疊加部分的操作。

    def parse(self, response):
        print("正在爬取",response.url)
        count = int(response.css('#CountOfResults::text').extract_first()) # 獲取是否有資料
        if count>0:
            for page in range(1,int(count/10)+2):
                # 抓取詳細資料
                yield Request(self.url.format(keywords=self.keywords,from_time=self.from_time,end_time=self.end_time,page=page),cookies=self.cookies,callback=self.parse_detail,dont_filter=True)

        time.sleep(2)
        # 跳轉下一個日期
        d = datetime.datetime.strptime(self.from_time, '%Y-%m-%d')
        delta = datetime.timedelta(days=1)
        d = d + delta
        self.from_time = d.strftime('%Y-%m-%d')
        self.end_time =self.from_time
        yield Request(
            self.url.format(keywords=self.keywords, from_time=self.from_time, end_time=self.end_time, page=self.page),
            cookies=self.cookies, callback=self.parse, dont_filter=True)
複製程式碼

頁面解析入庫

本部分操作邏輯沒有複雜點，只需要按照流程編寫即可，執行程式碼，跑起來，在mongodb等待一些時間

db.getCollection('dict').count({}) 
複製程式碼

372352條資料
複製程式碼


    def parse_detail(self,response):
        items = response.xpath('//div[@class="searchItem"]')
        for item in items:
            title = item.xpath('h3[@class="searchItemTitle"]/a//text()').extract()
            title = "".join(title)

            author = item.xpath(".//span[@class='searchItemInfo-userName']/a/text()").extract_first()
            public_date = item.xpath(".//span[@class='searchItemInfo-publishDate']/text()").extract_first()
            pv = item.xpath(".//span[@class='searchItemInfo-views']/text()").extract_first()
            if pv:
                pv = pv[3:-1]
            url = item.xpath(".//span[@class='searchURL']/text()").extract_first()
            #print(title,author,public_date,pv)
            yield {
                "title":title,
                "author":author,
                "public_date":public_date,
                "pv":pv,
                "url":url
            }

複製程式碼

資料入庫

一頓操作猛如虎，資料就到手了~後面可以做一些簡單的資料分析，那篇部落格再見啦@

[雪峰磁針石部落格]python爬蟲cookbook1爬蟲入門
2018-09-10
Python爬蟲
Python爬蟲-部落格園首頁推薦部落格排行(整合詞雲+郵件傳送)
2019-05-14
Python爬蟲
Python爬蟲實戰系列1：部落格園cnblogs熱門新聞採集
2024-03-13
Python爬蟲
01、部落格爬蟲
2019-04-11
爬蟲
部落格園美化教程
2024-04-10
爬取部落格園文章
2020-07-31
關於部落格園裝修教程
2024-08-20
Python爬蟲入門教程 33-100 《海王》評論資料抓取 scrapy
2019-02-14
Python爬蟲
部落格園資料備份相關
2024-05-05
初入部落格園
2024-04-09
部落格園記錄：汽車引數爬蟲
2024-11-06
爬蟲
部落格園皮膚-我的部落格園皮膚設定教程
2019-05-09
Python爬取CSDN部落格資料
2019-01-03
Python
部落格入門
2024-06-15
部落格美化&typora編寫部落格攻略（部落格園版）
2020-10-13
部落格園主題美化教程
2024-06-08
部落格園部落格記錄備份
2024-10-12
部落格園部落格重修計劃2024
2024-07-21
部落格園，部落格園，念念不忘，必有迴響
2024-07-29
Python網路爬蟲4 - scrapy入門
2018-05-29
Python爬蟲
部落格園，你好！
2024-10-08
部落格園美化
2024-09-02
【Typora + 部落格園】如何高效的在部落格園上編寫MD格式的部落格
2020-12-05
關於部落格園設定awescnb皮膚教程
2021-10-10
部落格園openlivewriter安裝配置教程
2024-04-05
Python部落格導航
2023-01-24
Python
用AngleSharp & LINQPad抓取分析部落格園排行榜
2023-02-13
自定義部落格園部落格的背景圖片
2021-02-18
關於部落格園美化裝修
2018-09-07
重回部落格園
2024-08-28
Go秒爬部落格園100頁新聞
2018-08-01
Go
部落格園之自定義部落格(美化+播放器)
2021-05-09
播放器
【爬蟲】利用Python爬蟲爬取小麥苗itpub部落格的所有文章的連線地址（1）
2018-12-26
爬蟲Python
【Python】備份itpub部落格
2018-07-13
Python
python老師的部落格
2018-06-27
Python
Python爬蟲入門教程 18-100 煎蛋網XXOO圖片抓取
2019-01-04
Python爬蟲
Python爬蟲教程-30-Scrapy 爬蟲框架介紹
2018-09-06
Python爬蟲框架
部落格園cnblog部落格遷移到Hexo(提供格式轉換)
2024-04-10
Hexo

Python爬蟲入門教程 40-100 部落格園Python相關40W部落格抓取 scrapy

爬前叨叨

頁面解析入庫

資料入庫

相關文章