Python爬蟲實戰系列1：部落格園cnblogs熱門新聞採集

Python魔法师發表於2024-03-13

原文網址 : https://www.cnblogs.com/meet/p/18068015

Python爬蟲

實戰案例：部落格園熱門新聞採集

一、分析頁面

開啟部落格園網址https://www.cnblogs.com/，點選【新聞】再點選【本週】

今日新聞.png

本次採集，我們以頁面新聞標題為案例來採集。這裡可以看到標題“ 李彥宏：以後不會存在“程式設計師”這種職業了”。

1.1、分析請求

F12開啟開發者模式，然後點選Network後點選任意一個請求，Ctrl+F開啟搜尋，輸入標題李彥宏：以後不會存在“程式設計師”這種職業了 ，開始搜尋

請求分析.png

可以看到請求地址為https://news.cnblogs.com/n/digg?type=week 但是返回的內容不是json格式，而是html原始碼，說明該頁面是部落格園後端拼接html原始碼返回給前端的，這裡我們就不能簡單的直接透過API介面來獲取資料了，還需要對html原始碼進行解析。

1.2、分析頁面

點選檢視元素，然後點選新聞標題。

頁面原始碼.png

對應的html原始碼是<a href="/n/766062/" target="_blank">李彥宏：以後不會存在“程式設計師”這種職業了</a>

透過原始碼我們可以看出，標題是被一個id=news_list的div包裹，然後news_div下還有news_block這個div包裹，然後是逐級向下，一直到a標籤才是我們想要的資料。

標題原始碼分析.png

1.3、分頁資訊處理

分頁資訊.png

透過頁面分析，可以看到分頁很簡單，直接在Query String QueryParamters裡傳入type: week、page: 2兩個引數即可。

1.4、判斷反爬及cookie

如何判斷該請求需要哪些header和cookie引數？或者有沒有反爬策略

首先複製curl，在另一臺機器上執行，curl程式碼如下

curl程式碼.png

透過逐步刪除程式碼中header引數來判斷哪些是必要的引數，首先把cookie引數刪除試試，發現可以獲取到結果。由此判斷，該網站沒有設定cookie請求機制。

那就很簡單了，直接發請求，解析html原始碼。

二、程式碼實現

新建Cnblogs類，並在init裡設定預設header引數

class Cnblogs:
    def __init__(self):
        self.headers = {
            'authority': 'news.cnblogs.com',
            'referer': 'https://news.cnblogs.com/n/digg?type=yesterday',
            'user-agent': USERAGENT
        }

新建獲取新聞get_news函式

def get_news(self):
    result = []
    for i in range(1, 4):
        url = f'https://news.cnblogs.com/n/digg?type=today&page={i}'
        content = requests.get(url)
        html = etree.HTML(content.text)
        news_list = html.xpath('//*[@id="news_list"]/div[@class="news_block"]')
        for new in news_list:
            title = new.xpath('div[@class="content"]/h2[@class="news_entry"]/a/text()')
            push_date = new.xpath('div[@class="content"]/div[@class="entry_footer"]/span[@class="gray"]/text()')
            result.append({
                "news_title": str(title[0]),
                "news_date": str(push_date[0]),
                "source_en": spider_config['name_en'],
                "source_cn": spider_config['name_cn'],
            })
    return result

程式碼主要使用了requests和lxml兩個庫來實現

測試執行

def main():
    cnblogs = Cnblogs()
    results = cnblogs.get_news()
    print(results)


if __name__ == '__main__':
    main()

程式執行效果.png

完整程式碼

# -*- coding: utf-8 -*-

import os
import sys

import requests
from lxml import etree

opd = os.path.dirname
curr_path = opd(os.path.realpath(__file__))
proj_path = opd(opd(opd(curr_path)))
sys.path.insert(0, proj_path)

from app.utils.util_mysql import db
from app.utils.util_print import Print
from app.conf.conf_base import USERAGENT

spider_config = {
    "name_en": "https://news.cnblogs.com",
    "name_cn": "部落格園"
}


class Cnblogs:
    def __init__(self):
        self.headers = {
            'authority': 'news.cnblogs.com',
            'referer': 'https://news.cnblogs.com/n/digg?type=yesterday',
            'user-agent': USERAGENT
        }

    def get_news(self):
        result = []
        for i in range(1, 4):
            url = f'https://news.cnblogs.com/n/digg?type=week&page={i}'
            content = requests.get(url)
            html = etree.HTML(content.text)
            news_list = html.xpath('//*[@id="news_list"]/div[@class="news_block"]')
            for new in news_list:
                title = new.xpath('div[@class="content"]/h2[@class="news_entry"]/a/text()')
                push_date = new.xpath('div[@class="content"]/div[@class="entry_footer"]/span[@class="gray"]/text()')
                result.append({
                    "news_title": str(title[0]),
                    "news_date": str(push_date[0]),
                    "source_en": spider_config['name_en'],
                    "source_cn": spider_config['name_cn'],
                })
        return result


def main():
    cnblogs = Cnblogs()
    results = cnblogs.get_news()
    print(results)


if __name__ == '__main__':
    main()

總結

透過以上程式碼，我們實現了採集部落格園的功能。

本文章程式碼只做學習交流使用，作者不負責任何由此引起的法律責任。

各位看官，如對你有幫助歡迎點贊，收藏，轉發，關注公眾號【Python魔法師】獲取更多Python魔法~

Python爬蟲實戰系列3：今日BBNews程式設計新聞採集
2024-03-15
Python爬蟲程式設計
頁面資料採集——網路爬蟲實戰（ASP.NET Web 部落格園為例）
2020-12-25
爬蟲ASP.NETWeb
爬蟲實戰：探索XPath爬蟲技巧之熱榜新聞
2024-03-21
爬蟲
Go秒爬部落格園100頁新聞
2018-08-01
Go
[雪峰磁針石部落格]python爬蟲cookbook1爬蟲入門
2018-09-10
Python爬蟲
Python爬蟲實戰系列4：天眼查公司工商資訊採集
2024-03-20
Python爬蟲
Python爬蟲入門教程 40-100 部落格園Python相關40W部落格抓取 scrapy
2019-02-25
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
【python爬蟲實戰】使用Selenium webdriver採集山東招考資料
2020-07-02
Python爬蟲Web
Python網路爬蟲資料採集實戰：Requests和Re庫
2020-03-22
Python爬蟲
Python爬蟲-部落格園首頁推薦部落格排行(整合詞雲+郵件傳送)
2019-05-14
Python爬蟲
部落格園記錄：汽車引數爬蟲
2024-11-06
爬蟲
01、部落格爬蟲
2019-04-11
爬蟲
爬取部落格園文章
2020-07-31
Python 爬蟲實戰
2023-10-16
Python爬蟲
小工具-markdown檔案匯入部落格園cnblogs
2024-11-20
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
大規模非同步新聞爬蟲：實現一個同步定向新聞爬蟲
2018-12-03
非同步爬蟲
【爬蟲】利用Python爬蟲爬取小麥苗itpub部落格的所有文章的連線地址（1）
2018-12-26
爬蟲Python
Python爬蟲百度新聞標題
2020-11-29
Python爬蟲
Python網路爬蟲實戰(一)快速入門
2019-09-16
Python爬蟲
python爬蟲實戰教程-Python爬蟲開發實戰教程（微課版）
2020-11-11
Python爬蟲
Python爬蟲開發與專案實戰（1）
2020-10-18
Python爬蟲
python爬蟲實戰，爬蟲之路，永無止境
2022-01-27
Python爬蟲
普京宣佈開戰，俄烏戰爭實時新聞採集整理
2022-02-24
python3網路爬蟲開發實戰_Python3 爬蟲實戰
2022-01-24
Python爬蟲
通用新聞爬蟲開發系列（專案介紹）
2022-02-18
爬蟲
Python爬蟲入門學習實戰專案（一）
2020-02-18
Python爬蟲
Python網路爬蟲實戰
2022-03-18
Python爬蟲
python 爬蟲實戰的原理
2021-10-29
Python爬蟲
Python爬蟲實戰之bilibili
2021-04-04
Python爬蟲
Python 爬蟲系列
2021-01-01
Python爬蟲
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
aardio爬蟲) 實戰篇：採集自己的公眾號粉絲列表
2024-04-29
爬蟲
Python爬蟲初學二（網路資料採集）
2020-05-03
Python爬蟲
IPIDEA分析資料採集新趨勢，Python爬蟲的應用前景如何？
2023-04-23
IdeaPython爬蟲
大規模非同步新聞爬蟲：簡單的百度新聞爬蟲
2018-12-02
非同步爬蟲
Python3網路爬蟲快速入門實戰解析
2020-04-23
Python爬蟲