selenium + xpath爬取csdn關於python的博文博主資訊

我愛工大先進演算法設計課發表於2020-12-19

原文網址 : https://blog.csdn.net/cyj5201314/article/details/111409395

from selenium.webdriver import Chrome
from lxml import etree
import time
import requests
import json

class CSDN_Spider():

    def __init__(self):
        self.url = "https://www.csdn.net/nav/python"
        self.headers = {
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
        }
        self.browser = Chrome(executable_path="D:/chromedriver_win32/chromedriver.exe")

    def get_all_articles_url(self):
        items = []
        self.browser.get(self.url)
        for i in range(100):
            text = self.browser.page_source
            html = etree.HTML(text)
            tag_a_url = html.xpath("//div[@class='title']//a/@href")
            items += tag_a_url
            js = "var q=document.documentElement.scrollTop=100000"
            self.browser.execute_script(js)
            time.sleep(1)
        self.browser.close()
        return list(set(items))

    def get_detail_page_text(self, url):
        response = requests.get(url, headers=self.headers)
        if response.status_code == 200:
            return response.text
        else:
            return "request not successfully"

    def parse_detail_page(self, text):
        html = etree.HTML(text)
        item = {}
        try:
            item["name"] = html.xpath("//a[@class='follow-nickName ']/text()")[0]
            item["code_age"] = html.xpath("//span[@class='personal-home-page personal-home-years']/text()")[0]
            item["authentication"] = html.xpath("//a[@class='personal-home-certification']/@title")[0]
            digital_data = html.xpath("//dl[@class='text-center']//span[@class='count']/text()")
            item["original"] = digital_data[0]
            item["week_rank"] = digital_data[1]
            item["all_rank"] = digital_data[2]
            item["vivstor_num"] = digital_data[3]
            item["integral"] = digital_data[4]
            item["fans_num"] = digital_data[5]
            item["be_praised_num"] = digital_data[6]
            item["review_num"] = digital_data[7]
            item["collection"] = digital_data[8]
        except:
            print("this item is erroneous")
            item["name"] = "invalidity"
        return item

    def save_data(self, item):
        with open("./data/csdn_authors.json", "a", encoding="utf-8") as fp:
            json.dump(item, fp, ensure_ascii=False)
            fp.write("\n")


    def main(self):
        articles_urls = self.get_all_articles_url()
        for url in articles_urls:
            text = self.get_detail_page_text(url)
            if text != "request not successfully":
                item = self.parse_detail_page(text)
                self.save_data(item)
                print(item["name"] + "save into local document of json successfully")


if __name__ == '__main__':
    cs = CSDN_Spider()
    cs.main()

爬取結果如下：
在這裡插入圖片描述

Python實現微博爬蟲，爬取新浪微博
2020-12-14
Python爬蟲
python實現微博個人主頁的資訊爬取
2021-01-03
Python
微博爬取長津湖博文及評論
2021-10-08
爬蟲實戰（一）：爬取微博使用者資訊
2018-07-15
爬蟲
「玩轉Python」打造十萬博文爬蟲篇
2019-07-30
Python爬蟲
JB的Python之旅-爬蟲篇-新浪微博內容爬取
2018-06-30
Python爬蟲
Python 超簡單爬取微博熱搜榜資料
2020-05-13
Python
[Python3]selenium爬取淘寶商品資訊
2021-09-09
Python
Python爬取CSDN部落格資料
2019-01-03
Python
一個批次爬取微博資料的神器
2024-08-30
selenium 知網爬蟲之根據【關鍵詞】獲取文獻資訊
2023-10-28
爬蟲
Python爬取動態載入的視訊（梨視訊,xpath)
2022-03-21
Python
Python 超簡單爬取新浪微博資料 (高階版)
2020-05-16
Python
微博-指定話題當日資料爬取
2024-06-12
利用 Python 爬取“工商祕密”微博，看看大家都在關注些什麼？
2020-12-21
Python
爬取微博圖片資料存到Mysql中遇到的各種坑mysql儲存圖片爬取微博圖片
2019-02-16
MySql
爬蟲Selenium+PhantomJS爬取動態網站圖片資訊（Python）
2018-03-24
爬蟲JS網站Python
利用爬蟲獲取當前博文數量與字數
2021-06-11
爬蟲
2018-05-10 爬蟲筆記（二）一個簡單的實踐 —簡單獲取生物資訊達人博主主頁的資訊...
2018-05-12
爬蟲筆記
Python網路爬蟲2 - 爬取新浪微博使用者圖片
2018-04-10
Python爬蟲
自動化：用selenium發一篇博文
2024-06-07
關於python爬取網頁
2021-03-10
Python網頁
Scrapy框架的使用之Scrapy爬取新浪微博
2018-05-23
框架
【Python爬蟲實戰】使用Selenium爬取QQ音樂歌曲及評論資訊
2021-03-24
Python爬蟲
用xpath、bs4、re爬取B站python資料
2018-08-07
Python
Python爬蟲-xpath
2018-06-08
Python爬蟲
Python爬蟲——XPath
2018-07-28
Python爬蟲
博主簡介
2018-03-13
優秀博主
2024-11-28
技術博主
2024-12-04
Python web自動化爬蟲-selenium/處理驗證碼/Xpath
2024-07-18
PythonWeb爬蟲
爬蟲實戰（三）：微博使用者資訊分析
2018-07-15
爬蟲
Python爬蟲之資料解析（XPath）
2018-12-18
Python爬蟲
博主的腦抽日常
2024-11-17
python爬取北京租房資訊
2018-05-18
Python
Python一鍵爬取你所關心的書籍資訊
2019-03-05
Python
python_selenium元素定位_xpath(2)
2022-10-24
Python
博文標題
2024-03-17

selenium + xpath爬取csdn關於python的博文博主資訊

相關文章