利用爬蟲獲取當前博文數量與字數

SilenceHL發表於2021-06-11

原文網址 : https://learnku.com/articles/58039

由於個人部落格沒有博文統計的功能，於是自己手寫了一個爬蟲，用於獲取當前博文數量與字數，具體的思路就是先獲取整個文章列表，然後遍歷文章來統計數量與字數

import requests
from lxml import etree
import re
import random
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def post_statistic():
    start_url = 'https://silencehuliang.github.io/posts/'
    response = requests.get(start_url).content.decode()
    html = etree.HTML(response)
    page = int(html.xpath('//ul [@class="pagination"]/li[5]//a/text()')[0])
    print('當前部落格總頁數為：{}'.format(page))
    archive_count = len(html.xpath('//article [@class="archive-item"]'))
    post_url_list = html.xpath('//a [@class="archive-item-link"]/@href')
    for i in range(1, page + 1):
        print("開始訪問第{}頁".format(i))
        next_url = 'https://silencehuliang.github.io/posts/page/{}/'.format(i)
        response = requests.get(next_url).content.decode()
        html = etree.HTML(response)
        archive_count += len(html.xpath('//article [@class="archive-item"]'))
        post_url_list.extend(html.xpath('//a [@class="archive-item-link"]/@href'))
    num = 0
    for p_url in post_url_list:
        post_url = 'https://silencehuliang.github.io' + p_url
        response = requests.get(post_url).content.decode()
        html = etree.HTML(response)
        num += int(re.findall('約 (\d+) 字', html.xpath('//div [@class="post-meta"]/div[2]/text()[4]')[0])[0])
    print("目前博文數量為：{}，總字數為：{}".format(archive_count, num))


if __name__ == '__main__':
    post_statistic()

本作品採用《CC 協議》，轉載必須註明作者和本文連結

react獲取當前頁面的url引數
2018-08-22
React
用js獲取當前月份的天數
2018-06-06
JS
Python實現微博爬蟲，爬取新浪微博
2020-12-14
Python爬蟲
Cocos2d-x 中獲取動畫當前幀數
2020-10-21
動畫
LeetCode - 1365 - 有多少小於當前數字的數字
2020-03-19
LeetCode
利用Python爬蟲獲取招聘網站職位資訊
2021-08-09
Python爬蟲網站
Java 獲取Word字數
2021-12-14
Java
Python爬蟲：爬取instagram，破解js加密引數
2019-04-09
Python爬蟲JS加密
JavaScript 獲取當前月份
2019-06-06
JavaScript
獲取當前時間
2020-12-28
LeetCode1365有多少小於當前數字的數字
2020-10-26
LeetCode
php 獲取當前域名和當前協議
2021-03-29
PHP協議
JavaScript獲取table表格行與列的數量
2018-06-29
JavaScript
獲取當前頁面的topViewController
2019-03-25
ViewController
Java獲取當前星期幾
2018-07-18
Java
mybatis獲取當前時間
2019-12-07
MyBatis
獲取當前時間戳和隨機數的獲取、Java Random、ThreadLocalRandom、UUID類中的方法應用（隨機數）
2018-04-10
時間戳隨機JavarandomthreadUI
「玩轉Python」打造十萬博文爬蟲篇
2019-07-30
Python爬蟲
Leetcode 刷題 ------1365.有多少小於當前數字的數字
2020-10-26
LeetCode
c#獲取word檔案頁數、字數
2018-10-22
C#
python爬蟲如何獲取表情包
2021-09-11
Python爬蟲
微博爬取長津湖博文及評論
2021-10-08
爬蟲headers引數
2020-10-25
爬蟲Header
python 如何獲取當前時間
2021-09-11
Python
Flutter 小知識,Key的使用(獲取當前點選Widget位置/獲取當前Widget大小)
2020-12-04
Flutter
爬蟲實戰（一）：爬取微博使用者資訊
2018-07-15
爬蟲
JB的Python之旅-爬蟲篇-新浪微博內容爬取
2018-06-30
Python爬蟲
C#.net 獲取當前應用程式所在路徑及環境變數
2020-04-04
C#變數
微博-指定話題當日資料爬取
2024-06-12
為爬蟲獲取登入cookies：使用Charles和requests模擬微博登入
2018-12-03
爬蟲Cookie
輕鬆利用Python爬蟲爬取你想要的資料
2021-09-10
Python爬蟲
JavaScript 獲取指定區間的數字
2019-04-10
JavaScript
當前仍可用的爬取Youtube影片方法
2024-10-10
獲取當前Tomcat例項的埠
2019-01-19
Tomcat
Java如何獲取當前執行緒
2018-07-05
Java執行緒
Linux C獲取當前工作目錄
2020-11-08
Linux
微信小程式獲取當前位置
2019-01-28
微信小程式
獲取當前時間往前的日期
2018-03-08

利用爬蟲獲取當前博文數量與字數

相關文章