python 爬蟲爬取 learnku 精華文章

GA17發表於2020-04-17

原文網址 : https://learnku.com/articles/43325

繼上篇文章使用 requests 登入 learnku 的基礎上，爬 learnku python 社群的精華文章。

python 社群精華文章列表連結為 https://learnku.com/python?filter=excellen...

老規矩，開啟連結後 F12 開啟開發者模式。點選開發者模式選擇節點那個箭頭，隨便點選一個文章的標題。

可以看到文章的 url 在一個 class 為 topic-title-wrap rm-tdu 的 a 標籤裡面，標題在這個a 標籤的第一個子節點 span 標籤裡面， span 標籤的 class 為 topic-title。

接上上篇文章的程式碼

from lxml import etree
import requests


# 定義一個會話， 維持cookie
s = requests.Session()

headers = {
    'Host': 'learnku.com',
    'User-Agent': '你的UA'
}
# 更新會話的請求頭
s.headers.update(headers)

# 請求 learnku，拿 token。這裡其實還得到了首頁的 cookie。
response = s.get('https://learnku.com/')
html = etree.HTML(response.text)
# 解析 learnku 首頁拿到 token
token = html.xpath('//meta[@name="csrf-token"]/@content')

# 定義登入要提交的資料
payload = {
    '_token': token,
    'remember': 'yes',
    'return_back': 'yes',
    'username': '賬號',
    'password': '密碼'
}
# 請求登入
s.post('https://learnku.com/auth/login', data=payload)

# 請求精華文章
url = 'https://learnku.com/python?filter=excellent'
response = s.get(url)
html = etree.HTML(response.text)

# 定位頁面裡面包含標題的 a 標籤
lis = html.xpath('//a[@class="topic-title-wrap rm-tdu"]')

# 遍歷這些 a 標籤
for li in lis:
    essay_url = li.xpath('./@href')[0]
    title = li.xpath('./span[@class="topic-title"]/text()')[0]

這樣就可以獲取文章的連結和標題啦。但是，我發現了文章裡面的文字被標籤分隔開，這樣提取比較麻煩，而且如果有圖片的話很難弄出來，就算弄出來文章整體也不好看。這個時候我突然想到，可以點那個改進的功能，裡面的文章是 Markdown 格式的。

原本文章連結為部落格：python 爬蟲爬取酷狗音樂
改進文章連結為 https://learnku.com/topics/42536/patches/c...
可以發現，數字那個為文章的ID。

所以我們要把文章連結的 ID 提取出來變為改進文章的連結
然後請求改進文章的連結

from lxml import etree
import requests

# 中間部分程式碼省略
for li in lis:
    essay_url = li.xpath('./@href')[0]
    ID = essay_url[essay_url.rfind('/')+1:]
    modify_url = 'https://learnku.com/topics/{}/patches/create'.format(ID)
    title = li.xpath('./span[@class="topic-title"]/text()')[0]
    # 請求改進文章
    response = s.get(modify_url)

在改進文章的頁面如果繼續使用開發者模式定位的話，就會發現，請求到的 html 裡面沒有這個元素，我猜測開發者模式看到的是由 JS 渲染出來的。於是我儲存下請求的資料為 html 檔案，搜尋文章的部分詞語，發現文章是在一個 name 為 body 的 textarea 標籤下。

拿到文章內容就可以存起來啦。

from lxml import etree
import requests

# 中間部分程式碼省略
for li in lis:
    essay_url = li.xpath('./@href')[0]
    ID = essay_url[essay_url.rfind('/')+1:]
    modify_url = 'https://learnku.com/topics/{}/patches/create'.format(ID)
    title = li.xpath('./span[@class="topic-title"]/text()')[0]
    # 請求改進文章
    response = s.get(modify_url)
    # 儲存文章
    with open('{}.html'.format(title), 'wb') as f:
        f.write(response.content)

本作品採用《CC 協議》，轉載必須註明作者和本文連結

python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python爬蟲新手教程：知乎文章圖片爬取器
2019-07-20
Python爬蟲
新手爬蟲教程：Python爬取知乎文章中的圖片
2019-01-17
爬蟲Python
PHP爬蟲初探......先爬Learnku試試看
2020-07-23
PHP爬蟲
【爬蟲】利用Python爬蟲爬取小麥苗itpub部落格的所有文章的連線地址（1）
2018-12-26
爬蟲Python
每天一個爬蟲-learnku
2021-06-16
爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
Python爬蟲精簡步驟1 獲取資料
2020-02-17
Python爬蟲
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
python 爬蟲 1 爬取酷狗音樂
2020-03-29
Python爬蟲
【Python爬蟲】正則爬取趕集網
2020-12-24
Python爬蟲
Python爬蟲：爬取instagram，破解js加密引數
2019-04-09
Python爬蟲JS加密
python網路爬蟲--爬取淘寶聯盟
2018-07-17
Python爬蟲
Python爬蟲入門【5】：27270圖片爬取
2019-07-30
Python爬蟲
Python 第一個爬蟲，爬取 147 小說
2020-05-08
Python爬蟲
小白學 Python 爬蟲（25）：爬取股票資訊
2019-12-24
Python爬蟲
爬蟲——爬取貴陽房價（Python實現）
2022-02-09
爬蟲Python
python 爬蟲自動切換 learnku 的白天 / 夜間模式
2020-05-05
Python爬蟲模式
Python 爬蟲進階篇-利用beautifulsoup庫爬取網頁文章內容實戰演示
2020-09-14
Python爬蟲網頁
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
【爬蟲】利用Python爬蟲爬取小麥苗itpub部落格的所有文章的連線地址並寫入Excel中（2）
2018-12-27
爬蟲PythonExcel
爬蟲之股票定向爬取
2018-12-06
爬蟲
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
Python爬蟲爬取淘寶，京東商品資訊
2020-02-11
Python爬蟲
Python爬蟲實戰詳解：爬取圖片之家
2020-11-04
Python爬蟲
python爬蟲學習01--電子書爬取
2020-07-13
Python爬蟲
Python爬蟲——批次爬取douyin影片，下載到本地
2024-12-06
Python爬蟲
Python爬蟲實踐--爬取網易雲音樂
2022-02-15
Python爬蟲
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
Python實現微博爬蟲，爬取新浪微博
2020-12-14
Python爬蟲
Python爬蟲入門【9】：圖蟲網多執行緒爬取
2019-07-31
Python爬蟲執行緒
python爬蟲如何獲取表情包
2021-09-11
Python爬蟲
python 爬蟲
2024-04-20
Python爬蟲
python爬蟲
2024-06-13
Python爬蟲
什麼是Python爬蟲？一篇文章帶你全面瞭解爬蟲
2022-02-21
Python爬蟲

python 爬蟲 爬取 learnku 精華文章

相關文章

python 爬蟲爬取 learnku 精華文章