python簡書資料抓取

One of them發表於2018-08-25

原文網址 : https://blog.csdn.net/one_of_them/article/details/82049176

使用Python抓取簡述首頁標題即詳情頁資訊

"""
   get page_data of 'JianShu.com'.(rewrite)
"""
import re
from lxml import etree
import requests
from bs4 import BeautifulSoup
import json


class PageTo:
    def __init__(self, url):
        self.url = url
        self.header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
        }

        self.main_page = requests.get(self.url, headers=self.header)
        self.csrf_token = re.findall('<meta name="csrf-token" content="(.*?)" />', self.main_page.text)[0]
        self.note_id = re.findall('<li id=".*?" data-note-id="(.*?)"', self.main_page.text, re.S)
        print("第1頁")
        self.page_parse(self.main_page.text)

    def get_ajax_page(self, start_page, end_page):
        ajax_header = self.header.copy()
        ajax_header.update({
            'X-INFINITESCROLL': 'true',
            'X-CSRF-Token': self.csrf_token
        })

        for page in range(start_page, end_page + 1):
            if page < end_page + 1:
                print(f"第{page}頁")
                params = {'seen_snote_ids[]': self.note_id}
                ajax_page = requests.get(self.url, params=params, headers=ajax_header)
                self.page_parse(ajax_page.text)
                self.note_id += re.findall('data-note-id="(.*?)', ajax_page.text, re.S)
            else:
                print("It's over.")
                break

    def page_parse(self, page):
        article_title_link = re.findall('<a class="title" target="_blank" href="(.*?)">(.*?)</a>', page, re.S)
        article_abstract = re.findall(' <p class="abstract">(.*?)</p>', page, re.S)

        article_header = self.header.copy()
        article_header.update({
            'Upgrade-Insecure-Requests': '1'
        })
        for info in range(len(article_abstract)):
            print(article_title_link[info][1])
            print(article_abstract[info])

            article_page = requests.get(f'{self.url}{article_title_link[info][0]}', headers=article_header).text
            self.article_parse(article_page)

    def article_parse(self, article_page):
        soup = BeautifulSoup(article_page, 'lxml')
        article_info = soup.find('div', class_='article')
        author = article_info.find('div', class_='author').find('span', class_='name').get_text()
        author_info = json.loads(soup.find('script', type='application/json').get_text())
        article = article_info.find_all('p')
        print('author:', author)
        print(f'"likes_count":{author_info["note"]["likes_count"]},'
              f'"views_count":{author_info["note"]["views_count"]},'
              f'"public_wordage":{author_info["note"]["public_wordage"]},'
              f'"comments_count":{author_info["note"]["comments_count"]}')
        for info in range(len(article)):
            print(article[info].get_text())
        print("\n")


if __name__ == '__main__':
    JianShu = PageTo('https://www.jianshu.com')
    JianShu.get_ajax_page(2, 3)

Python抓取淘寶IP地址資料
2019-04-26
Python
如何用Python爬資料？（一）網頁抓取
2018-06-27
Python網頁
《Python 簡明教程》讀書筆記系列四 —— 資料結構
2020-04-19
Python筆記資料結構
Python爬蟲新手教程：手機APP資料抓取 pyspider
2019-07-20
Python爬蟲APPIDE
Python爬蟲抓取股票資訊
2021-01-03
Python爬蟲
Python中使用mechanize庫抓取網頁上的表格資料
2024-03-15
Python網頁
Python爬蟲抓取資料，為什麼要使用代理IP？
2022-12-27
Python爬蟲
Python爬蟲如何去抓取qq音樂的歌手資料？
2021-03-19
Python爬蟲
用Python抓取漫畫並製作mobi格式電子書
2019-01-04
Python
使用python3抓取鏈家二手房資料
2018-04-18
Python
爬蟲進階——動態網頁Ajax資料抓取（簡易版）
2024-04-12
爬蟲網頁
Python網路爬蟲抓取動態網頁並將資料存入資料庫MYSQL
2019-01-04
Python爬蟲網頁資料庫MySql
資料包抓取工具：Debookee for mac
2022-07-06
Mac
Debookee for mac(資料包抓取工具)
2022-07-05
Mac
爬蟲原理與資料抓取
2020-12-17
爬蟲
如何使用代理IP進行資料抓取，PHP爬蟲抓取亞馬遜商品資料
2019-05-15
PHP爬蟲亞馬遜
Python爬蟲新手教程：微醫掛號網醫生資料抓取
2019-07-20
Python爬蟲
薦書 | 《利用Python進行資料分析》
2019-05-13
Python
荔枝FM IPO招股書及簡要資料
2019-10-29
Airbnb上市IPO招股書及簡要資料
2020-11-18
AI
Qualtrics上市IPO招股書及簡要資料
2020-12-29
[知識圖譜實戰篇] 一.資料抓取之Python3抓取JSON格式的電影實體
2019-01-31
PythonJSON
18.2 使用NPCAP庫抓取資料包
2023-10-26
PCA
TypeScript_抓取酒店價格資料
2023-11-07
TypeScript
爬蟲抓取網頁資料原理
2021-12-06
爬蟲網頁
業務資料抓取的影響
2022-01-17
網頁資料抓取之噹噹網
2020-12-21
網頁
Python實現拼多多商品資訊抓取方法
2023-10-10
Python
Python爬蟲入門教程 29-100 手機APP資料抓取 pyspider
2019-01-23
Python爬蟲APPIDE
Python爬蟲實戰：爐石傳說卡牌、原畫資料抓取
2020-10-09
Python爬蟲
Python爬蟲入門教程 33-100 《海王》評論資料抓取 scrapy
2019-02-14
Python爬蟲
亂燉“簡書交友”資料之程式碼（1）
2018-06-13
蛋殼公寓IPO招股書及簡要資料
2019-10-29
嘉楠科技IPO招股書及簡要資料
2019-10-30
貝殼上市IPO招股書及簡要資料
2020-07-25
嘀嗒出行上市IPO招股書及簡要資料
2020-10-09
快手IPO上市招股書及簡要資料
2020-11-05
理想汽車IPO招股書及簡要資料
2020-07-12

python簡書資料抓取

使用Python抓取簡述首頁標題即詳情頁資訊

相關文章