GB標準文件爬蟲下載程式

babyfengfjx發表於2024-04-11

原文網址 : https://www.cnblogs.com/babyfengfjx/p/18128743

"""
author：babyfengfjx
"""
import requests
import re
from time import sleep
from bs4 import BeautifulSoup
import shelve
headers = {
  "Accept": "image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8",
  # "Accept-Encoding": "gzip, deflate, br, zstd",
  "Accept-Language": "zh-CN,zh;q=0.9",
  "Connection": "keep-alive",
  "Cookie": "HMACCOUNT_BFESS=D1D8258E03E0A558; BDUSS_BFESS=dKNzJKaFhQQWFvcEpSZG9oRE5YR0Zod1l-VHE3ZVFLfnJTZWNJT3JKbGdiT3BsRVFBQUFBJCQAAAAAAAAAAAEAAAB~qcIBZmxvd2ludGhlcml2ZXIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGDfwmVg38JlMT; H_WISE_SIDS_BFESS=40008_40206_40212_40215_40080_40364_40352_40367_40374_40401_40311_40301_40467_40461_40471_40456_40317; BAIDUID_BFESS=A6E2AF276F85EFFB50804B65078FB44D:FG=1; ZFY=hyR2bKIUFoz76hVFPIVRUUHYScV4SOFL0yQP0ASJu4k:C",
  # "Host": "hm.baidu.com",
  "Referer": "https://0ppt.com/",
  # "Sec-Ch-Ua": "\"Chromium\";v=\"124\", \"Microsoft Edge\";v=\"124\", \"Not-A.Brand\";v=\"99\"",
  # "Sec-Ch-Ua-Mobile": "?0",
  # "Sec-Ch-Ua-Platform": "\"Windows\"",
  # "Sec-Fetch-Dest": "image",
  # "Sec-Fetch-Mode": "no-cors",
  # "Sec-Fetch-Site": "cross-site",
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0"
}
def getcontent_list(html):
    # https://0ppt.com/bz/index_927.html

    res = requests.get(html,headers=headers)
    res.encoding = 'GBK'
    html_content= res.text

    # print(html_content)

    repat = re.compile(r'<a href="(https.+?)".*?title="(.*?)">.*?</a>')
    repat1 = re.compile(r'<a href="(https.+?)".*?target="_blank">(.*?)</a>')
    result = repat.findall(html_content)
    result1 = repat1.findall(html_content)
    return result + result1

def downloadfiles(fileinfo):
    url = fileinfo[0]
    name = fileinfo[1]
    download_url_info = requests.get(url,headers=headers).text
    # print(download_url_info)
    repat = re.compile(r'<a href="(https.*?)" target="_blank" rel="nofollow" '
                       r'class="bz-down-button">線上預覽</a>')
    download_url = repat.findall(download_url_info)
    name = name.replace('/','-').replace('∕','-').replace(':', '-').replace('*','-')   # 這裡因為/在windows下不能用作檔名，所以替換掉

    return  download_url[0],name

def download_file(url,name):
    with shelve.open('download_list') as f:
        if url in f:
            # print(f'{name}》已經下載過')
            return
    print(f'{name}--開始下載》{url}')
    res = requests.get(url,headers=headers,stream=True)
    if res.status_code == 200:

        with open(f'{name}.pdf','wb') as f:
            for chunk in res.iter_content(chunk_size=16384):  #16384就是16k
                f.write(chunk)
            print(f'{name}下載完成')
            with shelve.open('download_list') as f:
                f[url] = name
    else:
        print(f'{name}下載失敗')

if __name__ == '__main__':
    for base_page in range(45,928):
        htmlbase = f'https://0ppt.com/bz/index_{base_page}.html'
        print(f"當前訪問頁面：{htmlbase}")
        res = getcontent_list(htmlbase)
        # print(res)
        for i in res:
            # print(i)
            # if "軟體" in i[1]:
            download_url,name= downloadfiles(i)
            download_file(download_url,name)
            sleep(1)

中小學教材下載爬蟲
2020-07-24
爬蟲
爬蟲 Scrapy框架爬取圖蟲圖片並下載
2018-08-27
爬蟲框架
爬蟲：多程式爬蟲
2021-05-19
爬蟲
Python爬蟲——批次爬取douyin影片，下載到本地
2024-12-06
Python爬蟲
如何用python爬蟲下載小說？
2021-09-11
Python爬蟲
PHP蜘蛛爬蟲開發文件
2021-01-12
PHP爬蟲
爬蟲福利----妹子圖網MM批量下載
2020-01-06
爬蟲
Python爬蟲批次下載電影連結
2021-09-09
Python爬蟲
python 爬蟲下載百度美女圖片
2024-04-18
Python爬蟲
Swift爬蟲程式
2023-11-13
Swift爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
【爬蟲工具】下載部落格轉成Markdown的形式
2019-02-16
爬蟲
爬蟲福利二之妹子圖網MM批量下載
2020-01-11
爬蟲
第二彈！python爬蟲批量下載高清大圖
2019-10-06
Python爬蟲
使用Python爬蟲實現自動下載圖片
2021-09-11
Python爬蟲
Python爬蟲全網搜尋並下載音樂
2021-02-14
Python爬蟲
《汽車設計標準資料手冊(標準件篇)》PDF下載
2024-03-30
MichaelPage：2020上海薪酬標準指南（附下載）
2020-01-09
網路爬蟲專案開發日誌（三）：爬蟲上線準備
2022-02-02
爬蟲
爬蟲之xpath精準定位--位置定位
2024-06-03
爬蟲
爬蟲爬取微信小程式
2019-02-16
爬蟲微信小程式
Python 爬蟲目標：千圖網VIP高清無水印下載即用
2020-03-29
Python爬蟲
Windows下安裝配置爬蟲工具Scrapy及爬蟲環境
2018-09-19
Windows爬蟲
Michael Page：2020北京薪酬標準指南（附下載）
2020-01-13
MichaelPage：2020廣州薪酬標準指南（附下載）
2020-01-07
簡單的爬蟲程式
2024-03-24
爬蟲
Bootstrap提供的CDN服務標籤與下載文件
2022-02-16
boot
三篇文件學會使用casperjs製作爬蟲
2018-08-14
JS爬蟲
ReactPHP 爬蟲實戰：下載整個網站的圖片
2019-01-20
ReactPHP爬蟲網站
京東商品圖片自動下載抓取 c# 爬蟲
2020-09-30
C#爬蟲
堆糖網爬蟲(根據關鍵字下載圖片)
2021-10-24
爬蟲
爬蟲程式最佳化要點—附Python爬蟲影片教程
2020-10-15
爬蟲Python
Java爬蟲之批量下載LibreStock圖片（可輸入關鍵詞查詢下載）
2019-02-19
Java爬蟲REST
python爬蟲練習之爬取豆瓣讀書所有標籤下的書籍資訊
2018-07-23
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Vue 元件命名，CSS的標準文件流
2022-01-13
Vue元件CSS
爬蟲新手入門實戰專案（爬取筆趣閣小說並下載）
2019-05-09
爬蟲
你有自己寫過爬蟲的程式嗎？說說你對爬蟲和反爬蟲的理解？
2024-11-28
爬蟲

GB標準文件爬蟲下載程式

相關文章