段友福利：Python爬取段友之家貼吧圖片和小視訊

gxcuizy發表於2018-06-01

原文網址 : https://juejin.im/post/5b1158d4e51d4506d167f30a

Python

由於最新的視訊整頓風波，內涵段子APP被迫關閉，廣大段友無家可歸，但是最近發現了一個“段友”的app，版本更新也挺快，正在號召廣大段友回家，如下圖，有興趣的可以下載看看（ps：我不是打廣告的，沒收廣告費的）

同時，之前同事也發了一個貼吧的段子聚居地，客官稍等，馬上奉上連線：
段友之家 https://tieba.baidu.com/f?ie=utf-8&kw=段友之家

然後呢，看到上面，確實好多段友在上面，於是乎，我就想爬取他們的圖片和小視訊，就有了這篇文章的主題：

其實吧，用Python爬取網站資料是最基礎的東西，也不難，但是我還想分享給大家，一起學習和交流。

爬取這些網站裡的資料主要用的模組是bs4、requests以及os，都是常用模組

大概思路就是通過requests模組請求網頁html資料，然後通過bs4模組下的BeautifulSoup分析請求的網頁，然後通過css查詢器查詢內涵段子的圖片以及小視訊的地址，主要實現程式碼如下：

def download_file(web_url):
    """獲取資源的url"""
    # 下載網頁
    print('正在下載網頁： %s...' % web_url)
    result = requests.get(web_url)
    soup = bs4.BeautifulSoup(result.text, "html.parser")
    # 查詢圖片資源
    img_list = soup.select('.vpic_wrap img')
    if img_list == []:
        print('未發現圖片資源！')
    else:
        # 找到資源，開始寫入
        for img_info in img_list:
            file_url = img_info.get('bpic')
            write_file(file_url, 1)
    # 查詢視訊資源
    video_list = soup.select('.threadlist_video a')
    if video_list == []:
        print('未發現視訊資源！')
    else:
        # 找到資源，開始寫入
        for video_info in video_list:
            file_url = video_info.get('data-video')
            write_file(file_url, 2)
    print('下載資源結束：', web_url)
    next_link = soup.select('#frs_list_pager .next')
    if next_link == []:
        print('下載資料結束！')
    else:
        url = next_link[0].get('href')
        download_file('https:' + url)
複製程式碼

得到圖片以及視訊的地址之後，肯定還不夠，還得把這些資源寫入到本地，方式是通過二進位制的方式來讀取遠端檔案資源，然後分類寫入到本地，實現的主要程式碼如下：

def write_file(file_url, file_type):
    """寫入檔案"""
    res = requests.get(file_url)
    res.raise_for_status()
    # 檔案型別分資料夾寫入
    if file_type == 1:
        file_folder = 'nhdz\\jpg'
    elif file_type == 2:
        file_folder = 'nhdz\\mp4'
    else:
        file_folder = 'nhdz\\other'
    folder = os.path.exists(file_folder)
    # 資料夾不存在，則建立資料夾
    if not folder:
        os.makedirs(file_folder)
    # 開啟檔案資源，並寫入
    file_name = os.path.basename(file_url)
    str_index = file_name.find('?')
    if str_index > 0:
        file_name = file_name[:str_index]
    file_path = os.path.join(file_folder, file_name)
    print('正在寫入資原始檔：', file_path)
    image_file = open(file_path, 'wb')
    for chunk in res.iter_content(100000):
        image_file.write(chunk)
    image_file.close()
    print('寫入完成！')
複製程式碼

最後，再奉上完整的程式碼吧。要不然，會被人說的，說話說一半，說福利，也不給全，這就太不夠意思了。客官別急，馬上奉上……

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
爬取百度貼吧，段友之家的圖片和視訊
author: cuizy
time：2018-05-19
"""

import requests
import bs4
import os


def write_file(file_url, file_type):
    """寫入檔案"""
    res = requests.get(file_url)
    res.raise_for_status()
    # 檔案型別分資料夾寫入
    if file_type == 1:
        file_folder = 'nhdz\\jpg'
    elif file_type == 2:
        file_folder = 'nhdz\\mp4'
    else:
        file_folder = 'nhdz\\other'
    folder = os.path.exists(file_folder)
    # 資料夾不存在，則建立資料夾
    if not folder:
        os.makedirs(file_folder)
    # 開啟檔案資源，並寫入
    file_name = os.path.basename(file_url)
    str_index = file_name.find('?')
    if str_index > 0:
        file_name = file_name[:str_index]
    file_path = os.path.join(file_folder, file_name)
    print('正在寫入資原始檔：', file_path)
    image_file = open(file_path, 'wb')
    for chunk in res.iter_content(100000):
        image_file.write(chunk)
    image_file.close()
    print('寫入完成！')


def download_file(web_url):
    """獲取資源的url"""
    # 下載網頁
    print('正在下載網頁： %s...' % web_url)
    result = requests.get(web_url)
    soup = bs4.BeautifulSoup(result.text, "html.parser")
    # 查詢圖片資源
    img_list = soup.select('.vpic_wrap img')
    if img_list == []:
        print('未發現圖片資源！')
    else:
        # 找到資源，開始寫入
        for img_info in img_list:
            file_url = img_info.get('bpic')
            write_file(file_url, 1)
    # 查詢視訊資源
    video_list = soup.select('.threadlist_video a')
    if video_list == []:
        print('未發現視訊資源！')
    else:
        # 找到資源，開始寫入
        for video_info in video_list:
            file_url = video_info.get('data-video')
            write_file(file_url, 2)
    print('下載資源結束：', web_url)
    next_link = soup.select('#frs_list_pager .next')
    if next_link == []:
        print('下載資料結束！')
    else:
        url = next_link[0].get('href')
        download_file('https:' + url)


# 主程式入口
if __name__ == '__main__':
    web_url = 'https://tieba.baidu.com/f?ie=utf-8&kw=段友之家'
    download_file(web_url)
複製程式碼

Python爬蟲實戰詳解：爬取圖片之家
2020-11-04
Python爬蟲
爬取百度貼吧實戰，python教你如何獲取
2020-12-07
Python
客戶端爬取－答網友問
2019-03-04
客戶端
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
一段Python程式碼每月賺16萬....網友：請複製給我！
2019-06-05
Python
python爬去百度美女吧圖片
2018-04-01
Python
你好，小友！
2024-05-08
用一段爬蟲程式碼爬取高音質音訊示例
2023-10-18
爬蟲音訊
Python爬蟲入門【5】：27270圖片爬取
2019-07-30
Python爬蟲
爬蟲Selenium+PhantomJS爬取動態網站圖片資訊（Python）
2018-03-24
爬蟲JS網站Python
iOS 獲取視訊圖片
2018-09-19
iOS
獲取本地圖片/視訊
2018-08-17
地圖
Python《必應bing桌面圖片爬取》
2020-12-26
Python
Python爬蟲之小說資訊爬取與資料視覺化分析
2021-01-09
Python爬蟲視覺化
Python爬蟲新手教程：知乎文章圖片爬取器
2019-07-20
Python爬蟲
Python爬蟲遞迴呼叫爬取動漫美女圖片
2020-10-19
Python爬蟲遞迴
Java爬蟲批量爬取圖片
2021-09-24
Java爬蟲
青花瓷圖片的爬取和resize
2020-10-06
AotucCrawler 快速爬取圖片
2021-11-25
Python應用開發——爬取網頁圖片
2022-09-21
Python網頁
新手爬蟲教程：Python爬取知乎文章中的圖片
2019-01-17
爬蟲Python
python 爬取騰訊視訊的全部評論
2021-02-17
Python
實用爬蟲-03-爬取視訊教程課程名+連結+下載圖片
2018-10-29
爬蟲
貼吧小試牛刀
2018-03-26
Python：圖片合視訊（最簡）
2020-12-04
Python
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
爬蟲---xpath解析（爬取美女圖片）
2020-12-23
爬蟲
python 爬蟲之requests爬取頁面圖片的url，並將圖片下載到本地
2019-06-12
Python爬蟲
百度貼吧怎麼儲存看過的視訊？百度貼吧儲存視訊的方法
2020-12-30
Python爬取動態載入的視訊（梨視訊,xpath)
2022-03-21
Python
Python爬取王者榮耀英雄皮膚高清圖片
2018-11-07
Python
利用Python爬取攝影網站圖片，切勿商用
2018-12-18
Python網站
視訊提取圖片/圖片合成視訊ffmpeg(二十)
2020-10-28
python爬取網圖
2019-10-15
Python
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python網路爬蟲2 - 爬取新浪微博使用者圖片
2018-04-10
Python爬蟲
Python爬蟲入門【4】：美空網未登入圖片爬取
2019-07-30
Python爬蟲
蘇寧易購網址爬蟲爬取商品資訊及圖片
2021-10-12
爬蟲

段友福利：Python爬取段友之家貼吧圖片和小視訊

相關文章