如何爬取視訊的爬蟲程式碼原始碼

Edgar和他的演算法發表於2020-12-26

爬取梨視訊短視訊原始碼


首先放B站教學連結,如果對爬蟲感興趣的朋友可以去康一下原教程,講得真的很不錯;
連結:https://www.bilibili.com/video/BV1Yh411o7Sz?t=13&p=44

廢話不多說,開搞;
需求:主要是從首頁爬取當前最熱的視訊集,單個頁面涉及到3次請求;
難點:
1.【首頁url】-【詳情頁url】-【視訊url】,其中視訊url目前由AJAX儲存,並且AJAX頁面中的視訊地址還需要進一步解碼;
利用開發者工具得到的視訊連結:https://video.pearvideo.com/mp4/third/20201225/cont-1713375-10305425-120340-hd.mp4
ajax儲存的視訊連結:
https://video.pearvideo.com/mp4/third/20201225/1608982451558-10305425-120340-hd.mp4

2.獲取ajax的連結在post的請求中除了headers也需要加入Referer也就是跳轉前網頁(詳情頁)的url;

3.ajax中post請求mrd引數需要用隨機數進行賦值;

首頁:
在這裡插入圖片描述
詳情頁:
在這裡插入圖片描述

import requests
from multiprocessing.dummy import Pool
from lxml import etree
import random
import re
headers = {
    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}

#原則:執行緒池處理的是阻塞且耗時的操作

url = 'https://www.pearvideo.com/category_5'
video_info = requests.get(url = url, headers=  headers).text

tree = etree.HTML(video_info)
li_list = tree.xpath('//div[@class = "category-top"]//li[@class = "categoryem "]')

urls = [] #儲存所有視訊的連結和名字
for li in li_list:
    title = li.xpath('.//div[@class = "vervideo-title"]/text()')[0] + '.mp4'
    detail_url = 'https://www.pearvideo.com/' + li.xpath('.//a/@href')[0]
    #對詳情頁的URL發請求
    detail_url_page = requests.get(url = detail_url,headers =headers).text
    #從詳情頁中解析出視訊的地址(url)
    detail_tree = etree.HTML(detail_url_page)
    content = detail_tree.xpath('//div[@class = "img prism-player play"]/video/@src')

    post_url = "https://www.pearvideo.com/videoStatus.jsp?"
    post_headers = {
        "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
        "Referer":detail_url
    }

    video_id = detail_url.split('_')[1]

    data = {
        'contId':str(video_id),
        'mrd':str(random.random())
    }

    response_page = requests.post(url = post_url, data = data,headers=  post_headers).text
    ex = 'srcUrl":"(.*?)"}}'
    video_fake_url = re.findall(ex,response_page)[0]

    ex = 'srcUrl":"(.*?)"}}'
    video_fake_url = re.findall(ex, response_page)[0]
    video_fake_url_split = re.split('(/)', video_fake_url)
	#主要是為了拼接正確的視訊url
    target_str = re.split('(-)',video_fake_url_split[-1])
    target_str_process = [f'cont-{video_id}' if x == target_str[0] else x for x in target_str]
    target_str_process_link = "".join(target_str_process)
    video_fake_url_split_process = [target_str_process_link if x == video_fake_url_split[-1] else x for x in video_fake_url_split]
    video_url = "".join(video_fake_url_split_process)
    dic = {
        'name':title,
        'url':video_url
    }
    urls.append(dic)
print(urls)

def get_video_data(dic):
    url = dic['url']
    print(dic["name"],"正在下載.......")
    data = requests.get(url = url, headers = headers).content
    with open(dic['name'],'wb') as fp:
        fp.write(data)
        print(dic["name"],'下載成功!')

#使用執行緒池對視訊資料進行請求
pool = Pool(4)
pool.map(get_video_data,urls)

pool.close()
pool.join()

完畢,歡迎交流~

相關文章