python愛奇藝VIP視訊爬蟲爬取下載

行者劉6發表於2018-04-20

原文網址 : https://blog.csdn.net/qq_38282706/article/details/80025406

Python爬蟲

我是跟著@Jack-Cui 老哥的部落格爬的，發現爬取的網站更新了，不得不跟著更新爬取的程式碼

原部落格:https://blog.csdn.net/c406495762/article/details/78123502

要點：1.charles抓包、分析、compose功能

2.淺複製 headers.copy()

3.re.findall 的用法

4.js的 encodeURIComponent 轉碼使用urllib parse.unquote

注：fiddler侷限性很大，tunnel to的網頁不能顯示，問了很多爬蟲前輩，加上百度，我用上了charles花瓶，挺好用的，大家可以自行研究下，得搞破解版才行哦！

###http://api.xfsub.com/index.php?url=http://www.iqiyi.com/v_19rr7qhfg0.html#vfrm=19-9-0-1

現在這個旋風破解又特麼更新了，不能直接播放，得搞個框架才行，非常麻煩

但我還是忍著噁心去解析了這個vip播放網站，到最後一步下載視訊時，居然更噁心！！

檔案居然是ts格式，還是幾百個！！每個5秒鐘，所以我做到這一步就不搞了，我把程式碼貼上來，大家可以看看思路

import requests,json,re
from bs4 import BeautifulSoup
from urllib import parse

class download_movie():
    def __init__(self,url):
        self.headers={'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
        self.first_url='http://api.xfsub.com/index.php?url='+url.split('#')[0]
        self.main_ip='https://api.177537.com'

    def second_url(self):
        '''因為現在是用iframe框架，所以我搞多一步，返回的url就是
        https://api.177537.com/?url= + http://www.iqiyi.com/v_19rrkqj6kg.html
        偷懶應該可以直接上url就行了
        headers的referer愛奇藝的應該是固定的字元url'''
        headers=self.headers.copy()
        if 'www.iqiyi.com' in self.first_url:
            headers['referer']='http://api.xfsub.com/index.php?url=http://www.iqiyi.com/v_19rr7qhfg0.html'
            req = requests.get(self.first_url, headers=headers)
            bf = BeautifulSoup(req.text, 'lxml')
            api_url = bf.find('iframe')['src']
            return api_url

    def post_info(self):
        '''api_url=https://api.177537.com/?url=http://www.iqiyi.com/v_19rrkqj6kg.html
        用正規表示式求出post的4個引數，求出來的是5個，實際只需要4個
        headers的referer就是api.xfsub.com/index.php?url='+url'''
        api_url=self.second_url()
        headers = self.headers.copy()
        headers['referer'] =self.first_url
        req = requests.get(api_url, headers=headers)
        req_json = re.findall('"url.php", (.+),', req.text)[0]
        req1 = re.findall('(.+),"ckey1"', req_json)[0] + '}'
        info = json.loads(req1)
        return info

    def post_url(self):
        '''把那4個引數post到https://api.177537.com/url.php(好像是固定網站)
        會返回包含電影地址的網站，不過要注意play這個引數，返回的url型別不同的！！'''
        info=self.post_info()
        data = {'time': info['time'], 'key': info['key'],
                'url': info['url'], 'type': info['type']}
        req = requests.post(self.main_ip+'/url.php', headers=self.headers, data=data)
        url =json.loads(req.text)['url']
        play=json.loads(req.text)['play']
        return play,url

    def movie_url(self):
        '''play=m3u8時，url="http%3A%2F%2Fvs1.baduziyuan.com%2F%2Fppvod%2F42070A3BCA22609F4655BE87FAAC49F8.m3u8"
            所以得用unquote轉碼
            play=xml時，url=/url.php?key=357350013cc120d659b1a4d5e1d7acfe&time=1523979626&url=http%3A%2F%2Fwww.iqiyi.com%2Fv_19rrluypwg.html&type=iqiyi&xml=1
            需要加頭才能get'''
        play,url=self.post_url()
        print(play,url)
        if play=='m3u8':
            url1=parse.unquote(url)
            self.m3u8_movie(url1)
        elif play=='xml':
            url1=self.main_ip+url
            self.xml_movie(url1)

    def xml_movie(self,url1):
        '''這種情況得到的是xml，電影片段零碎，
        都是http://61.179.109.165/14/w/89aae12482708fc94010602ff1b34520.mp4?type=ppbox.launcher&key=a32e0dd656ea3cc2f31b98b102ce71c6&k=02ecf7808b8c478dca6b1a98cd4c595a-a396-1523991314'，
        每段都有7分鐘，幾十個檔案！！'''
        req = requests.get(url1, headers=self.headers)
        bf = BeautifulSoup(req.text, 'xml')  # 因為檔案是xml格式的，所以要進行xml解析。
        video_url = bf.find_all('file')
        urls = []
        for ip in video_url:
            urls.append(ip.string)
        print(urls)

    def m3u8_movie(self,url1):
        '''這種情況得到的是text，電影片段很零碎，而且地址還得自己加頭，
        都是/20180108/lf0jrQw6/800kb/hls/H1iYGK7412000.ts，
        每段都只有幾秒，幾百個檔案，很噁心！！'''
        req = requests.get(url1, headers=self.headers)
        text=re.findall(r'(.+.ts)',req.text)
        head = re.findall('(^.+com)', url1)[0]
        urls=[]
        for path in text:
            urls.append(head+path)
        print(urls)


url='http://www.iqiyi.com/v_19rrluypwg.html'
text=download_movie(url)
text.movie_url()