根據 savefrom條例
本例項及教程只用於學習交流用,權利歸savefrom.net所有
最後程式碼+註釋大概100行左右,具體程式碼以github程式碼為主(可以會在上面修復bug),本文只做具體講解
專案地址
思路
流程
1. post
根據思路里的第一步,我們首先需要用post
方式取到加密後的js欄位,筆者使用了requests
第三方庫來執行,關於爬蟲可以參考我之前的文章
i. 先把post中的headers格式化
# set the headers or the website will not return information
# the cookies in here you may need to change
headers = {
"cache-Control": "no-cache",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
"*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
"content-type": "application/x-www-form-urlencoded",
"cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
"clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
"helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
"_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
"PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
"PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
"origin": "https://en.savefrom.net",
"pragma": "no-cache",
"referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
"sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest": "iframe",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/87.0.4280.88 Safari/537.36"}
其中cookie
部分可能要改,然後最好以你們瀏覽器上的為主,具體每個引數的含義不是本文範圍,可以自行去搜尋引擎搜
ii.然後把引數也格式化
# set the parameter, we can get from chrome
kv = {"sf_url": url,
"sf_submit": "",
"new": "1",
"lang": "en",
"app": "",
"country": "cn",
"os": "Windows",
"browser": "Chrome"}
其中sf_url
欄位是我們要下載的youtube視訊的url,其他引數都不變
iii. 最後再執行requests
庫的post請求
# do the POST request
r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
data=kv)
r.raise_for_status()
注意是data=kv
iv. 封裝成一個函式
import requests
def gethtml(url):
# set the headers or the website will not return information
# the cookies in here you may need to change
headers = {
"cache-Control": "no-cache",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
"*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
"content-type": "application/x-www-form-urlencoded",
"cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
"clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
"helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
"_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
"PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
"PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
"origin": "https://en.savefrom.net",
"pragma": "no-cache",
"referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
"sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest": "iframe",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/87.0.4280.88 Safari/537.36"}
# set the parameter, we can get from chrome
kv = {"sf_url": url,
"sf_submit": "",
"new": "1",
"lang": "en",
"app": "",
"country": "cn",
"os": "Windows",
"browser": "Chrome"}
# do the POST request
r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
data=kv)
r.raise_for_status()
# get the result
return r.text
2. 呼叫解密函式
i. 分析
這其中的難點在於在python裡執行javascript程式碼,而晚上的解決方法有PyV8
等,本文選用execjs
。在思路部分我們可以發現js部分的最後幾行是解密函式,所以我們只需要在execjs
中先執行一遍全部,然後再單獨執行解密函式就好了
ii. 先取出js部分
# target(youtube address) url
url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
# get the target text
reo = gethtml(url)
# Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]
這裡其實可以用正則,不過由於筆者正規表示式還不太熟練就直接用split
了
iii. 取第一個解密函式作為我們用的解密函式
當你多取幾次不同視訊的結果,你就會發現每次的解密函式都不一樣,不過位置都是還是在固定行數
# split each line(help us find the decrypt function in last few line)
reA = reo.split("\n")
# get the depcrypt function
name = reA[len(reA) - 3].split(";")[0] + ";"
所以name
就是我們的解密函式了(變數名沒取太好hhh)
iv. 用execjs執行
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(reo)
# do the decryption
text = ct.eval(name.split("=")[1].replace(";", ""))
其中只取=
後面的和去掉分號是指指執行這個函式而不用賦值,當先執行賦值+解密然後取值也不是不可以
但是我們可以發現馬上就報錯了(要是有這麼簡單就好了)
1. this也就是window變數不存在
如果沒記錯是報錯this
或者$b
,筆者嘗試把全部this
去掉或者把全部框在一個class
裡面(這樣子this就變成那個class了)不過都沒有成功,然後發現在npm
下有個jsdom
可以在execjs
裡模擬window變數(其實應該有更好方法的),所以我們需要下載npm
和裡面的jsdom
,然後改寫以上程式碼
addition = """
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
window = dom.window;
document = window.document;
XMLHttpRequest = window.XMLHttpRequest;
"""
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(addition + reo, cwd=r'C:\Users\xxx\AppData\Roaming\npm\node_modules')
其中
cwd
欄位是npm root -g
的結果,也就是npm的modules路徑addition
是用來模擬window
的
但是我們又可以發現下一個錯誤
2. alert不存在
這個錯誤是因為在execjs
下執行alert
函式是沒有意義的,因為我們沒有瀏覽器讓他彈窗,且原本alert
函式的定義是來源window
而我們自定義了window
,所以我們要在程式碼前重寫覆蓋alert
函式(相當於定義一個alert)
# override the alert function, because in the code there has one place using
# and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
v. 整合程式碼
# target(youtube address) url
url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
# get the target text
reo = gethtml(url)
# Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]
# override the alert function, because in the code there has one place using
# and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
# split each line(help us find the decrypt function in last few line)
reA = reo.split("\n")
# get the depcrypt function
name = reA[len(reA) - 3].split(";")[0] + ";"
# add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
addition = """
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
window = dom.window;
document = window.document;
XMLHttpRequest = window.XMLHttpRequest;
"""
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
# do the decryption
text = ct.eval(name.split("=")[1].replace(";", ""))
3. 分析解密結果
i. 取關鍵json
執行完上面的部分,解密結果就存在text裡了,而我們在思路中可以發現,真正對我們重要的就是存在window.parent.sf.videoResult.show()
裡的json,所以用正規表示式取這一部分的json
# get the result in json
result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")
ii. 格式化json
python可以格式化json的庫有很多,這裡筆者用了json
庫(記得import)
# use `json` to load json
j = json.loads(result)
iii. 取下載地址
接下來就到了最後一步,根據思路里和json格式化工具我們可以發現j["url"][num]["url"]
就是下載連結,而num
是我們要的視訊格式(不同解析度和型別)
# the selection of video(in this case, num=1 mean the video is
# - 360p known from j["url"][num]["quality"]
# - MP4 known from j["url"][num]["type"]
# - audio known from j["url"][num]["audio"]
num = 1
downurl = j["url"][num]["url"]
# do some download
# thanks :)
# - EOF -
3. 全部程式碼
# -*- coding: utf-8 -*-
# @Time: 2021/1/10
# @Author: Eritque arcus
# @File: Youtube.py
# @License: MIT
# @Environment:
# - windows 10
# - python 3.6.2
# @Dependence:
# - jsdom in npm(windows also can use)
# - requests, execjs, re, json in python
import requests
import execjs
import re
import json
def gethtml(url):
# set the headers or the website will not return information
# the cookies in here you may need to change
headers = {
"cache-Control": "no-cache",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
"*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
"content-type": "application/x-www-form-urlencoded",
"cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
"clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
"helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
"_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
"PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
"PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
"origin": "https://en.savefrom.net",
"pragma": "no-cache",
"referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
"sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest": "iframe",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/87.0.4280.88 Safari/537.36"}
# set the parameter, we can get from chrome
kv = {"sf_url": url,
"sf_submit": "",
"new": "1",
"lang": "en",
"app": "",
"country": "cn",
"os": "Windows",
"browser": "Chrome"}
# do the POST request
r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
data=kv)
r.raise_for_status()
# get the result
return r.text
if __name__ == '__main__':
# target(youtube address) url
url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
# get the target text
reo = gethtml(url)
# Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]
# override the alert function, because in the code there has one place using
# and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
# split each line(help us find the decrypt function in last few line)
reA = reo.split("\n")
# get the depcrypt function
name = reA[len(reA) - 3].split(";")[0] + ";"
# add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
addition = """
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
window = dom.window;
document = window.document;
XMLHttpRequest = window.XMLHttpRequest;
"""
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
# do the decryption
text = ct.eval(name.split("=")[1].replace(";", ""))
# get the result in json
result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")
# use `json` to load json
j = json.loads(result)
# the selection of video(in this case, num=1 mean the video is
# - 360p known from j["url"][num]["quality"]
# - MP4 known from j["url"][num]["type"]
# - audio known from j["url"][num]["audio"]
num = 1
downurl = j["url"][num]["url"]
# do some download
# thanks :)
# - EOF -
- 總計102行
- 開發環境
# @Environment:
# - windows 10
# - python 3.6.2
- 依賴
# @Dependence:
# - jsdom in npm(windows also can use)
# - requests, execjs, re, json in python
For 爬蟲
版權宣告:本文為博主原創文章,遵循 CC 4.0 BY-SA 版權協議,轉載請附上原文出處連結和本宣告。
本文作者: https://www.cnblogs.com/Eritque-arcus/ 或https://blog.csdn.net/qq_40832960