python入門-爬取百度翻譯中的雙語例句
一開始我以為簡單的寫個 post 就可以把資料弄下來了。結果還是 too young too naive..
網站加密現在是很普遍的。
百度翻譯的介面爬取有幾個細節需要注意一下:
- 引數是通過加密,然後不斷變化的,包括
sign、Cookie
等 - 返回的json層級較深,整理資料的時候花了點功夫
- 最大請求數有限制,如果不做容錯處理的話,會報這個錯
- 輸出到檔案後記得關閉檔案..
requests.exceptions.ConnectionError:
HTTPSConnectionPool(host='fanyi.baidu.com', port=443):
Max retries exceeded with url: / (Caused by NewConnectionErro
r('<urllib3.connection.VerifiedHTTPSConnection object at 0x11068e390>:
Failed to establish a new connection: [Errno 60] Operation timed out',))
其它的就是參考別人破解百度翻譯介面的方法,然後直接拿來用了。
百度介面例項解析 v20181012
最後直接上程式碼吧。。沒啥好說的。。
import requests
import json
import re
import execjs
import urllib
import time
# 請求頭非常重要,在請求 fanyi.baidu.com 這個頁面的時候需要傳遞
# 筆者測試時發現,如果不傳遞的話百度也會返回 token 和 gtk,但是此時返回的值是無法正確請求到翻譯結果的
#### gtk 好像都是 320305.131321201
#### token 是固定的,每臺機器都不一樣
#### sign 每次都是不一定的,通過一段js程式碼加密;
#### Cookie 每段時間都會變化;998、997 這些錯誤程式碼跟 Cookie 有關
#### 最重要的破解這個 sign,最後居然能直接通過 get 獲得資料..
source = 'machine'
f_english = open("/Users/zhengguokai/Desktop/english.txt", "w+")
f_chinese = open("/Users/zhengguokai/Desktop/chinese.txt", "w+")
f_synonym = open("/Users/zhengguokai/Desktop/synonym.txt", "w+")
def getTranslation():
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
"Cookie": "BAIDUID=4650B0B34048BBAA1E0B909B42F5A564:FG=1; BIDUPSID=4650B0B34048BBAA1E0B909B42F5A564; PSTM=1537177909; BDUSS=w0VmEzUFFWTTh0bld5VWVhNVo5MEEyV2ZKdTk3U2stMGZmWVQ1TTRuSnVkOHBiQVFBQUFBJCQAAAAAAAAAAAEAAAD0GzcNaG9uZ3F1YW4xOTkxAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG7qoltu6qJbTk; pgv_pvi=6774493184; uc_login_unique=19e6fd48035206a8abe89f98c3fc542a; uc_recom_mark=cmVjb21tYXJrXzYyNDU4NjM%3D; MCITY=-218%3A; cflag=15%3A3; SIGNIN_UC=70a2711cf1d3d9b1a82d2f87d633bd8a02893452711; locale=zh; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1539333192; from_lang_often=%5B%7B%22value%22%3A%22en%22%2C%22text%22%3A%22%u82F1%u8BED%22%7D%2C%7B%22value%22%3A%22zh%22%2C%22text%22%3A%22%u4E2D%u6587%22%7D%5D; REALTIME_TRANS_SWITCH=1; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; to_lang_often=%5B%7B%22value%22%3A%22zh%22%2C%22text%22%3A%22%u4E2D%u6587%22%7D%2C%7B%22value%22%3A%22en%22%2C%22text%22%3A%22%u82F1%u8BED%22%7D%5D; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1539333307",
}
# 獲取網頁原始碼
html = requests.get('https://fanyi.baidu.com', headers=headers)
html.encoding = 'utf-8'
# 正則匹配 gtk
matches = re.findall("window.gtk = '(.*?)';", html.text, re.S)
for match in matches:
gtk = match
if gtk == "":
print('Get gtk fail.')
exit()
# print('gtk = ' + gtk)
# 正則匹配 token
matches = re.findall("token: '(.*?)'", html.text, re.S)
for match in matches:
token = match
if token == "":
print('Get token fail.')
exit()
# print('token = ' + token)
# 計算 sign
signCode = 'function a(r,o){for(var t=0;t<o.length-2;t+=3){var a=o.charAt(t+2);a=a>="a"?a.charCodeAt(0)-87:Number(a),a="+"===o.charAt(t+1)?r>>>a:r<<a,r="+"===o.charAt(t)?r+a&4294967295:r^a}return r}var C=null;var hash=function(r,_gtk){var o=r.length;o>30&&(r=""+r.substr(0,10)+r.substr(Math.floor(o/2)-5,10)+r.substr(-10,10));var t=void 0,t=null!==C?C:(C=_gtk||"")||"";for(var e=t.split("."),h=Number(e[0])||0,i=Number(e[1])||0,d=[],f=0,g=0;g<r.length;g++){var m=r.charCodeAt(g);128>m?d[f++]=m:(2048>m?d[f++]=m>>6|192:(55296===(64512&m)&&g+1<r.length&&56320===(64512&r.charCodeAt(g+1))?(m=65536+((1023&m)<<10)+(1023&r.charCodeAt(++g)),d[f++]=m>>18|240,d[f++]=m>>12&63|128):d[f++]=m>>12|224,d[f++]=m>>6&63|128),d[f++]=63&m|128)}for(var S=h,u="+-a^+6",l="+-3^+b+-f",s=0;s<d.length;s++)S+=d[s],S=a(S,u);return S=a(S,l),S^=i,0>S&&(S=(2147483647&S)+2147483648),S%=1e6,S.toString()+"."+(S^h)}'
sign = execjs.compile(signCode).call('hash', source, gtk)
print('source = ' + source + ', sign = ' + sign)
# 請求介面
fromLanguage = 'en'
toLanguage = 'zh'
# 請求介面地址
v2transapi = 'https://fanyi.baidu.com/v2transapi?from=%s&to=%s&query=%s' \
'&transtype=translang&simple_means_flag=3&sign=%s&token=%s' % (
fromLanguage, toLanguage, urllib.parse.quote(source), sign, token)
print(v2transapi)
#### 最大重試次數
attempts = 0
success = False
while attempts < 3 and not success:
try:
translate_result = requests.get(v2transapi, headers=headers)
success = True
except:
attempts += 1
if attempts == 3:
break
result = json.loads(translate_result.text)
# print(translate_result.text)
# print(result)
print('-----同義詞-----')
if "sanyms" in dict(result["dict_result"]).keys():
words = result["dict_result"]["sanyms"][0]["data"][0]["d"]
for s in words:
print(s, file=f_synonym)
print(s)
else:
print("----沒有同義詞----")
print("翻譯結果:{}".format(result["trans_result"]["data"][0]["dst"]))
# print(lines)
### line: 每一組句子,包含一句中文和一句英文
### sentences: 每個句子
### sentence: 句子中的每個詞;格式:['The', 'w_0', 'w_0', 0, ' ']
print('-----雙語例句-----')
double = result["liju_result"]['double']
if double != "":
lines = json.loads(result["liju_result"]['double'])
is_english = True
for line in lines:
# print(line)
for sentences in line:
# print(sentences)
if isinstance(sentences, list):
s = ""
# print(sentences)
for i, sentence in enumerate(sentences):
if is_english:
if i > len(sentences) - 3:
s += sentence[0]
else:
s += sentence[0] + " "
else:
s += sentence[0]
if is_english:
print(s, file=f_english)
is_english = False
else:
print(s, file=f_chinese)
is_english = True
# print(s)
# print()
s = ""
f_vob = open("/Users/zhengguokai/Desktop/electronic.txt")
for line in f_vob:
source = line
getTranslation()
f_vob.close()
f_english.close()
f_chinese.close()
f_synonym.close()
# entry = result["dict_result"]["collins"]["entry"]
# # print(entry)
# i = 1
# for e in entry:
# # print(e)
# for value in e["value"]:
# # print(value)
# mean_type = value.get('mean_type', '')
# # print(mean_type)
# for examples in mean_type:
# # print(examples)
# for example in examples.get('example', ''):
# # print(example)
# print('----------- ' + str(i))
# ex = example["ex"]
# print(ex)
# tran = example["tran"]
# print(tran)
# i += 1
相關文章
- python學習值爬取百度翻譯Python
- [翻譯] Go 語言入門Go
- 爬取有道翻譯
- Python爬蟲教程-05-python爬蟲實現百度翻譯Python爬蟲
- 爬取必應翻譯
- python 爬蟲 簡單實現百度翻譯Python爬蟲
- Python爬蟲教程-06-爬蟲實現百度翻譯(requests)Python爬蟲
- 爬蟲呼叫百度翻譯API爬蟲API
- Python反反爬蟲實戰,JS解密入門案例,詳解呼叫有道翻譯Python爬蟲JS解密
- Python爬蟲入門【5】:27270圖片爬取Python爬蟲
- python抓取百度翻譯Python
- C# 10分鐘完成百度翻譯(機器翻譯)——入門篇C#
- Python爬蟲教程-07-post介紹(百度翻譯)(上)Python爬蟲
- Python爬蟲教程-08-post介紹(百度翻譯)(下)Python爬蟲
- 百度翻譯app怎麼調整置語音速度? 百度翻譯設定翻譯速度的教程APP
- python爬蟲呼叫谷歌翻譯介面Python爬蟲谷歌
- 跨越專業翻譯的語言之牆:百度翻譯的技術攀登
- Python爬蟲入門【11】:半次元COS圖爬取Python爬蟲
- Python爬蟲入門【3】:美空網資料爬取Python爬蟲
- Python爬蟲入門【4】:美空網未登入圖片爬取Python爬蟲
- 爬蟲百戰穿山甲(2):百度翻譯爬蟲爬蟲
- 一篇文章教會你利用Python網路爬蟲獲取有道翻譯手機版的翻譯介面Python爬蟲
- Python爬蟲入門Python爬蟲
- Python爬蟲入門教程 2-100 妹子圖網站爬取Python爬蟲網站
- Python爬蟲入門【9】:圖蟲網多執行緒爬取Python爬蟲執行緒
- Python爬蟲入門【10】:電子書多執行緒爬取Python爬蟲執行緒
- python3:爬有道翻譯(命令列版)Python命令列
- Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作Python爬蟲
- 百度翻譯 for Mac - 百度翻譯mac桌面端Mac
- python-爬蟲入門Python爬蟲
- Python爬蟲入門教程 4-100 美空網未登入圖片爬取Python爬蟲
- Serilog文件翻譯系列(一) - 入門指南
- Gradle入門(翻譯自Graddle官網)Gradle
- 中文翻譯英語的軟體哪個好?如何完成中翻譯英
- python爬蟲獲取百度熱搜Python爬蟲
- 使用python爬取百度百科Python
- IDL封裝百度翻譯API實現自動翻譯和語種識別封裝API
- csharp:百度翻譯CSharp