Python反爬:利用js逆向和woff檔案爬取貓眼電影評分資訊

il_持之以恆_li發表於2022-01-30

首先:看看執行結果效果如何!

1. 實現思路

小編基本實現思路如下:

    1. 利用js逆向模擬請求得到電影評分的頁面(就是貓眼電影的評分資訊並不是我們上述看到的那個頁面上,應該它的實現是在一個頁面上插入另外一個頁面上的一些資訊)。



我們看一下上述這個網址的請求方式以及請求引數。

顯然這個signKey 進行了加密處理。(下面請求第二點講解怎樣模擬這個請求)

  • 2.通過上述模擬請求,我們最終可以得到這個評分資料,只不過看到評分資料是利用了字型加密,所以看到的是一系列 \u 開頭的字元編碼。如下:

    第1點處理抓取評分資訊之外,還需要得到這個字型的woff檔案。

    我們將這個檔案的下載連結上傳到這個網址上,網址連結為:https://font.qqe2.com/
    我們可以看到如下結果:

    我們仔細觀察上述圖片和 \u 開頭的字元編碼,可以發現如下:

    (接下來請看第3點講解)

2. 利用js逆向模擬請求得到電影評分的頁面


我們點選啟動器下如圖所示的連結,來到如下介面,對這個介面下的js程式碼進行斷點和監聽,如下:

是不是發現了熟悉的身影,就是 x 變數裡面的引數值是不是和小編在上述所說的get請求引數很像。
我們點選其中自定義方法i[_0xe30b("0x5")]),點選進入,然後打上斷點,重新整理介面之後繼續監聽,發現如下:


現在繼續點選方法 r[a(_0x193d("0xe4"))] 進入,發現如下


其實方法 r[a(_0x193d("0xe4"))] 就是這個 x[_0x193d("0x6")] 方法,你可別看它們兩的值不一樣,但是 r[a(_0x193d("0xe4"))] 就是這個 x[_0x193d("0x6")] ,再也找不到第二個了。
接下來就要繼續在這個方法下點選已經定義好的方法進入,然後斷點,重新整理和監聽,重複動作,小編就不一一講解了。
小編把本次程式中需要用到的js程式碼放到如下,可能和讀者自己操作看到的一些js程式碼不一樣哈!(改了一些,另外,就是貓眼電影網址那個js加密欄位程式碼已經更新了)

var getMD5Sign = function (x) {
    var c = x['method'],
        e = x['channelId'],
        _ = void 0 === e ? 40011 : e,
        t = x['sVersion'],
        n = x['type'],
        a = void 0 === n ? "qs" : n,
        i = Math['ceil'](10 * Math.random()),
        d = (new Date)['getTime'](),
        s = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
        u = 'method=' + c + '&timeStamp=' + d + '&User-Agent=' + s + '&index=' + i + '&channelId=' + _ + '&sVersion=' + t;
    f = '&key=A013F70DB97834C0A5492378BD76C53A';
    return {
        randomNum: i,
        timeStamp: d,
        md5sign: default2(u + f),
        channelId: _,
        sVersion: t,
        params: {webdriver: false}
    }
};
var default2 = function (x) {
    if (void 0 === x || null === x)
        throw new Error(x);
    var _ = wordsToBytes(a(x));
    return bytesToHex(_);
};
var _ff = function (x, c, e, _, t, n, a) {
    var r = x + (c & e | ~c & _) + (t >>> 0) + a;
    return (r << n | r >>> 32 - n) + c
};
var _gg = function (x, c, e, _, t, n, a) {
    var r = x + (c & _ | e & ~_) + (t >>> 0) + a;
    return (r << n | r >>> 32 - n) + c
};
var _hh = function (x, c, e, _, t, n, a) {
    var r = x + (c ^ e ^ _) + (t >>> 0) + a;
    return (r << n | r >>> 32 - n) + c
};
var _ii = function (x, c, e, _, t, n, a) {
    var r = x + (e ^ (c | ~_)) + (t >>> 0) + a;
    return (r << n | r >>> 32 - n) + c
};
var a = function (x) {
    x = stringToBytes(x);
    // x['constructor'] == String ? x = e && e['encoding'] === 'binary' ? stringToBytes(x) : stringToBytes(x) : t(x) ? x = Array.prototype.slice.call(x, 0) : Array['isArray'](x) || x.constructor === Uint8Array || (x = x['toString']());
    for (var r = bytesToWords(x), i = 8 * x['length'], d = 1732584193, s = -271733879, o = -1732584194, u = 271733878, f = 0; f < r['length']; f++)
        r[f] = 16711935 & (r[f] << 8 | r[f] >>> 24) | 4278255360 & (r[f] << 24 | r[f] >>> 8);
    r[i >>> 5] |= 128 << i % 32,
        r[14 + (i + 64 >>> 9 << 4)] = i;
    for (var l = _ff, m = _gg, h = _hh, b = _ii, f = 0; f < r.length; f += 16) {
        var y = d,
            p = s,
            L = o,
            M = u;
        d = l(d, s, o, u, r[f + 0], 7, -680876936),
            u = l(u, d, s, o, r[f + 1], 12, -389564586),
            o = l(o, u, d, s, r[f + 2], 17, 606105819),
            s = l(s, o, u, d, r[f + 3], 22, -1044525330),
            d = l(d, s, o, u, r[f + 4], 7, -176418897),
            u = l(u, d, s, o, r[f + 5], 12, 1200080426),
            o = l(o, u, d, s, r[f + 6], 17, -1473231341),
            s = l(s, o, u, d, r[f + 7], 22, -45705983),
            d = l(d, s, o, u, r[f + 8], 7, 1770035416),
            u = l(u, d, s, o, r[f + 9], 12, -1958414417),
            o = l(o, u, d, s, r[f + 10], 17, -42063),
            s = l(s, o, u, d, r[f + 11], 22, -1990404162),
            d = l(d, s, o, u, r[f + 12], 7, 1804603682),
            u = l(u, d, s, o, r[f + 13], 12, -40341101),
            o = l(o, u, d, s, r[f + 14], 17, -1502002290),
            s = l(s, o, u, d, r[f + 15], 22, 1236535329),
            d = m(d, s, o, u, r[f + 1], 5, -165796510),
            u = m(u, d, s, o, r[f + 6], 9, -1069501632),
            o = m(o, u, d, s, r[f + 11], 14, 643717713),
            s = m(s, o, u, d, r[f + 0], 20, -373897302),
            d = m(d, s, o, u, r[f + 5], 5, -701558691),
            u = m(u, d, s, o, r[f + 10], 9, 38016083),
            o = m(o, u, d, s, r[f + 15], 14, -660478335),
            s = m(s, o, u, d, r[f + 4], 20, -405537848),
            d = m(d, s, o, u, r[f + 9], 5, 568446438),
            u = m(u, d, s, o, r[f + 14], 9, -1019803690),
            o = m(o, u, d, s, r[f + 3], 14, -187363961),
            s = m(s, o, u, d, r[f + 8], 20, 1163531501),
            d = m(d, s, o, u, r[f + 13], 5, -1444681467),
            u = m(u, d, s, o, r[f + 2], 9, -51403784),
            o = m(o, u, d, s, r[f + 7], 14, 1735328473),
            s = m(s, o, u, d, r[f + 12], 20, -1926607734),
            d = h(d, s, o, u, r[f + 5], 4, -378558),
            u = h(u, d, s, o, r[f + 8], 11, -2022574463),
            o = h(o, u, d, s, r[f + 11], 16, 1839030562),
            s = h(s, o, u, d, r[f + 14], 23, -35309556),
            d = h(d, s, o, u, r[f + 1], 4, -1530992060),
            u = h(u, d, s, o, r[f + 4], 11, 1272893353),
            o = h(o, u, d, s, r[f + 7], 16, -155497632),
            s = h(s, o, u, d, r[f + 10], 23, -1094730640),
            d = h(d, s, o, u, r[f + 13], 4, 681279174),
            u = h(u, d, s, o, r[f + 0], 11, -358537222),
            o = h(o, u, d, s, r[f + 3], 16, -722521979),
            s = h(s, o, u, d, r[f + 6], 23, 76029189),
            d = h(d, s, o, u, r[f + 9], 4, -640364487),
            u = h(u, d, s, o, r[f + 12], 11, -421815835),
            o = h(o, u, d, s, r[f + 15], 16, 530742520),
            s = h(s, o, u, d, r[f + 2], 23, -995338651),
            d = b(d, s, o, u, r[f + 0], 6, -198630844),
            u = b(u, d, s, o, r[f + 7], 10, 1126891415),
            o = b(o, u, d, s, r[f + 14], 15, -1416354905),
            s = b(s, o, u, d, r[f + 5], 21, -57434055),
            d = b(d, s, o, u, r[f + 12], 6, 1700485571),
            u = b(u, d, s, o, r[f + 3], 10, -1894986606),
            o = b(o, u, d, s, r[f + 10], 15, -1051523),
            s = b(s, o, u, d, r[f + 1], 21, -2054922799),
            d = b(d, s, o, u, r[f + 8], 6, 1873313359),
            u = b(u, d, s, o, r[f + 15], 10, -30611744),
            o = b(o, u, d, s, r[f + 6], 15, -1560198380),
            s = b(s, o, u, d, r[f + 13], 21, 1309151649),
            d = b(d, s, o, u, r[f + 4], 6, -145523070),
            u = b(u, d, s, o, r[f + 11], 10, -1120210379),
            o = b(o, u, d, s, r[f + 2], 15, 718787259),
            s = b(s, o, u, d, r[f + 9], 21, -343485551),
            d = d + y >>> 0,
            s = s + p >>> 0,
            o = o + L >>> 0,
            u = u + M >>> 0
    }
    return endian([d, s, o, u]);
};
var rotl = function (x, c) {
    return x << c | x >>> 32 - c;
};
var rotr = function (x, c) {
    return x << 32 - c | x >>> c;
};
var endian = function (x) {
    if (x['constructor'] == Number)
        return 16711935 & rotl(x, 8) | 4278255360 & rotl(x, 24);
    for (var c = 0; c < x['length']; c++)
        x[c] = endian(x[c]);
    return x
};
var bytesToWords = function (x) {
    for (var c = [], e = 0, _ = 0; e < x['length']; e++,
        _ += 8)
        c[_ >>> 5] |= x[e] << 24 - _ % 32;
    return c
};
var wordsToBytes = function (x) {
    for (var c = [], e = 0; e < 32 * x['length']; e += 8)
        c['push'](x[e >>> 5] >>> 24 - e % 32 & 255);
    return c
};
var stringToBytes = function (x) {
    for (var c = [], e = 0; e < x.length; e++)
        c['push'](255 & x.charCodeAt(e));
    return c;
};
var bytesToHex = function (x) {
    for (var c = [], e = 0; e < x.length; e++)
        c['push']((x[e] >>> 4)['toString'](16)),
            c['push']((15 & x[e])['toString'](16));
    return c['join']("")
};

檔名:common.js

3. 利用pytesseract、PIL和cv2模組得到加密字元對應的原字元

這裡主要運用模組pytesseract對如下圖片進行識別,從而得到加密欄位對應的原欄位資訊。當然需要下載tesseract.exe

為了提高識別的正確率,小編運用cv2對這些圖片進行擷取處理。
參考程式碼如下:

from PIL import Image
import pytesseract
import cv2
import time

img = cv2.imread(filename='./input.png')
imgInfo = img.shape
x_1 = 0
x_2 = 125
font_dict = {}
for i in range(12):
    dst = img[:,x_1:x_2]
    x_1 = x_2
    x_2 += 125
    dst2 = dst[6:100,15:120]
    dst3 = dst[105:148,15:111]

    cv2.imwrite(filename='1.png',img=dst2)
    cv2.imwrite(filename='2.png',img=dst3)

    time.sleep(1)
    text2 = pytesseract.image_to_string(Image.open(fp='1.png'),
                                       config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
    text3 = pytesseract.image_to_string(Image.open(fp='2.png'),lang='eng',config='--psm 6')
    if text2.strip() != '':
        text31 = text3.strip()
        font_dict[text2.strip()] = (text31.split('\n')[0]).lower().replace('$',r'\u')
print(font_dict)

不過,這個識別的正確率還是有待提高的,不過我們可以通過其他一些程式碼進行處理,儘量提高它的正確度。

4. 最終實現程式碼

import requests
import re
import execjs
from selenium import webdriver
from lxml import etree
from PIL import Image, ImageGrab
import pytesseract
import time
import cv2

main_url = input('請輸入貓眼電影網址:')
movie_id = re.findall('https://www.maoyan.com/films/(.*)', main_url)[0]
url = 'https://www.maoyan.com/ajax/films/%s' % (movie_id)
with open(file='./common.js', mode='r', encoding='utf-8') as f:
    js = f.read()
ctx = execjs.compile(js)  # 執行js程式碼
flag = {
    'method': 'GET',
    'channelId': 40011,
    'sVersion': 1,
    'type': 'object'
}
n = ctx.call('getMD5Sign', flag)
e = n['randomNum']
i = n['timeStamp']
o = n['md5sign']
c = n['channelId']
s = n['sVersion']
# _ = n['params']
params = {
    'index': e,
    'timeStamp': i,
    'signKey': o,
    'channelId': c,
    'sVersion': s,
    'webdriver': 'false'
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'
}
rsp = requests.get(url=url, params=params, headers=headers)
print('請求狀態碼為:', rsp.status_code)
font_dict = {}
infos = []
if len((rsp.text).strip()) == 0:
    pass
else:
    html = rsp.text
    woff_href = re.findall("url('(.*?)') format('woff');", html)[0]  # woff檔案下載連結
    woff_url = 'http:' + woff_href  # woff檔案的下載連結

    html2 = etree.HTML(html)
    infos = html2.xpath("//div[@class='movie-stats-container']//span[@class='stonefont']/text()")  # 評分等資訊

    driver = webdriver.Chrome()
    driver.get(url='https://font.qqe2.com/')
    driver.implicitly_wait(15)
    driver.maximize_window()  # 最大視窗

    ele = driver.find_elements_by_xpath("//button[@class='btn btn-flat btn-sm dropdown-toggle']")[0]  # 匯入按鈕
    ele.click()  # 點選

    ele2 = driver.find_element_by_xpath("//ul[@class='dropdown-menu']/li[@data-action='add-url']")  # 點選從url上載入字型
    ele2.click()

    ele3 = driver.find_elements_by_xpath("//input[@class='form-control']")[1]  # 輸入框
    ele3.send_keys(woff_url)

    ele4 = driver.find_element_by_xpath("//button[@class='btn btn-flat btn-sm btn-confirm']")
    ele4.click()  # 確認按鈕
    time.sleep(5)

    bbox = (236, 278, 1502 + 236, 164 + 278)
    img = ImageGrab.grab(bbox)
    img.save('input.png')

    driver.close()

    img = cv2.imread(filename='./input.png')

    imgInfo = img.shape
    x_1 = 0
    x_2 = 125
    for i in range(12):
        dst = img[:, x_1:x_2]
        x_1 = x_2
        x_2 += 125
        dst2 = dst[6:100, 15:120]
        dst3 = dst[105:148, 10:111]

        cv2.imwrite(filename='1.png', img=dst2)
        cv2.imwrite(filename='2.png', img=dst3)

        time.sleep(1)
        text2 = pytesseract.image_to_string(Image.open(fp='1.png'),
                                            config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
        text3 = pytesseract.image_to_string(Image.open(fp='2.png'),lang='eng',config='--psm 6')

        if text2.strip() != '':
            text31 = text3.strip()
            font_dict[text2.strip()] = (text31.split('\n')[1]).lower().replace('uni', r'\u').replace('.','').replace(',','').replace('-','')

print('字型對應關係如下:')
print(font_dict)
print(infos)
mains_list = []
for i in range(len(infos)):
    infos[i] = infos[i].encode('unicode_escape').decode('utf-8')
    lists_u = []
    index = 0
    while index < len(infos[i])//6:
        str_u = infos[i][index*6:(index+1)*6]
        if '.' in str_u:
            lists_u.append('.')
            infos[i] = infos[i].replace('.','')
            str_u = infos[i][index * 6:(index + 1) * 6]
            lists_u.append(str_u)
        else:
            lists_u.append(str_u)
        index += 1
    mains_list.append(lists_u)

for i in range(len(mains_list)):
    for j in range(len(mains_list[i])):
        str_u2 = mains_list[i][j]
        if str_u2 != '.':
            bool_u = False
            for key in font_dict:
                str_u3 = font_dict[key]
                if str_u3 == str_u2:
                    mains_list[i][j] = key
                    bool_u = True
            if not bool_u:
                for key in font_dict:
                    str_u3 = font_dict[key]
                    error = 0
                    for k in range(len(str_u3)):
                        if str_u3[k] != str_u2[k]:
                            if str_u3[k] == 's' and str_u2[k] == '5':
                                pass
                            else:
                                error += 1
                    if error == 0 or error ==1:
                        mains_list[i][j] = key
print(mains_list)
for i in range(len(mains_list)):
    mains_list[i] = ''.join(mains_list[i])
print(mains_list)

相關文章