猿人學web端爬蟲攻防大賽賽題第19題——烏拉烏拉烏拉

死不悔改奇男子發表於2024-11-02

題目網址:https://match.yuanrenxue.cn/match/19

解題步驟

  1. 看觸發的資料包。
    image
    image

  2. 有這麼好的事情,沒有加密的引數,url非常簡單,直接寫程式碼訪問。

    import requests
    
    url = "https://match.yuanrenxue.cn/api/match/19?page=1"
    headers = {'Host': 'match.yuanrenxue.cn', 'Connection': 'keep-alive', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache',
    	'sec-ch-ua-platform': '"Windows"', 'X-Requested-With': 'XMLHttpRequest',
    	'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
    	'Accept': 'application/json, text/javascript, */*; q=0.01',
    	'sec-ch-ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"', 'sec-ch-ua-mobile': '?0',
    	'Sec-Fetch-Site': 'same-origin', 'Sec-Fetch-Mode': 'cors', 'Sec-Fetch-Dest': 'empty',
    	'Referer': 'https://match.yuanrenxue.cn/match/16', 'Accept-Encoding': 'gzip, deflate, br, zstd',
    	'Accept-Language': 'zh-CN,zh;q=0.9',
    	'Cookie': 'Hm_lvt_9bcbda9cbf86757998a2339a0437208e=1730505879; HMACCOUNT=5B8060E4DC36D34F; Hm_lvt_c99546cf032aaa5a679230de9a95c7db=1730505879; qpfccr=true; no-alert3=true; tk=-5370204167750759641; sessionid=39ahw8ftocq8eghmui7twey3qbw7lek8; Hm_lpvt_9bcbda9cbf86757998a2339a0437208e=1730505890; Hm_lpvt_c99546cf032aaa5a679230de9a95c7db=1730506377', }
    resp = requests.get(url, headers=headers)
    print(resp.text)
    

    執行一下,發現啥也獲取不到。
    image

  3. 再看資料包,是個http2.0協議,嘗試用httpx庫訪問。
    image

    import httpx
    
    url = "https://match.yuanrenxue.cn/api/match/19?page=1"
    headers = {'Host': 'match.yuanrenxue.cn', 'Connection': 'keep-alive', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache',
    	'sec-ch-ua-platform': '"Windows"', 'X-Requested-With': 'XMLHttpRequest',
    	'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
    	'Accept': 'application/json, text/javascript, */*; q=0.01',
    	'sec-ch-ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"', 'sec-ch-ua-mobile': '?0',
    	'Sec-Fetch-Site': 'same-origin', 'Sec-Fetch-Mode': 'cors', 'Sec-Fetch-Dest': 'empty',
    	'Referer': 'https://match.yuanrenxue.cn/match/16', 'Accept-Encoding': 'gzip, deflate, br, zstd',
    	'Accept-Language': 'zh-CN,zh;q=0.9',
    	'Cookie': 'Hm_lvt_9bcbda9cbf86757998a2339a0437208e=1730505879; HMACCOUNT=5B8060E4DC36D34F; Hm_lvt_c99546cf032aaa5a679230de9a95c7db=1730505879; qpfccr=true; no-alert3=true; tk=-5370204167750759641; sessionid=39ahw8ftocq8eghmui7twey3qbw7lek8; Hm_lpvt_9bcbda9cbf86757998a2339a0437208e=1730505890; Hm_lpvt_c99546cf032aaa5a679230de9a95c7db=1730506377', }
    client = httpx.Client(http2=True)
    resp = client.get(url, headers=headers)
    print(resp.text)
    

    執行發現還是啥都獲取不到。
    image

  4. 真無厘頭啊,嘗試用fiddler工具抓包。
    image
    也只有一個資料包,重放一下,發現還是可以正常獲取資料。
    image
    image

  5. 經過查資料,大概這裡是判斷了tls指紋。

    tls指紋可參考https://developer.baidu.com/article/details/3348512

    python中可以透過安裝curl_cffi庫來模擬瀏覽器的tls指紋(在請求時指定 impersonate 關鍵字引數即可)
    安裝命令:pip install curl_cffi
    image

  6. 編寫程式碼嘗試獲取第一頁的資料。

    from curl_cffi import requests
    
    url = "https://match.yuanrenxue.cn/api/match/19?page=1"
    headers = {'Host': 'match.yuanrenxue.cn', 'Connection': 'keep-alive', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache',
    	'sec-ch-ua-platform': '"Windows"', 'X-Requested-With': 'XMLHttpRequest',
    	'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
    	'Accept': 'application/json, text/javascript, */*; q=0.01',
    	'sec-ch-ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"', 'sec-ch-ua-mobile': '?0',
    	'Sec-Fetch-Site': 'same-origin', 'Sec-Fetch-Mode': 'cors', 'Sec-Fetch-Dest': 'empty',
    	'Referer': 'https://match.yuanrenxue.cn/match/16', 'Accept-Encoding': 'gzip, deflate, br, zstd',
    	'Accept-Language': 'zh-CN,zh;q=0.9',
    	'Cookie': 'Hm_lvt_9bcbda9cbf86757998a2339a0437208e=1730505879; HMACCOUNT=5B8060E4DC36D34F; Hm_lvt_c99546cf032aaa5a679230de9a95c7db=1730505879; qpfccr=true; no-alert3=true; tk=-5370204167750759641; sessionid=39ahw8ftocq8eghmui7twey3qbw7lek8; Hm_lpvt_9bcbda9cbf86757998a2339a0437208e=1730505890; Hm_lpvt_c99546cf032aaa5a679230de9a95c7db=1730506377', }
    resp = requests.get(url, headers=headers, impersonate="chrome110")
    print(resp.text)
    

    執行成功獲取到頁面資料。
    image

  7. 看來我們的思路沒錯,完整爬蟲程式碼如下。

    from curl_cffi import requests
    import re
    
    res_sum = 0
    pattern = '{"value": (.*?)}'
    
    for i in range(1, 6):
    	url = "https://match.yuanrenxue.cn/api/match/19?page={}".format(i)
    	headers = {'Host': 'match.yuanrenxue.cn', 'Connection': 'keep-alive', 'Pragma': 'no-cache',
    			   'Cache-Control': 'no-cache', 'sec-ch-ua-platform': '"Windows"', 'X-Requested-With': 'XMLHttpRequest',
    			   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
    			   'Accept': 'application/json, text/javascript, */*; q=0.01',
    			   'sec-ch-ua': '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
    			   'sec-ch-ua-mobile': '?0', 'Sec-Fetch-Site': 'same-origin', 'Sec-Fetch-Mode': 'cors',
    			   'Sec-Fetch-Dest': 'empty', 'Referer': 'https://match.yuanrenxue.cn/match/16',
    			   'Accept-Encoding': 'gzip, deflate, br, zstd', 'Accept-Language': 'zh-CN,zh;q=0.9',
    			   'Cookie': 'Hm_lvt_9bcbda9cbf86757998a2339a0437208e=1730505879; HMACCOUNT=5B8060E4DC36D34F; Hm_lvt_c99546cf032aaa5a679230de9a95c7db=1730505879; qpfccr=true; no-alert3=true; tk=-5370204167750759641; sessionid=39ahw8ftocq8eghmui7twey3qbw7lek8; Hm_lpvt_9bcbda9cbf86757998a2339a0437208e=1730505890; Hm_lpvt_c99546cf032aaa5a679230de9a95c7db=1730506377', }
    	resp = requests.get(url, headers=headers, impersonate="chrome110")
    	string = resp.text
    	findall = re.findall(pattern, string)
    	for item in findall:
    		res_sum += int(item)
    print(res_sum)
    

    執行得到結果。
    image

  8. 提交結果,成功透過。
    image

相關文章