Python爬蟲(5-10)-編解碼、ajax的get請求、ajax的post請求、URLError/HTTPError、微博的cookie登入、Handler處理器

夕瑶^發表於2024-07-17

五、編解碼（Unicode編碼）

（1）GET請求

所提方法都在urllib.parse.路徑下

get請求的quote()方法（適用於只提交一兩個引數值）

url='http://www.baidu.com/baidu?ie=utf-8&wd='

# 對漢字進行unicode編碼
name=urllib.parse.quote('白敬亭')
url+=name

get請求的urlencode()方法（適用於提交多個引數）

base_url='http://www.baidu.com/baidu?'

data={
    'wd':'白敬亭',
    'sex':'男',
    'address':'中國'
}

new_data=urllib.parse.urlencode(data)
url=base_url+new_data

（2）POST請求

百度翻譯

1.以百度翻譯為例，輸入需翻譯的單詞後，點選”檢查”—”網路”，發現存在多個名為sug的檔案

2.找到最後一個名為sug的檔案,觀察其請求和響應，發現就是我們要找的介面URL

3.程式碼如下

# post請求百度翻譯
url='https://fanyi.baidu.com/sug'

headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0'
}

data={
    'kw':'spider'
}
# post請求的引數必須進行編碼，轉為位元組型
# post的請求的引數是不會拼接在URL的後面的,而是需要放在清求物件定製的引數中
data=urllib.parse.urlencode(data).encode('utf-8')
request=urllib.request.Request(url=url,data=data,headers=headers)
response=urllib.request.urlopen(request)

content=response.read().decode('utf-8')

import json
# 將json字串變為python物件
obj=json.loads(content)
print(obj)

4.總結

post請求方式的引數必須進行編碼
編碼之後必須呼叫encode()方法
引數要放在請求物件定製的方法中

六、ajax的get請求

(1)獲取豆瓣電影的第一頁的資料

1.開啟“豆瓣電影”—“排行榜”—“喜劇”，找到介面（也就是找到某一個東西，該東西能展現當前頁面的資料），由“響應”可以判斷出所需的url

2.儲存豆瓣電影的第一頁的資料的程式碼如下：


import urllib.request

# ajax的get請求
# 獲取豆瓣電影的第一頁的資料並且儲存起來
url='https://movie.douban.com/j/chart/top_list?type=24&interval_id=100:90&action=&start=0&limit=20'

headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0'
}

request=urllib.request.Request(url=url,headers=headers)
response=urllib.request.urlopen(request)
content=response.read().decode('utf-8')
# print(content)

# open方法預設情況下使用的是gbk的編碼
# 使用UTF-8編碼方式開啟/建立名為douban.json檔案
# 法一
fp=open('douban.json','w',encoding='utf-8')
fp.write(content)
# 法二
with open('douban1.json','w') as fp:
    fp.write(content)

在法二中，with 語句用於管理檔案物件的開啟和關閉。as 關鍵字用於給檔案物件取一個別名，這裡是fp。fp.write(): 這是檔案物件的一個方法，用於向檔案中寫入內容。

(2)獲取豆瓣電影前十頁的資料

1.觀察可發現，URL中start的規律，start=(page-1)*20

2.程式碼展示如下

# 獲取豆瓣電影的前十頁資料並且儲存起來

def create_request(page):
    headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0'
}
    url='https://movie.douban.com/j/chart/top_list?'
    data={
  "type": 24,
  "interval_id": "100:90",
  "action": "",
#   觀察可得start的規律
  "start": (page-1)*20,
  "limit": 20
    }
    new_data=urllib.parse.urlencode(data)
    url+=new_data
    request=urllib.request.Request(url=url,headers=headers)
    return request

def get_content(request):
    response=urllib.request.urlopen(request)
    content=response.read().decode('utf-8')
    return content

def load_content(page,content):
    # 要將整形變數page轉變為字串
    # 如果直接加單引號，page被當作一個字面量字串，而不是變數page
    with open('douban_'+str(page)+'.json','w') as fp:
        fp.write(content)

# 程式的入口
if __name__=='__main__':
    start_page=int(input('請輸入開始的頁碼'))
    end_page=int(input('請輸入結束的頁碼'))
    for page in range(start_page,end_page):
        request=create_request(page)
        content=get_content(request)
        load_content(page,content)

七、ajax的post請求

1.觀察發現”kfc官網“-”餐廳查詢“網頁的介面

2.程式碼如下

# ajax的post請求--肯德基官網

def create_request(page):
    url='http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'
    data={
	'cname':'濮陽',
	'pid':'',
	'pageIndex':page,
	'pageSize':10
	}
    new_data=urllib.parse.urlencode(data).encode('utf-8')
    headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0'
}
    # post方式不能直接拼接，要在請求物件定製的方法中加入該引數
    request=urllib.request.Request(url=url,headers=headers,data=new_data)
    return request

def get_content(request):
    response=urllib.request.urlopen(request)
    content=response.read().decode('utf-8')
    return content

def load_content(page,content):
    with open('kendeji'+str(page)+'.json','w',encoding='utf-8') as fp:
        fp.write(content)
    
  
if __name__=='__main__':
    start_page=int(input('請輸入起始頁碼'))
    end_page=int(input('請輸入終止頁碼'))
    for page in range(start_page,end_page):
        request=create_request(page)
        content=get_content(request)
        print(f"頁面 {page} 的內容: {content}")
        # load_content(page,content)

八、URLError/HTTPError

簡介：1.HTTPError類是URLError類的子類
2.匯入的包urllib.error.HTTPError 或 urllib.error.URLError
3.http錯誤：http錯誤是針對瀏覽器無法連線到伺服器而增加出來的錯誤提示。引導並告訴瀏覽者該頁是哪裡出了問題。
4.透過urllib傳送請求的時候，有可能會傳送失敗，這個時候如果想讓你的程式碼更加的健壯，可以透過try-except進行捕獲異常，異常有兩類：URLError/HTTPError

捕獲異常的程式碼可參考下圖：

九、微博的cookie登入

登入https://weibo.cn/5915756025/info

注意：URL中的那串數字根據自己的進行轉換。當我們成功登入微博後，觀察登陸成功的URL，得到上面所需要的那串數字

程式碼如下：

# 利用cookie獲取微博登入頁面
url='https://weibo.cn/5915756025/info'
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0',
# 如果不帶cookie，只能獲得一直載入中的登入頁面
'Cookie':'SCF=AuRE_quArNVh-RyQwgY_Mfyf1dQ9GXz5mpS5n6rPb4yrZ7pKGOaRqZ_TEjQo1ZKv0CUTDGSjai3FDr5cew5mnSM.; SUB=_2A25Lk7bsDeRhGeNH6lcW9SjMyTmIHXVo0LYkrDV6PUJbktAGLRj4kW1NSsw-pVvkP4MUFdoRPrVOGCv000hCGyH4; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W5a32BT_aSyAG8TS1.P-hxi5NHD95Qf1K2fS0-cehzfWs4Dqcjai--Ri-8si-zNi--fi-2Xi-24i--fi-2Xi-24i--fi-ihiKn7i--fi-isiKn0PN.t; SSOLoginState=1721222844; ALF=1723814844; _T_WM=7dcb8e0c308d256e687e3e446561c97c'
}

request=urllib.request.Request(url=url,headers=headers)
response=urllib.request.urlopen(request)
content=response.read().decode('utf-8')
# print(content)
with open('weibo.html','w') as fp:
    fp.write(content)

cookie中攜帶著你的登陸資訊，如果有登陸之後的cookie，那麼我們就可以攜帶著cookie進入到任何頁面

學會爬取微博的個人資訊頁面後，我們也可以嘗試爬取QQ空間的個人登入頁面html程式碼，小提示：需要帶上referer(判斷當前路徑是不是由上一個路徑進來的)~

十、Handler處理器

Handler處理器：定製更高階的請求頭。隨著業務邏輯的複雜，請求物件的定製已經滿足不了我們的需求（動態cookie和代理不能使用請求物件的定製)

基本使用方法如下：

ajax的post或者get伺服器請求
2018-10-18
伺服器
springmvc處理ajax請求
2024-06-26
SpringMVC
ajax 請求的時候 get 和 post 方式的區別?
2018-08-02
4.爬蟲 requests庫講解 GET請求 POST請求響應
2019-04-09
爬蟲
原生js實現Ajax請求，包含get和post
2018-04-11
JS
爬蟲快速入門——Get請求的使用
2020-10-25
爬蟲
spring security：ajax請求的session超時處理
2018-08-02
SpringSession
ajax請求
2019-02-26
ajax中POST請求與引數(請求體)設定
2020-12-31
Python中get、post請求詳解(HTTP請求頭、狀態碼)
2020-03-09
PythonHTTP
vue2.0 axios post請求傳參問題（ajax請求）
2018-07-24
VueiOS
ajax請求 juery
2020-04-06
ajax 請求攜帶cookie 瀏覽器報錯
2018-09-17
Cookie瀏覽器
get請求和post請求的區別
2022-03-21
封裝springmvc處理ajax請求結果
2020-04-05
封裝SpringMVC
uni-app的POST請求和GET請求
2024-04-26
APP
ajax呼叫WebMethed返回處理請求時出錯
2018-12-13
Web
原生ajax請求&JSONP
2019-01-07
JSON
get與post的請求區別
2018-10-10
axios的post請求爬坑
2018-08-05
iOS
SpringMVC中如何傳送GET請求、POST請求、PUT請求、DELETE請求。
2020-05-13
SpringMVCdelete
Python爬蟲Post請求返回值為-1000
2024-07-16
Python爬蟲
python爬蟲請求頭
2020-10-06
Python爬蟲
Datawhale-爬蟲-Task1（學習get與post請求）
2019-03-01
爬蟲
vue-resource get/post請求如何攜帶cookie的問題
2018-04-24
VueCookie
封裝ajax、axios請求
2019-01-23
封裝iOS
ajax跨域請求之CORS的使用
2021-09-09
跨域CORS
基於jQuery的三種AJAX請求
2022-12-31
jQuery
判斷請求是否為Ajax請求的小妙招
2022-10-04
SpringMVC原始碼分析：POST請求中的檔案處理
2022-05-22
SpringMVC原始碼
vue 發起get請求和post請求
2024-06-09
Vue
Requests如何在Python爬蟲中實現get請求？
2021-09-11
Python爬蟲
介面請求 (get、post、head 等) 詳解
2020-11-25
介面請求（get、post、head等）詳解
2020-11-25
Node中POST請求的正確處理方式
2019-06-19
POST與GET請求區別
2019-05-15
http請求之get和post的區別
2020-04-04
HTTP
get和post請求的區別（面試）
2020-12-01
面試