爬蟲——Requests模組

Dictator丶發表於2019-01-13

原文網址 : https://juejin.im/post/5c3aa0cb6fb9a049c965e81c

爬蟲

1. 入門

1.1 為什麼用 requests，不用 urllib

requests的底層實現就是urllib
requests在python2 和python3中通用，方法完全一樣
requests 簡單易用
requests 能幫助我們解壓網頁內容

1.2 requests 作用

傳送網路請求，返回相應資料
中文文件API

1.3 示例

import requests

response = requests.get('http://www.baidu.com')
# 在 Python3中 decode不帶引數，預設為 utf-8 解碼
print(response.content.decode())
# 根據HTTP 頭部對響應的編碼作出有根據的推測，推測的文字編碼， 此處輸出亂碼
print(response.text)
# 輸出當前推測的文字編碼為  ISO-8859-1 
print(response.encoding)
# 修改為utf-8
response.encoding = 'utf-8'
# 此處輸出正常， 與 response.content.decode() 輸出一致
print(response.text)
複製程式碼

response.text 與 response.content 的區別
- response.text
  型別：str
  解碼型別：根據HTTP 頭部對響應的編碼作出有根據的推測，推測的文字編碼
  如何修改編碼方式：response.encoding = 'utf-8'
- response.content
  型別：bytes
  解碼型別：沒有指定
  如何修改編碼方式：response.content.deocde(“utf8”)

1.4 requests 儲存圖片

import requests

response = requests.get('https://www.baidu.com/img/bd_logo1.png?where=super')

with open('img.png', 'wb') as f:
    f.write(response.content)
複製程式碼

1.5 response 的常用方法

response.text
respones.content
response.status_code

狀態碼
response.request.url

請求的URL地址

response.request.headers

請求頭

{
  'User-Agent': 'python-requests/2.19.1',
  'Accept-Encoding': 'gzip, deflate',
  'Accept': '*/*',
  'Connection': 'keep-alive'
}
複製程式碼

response.headers

響應頭

{
  'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform',
  'Connection': 'Keep-Alive',
  'Content-Encoding': 'gzip',
  'Content-Type': 'text/html',
  'Date': 'Sun, 13 Jan 2019 02:15:14 GMT',
  'Last-Modified': 'Mon, 23 Jan 2017 13:27:32 GMT',
  'Pragma': 'no-cache',
  'Server': 'bfe/1.0.8.18',
  'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/',
  'Transfer-Encoding': 'chunked'
}
複製程式碼

1.6 傳送帶 header 和引數的請求

header

帶header是為了模擬瀏覽器，欺騙伺服器，獲取和瀏覽器一致的內容

header的形式：字典

headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0"
}
複製程式碼

用法： requests.get(url,headers=headers)

引數
- 例： www.baidu.com/s?wd=python
- 引數的形式：字典
```
params = {
    "wd": "python"
}
複製程式碼
```
- 用法：requests.get(url,params=kw)

程式碼示例：

import requests

url = 'http://www.baidu.com/s?'

headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0"
}

params = {
  "wd": "python"
}

response = requests.get(url, headers=headers, params=params)

print(response.request.url)
print(response.status_code)

#格式化形式
url_param = 'http://www.baidu.com/s?wd={}'.format('python')
response1 = requests.get(url_param, headers=headers)
print(response.request.url)
print(response.status_code)   
複製程式碼

1.7 貼吧爬蟲

import requests


class TiebaSpider(object):
    def __init__(self, tieba_name):
        self.name = tieba_name
        self.url_tmp = 'https://tieba.baidu.com/f?kw=' + tieba_name + '&ie=utf-8&pn={}'
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0"
        }

    def get_url_list(self):
        # url_list = []
        # for i in range(1000):
        #     url_list.append(self.url_temp.format(i*50))
        # return url_list
        '''
        [i * 2 for i in range(3)]
        [0, 2, 4]
        '''
        return [self.url_tmp.format(i * 50) for i in range(1000)]

    def parse_url(self, url):
        print(url)
        response = requests.get(url, headers=self.headers)
        return response.content.decode()

    def save_html(self, html, page_num):
        file_path = '{}-第{}頁.html'.format(self.name, page_num)
        with open(file_path, 'w', encoding='utf-8') as f:
            f.write(html)

    def run(self):
        # 1. 構造 url_list
        url_list =  self.get_url_list()
        # 2. 遍歷請求
        for url in url_list:
            html = self.parse_url(url)
            # 3. 儲存
            page_num = url_list.index(url) + 1
            self.save_html(html, page_num)


if __name__ == '__main__':
    # tieba_spider = TiebaSpider('李毅')
    tieba_spider = TiebaSpider('lol')
    tieba_spider.run()
複製程式碼

2. 進階

2.1 傳送 POST 請求

需要用到 POST 的情況：

登入註冊， POST 比 GET 更安全
需要傳輸大文字內容的時候，POST 請求對資料長度沒有要求

用法：

response = requests.post("http://www.baidu.com/", data = data,headers=headers)
複製程式碼

注意和 GET 的區別， GET 中為 params=data, data 為字典形式

示例：
百度翻譯API

2.2 使用代理

原因：

讓伺服器以為不是一個客戶端在請求
防止我們的真實地址被洩露

代理工作流程：

正向代理與反向代理：

一般情況下，不知道最終伺服器的地址為反向代理，知道最終伺服器的為正向代理。

用法：

requests.get("http://www.baidu.com", proxies = proxies)
複製程式碼

proxies 為字典形式

proxies = { 
    "http": "http://12.34.56.79:9527", 
    "https": "https://12.34.56.79:9527", 
}
複製程式碼

私密代理

如果代理需要使用HTTP Basic Auth，可以使用下面這種格式：

proxies = { 
  "http": "http://user:password@10.1.10.1:1234"
}
複製程式碼

使用代理 ip：

準備一堆的ip地址，組成ip池，隨機選擇一個ip來時用
如何隨機選擇代理ip，讓使用次數較少的ip地址有更大的可能性被用到
- {"ip":ip,"times":0}
- [{},{},{},{},{}],對這個ip的列表進行排序，按照使用次數進行排序
- 選擇使用次數較少的10個ip，從中隨機選擇一個
檢查ip的可用性
- 可以使用requests新增超時引數，判斷ip地址的質量
- 線上代理ip質量檢測的網站

示例：

import requests
proxies = {
    "http": "http://119.101.113.180:9999"
}
response = requests.get("http://www.baidu.com", proxies=proxies)
print(response.status_code)
複製程式碼

2.3 Cookie 和 Session

cookie資料存放在客戶的瀏覽器上，session資料放在伺服器上。
cookie不是很安全，別人可以分析存放在本地的cookie並進行cookie欺騙。
session會在一定時間內儲存在伺服器上。當訪問增多，會比較佔用你伺服器的效能。
單個cookie儲存的資料不能超過4K，很多瀏覽器都限制一個站點最多儲存20個cookie。

獲取登入後的頁面的三種方式

例項化session，使用session傳送post請求，在使用他獲取登陸後的頁面
- 例項化session
- 先使用session傳送請求，登入對網站，把cookie儲存在session中
- 再使用session請求登陸之後才能訪問的網站，session能夠自動的攜帶登入成功時儲存在其中的cookie，進行請求
headers中新增cookie鍵，值為cookie字串
在請求方法中新增cookies引數，接收字典形式的cookie。字典形式的cookie中的鍵是cookie的name對應的值，值是cookie的value對應的值
- 攜帶一堆cookie進行請求，把cookie組成cookie池

session 程式碼示例：

import requests

session = requests.session()
url = 'http://www.renren.com/PLogin.do'
data = {
    "email": "****",
    "password": "*****"
}
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
session.post(url, headers=headers, data=data)
response = session.get('http://www.renren.com/969487434/profile', headers=headers)
html = response.content.decode()

with open('renren1.html', 'w', encoding='utf-8') as f:
    f.write(html)
複製程式碼

headers 新增 Cookie 示例：

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
    "Cookie": "anonymid=jr2yoyv4-l71jfd; depovince=SD; _r01_=1; JSESSIONID=abc5vZNDY5GXOfh79uKHw; ick_login=8e8d2154-31f7-47d6-afea-cd7da7f60cd7; ick=47b5a827-ecaf-4b44-ab57-c433e8f73b67; first_login_flag=1; ln_uact=13654252805; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; jebe_key=cf5d55ad-7eb1-4b50-848a-25d4c8081154%7C07a531353a345fda40d3ab252602e2f6%7C1547871575690%7C1%7C1547871574048; wp_fold=0; jebecookies=9724d2c7-5e9c-4be9-92d0-bf9b6dffd455|||||; _de=D0539E08F82219B3A527C713E360D2ED; p=7f8736045559e52d93420c14f063d70e4; t=522278d3c40436e9d5e7b3dc2650e55a4; societyguester=522278d3c40436e9d5e7b3dc2650e55a4; id=969487434; ver=7.0; xnsid=9ba2d506; loginfrom=null"
}
response = requests.get('http://www.renren.com/969487434/profile', headers=headers)
html = response.content.decode()

with open('renren2.html', 'w', encoding='utf-8') as f:
    f.write(html)
複製程式碼

在請求方法中新增cookies引數：

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
Cookie = 'anonymid=jr2yoyv4-l71jfd; depovince=SD; _r01_=1; JSESSIONID=abc5vZNDY5GXOfh79uKHw; ick_login=8e8d2154-31f7-47d6-afea-cd7da7f60cd7; ick=47b5a827-ecaf-4b44-ab57-c433e8f73b67; first_login_flag=1; ln_uact=13654252805; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; jebe_key=cf5d55ad-7eb1-4b50-848a-25d4c8081154%7C07a531353a345fda40d3ab252602e2f6%7C1547871575690%7C1%7C1547871574048; wp_fold=0; jebecookies=9724d2c7-5e9c-4be9-92d0-bf9b6dffd455|||||; _de=D0539E08F82219B3A527C713E360D2ED; p=7f8736045559e52d93420c14f063d70e4; t=522278d3c40436e9d5e7b3dc2650e55a4; societyguester=522278d3c40436e9d5e7b3dc2650e55a4; id=969487434; ver=7.0; xnsid=9ba2d506; loginfrom=null'
# 字典推導式
cookies = {i.split("=")[0] : i.split("=")[1] for i in Cookie.split("; ")}
response = requests.get('http://www.renren.com/969487434/profile', headers=headers, cookies=cookies)
html = response.content.decode()

with open('renren3.html', 'w', encoding='utf-8') as f:
    f.write(html)
複製程式碼

2.4 尋找登入 POST 的地址

檢視 HTML 頁面，在form表單中尋找action對應的url地址
- post的資料是input標籤中name的值作為鍵，真正的使用者名稱密碼作為值的字典，post的url地址就是action對應的url地址
抓包，尋找登入的url地址
- 勾選perserve log按鈕，防止頁面跳轉找不到url
- 尋找post資料，確定引數
  - 引數不會變，直接用，比如密碼不是動態加密的時候
  - 引數會變
    - 引數在當前的響應中
    - 通過js生成

2.5 定位想要的js

選擇會觸發js時間的按鈕，點選event listener，找到js的位置
通過chrome中的search all file來搜尋url中關鍵字
新增斷點的方式來檢視js的操作，通過python來進行同樣的操作

2.6 requests 小技巧

reqeusts.util.dict_from_cookiejar 把cookie物件轉化為字典

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
    "Cookie": "anonymid=jr2yoyv4-l71jfd; depovince=SD; _r01_=1; JSESSIONID=abc5vZNDY5GXOfh79uKHw; ick_login=8e8d2154-31f7-47d6-afea-cd7da7f60cd7; ick=47b5a827-ecaf-4b44-ab57-c433e8f73b67; first_login_flag=1; ln_uact=13654252805; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; jebe_key=cf5d55ad-7eb1-4b50-848a-25d4c8081154%7C07a531353a345fda40d3ab252602e2f6%7C1547871575690%7C1%7C1547871574048; wp_fold=0; jebecookies=9724d2c7-5e9c-4be9-92d0-bf9b6dffd455|||||; _de=D0539E08F82219B3A527C713E360D2ED; p=7f8736045559e52d93420c14f063d70e4; t=522278d3c40436e9d5e7b3dc2650e55a4; societyguester=522278d3c40436e9d5e7b3dc2650e55a4; id=969487434; ver=7.0; xnsid=9ba2d506; loginfrom=null"
}
response = requests.get('http://www.renren.com/969487434/profile', headers=headers)
print(response.cookies)
print(requests.utils.dict_from_cookiejar(response.cookies))
複製程式碼

請求 SSL 證照驗證

response = requests.get("https://www.12306.cn/mormhweb/ ", verify=False)
複製程式碼

設定超時

response = requests.get("https://www.baidu.com ", timeout=2)
複製程式碼

配合狀態碼判斷是否請求成功

assert response.status_code == 200
複製程式碼

示例程式碼（可做工具類）:

import requests
from retrying import retry

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}


@retry(stop_max_attempt_number=3)
def _request_rul(url, method, data):
    if method == 'POST':
        response = requests.post(url, headers=headers, data=data)
    else:
        response = requests.get(url, headers=headers, params=data, timeout=3)
    assert response.status_code
    return response.content.decode()


def request_url(url, method='GET', data=None):
    try:
        html =  _request_rul(url, method, data)
    except:
        html = None
    return html


if __name__ == '__main__':
    url = 'http://www.baidu.com'
    print(request_url(url))
複製程式碼

2.7 web客戶端驗證

如果是Web客戶端驗證，需要新增 auth = (賬戶名, 密碼)

import requests
auth=('test', '123456')
response = requests.get('http://192.168.199.107', auth = auth)
print (response.text)複製程式碼

爬蟲-Requests模組
2022-03-03
爬蟲
python爬蟲requests模組
2019-03-01
Python爬蟲
python爬蟲:爬蟲的簡單介紹及requests模組的簡單使用
2022-02-24
Python爬蟲
爬蟲之requests庫
2022-03-20
爬蟲
Python "爬蟲"出發前的裝備之二資料先行（ Requests 模組）
2022-03-03
Python爬蟲
爬蟲-urllib模組的使用
2021-01-14
爬蟲
Python爬蟲之路-jsonpath模組
2021-01-04
Python爬蟲JSON
Python爬蟲之路-lxml模組
2021-01-04
Python爬蟲XML
requests模組
2024-11-01
python爬蟲需要什麼模組
2021-09-11
Python爬蟲
Python：requests模組
2020-10-18
Python
Python爬蟲教程-09-error 模組
2018-09-06
Python爬蟲Error
爬蟲 | 非同步請求aiohttp模組
2024-06-16
爬蟲非同步AIHTTP
Python爬蟲：流程框架和常用模組
2021-09-11
Python爬蟲框架
爬蟲-urllib3模組的使用
2021-01-15
爬蟲
Python爬蟲神器requests庫的使用
2024-11-07
Python爬蟲
為爬蟲框架構建Selenium模組、DSL模組(Kotlin實現)
2018-06-12
爬蟲框架架構Kotlin
python爬蟲常用庫之requests詳解
2019-03-04
Python爬蟲
Python爬蟲教程-06-爬蟲實現百度翻譯(requests)
2018-09-06
Python爬蟲
requests模組獲取cookie
2018-08-15
Cookie
requests模組 - get 請求
2024-10-13
requests 模組 - post 請求
2024-10-13
Python中爬蟲框架或模組的區別！
2021-04-30
Python爬蟲框架
Python中爬蟲框架或模組的區別
2021-04-07
Python爬蟲框架
6.爬蟲 requests庫講解總結
2019-04-09
爬蟲
JB的Python之旅-爬蟲篇--requests&Scrapy
2018-06-08
Python爬蟲
Python 爬蟲實戰（二）：使用 requests-html
2018-03-14
Python爬蟲HTML
python爬蟲利用requests製作代理池s
2019-12-04
Python爬蟲
python爬蟲—基本的模組，你一定要懂！！
2019-02-13
Python爬蟲
為爬蟲獲取登入cookies：使用Charles和requests模擬微博登入
2018-12-03
爬蟲Cookie
5.爬蟲 requests庫講解高階用法
2019-04-09
爬蟲
使用Python和requests庫的簡單爬蟲程式
2023-11-13
Python爬蟲
使用requests+BeautifulSoup的簡單爬蟲練習
2018-04-06
爬蟲
Python2爬蟲利器：requests庫的基本用法
2021-09-11
Python爬蟲
python多執行緒非同步爬蟲-Python非同步爬蟲試驗[Celery,gevent,requests]
2020-11-11
Python執行緒非同步爬蟲
使用 nodejs 寫爬蟲(一): 常用模組和 js 語法
2019-04-03
NodeJS爬蟲
如何使用queue模組實現多執行緒爬蟲
2023-11-29
執行緒爬蟲
Python中爬蟲模組有哪些?優缺點介紹！
2021-04-21
Python爬蟲