requests, BeauitfulSoup

不愿透露姓名的小村村發表於2024-07-18

原文網址 : https://www.cnblogs.com/xiacuncun/p/18310608

requests

requests.get()的基本使用

# 匯入
import requests

# 不帶引數get
reponse = requests.get('url')

# 帶引數get

headers = {'referer': 'http://xxxxxx.net/',
           'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0'}
reponse = requests.get('url',headers=headers)

# 或者這樣帶引數
r1 = requests.get(url='http://dict.baidu.com/s', params={'wd': 'python'})  
print(r1.url)

>>> http://dict.baidu.com/s?wd=python

# 或者直接用url傳引數
payload = {'keyword': '香港', 'salecityid': '2'}
r = requests.get("http://m.ctrip.com/webapp/tourvisa/visa_list", params=payload) 
print（r.url） 

>>> http://m.ctrip.com/webapp/tourvisa/visa_list?salecityid=2&keyword=香港

# 獲取文字
reponse.text

# 獲取圖片和影片內容
reponse.content

# 獲取編碼
reponse.encoding = 'utf-8'

# 獲取狀態碼
reponse.status_code

# 超時時間設定
r = requests.get('http://m.ctrip.com', timeout=0.001)

#第一次get/post的時候網站會返回一個cookies，有些時候post登入需要帶cookies,以下語句用於獲取cookies
reponse_cookie_dic = reponse.cookies.get_dict()

requests.post()的基本使用，提交`登入資訊`或者把資料傳給伺服器的時候可以用

# post的基本使用方法

# 提交的引數
post_data = {
    "phone": '86'+'01234567890',
    'password': '123',
    'oneMonth': 1  # 一個月內免登陸
}
# 一定要新增瀏覽器，不然可能會遇到網路防火牆
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0'}

response = requests.post(
    url='https://xxxxx.com/login',
    headers =headers,
    data=post_data,
)

#返回的cookies
reponse_cookie_dic = reponse.cookies.get_dict()

# 用上面得到的cookies再次請求
response = requests.post(
    url='https://xxxxx.com/login',
    headers =headers,
    data=post_data,
    cookies=reponse_cookie_dic
)

# 使用代理來避免IP重複，格式是proxy_dict={'order'":'ip'+':'+'port'}的字典,可以搞一個ip代理池，用完就丟。
# proxies處就填入
response = requests.get(target_url, headers=headers, proxies=proxy_dict, timeout=30)

requests.session，自動保持cookies,不需要手動維護cookies

s = requests.Session()

# get請求
target_response = s.get(url=target_url, headers=target_headers)

配置超時及重連次數

from requests.adapters import HTTPAdapter
headers = dict() #建立字典
headers["User-Agent"] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"
headers["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
headers["Accept-Encoding"] = "gzip, deflate, sdch"
headers["Accept-Language"] = "zh-CN,zh;q=0.8"
headers["Accept-Language"] = "zh-CN,zh;q=0.8"
request_retry = HTTPAdapter(max_retries=3) #配置超時及重連次數

def my_get(url, refer=None):
    session = requests.session()
    session.headers = headers
    if refer:
        headers["Referer"] = refer
    session.mount('https://', request_retry) #mount掛載特定對話 代理
    session.mount('http://', request_retry)
    return session.get(url)

最簡單的實現：靠random模組做ip池和user-agent的隨機分配

#設定使用者代理池
header_list = ["Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"]
#設定ip池
ip_pools = ["123.54.44.4：9999",
"110.243.13.120:9999", 
"183.166.97.101:9999"]

random_ip = random.choice(ip_pools)
header = ("User-Agent", random.choice(header_list))

BeautifulSoup

# 匯入
from bs4 import BeautifulSoup

# 解析，'html.parser'也可以改成'lxml'格式，lxml是唯一支援xml格式的解析器
soup = BeautifulSoup(reponse.text,'html.parser')

# 找到第一個標籤，可以是a,div等任意html支援的標籤
tag1 = soup.find('a')

#找到第一個id是d1的標籤
tag2 = soup.find(id='d1')

# 找到第一個id是d1的div標籤
tag3 = find('div',id='d1')

# find_all,返回所有滿足條件的標籤，這是個可迭代物件，用for遍歷
# attrs填滿足查詢條件的屬性
all_div = soup.find_all('div',attrs={'class':'card'})
for div in all_div:
    print(div.img['src'])

# select CSS選擇器
soup.select("title")
soup.select("body a")
soup.select("#link1")
soup.select('a[href]')
soup.select("p > a")
li_list = soup.select("div.postlist ul#pins li")
href = bs(response.content, "lxml").select_one("div.main-image img").attrs["src"]

寫入檔案

# mkdir直接在當前.py檔案的資料夾路徑平級建立一個檔案
if not os.path.exists("img"):
    os.mkdir("img")
if not os.path.exists("img/" + str(start_num)):
    os.mkdir("img/" + str(start_num))
with open("img/" + str(start_num) + "/" + img_name + ".jpg", 'wb') as fs:
    fs.write(src_reponse.content)
    print ("download success!")

requests庫
2024-05-31
requests基本用法
2018-07-26
python+requests
2024-11-20
Python
requests模組
2024-11-01
拆輪子：requests
2018-07-19
Python：requests模組
2020-10-18
Python
淺談requests庫
2020-04-19
筆記requests庫
2019-03-23
筆記
requests庫幫助
2018-04-14
爬蟲——Requests模組
2019-01-13
爬蟲
2 26requests.py
2019-02-26
使用 requests 登入 learnku
2020-04-16
爬蟲-Requests模組
2022-03-03
爬蟲
python requests 最牛攻略
2023-02-19
Python
爬蟲之requests庫
2022-03-20
爬蟲
介面測試框架Requests
2021-02-12
框架
requests、aiohttp、httpx 對比
2021-03-06
AIHTTP
python爬蟲requests模組
2019-03-01
Python爬蟲
requests模組獲取cookie
2018-08-15
Cookie
curl轉python requests程式碼
2018-08-16
Python
Trying to hack Redis via HTTP requests
2020-08-19
RedisHTTP
05.python requests IP代理
2020-11-13
Python
Python----Requests庫基本使用
2019-03-22
Python
requests模組 - get 請求
2024-10-13
requests 模組 - post 請求
2024-10-13
Python requests 安裝與開發
2019-02-16
Python
Python3安裝requests庫
2018-07-30
Python
python庫學習之Requests（二）
2018-06-20
Python
python的requests怎麼安裝
2024-03-13
Python
requests庫中的Cookie處理
2023-11-16
Cookie
修改 requests 庫原始碼的方法
2023-11-21
原始碼
Python requests設定代理的方法
2020-03-03
Python
requests返回值cookies轉字典
2020-11-04
Cookie
AttributeError: module ‘requests‘ has no attribute ‘_version_‘
2020-11-20
Error
Python HTTP庫：requests快速入門
2019-04-20
PythonHTTP
requests.exceptions.SSLError: HTTPSConnectionPool 報錯
2024-11-04
ExceptionErrorHTTP
精講Python中的requests方法
2021-09-11
Python
async await：比requests 更強大
2021-10-12
AI

requests, BeauitfulSoup

requests

requests.get()的基本使用

requests.post()的基本使用，提交登入資訊或者把資料傳給伺服器的時候可以用

requests.session，自動保持cookies,不需要手動維護cookies

配置超時及重連次數

最簡單的實現：靠random模組做ip池和user-agent的隨機分配

BeautifulSoup

寫入檔案

相關文章

requests.post()的基本使用，提交`登入資訊`或者把資料傳給伺服器的時候可以用