基於bs4+requests的python爬蟲偽裝

瀟灑坤發表於2018-07-20

原文網址 : https://flycode.co/archives/167509

要匯入fake-useragent庫，需要先用pip安裝，安裝命令：pip install fake-useragent
params是爬蟲偽裝的引數，資料型別為字典dict，裡面有2個鍵值對，2個鍵：headers、proxies。
headers的資料型別是字典，裡面有1個鍵值對，鍵User-Agent對應的值資料型別為字串，User-Agent中文翻譯是使用者代理。
proxies的資料型別是字典，裡面有1個鍵值對，鍵http對應的值資料型別為字串，是代理伺服器的url。
匿名ip主要是從66ip.cn網站獲取。

import requests
from bs4 import BeautifulSoup as bs
from fake_useragent import UserAgent
import random

def getSoup(url,encoding="utf-8",**params):
    print(params)
    reponse = requests.get(url,**params)
    reponse.encoding = encoding
    soup = bs(reponse.text,`lxml`)
    return soup

def cssFind(movie,cssSelector,nth=1):
    if len(movie.select(cssSelector)) >= nth:
        return movie.select(cssSelector)[nth-1].text.strip()
    else:
        return ``

def getProxyList():
    proxies_url_before = "http://www.66ip.cn/areaindex_2/{}.html"
    proxies_url = proxies_url_before.format(random.randint(1,10))
    soup = getSoup(proxies_url)
    item_list = soup.select("table tr")[2:]
    proxies_list = []
    for item in item_list:
        ipAddress = cssFind(item, "td")
        ipPort = cssFind(item, "td", 2)
        proxies_list.append("http://{}:{}".format(ipAddress, ipPort))
    return proxies_list

def getParams():
    ua = UserAgent()
    ip_list = getProxyList()
    params = dict(
        headers = {`User-Agent`: ua.random},
        proxies = {`http`: random.choice(ip_list)}
    )
    return params

if __name__ == "__main__":   
    params = getParams()
    soup = getSoup("https://movie.douban.com/top250?start=50",**params)

對於反爬蟲偽裝瀏覽器進行爬蟲
2018-04-12
爬蟲瀏覽器
基於Python的簡單天氣爬蟲程式
2018-03-26
Python爬蟲
Python：基礎&爬蟲
2023-10-29
Python爬蟲
爬蟲偽裝正常使用者的三種方法
2022-05-27
爬蟲
python爬蟲基礎概念
2020-05-11
Python爬蟲
python_爬蟲基礎
2024-07-30
Python爬蟲
基於java的分散式爬蟲
2018-07-06
Java分散式爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
Python分散式爬蟲(三) - 爬蟲基礎知識
2019-03-21
Python分散式爬蟲
Python爬蟲之路-爬蟲基礎知識(理論)
2021-01-04
Python爬蟲
爬蟲入門基礎-Python
2020-05-09
爬蟲Python
Python爬蟲基礎之selenium
2022-07-13
Python爬蟲
python爬蟲基礎之urllib
2020-11-26
Python爬蟲
python網路爬蟲（9）構建基礎爬蟲思路
2019-06-09
Python爬蟲
Python爬蟲基礎-01-帶有請求引數的爬蟲
2018-06-06
Python爬蟲
有道翻譯最新爬蟲程式碼-基於Python3
2018-08-17
爬蟲Python
Java爬蟲與Python爬蟲的區別？
2023-10-25
Java爬蟲Python
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python3爬蟲利器：Appium的安裝
2021-09-11
Python爬蟲APP
基於nodejs編寫小爬蟲
2019-02-16
NodeJS爬蟲
基於 go + xpath 爬蟲小案例
2021-07-11
Go爬蟲
基於asyncio、aiohttp、xpath的非同步爬蟲
2019-02-16
AIHTTP非同步爬蟲
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
Python爬蟲的用途
2018-08-16
Python爬蟲
世界盃快到了，看我用Python爬蟲實現（偽）球迷速成！
2018-06-10
Python爬蟲
python 爬蟲
2024-04-20
Python爬蟲
python爬蟲
2024-06-13
Python爬蟲
Python爬蟲（1.爬蟲的基本概念）
2018-04-20
Python爬蟲
Python爬蟲之路-chrome在爬蟲中的使用
2021-01-04
Python爬蟲Chrome
Python爬蟲之路-selenium在爬蟲中的使用
2021-01-04
Python爬蟲
Python3爬蟲利器之ChromeDriver的安裝
2021-09-11
Python爬蟲Chrome
一個基於 golang 的爬蟲電影站
2020-03-20
Golang爬蟲
如何讓爬蟲正確提取偽元素
2020-12-13
爬蟲
爬蟲資料儲存--基於MonogoDB
2018-04-09
爬蟲MonoGo
基於 Lua 寫一個爬蟲程式
2023-11-14
爬蟲
python爬蟲基礎與http協議
2019-03-25
Python爬蟲HTTP協議
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲

基於bs4+requests的python爬蟲偽裝

相關文章