python 小爬蟲 DrissionPage+BeautifulSoup

net郝發表於2024-06-16

原文網址 : https://www.cnblogs.com/hkf100/p/18250083

Python爬蟲

哈嘍，大家好，我要開始寫部落格啦💪..... 文中有不當之處，還請多多指正～謝謝

1.前言

在網上找書籍或其他資源的時候，都會看到某些資源網站上寫著“資訊來源網際網路”。於是乎，我也開始“搭弓射箭”，找來了python老弟，幫我研究研究到底如何才能爬爬爬～。

首先，第一版我用了requests工具，好像是python 老弟自帶的 ... 用它不是不行，就是不行，用各種代理，還是拉垮... 那就over

後來又研究了selenium，這玩意配置起來賊費勁，必須各種瀏覽器驅動版本對應，最後也是over

最後最後發現還挺不拉垮的DrissionPage，也不知道那個大牛天天沒事幹，搞出個這麼好用的玩意 ...

廢話不多說，咱們直接開幹

先看一下最後的結果截圖

2.邏輯思路

其實邏輯非常簡單：

開啟網頁
輸入百度地址
查詢關鍵字
獲取查詢到的頁面地址（頁碼數自己設定）
從獲取到的頁面中找我想要的資源（這裡是網盤資源）
把結果寫入到excel
over。over

3.函式定義

先介紹一下我都“裝載”了哪些函式：

#儲存結果佇列
data=queue.Queue()
#用它來開啟瀏覽器，百度輸入關鍵字，查詢並返回結果（查詢到的url，百度一頁有10個結果）
#key 百度查詢關鍵字
#port 埠號 這裡是個坑，因為不自定義埠號， 它就會重複使用某個埠號，多執行緒時經常錯誤， 找不到元素，因為你懂的...
def baiduGetUrl(key,port):
    #return ['http://....',*,*]
    pass
#獲取html內容，中間會有一些關閉按鈕的使用 （開啟網頁， 它給你個彈窗，你說你關不關它）
#url 地址是baidu給的...
#port 埠（隨機自定義 要大一些 你懂的）
def getHtml(url,port):
    #return page.html
    pass
#找到匹配的內容
def searchHtml(html,url):
    #soup.find_all(....) 
    #return result 
    pass
#負責“獲取html”和”找到html匹配的內容“ 
def getHtmlProc(url,port):
    pass
#負責寫入檔案
def create_excel(filename):
    pass
#主程式會開啟多程序，呼叫這個函式
#key 百度查詢關鍵字 port埠
def eachGetData(key,port):
    #獲取輸入的key 查詢到的地址，也就是call baiduGetUrl
    #遍歷地址，開啟幾個執行緒，執行getHtmlProc
    #最後將結果寫入檔案 create_excel

4.核心程式碼演示

不知道小夥伴們有沒有看懂函式定義部分，也不知道我那麼定義有沒有毛病，管他呢，能執行就不錯了...

下面簡單程式碼演示一下，文末會給出整個程式碼的連結哦～

檢視程式碼

 def baiduGetUrl(key,port):
    resultUrl=[]
    co = ChromiumOptions().headless()
    co.auto_port()
    co.set_local_port(9432+port)
    page = ChromiumPage(co)
    page.get("http://www.baidu.com", retry=999)
    page.wait.load_start()
    page('#kw').input(key)
    page("#su").click()
    page.wait.load_start()

    for i in range(1,16):
        try:
            next_page = page.ele("@text()=下一頁 >")  # 百度的下一頁
            if i!=1:
                if not next_page:
                    next_page=page.ele("@text()=下一頁 >")
                print(next_page)
                next_page.click()
                page.wait.load_start()
                time.sleep(2)
            divs=page.eles("t:h3")
            for d in divs:
                a=d.ele("t:a")
                resultUrl.append(a.attrs.get("href"))
        except Exception as ex:
            logger.error("下一頁時錯誤："+str(ex))
            print("下一頁時錯誤："+str(ex))
            pass
    page.close()
    print("獲取到的頁面url=%d"%len(resultUrl))
    return resultUrl

檢視程式碼

def getHtml(url,port):
    try:
        co = ChromiumOptions().headless()#
        co.auto_port()
        thread_id=threading.currentThread().ident
        r=random.Random()
        num=r.randint(1,40000)
        thisprot=9012+port+int(str(thread_id)[0:3])+num
        print("埠號： %d " %(thisprot,))
        co.set_local_port(thisprot)
        page = ChromiumPage(co)

        page.get(url, retry=999)
        page.wait.load_start()
        time.sleep(1)
        try:

            btn = page.ele("@aria-label:關閉")
            # 某些網站的關閉按鈕
            if btn:
                btn.click()
        except:
            pass
        try:
            btnimg = page.ele("@style^position: absolute;top: 16px;left: -20px;width: 16px;height: 16px")
            # csdn 關閉
            if btnimg:
                btnimg.click()
        except:
            pass
        try:
            btnbdba = page.ele("@class=close-btn")
            if btnbdba:
                btnbdba.click()
        except:
            pass
        try:
            btnin=page.ele("@id=access")
            if btnin:
                btnin.click()
        except:
            pass
        try:
            btnClose= page.ele("@class=close")
            if btnClose:
                btnClose.click()
        except:
            pass
        html=page.html
        page.close()
        print("獲取到html 內容")

        return html
    except Exception as ex:
        logger.error("獲取html遇到錯誤："+str(ex))

        print("獲取html遇到錯誤："+str(ex))
        return ""

這裡的找結果的方法多少有點簡陋，會找到一些錯誤的結果，也有一些結果已經過期（資源連結過期），這有待完善哦。。

檢視程式碼

 def searchHtml(html,url):
    findData = []
    soup = BeautifulSoup(html, 'html.parser')
    print("獲取到html")
    textFindEle = soup.find_all(
        string=lambda text: text and (text.startswith('連結') or text.startswith(' 連結') or text.startswith(
            '百度網盤地址') or text.startswith('地址') or text.startswith('網盤地址')))
    hrefFindEle = soup.find_all('a', href=lambda href: href
                                                       and (href.startswith('https://pan.baidu.com/s')
                                                            or href.startswith('https://pan.baidu.com/share/init?')
                                                            or href.startswith('https://pan.quark.cn/s/')
                                                            or href.startswith('https://url98.ctfile.com/d')))
    print("找到結果 text=%d ,href=%d " % (len(textFindEle), len(hrefFindEle),))
    if len(textFindEle)!=0:
        for textf in textFindEle:
            prev = textf.previous_sibling
            next = textf.next_sibling
            try:
                findData.append(
                    {"prev": prev and prev.text, 'next': next and next.text, "thist": textf and textf.text, "url": url,
                     "thish": textf.attrs.get('href')})
            except:
                findData.append(
                    {"prev": prev and prev.text, 'next': next and next.text, "thist": textf and textf.text, "url": url,
                     "thish": ''})

                pass
    if len(hrefFindEle)!=0:
        for href in hrefFindEle:
            prev = href.previous_sibling
            next = href.next_sibling
            try:
                findData.append(
                    {"prev": prev and prev.text, 'next': next and next.text, "thist": href and href.text, "url": url,
                     "thish": href.attrs.get('href')})
            except:
                findData.append(
                    {"prev": prev and prev.text, 'next': next and next.text, "thist": href and href.text, "url": url,
                     "thish": ''})
                pass
    if len(findData)!=0:
        print("結果新增到佇列")
        data.put(findData)

5.這裡就要結束了

總體來看，程式碼還是可以找到一些想要的結果的。

整體優缺點的話，個人覺得程式碼改成多程序（multiprocessing.Process，百度說是開啟子程序，不知道他有沒有騙我）然後在子程序中開啟了多個執行緒（限制3，4個同時執行）。。。不能搞太多，太多電腦大哥就不幹了，第三方外掛庫也不幹。。

比單執行緒要快很多，但是，也需要挺長時間查詢結果的。。

還有就是結果的正確性問題，如何更精準的找到結果？ ok ，此事仍待大神...

最後

感謝閱讀，程式碼關注公眾號回覆“python小爬蟲”即可收到連結哦。有問題隨時留言評論，對了，點個贊再走唄

Python爬蟲小結（轉）
2018-08-09
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
Python 第一個爬蟲，爬取 147 小說
2020-05-08
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
如何用python爬蟲下載小說？
2021-09-11
Python爬蟲
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
python 爬蟲
2024-04-20
Python爬蟲
python爬蟲
2024-06-13
Python爬蟲
python爬蟲實操專案_Python爬蟲開發與專案實戰 1.6 小結
2021-02-04
Python爬蟲
Python爬蟲小專案：爬一個圖書網站
2018-11-21
Python爬蟲網站
python爬蟲之抓取小說(逆天邪神)
2022-03-10
Python爬蟲
Python網路爬蟲實戰小專案
2021-04-12
Python爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
python爬蟲2
2019-01-07
Python爬蟲
Python爬蟲-xpath
2018-06-08
Python爬蟲
Python爬蟲——XPath
2018-07-28
Python爬蟲
Python爬蟲--2
2024-03-24
Python爬蟲
Python asyncio 爬蟲
2020-04-28
Python爬蟲
Python 爬蟲系列
2021-01-01
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
3.26爬蟲小記
2019-03-26
爬蟲
爬蟲小專案
2019-05-10
爬蟲
3.22 爬蟲小記
2019-03-22
爬蟲
Go 爬蟲小例
2022-05-24
Go爬蟲
Python爬蟲教程-01-爬蟲介紹
2018-09-06
Python爬蟲
Java爬蟲與Python爬蟲的區別？
2023-10-25
Java爬蟲Python
什麼是Python爬蟲？python爬蟲入門難嗎？
2021-12-27
Python爬蟲
python爬蟲是什麼?學習python爬蟲難嗎
2021-03-31
Python爬蟲
什麼是Python爬蟲？Python爬蟲常用框架有哪些？
2020-12-24
Python爬蟲框架
初探python之做一個簡單小爬蟲
2019-03-02
Python爬蟲
【爬蟲】利用Python爬蟲爬取小麥苗itpub部落格的所有文章的連線地址（1）
2018-12-26
爬蟲Python
爬蟲爬取微信小程式
2019-02-16
爬蟲微信小程式
爬蟲的小技巧之–如何尋找爬蟲入口
2018-03-05
爬蟲
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
Python爬蟲可以幹什麼?Python爬蟲有什麼用?
2022-08-25
Python爬蟲

python 小爬蟲 DrissionPage+BeautifulSoup

1.前言

2.邏輯思路

3.函式定義

4.核心程式碼演示

5.這裡就要結束了

最後

相關文章