python爬蟲

Sol_9發表於2024-06-13

原文網址 : https://blog.csdn.net/weixin_45714641/article/details/109007822

What's 爬蟲？

簡單來說：

爬蟲，即網路蜘蛛，是偽裝成客戶端與伺服器進行資料互動的程式。

程式碼

點選檢視程式碼

from bs4 import BeautifulSoup        #網頁解析
import urllib.request,urllib.error   #制定URL，獲取網頁資料
import re                            #正規表示式  進行文字匹配
import xlwt                          #進行excel操作
from tqdm import trange              #進度條庫
 
def main():
    baseurl = "https://movie.douban.com/top250?start="
    # 1.爬取網頁
    # 2.逐一解析資料
    datalist =getDate(baseurl)
    # 3.儲存資料
    savepath = "豆瓣top250.xls"
    savedata(datalist,savepath)
 
#影片詳情連結
findLink = re.compile(r'<a href="(.*?)">')     #建立正規表示式，表示規則（字串的模式）
#影片圖片的連結
findImagesrc = re.compile( r'<img[^>]+src=["\']([^"\']+)["\']')     #ai寫的正則
#影片中文片名
findName = re.compile(r'<span class="title">(.*)</span>')
#影片評分
findRating = re.compile(r'<span class="rating_num" property="v:average">(.*)</span>')
#影片熱評
findComment = re.compile(r'<span class="inq">(.*?)</span>')
 
def getDate(baseurl):
    datalist = []
    # 1.爬取網頁
    for i in trange(0,10):
        url = baseurl + str(i*25)
        html = askURL(url)      #儲存獲取到的網路原始碼
        soup = BeautifulSoup(html,"html.parser")
        for item in soup.find_all('div',class_="item"):
            # 2.逐一解析資料
            item =str(item)
            data=[]
            name = re.findall(findName,item)[0]
            data.append(name)
            link = re.findall(findLink,item)[0]
            data.append(link)
            img = re.findall(findImagesrc,item)[0]
            data.append(img)
            rating = re.findall(findRating,item)
            data.append(rating)
            comment = re.findall(findComment,item)
            if len(comment)!=0:
                comment=comment[0].replace("。","")
                data.append(comment)
            else:
                data.append("  ")
            datalist.append(data)
 
    return datalist
 
def askURL(url):
    head={
        "User-Agent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Mobile Safari/537.36 Edg/121.0.0.0"
    }
    request = urllib.request.Request(url,headers=head)
    html= ""
    try:
        response=urllib.request.urlopen(request)
        html=response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)
    return html
 
def savedata(datalist,savepath):
    book = xlwt.Workbook(encoding="utf-8",style_compression=0)   #壓縮樣式效果，設為0
    sheet = book.add_sheet('top250',cell_overwrite_ok=True)  #每個單元在寫入時覆蓋以前的內容
    col = ('電影中文名','電影詳情連結','圖片連結','電影評分','電影熱評')
    for i in range(0,len(col)):
        sheet.write(0,i,col[i])   #列名
    for i in range (0,250):
        data = datalist[i]
        for j in range (0,len(col)):
            sheet.write(i+1,j,data[j])
 
    book.save(savepath)
 
if __name__ == "__main__":
    main()
    print("爬取完成")

需要用到的庫： from bs4 import BeautifulSoup #網頁解析 import urllib.request,urllib.error #制定URL，獲取網頁資料 import re #正規表示式進行文字匹配 import xlwt #進行excel操作 from tqdm import trange

進度條庫，當然你也可以不用，這個庫只需要把for迴圈裡的range改為trange，你就可以得到一個進度條

思路

1.獲取網頁的原始碼

def askURL(url):
    head={
        "User-Agent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Mobile Safari/537.36 Edg/121.0.0.0"
    }
    request = urllib.request.Request(url,headers=head)
    html= ""
    try:
        response=urllib.request.urlopen(request)
        html=response.read().decode("utf-8")
        #print(html)
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)
    return html

用個迴圈，根據網頁制定一下url

找到用於偽裝客戶端User-Agent
在network裡重新整理一下網頁，找到傳送的標頭header

這個是user-agent：Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Mobile Safari/537.36 Edg/121.0.0.0

用urllib獲取你制定的url的原始碼
在 try 塊中：

呼叫 urllib.request.urlopen(request) 傳送HTTP請求，並獲取響應物件 response。
透過 response.read() 獲取伺服器返回的原始二進位制資料。
使用 decode("utf-8") 方法將二進位制資料解碼成UTF-8編碼的字串，並將其賦值給變數 html。
如果在執行 urlopen 函式過程中出現 urllib.error.URLError 異常，則進入 except 塊：

判斷異常物件是否包含 .code 屬性，如果有則列印出HTTP狀態碼。
再判斷異常物件是否包含 .reason 屬性，如果有則列印出錯誤原因。
最後，無論是否發生異常，都返回抓取到的網頁HTML內容（即變數 html）

2.解析資料

def getDate(baseurl):
    datalist = []
    # 1.爬取網頁
    for i in trange(0,10):
        url = baseurl + str(i*25)
        html = askURL(url)      #儲存獲取到的網路原始碼
        soup = BeautifulSoup(html,"html.parser")
        for item in soup.find_all('div',class_="item"):
            # 2.逐一解析資料
            item =str(item)
            data=[]
            name = re.findall(findName,item)[0]
            data.append(name)
            link = re.findall(findLink,item)[0]
            data.append(link)
            img = re.findall(findImagesrc,item)[0]
            data.append(img)
            rating = re.findall(findRating,item)
            data.append(rating)
            comment = re.findall(findComment,item)
            if len(comment)!=0:
                comment=comment[0].replace("。","")
                data.append(comment)
            else:
                data.append("  ")
            datalist.append(data)
 
    return datalist

BeautifulSoup
bs4是一個強大的庫，用於從HTML和XML檔案中提取資料，它能夠將複雜的HTML結構轉換成樹形結構（即元素樹），使得開發者可以方便地搜尋、遍歷以及修改網頁內容。

"html.parser": 這是BeautifulSoup用來解析HTML文件的解析器。在這個案例中，它是指Python自帶的標準HTML解析器。除了標準的解析器外，BeautifulSoup還可以配合其他第三方解析器如 lxml 來使用。

用bs4和re篩選資訊

3.儲存資料寫入excel表中

需要用到xwlt庫

def savedata(datalist,savepath):
    book = xlwt.Workbook(encoding="utf-8",style_compression=0)   #壓縮樣式效果，設為0
    sheet = book.add_sheet('top250',cell_overwrite_ok=True)  #每個單元在寫入時覆蓋以前的內容
    col = ('電影中文名','電影詳情連結','圖片連結','電影評分','電影熱評')
    for i in range(0,len(col)):
        sheet.write(0,i,col[i])   #列名
    for i in range (0,250):
        data = datalist[i]
        for j in range (0,len(col)):
            sheet.write(i+1,j,data[j])
 
    book.save(savepath)

【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
python 爬蟲
2024-04-20
Python爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
python爬蟲2
2019-01-07
Python爬蟲
Python爬蟲-xpath
2018-06-08
Python爬蟲
Python爬蟲——XPath
2018-07-28
Python爬蟲
Python爬蟲--2
2024-03-24
Python爬蟲
Python asyncio 爬蟲
2020-04-28
Python爬蟲
Python 爬蟲系列
2021-01-01
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
Python爬蟲教程-01-爬蟲介紹
2018-09-06
Python爬蟲
Java爬蟲與Python爬蟲的區別？
2023-10-25
Java爬蟲Python
什麼是Python爬蟲？python爬蟲入門難嗎？
2021-12-27
Python爬蟲
python爬蟲是什麼?學習python爬蟲難嗎
2021-03-31
Python爬蟲
什麼是Python爬蟲？Python爬蟲常用框架有哪些？
2020-12-24
Python爬蟲框架
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
Python爬蟲可以幹什麼?Python爬蟲有什麼用?
2022-08-25
Python爬蟲
【爬蟲】python爬蟲從入門到放棄
2018-12-20
爬蟲Python
【python--爬蟲】彼岸圖網高清桌布爬蟲
2019-07-21
Python爬蟲
Python爬蟲（1.爬蟲的基本概念）
2018-04-20
Python爬蟲
python爬蟲實戰，爬蟲之路，永無止境
2022-01-27
Python爬蟲
什麼是爬蟲？Python爬蟲框架有哪些？
2022-04-18
爬蟲Python框架
Python爬蟲與Java爬蟲有何區別？
2022-06-01
Python爬蟲Java
Python爬蟲之路-chrome在爬蟲中的使用
2021-01-04
Python爬蟲Chrome
Python爬蟲之路-selenium在爬蟲中的使用
2021-01-04
Python爬蟲
Python爬蟲之BeautifulSoup
2019-02-16
Python爬蟲
python爬蟲Scrapy框架
2018-11-21
Python爬蟲框架
Python爬蟲工具列表
2018-11-15
Python爬蟲
python 爬蟲代理池
2019-03-09
Python爬蟲
Python爬蟲的用途
2018-08-16
Python爬蟲
python爬蟲系列版
2018-03-16
Python爬蟲
Python：基礎&爬蟲
2023-10-29
Python爬蟲
Python 爬蟲實戰
2023-10-16
Python爬蟲
python 爬蟲 Demo webdriver
2019-09-25
Python爬蟲Web