Python爬蟲--2

Menq發表於2024-03-24

原文網址 : https://www.cnblogs.com/Menq/p/18088196

Python爬蟲

本節筆記
獲取豆瓣網站內容

記錄檔案建立時間
檔案→設定→編輯器→檔案和程式碼模板中找到Python Script
在輸入介面輸入
"#日期:${DATE}
"#檔案:${NAME}

新建一個Python檔案,顯示了此檔案建立時間和檔名字

一．單頁獲取

1.獲取電影名字

網頁URL：https://movie.douban.com/top250

首先先獲取網頁資訊

import requests

url="https://movie.douban.com/top250"

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"}
response=requests.get(url,headers=headers)
print(response.text)

方便檢視網頁程式碼，我們新建一個HTML檔案，將獲取到的網頁貼上到HTML檔案中

拖動HTML1檔案到另一邊，方便檢視

匯入正規表示式，找到含有電影名字的程式碼，複製程式碼，將中間名字改為正規表示式語法任意字元"(.*?)"，執行獲取電影名字，發現還有同樣出現了電影和英文名字和空格字元

我們只獲取中文，使用正規表示式中“除了”的語法

[^ ]除了的意思--eg:[^&nbsp]

# 日期:2024/3/22
# 檔案:spider1
import requests
import re

url="https://movie.douban.com/top250"

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"}
response=requests.get(url,headers=headers)

#解析資料
#獲取電影名字
movieName=re.findall( '<span class="title">([^&nbsp].*?)</span>',response.text)

print(movieName)

2.獲取電影名字和電影評分

找到含有評分的程式碼，複製，按照上面方法替換

# 日期:2024/3/22
# 檔案:spider1
import requests
import re

url="https://movie.douban.com/top250"

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"}
response=requests.get(url,headers=headers)

#解析資料
#獲取電影名字
movieName=re.findall( '<span class="title">([^&nbsp].*?)</span>',response.text)

#獲取電影評分
score=re.findall(' <span class="rating_num" property="v:average">(.*?)</span>',response.text)

print(movieName)
print(score)

列出方式：

（1）分別列出

（2）以元組方式列出

有25部電影，所以range（）有25個，range（25）不包括25，但是以下用了i+1，所以數字排序是1-25

（3）基於元組排列方式列出

# 日期:2024/3/22
# 檔案:spider1
import requests
import re

url="https://movie.douban.com/top250"

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"}
response=requests.get(url,headers=headers)

#解析資料
#獲取電影名字
movieName=re.findall( '<span class="title">([^&nbsp].*?)</span>',response.text)

#獲取電影評分
score=re.findall(' <span class="rating_num" property="v:average">(.*?)</span>',response.text)

#print(movieName)
#print(score)

l=[]         #建立空元組
#元組方式列出
for i in range(25):
    l.append((i+1,movieName[i],score[i])) #設定變數，新增序號
print(l)
#基於元組排列方式列出
for i in l:
   print(i)

二．多頁獲取

只需更改單頁獲取的幾個地方：多頁的url和請求的資料
建立電影名字和電影評分的空置，構造url
先來檢視幾頁網頁的url
第二頁

第三頁

第二十五頁

第一頁

從上面來看，都是基於"https://movie.douban.com/top250?start=(從0開始每次增加25)&filter="
構造url：url="https://movie.douban.com/top250?start=" + str(i) + "&filter="

將資料更改為迴圈形式（i=i+1,不知道怎麼表達）

因為這次請求的資料有250個，把輸出結束值改為250

執行

import requests
import re

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"}
movieName=[]
score=[]
#構造url
for i in range(0,250,25):   #range[開始，結束，步長]
    url="https://movie.douban.com/top250?start=" + str(i) + "&filter=" #"25"或者str(25),25為數字，"+"是連線字串

#傳送請求
    response=requests.get(url,headers=headers)

#請求資料
#爬取電影名字（中文）
    movieName+=re.findall('<span class="title">([^&nbsp].*?)</span>',response.text)

#爬取電影評分
    score=score+ re.findall('<span class="rating_num" property="v:average">(.*?)</span>',response.text)
#普通列出
print(movieName)
print(score)

l=[]
#元組方式列出
for i in range(250):
    l.append((i+1,movieName[i],score[i]))
print(l)
#排列列出
for i in l:
   print(i)

三．獲取電影圖片並以電影名字命名

要獲取圖片，到檔案下建立一個資料夾

除錯檢視

python爬蟲2
2019-01-07
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
2個月精通Python爬蟲——3大爬蟲框架+6場實戰+反爬蟲技巧+分散式爬蟲
2018-06-28
Python爬蟲框架分散式
python爬蟲—學習筆記-2
2024-04-10
Python爬蟲筆記
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
python 爬蟲
2024-04-20
Python爬蟲
python爬蟲
2024-06-13
Python爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
Python爬蟲-xpath
2018-06-08
Python爬蟲
Python爬蟲——XPath
2018-07-28
Python爬蟲
Python asyncio 爬蟲
2020-04-28
Python爬蟲
Python 爬蟲系列
2021-01-01
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
Python爬蟲教程-01-爬蟲介紹
2018-09-06
Python爬蟲
Java爬蟲與Python爬蟲的區別？
2023-10-25
Java爬蟲Python
什麼是Python爬蟲？python爬蟲入門難嗎？
2021-12-27
Python爬蟲
python爬蟲是什麼?學習python爬蟲難嗎
2021-03-31
Python爬蟲
什麼是Python爬蟲？Python爬蟲常用框架有哪些？
2020-12-24
Python爬蟲框架
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
Python爬蟲可以幹什麼?Python爬蟲有什麼用?
2022-08-25
Python爬蟲
【爬蟲】python爬蟲從入門到放棄
2018-12-20
爬蟲Python
【python--爬蟲】彼岸圖網高清桌布爬蟲
2019-07-21
Python爬蟲
Python爬蟲（1.爬蟲的基本概念）
2018-04-20
Python爬蟲
python爬蟲實戰，爬蟲之路，永無止境
2022-01-27
Python爬蟲
什麼是爬蟲？Python爬蟲框架有哪些？
2022-04-18
爬蟲Python框架
Python爬蟲與Java爬蟲有何區別？
2022-06-01
Python爬蟲Java
Python爬蟲之路-chrome在爬蟲中的使用
2021-01-04
Python爬蟲Chrome
Python爬蟲之路-selenium在爬蟲中的使用
2021-01-04
Python爬蟲
Python爬蟲開發與專案實戰（2）
2020-10-21
Python爬蟲
爬蟲開發python工具包介紹（2）
2020-04-05
爬蟲Python
Python2爬蟲利器：requests庫的基本用法
2021-09-11
Python爬蟲
[Python3網路爬蟲開發實戰] 2-爬蟲基礎 2-網頁基礎
2018-03-08
Python爬蟲網頁
Python爬蟲之BeautifulSoup
2019-02-16
Python爬蟲
python爬蟲Scrapy框架
2018-11-21
Python爬蟲框架
Python爬蟲工具列表
2018-11-15
Python爬蟲