python爬蟲—學習筆記-2

wind风语發表於2024-04-10

python爬蟲—學習筆記-2

ps:因為本人近一個月住院,文章為隊友所著。

任務

獲取豆瓣網站內容

單頁獲取

網址:https://movie.douban.com/top250

獲取網頁資訊

程式碼:

import requests


url="https://movie.douban.com/top250"

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0"}
response=requests.get(url,headers=headers)
print(response.text)

image-20240410172656380

為了方便檢視,獲取的程式碼建立一個html網頁放進其中。

image-20240410172727354

獲取第一頁的電影名字

電影名字包含在

<span class="title">([^&nbsp].*?)</span>

這個標籤之中,所以需要

image-20240410172805826

import requests
import re

url="https://movie.douban.com/top250"

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0"}
response=requests.get(url,headers=headers)


#獲取電影名字
movieName=re.findall( '<span class="title">([^&nbsp].*?)</span>',response.text)
print(movieName)
#^:意為除了

將電影名字與評分聯絡起來

image-20240410172847020

多頁獲取

構建url

首頁:https://movie.douban.com/top250

第二頁:https://movie.douban.com/top250?start=25&filter=

第三頁:https://movie.douban.com/top250?start=50&filter=

………………

最後一頁:https://movie.douban.com/top250?start=250&filter=

可以看出其中存在一些關係

https://movie.douban.com/top250?start=‘+25的倍數+‘&filter=

所以url可以這樣構建

url="https://movie.douban.com/top250?start=" + str(i) + "&filter="

import requests
import re

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"}
movieName=[]
score=[]
for i in range(0,250,25):
    url="https://movie.douban.com/top250?start=" + str(i) + "&filter="


    response=requests.get(url,headers=headers)

    movieName+=re.findall('<span class="title">([^&nbsp].*?)</span>',response.text)

    score=score+ re.findall('<span class="rating_num" property="v:average">(.*?)</span>',response.text)

print(movieName)
print(score)

l=[]
for i in range(250):
   l.append((i+1,movieName[i],score[i]))
print(l)

for i in l:
   print(i)

import requests
import re

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"}
movieName=[]
score=[]
for i in range(0,250,25):
    url="https://movie.douban.com/top250?start=" + str(i) + "&filter="


    response=requests.get(url,headers=headers)

    movieName+=re.findall('<span class="title">([^&nbsp].*?)</span>',response.text)

    score=score+ re.findall('<span class="rating_num" property="v:average">(.*?)</span>',response.text)

print(movieName)
print(score)

l=[]
for i in range(250):
   l.append((i+1,movieName[i],score[i]))
print(l)

for i in l:
   print(i)

image-20240410173022302

獲取網頁圖片

首先找到圖片所在的標籤

image-20240410173043580

import requests
import re

url="https://movie.douban.com/top250?start=0&filter="


headers={"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
response=requests.get(url,headers=headers)


movieName=re.findall('<span class="title">([^&nbsp].*?)</span>',response.text)

imgurl=re.findall('src="(.*?)" class="">',response.text)

for i in range(25):
    imgres=requests.get(imgurl[i],headers=headers)
    filename="./images/" + movieName[i] + ".jpg"
    with open (filename,mode="wb") as f:
        f.write(imgres.content)

image-20240410173103588

image-20240410173107655

注意:

獲取圖片時不應太平頻繁 , 可以的話適當加上一個獲取時間間隔 。

相關文章