Python爬蟲—爬取某網站圖片

BaiXuePrincess發表於2020-11-19

前言

本章主要用requests,解析圖片網址主要用beautiful soup

操作步驟

1.開啟F12,選到network,點選Load more…按鈕,可以檢視network裡抓到的網址
在這裡插入圖片描述
現在我們可以通過requests請求網頁

import requests
#cookies、headers值這裡就不寫了
cookies = {}
headers = {}
params = {'page': '2'}

#這裡是get請求,get方法帶引數請求時,是params=引數字典
response = requests.get('https://github.com/topics', headers=headers, params=params, cookies=cookies)

print(response.text)

2.點選下圖的小箭頭,選擇圖中的一個圖片點選,可以獲得圖片地址
在這裡插入圖片描述
根據請求到的資料用beautifulsoup 模組解析 ,獲取圖片地址

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, "lxml")
    pngs = soup.find("ul", {"class": "list-style-none"}).find_all("li", {"class": "py-4 border-bottom"})
    print(len(pngs))
    for each in pngs:
        png_tag = each.find("img", {"class": "rounded-1 mr-3"})
        if not png_tag:
            png_url = ""
        else:
            png_url = png_tag.get("src")
            print(png_url)

3.獲取到圖片地址後就可將圖片儲存到本地
這裡我是用圖片原本的圖片名儲存的

import urllib.request
filename = png_url.split('/')[-1]
print(filename)
urllib.request.urlretrieve(png_url, 'E://images/'+filename)

4.全部的程式碼如下

import requests
from bs4 import BeautifulSoup
import urllib.request

def main():
    cookies = {}
    headers = {}
    params = {'page': '2'}

    response = requests.get('https://github.com/topics', headers=headers, params=params, cookies=cookies)

    soup = BeautifulSoup(response.content, "lxml")
    pngs = soup.find("ul", {"class": "list-style-none"}).find_all("li", {"class": "py-4 border-bottom"})
    print(len(pngs))
    for each in pngs:
        png_tag = each.find("img", {"class": "rounded-1 mr-3"})
        if not png_tag:
            png_url = ""
        else:
            png_url = png_tag.get("src")
            print(png_url)
            filename = png_url.split('/')[-1]
            print(filename)
            urllib.request.urlretrieve(png_url, 'E://images/'+filename)
            # response = requests.get(png_url, stream=True)
            # with open('E://images/'+filename, 'wb') as fd:
            #     fd.write(response.content)
            #     print(filename + "download success")

if __name__ == '__main__':
    main()

相關文章