Python實戰 - 第4節:如何獲取頁面中的動態資料

weixin_34120274發表於2016-11-01

筆記

  • 通過觀察載入動態資料時的網路互動,尋找載入更多資料的Request的規律,進一步構造相應Request來獲取Response。

作業

  • 程式碼:
from bs4 import BeautifulSoup
import requests
import urllib.request
import os
import socket

urls = ['http://weheartit.com/inspirations/taylorswift?page={}'.format(str(i)) for i in range(1, 2)]
'''proxies = {"http": "122.96.59.99:3128"}'''
'''proxies = {"http": "121.69.29.162:8118"}'''
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
}
base_path = 'F:\\workspace-python\\hw_02\\img_dl'


def download_img(img_url):
    file_name = img_url.split("/")[-2] + "." + img_url.split(".")[-1]
    target = os.path.join(base_path, file_name)

    print('%s ==> %s' % (img_url, target))
    '''urllib.request.urlretrieve(img_url, target)'''


def process_dynamic_page(url):

    web_data = requests.get(url, headers=headers)
    if web_data.status_code != 200:
        print(web_data.status_code)
        return

    soap = BeautifulSoup(web_data.text, 'lxml')

    images = soap.select('div > div > div > a > img[class="entry-thumbnail"]')
    web_data.close()
    for image in images:
        img_url = image.get('src')
        download_img(img_url)


for url in urls:
    process_dynamic_page(url)
    

  • 執行結果(部分):
"D:\Program Files\Python35\python.exe" F:/workspace-python/hw_02/hw_04.py
http://data.whicdn.com/images/201685162/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\201685162.jpg
http://data.whicdn.com/images/261819708/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\261819708.jpg
http://data.whicdn.com/images/262877209/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\262877209.jpg
http://data.whicdn.com/images/225569474/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\225569474.jpg
http://data.whicdn.com/images/264736360/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264736360.jpg
http://data.whicdn.com/images/262204064/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\262204064.jpg
http://data.whicdn.com/images/254688840/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\254688840.jpg
http://data.whicdn.com/images/258279435/superthumb.png ==> F:\workspace-python\hw_02\img_dl\258279435.png
http://data.whicdn.com/images/261497975/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\261497975.jpg
http://data.whicdn.com/images/264710374/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264710374.jpg
http://data.whicdn.com/images/264713023/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264713023.jpg
http://data.whicdn.com/images/264706335/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264706335.jpg
http://data.whicdn.com/images/264721633/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264721633.jpg
http://data.whicdn.com/images/264721658/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264721658.jpg
http://data.whicdn.com/images/264721683/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264721683.jpg
http://data.whicdn.com/images/206651826/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\206651826.jpg
http://data.whicdn.com/images/264711782/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264711782.jpg
http://data.whicdn.com/images/264715635/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264715635.jpg
http://data.whicdn.com/images/264710414/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264710414.jpg
http://data.whicdn.com/images/264697940/superthumb.png ==> F:\workspace-python\hw_02\img_dl\264697940.png
http://data.whicdn.com/images/264697906/superthumb.gif ==> F:\workspace-python\hw_02\img_dl\264697906.gif
http://data.whicdn.com/images/264705727/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264705727.jpg
http://data.whicdn.com/images/264703283/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264703283.jpg
http://data.whicdn.com/images/264703286/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264703286.jpg
http://data.whicdn.com/images/261104252/superthumb.gif ==> F:\workspace-python\hw_02\img_dl\261104252.gif
http://data.whicdn.com/images/264695862/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264695862.jpg
http://data.whicdn.com/images/264695929/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264695929.jpg
http://data.whicdn.com/images/264695960/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264695960.jpg
http://data.whicdn.com/images/173728739/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\173728739.jpg
http://data.whicdn.com/images/197006986/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\197006986.jpg
http://data.whicdn.com/images/264674428/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264674428.jpg
http://data.whicdn.com/images/264579949/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264579949.jpg
http://data.whicdn.com/images/264631087/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264631087.jpg
http://data.whicdn.com/images/264644105/superthumb.png ==> F:\workspace-python\hw_02\img_dl\264644105.png
http://data.whicdn.com/images/264628123/superthumb.png ==> F:\workspace-python\hw_02\img_dl\264628123.png
http://data.whicdn.com/images/264634842/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\264634842.jpg
http://data.whicdn.com/images/259844486/superthumb.jpg ==> F:\workspace-python\hw_02\img_dl\259844486.jpg
  • 遺留問題:
  • 下載圖片時,提示 “urllib.error.URLError: <urlopen error [WinError 10013] 以一種訪問許可權不允許的方式做了一個訪問套接字的嘗試。>”,詳見討論帖:http://study.163.com/forum/detail/1002726062.htm

相關文章