利用Python爬蟲爬取天氣資料

xunkhun發表於2018-02-06

原文網址 : https://blog.csdn.net/xunkhun/article/details/79266283

之前斷斷續續地學習了python，最近系統地整理了一些，專注於學習python爬蟲系列課程，也看了許多關於python爬蟲的許多部落格文章，也試著參考一些文章的方法進行爬蟲實踐，結合自己的學習經驗，寫了一個爬取天氣的小程式。

#這個小程式是看了Bo_wen的文章‘http://blog.csdn.net/bo_wen_/article/details/50868339’,結合自己的學習經驗，認為這篇文章的方法較為通俗易懂，因此寫了這篇文章。

前期準備：

1.準備好一個適合的python編輯器，自己使用的是整合包管理的Anaconda中的Jupyter Notebook，因為便於管理和編寫，所以也推薦大家使用這個編輯器。下載地址為：https://www.anaconda.com/download/

2.本文章要使用python庫中的requests和BeautifulSoup4庫爬取網頁和解析網頁。因為不是python自帶的標準庫，所以要手動安裝以上兩個庫。

requests庫下載地址：https://pypi.python.org/pypi/requests/
Beautiful Soup4庫下載地址：https://pypi.python.org/pypi/beautifulsoup

3.用pip安裝以上兩個庫，首先Win+R開啟，輸入cmd,點選確認進入管理視窗，輸入如下：

pip install requests

pip install beautifulsoup4

網頁分析：

下面我們開始爬蟲程式的主體部分，我們爬取的是中國天氣網中蘇州的近7天的天氣情況

http://www.weather.com.cn/weather/101190401.shtml

如下圖：

接下來，右鍵檢查元素，或者F12開啟網頁原始碼，我們查詢“6日”這個詞，可以看到我們所需要的欄位所在的標籤位置，如下圖：

我們所需要的欄位全部在id=“7d"中的div中的ul中，日期在標籤li中的h1標籤中，天氣情況在第一個p標籤中。

最高溫度在第二個p標籤的span 標籤中，最低溫度在第二個p標籤的 i 標籤中。風級在第三個p 標籤中的 i 標籤中。如下圖：

其他幾天中的所需欄位都是類似的結構，下面我們可以來寫我們的主程式。

程式結構：

先匯入所需要的requests庫和Beautiful Soup4庫；

import requests
from bs4 import BeautifulSoup

然後我們寫程式的主體框架結構；

def getHTMLText(url):                              #這個函式是為了獲取網頁資訊
    return

def get_data(html):                                #這個函式是從網頁中爬取資料，並儲存在一個列表中
    return

def print_data(final_list,num):                    #這個函式是為了在列表中將結果列印輸出

def main():                                        #主函式，將各個函式連線起來
    url = ' '
    html = getHTMLText(url)
    final_list = get_data(html)
    print_data(final_list,7)
main()

我們來填充第一個函式的內容（def getHTMLText()）；

def getHTMLText(url,timeout = 30):
    try:
        r = requests.get(url, timeout = 30)       #用requests抓取網頁資訊
        r.raise_for_status()                      #可以讓程式產生異常時停止程式
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return '產生異常'

用requests庫的get()方法簡明地獲取網頁資訊，採用的try-except結構，在程式出問題的時候能自動停止。第一個函式返回一個網頁內容---r.text.

接下來我們來填充第二個函式的內容（def get_data()）;

def get_data(html):
    final_list = []
    soup = BeautifulSoup(html,'html.parser')       #用BeautifulSoup庫解析網頁
    body  = soup.body
    data = body.find('div',{'id':'7d'})
    ul = data.find('ul')
    lis = ul.find_all('li')


    for day in lis:
        temp_list = []
        
        date = day.find('h1').string             #找到日期
        temp_list.append(date)     
    
        info = day.find_all('p')                 #找到所有的p標籤
        temp_list.append(info[0].string)
    
        if info[1].find('span') is None:          #找到p標籤中的第二個值'span'標籤——最高溫度
            temperature_highest = ' '             #用一個判斷是否有最高溫度
        else:
            temperature_highest = info[1].find('span').string
            temperature_highest = temperature_highest.replace('℃',' ')
            
        if info[1].find('i') is None:              #找到p標籤中的第二個值'i'標籤——最高溫度
            temperature_lowest = ' '               #用一個判斷是否有最低溫度
        else:
            temperature_lowest = info[1].find('i').string
            temperature_lowest = temperature_lowest.replace('℃',' ')
            
        temp_list.append(temperature_highest)       #將最高氣溫新增到temp_list中
        temp_list.append(temperature_lowest)        #將最低氣溫新增到temp_list中
    
        wind_scale = info[2].find('i').string      #找到p標籤的第三個值'i'標籤——風級，新增到temp_list中
        temp_list.append(wind_scale)
    
        final_list.append(temp_list)              #將temp_list列表新增到final_list列表中
    return final_list

根據我們之前分析的網頁結構的標籤樹，用 Beautiful Soup 方法解析網頁結構，用find（）找到對應的標籤。再在標籤中遍歷元素，將其中的文字內容新增到 temp_list 中，再將 temp_list 新增到 final_list 中，最後返回 final_list.

以上的所有資料已經儲存在 final_list 中，我們來填充第三個函式的內容，將資料列印輸出；

#用format()將結果列印輸出
def print_data(final_list,num):
    print("{:^10}\t{:^8}\t{:^8}\t{:^8}\t{:^8}".format('日期','天氣','最高溫度','最低溫度','風級'))
    for i in range(num):    
        final = final_list[i]
        print("{:^10}\t{:^8}\t{:^8}\t{:^8}\t{:^8}".format(final[0],final[1],final[2],final[3],final[4]))

python傾向於用format()列印輸出結果。最後的輸出結果如下：

程式程式碼：

到此，我們的爬蟲程式就寫完了。下面是全部的程式碼：

import requests
from bs4 import BeautifulSoup


def getHTMLText(url,timeout = 30):
    try:
        r = requests.get(url, timeout = 30)       #用requests抓取網頁資訊
        r.raise_for_status()                      #可以讓程式產生異常時停止程式
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return '產生異常'
    

def get_data(html):
    final_list = []
    soup = BeautifulSoup(html,'html.parser')       #用BeautifulSoup庫解析網頁
    body  = soup.body
    data = body.find('div',{'id':'7d'})
    ul = data.find('ul')
    lis = ul.find_all('li')


    for day in lis:
        temp_list = []
        
        date = day.find('h1').string             #找到日期
        temp_list.append(date)     
    
        info = day.find_all('p')                 #找到所有的p標籤
        temp_list.append(info[0].string)
    
        if info[1].find('span') is None:          #找到p標籤中的第二個值'span'標籤——最高溫度
            temperature_highest = ' '             #用一個判斷是否有最高溫度
        else:
            temperature_highest = info[1].find('span').string
            temperature_highest = temperature_highest.replace('℃',' ')
            
        if info[1].find('i') is None:              #找到p標籤中的第二個值'i'標籤——最高溫度
            temperature_lowest = ' '               #用一個判斷是否有最低溫度
        else:
            temperature_lowest = info[1].find('i').string
            temperature_lowest = temperature_lowest.replace('℃',' ')
            
        temp_list.append(temperature_highest)       #將最高氣溫新增到temp_list中
        temp_list.append(temperature_lowest)        #將最低氣溫新增到temp_list中
    
        wind_scale = info[2].find('i').string      #找到p標籤的第三個值'i'標籤——風級，新增到temp_list中
        temp_list.append(wind_scale)
    
        final_list.append(temp_list)              #將temp_list列表新增到final_list列表中
    return final_list
    
#用format()將結果列印輸出
def print_data(final_list,num):
    print("{:^10}\t{:^8}\t{:^8}\t{:^8}\t{:^8}".format('日期','天氣','最高溫度','最低溫度','風級'))
    for i in range(num):    
        final = final_list[i]
        print("{:^10}\t{:^8}\t{:^8}\t{:^8}\t{:^8}".format(final[0],final[1],final[2],final[3],final[4]))
        
#用main()主函式將模組連線
def main():
    url = 'http://www.weather.com.cn/weather/101190401.shtml'
    html = getHTMLText(url)
    final_list = get_data(html)
    print_data(final_list,7)
main()

總結：

1.爬蟲主要用requests庫和Beautiful Soup庫可以簡明地爬取網頁上的資訊；

2.先定好程式主要框架，再根據目的需求填充函式內容：獲取網頁資訊>爬取網頁資料>列印輸出；

3.最重要的是解析網頁結構，最好可以用標籤樹的形式確定欄位所在的標籤，並遍歷全部標籤儲存資料。

第一次寫部落格，學習的python也不太久，文章中有些理解和書寫錯誤，還望大家理解和之處，謝謝！

另外，強烈推薦北京理工大學嵩天老師的python系列課程。

更多關於本人的資訊請訪問本人網站http://www.o-xunkhun.com/（目前處於申請維護狀態）

python爬蟲獲取天氣網實時資料
2022-11-29
Python爬蟲
輕鬆利用Python爬蟲爬取你想要的資料
2021-09-10
Python爬蟲
python爬蟲抓取哈爾濱天氣資訊（靜態爬蟲）
2020-04-05
Python爬蟲
利用python編寫爬蟲爬取淘寶奶粉部分資料.1
2021-09-09
Python爬蟲
【python爬蟲案例】利用python爬取豆瓣電影TOP250評分排行資料！
2024-09-18
Python爬蟲
房產資料爬取、智慧財產權資料爬取、企業工商資料爬取、抖音直播間資料python爬蟲爬取
2024-07-11
Python爬蟲
python爬蟲利用代理IP分析大資料
2020-12-01
Python爬蟲大資料
Python爬蟲框架：scrapy爬取高考派大學資料
2019-10-07
Python爬蟲框架
Python爬蟲入門【3】：美空網資料爬取
2019-07-30
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
【python爬蟲案例】利用python爬取豆瓣讀書評分TOP250排行資料
2024-09-20
Python爬蟲
Python爬取天氣資訊並語音播報
2020-11-17
Python
Python爬蟲入門教程 53-100 Python3爬蟲獲取三亞天氣做旅遊參照
2019-03-21
Python爬蟲
基於Python的簡單天氣爬蟲程式
2018-03-26
Python爬蟲
Python爬蟲筆記（4）：利用scrapy爬取豆瓣電影250
2018-11-10
Python爬蟲筆記
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
Python爬蟲訓練：爬取酷燃網視訊資料
2020-10-23
Python爬蟲
python網路爬蟲（7）爬取靜態資料詳解
2019-06-07
Python爬蟲
python爬蟲爬取豆瓣電影 1-10 ajax 資料
2024-07-04
Python爬蟲
小白學 Python 爬蟲（25）：爬取股票資訊
2019-12-24
Python爬蟲
爬蟲爬取資料如何繞開限制？
2022-06-10
爬蟲
利用Python爬蟲獲取招聘網站職位資訊
2021-08-09
Python爬蟲網站
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
【爬蟲】利用Python爬蟲爬取小麥苗itpub部落格的所有文章的連線地址（1）
2018-12-26
爬蟲Python
Python爬蟲爬取淘寶，京東商品資訊
2020-02-11
Python爬蟲
python 爬蟲爬取 learnku 精華文章
2020-04-17
Python爬蟲
爬蟲實戰——58同城租房資料爬取
2019-12-04
爬蟲
如何保障爬蟲高效穩定爬取資料？
2022-05-27
爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
利用python爬取某殼的房產資料
2024-05-05
Python
Python爬蟲之小說資訊爬取與資料視覺化分析
2021-01-09
Python爬蟲視覺化
Python資料爬蟲學習筆記（11）爬取千圖網圖片資料
2018-09-18
Python爬蟲筆記
Python爬蟲實戰：爬取淘寶的商品資訊
2021-09-11
Python爬蟲
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
python 爬蟲 1 爬取酷狗音樂
2020-03-29
Python爬蟲
【Python爬蟲】正則爬取趕集網
2020-12-24
Python爬蟲
python爬蟲總是爬不到資料，你需要解決反爬蟲了
2020-06-26
Python爬蟲

利用Python爬蟲爬取天氣資料

前期準備：

網頁分析：

程式結構：

程式程式碼：

總結：

相關文章