爬蟲實戰：從外地天氣到美食推薦，探索乾飯人的世界

努力的小雨發表於2024-03-18

原文網址 : https://www.cnblogs.com/guoxiaoyu/p/18067058

今天是第二堂課，我們將繼續學習爬蟲技術。在上一節課中，我們已經學會了如何爬取乾飯教程。正如魯迅所說（我沒說過），當地吃完飯就去外地吃，這啟發了我去爬取城市天氣資訊，並順便了解當地美食。這個想法永遠是乾飯人的靈魂所在。

今天我們的目標是學習如何爬取城市天氣資訊，因為要計劃去哪裡玩耍，首先得了解天氣情況。雖然我們的手機已經裝有許多免費天氣軟體，但是也不妨礙我們學習。

在我們開始學習爬蟲技術之前，首先需要找到一個容易爬取資料的天氣網站。並不要求特定網站，只要易於爬取的網站即可。畢竟我們目前並不需要爬取特定網站來搶票或搶購商品，我們的主要目的是學習爬蟲技術。

天氣爬蟲

在進行爬蟲操作時，如果不確定一個網站是否易於爬取，可以先嚐試輸入該網站的首頁地址，檢視能否成功解析出HTML網頁。如果解析出來的頁面與實際瀏覽的頁面一致，那麼說明該網站可能沒有設定反爬蟲機制；反之，如果解析出來的頁面與實際不同，那麼該網站很可能設定了反爬蟲措施。在學習階段，建議選擇較為容易爬取的網站進行練習，避免過早挑戰難度過大的網站。

好的，廢話不多說，我們現在就開始抓取該網站上的所有城市資訊。

城市列表

天氣資訊肯定與城市相關，因此幾乎每個天氣網站都會有城市列表。讓我們先來抓取這些城市列表並儲存起來，以備後續使用。以下是相應的程式碼：

# 匯入urllib庫的urlopen函式
from urllib.request import urlopen,Request
# 匯入BeautifulSoup
from bs4 import BeautifulSoup as bf

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
req = Request("https://www.tianqi.com/chinacity.html",headers=headers)
# 發出請求，獲取html
# 獲取的html內容是位元組，將其轉化為字串
html = urlopen(req)
html_text = bytes.decode(html.read())
obj = bf(html_text,'html.parser')
# 使用find_all函式獲取所有圖片的資訊
province_tags = obj.find_all('h2')
for province_tag in province_tags:
    province_name = province_tag.text.strip()
    cities = []
    print(province_name)
    next_sibling = province_tag.find_next_sibling()
    city_tags = next_sibling.find_all('a')
    for city_tag in city_tags:
        city_name = city_tag.text.strip()
        cities.append(city_name)
        print(city_name)

在上述操作中，主要的步驟是從城市地址頁面中獲取資訊，對其進行解析以獲取省份和城市之間的對應關係。目前僅僅進行了簡單的列印輸出。

城市天氣

在獲取城市資訊之後，接下來的步驟是根據城市資訊獲取天氣資訊。在這裡，我們僅考慮直轄市的天氣情況，而省份的天氣資訊獲取相比直轄市多了一步省份的跳轉。我們暫時不進行省份天氣資訊的演示。現在，讓我們一起來看一下程式碼：

# 匯入urllib庫的urlopen函式
from urllib.request import urlopen,Request
# 匯入BeautifulSoup
from bs4 import BeautifulSoup as bf

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
req = Request(f"https://www.tianqi.com/beijing/",headers=headers)
# 發出請求，獲取html
# 獲取的html內容是位元組，將其轉化為字串
html = urlopen(req)
html_text = bytes.decode(html.read())
obj = bf(html_text,'html.parser')
city_tags = obj.find_all('div',class_='mainWeather')
for city_tag in city_tags:
    a_tags = city_tag.find_all('a', class_=lambda value: value != 'd15')
    for a_tag in a_tags:
        title = a_tag.get('title')
        print(title)
foods = obj.find_all('ul',class_='paihang_good_food')
for food in foods:
    a_tags = food.find_all('a')
    for a_tag in a_tags:
        href = a_tag.get('href')
        print(href)
        title = a_tag.get('title')
        print(title)
weather_info = obj.find_all('dl', class_='weather_info')
for info in weather_info:
    city_name = info.find('h1').text
    date = info.find('dd', class_='week').text
    temperature = info.find('p', class_='now').text
    humidity = info.find('dd', class_='shidu').text
    air_quality = info.find('dd', class_='kongqi').h5.text
    print(f"地點:{city_name}")
    print(f"時間:{date}")
    print(f"當前溫度:{temperature}")
    print(humidity)
    print(air_quality)

以上程式碼不僅僅把天氣解析出來，而且將當前地址的天氣和各個城區的天氣以及當地美食都解析了出來。當地美食因為連結是變動的，所以將連結和美食做了響應的對映關係儲存。

城市美食

在確定天氣適宜的情況下，我們通常都會想了解當地有哪些特色美食，畢竟不能總是吃快餐，特色美食才是我們吃貨的靈魂所在。

以下是一個示例程式碼：

# 匯入urllib庫的urlopen函式
from urllib.request import urlopen,Request
# 匯入BeautifulSoup
from bs4 import BeautifulSoup as bf

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
req = Request(f"https://www.tianqi.com/meishi/1737.html",headers=headers)
# 發出請求，獲取html
# 獲取的html內容是位元組，將其轉化為字串
html = urlopen(req)
html_text = bytes.decode(html.read())
obj = bf(html_text,'html.parser')
span_tag = obj.find('span', class_='traffic')
text_content = ''.join(span_tag.stripped_strings)
print(text_content)

在這裡，我主要解析了當前美食推薦的原因。實際上，連結應該與之前解析的天氣資訊相關聯，但為了演示方便，我在示例程式碼中使用了固定值。

包裝一下

將以上內容單獨製作成小案例確實是一種有效的方式，但將其整合成一個簡單的小應用則更具實用性，因為這樣可以實現更靈活的互動。讓我們一起來看一下最終的程式碼：

# 匯入urllib庫的urlopen函式
from urllib.request import urlopen,Request
import urllib,string
# 匯入BeautifulSoup
from bs4 import BeautifulSoup as bf
from random import choice,sample
from colorama import init
from os import system
from termcolor import colored
from readchar import  readkey
from xpinyin import Pinyin

p = Pinyin()

city_province_mapping = []

province_sub_weather = []

good_foods = []
FGS = ['green', 'yellow', 'blue', 'cyan', 'magenta', 'red']

def clear():
    system("CLS")

def get_city_province_mapping():
    print(colored('開始搜尋城市',choice(FGS)))
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
    req = Request("https://www.tianqi.com/chinacity.html",headers=headers)
    # 發出請求，獲取html
    # 獲取的html內容是位元組，將其轉化為字串
    html = urlopen(req)
    html_text = bytes.decode(html.read())
    obj = bf(html_text,'html.parser')
    # 使用find_all函式獲取所有圖片的資訊
    province_tags = obj.find_all('h2')
    for province_tag in province_tags:
        province_name = province_tag.text.strip()
        cities = []

        next_sibling = province_tag.find_next_sibling()
        city_tags = next_sibling.find_all('a')
        for city_tag in city_tags:
            city_name = city_tag.text.strip()
            cities.append(city_name)

        city_province_mapping.append((province_name,cities))


def get_province_weather(province):
    print(colored(f'已選擇：{province}',choice(FGS)))
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
    req = Request(f"https://www.tianqi.com/{province}/",headers=headers)
    # 發出請求，獲取html
    # 獲取的html內容是位元組，將其轉化為字串
    html = urlopen(req)
    html_text = bytes.decode(html.read())
    obj = bf(html_text,'html.parser')
    city_tags = obj.find_all('div',class_='mainWeather')
    # city_tags = obj.find_all('ul',class_='raweather760')
    province_sub_weather.clear()
    print(colored('解析主要城市中',choice(FGS)))
    for city_tag in city_tags:
        a_tags = city_tag.find_all('a', class_=lambda value: value != 'd15')
        for a_tag in a_tags:
            title = a_tag.get('title')
            province_sub_weather.append(title)
    foods = obj.find_all('ul',class_='paihang_good_food')
    print(colored('解析熱搜美食中',choice(FGS)))
    for food in foods:
        a_tags = food.find_all('a')
        for a_tag in a_tags:
            href = a_tag.get('href')
            title = a_tag.get('title')
            good_foods.append((href, title))
    weather_info = obj.find_all('dl', class_='weather_info')
    print(colored('解析完畢',choice(FGS)))
    for info in weather_info:
        city_name = info.find('h1').text
        date = info.find('dd', class_='week').text
        temperature = info.find('p', class_='now').text
        humidity = info.find('dd', class_='shidu').text
        air_quality = info.find('dd', class_='kongqi').h5.text

        print(colored(f"地點:{city_name}",choice(FGS)))
        print(colored(f"時間:{date}",choice(FGS)))
        print(colored(f"當前溫度:{temperature}",choice(FGS)))
        print(colored(humidity,choice(FGS)))
        print(colored(air_quality,choice(FGS)))   
def search_food(link):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0'}
    req = Request(f"https://www.tianqi.com{link}",headers=headers)
    # 發出請求，獲取html
    # 獲取的html內容是位元組，將其轉化為字串
    html = urlopen(req)
    html_text = bytes.decode(html.read())
    obj = bf(html_text,'html.parser')
    span_tag = obj.find('span', class_='traffic')
    text_content = ''.join(span_tag.stripped_strings)
    print(colored(text_content,choice(FGS)))

def print_menu():
    for i in range(0, 4, 3):
        names = [f'{i + j}:{city_province_mapping[i + j][0]}' for j in range(3) if i + j < 4]
        print(colored('\t\t'.join(names),choice(FGS)))

def print_food():
    if not good_foods:
        print(colored('請選擇城市，才可檢視',choice(FGS)))
        return
    for i in range(0, len(good_foods), 3):
        names = [f'{i + j}:{good_foods[i + j][1]}' for j in range(3) if i + j < len(good_foods)]
        print(colored('\t\t'.join(names),choice(FGS)))

def print_hot(weather):
    if not weather:
        print(colored('請選擇城市，才可檢視',choice(FGS)))
        return
    for i in range(0,len(weather), 3):
        names = [f'{i + j}:{weather[i + j]}' for j in range(3) if i + j < len(weather)]
        print(colored('\t\t'.join(names),choice(FGS)))

get_city_province_mapping()

# get_province_weather('beijing')
# search_food(good_foods[1][0])
init() ## 命令列輸出彩色文字
print(colored('已搜尋完畢！',choice(FGS)))
print(colored('m:返回首頁',choice(FGS)))
print(colored('h:檢視當前城區天氣',choice(FGS)))
print(colored('f:檢視當地美食',choice(FGS)))
print(colored('q:退出天氣',choice(FGS)))
my_key = ['q','m','c','h','f']
while True:
    while True:
        move = readkey()
        if move in my_key:
            break
    if move == 'q': ## 鍵盤‘Q’是退出
        break 
    if move == 'c': ## 鍵盤‘C’是清空控制檯
        clear()
    if move == 'h':  
        print_hot(province_sub_weather)
    if move == 'f':  
        print_food()
        num = int(input('請輸入美食編號：=====>'))
        if num <= len(good_foods):
            search_food(good_foods[num][0])
    if move == 'm':
        print_menu()
        num = int(input('請輸入城市編號：=====>'))
        if num <= len(city_province_mapping):
            pinyin_without_tone = p.get_pinyin(city_province_mapping[num][0],'')
            get_province_weather(pinyin_without_tone)

按照我的習慣，我通常喜歡在控制檯中進行列印輸出，這樣可以避免不必要的UI依賴。雖然整個過程並不算太複雜，但解析資料確實需要花費一些時間。儘管如此，還是成功完成了天氣資訊的爬取任務。

總結

在今天的學習中，所涉及的知識點基本延續了上一次的內容，並沒有太多新的擴充。主要是對網頁進行解析，提取資訊並儲存，最後根據這些資訊來動態改變連結地址，最終完成了一個簡單的與使用者互動的演示專案。我希望你也能跟著動手實踐，儘管這個過程可能會有些痛苦，不過雖然並沒有給你的技術水平帶來實質性提升，但至少可以擴充你的技術廣度。

python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
【爬蟲】爬蟲專案推薦 / 思路
2020-04-21
爬蟲
爬蟲實戰：探索XPath爬蟲技巧之熱榜新聞
2024-03-21
爬蟲
5分鐘上手Python爬蟲：從乾飯開始，輕鬆掌握技巧
2024-03-15
Python爬蟲
乾貨分享！Python網路爬蟲實戰
2020-08-07
Python爬蟲
python書籍推薦-Python爬蟲開發與專案實戰
2019-06-11
Python爬蟲
python爬蟲獲取天氣網實時資料
2022-11-29
Python爬蟲
【推薦】最高效的Python爬蟲框架！
2021-05-25
Python爬蟲框架
從0到1完成nutch分散式爬蟲專案實戰
2019-01-08
分散式爬蟲
基於Python的簡單天氣爬蟲程式
2018-03-26
Python爬蟲
python爬蟲抓取哈爾濱天氣資訊（靜態爬蟲）
2020-04-05
Python爬蟲
Python爬蟲實戰案例-爬取幣世界標紅快訊
2019-02-16
Python爬蟲
【爬蟲】python爬蟲從入門到放棄
2018-12-20
爬蟲Python
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
【機器學習PAI實戰】—— 玩轉人工智慧之美食推薦
2019-03-26
機器學習AI人工智慧
爬蟲實戰scrapy
2018-03-11
爬蟲
Python 爬蟲實戰
2023-10-16
Python爬蟲
python 爬蟲實戰的原理
2021-10-29
Python爬蟲
【從零開始學爬蟲】採集全國曆史天氣資料
2022-12-22
爬蟲
python爬蟲實戰，爬蟲之路，永無止境
2022-01-27
Python爬蟲
推薦13個.Net開源的網路爬蟲
2018-05-06
爬蟲
Python爬蟲的框架有哪些？推薦這五個！
2021-05-07
Python爬蟲框架
一個爬蟲的故事：這是人乾的事兒？
2020-10-10
爬蟲
爬蟲技術實戰
2020-08-19
爬蟲
Puppeteer爬蟲實戰(三)
2020-07-21
爬蟲
python爬蟲實戰教程-Python爬蟲開發實戰教程（微課版）
2020-11-11
Python爬蟲
IPIDEA乾貨|Java爬蟲與Python爬蟲的區別
2023-05-08
IdeaJava爬蟲Python
Java 爬蟲專案實戰之爬蟲簡介
2018-11-24
Java爬蟲
Spring Cloud 純乾貨，從入門到實戰
2020-11-10
SpringCloud
爬蟲實戰：從網頁到本地，如何輕鬆實現小說離線閱讀
2024-03-19
爬蟲網頁
Python爬蟲，推薦一條高效的學習路徑
2019-02-28
Python爬蟲
python3網路爬蟲開發實戰_Python3 爬蟲實戰
2022-01-24
Python爬蟲
爬蟲實戰專案集合
2019-02-28
爬蟲
爬蟲——三個小實戰
2018-09-21
爬蟲
爬蟲——實戰完整版
2018-09-25
爬蟲
基礎爬蟲案例實戰
2024-05-24
爬蟲
爬蟲專案實戰（一）
2020-06-15
爬蟲