[python爬蟲] BeautifulSoup爬取+CSV儲存貴州農產品資料

Eastmount發表於2017-10-29

原文網址 : https://blog.csdn.net/eastmount/article/details/78389201

在學習使用正規表示式、BeautifulSoup技術或Selenium技術爬取網路資料過程中，通常會將爬取的資料儲存至TXT檔案中，前面也講述過海量資料儲存至本地MySQL資料庫中，這裡主要補充BeautifulSoup爬取貴州農產品資料的過程，並儲存至本地的CSV檔案。

核心內容包括以下幾點：
1.如何呼叫BeautifulSoup爬取網頁資料。
2.如何儲存資料至CSV檔案。
3.如何解決中文字元儲存的亂碼問題，UTF8編碼設定。
4.如何定時設定爬取任務，定時截圖儲存。

基礎文章希望對大家有所幫助，尤其是剛入門學習BeautifulSoup爬蟲知識，或者是遇到如何將中文資料儲存至CSV檔案下的同學。如果文章中存在錯誤或不足之處，還請海涵~

一. Python操作CSV庫

CSV（Comma-Separated Values）是常用的儲存檔案，逗號分隔符，值與值之間用分號分隔。Python中匯入CSV擴充套件包即可使用，包括寫入檔案和讀取檔案。

1.寫入CSV檔案

# -*- coding: utf-8 -*-
import csv
c = open("test-01.csv", "wb")  #寫檔案
writer = csv.writer(c)
writer.writerow(['序號','姓名','年齡'])

tlist = []
tlist.append("1")
tlist.append("小明")
tlist.append("10")
writer.writerow(tlist)
print tlist,type(tlist)

del tlist[:]  #清空
tlist.append("2")
tlist.append("小紅")
tlist.append("9")
writer.writerow(tlist)
print tlist,type(tlist)

c.close()

其中writerow用於寫檔案，這裡增加列表list寫入。輸出如下圖所示：

2.讀取CSV檔案

# -*- coding: utf-8 -*-
import csv
c = open("test-01.csv", "rb")  #讀檔案
reader = csv.reader(c)
for line in reader:
    print line[0],line[1],line[2]
c.close()

輸出結果如下：

序號 姓名 年齡
1 小明 10
2 小紅 9

二. BeautifulSoup爬取貴州農經網

開啟貴州農經網可以看到每天各類農產品的價格及產地更新資料，如下圖所示。
網址：http://www.gznw.gov.cn/priceInfo/getPriceInfoByAreaId.jx?areaid=22572&page=1

現在需要通過BeautifulSoup獲取產品、價格、單位、產地和釋出時間五個資訊，通過瀏覽器“審查元素”可以看到每行資料都位於<tr>節點下，其class屬性為“odd gradeX”，呼叫find_all函式即可獲取。

爬取第一頁的程式碼如下：

# -*- coding: utf-8 -*-
import urllib
from bs4 import BeautifulSoup

i = 1
while i<5:
    print "爬取第" + str(i) + "頁"
    url = "http://www.gznw.gov.cn/priceInfo/getPriceInfoByAreaId.jx?areaid=22572&page=" + str(i)
    print url
    content = urllib.urlopen(url).read()
    soup = BeautifulSoup(content, "html.parser")
    print soup.title.get_text()
    
    num = soup.find_all("tr",class_="odd gradeX")
    for n in num:
        con = n.get_text()        
        num = con.splitlines()
        print num[1],num[2],num[3],num[4],num[5]
        
    i = i + 1

注意以下幾點：
1.由於網頁涉及到翻頁，通過分析URL發現，與"page=xxx"相關，則定義迴圈爬取1-5頁的內容；
2.BeautifulSoup相當於將爬取的網頁解析成樹狀結構，呼叫soup.title可以得到<title>xxxx</title>內容，再呼叫get_text()函式獲取值；
3.核心程式碼是通過num = soup.find_all("tr",class_="odd gradeX")找到內容；
4.通過splitlines()刪除換行，並生成序列，再依次獲取內容num[n]。
輸出如下所示：

爬取第1頁
http://www.gznw.gov.cn/priceInfo/getPriceInfoByAreaId.jx?areaid=22572&page=1
貴州農經網
龍蝦 260 元/公斤 貴陽市南明區新路口集貿市場 2017-10-30 11:28:39
豌豆 8 元/公斤 貴陽市南明區新路口集貿市場 2017-10-30 11:28:39
泥鰍 40 元/公斤 貴陽市南明區新路口集貿市場 2017-10-30 11:28:39
...
爬取第2頁
http://www.gznw.gov.cn/priceInfo/getPriceInfoByAreaId.jx?areaid=22572&page=2
貴州農經網
石斑魚 60 元/公斤 貴陽市南明區新路口集貿市場 2017-10-30 11:28:39
紫菜（幹） 34 元/公斤 貴陽市南明區新路口集貿市場 2017-10-30 11:28:39
紐荷爾 14 元/公斤 貴陽市南明區新路口集貿市場 2017-10-30 11:28:39
黃鱔 70 元/公斤 貴陽市南明區新路口集貿市場 2017-10-30 11:28:39
...

接下來需要將內容儲存至CSV檔案中，這裡最容易出現的錯誤是中文亂碼的問題。一方面，需要將建立的CSV檔案設定為UTF-8編碼，另一方面需要呼叫encode('utf-8')函式轉化為中文編碼方式，寫入檔案。程式碼如下：

# -*- coding: utf-8 -*-
"""
Created on Fri Oct 20 20:07:36 2017

@author: eastmount CSDN 楊秀璋
"""

import urllib
from bs4 import BeautifulSoup
import csv
import codecs

c = open("test.csv","wb")    #建立檔案
c.write(codecs.BOM_UTF8)     #防止亂碼
writer = csv.writer(c)       #寫入物件
writer.writerow(['產品','價格','單位','批發地','時間'])

i = 1
while i <= 4:
    print "爬取第" + str(i) + "頁"
    url = "http://www.gznw.gov.cn/priceInfo/getPriceInfoByAreaId.jx?areaid=22572&page=" + str(i)
    content = urllib.urlopen(url).read()
    soup = BeautifulSoup(content,"html.parser")
    print soup.title.get_text()
    tt = soup.find_all("tr",class_="odd gradeX")
    for t in tt:
        content = t.get_text()
        num = content.splitlines()
        print num[0],num[1],num[2],num[3],num[4],num[5]
        #寫入檔案
        templist = []
        num[1] =  num[1].encode('utf-8')
        num[2] =  num[2].encode('utf-8')        
        num[3] =  num[3].encode('utf-8')
        num[4] =  num[4].encode('utf-8')
        num[5] =  num[5].encode('utf-8')
        templist.append(num[1])
        templist.append(num[2])
        templist.append(num[3])
        templist.append(num[4])
        templist.append(num[5])
        #print templist
        writer.writerow(templist)
    i = i + 1

c.close()

輸出如下所示，寫入CSV檔案下載前4頁內容。

如果在增加一個定時機制，每天定時爬取就非常完美了。最後補充一下Anaconda製作的定時任務。

三. Python設定定時任務截圖

通常可以使用系統的任務來實現，比如WPS、QQ等軟體自動更新，或定時提醒，這裡只需要每天定時執行爬蟲程式碼即可，後面講寫一篇文章詳細介紹。
參考：http://blog.csdn.net/wwy11/article/details/51100432

下面是每隔10秒鐘開啟網頁，然後進行截圖的操作。作為線上備份的程式碼，僅供參考。

# -*- coding:utf-8 -*-
from PIL import ImageGrab
import webbrowser as web
import time
import os

#定時15分鐘
def sleeptime(hour,min,sec):
    return hour*3600 + min*60 + sec

second = sleeptime(0,0,10)
j = 1
while j==1:
    time.sleep(second)
    #開啟網頁
    url = ["https://www.gzzzb.gov.cn/","http://www.gznw.gov.cn/"]
    i = 1
    while i<=len(url):
        web.open_new_tab(url[i-1])
        time.sleep(5)
        im = ImageGrab.grab()
        im.save(str(j)+'.jpg','JPEG')#圖片儲存
        i = i+1
        j = j+1

《勿忘心安》
勿要把酒倚寒窗，庭院枯葉已飛霜。
忘懷之前坎坷路，勸君一醉付流光。
心中愁苦漫翻滾，雪上寒鴉入畫堂。
安知我輩庸庸過，雙鬢飛白亦疏狂。

很喜歡這首詩，也享受在公交車上備課的日子，心很靜很安，更享受和期待新裝修的新家，人生漫漫，還是帶著一絲微笑和她前行。接下來再忙還是擠點時間看看分散式爬蟲和深度學習，十月這個節點終於結束啦。學生的筆記不錯，有我的風範，大家也很認真。
Remember you are born to live. Don’t live because you are born! Don't go the way life takes you.Take life the way you go! Follow my heart and nana's footsteps forever.

最後希望這篇文章對你有所幫助。
(By:Eastmount 2017-10-30 18:00 http://blog.csdn.net/eastmount/)

Python爬蟲之BeautifulSoup
2019-02-16
Python爬蟲
Python爬蟲之使用MongoDB儲存資料
2019-02-16
Python爬蟲MongoDB
房產資料爬取、智慧財產權資料爬取、企業工商資料爬取、抖音直播間資料python爬蟲爬取
2024-07-11
Python爬蟲
爬蟲——爬取貴陽房價（Python實現）
2022-02-09
爬蟲Python
Python爬蟲之BeautifulSoup庫
2020-12-14
Python爬蟲
Python爬蟲學習筆記（三、儲存資料）
2020-10-03
Python爬蟲筆記
python 小爬蟲 DrissionPage+BeautifulSoup
2024-06-16
Python爬蟲
爬蟲資料儲存--基於MonogoDB
2018-04-09
爬蟲MonoGo
爬蟲系列：使用 MySQL 儲存資料
2021-12-09
爬蟲MySql
【0基礎學爬蟲】爬蟲基礎之資料儲存
2023-04-14
爬蟲
python爬蟲之 BeautifulSoup庫入門
2019-12-09
Python爬蟲
併發爬蟲_使用motor儲存資料
2024-10-12
爬蟲
Python爬蟲教程-25-資料提取-BeautifulSoup4（三）
2018-09-06
Python爬蟲
Python爬蟲教程-24-資料提取-BeautifulSoup4（二）
2018-09-06
Python爬蟲
Python爬蟲教程-23-資料提取-BeautifulSoup4（一）
2018-09-06
Python爬蟲
爬蟲學習整理（3）資料儲存——Python對MySql操作
2020-09-26
爬蟲PythonMySql
Python爬蟲框架：scrapy爬取高考派大學資料
2019-10-07
Python爬蟲框架
Python爬蟲入門【3】：美空網資料爬取
2019-07-30
Python爬蟲
輕鬆利用Python爬蟲爬取你想要的資料
2021-09-10
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python爬蟲常用庫之BeautifulSoup詳解
2018-04-01
Python爬蟲
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
爬蟲（6） - 網頁資料解析(2) | BeautifulSoup4在爬蟲中的使用
2022-07-04
爬蟲網頁
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
Python爬蟲訓練：爬取酷燃網視訊資料
2020-10-23
Python爬蟲
python網路爬蟲（7）爬取靜態資料詳解
2019-06-07
Python爬蟲
python爬蟲爬取豆瓣電影 1-10 ajax 資料
2024-07-04
Python爬蟲
Python爬蟲教程-14-爬蟲使用filecookiejar儲存cookie檔案(人人網)
2018-09-06
Python爬蟲CookieJAR
Python 爬蟲進階篇-利用beautifulsoup庫爬取網頁文章內容實戰演示
2020-09-14
Python爬蟲網頁
小白學 Python 爬蟲（25）：爬取股票資訊
2019-12-24
Python爬蟲
爬蟲爬取資料如何繞開限制？
2022-06-10
爬蟲
python爬蟲：使用BeautifulSoup修改網頁內容
2020-04-05
Python爬蟲網頁
python-爬蟲-css提取-寫入csv-爬取貓眼電影榜單
2023-04-05
Python爬蟲CSS
Scrapy爬蟲（6）爬取銀行理財產品並存入MongoDB（共12w+資料）
2018-03-15
爬蟲MongoDB
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
利用python編寫爬蟲爬取淘寶奶粉部分資料.1
2021-09-09
Python爬蟲
[python爬蟲] BeautifulSoup設定Cookie解決網站攔截並爬取螞蟻短租
2018-03-07
Python爬蟲Cookie網站
Python爬蟲爬取淘寶，京東商品資訊
2020-02-11
Python爬蟲

[python爬蟲] BeautifulSoup爬取+CSV儲存貴州農產品資料

一. Python操作CSV庫

二. BeautifulSoup爬取貴州農經網

三. Python設定定時任務截圖

相關文章