Python 爬蟲實戰(2)：股票資料定向爬蟲

發表於2017-08-12

原文網址 : http://python.jobbole.com/88350/

Python爬蟲

功能簡介

目標： 獲取上交所和深交所所有股票的名稱和交易資訊。
輸出： 儲存到檔案中。
技術路線： requests—bs4–re
語言：python3.5

說明

網站選擇原則： 股票資訊靜態存在於html頁面中，非js程式碼生成，沒有Robbts協議限制。
選取方法： 開啟網頁，檢視原始碼，搜尋網頁的股票價格資料是否存在於原始碼中。
如開啟新浪股票網址：連結描述，如下圖所示：

上圖中左邊為網頁的介面，顯示了天山股份的股票價格是13.06。右邊為該網頁的原始碼，在原始碼中查詢13.06發現沒有找到。所以判斷該網頁的資料使用js生成的，不適合本專案。因此換一個網頁。

再開啟百度股票的網址：連結描述，如下圖所示：

從上圖中可以發現百度股票的資料是html程式碼生成的，符合我們本專案的要求，所以在本專案中選擇百度股票的網址。

由於百度股票只有單個股票的資訊，所以還需要當前股票市場中所有股票的列表，在這裡我們選擇東方財富網，網址為：連結描述，介面如下圖所示：

原理分析

檢視百度股票每隻股票的網址：https://gupiao.baidu.com/stock/sz300023.html，可以發現網址中有一個編號300023正好是這隻股票的編號，sz表示的深圳交易所。因此我們構造的程式結構如下：

步驟1： 從東方財富網獲取股票列表；
步驟2： 逐一獲取股票程式碼，並增加到百度股票的連結中，最後對這些連結進行逐個的訪問獲得股票的資訊；
步驟3： 將結果儲存到檔案。

接著檢視百度個股資訊網頁的原始碼，發現每隻股票的資訊在html程式碼中的儲存方式如下：

因此，在我們儲存每隻股票的資訊時，可以參考上圖中html程式碼的儲存方式。每一個資訊源對應一個資訊值，即採用鍵值對的方式進行儲存。在python中鍵值對的方式可以用字典型別。因此，在本專案中，使用字典來儲存每隻股票的資訊，然後再用字典把所有股票的資訊記錄起來，最後將字典中的資料輸出到檔案中。

程式碼編寫

首先是獲得html網頁資料的程式，在這裡不多做介紹了，程式碼如下：

#獲得html文字
def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

#獲得html文字

def getHTMLText(url):

try:

r = requests.get(url)

r.raise_for_status()

r.encoding = r.apparent_encoding

return r.text

except:

return ""

接下來是html程式碼解析程式，在這裡首先需要解析的是東方財富網頁面：連結描述，我們開啟其原始碼，如下圖所示：

由上圖可以看到，a標籤的href屬性中的網址連結裡面有每隻股票的對應的號碼，因此我們只要把網址裡面對應股票的號碼解析出來即可。解析步驟如下：

第一步，獲得一個頁面：

html = getHTMLText(stockURL)

1	html = getHTMLText(stockURL)

第二步，解析頁面，找到所有的a標籤：

soup = BeautifulSoup(html, 'html.parser') 
a = soup.find_all('a')

1 2	soup = BeautifulSoup(html, 'html.parser') a = soup.find_all('a')

第三步，對a標籤中的每一個進行遍歷來進行相關的處理。處理過程如下：

1.找到a標籤中的href屬性，並且判斷屬性中間的連結，把連結後面的數字取出來，在這裡可以使用正規表示式來進行匹配。由於深圳交易所的程式碼以sz開頭，上海交易所的程式碼以sh開頭，股票的數字有6位構成，所以正規表示式可以寫為[s][hz]\d{6}。也就是說構造一個正規表示式，在連結中去尋找滿足這個正規表示式的字串，並把它提取出來。程式碼如下：

for i in a:
    href = i.attrs['href']
    lst.append(re.findall(r"[s][hz]\d{6}", href)[0])

for i in a:

href = i.attrs['href']

lst.append(re.findall(r"[s][hz]\d{6}", href)[0])

2.由於在html中有很多的a標籤，但是有些a標籤中沒有href屬性，因此上述程式在執行的時候出現異常，所有對上述的程式還要進行try…except來對程式進行異常處理，程式碼如下：

for i in a:
    try:
        href = i.attrs['href']
        lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
    except:
        continue

for i in a:

try:

href = i.attrs['href']

lst.append(re.findall(r"[s][hz]\d{6}", href)[0])

except:

continue

從上面程式碼可以看出，對於出現異常的情況我們使用了continue語句，直接讓其跳過，繼續執行下面的語句。通過上面的程式我們就可以把東方財富網上股票的程式碼資訊全部儲存下來了。

將上述的程式碼封裝成一個函式，對東方財富網頁面解析的完整程式碼如下所示：

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue

def getStockList(lst, stockURL):

html = getHTMLText(stockURL)

soup = BeautifulSoup(html, 'html.parser')

a = soup.find_all('a')

for i in a:

try:

href = i.attrs['href']

lst.append(re.findall(r"[s][hz]\d{6}", href)[0])

except:

continue

接下來是獲得百度股票網連結描述單隻股票的資訊。我們先檢視該頁面的原始碼，如下圖所示：

股票的資訊就存在上圖所示的html程式碼中，因此我們需要對這段html程式碼進行解析。過程如下：

1.百度股票網的網址為：https://gupiao.baidu.com/stock/
一隻股票資訊的網址為：https://gupiao.baidu.com/stock/sz300023.html

所以只要百度股票網的網址+每隻股票的程式碼即可，而每隻股票的程式碼我們已經有前面的程式getStockList從東方財富網解析出來了，因此對getStockList函式返回的列表進行遍歷即可，程式碼如下：

for stock in lst:
        url = stockURL + stock + ".html"

1 2	for stock in lst: url = stockURL + stock + ".html"

2.獲得網址後，就要訪問網頁獲得網頁的html程式碼了，程式如下：

html = getHTMLText(url)

1	html = getHTMLText(url)

3.獲得了html程式碼後就需要對html程式碼進行解析，由上圖我們可以看到單個股票的資訊存放在標籤為div,屬性為stock-bets的html程式碼中，因此對其進行解析：

soup = BeautifulSoup(html, 'html.parser')
stockInfo = soup.find('div',attrs={'class':'stock-bets'})

1 2	soup = BeautifulSoup(html, 'html.parser') stockInfo = soup.find('div',attrs={'class':'stock-bets'})

4.我們又發現股票名稱在bets-name標籤內，繼續解析，存入字典中：

infoDict = {}
name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
infoDict.update({'股票名稱': name.text.split()[0]})

infoDict = {}

name = stockInfo.find_all(attrs={'class':'bets-name'})[0]

infoDict.update({'股票名稱': name.text.split()[0]})

split()的意思是股票名稱空格後面的部分不需要了。

5.我們從html程式碼中還可以觀察到股票的其他資訊存放在dt和dd標籤中，其中dt表示股票資訊的鍵域，dd標籤是值域。獲取全部的鍵和值：

keyList = stockInfo.find_all('dt')
valueList = stockInfo.find_all('dd')

1 2	keyList = stockInfo.find_all('dt') valueList = stockInfo.find_all('dd')

並把獲得的鍵和值按鍵值對的方式村放入字典中：

for i in range(len(keyList)):
    key = keyList[i].text
    val = valueList[i].text
    infoDict[key] = val

for i in range(len(keyList)):

key = keyList[i].text

val = valueList[i].text

infoDict[key] = val

6.最後把字典中的資料存入外部檔案中：

with open(fpath, 'a', encoding='utf-8') as f:
f.write( str(infoDict) + '\n' )

1 2	with open(fpath, 'a', encoding='utf-8') as f: f.write( str(infoDict) + '\n' )

將上述過程封裝成完成的函式，程式碼如下：

def getStockInfo(lst, stockURL, fpath):
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})
 
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名稱': name.text.split()[0]})
             
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
             
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
        except:
            continue

def getStockInfo(lst, stockURL, fpath):

for stock in lst:

url = stockURL + stock + ".html"

html = getHTMLText(url)

try:

if html=="":

continue

infoDict = {}

soup = BeautifulSoup(html, 'html.parser')

stockInfo = soup.find('div',attrs={'class':'stock-bets'})

name = stockInfo.find_all(attrs={'class':'bets-name'})[0]

infoDict.update({'股票名稱': name.text.split()[0]})

keyList = stockInfo.find_all('dt')

valueList = stockInfo.find_all('dd')

for i in range(len(keyList)):

key = keyList[i].text

val = valueList[i].text

infoDict[key] = val

with open(fpath, 'a', encoding='utf-8') as f:

f.write( str(infoDict) + '\n' )

except:

continue

其中try…except用於異常處理。

接下來編寫主函式，呼叫上述函式即可：

def main():
    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

def main():

stock_list_url = 'http://quote.eastmoney.com/stocklist.html'

stock_info_url = 'https://gupiao.baidu.com/stock/'

output_file = 'D:/BaiduStockInfo.txt'

slist=[]

getStockList(slist, stock_list_url)

getStockInfo(slist, stock_info_url, output_file)

專案完整程式

# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup
import traceback
import re
 
def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
 
def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue
 
def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})
 
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名稱': name.text.split()[0]})
             
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
             
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
                count = count + 1
                print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")
        except:
            count = count + 1
            print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")
            continue
 
def main():
    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)
 
main()

# -*- coding: utf-8 -*-

import requests

from bs4 import BeautifulSoup

import traceback

import re

def getHTMLText(url):

try:

r = requests.get(url)

r.raise_for_status()

r.encoding = r.apparent_encoding

return r.text

except:

return ""

def getStockList(lst, stockURL):

html = getHTMLText(stockURL)

soup = BeautifulSoup(html, 'html.parser')

a = soup.find_all('a')

for i in a:

try:

href = i.attrs['href']

lst.append(re.findall(r"[s][hz]\d{6}", href)[0])

except:

continue

def getStockInfo(lst, stockURL, fpath):

count = 0

for stock in lst:

url = stockURL + stock + ".html"

html = getHTMLText(url)

try:

if html=="":

continue

infoDict = {}

soup = BeautifulSoup(html, 'html.parser')

stockInfo = soup.find('div',attrs={'class':'stock-bets'})

name = stockInfo.find_all(attrs={'class':'bets-name'})[0]

infoDict.update({'股票名稱': name.text.split()[0]})

keyList = stockInfo.find_all('dt')

valueList = stockInfo.find_all('dd')

for i in range(len(keyList)):

key = keyList[i].text

val = valueList[i].text

infoDict[key] = val

with open(fpath, 'a', encoding='utf-8') as f:

f.write( str(infoDict) + '\n' )

count = count + 1

print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")

except:

count = count + 1

print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")

continue

def main():

stock_list_url = 'http://quote.eastmoney.com/stocklist.html'

stock_info_url = 'https://gupiao.baidu.com/stock/'

output_file = 'D:/BaiduStockInfo.txt'

slist=[]

getStockList(slist, stock_list_url)

getStockInfo(slist, stock_info_url, output_file)

main()

上述程式碼中的print語句用於列印爬取的進度。執行完上述程式碼後在D盤會出現BaiduStockInfo.txt檔案，裡面存放了股票的資訊。

爬蟲之股票定向爬取
2018-12-06
爬蟲
Python【爬蟲實戰】提取資料
2020-11-17
Python爬蟲
Python爬蟲抓取股票資訊
2021-01-03
Python爬蟲
Python 爬蟲實戰
2023-10-16
Python爬蟲
2個月精通Python爬蟲——3大爬蟲框架+6場實戰+反爬蟲技巧+分散式爬蟲
2018-06-28
Python爬蟲框架分散式
python爬蟲實戰，爬蟲之路，永無止境
2022-01-27
Python爬蟲
實時獲取股票資料，免費！——Python爬蟲Sina Stock實戰
2021-10-13
Python爬蟲
小白學 Python 爬蟲（25）：爬取股票資訊
2019-12-24
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
python爬蟲實戰教程-Python爬蟲開發實戰教程（微課版）
2020-11-11
Python爬蟲
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
python爬蟲2
2019-01-07
Python爬蟲
Python爬蟲--2
2024-03-24
Python爬蟲
python3網路爬蟲開發實戰_Python3 爬蟲實戰
2022-01-24
Python爬蟲
爬蟲實戰——58同城租房資料爬取
2019-12-04
爬蟲
Python網路爬蟲實戰
2022-03-18
Python爬蟲
python 爬蟲實戰的原理
2021-10-29
Python爬蟲
Python爬蟲實戰之bilibili
2021-04-04
Python爬蟲
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
python3 爬蟲實戰：為爬蟲新增 GUI 影象介面
2020-03-06
Python爬蟲GUI
Python 爬蟲實戰之爬拼多多商品並做資料分析
2023-10-17
Python爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python爬蟲實戰：爬取淘寶的商品資訊
2021-09-11
Python爬蟲
《Python3網路爬蟲開發實戰》教程||爬蟲教程
2018-11-13
Python爬蟲
Python爬蟲開發與專案實戰（2）
2020-10-21
Python爬蟲
爬蟲實戰scrapy
2018-03-11
爬蟲
[Python3網路爬蟲開發實戰] 分散式爬蟲原理
2019-12-08
Python爬蟲分散式
Java 爬蟲專案實戰之爬蟲簡介
2018-11-24
Java爬蟲
python爬蟲總是爬不到資料，你需要解決反爬蟲了
2020-06-26
Python爬蟲
Python網路爬蟲實戰專案大全 32個Python爬蟲專案demo
2019-04-24
Python爬蟲
大資料爬蟲專案實戰教程
2018-11-14
大資料爬蟲
API商品資料介面呼叫爬蟲實戰
2023-10-27
API爬蟲
Python爬蟲實戰詳解：爬取圖片之家
2020-11-04
Python爬蟲
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
python爬蟲實操專案_Python爬蟲開發與專案實戰 1.6 小結
2021-02-04
Python爬蟲
[Python3網路爬蟲開發實戰] 2-爬蟲基礎 2-網頁基礎
2018-03-08
Python爬蟲網頁
Python爬蟲 ---scrapy框架初探及實戰
2020-04-16
Python爬蟲框架

Python 爬蟲實戰(2)：股票資料定向爬蟲

功能簡介

說明

原理分析

程式碼編寫

專案完整程式

相關文章