網貸資料爬取及據分析

zgbgx發表於2018-01-24

關於資料來源

本專案寫於2017年七月初,主要使用Python爬取網貸之家以及人人貸的資料進行分析。
網貸之家是國內最大的P2P資料平臺,人人貸國內排名前二十的P2P平臺。
原始碼地址

資料爬取

抓包分析

抓包工具主要使用chrome的開發者工具 網路一欄,網貸之家的資料全部是ajax返回json資料,而人人貸既有ajax返回資料也有html頁面直接生成資料。

請求例項

QQ截圖20180123205633.png

從資料中可以看到請求資料的方式(GET或者POST),請求頭以及請求引數。

QQ截圖20180123205843.png

從請求資料中可以看到返回資料的格式(此例中為json)、資料結構以及具體資料。
注:這是現在網貸之家的API請求後臺的介面,爬蟲編寫的時候與資料介面與如今的請求介面不一樣,所以網貸之家的資料爬蟲部分已無效。

構造請求

根據抓包分析得到的結果,構造請求。在本專案中,使用Python的 requests庫模擬http請求
具體程式碼:

import requests
class SessionUtil():
    def __init__(self,headers=None,cookie=None):
        self.session=requests.Session()
        if headers is None:
            headersStr={"Accept":"application/json, text/javascript, */*; q=0.01",
                "X-Requested-With":"XMLHttpRequest",
                "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36",
                "Accept-Encoding":"gzip, deflate, sdch, br",
                "Accept-Language":"zh-CN,zh;q=0.8"
                }
            self.headers=headersStr
        else:
            self.headers=headers
        self.cookie=cookie
    //傳送get請求
    def getReq(self,url):
        return self.session.get(url,headers=self.headers).text
    def addCookie(self,cookie):
        self.headers[`cookie`]=cookie
    //傳送post請求
    def postReq(self,url,param):
        return self.session.post(url, param).text
複製程式碼

在設定請求頭的時候,關鍵欄位只設定了”User-Agent”,網貸之家和人人貸的沒有反爬措施,甚至不用設定”Referer”欄位來防止跨域錯誤。

爬蟲例項

以下是一個爬蟲例項

import json
import time
from databaseUtil import DatabaseUtil
from sessionUtil import SessionUtil
from dictUtil import DictUtil
from logUtil import LogUtil
import traceback
def handleData(returnStr):
    jsonData=json.loads(returnStr)
    platData=jsonData.get(`data`).get(`platOuterVo`)
    return platData
def storeData(jsonOne,conn,cur,platId):
    actualCapital=jsonOne.get(`actualCapital`)
    aliasName=jsonOne.get(`aliasName`)
    association=jsonOne.get(`association`)
    associationDetail=jsonOne.get(`associationDetail`)
    autoBid=jsonOne.get(`autoBid`)
    autoBidCode=jsonOne.get(`autoBidCode`)
    bankCapital=jsonOne.get(`bankCapital`)
    bankFunds=jsonOne.get(`bankFunds`)
    bidSecurity=jsonOne.get(`bidSecurity`)
    bindingFlag=jsonOne.get(`bindingFlag`)
    businessType=jsonOne.get(`businessType`)
    companyName=jsonOne.get(`companyName`)
    credit=jsonOne.get(`credit`)
    creditLevel=jsonOne.get(`creditLevel`)
    delayScore=jsonOne.get(`delayScore`)
    delayScoreDetail=jsonOne.get(`delayScoreDetail`)
    displayFlg=jsonOne.get(`displayFlg`)
    drawScore=jsonOne.get(`drawScore`)
    drawScoreDetail=jsonOne.get(`drawScoreDetail`)
    equityVoList=jsonOne.get(`equityVoList`)
    experienceScore=jsonOne.get(`experienceScore`)
    experienceScoreDetail=jsonOne.get(`experienceScoreDetail`)
    fundCapital=jsonOne.get(`fundCapital`)
    gjlhhFlag=jsonOne.get(`gjlhhFlag`)
    gjlhhTime=jsonOne.get(`gjlhhTime`)
    gruarantee=jsonOne.get(`gruarantee`)
    inspection=jsonOne.get(`inspection`)
    juridicalPerson=jsonOne.get(`juridicalPerson`)
    locationArea=jsonOne.get(`locationArea`)
    locationAreaName=jsonOne.get(`locationAreaName`)
    locationCity=jsonOne.get(`locationCity`)
    locationCityName=jsonOne.get(`locationCityName`)
    manageExpense=jsonOne.get(`manageExpense`)
    manageExpenseDetail=jsonOne.get(`manageExpenseDetail`)
    newTrustCreditor=jsonOne.get(`newTrustCreditor`)
    newTrustCreditorCode=jsonOne.get(`newTrustCreditorCode`)
    officeAddress=jsonOne.get(`officeAddress`)
    onlineDate=jsonOne.get(`onlineDate`)
    payment=jsonOne.get(`payment`)
    paymode=jsonOne.get(`paymode`)
    platBackground=jsonOne.get(`platBackground`)
    platBackgroundDetail=jsonOne.get(`platBackgroundDetail`)
    platBackgroundDetailExpand=jsonOne.get(`platBackgroundDetailExpand`)
    platBackgroundExpand=jsonOne.get(`platBackgroundExpand`)
    platEarnings=jsonOne.get(`platEarnings`)
    platEarningsCode=jsonOne.get(`platEarningsCode`)
    platName=jsonOne.get(`platName`)
    platStatus=jsonOne.get(`platStatus`)
    platUrl=jsonOne.get(`platUrl`)
    problem=jsonOne.get(`problem`)
    problemTime=jsonOne.get(`problemTime`)
    recordId=jsonOne.get(`recordId`)
    recordLicId=jsonOne.get(`recordLicId`)
    registeredCapital=jsonOne.get(`registeredCapital`)
    riskCapital=jsonOne.get(`riskCapital`)
    riskFunds=jsonOne.get(`riskFunds`)
    riskReserve=jsonOne.get(`riskReserve`)
    riskcontrol=jsonOne.get(`riskcontrol`)
    securityModel=jsonOne.get(`securityModel`)
    securityModelCode=jsonOne.get(`securityModelCode`)
    securityModelOther=jsonOne.get(`securityModelOther`)
    serviceScore=jsonOne.get(`serviceScore`)
    serviceScoreDetail=jsonOne.get(`serviceScoreDetail`)
    startInvestmentAmout=jsonOne.get(`startInvestmentAmout`)
    term=jsonOne.get(`term`)
    termCodes=jsonOne.get(`termCodes`)
    termWeight=jsonOne.get(`termWeight`)
    transferExpense=jsonOne.get(`transferExpense`)
    transferExpenseDetail=jsonOne.get(`transferExpenseDetail`)
    trustCapital=jsonOne.get(`trustCapital`)
    trustCreditor=jsonOne.get(`trustCreditor`)
    trustCreditorMonth=jsonOne.get(`trustCreditorMonth`)
    trustFunds=jsonOne.get(`trustFunds`)
    tzjPj=jsonOne.get(`tzjPj`)
    vipExpense=jsonOne.get(`vipExpense`)
    withTzj=jsonOne.get(`withTzj`)
    withdrawExpense=jsonOne.get(`withdrawExpense`)
    sql=`insert into problemPlatDetail (actualCapital,aliasName,association,associationDetail,autoBid,autoBidCode,bankCapital,bankFunds,bidSecurity,bindingFlag,businessType,companyName,credit,creditLevel,delayScore,delayScoreDetail,displayFlg,drawScore,drawScoreDetail,equityVoList,experienceScore,experienceScoreDetail,fundCapital,gjlhhFlag,gjlhhTime,gruarantee,inspection,juridicalPerson,locationArea,locationAreaName,locationCity,locationCityName,manageExpense,manageExpenseDetail,newTrustCreditor,newTrustCreditorCode,officeAddress,onlineDate,payment,paymode,platBackground,platBackgroundDetail,platBackgroundDetailExpand,platBackgroundExpand,platEarnings,platEarningsCode,platName,platStatus,platUrl,problem,problemTime,recordId,recordLicId,registeredCapital,riskCapital,riskFunds,riskReserve,riskcontrol,securityModel,securityModelCode,securityModelOther,serviceScore,serviceScoreDetail,startInvestmentAmout,term,termCodes,termWeight,transferExpense,transferExpenseDetail,trustCapital,trustCreditor,trustCreditorMonth,trustFunds,tzjPj,vipExpense,withTzj,withdrawExpense,platId) values ("`+actualCapital+`","`+aliasName+`","`+association+`","`+associationDetail+`","`+autoBid+`","`+autoBidCode+`","`+bankCapital+`","`+bankFunds+`","`+bidSecurity+`","`+bindingFlag+`","`+businessType+`","`+companyName+`","`+credit+`","`+creditLevel+`","`+delayScore+`","`+delayScoreDetail+`","`+displayFlg+`","`+drawScore+`","`+drawScoreDetail+`","`+equityVoList+`","`+experienceScore+`","`+experienceScoreDetail+`","`+fundCapital+`","`+gjlhhFlag+`","`+gjlhhTime+`","`+gruarantee+`","`+inspection+`","`+juridicalPerson+`","`+locationArea+`","`+locationAreaName+`","`+locationCity+`","`+locationCityName+`","`+manageExpense+`","`+manageExpenseDetail+`","`+newTrustCreditor+`","`+newTrustCreditorCode+`","`+officeAddress+`","`+onlineDate+`","`+payment+`","`+paymode+`","`+platBackground+`","`+platBackgroundDetail+`","`+platBackgroundDetailExpand+`","`+platBackgroundExpand+`","`+platEarnings+`","`+platEarningsCode+`","`+platName+`","`+platStatus+`","`+platUrl+`","`+problem+`","`+problemTime+`","`+recordId+`","`+recordLicId+`","`+registeredCapital+`","`+riskCapital+`","`+riskFunds+`","`+riskReserve+`","`+riskcontrol+`","`+securityModel+`","`+securityModelCode+`","`+securityModelOther+`","`+serviceScore+`","`+serviceScoreDetail+`","`+startInvestmentAmout+`","`+term+`","`+termCodes+`","`+termWeight+`","`+transferExpense+`","`+transferExpenseDetail+`","`+trustCapital+`","`+trustCreditor+`","`+trustCreditorMonth+`","`+trustFunds+`","`+tzjPj+`","`+vipExpense+`","`+withTzj+`","`+withdrawExpense+`","`+platId+`")`
    cur.execute(sql)
    conn.commit()

conn,cur=DatabaseUtil().getConn()
session=SessionUtil()
logUtil=LogUtil("problemPlatDetail.log")
cur.execute(`select platId from problemPlat`)
data=cur.fetchall()
print(data)
mylist=list()
print(data)
for i in range(0,len(data)):
    platId=str(data[i].get(`platId`))
    
    mylist.append(platId)

print mylist  
for i in mylist:
    url=`http://wwwservice.wdzj.com/api/plat/platData30Days?platId=`+i
    try:
        data=session.getReq(url)
        platData=handleData(data)
        dictObject=DictUtil(platData)
        storeData(dictObject,conn,cur,i)
    except Exception,e:
        traceback.print_exc()
cur.close()
conn.close
複製程式碼

整個過程中 我們 構造請求,然後把解析每個請求的響應,其中json返回值使用json庫進行解析,html頁面使用BeautifulSoup庫進行解析(結構複雜的html的頁面推薦使用lxml庫進行解析),解析到的結果儲存到mysql資料庫中。

爬蟲程式碼

爬蟲程式碼地址(注:爬蟲使用程式碼Python2與python3都可執行,本人把爬蟲程式碼部署在阿里雲伺服器上,使用Python2 執行)

資料分析

資料分析主要使用Python的numpy、pandas、matplotlib進行資料分析,同時輔以海致BDP。

時間序列分析

資料讀取

一般採取把資料讀取pandas的DataFrame中進行分析。
以下就是讀取問題平臺的資料的例子

problemPlat=pd.read_csv(`problemPlat.csv`,parse_dates=True)#問題平臺 
複製程式碼

資料結構

QQ截圖20180123212641.png

時間序列分析

eg 問題平臺數量隨時間變化

problemPlat[`id`][`2012`:`2017`].resample(`M`,how=`count`).plot(title=`P2P發生問題`)#發生問題P2P平臺數量 隨時間變化趨勢
複製程式碼

圖形化展示

QQ截圖20180123212803.png

地域分析

使用海致BDP完成(Python繪製地圖分佈輪子比較複雜,當時還未學習)

各省問題平臺數量

下載.png

各省平臺成交額

全年成交額全國各省對比.png

規模分佈分析

eg 全國六月平臺成交額分佈
程式碼

juneData[`amount`].hist(normed=True)
juneData[`amount`].plot(kind=`kde`,style=`k--`)#六月份交易量概率分佈
複製程式碼

核密度圖形展示

QQ截圖20180123213700.png

成交額取對數核密度分佈

np.log10(juneData[`amount`]).hist(normed=True)
np.log10(juneData[`amount`]).plot(kind=`kde`,style=`k--`)#取 10 對數的 概率分佈
複製程式碼

圖形化展示

QQ截圖20180123213901.png

可看出取10的對數後分布更符合正常的金字塔形。

相關性分析

eg.陸金所交易額與所有平臺交易額的相關係數變化趨勢

lujinData=platVolume[platVolume[`wdzjPlatId`]==59]
corr=pd.rolling_corr(lujinData[`amount`],allPlatDayData[`amount`],50,min_periods=50).plot(title=`陸金所交易額與所有平臺交易額的相關係數變化趨勢`)
複製程式碼

圖形化展示

QQ截圖20180123214114.png

分類比較

車貸平臺與全平臺成交額資料對比

carFinanceDayData=carFinanceData.resample(`D`).sum()[`amount`]
fig,axes=plt.subplots(nrows=1,ncols=2,sharey=True,figsize=(14,7))
carFinanceDayData.plot(ax=axes[0],title=`車貸平臺交易額`)
allPlatDayData[`amount`].plot(ax=axes[1],title=`所有p2p平臺交易額`)
複製程式碼
QQ截圖20180123214359.png

趨勢預測

eg預測陸金所成交量趨勢(使用Facebook Prophet庫完成)

lujinAmount=platVolume[platVolume[`wdzjPlatId`]==59]
lujinAmount[`y`]=lujinAmount[`amount`]
lujinAmount[`ds`]=lujinAmount[`date`]
m=Prophet(yearly_seasonality=True)
m.fit(lujinAmount)
future=m.make_future_dataframe(periods=365)
forecast=m.predict(future)
m.plot(forecast)
複製程式碼

趨勢預測圖形化展示

QQ截圖20180123214653.png

資料分析程式碼

資料分析程式碼地址(注:資料分析程式碼智慧執行在Python3 環境下)
程式碼執行後樣例(無需安裝Python環境 也可檢視具體程式碼解圖形化展示)

後記

這是本人從 Java web轉向資料方向後自己寫的第一專案,也是自己的第一個Python專案,在整個過程中,也沒遇到多少坑,整體來說,爬蟲和資料分析以及Python這門語言門檻都是非常低的。
如果想入門Python爬蟲,推薦《Python網路資料採集》

s29086659.jpg

如果想入門Python資料分析,推薦 《利用Python進行資料分析》

30adcbef76094b360e72e763a9cc7cd98c109d58.jpg

相關文章