關於資料來源
本專案寫於2017年七月初,主要使用Python爬取網貸之家以及人人貸的資料進行分析。 網貸之家是國內最大的P2P資料平臺,人人貸國內排名前二十的P2P平臺。 原始碼地址
資料爬取
抓包分析
抓包工具主要使用chrome的開發者工具 網路一欄,網貸之家的資料全部是ajax返回json資料,而人人貸既有ajax返回資料也有html頁面直接生成資料。
請求例項
從資料中可以看到請求資料的方式(GET或者POST),請求頭以及請求引數。 從請求資料中可以看到返回資料的格式(此例中為json)、資料結構以及具體資料。 注:這是現在網貸之家的API請求後臺的介面,爬蟲編寫的時候與資料介面與如今的請求介面不一樣,所以網貸之家的資料爬蟲部分已無效。構造請求
根據抓包分析得到的結果,構造請求。在本專案中,使用Python的 requests庫模擬http請求 具體程式碼:
import requests
class SessionUtil():
def __init__(self,headers=None,cookie=None):
self.session=requests.Session()
if headers is None:
headersStr={"Accept":"application/json, text/javascript, */*; q=0.01",
"X-Requested-With":"XMLHttpRequest",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36",
"Accept-Encoding":"gzip, deflate, sdch, br",
"Accept-Language":"zh-CN,zh;q=0.8"
}
self.headers=headersStr
else:
self.headers=headers
self.cookie=cookie
//傳送get請求
def getReq(self,url):
return self.session.get(url,headers=self.headers).text
def addCookie(self,cookie):
self.headers['cookie']=cookie
//傳送post請求
def postReq(self,url,param):
return self.session.post(url, param).text
複製程式碼
在設定請求頭的時候,關鍵欄位只設定了"User-Agent",網貸之家和人人貸的沒有反爬措施,甚至不用設定"Referer"欄位來防止跨域錯誤。
爬蟲例項
以下是一個爬蟲例項
import json
import time
from databaseUtil import DatabaseUtil
from sessionUtil import SessionUtil
from dictUtil import DictUtil
from logUtil import LogUtil
import traceback
def handleData(returnStr):
jsonData=json.loads(returnStr)
platData=jsonData.get('data').get('platOuterVo')
return platData
def storeData(jsonOne,conn,cur,platId):
actualCapital=jsonOne.get('actualCapital')
aliasName=jsonOne.get('aliasName')
association=jsonOne.get('association')
associationDetail=jsonOne.get('associationDetail')
autoBid=jsonOne.get('autoBid')
autoBidCode=jsonOne.get('autoBidCode')
bankCapital=jsonOne.get('bankCapital')
bankFunds=jsonOne.get('bankFunds')
bidSecurity=jsonOne.get('bidSecurity')
bindingFlag=jsonOne.get('bindingFlag')
businessType=jsonOne.get('businessType')
companyName=jsonOne.get('companyName')
credit=jsonOne.get('credit')
creditLevel=jsonOne.get('creditLevel')
delayScore=jsonOne.get('delayScore')
delayScoreDetail=jsonOne.get('delayScoreDetail')
displayFlg=jsonOne.get('displayFlg')
drawScore=jsonOne.get('drawScore')
drawScoreDetail=jsonOne.get('drawScoreDetail')
equityVoList=jsonOne.get('equityVoList')
experienceScore=jsonOne.get('experienceScore')
experienceScoreDetail=jsonOne.get('experienceScoreDetail')
fundCapital=jsonOne.get('fundCapital')
gjlhhFlag=jsonOne.get('gjlhhFlag')
gjlhhTime=jsonOne.get('gjlhhTime')
gruarantee=jsonOne.get('gruarantee')
inspection=jsonOne.get('inspection')
juridicalPerson=jsonOne.get('juridicalPerson')
locationArea=jsonOne.get('locationArea')
locationAreaName=jsonOne.get('locationAreaName')
locationCity=jsonOne.get('locationCity')
locationCityName=jsonOne.get('locationCityName')
manageExpense=jsonOne.get('manageExpense')
manageExpenseDetail=jsonOne.get('manageExpenseDetail')
newTrustCreditor=jsonOne.get('newTrustCreditor')
newTrustCreditorCode=jsonOne.get('newTrustCreditorCode')
officeAddress=jsonOne.get('officeAddress')
onlineDate=jsonOne.get('onlineDate')
payment=jsonOne.get('payment')
paymode=jsonOne.get('paymode')
platBackground=jsonOne.get('platBackground')
platBackgroundDetail=jsonOne.get('platBackgroundDetail')
platBackgroundDetailExpand=jsonOne.get('platBackgroundDetailExpand')
platBackgroundExpand=jsonOne.get('platBackgroundExpand')
platEarnings=jsonOne.get('platEarnings')
platEarningsCode=jsonOne.get('platEarningsCode')
platName=jsonOne.get('platName')
platStatus=jsonOne.get('platStatus')
platUrl=jsonOne.get('platUrl')
problem=jsonOne.get('problem')
problemTime=jsonOne.get('problemTime')
recordId=jsonOne.get('recordId')
recordLicId=jsonOne.get('recordLicId')
registeredCapital=jsonOne.get('registeredCapital')
riskCapital=jsonOne.get('riskCapital')
riskFunds=jsonOne.get('riskFunds')
riskReserve=jsonOne.get('riskReserve')
riskcontrol=jsonOne.get('riskcontrol')
securityModel=jsonOne.get('securityModel')
securityModelCode=jsonOne.get('securityModelCode')
securityModelOther=jsonOne.get('securityModelOther')
serviceScore=jsonOne.get('serviceScore')
serviceScoreDetail=jsonOne.get('serviceScoreDetail')
startInvestmentAmout=jsonOne.get('startInvestmentAmout')
term=jsonOne.get('term')
termCodes=jsonOne.get('termCodes')
termWeight=jsonOne.get('termWeight')
transferExpense=jsonOne.get('transferExpense')
transferExpenseDetail=jsonOne.get('transferExpenseDetail')
trustCapital=jsonOne.get('trustCapital')
trustCreditor=jsonOne.get('trustCreditor')
trustCreditorMonth=jsonOne.get('trustCreditorMonth')
trustFunds=jsonOne.get('trustFunds')
tzjPj=jsonOne.get('tzjPj')
vipExpense=jsonOne.get('vipExpense')
withTzj=jsonOne.get('withTzj')
withdrawExpense=jsonOne.get('withdrawExpense')
sql='insert into problemPlatDetail (actualCapital,aliasName,association,associationDetail,autoBid,autoBidCode,bankCapital,bankFunds,bidSecurity,bindingFlag,businessType,companyName,credit,creditLevel,delayScore,delayScoreDetail,displayFlg,drawScore,drawScoreDetail,equityVoList,experienceScore,experienceScoreDetail,fundCapital,gjlhhFlag,gjlhhTime,gruarantee,inspection,juridicalPerson,locationArea,locationAreaName,locationCity,locationCityName,manageExpense,manageExpenseDetail,newTrustCreditor,newTrustCreditorCode,officeAddress,onlineDate,payment,paymode,platBackground,platBackgroundDetail,platBackgroundDetailExpand,platBackgroundExpand,platEarnings,platEarningsCode,platName,platStatus,platUrl,problem,problemTime,recordId,recordLicId,registeredCapital,riskCapital,riskFunds,riskReserve,riskcontrol,securityModel,securityModelCode,securityModelOther,serviceScore,serviceScoreDetail,startInvestmentAmout,term,termCodes,termWeight,transferExpense,transferExpenseDetail,trustCapital,trustCreditor,trustCreditorMonth,trustFunds,tzjPj,vipExpense,withTzj,withdrawExpense,platId) values ("'+actualCapital+'","'+aliasName+'","'+association+'","'+associationDetail+'","'+autoBid+'","'+autoBidCode+'","'+bankCapital+'","'+bankFunds+'","'+bidSecurity+'","'+bindingFlag+'","'+businessType+'","'+companyName+'","'+credit+'","'+creditLevel+'","'+delayScore+'","'+delayScoreDetail+'","'+displayFlg+'","'+drawScore+'","'+drawScoreDetail+'","'+equityVoList+'","'+experienceScore+'","'+experienceScoreDetail+'","'+fundCapital+'","'+gjlhhFlag+'","'+gjlhhTime+'","'+gruarantee+'","'+inspection+'","'+juridicalPerson+'","'+locationArea+'","'+locationAreaName+'","'+locationCity+'","'+locationCityName+'","'+manageExpense+'","'+manageExpenseDetail+'","'+newTrustCreditor+'","'+newTrustCreditorCode+'","'+officeAddress+'","'+onlineDate+'","'+payment+'","'+paymode+'","'+platBackground+'","'+platBackgroundDetail+'","'+platBackgroundDetailExpand+'","'+platBackgroundExpand+'","'+platEarnings+'","'+platEarningsCode+'","'+platName+'","'+platStatus+'","'+platUrl+'","'+problem+'","'+problemTime+'","'+recordId+'","'+recordLicId+'","'+registeredCapital+'","'+riskCapital+'","'+riskFunds+'","'+riskReserve+'","'+riskcontrol+'","'+securityModel+'","'+securityModelCode+'","'+securityModelOther+'","'+serviceScore+'","'+serviceScoreDetail+'","'+startInvestmentAmout+'","'+term+'","'+termCodes+'","'+termWeight+'","'+transferExpense+'","'+transferExpenseDetail+'","'+trustCapital+'","'+trustCreditor+'","'+trustCreditorMonth+'","'+trustFunds+'","'+tzjPj+'","'+vipExpense+'","'+withTzj+'","'+withdrawExpense+'","'+platId+'")'
cur.execute(sql)
conn.commit()
conn,cur=DatabaseUtil().getConn()
session=SessionUtil()
logUtil=LogUtil("problemPlatDetail.log")
cur.execute('select platId from problemPlat')
data=cur.fetchall()
print(data)
mylist=list()
print(data)
for i in range(0,len(data)):
platId=str(data[i].get('platId'))
mylist.append(platId)
print mylist
for i in mylist:
url='http://wwwservice.wdzj.com/api/plat/platData30Days?platId='+i
try:
data=session.getReq(url)
platData=handleData(data)
dictObject=DictUtil(platData)
storeData(dictObject,conn,cur,i)
except Exception,e:
traceback.print_exc()
cur.close()
conn.close
複製程式碼
整個過程中 我們 構造請求,然後把解析每個請求的響應,其中json返回值使用json庫進行解析,html頁面使用BeautifulSoup庫進行解析(結構複雜的html的頁面推薦使用lxml庫進行解析),解析到的結果儲存到mysql資料庫中。
爬蟲程式碼
爬蟲程式碼地址(注:爬蟲使用程式碼Python2與python3都可執行,本人把爬蟲程式碼部署在阿里雲伺服器上,使用Python2 執行)
資料分析
資料分析主要使用Python的numpy、pandas、matplotlib進行資料分析,同時輔以海致BDP。
時間序列分析
資料讀取
一般採取把資料讀取pandas的DataFrame中進行分析。 以下就是讀取問題平臺的資料的例子
problemPlat=pd.read_csv('problemPlat.csv',parse_dates=True)#問題平臺
複製程式碼
資料結構
時間序列分析
eg 問題平臺數量隨時間變化
problemPlat['id']['2012':'2017'].resample('M',how='count').plot(title='P2P發生問題')#發生問題P2P平臺數量 隨時間變化趨勢
複製程式碼
圖形化展示
地域分析
使用海致BDP完成(Python繪製地圖分佈輪子比較複雜,當時還未學習)
各省問題平臺數量
各省平臺成交額
規模分佈分析
eg 全國六月平臺成交額分佈 程式碼
juneData['amount'].hist(normed=True)
juneData['amount'].plot(kind='kde',style='k--')#六月份交易量概率分佈
複製程式碼
核密度圖形展示
成交額取對數核密度分佈np.log10(juneData['amount']).hist(normed=True)
np.log10(juneData['amount']).plot(kind='kde',style='k--')#取 10 對數的 概率分佈
複製程式碼
圖形化展示
可看出取10的對數後分布更符合正常的金字塔形。相關性分析
eg.陸金所交易額與所有平臺交易額的相關係數變化趨勢
lujinData=platVolume[platVolume['wdzjPlatId']==59]
corr=pd.rolling_corr(lujinData['amount'],allPlatDayData['amount'],50,min_periods=50).plot(title='陸金所交易額與所有平臺交易額的相關係數變化趨勢')
複製程式碼
圖形化展示
分類比較
車貸平臺與全平臺成交額資料對比
carFinanceDayData=carFinanceData.resample('D').sum()['amount']
fig,axes=plt.subplots(nrows=1,ncols=2,sharey=True,figsize=(14,7))
carFinanceDayData.plot(ax=axes[0],title='車貸平臺交易額')
allPlatDayData['amount'].plot(ax=axes[1],title='所有p2p平臺交易額')
複製程式碼
趨勢預測
eg預測陸金所成交量趨勢(使用Facebook Prophet庫完成)
lujinAmount=platVolume[platVolume['wdzjPlatId']==59]
lujinAmount['y']=lujinAmount['amount']
lujinAmount['ds']=lujinAmount['date']
m=Prophet(yearly_seasonality=True)
m.fit(lujinAmount)
future=m.make_future_dataframe(periods=365)
forecast=m.predict(future)
m.plot(forecast)
複製程式碼
趨勢預測圖形化展示
資料分析程式碼
資料分析程式碼地址(注:資料分析程式碼智慧執行在Python3 環境下) 程式碼執行後樣例(無需安裝Python環境 也可檢視具體程式碼解圖形化展示)
後記
這是本人從 Java web轉向資料方向後自己寫的第一專案,也是自己的第一個Python專案,在整個過程中,也沒遇到多少坑,整體來說,爬蟲和資料分析以及Python這門語言門檻都是非常低的。 如果想入門Python爬蟲,推薦《Python網路資料採集》
如果想入門Python資料分析,推薦 《利用Python進行資料分析》