拉勾網職位資料爬取

dta0502發表於2018-08-26

原文網址 : https://blog.csdn.net/dta0502/article/details/82083391

後面我還對爬取的資料做了分析—拉勾網Python職位分析。

拉勾網反爬蟲做的比較嚴，請求頭多新增幾個引數才能不被網站識別。
我們找到真正的請求網址，發現返回的是一個JSON串，解析這個JSON串即可，而且注意是POST傳值，通過改變Form Data中pn的值來控制翻頁。

需要的一些知識點

AJAX：Asynchronous JavaScript and XML（非同步的 JavaScript 和 XML）。它不是新的程式語言，而是一種使用現有標準的新方法。它採用的是AJAX非同步請求。通過在後臺與伺服器進行少量資料交換，AJAX 可以使網頁實現非同步更新。因此就可以在不重新載入整個網頁的情況下，對網頁的某部分進行更新，從而實現資料的動態載入。
XHR：XMLHttpRequest 物件用於和伺服器交換資料。

分析網頁

開啟拉勾網主頁之後，我們在搜尋框中輸入關鍵字Python，以用來查詢和Python相關的職位。在搜尋結果的頁面中，我們按照以下順序操作：

右鍵檢查
開啟審查元素後預設開啟的是Elements
我們切換到Network標籤，重新整理一下網頁會出現各種條目的請求
因為該網站是非同步請求，所以開啟Network中的XHR，針對JSON中的資料進行分析。

我們點選頁面中的頁數，比如第2頁，我們可以在右邊看到一個POST請求，這個請求裡面包含了真實的URL（瀏覽器上的URL並沒有職位資料，檢視原始碼就可以發現這一點）、POST請求的請求頭Headers、POST請求提交的表單Form Data（這裡麵包含了頁面資訊pn、搜尋的職位資訊kd）。

真實的URL

下面是真實的URL：
url

請求頭資訊

下面是我們需要構造的請求頭Headers資訊，如果這裡沒有構造好的話，容易被網站識別為爬蟲，從而拒絕訪問請求。
headers

表單資訊

下面是我們傳送POST請求時需要包含的表單資訊Form Data。
data

返回的JSON資料

我們可以發現需要的職位資訊在content –> positionResult –> result下，其中包含了工作地點、公司名、職位等資訊。我們只需要儲存這個資料就可以了。
json

至此我們分析完畢網頁，下面可以開始爬取過程了。

單個頁面的爬取

import requests
from fake_useragent import UserAgent
from lxml import etree
import csv
import json
import time
import pandas as pd

構造請求頭、表單

下面是構造請求頭（headers）。

Host = "www.lagou.com"
Origin =  "https://www.lagou.com"
Referer = "https://www.lagou.com/jobs/list_Python?px=default&gx=&isSchoolJob=1&city=%E6%9D%AD%E5%B7%9E"

ua = UserAgent()

headers = {
    'User-Agent':ua.random,
    'Host':Host,
    'Origin':Origin,
    'Referer':Referer
}

下面是構造表單（Form Data）。

data= {
    'first': False,
    'pn': "1",
    'kd': 'Python'
}

下面是真實的URL地址。

url = "https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E6%9D%AD%E5%B7%9E&needAddtionalResult=false&isSchoolJob=1"

requests獲取網頁

response = requests.post(url = url,headers = headers,data = data)

response.status_code

頁面解析

result = response.json()

position = result['content']['positionResult']['result']

df = pd.DataFrame(position)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 46 columns):
adWord                   15 non-null int64
appShow                  15 non-null int64
approve                  15 non-null int64
businessZones            7 non-null object
city                     15 non-null object
companyFullName          15 non-null object
companyId                15 non-null int64
companyLabelList         15 non-null object
companyLogo              15 non-null object
companyShortName         15 non-null object
companySize              15 non-null object
createTime               15 non-null object
deliver                  15 non-null int64
district                 15 non-null object
education                15 non-null object
explain                  0 non-null object
financeStage             15 non-null object
firstType                15 non-null object
formatCreateTime         15 non-null object
gradeDescription         0 non-null object
hitags                   4 non-null object
imState                  15 non-null object
industryField            15 non-null object
industryLables           15 non-null object
isSchoolJob              15 non-null int64
jobNature                15 non-null object
lastLogin                15 non-null int64
latitude                 15 non-null object
linestaion               6 non-null object
longitude                15 non-null object
pcShow                   15 non-null int64
plus                     0 non-null object
positionAdvantage        15 non-null object
positionId               15 non-null int64
positionLables           15 non-null object
positionName             15 non-null object
promotionScoreExplain    0 non-null object
publisherId              15 non-null int64
resumeProcessDay         15 non-null int64
resumeProcessRate        15 non-null int64
salary                   15 non-null object
score                    15 non-null int64
secondType               15 non-null object
stationname              6 non-null object
subwayline               6 non-null object
workYear                 15 non-null object
dtypes: int64(13), object(33)
memory usage: 5.5+ KB

type(result)

dict

全部頁面的爬取

一共有10個頁面，這裡全部爬取。

第一次嘗試

程式碼如下：

for page in range(1,11):
    data['pn'] = str(page)
    response = requests.post(url,headers = headers,data = data)
    result = response.json()
    print(result)
    position = result['content']['positionResult']['result']
    df = pd.DataFrame(position)
    if page == 1:
        total_df = df
    else:
        total_df = pd.concat([total_df,df],axis = 0)

出現這樣的錯誤：

{'success': False, 'msg': '您操作太頻繁,請稍後再訪問', 'clientIp': '121.248.50.24'}

可能是觸發了網站的反爬蟲機制，下面需要改進一下。

改進版本

主要加入了一個延遲，降低抓取的速度。

    if result['success']:
        position = result['content']['positionResult']['result']
        time.sleep(1)  # 獲取正常的情況下延時1s請求一次
        return position
    else:
        print("您操作太頻繁,請稍後再訪問")
        time.sleep(10)  # 出現異常時，間隔10s後再獲取
        position = getPosition(url,headers,data,page) #遞迴獲取
        return position

下面是爬取職位資訊函式，其中包括爬取失敗後的遞迴爬取，保證資料的完整！

def getPosition(url,headers,data,page):

    data['pn'] = str(page)
    response = requests.post(url,headers = headers,data = data)
    result = response.json()
    if result['success']:
        position = result['content']['positionResult']['result']
        time.sleep(1)  # 獲取正常的情況下延時1s請求一次
        return position
    else:
        print("您操作太頻繁,請稍後再訪問")
        time.sleep(10)  # 出現異常時，間隔10s後再獲取
        position = getPosition(url,headers,data,page) #遞迴獲取
        return position

下面是頁面爬取過程，這裡呼叫了前面的getPosition函式，最後將爬取到的職位資訊合併為一個Pandas DataFrame變數，方便後面儲存。

for page in range(1,11):
    position = getPosition(url,headers,data,page)
    df = pd.DataFrame(position)
    if page == 1:
        total_df = df
    else:
        total_df = pd.concat([total_df,df],axis = 0)

您操作太頻繁,請稍後再訪問
您操作太頻繁,請稍後再訪問
您操作太頻繁,請稍後再訪問
您操作太頻繁,請稍後再訪問
您操作太頻繁,請稍後再訪問
您操作太頻繁,請稍後再訪問

total_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 142 entries, 0 to 6
Data columns (total 46 columns):
adWord                   142 non-null int64
appShow                  142 non-null int64
approve                  142 non-null int64
businessZones            86 non-null object
city                     142 non-null object
companyFullName          142 non-null object
companyId                142 non-null int64
companyLabelList         142 non-null object
companyLogo              142 non-null object
companyShortName         142 non-null object
companySize              142 non-null object
createTime               142 non-null object
deliver                  142 non-null int64
district                 141 non-null object
education                142 non-null object
explain                  0 non-null object
financeStage             142 non-null object
firstType                142 non-null object
formatCreateTime         142 non-null object
gradeDescription         0 non-null object
hitags                   14 non-null object
imState                  142 non-null object
industryField            142 non-null object
industryLables           142 non-null object
isSchoolJob              142 non-null int64
jobNature                142 non-null object
lastLogin                142 non-null int64
latitude                 142 non-null object
linestaion               50 non-null object
longitude                142 non-null object
pcShow                   142 non-null int64
plus                     0 non-null object
positionAdvantage        142 non-null object
positionId               142 non-null int64
positionLables           142 non-null object
positionName             142 non-null object
promotionScoreExplain    0 non-null object
publisherId              142 non-null int64
resumeProcessDay         142 non-null int64
resumeProcessRate        142 non-null int64
salary                   142 non-null object
score                    142 non-null int64
secondType               142 non-null object
stationname              50 non-null object
subwayline               50 non-null object
workYear                 142 non-null object
dtypes: int64(13), object(33)
memory usage: 52.1+ KB

下面是輸出為csv檔案。

total_df.to_csv('Python-School-Hangzhou.csv', sep = ',', header = True, index = False)