利用python爬取58同城簡歷資料

卓傑傑發表於2016-05-08

利用python爬取58同城簡歷資料

最近接到一個工作，需要獲取58同城上面的簡歷資訊（http://gz.58.com/qzyewu/）。最開始想到是用python裡面的scrapy框架製作爬蟲。但是在製作的時候，發現內容不能被儲存在本地變數 response 中。當我通過shell載入網頁後，雖然內容能被儲存在response中，用xpath對我需要的資料進行獲取時，返回的都是空值。考慮到資料都在原始碼中，於是我使用python裡的beautifulSoup通過下載原始碼的方式去獲取資料，然後插入到資料庫。

需要的python包urllib2,beautifulSoup，MySQLdb，re

第一，獲取整個頁面

#coding:utf-8
import urllib2
from BeautifulSoup import BeautifulSoup
url='http://jianli.58.com/resume/91655325401100'
content = urllib2.urlopen(url).read()
soup=BeautifulSoup(content)
print soup

url為需要下載的網頁
通過urllib2.urlopen()方法開啟一個網頁
read()方法讀取url上的資料

第二，篩選你想要的資料

這裡需要用到正規表示式，python提供了強大的正規表示式，不清楚的小夥伴可以參考一下資料（http://www.runoob.com/regexp/regexp-syntax.html）

比如，我們需要獲取姓名

通過控制檯可以看到名字所在的位置
這裡寫圖片描述

可用正規表示式進行匹配，程式碼如下：

name = re.findall(r'(?<=class="name">).*?(?=</span>)',str(soup))

執行程式，發現返回結果為空。

檢查正規表示式是無誤的，我們觀察之前返回的soup，發現他返回的原始碼與網頁上的原始碼是不一樣的。所有我們根據觀察網頁上的原始碼寫的正規表示式不能再返回的原始碼中匹配到相應的內容。因此我們只能通過觀察返回的原始碼寫正規表示式。

這裡寫圖片描述

在soup返回的原始碼中，我們很容易地找到這個人的全部基本資料，而且都在標籤為< li class=”item”>中，通過下面的fandAll()方法，很容易就獲取內容

data = soup.findAll('li',attrs={'class':'item'})

通過上面的程式碼，可以的到如下的結果，可見返回了一個list

這裡寫圖片描述

這樣，我們就獲取了這個人的姓名，性別，年齡，工作經驗和學歷。

通過上面的方法，我們能夠獲取整個頁面你所需要的資料。

第三，把資料儲存到資料庫

我使用的是mysql資料庫，所以這裡以mysql為例

連線資料庫

conn = MySQLdb.Connect(
            host = '127.0.0.1',
            port = 3306,user = 'root',
            passwd = 'XXXXX',
            db = 'XXXXX',
            charset = 'utf8')
cursor = conn.cursor()

因為要儲存中文，所以在這裡設定編碼格式為utf8

建立插入語句

sql_insert = "insert into resume(
ID,name,sex,age,experience,education,pay,ad
,job,job_experience,education_experience)
values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"

插入資料

cursor.execute(sql_insert,(ID,name,sex,age,experience,education
,pay,ad,job,job_experience,education_experience))
conn.commit()

關閉資料庫

cursor.close()
conn.close()

執行程式

報錯了…

(1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '))' at line 1")

發生這個錯誤，如果sql語法沒錯，一般就是編碼有問題了。
我們的資料庫使用的編碼是utf8，應該是插入的資料在編碼上出現問題了。
我們對返回的資料進行重新編碼用decode()和encode()方法實現

name = data[0].decode('utf-8').encode('utf-8')

用這個簡單的方法，我們就解決了資料庫編碼與資料編碼不一致導致出錯的問題。

為什麼編碼會不一樣呢？
這是因為，我們用BeautifulSoup包爬取網頁的時候，返回的資料是ascii編碼的資料。而我們的資料庫為utf8編碼的，所有插入資料是會發生錯誤，只要對爬取的資料重新進行編碼

結果

這裡寫圖片描述

這個是我爬取的結果，效果還是挺好的，速度大概是1秒個網頁，雖然比起scrapy要慢好多，但是BeautifulSoup和urllib2使用簡單，適合新手練手。

附錄:程式碼

#coding:utf-8
import urllib2
from BeautifulSoup import BeautifulSoup
import re
import MySQLdb
url = 'http://jianli.58.com/resume/91655325401100'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
basedata = str(soup.findAll('li',attrs={'class':'item'}))
basedata = re.findall(r'(?<=class="item">).*?(?=</li>)',basedata)
ID = str(soup.findAll('script',attrs={'type':'text/javascript'}))
ID = re.findall(r'(?<=global.ids = ").*?(?=";)',ID)
ID = ID[0].decode('utf-8').encode('utf-8')
name = basedata[0].decode('utf-8').encode('utf-8')
sex = basedata[1].decode('utf-8').encode('utf-8')
age = basedata[2].decode('utf-8').encode('utf-8')
experience = basedata[3].decode('utf-8').encode('utf-8')
education = basedata[4].decode('utf-8').encode('utf-8')
pay = str(soup.findAll('dd',attrs={None:None}))
pay = re.findall(r'(?<=<dd>)\d+.*?(?=</dd>)',pay)
pay = pay[0].decode('utf-8').encode('utf-8')
expectdata = str(soup.findAll('dd',attrs={None:None}))
expectdata = re.findall(r'''(?<=["']>)[^<].*?(?=</a></dd>)''',expectdata)
ad = expectdata[0].decode('utf-8').encode('utf-8')
job = expectdata[1].decode('utf-8').encode('utf-8')
job_experience = str(soup.findAll('div',attrs={'class':'employed'}))
job_experience = re.findall(r'(?<=>)[^<].*?(?=<)',job_experience)
job_experience = ''.join(job_experience).decode('utf-8').encode('utf-8')
education_experience = str(soup.findAll('dd',attrs={None:None}))
education_experience = re.findall(r'(?<=<dd><p>).*\n.*\n?.*',education_experience)
education_experience = ''.join(education_experience).decode('utf-8').encode('utf-8')
conn = MySQLdb.Connect(
            host = '127.0.0.1',
            port = 3306,user = 'root',
            passwd = 'XXXXX',
            db = 'XXXX',
            charset = 'utf8')
cursor = conn.cursor()
sql_insert = "insert into resume(ID, name,sex,age,experience,education,pay,ad,job,job_experience,education_experience)\
values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
try:
    cursor.execute(sql_insert, (ID, name,sex,age,experience,education,pay,ad,job,job_experience,education_experience))
    conn.commit()
except Exception as e:
    print e
    conn.rollback()
finally:
    cursor.close()
    conn.close()

利用python爬取58同城簡歷資料

python爬取58同城一頁資料
2018-08-04
Python
爬蟲實戰——58同城租房資料爬取
2019-12-04
爬蟲
python爬蟲58同城（多個資訊一次爬取）
2018-11-04
Python爬蟲
輕鬆利用Python爬蟲爬取你想要的資料
2021-09-10
Python爬蟲
利用python爬取某殼的房產資料
2024-05-05
Python
利用python編寫爬蟲爬取淘寶奶粉部分資料.1
2021-09-09
Python爬蟲
爬蟲：拉勾自動投遞簡歷+資料獲取
2020-10-21
爬蟲
python爬蟲-1w+套個人簡歷模板爬取
2021-03-05
Python爬蟲
歷史股票資料的爬取
2021-12-31
【python爬蟲案例】利用python爬取豆瓣電影TOP250評分排行資料！
2024-09-18
Python爬蟲
如何利用 Selenium 爬取評論資料？
2018-04-12
Python爬蟲精簡步驟1 獲取資料
2020-02-17
Python爬蟲
python爬蟲簡歷專案怎麼寫_爬蟲專案咋寫，爬取什麼樣的資料可以作為專案寫在簡歷上？...
2020-12-01
Python爬蟲
58同城反爬蟲機制及處理
2020-08-15
爬蟲
利用Python爬取必應桌布
2020-10-13
Python
房產資料爬取、智慧財產權資料爬取、企業工商資料爬取、抖音直播間資料python爬蟲爬取
2024-07-11
Python爬蟲
Python：爬取疫情每日資料
2020-02-17
Python
【python爬蟲案例】利用python爬取豆瓣讀書評分TOP250排行資料
2024-09-20
Python爬蟲
Python 超簡單爬取微博熱搜榜資料
2020-05-13
Python
利用Python自動爬取全國30+城市地鐵圖資料
2019-01-18
Python
Python 爬取 baidu 股票市值資料
2019-02-16
PythonAI
Python爬取噹噹網APP資料
2020-10-21
PythonAPP
Python爬取CSDN部落格資料
2019-01-03
Python
使用 Python 爬取網站資料
2024-07-27
Python網站
Python 超簡單爬取新浪微博資料 (高階版)
2020-05-16
Python
python爬取股票資料並存到資料庫
2021-03-29
Python資料庫
python爬蟲利用代理IP分析大資料
2020-12-01
Python爬蟲大資料
利用python爬取城市公交站點
2021-12-09
Python
北京市政百姓信件分析實戰一（利用python爬取資料）
2024-09-06
Python
58同城：聚焦女性職場人求職大資料
2021-03-08
求職大資料
利用Python爬蟲獲取招聘網站職位資訊
2021-08-09
Python爬蟲網站
求職簡歷-Python爬蟲工程師
2018-07-26
求職Python爬蟲工程師
58同城：2019年雙十一熱門崗位大資料
2019-11-08
大資料
58同城：2020年雙十一熱門職位大資料
2020-11-06
大資料
58同城：2020年雙十一客服行業大資料
2020-11-10
行業大資料
python簡介怎麼寫-python爬蟲簡歷怎麼寫
2020-11-01
Python爬蟲
Python爬蟲框架：scrapy爬取高考派大學資料
2019-10-07
Python爬蟲框架
Python爬蟲入門【3】：美空網資料爬取
2019-07-30
Python爬蟲
Python筆記：網頁資訊爬取簡介（一）
2020-11-11
Python筆記網頁

利用python爬取58同城簡歷資料

利用python爬取58同城簡歷資料

第一，獲取整個頁面

第二，篩選你想要的資料

第三，把資料儲存到資料庫

連線資料庫

建立插入語句

插入資料

關閉資料庫

執行程式

結果

附錄:程式碼

相關文章