爬蟲實戰——58同城租房資料爬取

XJTU_WJGao發表於2019-12-04

原文網址 : https://blog.csdn.net/weixin_43333043/article/details/103395861

背景

自己本人在暑期時自學了python，還在中國大學mooc上學習了一些爬蟲相關的知識，對requests庫、re庫以及BeautifulSoup庫有了一定的瞭解，但是沒有過爬蟲方面的實戰，剛好家人有這方面需求，就對58同城上的租房資料進行了爬取，算是當作自己的第一次爬蟲實戰。不過自己的水平實在不行，搞了兩天，參考了很多人的程式碼，也查了很多檔案。好，話不多說，直接上程式碼。

程式碼及結果

from bs4 import BeautifulSoup
from fontTools.ttLib import TTFont
import requests
import json
import time
import re
import base64
import openpyxl

User_Agent = 'Mozilla/5.0(Macintosh;IntelMacOSX10_7_0)AppleWebKit/535.11(KHTML,likeGecko)Chrome/17.0.963.56Safari/535.11'
headers = {
    'User-Agent': User_Agent,
}


def convertNumber(html_page):

    base_fonts = ['uni9FA4', 'uni9F92', 'uni9A4B', 'uni9EA3', 'uni993C', 'uni958F', 'uni9FA5', 'uni9476', 'uni9F64',
                  'uni9E3A']
    base_fonts2 = ['&#x' + x[3:].lower() + ';' for x in base_fonts]  # 構造成 &#x9e3a; 的形式
    pattern = '(' + '|'.join(base_fonts2) + ')'

    font_base64 = re.findall("base64,(AA.*AAAA)", html_page)[0]  # 找到base64編碼的字型格式檔案
    font = base64.b64decode(font_base64)
    with open('58font2.ttf', 'wb') as tf:
        tf.write(font)
    onlinefont = TTFont('58font2.ttf')
    convert_dict = onlinefont['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap  # convert_dict資料如下：{40611: 'glyph00004', 40804: 'glyph00009', 40869: 'glyph00010', 39499: 'glyph00003'
    new_page = re.sub(pattern, lambda x: getNumber(x.group(),convert_dict), html_page)
    return new_page


def getNumber(g,convert_dict):
    key = int(g[3:7], 16)  # '&#x9ea3',擷取後四位十六進位制數字，轉換為十進位制數，即為上面字典convert_dict中的鍵
    number = int(convert_dict[key][-2:]) - 1  # glyph00009代表數字8， glyph00008代表數字7，依次類推
    return str(number)
               

def request_details(url):
   f = requests.get(url, headers=headers)
   f=convertNumber(f.text)
   soup = BeautifulSoup(f, 'lxml')
   name = soup.select('h1[class="c_333 f20 strongbox"]')
   price = soup.select('b[class="f36 strongbox"]')
   try:
       lease=soup.find(text='租賃方式：').findNext('span').contents[0]
   except:
       lease=""
   try:
       housing_type=soup.find(text='房屋型別：').findNext('span').contents[0]
       housing_type="".join(housing_type.split())
   except:
       housing_type=""
   try:
       oritation=soup.find(text='朝向樓層：').findNext('span').contents[0]
       oritation="".join(oritation.split())
   except:
       oritation=""
   try:
       community=soup.find(text='所在小區：').findNext('span').contents[0].contents[0]
   except:
       community=""
   try:
       building_type=soup.find(text='建築型別：').findNext('span').contents[0]
   except:
       building_type=""
   try:
       address=soup.find(text='詳細地址：').findNext('span').contents[0]
       address="".join(address.split())
   except:
       address=""
   data = {
    '名稱':name[0].text,
    '價格':price[0].text.strip()+"元",
    '租賃方式':lease,
    '房屋型別':housing_type,
    '朝向樓層':oritation,
    '所在小區':community,
    '建築型別':building_type,
    '詳細地址':address,
    '網址':url
   }
   return data
  

def get_link(url):
    f = requests.get(url)
    soup = BeautifulSoup(f.text, 'lxml')
    links = soup.select('div[class="des"]>h2>a')
    link_list = []
    for link in links:
        link_content = link.get('href')
        link_list.append(link_content)
    return link_list


def save_to_text(content,i):
    mywb=openpyxl.load_workbook('SHEET2.xlsx')
    sheet=mywb.active;
    sheet[chr(65)+str(i)].value=content['名稱']
    sheet[chr(66)+str(i)].value=content['價格']
    sheet[chr(67)+str(i)].value=content['租賃方式'].encode('utf-8')
    sheet[chr(68)+str(i)].value=content['房屋型別'].encode('utf-8')
    sheet[chr(69)+str(i)].value=content['朝向樓層'].encode('utf-8')
    sheet[chr(70)+str(i)].value=content['所在小區'].encode('utf-8')
    sheet[chr(71)+str(i)].value=content['建築型別'].encode('utf-8')
    sheet[chr(72)+str(i)].value=content['詳細地址'].encode('utf-8')
    sheet[chr(73)+str(i)].value=content['網址']
    mywb.save('SHEET2.xlsx')
    content = json.dumps(content, ensure_ascii=False)
    with open('58', 'a', encoding='utf-8') as f:
        f.write(content)
        f.write('\r\n')


def main():
    mywb=openpyxl.Workbook()
    sheet=mywb.active;
    sheet[chr(65)+str(1)].value="名稱"
    sheet[chr(66)+str(1)].value="價格"
    sheet[chr(67)+str(1)].value="租賃方式"
    sheet[chr(68)+str(1)].value="房屋型別"
    sheet[chr(69)+str(1)].value="朝向樓層"
    sheet[chr(70)+str(1)].value="所在小區"
    sheet[chr(71)+str(1)].value="建築型別"
    sheet[chr(72)+str(1)].value="詳細地址"
    sheet[chr(73)+str(1)].value="網址"
    mywb.save('SHEET2.xlsx')
    num = 1
    link = 'https://fy.58.com/qhdlfyzq/chuzu/pn{}'
    start = 1
    end = 11
    urls = [link.format(i) for i in range(start, end)]
    for url in urls:
        link_list = get_link(url)
        print(link_list)
        time.sleep(5)
        for link in link_list:
            if link != "https://e.58.com/all/zhiding.html":
                try:
                    content = request_details(link)
                    num = num + 1
                    print(content)
                    save_to_text(content, num)
                except:
                    print("error")
                time.sleep(5)


if __name__ == '__main__':
    main()

在這裡插入圖片描述

總結

難點

對於頁碼的變換，參考了一些人的程式碼，為網址後面加上/pn{}(page number)
為了防止輸入驗證碼，每兩次訪問之間都需要間隔至少5s，由於58同城是識別ip地址，一旦被鎖定，需要手動開啟網頁，完成驗證
58同城在網頁原始碼中，對數字進行了加密，許多數字都是以亂碼顯示，我參考了58同城南京資料爬取，借鑑了一些他的程式碼，解決了這個問題
對於BeautifulSoup庫的運用，包括find(),next_sibling,contents等等，都提高了查到有用資訊的效率

自己需要提升的方面

對於我這種python新手來說，最應該提升的部分就是字元的編碼了，整個除錯過程中，遇到很多次編碼上的錯誤，最後雖然除錯通了，但是部分程式碼比較冗餘
程式碼中最後用到了openpyxl模組，感覺還需要繼續熟練
異常處理十分重要，但是自己掌握的還不是很好
程式碼的整齊美觀還需要很大的提升

程式碼提升潛力

程式碼的簡潔和美化
異常處理感覺可以進行修改
在主函式中，可以直接一次性提取出所有的網址，防止重複

python爬蟲58同城（多個資訊一次爬取）
2018-11-04
Python爬蟲
python爬取58同城一頁資料
2018-08-04
Python
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
初識Scrapy框架+爬蟲實戰(7)-爬取鏈家網100頁租房資訊
2018-06-12
框架爬蟲
58同城反爬蟲機制及處理
2020-08-15
爬蟲
Python爬蟲實戰：爬取淘寶的商品資訊
2021-09-11
Python爬蟲
Python【爬蟲實戰】提取資料
2020-11-17
Python爬蟲
爬蟲實戰（一）：爬取微博使用者資訊
2018-07-15
爬蟲
python爬取北京租房資訊
2018-05-18
Python
API商品資料介面呼叫實戰：爬蟲與資料獲取
2023-10-29
API爬蟲
Python爬蟲實戰詳解：爬取圖片之家
2020-11-04
Python爬蟲
爬蟲爬取資料如何繞開限制？
2022-06-10
爬蟲
爬蟲實戰（二）：Selenium 模擬登入並爬取資訊
2018-07-15
爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
房產資料爬取、智慧財產權資料爬取、企業工商資料爬取、抖音直播間資料python爬蟲爬取
2024-07-11
Python爬蟲
爬蟲實戰：從HTTP請求獲取資料解析社群
2024-03-20
爬蟲HTTP
Python網路爬蟲實戰：爬取知乎話題下 18934 條回答資料
2019-01-17
Python爬蟲
一小時入門Python爬蟲，連我都會了！Python爬取租房資料例項
2019-08-02
Python爬蟲
如何保障爬蟲高效穩定爬取資料？
2022-05-27
爬蟲
大資料爬蟲專案實戰教程
2018-11-14
大資料爬蟲
API商品資料介面呼叫爬蟲實戰
2023-10-27
API爬蟲
實時獲取股票資料，免費！——Python爬蟲Sina Stock實戰
2021-10-13
Python爬蟲
JavaScript爬蟲程式實現自動化爬取tiktok資料教程
2023-10-18
JavaScript爬蟲
python爬蟲實戰：爬取西刺代理的代理ip（二）
2019-02-16
Python爬蟲
Python爬蟲實戰案例-爬取幣世界標紅快訊
2019-02-16
Python爬蟲
爬蟲實戰scrapy
2018-03-11
爬蟲
Python 爬蟲實戰
2023-10-16
Python爬蟲
Python 爬蟲實戰之爬拼多多商品並做資料分析
2023-10-17
Python爬蟲
python爬蟲實戰，爬蟲之路，永無止境
2022-01-27
Python爬蟲
Java爬蟲-爬取疫苗批次資訊
2024-06-03
Java爬蟲
Java爬蟲實戰：API商品資料介面呼叫
2023-10-26
Java爬蟲API
【Python爬蟲實戰】使用Selenium爬取QQ音樂歌曲及評論資訊
2021-03-24
Python爬蟲
Python爬蟲實戰案例：取喜馬拉雅音訊資料詳解
2020-12-05
Python爬蟲音訊
Java 爬蟲專案實戰之爬蟲簡介
2018-11-24
Java爬蟲
Python爬蟲框架：scrapy爬取高考派大學資料
2019-10-07
Python爬蟲框架
Python爬蟲入門【3】：美空網資料爬取
2019-07-30
Python爬蟲
爬蟲如何爬取貓眼電影TOP榜資料
2019-06-17
爬蟲
輕鬆利用Python爬蟲爬取你想要的資料
2021-09-10
Python爬蟲

爬蟲實戰——58同城租房資料爬取

背景

程式碼及結果

總結

難點

自己需要提升的方面

程式碼提升潛力

相關文章