001.01 一般網頁爬蟲處理範例

Jason990420發表於2019-08-06

原文網址 : https://learnku.com/articles/32178?order_by=created_at&

001.01 一般網頁爬蟲處理範例

建檔日期: 2019/07/28
更新日期: 2019/03/28 錯誤更正

Win 10

Python 3.7.4

在這裡將不使用BeatifulSoup來作網頁爬蟲, 並儘可能不使用其他的庫來完成件事, 使用別人的庫來寫程式, 好處是簡單, 快捷, 程式碼短, 而且還不一定要懂內部的細節; 但是就因為不懂細節, 就很容出錯, 而且不會真正瞭解真正的作法等等缺點.

網頁爬蟲預備動作

開啟網頁, 比如https://www.51job.com/, 選上方的職位搜尋

001.01 一般網頁爬蟲處理

會出現超過十萬條的工作資訊, 我們並不需要這麼多資料來供自己處理, 因此多加一點條件先行篩選, 按下面點選, 可以得到較少的資料, 比如撰文時的資料有491條職位, 共分10頁.

001.01 一般網頁爬蟲處理

檢查一下, 這十頁的網址差異, 並建立一份十個網址的utf-8格式文字檔案url.txt, 每條網址都有一些引數不同, 要自己確認. (如下圖)

001.01 一般網頁爬蟲處理

滑鼠指向表格中一開始的欄位, 右鍵單擊選看”檢查元素”, 檢查HTML內容

001.01 一般網頁爬蟲處理

表格的起始位置及結束位置, 並找出表格資料所在的特別的標籤或其他可以確認的字串.

001.01 一般網頁爬蟲處理

找出表格資料所在的特別的標籤或其他可以確認的字串.

001.01 一般網頁爬蟲處理

在此, 我們以’<=”el title”>’為表格標題的起點, 再找到> …. <中的資料, 這就是表格的五項標題. 後面每一行的資料, 則以兩個a title, 三個span class為起點, 再找到> …. <中的資料, 這就是表格的一行中的五項資料.
注意: 不同的系統, 瀏覽器所收到的HTML檔會有些差異, 比如我的前兩筆資料就不是以a title為起點, 而是以a target為起點其他部份則相同.
因此找出這網頁的資料步驟為
- 取出 ‘‘ 到 ‘ ‘ 的HTML
- 找到 ‘<=”el title”>’, 取出’>’和’<’之間表格的五項標題
- 讀取每一筆資料
  - 找到 ‘a target’, 取出’>’和’<’之間表格的資料, 重複兩次
  - 找到 ‘span class’, 取出’>’和’<’之間表格的資料, 重複三次
- 重複以上的動作, 讀取以下每一筆資料
- 找不到起點的字串, 表示該網頁資料已讀取完畢, 再換下一個網頁, 直到結束.

程式範例

import urllib.request
import datetime

# 取現在的時間當作檔名
date = datetime.datetime.now()
url_list_file = "url.txt"
data_file_name = "Web Data " + str(date.year) + ("%02d" % date.month) + \
                ("%02d" % date.day) +  ("%02d" % date.hour)  + \
                ("%02d" % date.minute) +  ("%02d" % date.second)  +".csv"

# 讀取網頁及存檔用的編碼`
decoder = "GB18030"

# 網頁中表格的起點及終點
table_format = ['<!--列表表格-->','<!--列表表格 END-->']
area_start = table_format[0]
area_stop  = table_format[1]

# 尋找字串, 並跳過
def find_skip(in_string, string1):
    global pointer, string_found
    pointer = in_string.find(string1, pointer)
    # 確認表格結束
    if pointer < 0:
        string_found = False
    pointer += len(string1)

# 按起始字串, 讀取兩個標記中的資料, 標記通常是'>'及'<', 並重復n次
def get_n_data(in_string, string1, string2, string3, n, data_follow):
    global pointer, string_found, table

    # 重複n次
    for i in range(n):

        # 跳過起始字串
        find_skip(in_string, string1)

        # 表格結束 ?
        if string_found:

            # 找到第一個標記
            find_skip(in_string, string2)
            data_start = pointer

            # 找到第二個標記
            find_skip(in_string, string3)
            data_stop  = pointer - len(string3)

            # 資料放到列表中
            data = in_string[data_start : data_stop]
            table.append(data.strip())

# 將列表的資料, 按CSV文字檔案格式, 寫入CSV檔中
def save_table(data_file_name, table):

    # 表格是的空的, 不存檔
    if len(table) != 0:

        table_data =""
        count = 0

        # 表格所有專案轉成字串
        for item in table:
            count += 1

            # 每一行五個資料, 以','間隔, '\n'為結尾
            if count != 5:
                table_data = table_data+item+','
            else:
                table_data = table_data+item+'\n'
                count = 0

        # 建立檔案, 寫入資料字串, 檔案開閉
        datafile = open(data_file_name, 'wt', encoding=decoder)
        datafile.write(table_data)
        datafile.close()

table =[]

# 讀入網址檔案中的每一個網址
with open(url_list_file, mode='rt', encoding='utf-8-sig') as url_list:

    for url in url_list:

        # 解碼讀入網頁HTML
        try:
            web_page = urllib.request.urlopen(url.strip())
        except urllib.error.URLError:
            print('Web page loading failed !!!')
            exit()
        html = web_page.read().decode(decoder)

        # 網頁只保留表格部份處理
        pointer = 0
        data_area_start = html.find(area_start)
        data_area_stop  = html.find(area_stop)
        data_html       = html[data_area_start : data_area_stop]
        del html

        pointer = 0

        '''
        省略全部表格每一欄的標題, 避免多個頁標題和資料混在一起
        find_skip(data_html, '<div class="el title">')
        get_n_data(data_html, 'span class', '>', '<',  4, ',')
        get_n_data(data_html, 'span class', '>', '<', 1, 'n')
        '''

        # 找不到間隔字串, 表示表格結束
        string_found = True

        # 重複取得每一筆的表格資料, 找到起始字中, 再取 >......< 中的資料
        while string_found:

            # 兩次的'a target', '< ......>'的資料
            get_n_data(data_html, 'a target', '>', '<', 2, ',')

            # 表格結束 ?
            if string_found:

                # 三次的'span class', '< ......>'的資料
                get_n_data(data_html, 'span class', '>', '<', 3, '\n')

# 存表格
save_table(data_file_name, table)

# 結束
exit()

注意: 由於本地變數及全域性變數的混淆, 容易出錯, 所以我對常用的變數, 使用了Global定義. 確認了一下, 程式讀入十個網頁, 取得491筆資料, 約34K檔案大小, 結果使用了整整30秒.

把得到的資料, 轉存為以”,”來分隔每一筆資料的.csv格式檔, 用Excel外部輸入的方式, 插到Excel中, 就可以得到以下近五百筆的資料, 至於你要再作什麼處理, 就看你自己了, 比如用EXCEL找出職位名有’經理’的部份.

—- The End —

本作品採用《CC 協議》，轉載必須註明作者和本文連結

Jason Yang

001.01 一般網頁爬蟲處理
2019-08-06
網頁爬蟲
《網頁爬蟲》
2018-11-26
網頁爬蟲
Python爬蟲js處理
2020-03-31
Python爬蟲JS
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
爬蟲——網頁爬取方法和網頁解析方法
2020-12-07
爬蟲網頁
如何處理識別出的網路爬蟲
2019-02-02
爬蟲
HTTP爬蟲被封如何處理？
2022-06-10
HTTP爬蟲
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
網頁爬蟲--未完成
2020-10-04
網頁爬蟲
python 爬蟲網頁登陸
2020-11-30
Python爬蟲網頁
python爬蟲之處理驗證碼
2019-03-01
Python爬蟲
爬蟲 | 處理cookie的基本方法——session
2024-06-12
爬蟲CookieSession
【爬蟲】網頁抓包工具--Fiddler
2018-12-19
爬蟲網頁
爬蟲抓取網頁資料原理
2021-12-06
爬蟲網頁
C# 爬蟲—-Cookies處理(Set-Cookie)
2018-08-16
C#爬蟲Cookie
58同城反爬蟲機制及處理
2020-08-15
爬蟲
爬蟲抓取網頁的詳細流程
2023-11-28
爬蟲網頁
Go 爬蟲小例
2022-05-24
Go爬蟲
Node JS爬蟲：爬取瀑布流網頁高清圖
2018-05-17
JS爬蟲網頁
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
爬蟲例項-淘寶頁面商品資訊獲取
2020-10-08
爬蟲
Java爬蟲翻頁
2024-07-09
Java爬蟲
一起學爬蟲——使用Beautiful Soup爬取網頁
2018-11-26
爬蟲網頁
手把手教你利用爬蟲爬網頁（Python程式碼）
2019-05-14
爬蟲網頁Python
python爬蟲：使用BeautifulSoup修改網頁內容
2020-04-05
Python爬蟲網頁
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
Jsoup + HtmlUtil 實現網易新聞網頁爬蟲
2019-01-14
JSHTML網頁爬蟲
爬蟲（6） - 網頁資料解析(2) | BeautifulSoup4在爬蟲中的使用
2022-07-04
爬蟲網頁
Python爬蟲教程-13-爬蟲使用cookie爬取登入後的頁面(人人網)（下）
2018-09-06
Python爬蟲Cookie
Python爬蟲教程-12-爬蟲使用cookie爬取登入後的頁面(人人網)（上）
2018-09-06
Python爬蟲Cookie
python 爬蟲如何爬取動態生成的網頁內容
2024-10-31
Python爬蟲網頁
50行爬蟲?️抓取並處理圖靈書目
2019-02-25
爬蟲圖靈
Python 爬蟲網頁內容提取工具xpath(二)
2018-12-08
Python爬蟲網頁
Python3 | 簡單爬蟲分析網頁元素
2018-11-30
Python爬蟲網頁
【爬蟲】網頁抓包工具--Charles的使用教程
2018-12-19
爬蟲網頁
Python 爬蟲網頁內容提取工具xpath(一)
2018-12-06
Python爬蟲網頁
Python 爬蟲網頁解析工具lxml.html(二)
2018-12-05
Python爬蟲網頁XMLHTML
Python 爬蟲網頁解析工具lxml.html(一)
2018-12-05
Python爬蟲網頁XMLHTML

001.01 一般網頁爬蟲處理範例