Python3 | 簡單爬蟲分析網頁元素

weixin_34393428發表於2018-11-30

原文網址 : https://blog.csdn.net/weixin_34393428/article/details/88434833

Python爬蟲網頁

今天臨時接到一個需求任務，就是檢查頁面元素中是否完整，比方網頁的主題內容是否存在，然後給到資料是一個excel，裡面包含了需要檢查的網頁url。

首先，先進行簡單的業務分析，從excel取出url，訪問這個url，獲取網頁整個內容，然後通過標籤id取到對應的內容區域是否有內容，然後在excel給對應的資訊標記。

涉及到包

讀取excel的包 xlrd 1.1.0
寫入excel的包 xlwt 1.3.0
網頁請求的包 urllib3 1.24.1
網頁解析的包 bs4中的BeautifulSoup 4.6.3

相關包介紹

以下兩種操作excel和之前介紹的openpyxl有略微區別，openpyxl只能操作07版excel之後的，也就是字尾為xlsx的檔案

1. xlrd （官方文件 xlrd）

同樣，先看事例程式碼


# 讀取excel
def operateexcel(path):
    wb = xlrd.open_workbook(path)
    sheet = wb.sheet_by_index(0)
    rownum = sheet.nrows
    colnum = sheet.ncols
    listdata = []
    for i in range(rownum):
        rowlist = []
        for j in range(colnum):
            rowlist.append(sheet.cell_value(i, j))
        listdata.append(rowlist)
    return listdata

此處的邏輯就是，讀取excel然後存入到list這個資料型別中，返回。

其中：

xlrd.open_workbook(filename) filename 記得加上路徑
workbook.sheet_by_index(index) 根據索引取得sheet
sheet.nrows 獲取總行數
sheet.ncols 獲取總列數
sheet.cell_value(row, col) 根據行列號獲取單元格值

以上是基本的操作，更多具體操作還挺多的，這裡就不一一贅述，需要時再去翻閱文件，英文的看起來賊爽。

2. xlwt （官方文件 xlwt ）

相比xlrd來說，xlwt就多了對excel樣式操作，挺豐富的

先看程式碼吧

# 寫excel
def writeexcel(path, list):
    # 初始化一個excel
    excel = xlwt.Workbook(encoding='utf-8')
    # 新建一個sheet
    sheet = excel.add_sheet('sheet1')

    first_col = sheet.col(0)  # xlwt中是行和列都是從0開始計算的
    sec_col = sheet.col(1)
    thr_col = sheet.col(2)
    four_col = sheet.col(3)

    first_col.width = 256 * 30
    sec_col.width = 256 * 30
    thr_col.width = 256 * 30
    four_col.width = 256 * 30

    # 設定樣式
    style = xlwt.XFStyle()
    font = xlwt.Font()
    font.name = u'微軟雅黑'
    font.underline = False
    font.italic = False
    font.height = 240
    style.font = font
    for indexi, i in enumerate(list):
        for indexj, j in enumerate(i):
            sheet.write(indexi, indexj, j, style)
    excel.save(path)

此處邏輯也就是接受生成路徑引數和一個二維list。遍歷二維list，寫入單元格。

這裡面會有個函式挺常用的，這裡簡單介紹下enumerate() 函式用於將一個可遍歷的資料物件(如列表、元組或字串)組合為一個索引序列，同時列出資料和資料下標，一般用在 for 迴圈當中。

其中api介紹下：

excel = xlwt.Workbook(encoding='ascii', style_compression=0) 初始化excel 在快取中存在，得到一個excel例項
excel.save(path) 通過生成的excel例項，然後通過excel呼叫save方法，然後在實際路徑生成對應的檔案
excel.add_sheet(sheetname) 通過生成的excel例項,然後呼叫add_sheet方法生成sheet 返回一個sheet例項
col = sheet.col(index) 根據索引得到的對應的列
col.width 設定列寬度，當然也可以設定高
xlwt.XFStyle() 返回一個表格style樣式例項，用於後面掛載樣式
xlwt.Font() 返回字型樣式的設定例項，掛載到style例項上生效

xlwt在樣式上，會有很多豐富的api，我這邊不一一介紹，大概方法需要去翻閱文件，大概上也就是類似操作。

3. urllib3 （官方文件：urllib3 ）

先看程式碼，

from urllib import request
# 抓取網頁
def fetch(urladdr):
    try:
        res = request.urlopen(urladdr, data=None, timeout=10)
        page = res.read().decode('utf-8')
        return page
    except Exception as e:
        print(e, urladdr)
        return False

為什麼是從urllib 引入，不是urllib3嗎？我也不是太清楚，應該是urllib3包和urllib還是有點區別，我們用老版本的吧，但是如果學習還是學習使用urllib3，效能會好點，貌似用到了連線池的技術。。。猜測。。

我們來介紹下urllib3的api吧！就是這麼任性，後面和本次業務沒關係。

urllib3.HTTPConnectionPool('google.com', maxsize=10, block=True) 傳入連線地址
http = urllib3.PoolManager()
r = http.request( 'GET', 'http://httpbin.org/bytes/1024',preload_content=False) //也可以這樣連線使用

r.read() // 讀取網頁，你會發現其實包之間還是存在的共性的

感覺上urllib3比urllib要難理解點，現階段網路上的教程大面積的是urllib教程，所以關於urllib3的使用可能不夠普遍，功能實現就好了。

4. BeautifulSoup （官方文件：BeautifulSoup ）

巧了，這個有中文文件，還非常詳細，安裝時從bs4包裡面引入

看樣子，這個網頁解析器用的還是比較多，中文版都有人翻譯出來了。

from bs4 import BeautifulSoup
# 判斷內容是否存在
def chargecontent(htmlcontent, content_selector, media_selector):
    soup = BeautifulSoup(htmlcontent, "html.parser")
    # 去除script
    [s.extract() for s in soup(['script', 'a'])]
    s = soup.select(content_selector)
    hascontent = False
    if len(s) > 0:
        strcontent = s[0].get_text()
        d = "[\u4e00-\u9fa5]+"
        li = re.findall(d, strcontent)
        hascontent = len(li) > 0
    hasmedia = False

    s = soup.select(media_selector)
    if len(s) > 0:
        print(s)
        hascontent = True

    return hascontent or hasmedia

這個地方我把解析頁面和內容是否存在的邏輯放一塊，主要過程是，通過urllib請求過來的頁面進行解析，然後獲取對應標籤的內容，然後正則匹配，判斷是否存在內容

這塊內容要是學的比較好，對於網頁上的元素都可以分析的很好，介紹下我們用到的api：

soup = BeautifulSoup(htmlcontent, "html.parser") 通過傳入html內容，傳入解析方式，得到soup例項
soup.select(selector) 通過類似css選擇器來選擇你需要的標籤元素，注意這個地方得到的是例項物件，操作都是需要基於這個物件，好像用list儲存
s[0].get_text() 通過得到的例項物件得到他對應內容

額。。。太多不介紹了，在介紹下去你都沒興趣看了，總歸瞭解大概的使用流程了

Beautiful Soup將複雜HTML文件轉換成一個複雜的樹形結構,每個節點都是Python物件,所有物件可以歸納為4種: Tag , NavigableString , BeautifulSoup , Comment .

業務程式碼實現

"""
讀取excel中url判斷該網頁中主要內容部分是否有
"""
from urllib import request
from bs4 import BeautifulSoup
import re
import xlrd
import xlwt


# 抓取網頁
def fetch(urladdr):
    try:
        res = request.urlopen(urladdr, data=None, timeout=10)
        page = res.read().decode('utf-8')
        return page
    except Exception as e:
        print(e, urladdr)
        return False


# 判斷內容是否存在
def chargecontent(htmlcontent, content_selector, media_selector):
    soup = BeautifulSoup(htmlcontent, "html.parser")
    # 去除script
    [s.extract() for s in soup(['script', 'a'])]
    s = soup.select(content_selector)
    hascontent = False
    if len(s) > 0:
        strcontent = s[0].get_text()
        d = "[\u4e00-\u9fa5]+"
        li = re.findall(d, strcontent)
        hascontent = len(li) > 0
    hasmedia = False

    s = soup.select(media_selector)
    if len(s) > 0:
        print(s)
        hascontent = True

    return hascontent or hasmedia


# 讀取excel
def readexcel(path):
    wb = xlrd.open_workbook(path)
    sheet = wb.sheet_by_index(0)
    rownum = sheet.nrows
    colnum = sheet.ncols
    listdata = []
    for i in range(rownum):
        rowlist = []
        for j in range(colnum):
            rowlist.append(sheet.cell_value(i, j))
        listdata.append(rowlist)
    return listdata


# 寫excel
def writeexcel(path, list):
    # 初始化一個excel
    excel = xlwt.Workbook(encoding='utf-8')
    # 新建一個sheet
    sheet = excel.add_sheet('sheet1')

    first_col = sheet.col(0)  # xlwt中是行和列都是從0開始計算的
    sec_col = sheet.col(1)
    thr_col = sheet.col(2)
    four_col = sheet.col(3)

    first_col.width = 256 * 30
    sec_col.width = 256 * 30
    thr_col.width = 256 * 30
    four_col.width = 256 * 30

    # 設定樣式
    style = xlwt.XFStyle()
    font = xlwt.Font()
    font.name = u'微軟雅黑'
    font.underline = False
    font.italic = False
    font.height = 240
    style.font = font
    for indexi, i in enumerate(list):
        for indexj, j in enumerate(i):
            sheet.write(indexi, indexj, j, style)
    excel.save(path)


if __name__ == '__main__':
    list = readexcel('../url.xls')

    for index, row in enumerate(list):
        contentdata = fetch(row[1])
        if contentdata is not False:
            has_content = chargecontent(contentdata, '#ivs_content', '.media-video')
            if has_content:
                list[index].append('有內容')
            else:
                list[index].append('無內容')
        else:
            list[index].append('請求404')
        print(index, list[index])

    writeexcel('../url-python.xls', list)

效果圖

整體感覺還是邏輯比較清晰，可能再執行效率上會有點低，但是基本的業務實現還是沒問題，一旦資料上了萬級的資料量，可能就需要進行調整了，而且對於出現異常的處理也沒有很到位，一旦報錯整個程式全部崩，這塊啊需要改進。

歡迎交流！！！

關注我

《網頁爬蟲》
2018-11-26
網頁爬蟲
python3網路爬蟲開發實戰_Python3 爬蟲實戰
2022-01-24
Python爬蟲
[Python3網路爬蟲開發實戰] 2-爬蟲基礎 2-網頁基礎
2018-03-08
Python爬蟲網頁
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
爬蟲——網頁爬取方法和網頁解析方法
2020-12-07
爬蟲網頁
《Python3網路爬蟲開發實戰》教程||爬蟲教程
2018-11-13
Python爬蟲
簡單的爬蟲程式
2024-03-24
爬蟲
python簡單爬蟲(二)
2018-04-18
Python爬蟲
[Python3網路爬蟲開發實戰] 分散式爬蟲原理
2019-12-08
Python爬蟲分散式
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
python爬蟲:爬蟲的簡單介紹及requests模組的簡單使用
2022-02-24
Python爬蟲
網頁爬蟲--未完成
2020-10-04
網頁爬蟲
python 爬蟲網頁登陸
2020-11-30
Python爬蟲網頁
Node.js學習之路22——利用cheerio製作簡單的網頁爬蟲
2019-02-16
Node.js網頁爬蟲
簡單瞭解python爬蟲
2020-10-13
Python爬蟲
3天學會網頁爬蟲進行資料分析
2022-01-07
網頁爬蟲
python3網路爬蟲開發實戰_Python 3開發網路爬蟲(一)
2020-12-07
Python爬蟲
簡單的爬蟲：爬取網站內容正文與圖片
2021-09-09
爬蟲網站
Python3網路爬蟲開發實戰
2021-04-15
Python爬蟲
爬蟲學習日記（十一）selenium 頁面元素更新
2019-03-14
爬蟲
【爬蟲】網頁抓包工具--Fiddler
2018-12-19
爬蟲網頁
爬蟲抓取網頁資料原理
2021-12-06
爬蟲網頁
python3 爬蟲入門
2021-09-09
Python爬蟲
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
go語言實現簡單爬蟲獲取頁面圖片
2022-11-14
Go爬蟲
如何用python爬蟲分析動態網頁的商品資訊？
2021-09-11
Python爬蟲網頁
開源JAVA單機爬蟲框架簡介,優缺點分析
2018-11-16
Java爬蟲框架
《python3網路爬蟲開發實戰》--pyspider
2018-10-18
Python爬蟲IDE
python3網路爬蟲開發實戰pdf
2021-11-30
Python爬蟲
python3 爬蟲實戰：為爬蟲新增 GUI 影象介面
2020-03-06
Python爬蟲GUI
爬蟲進階——動態網頁Ajax資料抓取（簡易版）
2024-04-12
爬蟲網頁
爬蟲抓取網頁的詳細流程
2023-11-28
爬蟲網頁
簡單網頁
2020-10-02
網頁
Node JS爬蟲：爬取瀑布流網頁高清圖
2018-05-17
JS爬蟲網頁
Python代理IP爬蟲的簡單使用
2019-03-04
Python爬蟲
nodeJS做一個簡單的爬蟲
2018-03-30
NodeJS爬蟲
Python3爬蟲入門(一)
2020-12-05
Python爬蟲