Python靜態網頁爬蟲專案實戰

LMRzero發表於2020-05-01

原文網址 : https://lmrzero.blog.csdn.net/article/details/105882043

Python網頁爬蟲

本爬蟲是基於《Python爬蟲開發與專案實戰》一書實現的，基於現在的網頁版本進行更新，可以成功抓取資料。

爬蟲基礎架構和流程

《Python爬蟲開發與專案實戰》一書中的介紹和圖

首先介紹爬蟲的基礎架構和流程如下圖所示：

基礎爬蟲框架主要包括五大模組，分別為爬蟲排程器、URL 管理器、HTML 下載器、 HTML解析器、資料儲存器。功能分析如下：

已爬蟲排程器主要負責統籌其他四個模組的協調工作。

URL管理器負責管理URL連結 9 維護已經爬取的URL集合和未爬取的URL集合，提供獲取新URL連結的介面。
HTML下載器用於從URL管理器中獲取未爬取的URL連結並下載HTML網頁。
HTML解析器用千從HTML下載器中獲取已經下載的HTML網頁，並從中解析出新的URL連結交給URL管理器，解析出有效資料交給資料儲存器。
資料儲存器用於將HTML解析器解析出來的資料通過檔案或者資料庫的形式儲存起來。下而通過圖6-3展示一下爬蟲框架的動態執行流程，方便大家理解。

（1）URL管理器

URL 管理器主要包括兩個變數，一個是已爬取 URL的集合，另個是未爬取 URL的集合。採用 Python 中的 set 型別，主要是使用 set 的去重複功能，防止連結重複爬取，因為爬取連結重複時容易造成死迴圈。連結去重複在 Python 爬蟲開發中是必備的功能，解決方案主要有三種： 1) 記憶體去重 2) 關聯式資料庫去重 3) 快取資料庫去重。大型成熟的爬蟲基本上採用快取資料庫的去重方案，儘可能避免記憶體大小的限制，又比關係型資料庫去重效能高很多。由於基礎爬蟲的爬取數量較小，因此我們使用 Python 中 set 這個記憶體去重方式。

#coding:utf-8

class UrlManager(object):
    def __init__(self):
        #未爬取URL集合
        self.new_urls = set()
        #已爬取URL集合
        self.old_urls = set()

    def has_new_url(self):
        """
        判斷是否有未爬取的URL
        :return:
        """
        return  self.new_url_size() != 0

    def get_new_url(self):
        """
        獲取一個未爬取的URL
        :return:
        """
        new_url = self.new_urls.pop()
        self.old_urls.add(new_url)
        return new_url

    def add_new_url(self, url):
        """
        將新的URL新增到未爬取的URL集合
        :param url: 單個URL
        :return:
        """
        if url is None:
            return
        if url not in self.new_urls and url not in self.old_urls:
            self.new_urls.add(url)


    def add_new_urls(self, urls):
        """
        將新的URL集合合併到未爬取集合中
        :param urls: url集合
        :return:
        """
        if urls is None or len(urls) == 0:
            return
        for url in urls:
            self.add_new_url(url)

    def new_url_size(self):
        """
        未爬取URL集合大小
        :return:
        """
        return len(self.new_urls)

    def old_url_size(self):
        """
        獲取已經爬取URL集合大小
        :return:
        """
        return len(self.old_urls)

(2) HTML下載器

HTML下載器用來下載網頁，這時候需要注意網頁的編碼，以保證下載的網頁沒有亂碼。下載器需要用到Requests模組，裡面只需要實現一個介面即可： download(url)。程式HtmlDownloader. py程式碼如下：

#coding:utf-8
import requests

class HtmlDownloader(object):

    def download(self, url):
        if url is None:
            return None

        user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        headers = {'User-Agent': user_agent}
        r = requests.get(url, headers = headers)
        if r.status_code == 200:
            r.encoding = 'utf-8'
            return r.text
        return None

(3) HTML解析器

HTML解析器使用Beautifu1Soup4進行HTML解析。需要解析的部分主要分為提取相關詞條頁面的URL和提取當前詞條的標題和摘要資訊。

針對不同的網站需要自己去檢視網頁結構，本部落格以百度百科為例：

此外，我們還需要從當前網頁中提取超連結以便於進一步抓取，檢視網頁結構可以得到：

對於這種標籤我們可以使用正規表示式匹配

soup.find_all('a', href = re.compile(r'/item/\w+'))

完整程式碼如下：

#coding:utf-8
import re
import urlparse
from bs4 import BeautifulSoup

class HtmlParser(object):

    def parser(self, page_url, html_cont):
        """
        用於解析網頁內容
        :param page_url: 下載頁面url
        :param html_cont: 下載頁面內容
        :return:
        """
        if page_url is None or html_cont is None:
            return

        soup = BeautifulSoup(html_cont, 'html.parser')
        new_urls = self._get_new_urls(page_url, soup)
        new_data = self._get_new_data(page_url, soup)
        return new_urls, new_data

    def _get_new_urls(self, page_url, soup):
        """
        抽取新的URL集合
        :param page_url: 下載頁URL
        :param soup: soup物件
        :return:
        """
        new_urls = set()
        #抽取符合要求的a標記
        links = soup.find_all('a', href = re.compile(r'/item/\w+'))
        for link in links:
            #提取href屬性
            new_url = link['href']
            #拼接完整url
            new_full_url = urlparse.urljoin(page_url, new_url)
            new_urls.add(new_full_url)
        return new_urls

    def _get_new_data(self, page_url, soup):
        """
        抽取有效資料
        :param page_url: 頁面URL
        :param soup: soup物件
        :return:
        """
        data = {}
        data['url'] = page_url
        title = soup.find('dd', class_='lemmaWgt-lemmaTitle-title').find('h1')
        data['title'] = title.get_text()
        summary = soup.find('div', class_='lemma-summary')
        data['summary'] = summary.get_text()

        return data

（4）資料儲存

#coding : utf-8
import codecs

class DataOutput(object):

    def __init__(self):
        self.datas = []

    def store_data(self, data):
        if data is None:
            return

        self.datas.append(data)

    def output_html(self):
        fout = codecs.open('baike.html', 'w', encoding='utf-8')
        fout.write("<html>")
        fout.write("<body>")
        fout.write("<table>")
        for data in self.datas:
            fout.write("<tr>")
            fout.write("<td>%s</td>" % data['url'])
            fout.write("<td>%s</td>" % data['title'])
            fout.write("<td>%s</td>" % data['summary'])
            fout.write("</tr>")
            self.datas.remove(data)
        fout.write("</table>")
        fout.write("</body>")
        fout.write("</html>")
        fout.close()

（5）爬蟲排程器（主函式）

以上已經對URL管理器、 HTML下載器、 HTML解析器和資料儲存器等模組進行了實現，接下來編寫爬蟲排程器以協調管理這些模組。爬蟲排程器首先要做的是初始化各個模組，然後通過crawl(root—url)方法傳入入口URL, 方法內部實現按照執行流程控制各個模組的工作。爬蟲排程器SpiderMain. py的程式如下：

# coding:utf-8
from UrlManager import UrlManager
from HtmlParser import HtmlParser
from HtmlDownloader import HtmlDownloader
from DataOutput import DataOutput

class SpiderMain(object):

    def __init__(self):
        self.manager = UrlManager()
        self.downloader = HtmlDownloader()
        self.parser = HtmlParser()
        self.output = DataOutput()

    def crawl(self, root_url):
        #新增入口yrl
        self.manager.add_new_url(root_url)
        #判斷url管理器中是否有新的url，同時判斷抓取了多少個url
        while (self.manager.has_new_url() and self.manager.old_url_size() < 100):
            try:
                #從URL管理器獲取新的url
                new_url = self.manager.get_new_url()
                #HTML下載器下載網頁
                html = self.downloader.download(new_url)
                #HTML解析器抽取網頁資料
                new_urls, data = self.parser.parser(new_url, html)
                #將抽取的url新增到url管理器
                self.manager.add_new_urls(new_urls)
                # 資料儲存器儲存檔案
                self.output.store_data(data)
                print('已經抓取%s個連結' % self.manager.old_url_size())
            except Exception, e:
                print("crawl failed")
        self.output.output_html()

if __name__=="__main__":
    spider_main = SpiderMain()
    spider_main.crawl("https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711?fromtitle=%E8%9C%98%E8%9B%9B&fromid=8135707")

爬取結果如下：

Python網路爬蟲實戰專案大全 32個Python爬蟲專案demo
2019-04-24
Python爬蟲
Python網路爬蟲實戰小專案
2021-04-12
Python爬蟲
Python網路爬蟲實戰專案大全！
2020-12-19
Python爬蟲
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
python爬蟲實操專案_Python爬蟲開發與專案實戰 1.6 小結
2021-02-04
Python爬蟲
最新《30小時搞定Python網路爬蟲專案實戰》
2020-02-18
Python爬蟲
爬蟲實戰專案集合
2019-02-28
爬蟲
爬蟲專案實戰（一）
2020-06-15
爬蟲
爬蟲實戰專案合集
2022-01-25
爬蟲
Python爬蟲開發與專案實戰pdf
2020-01-11
Python爬蟲
Python爬蟲開發與專案實戰（2）
2020-10-21
Python爬蟲
Python爬蟲開發與專案實戰（1）
2020-10-18
Python爬蟲
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
網路爬蟲（python專案）
2018-12-04
爬蟲Python
專案－－python網路爬蟲
2020-08-15
Python爬蟲
視訊教程-Python網路爬蟲開發與專案實戰-Python
2020-05-28
Python爬蟲
Python網路爬蟲實戰
2022-03-18
Python爬蟲
2019最新《網路爬蟲JAVA專案實戰》
2019-05-09
爬蟲Java
Java 爬蟲專案實戰之爬蟲簡介
2018-11-24
Java爬蟲
精通 Python 網路爬蟲：核心技術、框架與專案實戰
2018-11-06
Python爬蟲框架
Python爬蟲入門學習實戰專案（一）
2020-02-18
Python爬蟲
python3網路爬蟲開發實戰_Python3 爬蟲實戰
2022-01-24
Python爬蟲
python網路爬蟲（7）爬取靜態資料詳解
2019-06-07
Python爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
32個Python爬蟲實戰專案，滿足你的專案慌
2019-03-04
Python爬蟲
Python爬蟲開發與專案實戰--分散式程式
2018-07-31
Python爬蟲分散式
Python學習筆記——爬蟲之Scrapy專案實戰
2018-09-03
Python筆記爬蟲
Python 爬蟲實戰
2023-10-16
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python書籍推薦-Python爬蟲開發與專案實戰
2019-06-11
Python爬蟲
python動態網站爬蟲實戰(requests+xpath+demjson+redis)
2021-09-16
Python網站爬蟲JSONRedis
Python爬蟲實戰之叩富網
2021-04-04
Python爬蟲
python 爬蟲如何爬取動態生成的網頁內容
2024-10-31
Python爬蟲網頁
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
網路爬蟲——專案實戰（爬取糗事百科所有文章）
2020-02-07
爬蟲
Python爬蟲開發與專案實戰 4: HTML解析大法
2018-05-15
Python爬蟲HTML
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
python爬蟲抓取哈爾濱天氣資訊（靜態爬蟲）
2020-04-05
Python爬蟲