python網路爬蟲（9）構建基礎爬蟲思路

白夢偉、發表於2019-06-09

原文網址 : https://www.cnblogs.com/bai2018/p/10994735.html

Python爬蟲

目的意義

基礎爬蟲分5個模組，使用多個檔案相互配合，實現一個相對完善的資料爬取方案，便於以後更完善的爬蟲做準備。

這裡目的是爬取200條百度百科資訊，並生成一個html檔案，儲存爬取的站點，詞條，解釋。

本文思路來源書籍。其程式碼部分來源書籍。https://book.douban.com/subject/27061630/

功能模組

主檔案：爬蟲排程器，通過呼叫其他檔案中的方法，完成最終功能實現。

其他檔案：URL管理器，HTML下載器，HTML解析器，資料儲存器。

設計思路

定義SpiderMan類作為爬蟲排程器。輸入根URL開始爬取資料然後爬取結束。

在爬取過程中，需要獲取網頁，和解析網頁。

解析網頁需要HTML解析器，獲取網頁需要HTML下載器。

解析網頁需要解析的資料有：URL，TITLE，CONTEXT等。則需要URL管理器和資料儲存器。

主檔案設計

主檔案新增根URL，然後提取該URL，下載該URL內容。

根據內容，呼叫解析器：

　　　　　　解析出該URL中的新URL，存入URL管理器；

　　　　　　解析出該URL中的標題，文字等資訊，存入資料儲存器。

完成後開始下一次。這時URL管理器多出了新的URL，提取出新的URL，下載，解析，不斷重複即可。

重複結束以提取出的URL數量超過200則結束。

程式碼如下：

from BaseSpider.DataOutput import DataOutput
from BaseSpider.HtmlDownloader import HtmlDownloader
from BaseSpider.HtmlParser import HtmlParser
from BaseSpider.UrlManager import UrlManager
class SpiderMan():
    def __init__(self):
        self.manager=UrlManager()
        self.downloader=HtmlDownloader()
        self.parser=HtmlParser()
        self.output=DataOutput()
        
    def crawl(self,root_url):
        self.manager.add_new_url(root_url)
        while(self.manager.has_new_url() and self.manager.old_url_size()<200):
            new_url=self.manager.get_new_url()
            text=self.downloader.download(new_url)
            if text is None:
                print('None text')
                break
            new_urls,data=self.parser.parser(new_url,text)
            self.manager.add_new_urls(new_urls)
            self.output.store_data(data)
            print(self.manager.old_url_size())       
        self.output.output_html()
    
if __name__ == "__main__":
    spider_man=SpiderMan()
    spider_man.crawl("https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711?fr=aladdin")
    print('finish')

作為最初的設計，應該允許異常丟擲，便於檢視程式終止的原因，然後排查錯誤。

HTML下載器設計

下載網頁，返回文字。即可。

import requests
import chardet
class HtmlDownloader(object):
    def download(self,url):
        if url is None:
            return None
        user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'
        headers={'User-Agent':user_agent}
        r=requests.get(url,headers=headers)
        if r.status_code is 200:
            r.encoding=chardet.detect(r.content)['encoding']
            return r.text
        return None

HTML解析器設計

HTML解析器將下載的文字進行解析，需要解析出的資料有：頁面的新URL，頁面的新資料文字。

建立相應的解析器，需要開啟原始碼對比，然後進行使用原始碼分析，使用BeautifulSoup獲取所需資訊。

為了便於主函式呼叫或者其他原因，將所有資料通過parser實現返回，其parser分別呼叫獲取URL和獲取資料文字的資訊。

為了處理一些不同網頁可能抓取的意外情況導致程式終止，新增了一些判斷。

import re
from urllib import parse
from bs4 import BeautifulSoup
class HtmlParser(object):
    def parser(self,page_url,html_cont):
        if page_url is None or html_cont is None:
            return
        soup=BeautifulSoup(html_cont,'lxml')
        new_urls=self.getNewUrls(page_url,soup)
        new_data=self.getNewData(page_url,soup)
        return new_urls,new_data
    
    def getNewUrls(self,page_url,soup):
        new_urls=set()
        links=soup.find_all('a',href=re.compile(r'/item/.*'))
        for link in links:
            new_url=link['href']
            new_full_url=parse.urljoin(page_url,new_url)
            new_urls.add(new_full_url)
        return new_urls
    
    def getNewData(self,page_url,soup):
        data={}
        data['url']=page_url
        title=soup.find('dd',class_="basicInfo-item value")
        if title is not None:
            data['title']=title.string
            summary=soup.find('meta',attrs={"name":"description"})
            data['summary']=summary['content']
            return data
        else:
            title=soup.find('meta',attrs={"name":"keywords"})
            if title is not None:
                data['title']=title['content']
                summary=soup.find('meta',attrs={"name":"description"})
                data['summary']=summary['content']
                return data
            else:
                data['title']="ERROR!"
                data['summary']="Please check the url for more information"
                data['url']=page_url
                return data

URL管理器設計

為了避免重複的URL，使用python的set，建立集合初始化。參閱：https://www.runoob.com/python3/python3-set.html

使用old_urls儲存已經訪問過的網址，使用new_urls存入將要提取的網址。

然後寫好has_new_url等方法，輔助主程式呼叫。當得到新的URL們時，主程式呼叫函式將他們存入。

而主程式需要的其他URL管理方案，如提取，數量判定等，也在這裡實現。

class UrlManager():
    def __init__(self):
        self.old_urls=set()
        self.new_urls=set()
        pass
    
    def has_new_url(self):
        return self.new_url_size()!=0
    
    def new_url_size(self):
        return len(self.new_urls)
    
    def old_url_size(self):
        return len(self.old_urls)
    
    def get_new_url(self):
        new_url=self.new_urls.pop()
        self.old_urls.add(new_url)
        return new_url
    
    def add_new_url(self,url):
        if url is None:
            return
        if url not in self.new_urls and url not in self.old_urls:
            self.new_urls.add(url)
        pass
    
    def add_new_urls(self,urls):
        if urls is None or len(urls) == 0:
            return
        
        for url in urls:
            self.add_new_url(url)
        pass

資料儲存器設計

通過HTML解析器獲取的資料，通過資料儲存器進行儲存。

而最終將資料從記憶體寫入到本地磁碟，也在該檔案實現。

為了除錯美觀，建議是先爬取一兩個資料做好測試，寫好table的寬度設定，加入style='word-break:break-all;word-wrap:break-word;'引數。參閱：https://zhidao.baidu.com/question/1385859725784504260.html

import codecs
class DataOutput(object):
    def __init__(self):
        self.datas=[]
    
    def store_data(self,data):
        if data is None:
            return
        self.datas.append(data)
    
    def output_html(self):
        fout=codecs.open('baike.html', 'w', encoding='utf-8')
        fout.write("<html>")
        fout.write("<head><meta charset='urf-8'></head>")
        fout.write("<body>")
        fout.write("<table border='1' width=1800  style='word-break:break-all;word-wrap:break-word;'>")
        fout.write("<tr>")
        fout.write("<td width='300'>URL</td>")
        fout.write("<td width='100'>標題</td>")
        fout.write("<td width='1200'>釋義</td>")
        fout.write("</tr>")
        for data in self.datas:
            fout.write("<tr>")
            fout.write("<td><a href=%s>%s</a></td>"%(data['url'],data['url']))
            fout.write("<td>%s</td>"%data['title'])
            fout.write("<td>%s</td>"%data['summary'])
            fout.write("</tr>")
        fout.write("</table>")  
        fout.write("</body>")      
        fout.write("</html>")
        fout.close()

最終效果：

當然還有一些資料沒有處理好。

完

Python：基礎&爬蟲
2023-10-29
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
【0基礎學爬蟲】爬蟲基礎之網路請求庫的使用
2023-03-26
爬蟲
[Python3網路爬蟲開發實戰] 2-爬蟲基礎 2-網頁基礎
2018-03-08
Python爬蟲網頁
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python分散式爬蟲(三) - 爬蟲基礎知識
2019-03-21
Python分散式爬蟲
Python爬蟲之路-爬蟲基礎知識(理論)
2021-01-04
Python爬蟲
python爬蟲基礎概念
2020-05-11
Python爬蟲
python_爬蟲基礎
2024-07-30
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
python網路爬蟲（14）使用Scrapy搭建爬蟲框架
2019-07-27
Python爬蟲框架
爬蟲（9） - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架
2022-07-05
爬蟲框架非同步
python DHT網路爬蟲
2019-02-14
Python爬蟲
爬蟲基礎
2019-03-30
爬蟲
爬蟲入門基礎-Python
2020-05-09
爬蟲Python
Python爬蟲基礎之selenium
2022-07-13
Python爬蟲
python爬蟲基礎之urllib
2020-11-26
Python爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
Java網路爬蟲實操（9）
2018-03-17
Java爬蟲
Python爬蟲入門【9】：圖蟲網多執行緒爬取
2019-07-31
Python爬蟲執行緒
【爬蟲】爬蟲專案推薦 / 思路
2020-04-21
爬蟲
網路爬蟲
2018-12-07
爬蟲
網路爬蟲（python專案）
2018-12-04
爬蟲Python
專案－－python網路爬蟲
2020-08-15
Python爬蟲
python網路爬蟲合法嗎
2021-09-11
Python爬蟲
Python網路爬蟲實戰
2022-03-18
Python爬蟲
[Python] 網路爬蟲與資訊提取（1）網路爬蟲之規則
2020-11-06
Python爬蟲
什麼是Python網路爬蟲?常見的網路爬蟲有哪些?
2020-11-27
Python爬蟲
讀書筆記：《Python3網路爬蟲開發實戰》——第2章：爬蟲基礎
2019-04-09
筆記Python爬蟲
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
爬蟲基礎---1
2019-01-06
爬蟲
爬蟲基礎篇
2020-07-31
爬蟲
【0基礎學爬蟲】爬蟲基礎之資料儲存
2023-04-14
爬蟲
【0基礎學爬蟲】爬蟲基礎之檔案儲存
2023-04-07
爬蟲
手把手教你寫網路爬蟲（2）：迷你爬蟲架構
2018-04-27
爬蟲架構
《Python3網路爬蟲開發實戰》教程||爬蟲教程
2018-11-13
Python爬蟲
什麼是網路爬蟲?為什麼用Python寫爬蟲?
2021-03-08
爬蟲Python