python簡單爬蟲(二)

weixin_33912246發表於2018-04-18

原文網址 : https://blog.csdn.net/weixin_33912246/article/details/85983478

Python爬蟲

　　上一篇簡單的實現了獲取url返回的內容，在這一篇就要第返回的內容進行提取，並將結果儲存到html中。

一、需求:

　　抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的原始碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將網際網路上URL對應的網頁以HTML形式下載到本地

常用的本地下載器
　　1、urllib2 Python官方基礎模組
　　2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')

#獲取狀態碼，如果是200表示成功
code = response.getcode()

#讀取內容
cont = response.read()

(2)新增data、http header

將url、data、header傳入urllib.Request方法
然後 URLlib.urlopen(request)

import urllib2

#建立Request物件
request = urllin2.Request(url)

#新增資料
request.add_data('a'.'1')

#新增http的header 將爬蟲程式偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')

#傳送請求獲取結果
response = urllib2.urlopen(request)

(3)新增特殊情景的處理器

處理使用者登入才能訪問的情況，新增Cookie
或者需要代理才能訪問使用ProxyHandler
或者需要使用https請求

2.網頁解析器

1.作用:

從網頁中提取有價值資料的工具

以HTML網頁字串為輸入資訊，輸出有價值的資料和新的待爬取url列表

網頁解析器種類
　　1、正規表示式將下載好的HTML字串用正規表示式匹配解析，適用於簡單的網頁解析字串形式的模糊匹配
　　2、html.parser python自帶模組
　　3、BeautifulSoup 第三方外掛
　　4、xml 第三方外掛

原理是解析成DOM樹:

2.BeautifulSoup簡介及使用方法:

1.簡介:

　　BeautifulSoup:Python第三方庫，用於從HTML或XML中提取資料

安裝並測試beautifulsoup

方法1：-安裝：pip install beautifulsoup4
　　　　-測試：import bs4

方法2：pycharm--File--settings--Project Interpreter--新增beautifulsoup4

2.語法介紹:

根據HTML網頁字串可以建立BeautifulSoup物件，建立好之後已經載入完DOM樹
即可進行節點搜尋：find_all、find。搜尋出所有/第一個滿足要求的節點（可按照節點名稱、屬性、文字進行搜尋）
得到節點之後可以訪問節點名稱、屬性、文字

如：
<a href="123.html" class="aaa">Python</a>
可根據：
節點名稱：a
節點屬性：href="123.html" class="aaa"
節點內容：Python

建立BeautifulSoup物件：

from bs4 import BeautifulSoup

#根據下載好的HTML網頁字串建立BeautifulSoup物件
soup = BeautifulSoup(
　　html_doc, #HTML文件字串
　　'html.parser' #HTML解析器
　　from_encoding='utf-8' #HTML文件編碼
)

搜尋節點：
方法：find_all(name,attrs,string)

#查詢所有標籤為a的節點
　　soup.find_all('a')

#查詢所有標籤為a，連結符合/view/123.html形式的節點
　　soup.find_all('a',href='/view/123.html')
　　soup.find('a',href=re.compile('aaa')) #用正規表示式匹配內容

#查詢所有標籤為div，class為abc，文字為Python的節點
　　soup.find_all('div',class_='abc',string='Python') #class是Python關鍵字避免衝突

由於class是python的關鍵字，所以講class屬性加了個下劃線。

訪問節點資訊：
　　得到節點：<a href="123.html" class="aaa">Python</a>

#獲取查詢到的節點的標籤名稱
　　node.name
#獲取查詢到的節點的href屬性
　　node['href']
#獲取查詢到的節點的連線文字
　　node.gettext()

四、程式碼實現:

spider.py

# 爬蟲的入口排程器
from baike import url_manager, html_downloader, html_parser, html_outputer


class SpiderMain(object):
    def __init__(self):
        self.urlManager = url_manager.UrlManager()
        self.downloader = html_downloader.HtmlDownLoader()
        self.parser = html_parser.HtmpParser()
        self.outputer = html_outputer.HtmlOutpter()


    def craw(self,url):
        count = 1 #定義爬取幾個頁面
        self.urlManager.add_new_url(url)
        while self.urlManager.has_new_url():
            try:
                # 獲取一個url
                new_url = self.urlManager.get_new_url()
                # 訪問url，獲取網站返回資料
                html_content = self.downloader.download(new_url)
                new_urls, new_datas = self.parser.parse(new_url, html_content)
                self.urlManager.add_new_urls(new_urls)
                self.outputer.collect_data(new_datas)
                print(count)
                if count == 5:
                    break
                count = count+1
            except Exception as e:
                print("發生錯誤",e)
        # 將爬取結果輸出到html
        self.outputer.out_html()

if __name__=="__main__":
    url = 'https://baike.baidu.com/item/Python/407313'
    sm = SpiderMain()
    sm.craw(url)

url_manager.py

# url管理器
class UrlManager(object):
    def __init__(self):
        # 定義兩個set，一個存放未爬取的url，一個爬取已經訪問過的url
        self.new_urls = set()
        self.old_urls = set()

    # 新增一個url的方法
    def add_new_url(self,url):
        if url is None:
            return  None
        if url not in self.new_urls and url not in self.old_urls:
            self.new_urls.add(url)

    # 判斷是否還有待爬取的url(根據new_urls的長度判斷是否有待爬取的頁面)
    def has_new_url(self):
        return len(self.new_urls) != 0

    # 定義獲取一個新的url的方法
    def get_new_url(self):
        if len(self.new_urls)>0:
            # 從new_urls彈出一個並新增到old_urls中
            new_url = self.new_urls.pop()
            self.old_urls.add(new_url)
            return new_url

    # 批量新增url的方法
    def add_new_urls(self, new_urls):
        if new_urls is None:
            return
        for url in new_urls:
            self.add_new_url(url)

html_downloader.py

# 讀取網頁的類
import urllib.request


class HtmlDownLoader(object):
    def download(self, url):
        if url is None:
            return
        # 訪問url
        response = urllib.request.urlopen(url)
        # 如果返回的狀態碼不是200代表異常
        if response.getcode() != 200:
            return
        return response.read()

html_parser.py

# 網頁解析器類
import re
import urllib

from bs4 import BeautifulSoup


class HtmpParser(object):
    # 解析讀取到的網頁的方法
    def parse(self, new_url, html_content):
        if html_content is None:
            return
        soup = BeautifulSoup(html_content,'html.parser',from_encoding='utf-8')
        new_urls = self.get_new_urls(new_url,soup)
        new_datas = self.get_new_datas(new_url,soup)
        return new_urls, new_datas


    # 獲取new_urls的方法
    def get_new_urls(self, new_url, soup):
        new_urls = set()
        # 查詢網頁的a標籤，而且href包含/item
        links = soup.find_all('a',href=re.compile(r'/item'))
        for link in links:
            # 獲取到a必去哦啊Ian的href屬性
            url = link['href']
            # 合併url。使爬到的路徑變為全路徑，http://....的格式
            new_full_url = urllib.parse.urljoin(new_url,url)
            new_urls.add(new_full_url)
        return new_urls



    # 獲取new_data的方法
    def get_new_datas(self, new_url, soup):
        new_datas = {}
        # 獲取標題內容
        title_node = soup.find('dd',class_='lemmaWgt-lemmaTitle-title').find('h1')
        new_datas['title'] = title_node.get_text()

        #獲取簡介內容
        summary_node = soup.find('div',class_='lemma-summary')
        new_datas['summary'] = summary_node.get_text()

        new_datas['url'] = new_url

        return new_datas

html_outputer.py

# 爬蟲的入口排程器
from baike import url_manager, html_downloader, html_parser, html_outputer


class SpiderMain(object):
    def __init__(self):
        self.urlManager = url_manager.UrlManager()
        self.downloader = html_downloader.HtmlDownLoader()
        self.parser = html_parser.HtmpParser()
        self.outputer = html_outputer.HtmlOutpter()


    def craw(self,url):
        count = 1 #定義爬取幾個頁面
        self.urlManager.add_new_url(url)
        while self.urlManager.has_new_url():
            try:
                # 獲取一個url
                new_url = self.urlManager.get_new_url()
                # 訪問url，獲取網站返回資料
                html_content = self.downloader.download(new_url)
                new_urls, new_datas = self.parser.parse(new_url, html_content)
                self.urlManager.add_new_urls(new_urls)
                self.outputer.collect_data(new_datas)
                print(count)
                if count == 5:
                    break
                count = count+1
            except Exception as e:
                print("發生錯誤",e)
        # 將爬取結果輸出到html
        self.outputer.out_html()

if __name__=="__main__":
    url = 'https://baike.baidu.com/item/Python/407313'
    sm = SpiderMain()
    sm.craw(url)

執行spider.py的主函式:(結果會將提取到的結果儲存到html中)

總結:

　　python的類類似於java，繼承object

　　python的返回值return和return None一樣(None類似於java的null關鍵字)

下面附上自己基於java實現此爬蟲的地址:

簡單瞭解python爬蟲
2020-10-13
Python爬蟲
python爬蟲:爬蟲的簡單介紹及requests模組的簡單使用
2022-02-24
Python爬蟲
Python代理IP爬蟲的簡單使用
2019-03-04
Python爬蟲
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
Python爬蟲 --- 2.3 Scrapy 框架的簡單使用
2018-12-19
Python爬蟲框架
python爬蟲簡單實現逆向JS解密
2019-08-29
Python爬蟲JS解密
簡單的爬蟲程式
2024-03-24
爬蟲
Python3 | 簡單爬蟲分析網頁元素
2018-11-30
Python爬蟲網頁
初探python之做一個簡單小爬蟲
2019-03-02
Python爬蟲
使用Python和requests庫的簡單爬蟲程式
2023-11-13
Python爬蟲
Python 開發簡單爬蟲 (學習筆記)
2019-08-05
Python爬蟲筆記
基於Python的簡單天氣爬蟲程式
2018-03-26
Python爬蟲
《Python開發簡單爬蟲》實踐筆記
2021-09-09
Python爬蟲筆記
python 爬蟲 mc 皮膚站 little skin 的簡單爬取
2019-08-02
Python爬蟲
python最簡單的爬蟲 , 一看就會
2018-06-14
Python爬蟲
python 爬蟲簡單實現百度翻譯
2020-04-14
Python爬蟲
python如何實現簡單的爬蟲功能?Python學習教程!
2021-01-12
Python爬蟲
python爬蟲之js逆向（二）
2019-11-05
Python爬蟲JS
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
Python爬蟲筆記（一）——基礎知識簡單整理
2018-07-08
Python爬蟲筆記
Python《成功破解簡單的動態載入的爬蟲》
2020-12-20
Python爬蟲
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
誰說爬蟲只能Python？看我用C#快速簡單實現爬蟲開發和演示！
2024-05-27
爬蟲PythonC#
Python Scrapy 爬蟲（二）：scrapy 初試
2018-08-13
Python爬蟲
python爬蟲js逆向學習（二）
2020-07-03
Python爬蟲JS
Python爬蟲（二）——傳送請求
2021-08-27
Python爬蟲
python爬蟲學習筆記（二）
2020-11-24
Python爬蟲筆記
為什麼寫爬蟲用Python語言?原因很簡單！
2021-03-19
爬蟲Python
Python超簡單超基礎的免費小說爬蟲！爬蟲入門從這開始！
2020-10-23
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
nodeJS做一個簡單的爬蟲
2018-03-30
NodeJS爬蟲
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
Python3爬蟲（十八） Scrapy框架（二）
2018-10-26
Python爬蟲框架
Python爬蟲教程-21-xpath 簡介
2018-09-06
Python爬蟲
Python爬蟲教程-20-xml 簡介
2018-09-06
Python爬蟲XML
Python爬蟲教程-04-response簡介
2018-09-06
Python爬蟲
求職簡歷-Python爬蟲工程師
2018-07-26
求職Python爬蟲工程師
一天時間入門python爬蟲，直接寫一個爬蟲案例，分享出來，很簡單
2018-12-02
Python爬蟲