[python 爬蟲]第一個Python爬蟲，爬取某個新浪部落格所有文章並儲存為doc文件

Thorrrrrrrrrr發表於2017-03-16

原文網址 : https://blog.csdn.net/sinat_33487968/article/details/62420502

最近開始學習Python的爬蟲，本來想著從基本的各種語法開始學習的但是在逛知乎的過程中發現了一個帖子是就是關於如何入門Python爬蟲，裡面有一個回答，https://www.zhihu.com/question/20899988/answer/24923424 這裡面說的就是““入門”是良好的動機，但是可能作用緩慢。如果你手裡或者腦子裡有一個專案，那麼實踐起來你會被目標驅動，而不會像學習模組一樣慢慢學習”，所以我決定了不從基礎模組一個一個慢慢學習，直接從Python爬蟲的一個一個小程式學習，若有不懂的地方就往前面基礎部分翻。我慢慢發現其實很多內容都是相互聯絡的，而且在學習基礎部分的時候不會在意這個東西究竟有什麼作用而只是知道有這個一個東西存在。但是如果是自己實踐起來，就能更深刻理解。

話不多說，我第一個爬蟲程式參考了很多資料，目標是把一個人的新浪部落格裡的所有文章都儲存下來。網路上面有教育機構的視訊也是這麼說的，但是我看了視訊之後才發現視訊的標題只是一個噱頭，只是儲存了部落格所有文章的html網址在一個資料夾，而我希望是可編輯的文字格式。然而這本來就是難度等級不同的（起碼對於剛接觸Python語言的我而言）所以我決定自己參考資料寫一個。

我以韓寒部落格為例子。

思路是首先成功抓取一篇文章的內容（包括標題和正文內容），然後是抓取部落格目錄第一頁一整頁的文章，這裡涉及一個問題是要先獲得每篇文章的url然後再抓取文章，最後是觀察不同頁碼的部落格目錄，找出共同點寫一個函式對每一頁都進行抓取，最後就大功告成。

    def getText(self,url):
         text=urlopen(url).read().decode('utf-8')
         start=text.find(u"<!-- 正文開始 -->")
         print start
         end=text.find(u"<!-- 正文結束 -->")
         print end
         text=text[start:end]
         text = re.sub(re.compile('<p.*?>'),"\n    ",text)
         text = re.sub(re.compile('<p>'),"\n    ",text)
         text=re.sub(r'<(S*?)[^>]*>.*?|<.*? /> ','',text)
         text=re.sub(r'&[^>]*?\;',' ',text)
         return text.encode('utf-8')

這個函式就是獲取一篇文章的內容

def getUrl(self,page):
        pattern =re.compile('<span class="atc_title">.*?<a.*?href="(.*?)">.*?</a>.*?</span>',re.S)
        items = re.findall(pattern,page)
        urls = []
        for item in items:
            url = item
            urls.append(url.encode('utf-8'))
            #print url
        return urls

這個函式是獲取部落格目錄所有文章url

    def getContent(self,page):
        pattern = re.compile('<span class="atc_title">.*?<a.*?href.*?.html">(.*?)</a>.*?</span>',re.S)
        items = re.findall(pattern,page)
        contents = []
        for item in items:
            content = "\n"+self.tool.replace(item)+"\n"
            contents.append(content.encode('utf-8'))
            #print content
        return contents

同理這是獲取文章的標題

最後貼上原始碼

# -*- coding: cp936 -*-


__author__ = 'Thor'
# -*- coding:utf-8 -*-
import urllib
import urllib2
import re
from urllib import urlopen


class Tool:
     #去除img標籤,7位長空格
    removeImg = re.compile('<img.*?>| {7}|')
    #刪除超連結標籤
    removeAddr = re.compile('<a.*?>|</a>')
    #把換行的標籤換為\n
    replaceLine = re.compile('<tr>|<div>|</div>|</p>')
    #將表格製表<td>替換為\t
    replaceTD= re.compile('<td>')
    #把段落開頭換為\n加空兩格
    replacePara = re.compile('<p.*?>')
    #將換行符或雙換行符替換為\n
    replaceBR = re.compile('<br><br>|<br>')
    #將其餘標籤剔除
    removeExtraTag = re.compile('<.*?>')
    def replace(self,x):
        x = re.sub(self.removeImg,"",x)
        x = re.sub(self.removeAddr,"",x)
        x = re.sub(self.replaceLine,"\n",x)
        x = re.sub(self.replaceTD,"\t",x)
        x = re.sub(self.replacePara,"\n    ",x)
        x = re.sub(self.replaceBR,"\n",x)
        x = re.sub(self.removeExtraTag,"",x)
        #strip()將前後多餘內容刪除
        return x.strip()

class XLBK:
    def __init__(self,baseUrl,articleTag,fileName):
        self.baseURL=baseUrl
        self.tool=Tool()
        self.file=None
        self.article=1
        self.defaultTitle=u'新浪部落格'
        self.articleTag=articleTag
        self.fileName=fileName

    def getPage(self,pageNum):
        try:
            url=self.baseURL+str(pageNum)+'.html'
            print url
            request= urllib2.Request(url)
            response=urllib2.urlopen(request)
            return response.read().decode('utf-8')

        except urllib2.URLError ,e:
            if hasattr(e,"reason"):
                print u"連線新浪部落格失敗,錯誤原因",e.reason
                return None
    def getTitle(self,page):
        pattern = re.compile('blogname.*?blognamespan.*?>(.*?)</span>', re.S)
        result = re.search(pattern,page)
        print "title"+result.group(1).strip()
        if result:
            return result.group(1).strip()
        else:
            return None

    def getPageNum(self,page):
        pattern= re.compile(ur'<span style.*?>共(.*?)頁</span>',re.S)
        result = re.search(pattern,page)
        if result:
            #print "pagenum"+result.group(1).strip()
            return result.group(1).strip()
        else:
            print result
            return 1

    def getContent(self,page):
        pattern = re.compile('<span class="atc_title">.*?<a.*?href.*?.html">(.*?)</a>.*?</span>',re.S)
        items = re.findall(pattern,page)
        contents = []
        for item in items:
            content = "\n"+self.tool.replace(item)+"\n"
            contents.append(content.encode('utf-8'))
            #print content
        return contents

    def getUrl(self,page):
        pattern =re.compile('<span class="atc_title">.*?<a.*?href="(.*?)">.*?</a>.*?</span>',re.S)
        items = re.findall(pattern,page)
        urls = []
        for item in items:
            url = item
            urls.append(url.encode('utf-8'))
            #print url
        return urls

 
    def getText(self,url):
         text=urlopen(url).read().decode('utf-8')
         start=text.find(u"<!-- 正文開始 -->")
         print start
         end=text.find(u"<!-- 正文結束 -->")
         print end
         text=text[start:end]
         text = re.sub(re.compile('<p.*?>'),"\n    ",text)
         text = re.sub(re.compile('<p>'),"\n    ",text)
         text=re.sub(r'<(S*?)[^>]*>.*?|<.*? /> ','',text)
         text=re.sub(r'&[^>]*?\;',' ',text)
         return text.encode('utf-8')

    def setFileTitle(self,title):
        if title is not None:
            self.file = open(title + ".doc","w")
        else:
            self.file = open(self.defaultTitle + ".doc","w")


    def writeData(self,contents,urls):
        for item in contents:
            if self.articleTag == '1':

                articleLine = "\n" + str(self.article) + u"--------------------------------------------------------------------------------\n"
                self.file.write(articleLine)
            self.file.write(item)
            #print item
            self.file.write(urls[contents.index(item)])
            #print urls[contents.index(item)]
            text=self.getText(urls[contents.index(item)])   
            print text
            self.file.write(str(text))
            self.article += 1


    def start(self):
        indexPage = self.getPage(1)
        pageNum = self.getPageNum(indexPage)
        title = self.getTitle(indexPage)
        self.setFileTitle(self.fileName)
        if pageNum == None:
            print "URL已失效，請重試"
            return
        try:
            print "該部落格共有" + str(pageNum) + "頁"
            for i in range(1,int(pageNum)+1):
                print "正在寫入第" + str(i) + "頁資料"
                page = self.getPage(i)
                contents = self.getContent(page)
                urls =self.getUrl(page)
                self.writeData(contents,urls)
        except IOError,e:
            print "寫入異常，原因" + e.message
        finally:
            print "寫入任務完成"



print u"開啟一個新浪部落格的博文目錄\n如http://blog.sina.com.cn/s/articlelist_1866629225_0_1.html \n那麼該部落格的代號為1866629225_0_   \n請輸入部落格代號"
baseURL = 'http://blog.sina.com.cn/s/articlelist_' + str(raw_input(""))
articleTag = raw_input("是否寫入文章編號資訊，是輸入1，否輸入0\n")
fileName=raw_input("請輸入儲存文件的名稱\n")
xlbk = XLBK(baseURL,articleTag,fileName)
xlbk.start()

最後就可以坐等收錄文章。

圖片1

突然有種發現新大陸的感覺。

圖片2

ps：第一次寫部落格文章都不知道原來插入圖片要替換文字不然怎麼上傳和copy paste都無效。

總結一下，這是我自己動手的第一個爬蟲小指令碼，只是用了urllib，urllib2最基本的類庫，當然後來才知道有更多方便的如xpath，beautifulsoup這些，目前都在深入研究當中。

歡迎指出問題。

【爬蟲】利用Python爬蟲爬取小麥苗itpub部落格的所有文章的連線地址（1）
2018-12-26
爬蟲Python
【爬蟲】利用Python爬蟲爬取小麥苗itpub部落格的所有文章的連線地址並寫入Excel中（2）
2018-12-27
爬蟲PythonExcel
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python 第一個爬蟲，爬取 147 小說
2020-05-08
Python爬蟲
python 爬蟲爬取 learnku 精華文章
2020-04-17
Python爬蟲
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
Python實現微博爬蟲，爬取新浪微博
2020-12-14
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
[Python]爬蟲獲取知乎某個問題下所有圖片並去除水印
2021-09-20
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
JB的Python之旅-爬蟲篇-新浪微博內容爬取
2018-06-30
Python爬蟲
Python爬蟲和java爬蟲哪個效率高
2023-10-12
Python爬蟲Java
[雪峰磁針石部落格]python爬蟲cookbook1爬蟲入門
2018-09-10
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
Python爬蟲新手教程：知乎文章圖片爬取器
2019-07-20
Python爬蟲
Python爬蟲入門，8個常用爬蟲技巧盤點
2018-12-12
Python爬蟲
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
01、部落格爬蟲
2019-04-11
爬蟲
我的第一個Python爬蟲——談心得
2018-03-30
Python爬蟲
2個月精通Python爬蟲——3大爬蟲框架+6場實戰+反爬蟲技巧+分散式爬蟲
2018-06-28
Python爬蟲框架分散式
Python爬蟲爬取B站up主所有動態內容
2024-05-08
Python爬蟲
python 爬蟲
2024-04-20
Python爬蟲
python爬蟲
2024-06-13
Python爬蟲
python爬蟲58同城（多個資訊一次爬取）
2018-11-04
Python爬蟲
python爬蟲-1w+套個人簡歷模板爬取
2021-03-05
Python爬蟲
Python爬蟲教程-14-爬蟲使用filecookiejar儲存cookie檔案(人人網)
2018-09-06
Python爬蟲CookieJAR
新手爬蟲教程：Python爬取知乎文章中的圖片
2019-01-17
爬蟲Python
Python網路爬蟲2 - 爬取新浪微博使用者圖片
2018-04-10
Python爬蟲
Python爬蟲之使用MongoDB儲存資料
2019-02-16
Python爬蟲MongoDB
Python為什麼叫爬蟲?Python為什麼適合寫爬蟲?
2021-02-02
Python爬蟲
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
python 爬蟲 1 爬取酷狗音樂
2020-03-29
Python爬蟲
【Python爬蟲】正則爬取趕集網
2020-12-24
Python爬蟲
[爬蟲] 利用 Python 的 Selenium 庫爬取極客時間付費課程並儲存為 PDF 檔案
2020-06-03
爬蟲Python
為什麼學習python及爬蟲，Python爬蟲[入門篇]？
2018-11-21
Python爬蟲

[python 爬蟲]第一個Python爬蟲，爬取某個新浪部落格所有文章並儲存為doc文件

相關文章