Python爬蟲建站入門手記（3）：採集入庫

發表於2016-05-04

上回，我已經大概把爬蟲寫出來了。
我寫了一個內容爬蟲，一個爬取tag裡面內容連結的爬蟲
其實還差一個，就是收集一共有哪些tag的爬蟲。但是這裡先不說這個問題，因為我上次忘了這次又不想弄。。
還有個原因：如果實際採集的話，直接用http://segmentfault.com/questions/newest?page=1這個連結獲取所有問題，挨個爬就行。

進入正題

第三部分，採集入庫。

3.1 定義資料庫（or model or schema）

為了入庫，我需要在Django定義一個資料庫的結構。（不說nosql和mongodb（也是一個nosql但是很像關係型）的事）
還記得那個名叫web的app麼，裡面有個叫models.py的檔案，我現在就來編輯它。

bashvim ~/python_spider/web/models.py

1	bashvim ~/python_spider/web/models.py

內容如下:

python# -*- coding: utf-8 -*-
from django.db import models

# Create your models here.


class Tag(models.Model):
    title = models.CharField(max_length=30)

    def __unicode__(self):
        return self.title


class Question(models.Model):
    title = models.CharField(max_length=255)
    content = models.TextField()
    tags = models.ManyToManyField(Tag, related_name='questions')
    sf_id = models.CharField(max_length=16, default='0')　＃　加上這個可以記住問題在sf的位置，方便以後更新或者其他操作
    update_date = models.DateTimeField(auto_now=True)

    def __unicode__(self):
        return self.title


class Answer(models.Model):
    question = models.ForeignKey(Question, related_name='answers')
    content = models.TextField()

    def __unicode__(self):
        return 'To question %s' % self.question.title

python# -*- coding: utf-8 -*-

from django.db import models

# Create your models here.

class Tag(models.Model):

title = models.CharField(max_length=30)

def __unicode__(self):

return self.title

class Question(models.Model):

title = models.CharField(max_length=255)

content = models.TextField()

tags = models.ManyToManyField(Tag, related_name='questions')

sf_id = models.CharField(max_length=16, default='0')　＃　加上這個可以記住問題在sf的位置，方便以後更新或者其他操作

update_date = models.DateTimeField(auto_now=True)

def __unicode__(self):

return self.title

class Answer(models.Model):

question = models.ForeignKey(Question, related_name='answers')

content = models.TextField()

def __unicode__(self):

return 'To question %s' % self.question.title

都很直白，關於各個field可以看看 Django 的文件。

然後，我需要告訴我的python_spider專案，在執行的時候載入web這個app（專案不會自動載入裡面的app）。

bashvim ~/python_spider/python_spider/settings.py

1	bashvim ~/python_spider/python_spider/settings.py

在INSTALLED_APPS裡面加入web:

pythonINSTALLED_APPS = (
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'web',
)

pythonINSTALLED_APPS = (

'django.contrib.admin',

'django.contrib.auth',

'django.contrib.contenttypes',

'django.contrib.sessions',

'django.contrib.messages',

'django.contrib.staticfiles',

'web',

)

下面，就可以用django自動生成資料庫schema了

bashcd ~/python_spider
python manage.py makemigrations
python manage.py migrate

bashcd ~/python_spider

python manage.py makemigrations

python manage.py migrate

現在，我~/python_spider目錄就產生了一個db.sqlite3檔案，這是我的資料庫。
把玩一番我的模型

python>>> from web.models import Answer, Question, Tag
>>> tag = Tag()
>>> tag.title = u'測試標籤'
>>> tag.save()
>>> tag

>>> question = Question(title=u'測試提問', content=u'提問內容')
>>> question.save()
>>> question.tags.add(tag)
>>> question.save()
>>> answer = Answer(content=u'回答內容', question=question)
>>> answer.save()
>>> tag.questions.all() # 根據tag找question
[]
>>> question.tags.all() # 獲取question的tags
[]
>>> question.answers.all() # 獲取問題的答案
[]

python>>> from web.models import Answer, Question, Tag

>>> tag = Tag()

>>> tag.title = u'測試標籤'

>>> tag.save()

>>> tag

>>> question = Question(title=u'測試提問', content=u'提問內容')

>>> question.save()

>>> question.tags.add(tag)

>>> question.save()

>>> answer = Answer(content=u'回答內容', question=question)

>>> answer.save()

>>> tag.questions.all() # 根據tag找question

[]

>>> question.tags.all() # 獲取question的tags

[]

>>> question.answers.all() # 獲取問題的答案

[]

以上操作結果正常，說明定義的models是可用的。

3.2 入庫

接下來，我需要把採集的資訊入庫，說白了，就是把我自己蜘蛛的資訊利用django的ORM存到django連線的資料庫裡面，方便以後再用Django讀取用於做站。

入庫的方法太多了，這裡隨便寫一種，就是在web app裡面建立一個spider.py, 裡面定義兩個蜘蛛，繼承之前自己寫的蜘蛛，再新增入庫方法。

bashvim ~/python_spider/web/spider.py

1	bashvim ~/python_spider/web/spider.py

程式碼如下：

python# -*- coding: utf-8 -*-
from sfspider import spider
from web.models import Answer, Question, Tag


class ContentSpider(spider.SegmentfaultQuestionSpider):

    def save(self): # 新增save()方法
        sf_id = self.url.split('/')[-1] # 1
        tags = [Tag.objects.get_or_create(title=tag_title)[0] for tag_title in self.tags]　＃ 2
        question, created = Question.objects.get_or_create(
            sf_id=sf_id,
            defaults={'title':self.title, 'content':self.content}
        ) # 3
        question.tags.add(*tags) # 4
        question.save()
        for answer in self.answers:
            Answer.objects.get_or_create(content=answer, question=question)
        return question, created


class TagSpider(spider.SegmentfaultTagSpider):

    def crawl(self): # 採集當前分頁
        sf_ids = [url.split('/')[-1] for url in self.questions]
        for sf_id in sf_ids:
            question, created = ContentSpider(sf_id).save()

    def crawl_all_pages(self):
        while True:
            print u'正在抓取TAG:%s, 分頁:%s' % (self.tag_name, self.page) # 5
            self.crawl()
            if not self.has_next_page:
                break
            else:
                self.next_page()

python# -*- coding: utf-8 -*-

from sfspider import spider

from web.models import Answer, Question, Tag

class ContentSpider(spider.SegmentfaultQuestionSpider):

def save(self): # 新增save()方法

sf_id = self.url.split('/')[-1] # 1

tags = [Tag.objects.get_or_create(title=tag_title)[0] for tag_title in self.tags]　＃ 2

question, created = Question.objects.get_or_create(

sf_id=sf_id,

defaults={'title':self.title, 'content':self.content}

) # 3

question.tags.add(*tags) # 4

question.save()

for answer in self.answers:

Answer.objects.get_or_create(content=answer, question=question)

return question, created

class TagSpider(spider.SegmentfaultTagSpider):

def crawl(self): # 採集當前分頁

sf_ids = [url.split('/')[-1] for url in self.questions]

for sf_id in sf_ids:

question, created = ContentSpider(sf_id).save()

def crawl_all_pages(self):

while True:

print u'正在抓取TAG:%s, 分頁:%s' % (self.tag_name, self.page) # 5

self.crawl()

if not self.has_next_page:

break

else:

self.next_page()

這個地方寫得很笨，之前該在SegmentfaultQuestionSpider加上這個屬性。

建立或者獲取該提問的tags

建立或者獲取提問，採用sf_id來避免重複

把tags都新增到提問，這裡用*是因為這個方法原本的引數是(tag1, tag2, tag3)。但是我們的tags是個列表

測試的時候方便看看進度

然後，測試下我們的入庫指令碼

bashpython manage.py shell

1	bashpython manage.py shell

python>>> from web.spider import TagSpider
>>> t = TagSpider(u'微信')
>>> t.crawl_all_pages()
正在抓取TAG:微信, 分頁:1
正在抓取TAG:微信, 分頁:2
正在抓取TAG:微信, 分頁:3
KeyboardInterrupt # 用control-c中斷執行，測試一下就行:)
>>> from web.models import Tag, Question
>>> Question.objects.all()
[, , , , , , , , , , , , , , , , , , , , '...(remaining elements truncated)...']
>>> Question.objects.get(pk=5).tags.all() # 資料庫中id=5的question的tags
[, , , ]

python>>> from web.spider import TagSpider

>>> t = TagSpider(u'微信')

>>> t.crawl_all_pages()

正在抓取TAG:微信, 分頁:1

正在抓取TAG:微信, 分頁:2

正在抓取TAG:微信, 分頁:3

KeyboardInterrupt # 用control-c中斷執行，測試一下就行:)

>>> from web.models import Tag, Question

>>> Question.objects.all()

[, , , , , , , , , , , , , , , , , , , , '...(remaining elements truncated)...']

>>> Question.objects.get(pk=5).tags.all() # 資料庫中id=5的question的tags

[, , , ]

3.3 設定django.contrib.admin來檢視和編輯內容

為了更直觀的觀察我採集的資料，我可以利用django自帶的admin
編輯檔案

bashvim ~/python_spider/web/admin.py

1	bashvim ~/python_spider/web/admin.py

pythonfrom django.contrib import admin
from web.models import Tag, Question, Answer

admin.site.register(Tag)
admin.site.register(Question)
admin.site.register(Answer)

pythonfrom django.contrib import admin

from web.models import Tag, Question, Answer

admin.site.register(Tag)

admin.site.register(Question)

admin.site.register(Answer)

然後建立超級使用者

bashpython manage.py createsuperuser # 根據提示建立

1	bashpython manage.py createsuperuser # 根據提示建立

啟動測試伺服器

bashpython manage.py runserver 0.0.0.0:80 # 我這是在runabove上，本地直接manage.py runserver

1	bashpython manage.py runserver 0.0.0.0:80 # 我這是在runabove上，本地直接manage.py runserver

然後，我訪問http://192.99.71.91/admin/登入剛剛建立的賬號，就能對內容進行檢視和編輯了
Python爬蟲建站入門手記（3）：採集入庫

OK, 今天的內容到此。
下一篇，是編寫django的view，套用簡單的模板來建站。

Python爬蟲建站入門手記（1）：環境搭建
2016-05-04
Python爬蟲
python3 爬蟲入門
2021-09-09
Python爬蟲
Python3爬蟲入門(一)
2020-12-05
Python爬蟲
python爬蟲之 BeautifulSoup庫入門
2019-12-09
Python爬蟲
爬蟲入門學習筆記3
2021-01-05
爬蟲筆記
Python爬蟲入門
2020-11-30
Python爬蟲
Python3 爬蟲快速入門攻略
2018-12-07
Python爬蟲
如何入門 Python 爬蟲？
2015-04-14
Python爬蟲
python-爬蟲入門
2024-09-22
Python爬蟲
Python爬蟲入門【3】：美空網資料爬取
2019-07-30
Python爬蟲
Python3網路爬蟲快速入門實戰解析（一小時入門 Python 3 網路爬蟲）
2017-11-15
Python爬蟲
爬蟲入門
2024-04-13
爬蟲
【爬蟲】python爬蟲從入門到放棄
2018-12-20
爬蟲Python
爬蟲入門基礎-Python
2020-05-09
爬蟲Python
Python爬蟲入門指導
2017-05-16
Python爬蟲
Python爬蟲入門專案
2017-12-25
Python爬蟲
Python爬蟲入門（2）：爬蟲基礎瞭解
2015-04-25
Python爬蟲
什麼是Python爬蟲？python爬蟲入門難嗎？
2021-12-27
Python爬蟲
Node 爬蟲入門
2017-05-31
爬蟲
Python爬蟲入門（1）：綜述
2015-04-25
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
Python爬蟲入門，8個常用爬蟲技巧盤點
2018-12-12
Python爬蟲
Python爬蟲入門【5】：27270圖片爬取
2019-07-30
Python爬蟲
Python3網路爬蟲快速入門實戰解析
2020-04-23
Python爬蟲
為什麼學習python及爬蟲，Python爬蟲[入門篇]？
2018-11-21
Python爬蟲
python入門之爬蟲工具有哪些？
2021-09-11
Python爬蟲
Python爬蟲入門教程導航帖
2019-01-08
Python爬蟲
Python 網路爬蟲入門詳解
2018-01-28
Python爬蟲
Python網路爬蟲4 - scrapy入門
2018-05-29
Python爬蟲
Python爬蟲入門【4】：美空網未登入圖片爬取
2019-07-30
Python爬蟲
Python爬蟲入門-爬取pexels高清圖片
2017-09-24
Python爬蟲
Python網路資料採集（爬蟲）
2017-10-15
Python爬蟲
Python爬蟲入門【9】：圖蟲網多執行緒爬取
2019-07-31
Python爬蟲執行緒
帶你入門Python爬蟲，8個常用爬蟲技巧盤點
2018-08-06
Python爬蟲
Python爬蟲入門教程 61-100 寫個爬蟲碰到反爬了，動手破壞它！
2019-04-22
Python爬蟲
Python爬蟲進階之JS逆向入門
2019-05-29
Python爬蟲JS
Python 從入門到爬蟲極簡教程
2019-02-16
Python爬蟲
Python爬蟲怎麼入門-初級篇
2018-12-10
Python爬蟲

Python爬蟲建站入門手記（3）：採集入庫

第三部分，採集入庫。

3.1 定義資料庫（or model or schema）

3.2 入庫

3.3 設定django.contrib.admin來檢視和編輯內容

相關文章