Scrapy爬蟲 - 獲取知乎使用者資料

發表於2016-05-20

原文網址 : http://python.jobbole.com/85125/

爬蟲

安裝Scrapy爬蟲框架

關於如何安裝Python以及Scrapy框架，這裡不做介紹，請自行網上搜尋。

初始化

安裝好Scrapy後，執行 scrapy startproject myspider
接下來你會看到 myspider 資料夾，目錄結構如下：

scrapy.cfg
myspider
- items.py
- pipelines.py
- settings.py
- __init__.py
- spiders
  - __init__.py

編寫爬蟲檔案

在spiders目錄下新建 users.py

# -*- coding: utf-8 -*-
import scrapy
import os
import time
from zhihu.items import UserItem
from zhihu.myconfig import UsersConfig # 爬蟲配置

class UsersSpider(scrapy.Spider):
    name = 'users'
    domain = 'https://www.zhihu.com'
    login_url = 'https://www.zhihu.com/login/email'
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.8",
        "Connection": "keep-alive",
        "Host": "www.zhihu.com",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36"
    }

    def __init__(self, url = None):
        self.user_url = url

    def start_requests(self):
        yield scrapy.Request(
            url = self.domain,
            headers = self.headers,
            meta = {
                'proxy': UsersConfig['proxy'],
                'cookiejar': 1
            },
            callback = self.request_captcha
        )

    def request_captcha(self, response):
        # 獲取_xsrf值
        _xsrf = response.css('input[name="_xsrf"]::attr(value)').extract()[0]
        # 獲取驗證碼地址
        captcha_url = 'http://www.zhihu.com/captcha.gif?r=' + str(time.time() * 1000)
        # 準備下載驗證碼
        yield scrapy.Request(
            url = captcha_url,
            headers = self.headers,
            meta = {
                'proxy': UsersConfig['proxy'],
                'cookiejar': response.meta['cookiejar'],
                '_xsrf': _xsrf
            },
            callback = self.download_captcha
        )

    def download_captcha(self, response):
        # 下載驗證碼
        with open('captcha.gif', 'wb') as fp:
            fp.write(response.body)
        # 用軟體開啟驗證碼圖片
        os.system('start captcha.gif')
        # 輸入驗證碼
        print 'Please enter captcha: '
        captcha = raw_input()

        yield scrapy.FormRequest(
            url = self.login_url,
            headers = self.headers,
            formdata = {
                'email': UsersConfig['email'],
                'password': UsersConfig['password'],
                '_xsrf': response.meta['_xsrf'],
                'remember_me': 'true',
                'captcha': captcha
            },
            meta = {
                'proxy': UsersConfig['proxy'],
                'cookiejar': response.meta['cookiejar']
            },
            callback = self.request_zhihu
        )

    def request_zhihu(self, response):
        yield scrapy.Request(
            url = self.user_url + '/about',
            headers = self.headers,
            meta = {
                'proxy': UsersConfig['proxy'],
                'cookiejar': response.meta['cookiejar'],
                'from': {
                    'sign': 'else',
                    'data': {}
                }
            },
            callback = self.user_item,
            dont_filter = True
        )

        yield scrapy.Request(
            url = self.user_url + '/followees',
            headers = self.headers,
            meta = {
                'proxy': UsersConfig['proxy'],
                'cookiejar': response.meta['cookiejar'],
                'from': {
                    'sign': 'else',
                    'data': {}
                }
            },
            callback = self.user_start,
            dont_filter = True
        )

        yield scrapy.Request(
            url = self.user_url + '/followers',
            headers = self.headers,
            meta = {
                'proxy': UsersConfig['proxy'],
                'cookiejar': response.meta['cookiejar'],
                'from': {
                    'sign': 'else',
                    'data': {}
                }
            },
            callback = self.user_start,
            dont_filter = True
        )

    def user_start(self, response):
        sel_root = response.xpath('//h2[@class="zm-list-content-title"]')
        # 判斷關注列表是否為空
        if len(sel_root):
            for sel in sel_root:
                people_url = sel.xpath('a/@href').extract()[0]

                yield scrapy.Request(
                    url = people_url + '/about',
                    headers = self.headers,
                    meta = {
                        'proxy': UsersConfig['proxy'],
                        'cookiejar': response.meta['cookiejar'],
                        'from': {
                            'sign': 'else',
                            'data': {}
                        }
                    },
                    callback = self.user_item,
                    dont_filter = True
                )

                yield scrapy.Request(
                    url = people_url + '/followees',
                    headers = self.headers,
                    meta = {
                        'proxy': UsersConfig['proxy'],
                        'cookiejar': response.meta['cookiejar'],
                        'from': {
                            'sign': 'else',
                            'data': {}
                        }
                    },
                    callback = self.user_start,
                    dont_filter = True
                )

                yield scrapy.Request(
                    url = people_url + '/followers',
                    headers = self.headers,
                    meta = {
                        'proxy': UsersConfig['proxy'],
                        'cookiejar': response.meta['cookiejar'],
                        'from': {
                            'sign': 'else',
                            'data': {}
                        }
                    },
                    callback = self.user_start,
                    dont_filter = True
                )

    def user_item(self, response):
        def value(list):
            return list[0] if len(list) else ''

        sel = response.xpath('//div[@class="zm-profile-header ProfileCard"]')

        item = UserItem()
        item['url'] = response.url[:-6]
        item['name'] = sel.xpath('//a[@class="name"]/text()').extract()[0].encode('utf-8')
        item['bio'] = value(sel.xpath('//span[@class="bio"]/@title').extract()).encode('utf-8')
        item['location'] = value(sel.xpath('//span[contains(@class, "location")]/@title').extract()).encode('utf-8')
        item['business'] = value(sel.xpath('//span[contains(@class, "business")]/@title').extract()).encode('utf-8')
        item['gender'] = 0 if sel.xpath('//i[contains(@class, "icon-profile-female")]') else 1
        item['avatar'] = value(sel.xpath('//img[@class="Avatar Avatar--l"]/@src').extract())
        item['education'] = value(sel.xpath('//span[contains(@class, "education")]/@title').extract()).encode('utf-8')
        item['major'] = value(sel.xpath('//span[contains(@class, "education-extra")]/@title').extract()).encode('utf-8')
        item['employment'] = value(sel.xpath('//span[contains(@class, "employment")]/@title').extract()).encode('utf-8')
        item['position'] = value(sel.xpath('//span[contains(@class, "position")]/@title').extract()).encode('utf-8')
        item['content'] = value(sel.xpath('//span[@class="content"]/text()').extract()).strip().encode('utf-8')
        item['ask'] = int(sel.xpath('//div[contains(@class, "profile-navbar")]/a[2]/span[@class="num"]/text()').extract()[0])
        item['answer'] = int(sel.xpath('//div[contains(@class, "profile-navbar")]/a[3]/span[@class="num"]/text()').extract()[0])
        item['agree'] = int(sel.xpath('//span[@class="zm-profile-header-user-agree"]/strong/text()').extract()[0])
        item['thanks'] = int(sel.xpath('//span[@class="zm-profile-header-user-thanks"]/strong/text()').extract()[0])

        yield item

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

# -*- coding: utf-8 -*-

import scrapy

import os

import time

from zhihu.items import UserItem

from zhihu.myconfig import UsersConfig # 爬蟲配置

class UsersSpider(scrapy.Spider):

name = 'users'

domain = 'https://www.zhihu.com'

login_url = 'https://www.zhihu.com/login/email'

headers = {

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",

"Accept-Language": "zh-CN,zh;q=0.8",

"Connection": "keep-alive",

"Host": "www.zhihu.com",

"Upgrade-Insecure-Requests": "1",

"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36"

}

def __init__(self, url = None):

self.user_url = url

def start_requests(self):

yield scrapy.Request(

url = self.domain,

headers = self.headers,

meta = {

'proxy': UsersConfig['proxy'],

'cookiejar': 1

callback = self.request_captcha

)

def request_captcha(self, response):

# 獲取_xsrf值

_xsrf = response.css('input[name="_xsrf"]::attr(value)').extract()[0]

# 獲取驗證碼地址

captcha_url = 'http://www.zhihu.com/captcha.gif?r=' + str(time.time() * 1000)

# 準備下載驗證碼

yield scrapy.Request(

url = captcha_url,

headers = self.headers,

meta = {

'proxy': UsersConfig['proxy'],

'cookiejar': response.meta['cookiejar'],

'_xsrf': _xsrf

callback = self.download_captcha

)

def download_captcha(self, response):

# 下載驗證碼

with open('captcha.gif', 'wb') as fp:

fp.write(response.body)

# 用軟體開啟驗證碼圖片

os.system('start captcha.gif')

# 輸入驗證碼

print 'Please enter captcha: '

captcha = raw_input()

yield scrapy.FormRequest(

url = self.login_url,

headers = self.headers,

formdata = {

'email': UsersConfig['email'],

'password': UsersConfig['password'],

'_xsrf': response.meta['_xsrf'],

'remember_me': 'true',

'captcha': captcha

meta = {

'proxy': UsersConfig['proxy'],

'cookiejar': response.meta['cookiejar']

callback = self.request_zhihu

)

def request_zhihu(self, response):

yield scrapy.Request(

url = self.user_url + '/about',

headers = self.headers,

meta = {

'proxy': UsersConfig['proxy'],

'cookiejar': response.meta['cookiejar'],

'from': {

'sign': 'else',

'data': {}

}

callback = self.user_item,

dont_filter = True

)

yield scrapy.Request(

url = self.user_url + '/followees',

headers = self.headers,

meta = {

'proxy': UsersConfig['proxy'],

'cookiejar': response.meta['cookiejar'],

'from': {

'sign': 'else',

'data': {}

}

callback = self.user_start,

dont_filter = True

)

yield scrapy.Request(

url = self.user_url + '/followers',

headers = self.headers,

meta = {

'proxy': UsersConfig['proxy'],

'cookiejar': response.meta['cookiejar'],

'from': {

'sign': 'else',

'data': {}

}

callback = self.user_start,

dont_filter = True

)

def user_start(self, response):

sel_root = response.xpath('//h2[@class="zm-list-content-title"]')

# 判斷關注列表是否為空

if len(sel_root):

for sel in sel_root:

people_url = sel.xpath('a/@href').extract()[0]

yield scrapy.Request(

url = people_url + '/about',

headers = self.headers,

meta = {

'proxy': UsersConfig['proxy'],

'cookiejar': response.meta['cookiejar'],

'from': {

'sign': 'else',

'data': {}

}

callback = self.user_item,

dont_filter = True

)

yield scrapy.Request(

url = people_url + '/followees',

headers = self.headers,

meta = {

'proxy': UsersConfig['proxy'],

'cookiejar': response.meta['cookiejar'],

'from': {

'sign': 'else',

'data': {}

}

callback = self.user_start,

dont_filter = True

)

yield scrapy.Request(

url = people_url + '/followers',

headers = self.headers,

meta = {

'proxy': UsersConfig['proxy'],

'cookiejar': response.meta['cookiejar'],

'from': {

'sign': 'else',

'data': {}

}

callback = self.user_start,

dont_filter = True

)

def user_item(self, response):

def value(list):

return list[0] if len(list) else ''

sel = response.xpath('//div[@class="zm-profile-header ProfileCard"]')

item = UserItem()

item['url'] = response.url[:-6]

item['name'] = sel.xpath('//a[@class="name"]/text()').extract()[0].encode('utf-8')

item['bio'] = value(sel.xpath('//span[@class="bio"]/@title').extract()).encode('utf-8')

item['location'] = value(sel.xpath('//span[contains(@class, "location")]/@title').extract()).encode('utf-8')

item['business'] = value(sel.xpath('//span[contains(@class, "business")]/@title').extract()).encode('utf-8')

item['gender'] = 0 if sel.xpath('//i[contains(@class, "icon-profile-female")]') else 1

item['avatar'] = value(sel.xpath('//img[@class="Avatar Avatar--l"]/@src').extract())

item['education'] = value(sel.xpath('//span[contains(@class, "education")]/@title').extract()).encode('utf-8')

item['major'] = value(sel.xpath('//span[contains(@class, "education-extra")]/@title').extract()).encode('utf-8')

item['employment'] = value(sel.xpath('//span[contains(@class, "employment")]/@title').extract()).encode('utf-8')

item['position'] = value(sel.xpath('//span[contains(@class, "position")]/@title').extract()).encode('utf-8')

item['content'] = value(sel.xpath('//span[@class="content"]/text()').extract()).strip().encode('utf-8')

item['ask'] = int(sel.xpath('//div[contains(@class, "profile-navbar")]/a[2]/span[@class="num"]/text()').extract()[0])

item['answer'] = int(sel.xpath('//div[contains(@class, "profile-navbar")]/a[3]/span[@class="num"]/text()').extract()[0])

item['agree'] = int(sel.xpath('//span[@class="zm-profile-header-user-agree"]/strong/text()').extract()[0])

item['thanks'] = int(sel.xpath('//span[@class="zm-profile-header-user-thanks"]/strong/text()').extract()[0])

yield item

新增爬蟲配置檔案

在myspider目錄下新建myconfig.py，並新增以下內容，將你的配置資訊填入相應位置

# -*- coding: utf-8 -*-
UsersConfig = {
    # 代理
    'proxy': '',

    # 知乎使用者名稱和密碼
    'email': 'your email',
    'password': 'your password',
}

DbConfig = {
    # db config
    'user': 'db user',
    'passwd': 'db password',
    'db': 'db name',
    'host': 'db host',
}

# -*- coding: utf-8 -*-

UsersConfig = {

# 代理

'proxy': '',

# 知乎使用者名稱和密碼

'email': 'your email',

'password': 'your password',

}

DbConfig = {

# db config

'user': 'db user',

'passwd': 'db password',

'db': 'db name',

'host': 'db host',

}

修改items.py

# -*- coding: utf-8 -*-
import scrapy

class UserItem(scrapy.Item):
    # define the fields for your item here like:
    url = scrapy.Field()
    name = scrapy.Field()
    bio = scrapy.Field()
    location = scrapy.Field()
    business = scrapy.Field()
    gender = scrapy.Field()
    avatar = scrapy.Field()
    education = scrapy.Field()
    major = scrapy.Field()
    employment = scrapy.Field()
    position = scrapy.Field()
    content = scrapy.Field()
    ask = scrapy.Field()
    answer = scrapy.Field()
    agree = scrapy.Field()
    thanks = scrapy.Field()

# -*- coding: utf-8 -*-

import scrapy

class UserItem(scrapy.Item):

# define the fields for your item here like:

url = scrapy.Field()

name = scrapy.Field()

bio = scrapy.Field()

location = scrapy.Field()

business = scrapy.Field()

gender = scrapy.Field()

avatar = scrapy.Field()

education = scrapy.Field()

major = scrapy.Field()

employment = scrapy.Field()

position = scrapy.Field()

content = scrapy.Field()

ask = scrapy.Field()

answer = scrapy.Field()

agree = scrapy.Field()

thanks = scrapy.Field()

將使用者資料存入mysql資料庫

修改pipelines.py

# -*- coding: utf-8 -*-
import MySQLdb
import datetime
from zhihu.myconfig import DbConfig

class UserPipeline(object):
    def __init__(self):
        self.conn = MySQLdb.connect(user = DbConfig['user'], passwd = DbConfig['passwd'], db = DbConfig['db'], host = DbConfig['host'], charset = 'utf8', use_unicode = True)
        self.cursor = self.conn.cursor()
        # 清空表
        # self.cursor.execute('truncate table weather;')
        # self.conn.commit()

    def process_item(self, item, spider):
        curTime = datetime.datetime.now()
        try:
            self.cursor.execute(
                """INSERT IGNORE INTO users (url, name, bio, location, business, gender, avatar, education, major, employment, position, content, ask, answer, agree, thanks, create_at)
                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)""",
                (
                    item['url'],
                    item['name'],
                    item['bio'],
                    item['location'],
                    item['business'],
                    item['gender'],
                    item['avatar'],
                    item['education'],
                    item['major'],
                    item['employment'],
                    item['position'],
                    item['content'],
                    item['ask'],
                    item['answer'],
                    item['agree'],
                    item['thanks'],
                    curTime
                )
            )
            self.conn.commit()
        except MySQLdb.Error, e:
            print 'Error %d %s' % (e.args[0], e.args[1])

        return item

# -*- coding: utf-8 -*-

import MySQLdb

import datetime

from zhihu.myconfig import DbConfig

class UserPipeline(object):

def __init__(self):

self.conn = MySQLdb.connect(user = DbConfig['user'], passwd = DbConfig['passwd'], db = DbConfig['db'], host = DbConfig['host'], charset = 'utf8', use_unicode = True)

self.cursor = self.conn.cursor()

# 清空表

# self.cursor.execute('truncate table weather;')

# self.conn.commit()

def process_item(self, item, spider):

curTime = datetime.datetime.now()

try:

self.cursor.execute(

"""INSERT IGNORE INTO users (url, name, bio, location, business, gender, avatar, education, major, employment, position, content, ask, answer, agree, thanks, create_at)

VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)""",

(

item['url'],

item['name'],

item['bio'],

item['location'],

item['business'],

item['gender'],

item['avatar'],

item['education'],

item['major'],

item['employment'],

item['position'],

item['content'],

item['ask'],

item['answer'],

item['agree'],

item['thanks'],

curTime

)

self.conn.commit()

except MySQLdb.Error, e:

print 'Error %d %s' % (e.args[0], e.args[1])

return item

修改settings.py

找到 ITEM_PIPELINES，改為：

ITEM_PIPELINES = {
   'myspider.pipelines.UserPipeline': 300,
}

ITEM_PIPELINES = {

'myspider.pipelines.UserPipeline': 300,

}

在末尾新增，設定爬蟲的深度

DEPTH_LIMIT=10

1	DEPTH_LIMIT=10

爬取知乎使用者資料

確保MySQL已經開啟，在專案根目錄下開啟終端，
執行 scrapy crawl users -a url=https://www.zhihu.com/people/，
其中user為爬蟲的第一個使用者，之後會根據該使用者關注的人和被關注的人進行爬取資料
接下來會下載驗證碼圖片，若未自動開啟，請到根目錄下開啟 captcha.gif，在終端輸入驗證碼
資料爬取Loading…

原始碼

原始碼可以在這裡找到 github

分散式爬蟲之知乎使用者資訊爬取
2018-08-31
分散式爬蟲
Python爬蟲框架：scrapy爬取高考派大學資料
2019-10-07
Python爬蟲框架
Python爬蟲抓取知乎所有使用者資訊
2018-03-14
Python爬蟲
Python網路爬蟲實戰：爬取知乎話題下 18934 條回答資料
2019-01-17
Python爬蟲
Python爬蟲新手教程：知乎文章圖片爬取器
2019-07-20
Python爬蟲
scrapy實戰專案（簡單的爬取知乎專案）
2018-05-17
新手爬蟲教程：Python爬取知乎文章中的圖片
2019-01-17
爬蟲Python
[Python]爬蟲獲取知乎某個問題下所有圖片並去除水印
2021-09-20
Python爬蟲
python爬蟲如何爬知乎的話題？
2019-02-16
Python爬蟲
爬蟲 Scrapy框架爬取圖蟲圖片並下載
2018-08-27
爬蟲框架
scrapy爬取豆瓣電影資料
2021-09-11
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
爬蟲教程——用Scrapy爬取豆瓣TOP250
2018-10-31
爬蟲
Scrapy爬蟲-草稿
2018-09-08
爬蟲
Scrapy爬蟲框架
2024-11-13
爬蟲框架
Python爬蟲精簡步驟1 獲取資料
2020-02-17
Python爬蟲
python爬蟲獲取天氣網實時資料
2022-11-29
Python爬蟲
兩人因使用爬蟲非法爬取、使用淘寶11.8億使用者資料獲罪
2021-06-17
爬蟲
Selenium + Scrapy爬取某商標資料
2018-06-27
如何提升scrapy爬取資料的效率
2019-03-05
API商品資料介面呼叫實戰：爬蟲與資料獲取
2023-10-29
API爬蟲
爬蟲（9） - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架
2022-07-05
爬蟲框架非同步
爬蟲：拉勾自動投遞簡歷+資料獲取
2020-10-21
爬蟲
網路爬蟲如何獲取IP進行資料抓取
2022-05-19
爬蟲
Python Scrapy 爬蟲（二）：scrapy 初試
2018-08-13
Python爬蟲
python爬蟲Scrapy框架
2018-11-21
Python爬蟲框架
scrapy爬蟲代理池
2018-08-28
爬蟲
爬蟲實戰scrapy
2018-03-11
爬蟲
Python爬蟲—Scrapy框架
2020-10-04
Python爬蟲框架
【Python篇】scrapy爬蟲
2020-11-29
Python爬蟲
Scrapy爬蟲（6）爬取銀行理財產品並存入MongoDB（共12w+資料）
2018-03-15
爬蟲MongoDB
3、爬蟲-selenium-獲取使用者cookie的使用
2024-07-01
爬蟲Cookie
Python爬蟲筆記（4）：利用scrapy爬取豆瓣電影250
2018-11-10
Python爬蟲筆記
爬蟲爬取資料如何繞開限制？
2022-06-10
爬蟲
如何高效獲取大資料?動態ip代理：用爬蟲!
2019-01-24
大資料爬蟲
爬蟲實戰：從HTTP請求獲取資料解析社群
2024-03-20
爬蟲HTTP
爬蟲：HTTP請求與HTML解析（爬取某乎網站）
2021-05-19
爬蟲HTTPHTML網站
Scrapy框架的使用之Scrapy通用爬蟲
2018-05-21
框架爬蟲
scrapy之分散式爬蟲scrapy-redis
2020-12-24
分散式爬蟲Redis

Scrapy爬蟲 - 獲取知乎使用者資料

初始化

編寫爬蟲檔案

新增爬蟲配置檔案

修改items.py

將使用者資料存入mysql資料庫

修改settings.py

爬取知乎使用者資料

原始碼

相關文章