用Python寫一個簡單的微博爬蟲

發表於2016-03-03

我是個微博重度使用者，工作學習之餘喜歡刷刷timeline看看有什麼新鮮事發生，也因此認識了不少高質量的原創大V，有分享技術資料的，比如好東西傳送門；有時不時給你一點人生經驗的，比如石康；有高產的段子手，比如銀教授；有黃圖黃段子小能手，比如阿良哥哥木木蘿希木初犬餅…

好吧，我承認，爬黃圖黃段子才是我的真實目的，前三個是掩人耳目的…（捂臉，跑開）

另外說點題外話，我一開始想使用Sina Weibo API來獲取微博內容，但後來發現新浪微博的API限制實在太多，大家感受一下：

只能獲取當前授權的使用者（就是自己），而且只能返回最新的5條，WTF！
所以果斷放棄掉這條路，改為『生爬』，因為PC端的微博是Ajax的動態載入，爬取起來有些困難，我果斷知難而退，改為對移動端的微博進行爬取，因為移動端的微博可以通過分頁爬取的方式來一次性爬取所有微博內容，這樣工作就簡化了不少。

最後實現的功能：

輸入要爬取的微博使用者的user_id，獲得該使用者的所有微博
文字內容儲存到以%user_id命名文字檔案中，所有高清原圖儲存在weibo_image資料夾中

具體操作：
首先我們要獲得自己的cookie，這裡只說chrome的獲取方法。

用chrome開啟新浪微博移動端
option+command+i調出開發者工具
點開Network，將Preserve log選項選中
輸入賬號密碼，登入新浪微博
找到m.weibo.cn->Headers->Cookie，把cookie複製到程式碼中的#your cookie處

然後再獲取你想爬取的使用者的user_id，這個我不用多說啥了吧，點開使用者主頁，位址列裡面那個號碼就是user_id

將python程式碼儲存到weibo_spider.py檔案中
定位到當前目錄下後，命令列執行python weibo_spider.py user_id
當然如果你忘記在後面加user_id，執行的時候命令列也會提示你輸入

最後執行結束

iTerm

小問題：在我的測試中，有的時候會出現圖片下載失敗的問題，具體原因還不是很清楚，可能是網速問題，因為我宿舍的網速實在太不穩定了，當然也有可能是別的問題，所以在程式根目錄下面，我還生成了一個userid_imageurls的文字檔案，裡面儲存了爬取的所有圖片的下載連結，如果出現大片的圖片下載失敗，可以將該連結群一股腦導進迅雷等下載工具進行下載。

另外，我的系統是OSX EI Capitan10.11.2，Python的版本是2.7，依賴庫用sudo pip install XXXX就可以安裝，具體配置問題可以自行stackoverflow，這裡就不展開講了。

下面我就給出實現程式碼（嚴肅臉）

#-*-coding:utf8-*-

import re
import string
import sys
import os
import urllib
import urllib2
from bs4 import BeautifulSoup
import requests
from lxml import etree

reload(sys) 
sys.setdefaultencoding('utf-8')
if(len(sys.argv) >=2):
    user_id = (int)(sys.argv[1])
else:
    user_id = (int)(raw_input(u"請輸入user_id: "))

cookie = {"Cookie": "#your cookie"}
url = 'http://weibo.cn/u/%d?filter=1&page=1'%user_id

html = requests.get(url, cookies = cookie).content
selector = etree.HTML(html)
pageNum = (int)(selector.xpath('//input[@name="mp"]')[0].attrib['value'])

result = "" 
urllist_set = set()
word_count = 1
image_count = 1

print u'爬蟲準備就緒...'

for page in range(1,pageNum+1):

#獲取lxml頁面
  url = 'http://weibo.cn/u/%d?filter=1&page=%d'%(user_id,page) 
  lxml = requests.get(url, cookies = cookie).content

#文字爬取
  selector = etree.HTML(lxml)
  content = selector.xpath('//span[@class="ctt"]')
  for each in content:
    text = each.xpath('string(.)')
    if word_count >= 4:
      text = "%d :"%(word_count-3) +text+"\n\n"
    else :
      text = text+"\n\n"
    result = result + text
    word_count += 1

#圖片爬取
  soup = BeautifulSoup(lxml, "lxml")
  urllist = soup.find_all('a',href=re.compile(r'^http://weibo.cn/mblog/oripic',re.I))
  first = 0
  for imgurl in urllist:
    urllist_set.add(requests.get(imgurl['href'], cookies = cookie).url)
    image_count +=1

fo = open("/Users/Personals/%s"%user_id, "wb")
fo.write(result)
word_path=os.getcwd()+'/%d'%user_id
print u'文字微博爬取完畢'

link = ""
fo2 = open("/Users/Personals/%s_imageurls"%user_id, "wb")
for eachlink in urllist_set:
  link = link + eachlink +"\n"
fo2.write(link)
print u'圖片連結爬取完畢'

if not urllist_set:
  print u'該頁面中不存在圖片'
else:
  #下載圖片,儲存在當前目錄的pythonimg資料夾下
  image_path=os.getcwd()+'/weibo_image'
  if os.path.exists(image_path) is False:
    os.mkdir(image_path)
  x=1
  for imgurl in urllist_set:
    temp= image_path + '/%s.jpg' % x
    print u'正在下載第%s張圖片' % x
    try:
      urllib.urlretrieve(urllib2.urlopen(imgurl).geturl(),temp)
    except:
      print u"該圖片下載失敗:%s"%imgurl
    x+=1

print u'原創微博爬取完畢，共%d條，儲存路徑%s'%(word_count-4,word_path)
print u'微博圖片爬取完畢，共%d張，儲存路徑%s'%(image_count-1,image_path)

#-*-coding:utf8-*-

import re

import string

import sys

import os

import urllib

import urllib2

from bs4 import BeautifulSoup

import requests

from lxml import etree

reload(sys)

sys.setdefaultencoding('utf-8')

if(len(sys.argv) >=2):

user_id = (int)(sys.argv[1])

else:

user_id = (int)(raw_input(u"請輸入user_id: "))

cookie = {"Cookie": "#your cookie"}

url = 'http://weibo.cn/u/%d?filter=1&page=1'%user_id

html = requests.get(url, cookies = cookie).content

selector = etree.HTML(html)

pageNum = (int)(selector.xpath('//input[@name="mp"]')[0].attrib['value'])

result = ""

urllist_set = set()

word_count = 1

image_count = 1

print u'爬蟲準備就緒...'

for page in range(1,pageNum+1):

#獲取lxml頁面

url = 'http://weibo.cn/u/%d?filter=1&page=%d'%(user_id,page)

lxml = requests.get(url, cookies = cookie).content

#文字爬取

selector = etree.HTML(lxml)

content = selector.xpath('//span[@class="ctt"]')

for each in content:

text = each.xpath('string(.)')

if word_count >= 4:

text = "%d :"%(word_count-3) +text+"\n\n"

else :

text = text+"\n\n"

result = result + text

word_count += 1

#圖片爬取

soup = BeautifulSoup(lxml, "lxml")

urllist = soup.find_all('a',href=re.compile(r'^http://weibo.cn/mblog/oripic',re.I))

first = 0

for imgurl in urllist:

urllist_set.add(requests.get(imgurl['href'], cookies = cookie).url)

image_count +=1

fo = open("/Users/Personals/%s"%user_id, "wb")

fo.write(result)

word_path=os.getcwd()+'/%d'%user_id

print u'文字微博爬取完畢'

link = ""

fo2 = open("/Users/Personals/%s_imageurls"%user_id, "wb")

for eachlink in urllist_set:

link = link + eachlink +"\n"

fo2.write(link)

print u'圖片連結爬取完畢'

if not urllist_set:

print u'該頁面中不存在圖片'

else:

#下載圖片,儲存在當前目錄的pythonimg資料夾下

image_path=os.getcwd()+'/weibo_image'

if os.path.exists(image_path) is False:

os.mkdir(image_path)

x=1

for imgurl in urllist_set:

temp= image_path + '/%s.jpg' % x

print u'正在下載第%s張圖片' % x

try:

urllib.urlretrieve(urllib2.urlopen(imgurl).geturl(),temp)

except:

print u"該圖片下載失敗:%s"%imgurl

x+=1

print u'原創微博爬取完畢，共%d條，儲存路徑%s'%(word_count-4,word_path)

print u'微博圖片爬取完畢，共%d張，儲存路徑%s'%(image_count-1,image_path)

使用nodeJS寫一個簡單的小爬蟲
2018-12-25
NodeJS爬蟲
Python實現微博爬蟲，爬取新浪微博
2020-12-14
Python爬蟲
教你如何編寫第一個簡單的爬蟲
2020-02-16
爬蟲
初探python之做一個簡單小爬蟲
2019-03-02
Python爬蟲
一天時間入門python爬蟲，直接寫一個爬蟲案例，分享出來，很簡單
2018-12-02
Python爬蟲
nodeJS做一個簡單的爬蟲
2018-03-30
NodeJS爬蟲
為什麼寫爬蟲用Python語言?原因很簡單！
2021-03-19
爬蟲Python
python簡單爬蟲(二)
2018-04-18
Python爬蟲
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
5 個用 Python 編寫 web 爬蟲的方法
2018-05-20
PythonWeb爬蟲
python爬蟲:爬蟲的簡單介紹及requests模組的簡單使用
2022-02-24
Python爬蟲
JB的Python之旅-爬蟲篇-新浪微博內容爬取
2018-06-30
Python爬蟲
使用python的scrapy來編寫一個爬蟲
2019-03-14
Python爬蟲
Python 超簡單爬取微博熱搜榜資料
2020-05-13
Python
簡單瞭解python爬蟲
2020-10-13
Python爬蟲
一個簡單的爬蟲頭部構造
2020-11-22
爬蟲
python最簡單的爬蟲 , 一看就會
2018-06-14
Python爬蟲
Python代理IP爬蟲的簡單使用
2019-03-04
Python爬蟲
python簡介怎麼寫-python爬蟲簡歷怎麼寫
2020-11-01
Python爬蟲
Python 超簡單爬取新浪微博資料 (高階版)
2020-05-16
Python
java實現一個簡單的爬蟲小程式
2020-08-11
Java爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
Python爬蟲 --- 2.3 Scrapy 框架的簡單使用
2018-12-19
Python爬蟲框架
用python寫一個豆瓣短評通用爬蟲(登入、爬取、視覺化)
2020-10-24
Python爬蟲視覺化
分散式爬蟲很難嗎？用Python寫一個小白也能聽懂的分散式知乎爬蟲
2018-05-04
分散式爬蟲Python
python爬蟲是什麼?為什麼用python語言寫爬蟲？
2022-04-02
Python爬蟲
簡單的爬蟲程式
2024-03-24
爬蟲
爬蟲實戰（一）：爬取微博使用者資訊
2018-07-15
爬蟲
使用Python和requests庫的簡單爬蟲程式
2023-11-13
Python爬蟲
基於Python的簡單天氣爬蟲程式
2018-03-26
Python爬蟲
Python網路爬蟲2 - 爬取新浪微博使用者圖片
2018-04-10
Python爬蟲
python 爬蟲 mc 皮膚站 little skin 的簡單爬取
2019-08-02
Python爬蟲
什麼是網路爬蟲?為什麼用Python寫爬蟲?
2021-03-08
爬蟲Python
Python爬蟲筆記（一）——基礎知識簡單整理
2018-07-08
Python爬蟲筆記
github上的python爬蟲專案_GitHub - ahaharry/PythonCrawler: 用python編寫的爬蟲專案集合
2022-02-18
GithubPython爬蟲
python爬蟲簡單實現逆向JS解密
2019-08-29
Python爬蟲JS解密
誰說爬蟲只能Python？看我用C#快速簡單實現爬蟲開發和演示！
2024-05-27
爬蟲PythonC#
簡單例子展示爬蟲在不同思想下的寫法
2021-04-26
單例爬蟲
寫個爬蟲唄
2019-02-25
爬蟲

用Python寫一個簡單的微博爬蟲

相關文章