Python爬蟲: 抓取One網頁上的每日一話和圖

發表於2016-04-06

先說下需求：

最近打算蒐集點源資料，豐富下生活。嗯，最近看到One這個APP蠻好的。每天想你推送一張圖和一段話。很喜歡，簡單不復雜。而我想要把所有的句子都儲存下來，又不想要每個頁面都去手動檢視。因此，就有了Python。之前有點Python基礎，不過沒有深入。現在也沒有深入，用哪學哪吧。
網站的內容是這樣的，我想要圖片和這段話：

（一）

一臺MAC電腦

（二）Python環境搭建(所有命令都是在terminal中輸入的)

安裝homebrew：

Shell

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

1

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
安裝pip：這裡我在terminal中輸入python -v,homebrew會自動幫你升級Python到2.7.11版本的。2.7.11版本里自帶了pip工具。
安裝virtualenv：

Shell

pip install virtualenv

1

pip install virtualenv
安裝request和beautifulsoup4：

Shell

pip install requests beautifulsoup4

1

pip install requests beautifulsoup4

參考這裡

（三）分析
目的：找出三個內容所在的網頁標籤的位置，然後將它們提取出來。
網址：http://wufazhuce.com/one/1293
谷歌瀏覽器，右鍵->顯示網頁原始碼，然後就會彈出一堆HTML的東西了。這樣的：

網頁原始檔

我想要的內容是這段話：“即使熱戀者的情感是錯覺、幻象或自戀行為，那又何妨，所謂人生就是一段不斷追求情愛的路程。 by 森山大道”。它在圖中畫紅線的地方。在<heda>標籤裡的<meta>中，之後會用到，先往下看。
圖片的連結在哪裡？顯然不在<head>中，往下找，然後就在<body>中，發現2處和圖片類似的連結。看圖

圖片連結地址

哪個連結是呢，點選去，發現後一個連結，也就是67行這個img標籤的連結是。
然後，我還想知道哪一天的圖和文字。嗯，在回到<head>標籤裡，很明顯有個<title>，裡面的東西就是我們要的。這樣：

（四）python編碼
想要抓取網頁上的內容，又不想自己去解析HTML，只好求助萬能的Google了。然後就找到了上面的連結。主要有兩個工具：request載入網頁，BeautifulSoup4解析HTML。

首先，抓取我們需要的哪三個內容：
進入python環境，然後敲入下面的程式碼：

import requests
import bs4
response = requests.get('http://wufazhuce.com/one/1295')
soup = bs4.BeautifulSoup(response.text,"html.parser")

import requests

import bs4

response = requests.get('http://wufazhuce.com/one/1295')

soup = bs4.BeautifulSoup(response.text,"html.parser")

這樣，就可以將網頁資訊儲存到soup中了。你可以敲入print soup試試。

接下來，我們獲得<title>VOL.1271 – 「ONE · 一個」</title>中的數字1271。怎麼獲得呢，beautifulsoup4教程，提供了很好的方法，可以通過tag查詢得到title的內容，然後擷取字串。termianl中輸入：

soup.title.string[3:7]

1	soup.title.string[3:7]

title是tag值，string是tag=title的字串的值，也就是<title></title>之間的值，因為只有一個<title>tag，所以不用做判斷，直接獲取即可。

接下來，獲取一段話。

這段話在<meta>中，而這裡又有太多的<meta>了，怎麼辦。這裡要用到select方法了，它可以查詢所有的<meta>，並返回一個列表。還要用到get方法，get可以獲得tag的屬性，如tag: <meta attr=’abc’> tag.get(‘attr’)值等於abc。這裡我們要獲取的屬性是name，通過name=’description’來區分。

for meta in soup.select('meta'):
    if meta.get('name') == 'description':
        print meta.get('content')

for meta in soup.select('meta'):

if meta.get('name') == 'description':

print meta.get('content')

接下來，在兩個img標籤中，查詢第2個img標籤標定的連結。這裡通過find_all方法，它可以查詢所有的符合要求的標籤。

soup.find_all('img')[1]['src']

1	soup.find_all('img')[1]['src']

這樣，我們就把所需要的資訊找出來了。

終端示例

等等，之後我們還需要併發和儲存檔案。在此之前，先來看點別的。map函式有兩個引數，一個是函式，一個是序列。將序列的每個值，作為引數傳遞給函式，返回一個列表。參考這裡
示例：

def echoInfo(num):
    return num

data = map(echoInfo, range(0,10))
print data

def echoInfo(num):

return num

data = map(echoInfo, range(0,10))

print data

結果： [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
然後併發, python可以跨平臺使用，自身提供了多程式支援模組：multiprocessing。而pool可以用來建立大量的子程式。
儲存資料到檔案。這裡我們是吧資料解析後儲存到字典中，然後序列化為JSON模型，最後儲存到檔案的。
即：字典->JSON模型->儲存到檔案。
字典->JSON模型，使用的是JSON模組的json.dumps方法，該方法有一個引數，引數為字典，返回值是JSON字串。
JSON模型->檔案，使用的是json.load方法，可以將JSON儲存到檔案中。

全部的程式碼示例如下：

import argparse
import re
from multiprocessing import Pool
import requests
import bs4
import time
import json
import io

root_url = 'http://wufazhuce.com'

def get_url(num):
    return root_url + '/one/' + str(num)

def get_urls(num):
    urls = map(get_url, range(100,100+num))
    return urls

def get_data(url):
  dataList = {}
  response = requests.get(url)
  if response.status_code != 200:
      return {'noValue': 'noValue'}
  soup = bs4.BeautifulSoup(response.text,"html.parser")
  dataList["index"] = soup.title.string[4:7]
  for meta in soup.select('meta'):
    if meta.get('name') == 'description':
      dataList["content"] = meta.get('content')
  dataList["imgUrl"] = soup.find_all('img')[1]['src']
  return dataList

if __name__=='__main__':
  pool = Pool(4)
  dataList = []
  urls = get_urls(10)
  start = time.time()
  dataList = pool.map(get_data, urls)
  end = time.time()
  print 'use: %.2f s' % (end - start)
  jsonData = json.dumps({'data':dataList})
  with open('data.txt', 'w') as outfile:
    json.dump(jsonData, outfile)

import argparse

import re

from multiprocessing import Pool

import requests

import bs4

import time

import json

import io

root_url = 'http://wufazhuce.com'

def get_url(num):

return root_url + '/one/' + str(num)

def get_urls(num):

urls = map(get_url, range(100,100+num))

return urls

def get_data(url):

dataList = {}

response = requests.get(url)

if response.status_code != 200:

return {'noValue': 'noValue'}

soup = bs4.BeautifulSoup(response.text,"html.parser")

dataList["index"] = soup.title.string[4:7]

for meta in soup.select('meta'):

if meta.get('name') == 'description':

dataList["content"] = meta.get('content')

dataList["imgUrl"] = soup.find_all('img')[1]['src']

return dataList

if __name__=='__main__':

pool = Pool(4)

dataList = []

urls = get_urls(10)

start = time.time()

dataList = pool.map(get_data, urls)

end = time.time()

print 'use: %.2f s' % (end - start)

jsonData = json.dumps({'data':dataList})

with open('data.txt', 'w') as outfile:

json.dump(jsonData, outfile)

爬蟲抓取網頁的詳細流程
2023-11-28
爬蟲網頁
如何讓Python爬蟲一天抓取100萬張網頁
2019-05-09
Python爬蟲網頁
爬蟲抓取網頁資料原理
2021-12-06
爬蟲網頁
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
如何用Python爬資料？（一）網頁抓取
2018-06-27
Python網頁
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
python 爬蟲網頁登陸
2020-11-30
Python爬蟲網頁
Python爬蟲教程-12-爬蟲使用cookie爬取登入後的頁面(人人網)（上）
2018-09-06
Python爬蟲Cookie
爬蟲——網頁爬取方法和網頁解析方法
2020-12-07
爬蟲網頁
Python爬蟲入門教程 18-100 煎蛋網XXOO圖片抓取
2019-01-04
Python爬蟲
Python爬蟲抓取股票資訊
2021-01-03
Python爬蟲
如何利用Python網路爬蟲抓取微信朋友圈的動態（上）
2018-05-09
Python爬蟲
Python爬蟲二：抓取京東商品列表頁面資訊
2018-06-26
Python爬蟲
爬蟲，可用於增加訪問量和抓取網站全頁內容
2018-09-08
爬蟲網站
【python--爬蟲】千圖網高清背景圖片爬蟲
2019-05-21
Python爬蟲
【python--爬蟲】彼岸圖網高清桌布爬蟲
2019-07-21
Python爬蟲
Python爬蟲抓取技術的門道
2019-09-21
Python爬蟲
Python 爬蟲網頁內容提取工具xpath(一)
2018-12-06
Python爬蟲網頁
Python 爬蟲網頁解析工具lxml.html(一)
2018-12-05
Python爬蟲網頁XMLHTML
Python網路爬蟲抓取動態網頁並將資料存入資料庫MYSQL
2019-01-04
Python爬蟲網頁資料庫MySql
實戰：如何通過python requests庫寫一個抓取小網站圖片的小爬蟲
2020-01-25
Python網站爬蟲
用Python爬蟲抓取代理IP
2019-04-17
Python爬蟲
《網頁爬蟲》
2018-11-26
網頁爬蟲
Python爬蟲小專案：爬一個圖書網站
2018-11-21
Python爬蟲網站
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
利用Python網路爬蟲抓取網易雲音樂歌詞
2018-05-06
Python爬蟲
網路爬蟲之抓取郵箱
2018-06-18
爬蟲
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
python爬蟲如何爬知乎的話題？
2019-02-16
Python爬蟲
Node JS爬蟲：爬取瀑布流網頁高清圖
2018-05-17
JS爬蟲網頁
Python爬蟲進階之會話和Cookies
2021-09-11
Python爬蟲會話Cookie
python 爬蟲如何爬取動態生成的網頁內容
2024-10-31
Python爬蟲網頁
python爬蟲抓取哈爾濱天氣資訊（靜態爬蟲）
2020-04-05
Python爬蟲
爬蟲進階——動態網頁Ajax資料抓取（簡易版）
2024-04-12
爬蟲網頁
編寫web2.0爬蟲——頁面抓取部分
2020-10-09
Web爬蟲
Java爬蟲系列二：使用HttpClient抓取頁面HTML
2019-05-23
Java爬蟲HTTPclientHTML
python爬蟲：使用BeautifulSoup修改網頁內容
2020-04-05
Python爬蟲網頁
一個實現批量抓取淘女郎寫真圖片的爬蟲
2018-03-14
爬蟲
Python爬蟲入門【6】：蜂鳥網圖片爬取之一
2019-07-30
Python爬蟲

Python爬蟲: 抓取One網頁上的每日一話和圖

相關文章