爬蟲學習之一個簡單的網路爬蟲

發表於2016-07-11

概述

這是一個網路爬蟲學習的技術分享，主要通過一些實際的案例對爬蟲的原理進行分析，達到對爬蟲有個基本的認識，並且能夠根據自己的需要爬到想要的資料。有了資料後可以做資料分析或者通過其他方式重新結構化展示。

什麼是網路爬蟲

網路爬蟲（又被稱為網頁蜘蛛，網路機器人，在FOAF社群中間，更經常的稱為網頁追逐者），是一種按照一定的規則，自動地抓取全球資訊網資訊的程式或者指令碼。另外一些不常使用的名字還有螞蟻、自動索引、模擬程式或者蠕蟲。via 百度百科網路爬蟲
網路蜘蛛（Web spider）也叫網路爬蟲（Web crawler）[1]，螞蟻（ant），自動檢索工具（automatic indexer），或者（在FOAF軟體概念中）網路疾走（WEB scutter），是一種“自動化瀏覽網路”的程式，或者說是一種網路機器人。它們被廣泛用於網際網路搜尋引擎或其他類似網站，以獲取或更新這些網站的內容和檢索方式。它們可以自動採集所有其能夠訪問到的頁面內容，以供搜尋引擎做進一步處理（分檢整理下載的頁面），而使得使用者能更快的檢索到他們需要的資訊。via 維基百科網路蜘蛛

以上是百度百科和維基百科對網路爬蟲的定義，簡單來說爬蟲就是抓取目標網站內容的工具，一般是根據定義的行為自動進行抓取，更智慧的爬蟲會自動分析目標網站結構類似與搜尋引擎的爬蟲，我們這裡只討論基本的爬蟲原理。
###爬蟲工作原理

網路爬蟲框架主要由控制器、解析器和索引庫三大部分組成，而爬蟲工作原理主要是解析器這個環節，解析器的主要工作是下載網頁，進行頁面的處理，主要是將一些JS指令碼標籤、CSS程式碼內容、空格字元、HTML標籤等內容處理掉，爬蟲的基本工作是由解析器完成。所以解析器的具體流程是：

入口訪問->下載內容->分析結構->提取內容

分析爬蟲目標結構

這裡我們通過分析一個網站[落網：http://luoo.net] 對網站內容進行提取來進一步瞭解！

第一步確定目的
抓取目標網站的某一期所有音樂

第二步分析頁面結構
訪問落網的某一期刊，通過Chrome的開發者模式檢視播放列表中的歌曲，右側用紅色框線圈出來的是一些需要特別注意的語義結構，見下圖所示：
落網播放列表

以上紅色框線圈出的地方主要有歌曲名稱，歌曲的編號等，這裡並沒有看到歌曲的實際檔案地址，所以我們繼續檢視，點選某一個歌曲就會立即在瀏覽器中播放，這時我們可以看到在Chrome的開發者模式的Network中看到實際請求的播放檔案，如下圖所示：

播放檔案請求

檢視請求地址

根據以上分析我們可以得到播放清單的位置和音樂檔案的路徑，接下來我們通過Python來實現這個目的。

實現爬蟲

Python環境安裝請自行Google

主要依賴第三方庫

Requests（http://www.python-requests.org）用來發起請求
BeautifulSoup（bs4）用來解析HTML結構並提取內容
faker（http://fake-factory.readthedocs.io/en/stable/）用來模擬請求UA（User-Agent）

主要思路是分成兩部分，第一部分用來發起請求分析出播放列表然後丟到佇列中，第二部分在佇列中逐條下載檔案到本地，一般分析列表速度更快，下載速度比較慢可以藉助多執行緒同時進行下載。
主要程式碼如下:

#-*- coding: utf-8 -*-
'''by sudo rm -rf  http://imchenkun.com'''
import os
import requests
from bs4 import BeautifulSoup
import random
from faker import Factory
import Queue
import threading

fake = Factory.create()
luoo_site = 'http://www.luoo.net/music/'
luoo_site_mp3 = 'http://luoo-mp3.kssws.ks-cdn.com/low/luoo/radio%s/%s.mp3'

proxy_ips = [    '27.15.236.236'    ] # 替換自己的代理IP
headers = {
    'Connection': 'keep-alive',
    'User-Agent': fake.user_agent()
    }

def random_proxies():
    ip_index = random.randint(0, len(proxy_ips)-1)
    res = { 'http': proxy_ips[ip_index] }
    return res

def fix_characters(s):
    for c in ['<', '>', ':', '"', '/', '\\\\', '|', '?', '*']:
        s = s.replace(c, '')
    return s

class LuooSpider(threading.Thread):
    def __init__(self, url, vols, queue=None):
        threading.Thread.__init__(self)
        print '[luoo spider]'
        print '=' * 20
        self.url = url
        self.queue = queue
        self.vol = '1'
        self.vols = vols

def run(self):
        for vol in self.vols:
            self.spider(vol)
        print '\\ncrawl end\\n\\n'
        def spider(self, vol):
        url = luoo_site + vol
        print 'crawling: ' + url + '\\n'
        res = requests.get(url, proxies=random_proxies())
                soup = BeautifulSoup(res.content, 'html.parser')
        title = soup.find('span', attrs={'class': 'vol-title'}).text
        cover = soup.find('img', attrs={'class': 'vol-cover'})['src']
        desc = soup.find('div', attrs={'class': 'vol-desc'})
        track_names = soup.find_all('a', attrs={'class': 'trackname'})
        track_count = len(track_names)
        tracks = []
        for track in track_names:
            _id = str(int(track.text[:2])) if (int(vol) < 12) else track.text[:2]  # 12期前的音樂編號1~9是1位（如：1~9），之後的都是2位 1~9會在左邊墊0（如：01~09）
            _name = fix_characters(track.text[4:])
            tracks.append({'id': _id, 'name': _name})
            phases = {
                'phase': vol,                         # 期刊編號
                'title': title,                       # 期刊標題
                 'cover': cover,                      # 期刊封面
                 'desc': desc,                        # 期刊描述
                 'track_count': track_count,          # 節目數
                 'tracks': tracks                     # 節目清單(節目編號，節目名稱)
            }
            self.queue.put(phases)

class LuooDownloader(threading.Thread):
    def __init__(self, url, dist, queue=None):
        threading.Thread.__init__(self)
        self.url = url
        self.queue = queue
        self.dist = dist
        self.__counter = 0

def run(self):
        while True:
            if self.queue.qsize() <= 0:
                pass
            else:
                phases = self.queue.get()
                self.download(phases)

def download(self, phases):
        for track in phases['tracks']:
            file_url = self.url % (phases['phase'], track['id'])

local_file_dict = '%s/%s' % (self.dist, phases['phase'])
            if not os.path.exists(local_file_dict):
                os.makedirs(local_file_dict)

local_file = '%s/%s.%s.mp3' % (local_file_dict, track['id'], track['name'])
            if not os.path.isfile(local_file):
                print 'downloading: ' + track['name']
                res = requests.get(file_url, proxies=random_proxies(), headers=headers)
                with open(local_file, 'wb') as f:
                    f.write(res.content)
                    f.close()
                print 'done.\\n'
            else:
                print 'break: ' + track['name']

if __name__ == '__main__':
    spider_queue = Queue.Queue()

luoo = LuooSpider(luoo_site, vols=['680', '721', '725', '720'],queue=spider_queue)
    luoo.setDaemon(True)
    luoo.start()

downloader_count = 5
    for i in range(downloader_count):
        luoo_download = LuooDownloader(luoo_site_mp3, 'D:/luoo', queue=spider_queue)
        luoo_download.setDaemon(True)
        luoo_download.start()

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

#-*- coding: utf-8 -*-

'''by sudo rm -rf http://imchenkun.com'''

import os

import requests

from bs4 import BeautifulSoup

import random

from faker import Factory

import Queue

import threading

fake = Factory.create()

luoo_site = 'http://www.luoo.net/music/'

luoo_site_mp3 = 'http://luoo-mp3.kssws.ks-cdn.com/low/luoo/radio%s/%s.mp3'

proxy_ips = [ '27.15.236.236' ] # 替換自己的代理IP

headers = {

'Connection': 'keep-alive',

'User-Agent': fake.user_agent()

}

def random_proxies():

ip_index = random.randint(0, len(proxy_ips)-1)

res = { 'http': proxy_ips[ip_index] }

return res

def fix_characters(s):

for c in ['<', '>', ':', '"', '/', '\\\\', '|', '?', '*']:

s = s.replace(c, '')

return s

class LuooSpider(threading.Thread):

def __init__(self, url, vols, queue=None):

threading.Thread.__init__(self)

print '[luoo spider]'

print '=' * 20

self.url = url

self.queue = queue

self.vol = '1'

self.vols = vols

def run(self):

for vol in self.vols:

self.spider(vol)

print '\\ncrawl end\\n\\n'

def spider(self, vol):

url = luoo_site + vol

print 'crawling: ' + url + '\\n'

res = requests.get(url, proxies=random_proxies())

soup = BeautifulSoup(res.content, 'html.parser')

title = soup.find('span', attrs={'class': 'vol-title'}).text

cover = soup.find('img', attrs={'class': 'vol-cover'})['src']

desc = soup.find('div', attrs={'class': 'vol-desc'})

track_names = soup.find_all('a', attrs={'class': 'trackname'})

track_count = len(track_names)

tracks = []

for track in track_names:

_id = str(int(track.text[:2])) if (int(vol) < 12) else track.text[:2] # 12期前的音樂編號1~9是1位（如：1~9），之後的都是2位 1~9會在左邊墊0（如：01~09）

_name = fix_characters(track.text[4:])

tracks.append({'id': _id, 'name': _name})

phases = {

'phase': vol, # 期刊編號

'title': title, # 期刊標題

'cover': cover, # 期刊封面

'desc': desc, # 期刊描述

'track_count': track_count, # 節目數

'tracks': tracks # 節目清單(節目編號，節目名稱)

}

self.queue.put(phases)

class LuooDownloader(threading.Thread):

def __init__(self, url, dist, queue=None):

threading.Thread.__init__(self)

self.url = url

self.queue = queue

self.dist = dist

self.__counter = 0

def run(self):

while True:

if self.queue.qsize() <= 0:

pass

else:

phases = self.queue.get()

self.download(phases)

def download(self, phases):

for track in phases['tracks']:

file_url = self.url % (phases['phase'], track['id'])

local_file_dict = '%s/%s' % (self.dist, phases['phase'])

if not os.path.exists(local_file_dict):

os.makedirs(local_file_dict)

local_file = '%s/%s.%s.mp3' % (local_file_dict, track['id'], track['name'])

if not os.path.isfile(local_file):

print 'downloading: ' + track['name']

res = requests.get(file_url, proxies=random_proxies(), headers=headers)

with open(local_file, 'wb') as f:

f.write(res.content)

f.close()

print 'done.\\n'

else:

print 'break: ' + track['name']

if __name__ == '__main__':

spider_queue = Queue.Queue()

luoo = LuooSpider(luoo_site, vols=['680', '721', '725', '720'],queue=spider_queue)

luoo.setDaemon(True)

luoo.start()

downloader_count = 5

for i in range(downloader_count):

luoo_download = LuooDownloader(luoo_site_mp3, 'D:/luoo', queue=spider_queue)

luoo_download.setDaemon(True)

luoo_download.start()

以上程式碼執行後結果如下圖所示
執行效果
執行結果
Github地址：https://github.com/imchenkun/ick-spider/blob/master/luoospider.py

總結

通過本文我們基本瞭解了網路爬蟲的知識，對網路爬蟲工作原理認識的同時我們實現了一個真實的案例場景，這裡主要是使用一些基礎的第三方Python庫來幫助我們實現爬蟲，基本上演示了網路爬蟲框架中基本的核心概念。通常工作中我們會使用一些比較優秀的爬蟲框架來快速的實現需求，比如 scrapy框架，接下來我會通過使用Scrapy這類爬蟲框架來實現一個新的爬蟲來加深對網路爬蟲的理解！

特別申明：本文所提到的落網是我本人特別喜歡的一個音樂網站，本文只是拿來進行爬蟲的技術交流學習，讀者涉及到的所有侵權問題都與本人無關

【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
網路爬蟲
2018-12-07
爬蟲
簡單的爬蟲程式
2024-03-24
爬蟲
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
精通Scrapy網路爬蟲【一】第一個爬蟲專案
2021-06-19
爬蟲
nodeJS做一個簡單的爬蟲
2018-03-30
NodeJS爬蟲
python爬蟲:爬蟲的簡單介紹及requests模組的簡單使用
2022-02-24
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python 開發簡單爬蟲 (學習筆記)
2019-08-05
Python爬蟲筆記
網路爬蟲的原理
2018-12-02
爬蟲
網路爬蟲示例
2018-10-30
爬蟲
網路爬蟲精要
2019-04-27
爬蟲
python簡單爬蟲(二)
2018-04-18
Python爬蟲
python網路爬蟲（14）使用Scrapy搭建爬蟲框架
2019-07-27
Python爬蟲框架
爬蟲學習日記（六）完成第一個爬蟲任務
2019-01-10
爬蟲
什麼是Python網路爬蟲?常見的網路爬蟲有哪些?
2020-11-27
Python爬蟲
最簡單的網路圖片的爬取 --Pyhon網路爬蟲與資訊獲取
2020-04-04
爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
python網路爬蟲（9）構建基礎爬蟲思路
2019-06-09
Python爬蟲
網路爬蟲的反扒策略
2021-09-11
爬蟲
【0基礎學爬蟲】爬蟲基礎之網路請求庫的使用
2023-03-26
爬蟲
使用requests+BeautifulSoup的簡單爬蟲練習
2018-04-06
爬蟲
python DHT網路爬蟲
2019-02-14
Python爬蟲
網路爬蟲專案
2022-01-29
爬蟲
爬蟲與反爬蟲技術簡介
2022-09-20
爬蟲
簡單瞭解python爬蟲
2020-10-13
Python爬蟲
什麼是爬蟲?學習Python爬蟲難不難?
2019-11-05
爬蟲Python
python爬蟲是什麼?學習python爬蟲難嗎
2021-03-31
Python爬蟲
一個簡單的爬蟲頭部構造
2020-11-22
爬蟲
使用nodeJS寫一個簡單的小爬蟲
2018-12-25
NodeJS爬蟲
[Python] 網路爬蟲與資訊提取（1）網路爬蟲之規則
2020-11-06
Python爬蟲
如何自己寫一個網路爬蟲
2020-02-27
爬蟲
python如何實現簡單的爬蟲功能?Python學習教程!
2021-01-12
Python爬蟲
簡單的爬蟲：爬取網站內容正文與圖片
2021-09-09
爬蟲網站
爬蟲（9） - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架
2022-07-05
爬蟲框架非同步
爬蟲學習-初次上路
2020-11-21
爬蟲

爬蟲學習之一個簡單的網路爬蟲

什麼是網路爬蟲

分析爬蟲目標結構

實現爬蟲

總結

相關文章