簡單瞭解python爬蟲

円.1024發表於2020-10-13

原文網址 : https://blog.csdn.net/weixin_45249694/article/details/109062175

Python爬蟲

簡單瞭解python爬蟲

一、什麼是爬蟲

1.1爬蟲Spider的概念

爬蟲用於爬取資料，又稱之為資料採集程式。
爬取的資料來源於網路，網路中的資料可以是由Web伺服器(Nginx/Apache)、資料庫伺服器(MySQL、Redis)、索引庫(ElastichSearch) 、大資料(Hbase/Hive)、視訊/圖片庫(FTP)、雲端儲存等(OSS)提供的。
爬取的資料是公開的、非盈利的。

1.2 Python爬蟲

使用Python編寫的爬蟲指令碼(程式)可以完成定時、定量、指定目標(Web站點）的資料爬取。主要使用多(單）執行緒/程式、網路請求庫、資料解析、資料儲存、任務排程等相關技術。

#請求報文（請求頭header和請求體body以空行分開）
POST/s HTTP/1.1
HOST:www.baidu.com
Content-Type：application/json
Content-Length:24
{"name":"disen","phone":"17365488888"}

#相應報文
HTTP/1.1 200 OK
Content-Type：text/html,charset=utf-8
Content-Length:300

<html>
.....
</html>

二、爬蟲與Web後端服務之間的關係

爬蟲使用網路請求庫，相當於客戶端請求，Web後端服務根據請求響應資料。
爬蟲即向Web伺服器發起HTTP請求，正確地接收響應資料，然後根據資料的型別(Content-Type)進行資料的解析及儲存。
爬蟲程式在發起請求前，需要偽造瀏覽器(User-Agent指定請求頭)，然後再向伺服器發起請求，響應200的成功率高很多。|

三、Python爬蟲技術的相關庫

1.網路請求:

urllib
requests / urllib3
selenium(UI自動測試、動態js渲染)
appium(手機App的爬蟲或UI測試)

2.資料解析:

1.re正則

2.xpath

3.bs4

4.json

json序列化：將物件轉換成字串或位元組
json反序列化：將字串或位元組轉換成物件

3.資料儲存:

pymysql
mongodb
elasticsearch

4.多工庫:

多執行緒(threading)
執行緒佇列queue
協程(asynio、gevent/eventlet)

5.爬蟲框架

scrapy
scrapy-redis分散式（多機爬蟲)

四、常見反爬蟲的策略

UA (User-Agent）策略
登入限制(Cookie）策略
請求頻次(IP代理）策略
驗證碼(圖片-雲打碼，文字或物件圖片選擇、滑塊）策略
動態js (Selenium/Splash/api介面）策略

初次使用urllib實現爬蟲的資料請求
urllib.request.urlopen(ur1)發起get請求
urllib.parse.quote()將中文進行url編碼
urllib.request.urlretrieve(url，filename)下載url儲存到filename

案例：爬取漫畫島網站漫畫圖片

https://www.manhuadao.cn/

下載庫：

1.通過命令pip install requests

2.通過pycharm--File | Settings | Project: Pythonwork | Project Interpreter--點選+號搜尋需要的庫進行新增

#匯入第三方庫
from urllib import response
from xml import etree

import requests

#獲取網頁中的資料（網頁原始碼）
def get_urls(url):

    #反爬 爬蟲來模擬瀏覽器
    headers = {
        'User - Agent': 'Mozilla/5.0(Windows NT 6.1; Win64; x64) ApplewebKit/537.36(KHTML,like Gecko) Chrome / 72.0.3626.121 Safari/537.36'
    }

    response = requests.get(url,headers)
    # print(response.text) 獲取到的網頁原始碼

    return response.text

#獲取網頁圖片連結（爬取的那個圖片）
def html_result(text):

   html = etree.HTML(text)
   # print(html)
   """
   <div class="main-content">
    <ul>
        ......
        <img src="http://img.manhuadao.cn/api/v1/bookcenter/scenethumbnail/1/168407/U_01f7e72e-c343-4688-8a5f-2049c77d73b9.jpg" alt="">
        ....
   </div>
   // 可以提取某個標籤的所有資訊
   @ 選取屬性
   / 選擇某一個便籤
   """
   #獲取所有屬性class的值為main-content的div便籤下的類
   html.xpath('//div[@class="main-content"]//img/@src')
   return img_urls

#下載網頁當中的資料
def get_img(url,name):
    #請求下載地址
    reaponse = requests.get(url)
    #開始下載:%s 佔位符,%name get_img(u,image_name3),wb:以二進位制的方式寫入圖片
    with open("D:\img\%s.jpg"%name"wd") as f:
        f.write(response.content)


#定義一個函式來呼叫這3個功能
def main():
    url = 'https://www.manhuadao.cn/Comic/ComicView?comicid=58ddb12627a7c1392c23c427&chapterid=2182076'

    #呼叫第一個函式
    result = get_urls(url)
    #呼叫第二個函式
    html_result(result)

    #迴圈遍歷圖片
    for u in image_urls:
        '''
        <img src="http://img.manhuadao.cn/api/v1/bookcenter/scenethumbnail/1/168407/U_fea8cfad-f288-4364-83e8-348c9e3815ae.jpg" alt="">
        '''
        #擷取資料返回個列表--倒著取
        image_name = u.split('/')[-1]
        #按.擷取--正著取
        image_name2 = image_name.split('.')[0]
        # #按-擷取--倒著取【348c9e3815ae】
        image_name3 = image_name2.split('-')[-1]
        get_img(u,image_name3)
        print(image_name3)
#執行程式：執行判斷
if __name__ == '__main__':
    main()