爬蟲基礎篇

追夢NAN發表於2020-07-31

1.爬蟲相關概述

爬蟲概念:

通過編寫程式模擬瀏覽器上網,然後讓其去網際網路上爬取/抓取資料的過程
模擬:瀏覽器就是一款純天然的原始的爬蟲工具

爬蟲分類:

通用爬蟲:爬取一整張頁面中的資料. 抓取系統(爬蟲程式)
聚焦爬蟲:爬取頁面中區域性的資料.一定是建立在通用爬蟲的基礎之上
增量式爬蟲:用來監測網站資料更新的情況.以便爬取到網站最新更新出來的資料

風險分析

合理的的使用
爬蟲風險的體現:
爬蟲干擾了被訪問網站的正常運營;
爬蟲抓取了受到法律保護的特定型別的資料或資訊。
避免風險:
嚴格遵守網站設定的robots協議;
在規避反爬蟲措施的同時,需要優化自己的程式碼,避免干擾被訪問網站的正常執行;
在使用、傳播抓取到的資訊時,應審查所抓取的內容,如發現屬於使用者的個人資訊、隱私或者他人的商業祕密的,應及時停止並刪除。

反爬機制

反反爬策略 
robots.txt協議:文字協議,在文字中指定了可爬和不可爬的資料說明.

常用的頭資訊

User-Agent:請求載體的身份標識
Connection:close
content-type

如何鑑定頁面中是否有動態載入的資料?

區域性搜尋 全域性搜尋

對一個陌生網站進行爬取前的第一步做什麼?
確定你要爬取的資料是否為動態載入的!!!

2.requests模組的基本使用

requests模組
概念:一個機遇網路請求的模組.作用就是用來模擬瀏覽器發起請求
編碼流程:
指定url
進行請求的傳送
獲取響應資料(爬取到的資料)
持久化儲存
import requests
url = 'https://www.sogou.com'
#返回值是一個響應物件
response = requests.get(url=url)
#text返回的是字串形式的響應資料
data = (response.text)
with open('./sogou.html',"w",encoding='utf-8') as f:
    f.write(data)

基於搜狗編寫一個簡易的網頁採集器

解決亂碼問題

解決UA檢測問題

import requests

wd = input('輸入key:')
url = 'https://www.sogou.com/web'
# 儲存的就是動態的請求引數
params = {
    'query': wd
}
#params參數列示的是對請求url引數的封裝
#headers 解決反爬機制,實現UA偽裝
headers = {
    'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params,headers=headers)
#手動修改響應資料的編碼,解決中文亂碼
response.encoding = 'utf-8'

data = (response.text)
filename = wd + '.html'
with open(filename, "w", encoding='utf-8') as f:
    f.write(data)
print(wd, "下載成功")

1.爬取豆瓣電影的詳細資料

分析

當滾輪滑動到底部的時候,發起ajax的請求,且請求到了一組電影資料
動態載入的資料:就是通過另一個額外的請求請求到的資料
ajax生成動態載入的資料
js生成動態載入的資料
import requests
limit = input("排行榜前多少的資料:::")
url = 'https://movie.douban.com/j/chart/top_list'
params = {
    "type": "5",
    "interval_id": "100:90",
    "action": "",
    "start": "0",
    "limit": limit
}

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params, headers=headers)
#json返回的是序列化好的物件
data_list = (response.json())

with open('douban.txt', "w", encoding='utf-8') as f:
    for i in data_list:
        name = i['title']
        score = i['score']
        f.write(name+""+score+""+"\n")
print("成功")

2.爬取肯德基地理位置資訊

import requests

url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
params = {
    "cname": "",
    "pid": "",
    "keyword": "青島",
    "pageIndex": "1",
    "pageSize": "10"
}

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.post(url=url, params=params, headers=headers)
# json返回的是序列化好的物件
data_list = (response.json())
with open('kedeji.txt', "w", encoding='utf-8') as f:
    for i in data_list["Table1"]:
        name = i['storeName']
        addres = i['addressDetail']
        f.write(name + "," + addres  + "\n")
print("成功")

3.爬取藥品管理局資料

import requests

url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList"
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
with open('化妝品,txt', "w", encoding="utf-8") as f:
    for i in range(1, 5):
        params = {
            "on": "true",
            "page": str(i),
            "pageSize": "12",
            "productName": "",
            "conditionType": "1",
            "applyname": "",
            "applysn": ""
        }

        response = requests.post(url=url, params=params, headers=headers)
        data_dic = (response.json())

        for i in data_dic["list"]:
            id = i['ID']
            post_url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById"
            post_data = {
                "id": id
            }
            response2 = requests.post(url=post_url, params=post_data, headers=headers)
            data_dic2 = (response2.json())
            title = data_dic2["epsName"]
            name = data_dic2['legalPerson']

            f.write(title + ":" + name + "\n")

3.資料解析

解析:根據指定的規則對資料進行提取

作用:實現聚焦爬蟲

聚焦爬蟲的編碼流程:

指定url
發起請求
獲取響應資料
資料解析
持久化儲存

資料解析的方式:

正則
bs4
xpath
pyquery(擴充)

資料解析的通用原理是什麼?

資料解析需要作用在頁面原始碼中(一組html標籤組成的)

html的核心作用是什麼?

展示資料

html是如何展示資料的呢?

html所要展示的資料一定是被放置在html標籤之中,或者是在屬性中

通用原理:

1.標籤定位
2.取文字or取屬性

1.正則解析

1.爬取糗事百科糗圖資料

爬取單張

import requests

url = "https://pic.qiushibaike.com/system/pictures/12330/123306162/medium/GRF7AMF9GKDTIZL6.jpg"

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, headers=headers)
# content返回的是byte型別的資料
img_data = (response.content)
with open('./123.jpg', "wb") as f:
        f.write(img_data)
print("成功")



爬取單頁

<div class="thumb">

<a href="/article/123319109" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12331/123319109/medium/MOX0YDFJX7CM1NWK.jpg" alt="糗事#123319109" class="illustration" width="100%" height="auto">
</a>
</div>

import re
import os
import requests

dir_name = "./img"
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
url = "https://www.qiushibaike.com/imgrank/"

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
img_text = requests.get(url, headers).text
ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
img_list = re.findall(ex, img_text, re.S)
for src in img_list:
    src = "https:" + src
    img_name = src.split('/')[-1]
    img_path = dir_name + "/" + img_name
    response = requests.get(src, headers).content
    # 對圖片地址發請求獲取圖片資料
    with open(img_path, "wb") as f:
        f.write(response)
print("成功")


爬取多頁

import re
import os
import requests

dir_name = "./img"
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
for i in range(1,5):
    url = f"https://www.qiushibaike.com/imgrank/page/{i}/"
    print(f"正在爬取第{i}頁的圖片")
    img_text = requests.get(url, headers=headers).text
    ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
    img_list = re.findall(ex, img_text, re.S)
    for src in img_list:
        src = "https:" + src
        img_name = src.split('/')[-1]
        img_path = dir_name + "/" + img_name
        response = requests.get(src, headers).content
        # 對圖片地址發請求獲取圖片資料
        with open(img_path, "wb") as f:
            f.write(response)
print("成功")

2.bs4解析

環境安裝

pip install bs4 

bs4的解析原理

例項化一個BeautifulSoup的物件為soup,並且將即將被解析的頁面原始碼資料載入到該物件中,
呼叫BeautifulSoup物件中的相關屬性和方法進行標籤定位和資料提取

如何例項化BeautifulSoup物件呢?

BeautifulSoup(fp,'lxml'):專門用作於解析本地儲存的html文件中的資料
BeautifulSoup(page_text,'lxml'):專門用作於將網際網路上請求到的頁面原始碼資料進行解析

標籤定位

soup.tagName:定位到第一個TagName標籤,返回的是第一個

屬性定位

soup.find('div',class_='s'),返回值是class=s的div標籤
find_all:和find用法一致,但是返回值是列表

選擇器定位

select('選擇器'),返回值為列表
	標籤,類,id,層級(>一個層級,空格 多個層級)

提取資料

取文字

tag.string:標籤中直系的文字內容
tag.text:標籤中所有的文字內容

取屬性

soup.find("a",id_='tt')['href']

1.爬取三國演義小說內容

http://www.shicimingju.com/book/sanguoyanyi.html

爬取章節名稱+章節內容

1.在首頁中解析章節名稱&每一個章節詳情頁的url

from bs4 import BeautifulSoup
import requests

url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
page_text = requests.get(url, headers=headers).text
soup = BeautifulSoup(page_text, 'lxml')
a_list = soup.select(".book-mulu a")
with open('./sanguo.txt', 'w', encoding='utf-8') as f:
    for a in a_list:
        new_url = "http://www.shicimingju.com" + a["href"]
        mulu = a.text
        print(mulu)
        ##對章節詳情頁的url發起請求,解析詳情頁中的章節內容
        new_page_text = requests.get(new_url, headers).text
        new_soup = BeautifulSoup(new_page_text, 'lxml')
        neirong = new_soup.find('div', class_='chapter_content').text
        f.write(mulu+":"+neirong+"\n")

3.xpath解析

環境安裝

pip install lxml

xpath的解析原理

例項化一個etree型別xpath的解析原理的物件,且將頁面原始碼資料載入到該物件中
需要呼叫該物件的xpath方法結合著不同形式的xpath表示式進行標籤定位和資料提取

etree物件的例項化

tree = etree.parse(fileNane)
tree = etree.HTML(page_text)
xpath方法返回的永遠是一個列表

標籤定位

tree.xpath("")
在xpath表示式中最最側的/表示的含義是說,當前定位的標籤必須從根節點開始進行定位
xpath表示式中最左側的//表示可以從任意位置進行標籤定位
xpath表示式中非最左側的//表示的是多個層級的意思
xpath表示式中非最左側的/表示的是一個層級的意思

屬性定位://div[@class='ddd']

索引定位://div[@class='ddd']/li[3] #索引從1開始
索引定位://div[@class='ddd']//li[2] #索引從1開始

提取資料

取文字:
tree.xpath("//p[1]/text()"):取直系的文字內容
tree.xpath("//div[@class='ddd']/li[2]//text()"):取所有的文字內容
取屬性:
tree.xpath('//a[@id="feng"]/@href')

1.爬取boss的招聘資訊

from lxml import etree
import requests
import time


url = 'https://www.zhipin.com/job_detail/?query=python&city=101120200&industry=&position='
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    'cookie':'__zp__pub__=; lastCity=101120200; __c=1594792470; __g=-; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1594713563,1594713587,1594792470; __l=l=%2Fwww.zhipin.com%2Fqingdao%2F&r=&friend_source=0&friend_source=0; __a=26925852.1594713563.1594713586.1594792470.52.3.39.52; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1594801318; __zp_stoken__=c508aZxdfUB9hb0Q8ORppIXd7JTdDTF96U3EdCDgIHEscYxUsVnoqdH9VBxY5GUtkJi5wfxggRDtsR0dAT2pEDDRRfWsWLg8WUmFyWQECQlYFSV4SCUQqUB8yfRwAUTAyZBc1ABdbRRhyXUY%3D'
}
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="main"]/div/div[2]/ul/li')
for li in li_list:
    #需要將li表示的區域性頁面原始碼資料中的相關資料進行提取
    #如果xpath表示式被作用在了迴圈中, 表示式要以. / 或者. // 開頭
    detail_url = 'https://www.zhipin.com' + li.xpath('.//span[@class="job-name"]/a/@href')[0]
    job_title = li.xpath('.//span[@class="job-name"]/a/text()')[0]
    company = li.xpath('.//div[@class="info-company"]/div/h3/a/text()')[0]
    # # 對詳情頁的url發請求解析出崗位職責
    detail_page_text = requests.get(detail_url, headers=headers).text
    tree = etree.HTML(detail_page_text)
    job_desc = tree.xpath('//div[@class="text"]/text()')
    #列表轉字元傳
    job_desc = ''.join(job_desc)
    print(job_title,company,job_desc)
    time.sleep(5)

2.爬取糗事百科

爬取作者,和文章。注意作者有匿名和實名之分

from lxml import etree
import requests


url = "https://www.qiushibaike.com/text/page/4/"
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@class="col1 old-style-col1"]/div')
print(div_list)

for div in div_list:
#使用者名稱分為匿名使用者和註冊使用者
    author = div.xpath('.//div[@class="author clearfix"]//h2/text() | .//div[@class="author clearfix"]/span[2]/h2/text()')[0]
    content = div.xpath('.//div[@class="content"]/span//text()')
    content = ''.join(content)
    print(author, content)


3.爬取網站圖片

from lxml import etree
import requests
import os
dir_name = "./img2"
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
for i in range(1, 6):
    if i == 1:
        url = "http://pic.netbian.com/4kmeinv/"
    else:
        url = f"http://pic.netbian.com/4kmeinv/index_{i}.html"

    page_text = requests.get(url, headers=headers).text
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
    for li in li_list:
        img_src = "http://pic.netbian.com/" + li.xpath('./a/img/@src')[0]
        img_name = li.xpath('./a/b/text()')[0]
        #解決中文亂碼
        img_name = img_name.encode('iso-8859-1').decode('gbk')
        response = requests.get(img_src).content
        img_path = dir_name + "/" + f"{img_name}.jpg"
        with open(img_path, "wb") as f:
            f.write(response)
    print(f"第{i}頁成功")

4.IP代理

代理伺服器

實現請求轉發,從而可以實現更換請求的ip地址

代理的匿名度

透明:伺服器知道你使用了代理並且知道你的真實ip
匿名:伺服器知道你使用了代理,但是不知道你的真實ip
高匿:伺服器不知道你使用了代理,更不知道你的真實ip

代理的型別

http:該型別的代理只可以轉發http協議的請求

https:只可以轉發https協議的請求

免費代理ip的網站

快代理
西祠代理
goubanjia
代理精靈(推薦):http://http.zhiliandaili.cn/

在爬蟲中遇到ip被禁掉如何處理?

使用代理
構建一個代理池
撥號伺服器

import requests
import random
from lxml import etree

# 列表形式的代理池
all_ips = []
proxy_url = "http://t.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=5&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=15"
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
proxy_page_text = requests.get(url=proxy_url, headers=headers).text
tree = etree.HTML(proxy_page_text)
proxy_list = tree.xpath('//body//text()')
for ip in proxy_list:
    dic = {'https': ip}
    all_ips.append(dic)
# 爬取快代理中的免費代理ip
free_proxies = []
for i in range(1, 3):
    url = f"http://www.kuaidaili.com/free/inha/{i}/"
    page_text = requests.get(url, headers=headers,proxies=random.choice(all_ips)).text
    tree = etree.HTML(page_text)
    # xpath表示式中不可以出現tbody
    tr_list = tree.xpath('//*[@id="list"]/table/tbody/tr')
    for tr in tr_list:
        ip = tr.xpath("./td/text()")[0]
        port = tr.xpath("./td[2]/text()")[0]
        dic = {
            "ip":ip,
            "port":port
        }
        print(dic)
        free_proxies.append(dic)
    print(f"第{i}頁")
print(len(free_proxies))

5.處理cookie

視訊解析介面

https://www.wocao.xyz/index.php?url=
https://2wk.com/vip.php?url=
https://api.47ks.com/webcloud/?v-

視訊解析網址

牛巴巴     http://mv.688ing.com/
愛片網     https://ap2345.com/vip/
全民解析   http://www.qmaile.com/

迴歸正點

為什麼要處理cookie?

儲存客戶端的相關狀態

在請求中攜帶cookie,在爬蟲中如果遇到了cookie的反爬如何處理?

#手動處理
在抓包工具中捕獲cookie,將其封裝在headers中 


#自動處理
使用session機制
使用場景:動態變化的cookie
session物件:該物件和requests模組用法幾乎一致.如果在請求的過程中產生了cookie,如果該請求使用session發起的,則cookie會被自動儲存到session中

爬去雪球網的資料

import requests

s = requests.Session()
main_url = "https://xueqiu.com"  # 先對url發請求獲取cookie
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
params = {
    "size": "8",
    '_type': "10",
    "type": "10"
}
s.get(main_url, headers=headers)
url = 'https://stock.xueqiu.com/v5/stock/hot_stock/list.json?size=8&_type=10&type=10'

page_text = s.get(url, headers=headers).json()
print(page_text)

6.驗證碼識別

相關的線上打碼平臺識別

1.註冊,登入(使用者中心的身份認證)

2.登入後

​ 建立一個軟體:軟體ID->生成一個軟體id

​ 下載示例程式碼:開發文件->python->下載

平臺例項程式碼的演示

import requests
from hashlib import md5


class Chaojiying_Client(object):
    def __init__(self, username, password, soft_id):
        self.username = username
        password = password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
                          headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()


chaojiying = Chaojiying_Client('超級鷹使用者名稱', '超級鷹使用者名稱的密碼', '96001')
im = open('a.jpg', 'rb').read()
print(chaojiying.PostPic(im, 1902)['pic_str'])									

將古詩網中的驗證碼進行識別

zbb.py

import requests
from hashlib import md5


class Chaojiying_Client(object):
    def __init__(self, username, password, soft_id):
        self.username = username
        password = password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
                          headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()


def www(path,type):
    chaojiying = Chaojiying_Client('5423', '521521', '906630')
    im = open(path, 'rb').read()
    return chaojiying.PostPic(im, type)['pic_str']

requests.py

import requests
from lxml import etree
from zbb import www

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = requests.get(img_url,headers=headers).content
with open('./111.jpg','wb') as f:
    f.write(img_data)
img_text = www('./111.jpg',1004)
print(img_text)

7.模擬登陸

為什麼在爬蟲中需要實現模擬登入?

有的資料是必須經過登入後才可以顯示出來的

古詩網

涉及到的反扒機制

1.驗證碼
2.動態請求引數:每次請求對應的請求引數都是動態變化
	動態捕獲:通常情況下,動態的請求引數都會被隱藏在前臺頁面的原始碼中
3.cookie存在驗證碼圖片之中 
 坑壁玩意

import requests
from lxml import etree
from zbb import www

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
# 獲取cookie
s = requests.Session()
# s_url = "https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx"
# s.get(s_url, headers=headers)

# 獲取驗證碼
url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = s.get(img_url, headers=headers).content
with open('./111.jpg', 'wb') as f:
    f.write(img_data)
img_text = www('./111.jpg', 1004)
print(img_text)

# 動態捕獲動態的請求引數
__VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
__VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]

# 點選登入按鈕後發起請求的url:通過抓包工具捕獲
login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'
data = {
    "__VIEWSTATE": __VIEWSTATE,
    "__VIEWSTATEGENERATOR": __VIEWSTATEGENERATOR,  # 變化的
    "from": "http://so.gushiwen.cn/user/collect.aspx",
    "email": "542154983@qq.com",
    "pwd": "zxy521",
    "code": img_text,
    "denglu": "登入"
}
main_page_text = s.post(login_url, headers=headers, data=data).text
with open('main.html', 'w', encoding='utf-8') as fp:
    fp.write(main_page_text)

8.基於執行緒池的非同步爬取

基於執行緒池的非同步爬取 趣味百科前十頁內容

import requests
from multiprocessing.dummy import Pool

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
#將url獲取,加入列表之中
urls = []
for i in range(1, 11):
    urls.append(f'https://www.qiushibaike.com/8hr/page/{i}/')

#建立一個request請求
def get_request(url):
    # 必須只能有一個引數
    return requests.get(url, headers=headers).text
#例項化執行緒10個
pool = Pool(10)
response_text_list = pool.map(get_request,urls)
print(response_text_list)

9.單執行緒+多工非同步協程

1.簡介

協程:物件

#可以把協程當做是一個特殊的函式.如果一個函式的定義被async關鍵字所修飾.該特殊的函式被呼叫後函式內部的程式語句不會被立即執行,而是會返回一個協程物件.

任務物件(task)

#所謂的任務物件就是對協程物件的進一步封裝.在任務物件中可以實現顯示協程物件的執行狀況.
#任務物件最終是需要被註冊到事件迴圈物件中.

繫結回撥

#回撥函式是繫結給任務物件,只有當任務物件對應的特殊函式被執行完畢後,回撥函式才會被執行

事件迴圈物件

#無限迴圈的物件.也可以把其當成是某一種容器.該容器中需要放置多個任務物件(就是一組待執行的程式碼塊).

非同步的體現

#當事件迴圈開啟後,該物件會安裝順序執行每一個任務物件,
    #當一個任務物件發生了阻塞事件迴圈是不會等待,而是直接執行下一個任務物件

await:掛起的操作.交出cpu的使用權

單任務

from time import sleep
import asyncio


# 回撥函式:
# 預設引數:任務物件
def callback(task):
    print('i am callback!!1')
    print(task.result())  # result返回的就是任務物件對應的那個特殊函式的返回值


async def get_request(url):
    print('正在請求:', url)
    sleep(2)
    print('請求結束:', url)
    return 'hello bobo'


# 建立一個協程物件
c = get_request('www.1.com')
# 封裝一個任務物件
task = asyncio.ensure_future(c)

# 給任務物件繫結回撥函式
task.add_done_callback(callback)

# 建立一個事件迴圈物件
loop = asyncio.get_event_loop()
loop.run_until_complete(task)  # 將任務物件註冊到事件迴圈物件中並且開啟了事件迴圈


2.多工的非同步協程

import asyncio
from time import sleep
import time
start = time.time()
urls = [
    'http://localhost:5000/a',
    'http://localhost:5000/b',
    'http://localhost:5000/c'
]
#在待執行的程式碼塊中不可以出現不支援非同步模組的程式碼
#在該函式內部如果有阻塞操作必須使用await關鍵字進行修飾
async def get_request(url):
    print('正在請求:',url)
    # sleep(2)
    await asyncio.sleep(2)
    print('請求結束:',url)
    return 'hello bobo'

tasks = [] #放置所有的任務物件
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print(time.time()-start) 

注意事項:

1.將多個任務物件儲存到一個列表中,然後將該列表註冊到事件迴圈中.在註冊的過程中,該列表需要被wait方法進行處理.
2.在任務物件對應的特殊函式內部的實現中,不可以出現不支援非同步模組的程式碼,否則就會中斷整個的非同步效果.並且,在該函式內部每一組阻塞的操作都必須使用await關鍵字進行修飾.
3.requests模組對應的程式碼不可以出現在特殊函式內部,因為requests是一個不支援非同步的模組

3.aiohttp

支援非同步操作的網路請求的模組

- 環境安裝:pip install aiohttp
import asyncio
import requests
import time
import aiohttp
from lxml import etree

urls = [
    'http://localhost:5000/bobo',
    'http://localhost:5000/bobo',
    'http://localhost:5000/bobo',
    'http://localhost:5000/bobo',
    'http://localhost:5000/bobo',
    'http://localhost:5000/bobo',
]


# 無法實現非同步的效果:是因為requests模組是一個不支援非同步的模組
async def req(url):
    async with aiohttp.ClientSession() as s:
        async with await s.get(url) as response:
            # response.read():byte
            page_text = await response.text()
            return page_text

    # 細節:在每一個with前面加上async,在每一步的阻塞操作前加上await


def parse(task):
    page_text = task.result()
    tree = etree.HTML(page_text)
    name = tree.xpath('//p/text()')[0]
    print(name)


if __name__ == '__main__':
    start = time.time()
    tasks = []
    for url in urls:
        c = req(url)
        task = asyncio.ensure_future(c)
        task.add_done_callback(parse)
        tasks.append(task)

    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))

    print(time.time() - start)


10.selenium

概念

基於瀏覽器自動化的一個模組.

環境的安裝:

下載selenium模組

selenium和爬蟲之間的關聯是什麼?

便捷的獲取頁面中動態載入的資料
     requests模組進行資料爬取:可見非可得
     selenium:可見即可得
實現模擬登入

基本操作:

谷歌瀏覽器驅動程式下地址:
http://chromedriver.storage.googleapis.com/index.html

selenium驅動程式和谷歌版本的對映關係表:
https://blog.csdn.net/huilan_same/article/details/51896672

動作鏈

一系列的行為動作

無頭瀏覽器

無視覺化介面的瀏覽器
phantosJS

1.京東基本操作示例

from selenium import webdriver
from time import sleep
#1.例項化一個瀏覽器物件
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
#2.模擬使用者發起請求
url = 'https://www.jd.com'
bro.get(url) 
#3.標籤定位
search_input =  bro.find_element_by_id('key')
#4.對指定標籤進行資料互動
search_input.send_keys('華為')
#5.系列的行為動作
btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
btn.click()
sleep(2)
#6.執行js程式碼
jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
bro.execute_script(jsCode)
sleep(3)
#7.關閉
bro.quit()

2.爬取藥品總局資訊

from selenium import webdriver
from lxml import etree
from time import sleep

page_text_list = []
# 例項化一個瀏覽器物件
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
url = 'http://125.35.6.84:81/xk/'
bro.get(url)
# 必須等待頁面載入完畢
sleep(2)
# page_source就是瀏覽器開啟頁面的原始碼資料

page_text = bro.page_source
page_text_list.append(page_text)
#必須要與視窗對應,視窗必須要顯示點選按鈕才可
jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
bro.execute_script(jsCode)
#開啟後兩頁的
for i in range(2):
    bro.find_element_by_id('pageIto_next').click()
    sleep(2)

    page_text = bro.page_source
    page_text_list.append(page_text)

for p in page_text_list:
    tree = etree.HTML(p)
    li_list = tree.xpath('//*[@id="gzlist"]/li')
    for li in li_list:
        name = li.xpath('./dl/@title')[0]
        print(name)
sleep(2)
bro.quit()

3.動作鏈

from lxml import etree
from time import sleep
from selenium import webdriver
from selenium.webdriver import ActionChains

# 例項化一個瀏覽器物件
page_text_list = []
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
url = 'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
bro.get(url)
# 如果定位的標籤是存在於iframe對應的子頁面中的話,在進行標籤定位前一定要執行一個switch_to的操作
bro.switch_to.frame('iframeResult')
div_tag = bro.find_element_by_id('draggable')

# 1.例項化動作鏈物件
action = ActionChains(bro)
action.click_and_hold(div_tag)

for i in range(5):
    #perform讓動作鏈立即執行
    action.move_by_offset(17, 0).perform()
    sleep(0.5)
#釋放
action.release()

sleep(3)

bro.quit()

4.處理反爬selenium

像淘寶很多網站都禁止selenium爬取

正常在瀏覽器輸入window.Navigator.webdriver返回的是undefined

用程式碼開啟瀏覽器返回的是true

from selenium import webdriver
from selenium.webdriver import ChromeOptions
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])

#例項化一個瀏覽器物件
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe',options=option)
bro.get('https://www.taobao.com/')


5.模擬12306登入

from selenium import webdriver
from selenium.webdriver import ActionChains
from PIL import Image  # 用作於圖片的裁剪 pillow
from zbb import www
from time import sleep

bro =webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
bro.get('https://kyfw.12306.cn/otn/resources/login.html')
sleep(5)
zhdl = bro.find_element_by_xpath('/html/body/div[2]/div[2]/ul/li[2]/a')
zhdl.click()
sleep(1)

username = bro.find_element_by_id('J-userName')
username.send_keys('181873')
pwd = bro.find_element_by_id('J-password')
pwd.send_keys('zx1')
# 驗證碼圖片進行捕獲(裁剪)
bro.save_screenshot('main.png')
# 定位到了驗證碼圖片對應的標籤
code_img_ele = bro.find_element_by_xpath('//*[@id="J-loginImg"]')
location = code_img_ele.location  # 驗證碼圖片基於當前整張頁面的左下角座標
size = code_img_ele.size  # 驗證碼圖片的長和寬
# 裁剪的矩形區域(左下角和右上角兩點的座標)
rangle = (
int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height']))

i = Image.open('main.png')
frame = i.crop(rangle)
frame.save('code.png')

# # 使用打碼平臺進行驗證碼的識別
result = www('./code.png',9004)
  # x1,y1|x2,y2|x3,y3  ==> [[x1,y1],[x2,y2],[x3,y3]]
all_list = []  # [[x1,y1],[x2,y2],[x3,y3]] 每一個列表元素表示一個點的座標,座標對應值的0,0點是驗證碼圖片左下角
if '|' in result:
    list_1 = result.split('|')
    count_1 = len(list_1)
    for i in range(count_1):
        xy_list = []
        x = int(list_1[i].split(',')[0])
        y = int(list_1[i].split(',')[1])
        xy_list.append(x)
        xy_list.append(y)
        all_list.append(xy_list)
else:
    x = int(result.split(',')[0])
    y = int(result.split(',')[1])
    xy_list = []
    xy_list.append(x)
    xy_list.append(y)
    all_list.append(xy_list)
print(all_list)
action = ActionChains(bro)
for l in all_list:
    x = l[0]
    y = l[1]
    action.move_to_element_with_offset(code_img_ele, x, y).click().perform()
    sleep(2)

btn = bro.find_element_by_xpath('//*[@id="J-login"]')
btn.click()


action.release()
sleep(3)
bro.quit()


相關文章