1.爬蟲相關概述
爬蟲概念:
通過編寫程式模擬瀏覽器上網,然後讓其去網際網路上爬取/抓取資料的過程
模擬:瀏覽器就是一款純天然的原始的爬蟲工具
爬蟲分類:
通用爬蟲:爬取一整張頁面中的資料. 抓取系統(爬蟲程式)
聚焦爬蟲:爬取頁面中區域性的資料.一定是建立在通用爬蟲的基礎之上
增量式爬蟲:用來監測網站資料更新的情況.以便爬取到網站最新更新出來的資料
風險分析
合理的的使用
爬蟲風險的體現:
爬蟲干擾了被訪問網站的正常運營;
爬蟲抓取了受到法律保護的特定型別的資料或資訊。
避免風險:
嚴格遵守網站設定的robots協議;
在規避反爬蟲措施的同時,需要優化自己的程式碼,避免干擾被訪問網站的正常執行;
在使用、傳播抓取到的資訊時,應審查所抓取的內容,如發現屬於使用者的個人資訊、隱私或者他人的商業祕密的,應及時停止並刪除。
反爬機制
反反爬策略
robots.txt協議:文字協議,在文字中指定了可爬和不可爬的資料說明.
常用的頭資訊
User-Agent:請求載體的身份標識
Connection:close
content-type
如何鑑定頁面中是否有動態載入的資料?
區域性搜尋 全域性搜尋
對一個陌生網站進行爬取前的第一步做什麼?
確定你要爬取的資料是否為動態載入的!!!
2.requests模組的基本使用
requests模組
概念:一個機遇網路請求的模組.作用就是用來模擬瀏覽器發起請求
編碼流程:
指定url
進行請求的傳送
獲取響應資料(爬取到的資料)
持久化儲存
import requests
url = 'https://www.sogou.com'
#返回值是一個響應物件
response = requests.get(url=url)
#text返回的是字串形式的響應資料
data = (response.text)
with open('./sogou.html',"w",encoding='utf-8') as f:
f.write(data)
基於搜狗編寫一個簡易的網頁採集器
解決亂碼問題
解決UA檢測問題
import requests
wd = input('輸入key:')
url = 'https://www.sogou.com/web'
# 儲存的就是動態的請求引數
params = {
'query': wd
}
#params參數列示的是對請求url引數的封裝
#headers 解決反爬機制,實現UA偽裝
headers = {
'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params,headers=headers)
#手動修改響應資料的編碼,解決中文亂碼
response.encoding = 'utf-8'
data = (response.text)
filename = wd + '.html'
with open(filename, "w", encoding='utf-8') as f:
f.write(data)
print(wd, "下載成功")
1.爬取豆瓣電影的詳細資料
分析
當滾輪滑動到底部的時候,發起ajax的請求,且請求到了一組電影資料
動態載入的資料:就是通過另一個額外的請求請求到的資料
ajax生成動態載入的資料
js生成動態載入的資料
import requests
limit = input("排行榜前多少的資料:::")
url = 'https://movie.douban.com/j/chart/top_list'
params = {
"type": "5",
"interval_id": "100:90",
"action": "",
"start": "0",
"limit": limit
}
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params, headers=headers)
#json返回的是序列化好的物件
data_list = (response.json())
with open('douban.txt', "w", encoding='utf-8') as f:
for i in data_list:
name = i['title']
score = i['score']
f.write(name+""+score+""+"\n")
print("成功")
2.爬取肯德基地理位置資訊
import requests
url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
params = {
"cname": "",
"pid": "",
"keyword": "青島",
"pageIndex": "1",
"pageSize": "10"
}
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.post(url=url, params=params, headers=headers)
# json返回的是序列化好的物件
data_list = (response.json())
with open('kedeji.txt', "w", encoding='utf-8') as f:
for i in data_list["Table1"]:
name = i['storeName']
addres = i['addressDetail']
f.write(name + "," + addres + "\n")
print("成功")
3.爬取藥品管理局資料
import requests
url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
with open('化妝品,txt', "w", encoding="utf-8") as f:
for i in range(1, 5):
params = {
"on": "true",
"page": str(i),
"pageSize": "12",
"productName": "",
"conditionType": "1",
"applyname": "",
"applysn": ""
}
response = requests.post(url=url, params=params, headers=headers)
data_dic = (response.json())
for i in data_dic["list"]:
id = i['ID']
post_url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById"
post_data = {
"id": id
}
response2 = requests.post(url=post_url, params=post_data, headers=headers)
data_dic2 = (response2.json())
title = data_dic2["epsName"]
name = data_dic2['legalPerson']
f.write(title + ":" + name + "\n")
3.資料解析
解析:根據指定的規則對資料進行提取
作用:實現聚焦爬蟲
聚焦爬蟲的編碼流程:
指定url
發起請求
獲取響應資料
資料解析
持久化儲存
資料解析的方式:
正則
bs4
xpath
pyquery(擴充)
資料解析的通用原理是什麼?
資料解析需要作用在頁面原始碼中(一組html標籤組成的)
html的核心作用是什麼?
展示資料
html是如何展示資料的呢?
html所要展示的資料一定是被放置在html標籤之中,或者是在屬性中
通用原理:
1.標籤定位
2.取文字or取屬性
1.正則解析
1.爬取糗事百科糗圖資料
爬取單張
import requests
url = "https://pic.qiushibaike.com/system/pictures/12330/123306162/medium/GRF7AMF9GKDTIZL6.jpg"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, headers=headers)
# content返回的是byte型別的資料
img_data = (response.content)
with open('./123.jpg', "wb") as f:
f.write(img_data)
print("成功")
爬取單頁
<div class="thumb">
<a href="/article/123319109" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12331/123319109/medium/MOX0YDFJX7CM1NWK.jpg" alt="糗事#123319109" class="illustration" width="100%" height="auto">
</a>
</div>
import re
import os
import requests
dir_name = "./img"
if not os.path.exists(dir_name):
os.mkdir(dir_name)
url = "https://www.qiushibaike.com/imgrank/"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
img_text = requests.get(url, headers).text
ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
img_list = re.findall(ex, img_text, re.S)
for src in img_list:
src = "https:" + src
img_name = src.split('/')[-1]
img_path = dir_name + "/" + img_name
response = requests.get(src, headers).content
# 對圖片地址發請求獲取圖片資料
with open(img_path, "wb") as f:
f.write(response)
print("成功")
爬取多頁
import re
import os
import requests
dir_name = "./img"
if not os.path.exists(dir_name):
os.mkdir(dir_name)
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
for i in range(1,5):
url = f"https://www.qiushibaike.com/imgrank/page/{i}/"
print(f"正在爬取第{i}頁的圖片")
img_text = requests.get(url, headers=headers).text
ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
img_list = re.findall(ex, img_text, re.S)
for src in img_list:
src = "https:" + src
img_name = src.split('/')[-1]
img_path = dir_name + "/" + img_name
response = requests.get(src, headers).content
# 對圖片地址發請求獲取圖片資料
with open(img_path, "wb") as f:
f.write(response)
print("成功")
2.bs4解析
環境安裝
pip install bs4
bs4的解析原理
例項化一個BeautifulSoup的物件為soup,並且將即將被解析的頁面原始碼資料載入到該物件中,
呼叫BeautifulSoup物件中的相關屬性和方法進行標籤定位和資料提取
如何例項化BeautifulSoup物件呢?
BeautifulSoup(fp,'lxml'):專門用作於解析本地儲存的html文件中的資料
BeautifulSoup(page_text,'lxml'):專門用作於將網際網路上請求到的頁面原始碼資料進行解析
標籤定位
soup.tagName:定位到第一個TagName標籤,返回的是第一個
屬性定位
soup.find('div',class_='s'),返回值是class=s的div標籤
find_all:和find用法一致,但是返回值是列表
選擇器定位
select('選擇器'),返回值為列表
標籤,類,id,層級(>一個層級,空格 多個層級)
提取資料
取文字
tag.string:標籤中直系的文字內容
tag.text:標籤中所有的文字內容
取屬性
soup.find("a",id_='tt')['href']
1.爬取三國演義小說內容
http://www.shicimingju.com/book/sanguoyanyi.html
爬取章節名稱+章節內容
1.在首頁中解析章節名稱&每一個章節詳情頁的url
from bs4 import BeautifulSoup
import requests
url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
page_text = requests.get(url, headers=headers).text
soup = BeautifulSoup(page_text, 'lxml')
a_list = soup.select(".book-mulu a")
with open('./sanguo.txt', 'w', encoding='utf-8') as f:
for a in a_list:
new_url = "http://www.shicimingju.com" + a["href"]
mulu = a.text
print(mulu)
##對章節詳情頁的url發起請求,解析詳情頁中的章節內容
new_page_text = requests.get(new_url, headers).text
new_soup = BeautifulSoup(new_page_text, 'lxml')
neirong = new_soup.find('div', class_='chapter_content').text
f.write(mulu+":"+neirong+"\n")
3.xpath解析
環境安裝
pip install lxml
xpath的解析原理
例項化一個etree型別xpath的解析原理的物件,且將頁面原始碼資料載入到該物件中
需要呼叫該物件的xpath方法結合著不同形式的xpath表示式進行標籤定位和資料提取
etree物件的例項化
tree = etree.parse(fileNane)
tree = etree.HTML(page_text)
xpath方法返回的永遠是一個列表
標籤定位
tree.xpath("")
在xpath表示式中最最側的/表示的含義是說,當前定位的標籤必須從根節點開始進行定位
xpath表示式中最左側的//表示可以從任意位置進行標籤定位
xpath表示式中非最左側的//表示的是多個層級的意思
xpath表示式中非最左側的/表示的是一個層級的意思
屬性定位://div[@class='ddd']
索引定位://div[@class='ddd']/li[3] #索引從1開始
索引定位://div[@class='ddd']//li[2] #索引從1開始
提取資料
取文字:
tree.xpath("//p[1]/text()"):取直系的文字內容
tree.xpath("//div[@class='ddd']/li[2]//text()"):取所有的文字內容
取屬性:
tree.xpath('//a[@id="feng"]/@href')
1.爬取boss的招聘資訊
from lxml import etree
import requests
import time
url = 'https://www.zhipin.com/job_detail/?query=python&city=101120200&industry=&position='
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
'cookie':'__zp__pub__=; lastCity=101120200; __c=1594792470; __g=-; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1594713563,1594713587,1594792470; __l=l=%2Fwww.zhipin.com%2Fqingdao%2F&r=&friend_source=0&friend_source=0; __a=26925852.1594713563.1594713586.1594792470.52.3.39.52; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1594801318; __zp_stoken__=c508aZxdfUB9hb0Q8ORppIXd7JTdDTF96U3EdCDgIHEscYxUsVnoqdH9VBxY5GUtkJi5wfxggRDtsR0dAT2pEDDRRfWsWLg8WUmFyWQECQlYFSV4SCUQqUB8yfRwAUTAyZBc1ABdbRRhyXUY%3D'
}
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="main"]/div/div[2]/ul/li')
for li in li_list:
#需要將li表示的區域性頁面原始碼資料中的相關資料進行提取
#如果xpath表示式被作用在了迴圈中, 表示式要以. / 或者. // 開頭
detail_url = 'https://www.zhipin.com' + li.xpath('.//span[@class="job-name"]/a/@href')[0]
job_title = li.xpath('.//span[@class="job-name"]/a/text()')[0]
company = li.xpath('.//div[@class="info-company"]/div/h3/a/text()')[0]
# # 對詳情頁的url發請求解析出崗位職責
detail_page_text = requests.get(detail_url, headers=headers).text
tree = etree.HTML(detail_page_text)
job_desc = tree.xpath('//div[@class="text"]/text()')
#列表轉字元傳
job_desc = ''.join(job_desc)
print(job_title,company,job_desc)
time.sleep(5)
2.爬取糗事百科
爬取作者,和文章。注意作者有匿名和實名之分
from lxml import etree
import requests
url = "https://www.qiushibaike.com/text/page/4/"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@class="col1 old-style-col1"]/div')
print(div_list)
for div in div_list:
#使用者名稱分為匿名使用者和註冊使用者
author = div.xpath('.//div[@class="author clearfix"]//h2/text() | .//div[@class="author clearfix"]/span[2]/h2/text()')[0]
content = div.xpath('.//div[@class="content"]/span//text()')
content = ''.join(content)
print(author, content)
3.爬取網站圖片
from lxml import etree
import requests
import os
dir_name = "./img2"
if not os.path.exists(dir_name):
os.mkdir(dir_name)
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
for i in range(1, 6):
if i == 1:
url = "http://pic.netbian.com/4kmeinv/"
else:
url = f"http://pic.netbian.com/4kmeinv/index_{i}.html"
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
for li in li_list:
img_src = "http://pic.netbian.com/" + li.xpath('./a/img/@src')[0]
img_name = li.xpath('./a/b/text()')[0]
#解決中文亂碼
img_name = img_name.encode('iso-8859-1').decode('gbk')
response = requests.get(img_src).content
img_path = dir_name + "/" + f"{img_name}.jpg"
with open(img_path, "wb") as f:
f.write(response)
print(f"第{i}頁成功")
4.IP代理
代理伺服器
實現請求轉發,從而可以實現更換請求的ip地址
代理的匿名度
透明:伺服器知道你使用了代理並且知道你的真實ip
匿名:伺服器知道你使用了代理,但是不知道你的真實ip
高匿:伺服器不知道你使用了代理,更不知道你的真實ip
代理的型別
http:該型別的代理只可以轉發http協議的請求
https:只可以轉發https協議的請求
免費代理ip的網站
快代理
西祠代理
goubanjia
代理精靈(推薦):http://http.zhiliandaili.cn/
在爬蟲中遇到ip被禁掉如何處理?
使用代理
構建一個代理池
撥號伺服器
import requests
import random
from lxml import etree
# 列表形式的代理池
all_ips = []
proxy_url = "http://t.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=5&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=15"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
proxy_page_text = requests.get(url=proxy_url, headers=headers).text
tree = etree.HTML(proxy_page_text)
proxy_list = tree.xpath('//body//text()')
for ip in proxy_list:
dic = {'https': ip}
all_ips.append(dic)
# 爬取快代理中的免費代理ip
free_proxies = []
for i in range(1, 3):
url = f"http://www.kuaidaili.com/free/inha/{i}/"
page_text = requests.get(url, headers=headers,proxies=random.choice(all_ips)).text
tree = etree.HTML(page_text)
# xpath表示式中不可以出現tbody
tr_list = tree.xpath('//*[@id="list"]/table/tbody/tr')
for tr in tr_list:
ip = tr.xpath("./td/text()")[0]
port = tr.xpath("./td[2]/text()")[0]
dic = {
"ip":ip,
"port":port
}
print(dic)
free_proxies.append(dic)
print(f"第{i}頁")
print(len(free_proxies))
5.處理cookie
視訊解析介面
https://www.wocao.xyz/index.php?url=
https://2wk.com/vip.php?url=
https://api.47ks.com/webcloud/?v-
視訊解析網址
牛巴巴 http://mv.688ing.com/
愛片網 https://ap2345.com/vip/
全民解析 http://www.qmaile.com/
迴歸正點
為什麼要處理cookie?
儲存客戶端的相關狀態
在請求中攜帶cookie,在爬蟲中如果遇到了cookie的反爬如何處理?
#手動處理
在抓包工具中捕獲cookie,將其封裝在headers中
#自動處理
使用session機制
使用場景:動態變化的cookie
session物件:該物件和requests模組用法幾乎一致.如果在請求的過程中產生了cookie,如果該請求使用session發起的,則cookie會被自動儲存到session中
爬去雪球網的資料
import requests
s = requests.Session()
main_url = "https://xueqiu.com" # 先對url發請求獲取cookie
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
params = {
"size": "8",
'_type': "10",
"type": "10"
}
s.get(main_url, headers=headers)
url = 'https://stock.xueqiu.com/v5/stock/hot_stock/list.json?size=8&_type=10&type=10'
page_text = s.get(url, headers=headers).json()
print(page_text)
6.驗證碼識別
相關的線上打碼平臺識別
- 打碼兔
- 雲打碼
- 超級鷹:http://www.chaojiying.com/about.html
1.註冊,登入(使用者中心的身份認證)
2.登入後
建立一個軟體:軟體ID->生成一個軟體id
下載示例程式碼:開發文件->python->下載
平臺例項程式碼的演示
import requests
from hashlib import md5
class Chaojiying_Client(object):
def __init__(self, username, password, soft_id):
self.username = username
password = password.encode('utf8')
self.password = md5(password).hexdigest()
self.soft_id = soft_id
self.base_params = {
'user': self.username,
'pass2': self.password,
'softid': self.soft_id,
}
self.headers = {
'Connection': 'Keep-Alive',
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
}
def PostPic(self, im, codetype):
params = {
'codetype': codetype,
}
params.update(self.base_params)
files = {'userfile': ('ccc.jpg', im)}
r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
headers=self.headers)
return r.json()
def ReportError(self, im_id):
params = {
'id': im_id,
}
params.update(self.base_params)
r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
return r.json()
chaojiying = Chaojiying_Client('超級鷹使用者名稱', '超級鷹使用者名稱的密碼', '96001')
im = open('a.jpg', 'rb').read()
print(chaojiying.PostPic(im, 1902)['pic_str'])
將古詩網中的驗證碼進行識別
zbb.py
import requests
from hashlib import md5
class Chaojiying_Client(object):
def __init__(self, username, password, soft_id):
self.username = username
password = password.encode('utf8')
self.password = md5(password).hexdigest()
self.soft_id = soft_id
self.base_params = {
'user': self.username,
'pass2': self.password,
'softid': self.soft_id,
}
self.headers = {
'Connection': 'Keep-Alive',
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
}
def PostPic(self, im, codetype):
params = {
'codetype': codetype,
}
params.update(self.base_params)
files = {'userfile': ('ccc.jpg', im)}
r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
headers=self.headers)
return r.json()
def ReportError(self, im_id):
params = {
'id': im_id,
}
params.update(self.base_params)
r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
return r.json()
def www(path,type):
chaojiying = Chaojiying_Client('5423', '521521', '906630')
im = open(path, 'rb').read()
return chaojiying.PostPic(im, type)['pic_str']
requests.py
import requests
from lxml import etree
from zbb import www
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = requests.get(img_url,headers=headers).content
with open('./111.jpg','wb') as f:
f.write(img_data)
img_text = www('./111.jpg',1004)
print(img_text)
7.模擬登陸
為什麼在爬蟲中需要實現模擬登入?
有的資料是必須經過登入後才可以顯示出來的
古詩網
涉及到的反扒機制
1.驗證碼
2.動態請求引數:每次請求對應的請求引數都是動態變化
動態捕獲:通常情況下,動態的請求引數都會被隱藏在前臺頁面的原始碼中
3.cookie存在驗證碼圖片之中
坑壁玩意
import requests
from lxml import etree
from zbb import www
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
# 獲取cookie
s = requests.Session()
# s_url = "https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx"
# s.get(s_url, headers=headers)
# 獲取驗證碼
url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = s.get(img_url, headers=headers).content
with open('./111.jpg', 'wb') as f:
f.write(img_data)
img_text = www('./111.jpg', 1004)
print(img_text)
# 動態捕獲動態的請求引數
__VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
__VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]
# 點選登入按鈕後發起請求的url:通過抓包工具捕獲
login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'
data = {
"__VIEWSTATE": __VIEWSTATE,
"__VIEWSTATEGENERATOR": __VIEWSTATEGENERATOR, # 變化的
"from": "http://so.gushiwen.cn/user/collect.aspx",
"email": "542154983@qq.com",
"pwd": "zxy521",
"code": img_text,
"denglu": "登入"
}
main_page_text = s.post(login_url, headers=headers, data=data).text
with open('main.html', 'w', encoding='utf-8') as fp:
fp.write(main_page_text)
8.基於執行緒池的非同步爬取
基於執行緒池的非同步爬取 趣味百科前十頁內容
import requests
from multiprocessing.dummy import Pool
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
#將url獲取,加入列表之中
urls = []
for i in range(1, 11):
urls.append(f'https://www.qiushibaike.com/8hr/page/{i}/')
#建立一個request請求
def get_request(url):
# 必須只能有一個引數
return requests.get(url, headers=headers).text
#例項化執行緒10個
pool = Pool(10)
response_text_list = pool.map(get_request,urls)
print(response_text_list)
9.單執行緒+多工非同步協程
1.簡介
協程:物件
#可以把協程當做是一個特殊的函式.如果一個函式的定義被async關鍵字所修飾.該特殊的函式被呼叫後函式內部的程式語句不會被立即執行,而是會返回一個協程物件.
任務物件(task)
#所謂的任務物件就是對協程物件的進一步封裝.在任務物件中可以實現顯示協程物件的執行狀況.
#任務物件最終是需要被註冊到事件迴圈物件中.
繫結回撥
#回撥函式是繫結給任務物件,只有當任務物件對應的特殊函式被執行完畢後,回撥函式才會被執行
事件迴圈物件
#無限迴圈的物件.也可以把其當成是某一種容器.該容器中需要放置多個任務物件(就是一組待執行的程式碼塊).
非同步的體現
#當事件迴圈開啟後,該物件會安裝順序執行每一個任務物件,
#當一個任務物件發生了阻塞事件迴圈是不會等待,而是直接執行下一個任務物件
await:掛起的操作.交出cpu的使用權
單任務
from time import sleep
import asyncio
# 回撥函式:
# 預設引數:任務物件
def callback(task):
print('i am callback!!1')
print(task.result()) # result返回的就是任務物件對應的那個特殊函式的返回值
async def get_request(url):
print('正在請求:', url)
sleep(2)
print('請求結束:', url)
return 'hello bobo'
# 建立一個協程物件
c = get_request('www.1.com')
# 封裝一個任務物件
task = asyncio.ensure_future(c)
# 給任務物件繫結回撥函式
task.add_done_callback(callback)
# 建立一個事件迴圈物件
loop = asyncio.get_event_loop()
loop.run_until_complete(task) # 將任務物件註冊到事件迴圈物件中並且開啟了事件迴圈
2.多工的非同步協程
import asyncio
from time import sleep
import time
start = time.time()
urls = [
'http://localhost:5000/a',
'http://localhost:5000/b',
'http://localhost:5000/c'
]
#在待執行的程式碼塊中不可以出現不支援非同步模組的程式碼
#在該函式內部如果有阻塞操作必須使用await關鍵字進行修飾
async def get_request(url):
print('正在請求:',url)
# sleep(2)
await asyncio.sleep(2)
print('請求結束:',url)
return 'hello bobo'
tasks = [] #放置所有的任務物件
for url in urls:
c = get_request(url)
task = asyncio.ensure_future(c)
tasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
print(time.time()-start)
注意事項:
1.將多個任務物件儲存到一個列表中,然後將該列表註冊到事件迴圈中.在註冊的過程中,該列表需要被wait方法進行處理.
2.在任務物件對應的特殊函式內部的實現中,不可以出現不支援非同步模組的程式碼,否則就會中斷整個的非同步效果.並且,在該函式內部每一組阻塞的操作都必須使用await關鍵字進行修飾.
3.requests模組對應的程式碼不可以出現在特殊函式內部,因為requests是一個不支援非同步的模組
3.aiohttp
支援非同步操作的網路請求的模組
- 環境安裝:pip install aiohttp
import asyncio
import requests
import time
import aiohttp
from lxml import etree
urls = [
'http://localhost:5000/bobo',
'http://localhost:5000/bobo',
'http://localhost:5000/bobo',
'http://localhost:5000/bobo',
'http://localhost:5000/bobo',
'http://localhost:5000/bobo',
]
# 無法實現非同步的效果:是因為requests模組是一個不支援非同步的模組
async def req(url):
async with aiohttp.ClientSession() as s:
async with await s.get(url) as response:
# response.read():byte
page_text = await response.text()
return page_text
# 細節:在每一個with前面加上async,在每一步的阻塞操作前加上await
def parse(task):
page_text = task.result()
tree = etree.HTML(page_text)
name = tree.xpath('//p/text()')[0]
print(name)
if __name__ == '__main__':
start = time.time()
tasks = []
for url in urls:
c = req(url)
task = asyncio.ensure_future(c)
task.add_done_callback(parse)
tasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
print(time.time() - start)
10.selenium
概念
基於瀏覽器自動化的一個模組.
環境的安裝:
下載selenium模組
selenium和爬蟲之間的關聯是什麼?
便捷的獲取頁面中動態載入的資料
requests模組進行資料爬取:可見非可得
selenium:可見即可得
實現模擬登入
基本操作:
谷歌瀏覽器驅動程式下地址:
http://chromedriver.storage.googleapis.com/index.html
selenium驅動程式和谷歌版本的對映關係表:
https://blog.csdn.net/huilan_same/article/details/51896672
動作鏈
一系列的行為動作
無頭瀏覽器
無視覺化介面的瀏覽器
phantosJS
1.京東基本操作示例
from selenium import webdriver
from time import sleep
#1.例項化一個瀏覽器物件
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
#2.模擬使用者發起請求
url = 'https://www.jd.com'
bro.get(url)
#3.標籤定位
search_input = bro.find_element_by_id('key')
#4.對指定標籤進行資料互動
search_input.send_keys('華為')
#5.系列的行為動作
btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
btn.click()
sleep(2)
#6.執行js程式碼
jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
bro.execute_script(jsCode)
sleep(3)
#7.關閉
bro.quit()
2.爬取藥品總局資訊
from selenium import webdriver
from lxml import etree
from time import sleep
page_text_list = []
# 例項化一個瀏覽器物件
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
url = 'http://125.35.6.84:81/xk/'
bro.get(url)
# 必須等待頁面載入完畢
sleep(2)
# page_source就是瀏覽器開啟頁面的原始碼資料
page_text = bro.page_source
page_text_list.append(page_text)
#必須要與視窗對應,視窗必須要顯示點選按鈕才可
jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
bro.execute_script(jsCode)
#開啟後兩頁的
for i in range(2):
bro.find_element_by_id('pageIto_next').click()
sleep(2)
page_text = bro.page_source
page_text_list.append(page_text)
for p in page_text_list:
tree = etree.HTML(p)
li_list = tree.xpath('//*[@id="gzlist"]/li')
for li in li_list:
name = li.xpath('./dl/@title')[0]
print(name)
sleep(2)
bro.quit()
3.動作鏈
from lxml import etree
from time import sleep
from selenium import webdriver
from selenium.webdriver import ActionChains
# 例項化一個瀏覽器物件
page_text_list = []
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
url = 'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
bro.get(url)
# 如果定位的標籤是存在於iframe對應的子頁面中的話,在進行標籤定位前一定要執行一個switch_to的操作
bro.switch_to.frame('iframeResult')
div_tag = bro.find_element_by_id('draggable')
# 1.例項化動作鏈物件
action = ActionChains(bro)
action.click_and_hold(div_tag)
for i in range(5):
#perform讓動作鏈立即執行
action.move_by_offset(17, 0).perform()
sleep(0.5)
#釋放
action.release()
sleep(3)
bro.quit()
4.處理反爬selenium
像淘寶很多網站都禁止selenium爬取
正常在瀏覽器輸入window.Navigator.webdriver返回的是undefined
用程式碼開啟瀏覽器返回的是true
from selenium import webdriver
from selenium.webdriver import ChromeOptions
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
#例項化一個瀏覽器物件
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe',options=option)
bro.get('https://www.taobao.com/')
5.模擬12306登入
from selenium import webdriver
from selenium.webdriver import ActionChains
from PIL import Image # 用作於圖片的裁剪 pillow
from zbb import www
from time import sleep
bro =webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
bro.get('https://kyfw.12306.cn/otn/resources/login.html')
sleep(5)
zhdl = bro.find_element_by_xpath('/html/body/div[2]/div[2]/ul/li[2]/a')
zhdl.click()
sleep(1)
username = bro.find_element_by_id('J-userName')
username.send_keys('181873')
pwd = bro.find_element_by_id('J-password')
pwd.send_keys('zx1')
# 驗證碼圖片進行捕獲(裁剪)
bro.save_screenshot('main.png')
# 定位到了驗證碼圖片對應的標籤
code_img_ele = bro.find_element_by_xpath('//*[@id="J-loginImg"]')
location = code_img_ele.location # 驗證碼圖片基於當前整張頁面的左下角座標
size = code_img_ele.size # 驗證碼圖片的長和寬
# 裁剪的矩形區域(左下角和右上角兩點的座標)
rangle = (
int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height']))
i = Image.open('main.png')
frame = i.crop(rangle)
frame.save('code.png')
# # 使用打碼平臺進行驗證碼的識別
result = www('./code.png',9004)
# x1,y1|x2,y2|x3,y3 ==> [[x1,y1],[x2,y2],[x3,y3]]
all_list = [] # [[x1,y1],[x2,y2],[x3,y3]] 每一個列表元素表示一個點的座標,座標對應值的0,0點是驗證碼圖片左下角
if '|' in result:
list_1 = result.split('|')
count_1 = len(list_1)
for i in range(count_1):
xy_list = []
x = int(list_1[i].split(',')[0])
y = int(list_1[i].split(',')[1])
xy_list.append(x)
xy_list.append(y)
all_list.append(xy_list)
else:
x = int(result.split(',')[0])
y = int(result.split(',')[1])
xy_list = []
xy_list.append(x)
xy_list.append(y)
all_list.append(xy_list)
print(all_list)
action = ActionChains(bro)
for l in all_list:
x = l[0]
y = l[1]
action.move_to_element_with_offset(code_img_ele, x, y).click().perform()
sleep(2)
btn = bro.find_element_by_xpath('//*[@id="J-login"]')
btn.click()
action.release()
sleep(3)
bro.quit()