Python3爬蟲入門(一)
Python3爬蟲入門
網路爬蟲,也叫網路蜘蛛(Web?Spider)。它根據網頁地址(URL)爬取網頁內容,而網頁地址(URL)就是我們在瀏覽器中輸入的網站連結。
在瀏覽器的位址列輸入URL地址,在網頁處右鍵單擊,找到檢查。(不同瀏覽器的叫法不同,Chrome瀏覽器叫做檢查,Firefox瀏覽器叫做檢視元素,但是功能都是相同的)
- 每個網站都有爬蟲協議,(例如:https://www.baidu.com/robots.txt,這裡會寫清楚哪些允許 哪些不被允許)
- 可見即可爬(技術上)
- 違法的:擦邊球
一、URL 專業一些的叫法是統一資源定位符(Uniform Resource Locator),它的一般格式如下(帶方括號[]的為可選項):
protocol ?/ hostname[:port] / path / [;parameters][?query]#fragment
主要由前個三部分組成:
protocol:第一部分就是協議,例如google使用的是https協議;
hostname[:port]:第二部分就是主機名(還有埠號為可選引數),一般使用http協議的網站預設的埠號為80、使用https協議的網站埠號為443。
path:第三部分就是主機資源的具體地址,如目錄和檔名等也是就我們常說的路徑,這裡很重要我們訪問不同的路徑對應著我們向伺服器請求不同的資源,比如,京東這兩雙大拖鞋對應的path分別為
100006079301.html和100003887822.html
二、網路爬蟲的第一步就是根據 URL ,獲取網頁的 HTML 資訊。在 Python3 中,可以使用 urllib.request 和 requests 進行網頁爬取
1、request模組
- 安裝:pip3 install requests. --- urllib,urllib2 (這兩個是py內建的),requests模組是基於這兩個模組封裝的
# **** 基本使用 ****
# 匯入模組
# import requests
#
# # 傳送get請求,有返回結果
# resp = requests.get('https://www.baidu.com')
#
# # 請求回來的內容
# print(resp.text)
#
# with open('a.html','w',encoding='utf-8') as f:
# f.write(resp.text)
#
#
# # 請求返回的狀態碼
# print(res.status_code)
三、簡單例項
1、首先,讓我們看下 requests.get() 方法,它用於向伺服器發起 GET 請求。
import requests
if __name__ == '__main__':
url= "http://www.baidu.com/"
req = requests.get(url=url)
req.encoding = 'utf-8'
print(req.text)
執行結果:
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新聞</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地圖</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>視訊</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>貼吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登入</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登入</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多產品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>關於百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使用百度前必讀</a> <a href=http://jianyi.baidu.com/ class=cp-feedback>意見反饋</a> 京ICP證030173號 <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
我們把獲取到的結果貼上到文字文件,儲存為1.html,訪問就是我們爬取到的內容
2、爬取京東某商品評論
點進去之後找到我們想要找的內容
右鍵>檢查>network>點選商品評價>搜尋部分評論內容找到對應請求
找到了對應的URL請求
開始寫程式碼
# -*_coding:utf8-*-
# 爬取京東銷量最高口紅的10頁評論
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
resp=requests.get('https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100006079301&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1',headers=headers)
sp=resp.text
print(sp)
注意京東這裡是反扒的,需要驗證agent,所以加了header。
發現資料有點多不好貼上,我們這裡設定只讀取一頁
執行結果:
fetchJSON_comment98({"productAttr":null,"productCommentSummary":{"skuId":100006079301,"averageScore":5,"defaultGoodCount":226069,"defaultGoodCountStr":"22萬+","commentCount":267810,"commentCountStr":"26萬+","goodCount":40390,"goodCountStr":"4萬+","goodRate":0.96,"goodRateShow":96,"generalCount":539,"generalCountStr":"500+","generalRate":0.012,"generalRateShow":1,"poorCount":812,"poorCountStr":"800+","poorRate":0.028,"poorRateShow":3,"videoCount":315,"videoCountStr":"300+","afterCount":538,"afterCountStr":"500+","showCount":6217,"showCountStr":"6200+","oneYear":0,"sensitiveBook":0,"fixCount":0,"plusCount":0,"plusCountStr":"0","buyerShow":0,"poorRateStyle":4,"generalRateStyle":2,"goodRateStyle":144,"installRate":0,"productId":100006079301,"score1Count":812,"score2Count":174,"score3Count":365,"score4Count":427,"score5Count":39963},"hotCommentTagStatistics":[{"id":"51197de19f0ba763","name":"高階大氣","count":83,"type":4,"canBeFiltered":true,"stand":1,"rid":"51197de19f0ba763","ckeKeyWordBury":"eid=100^^tagid=51197de19f0ba763^^pid=20006^^sku=100006079301^^sversion=1000^^token=cf387ae8a15eca59"},{"id":"8d5387624e70bbd2","name":"質地細膩","count":21,"type":4,"canBeFiltered":true,"stand":1,"rid":"8d5387624e70bbd2","ckeKeyWordBury":"eid=100^^tagid=8d5387624e70bbd2^^pid=20006^^sku=100006079301^^sversion=1000^^token=4060dfd99e9f9368"},{"id":"13bacd43c59bb7d8","name":"質感極佳","count":19,"type":4,"canBeFiltered":true,"stand":1,"rid":"13bacd43c59bb7d8","ckeKeyWordBury":"eid=100^^tagid=13bacd43c59bb7d8^^pid=20006^^sku=100006079301^^sversion=1000^^token=22259161e6f06a4a"},{"id":"4244676cbb4a9a7a","name":"質地柔軟舒適","count":18,"type":4,"canBeFiltered":true,"stand":1,"rid":"4244676cbb4a9a7a","ckeKeyWordBury":"eid=100^^tagid=4244676cbb4a9a7a^^pid=20006^^sku=100006079301^^sversion=1000^^token=d4c916443c62aa1d"},{"id":"eeb3d5553c5b4d96","name":"少女感十足","count":10,"type":4,"canBeFiltered":true,"stand":1,"rid":"eeb3d5553c5b4d96","ckeKeyWordBury":"eid=100^^tagid=eeb3d5553c5b4d96^^pid=20006^^sku=100006079301^^sversion=1000^^token=0cfa91f4619d42cd"},{"id":"751bb69d96ad1a03","name":"不卡脣紋","count":3,"type":4,"canBeFiltered":true,"stand":1,"rid":"751bb69d96ad1a03","ckeKeyWordBury":"eid=100^^tagid=751bb69d96ad1a03^^pid=20006^^sku=100006079301^^sversion=1000^^token=13417f3acd3ee870"},{"id":"2f0435df19147d25","name":"沒有異味","count":2,"type":4,"canBeFiltered":true,"stand":1,"rid":"2f0435df19147d25","ckeKeyWordBury":"eid=100^^tagid=2f0435df19147d25^^pid=20006^^sku=100006079301^^sversion=1000^^token=d7844a0a37a59036"},{"id":"3a1d9b72c8f37e71","name":"烏黑亮麗","count":1,"type":4,"canBeFiltered":true,"stand":1,"rid":"3a1d9b72c8f37e71","ckeKeyWordBury":"eid=100^^tagid=3a1d9b72c8f37e71^^pid=20006^^sku=100006079301^^sversion=1000^^token=1aa00f26771093f3"},{"id":"3a57805849e14dbb","name":"安全可靠","count":1,"type":4,"canBeFiltered":true,"stand":1,"rid":"3a57805849e14dbb","ckeKeyWordBury":"eid=100^^tagid=3a57805849e14dbb^^pid=20006^^sku=100006079301^^sversion=1000^^token=633dd3c0804f145d"},{"id":"d1a61b29c4d11818","name":"分量足夠","count":1,"type":4,"canBeFiltered":true,"stand":1,"rid":"d1a61b29c4d11818","ckeKeyWordBury":"eid=100^^tagid=d1a61b29c4d11818^^pid=20006^^sku=100006079301^^sversion=1000^^token=04b8c858e9da8a54"}],"jwotestProduct":null,"maxPage":100,"testId":"cmt","score":0,"soType":5,"imageListCount":500,"vTagStatistics":null,"csv":"eid=100^^tagid=ALL^^pid=20006^^sku=100010958774^^sversion=1001^^pageSize=11","comments":[{"id":13868985681,"guid":"8bf8cff2-173c-45f7-a2cd-44e5fb295c13","content":"情人節的時候首發搶的,焦急的等待快遞的到來,很滿意,很不錯,不僅送了禮盒,還有面膜的小樣,瞬間感覺熬夜都是值得的,開心。口紅的包裝很有質感,設計的也很有檔次,顏色拿捏的也很細緻,非常不錯,沒有色差。而且保溼效果也很好,不會很乾,看上去很棒。非常的滿意。","vcontent":"情人節的時候首發搶的,焦急的等待快遞的到來,很滿意,很不錯,不僅送了禮盒,還有面膜的小樣,瞬間感覺熬夜都是值得的,開心。口紅的包裝很有質感,設計的也很有檔次,顏色拿捏的也很細緻,非常不錯,沒有色差。而且保溼效果也很好,不會很乾,看上去很棒。非常的滿意。","creationTime":"2020-03-03 19:26:32","isDelete":false,"isTop":false,"userImageUrl":"misc.360buyimg.com/user/myjd-2015/css/i/peisong.jpg","topped":0,"replyCount":0,"score":5,"imageStatus":1,"title":"","usefulVoteCount":1,"userClient":2,"discussionId":688960275,"imageCount":4,"anonymousFlag":1,"plusAvailable":201,"mobileVersion":"","images":[{"id":1070279057,"imgUrl":"//img30.360buyimg.com/n0/s128x96_jfs/t1/86015/17/13903/221606/5e5e3ee8E7c2135bb/3015e0cc8d1dcf28.jpg","imgTitle":"","status":0},{"id":1070279058,"imgUrl":"//img30.360buyimg.com/n0/s128x96_jfs/t1/90879/28/13906/189795/5e5e3ee7E16c8d3fb/00e1e7f373c4c967.jpg","imgTitle":"","status":0},{"id":1070279059,"imgUrl":"//img30.360buyimg.com/n0/s128x96_jfs/t1/101831/8/13791/230984/5e5e3ee7E290fb10c/53a6be934b8cd090.jpg","imgTitle":"","status":0},{"id":1070279060,"imgUrl":"//img30.360buyimg.com/n0/s128x96_jfs/t1/107793/21/7596/62938/5e5e3ee7E1f6b32ea/88234ceb80ab0c4d.jpg","imgTitle":"","status":0}],"mergeOrderStatus":2,"productColor":"限量款196","productSize":"","textIntegral":40,"imageIntegral":40,"status":1,"referenceId":"100010958774","referenceTime":"2020-02-11 09:49:01","nickname":"j***0","replyCount2":0,"userImage":"misc.360buyimg.com/user/myjd-2015/css/i/peisong.jpg","orderId":0,"integral":80,"productSales":"[]","referenceImage":"jfs/t1/148778/34/16762/154478/5fc9e554E45dd107a/fa5f882090cf848e.jpg","referenceName":"蘭蔻(LANCOME)口紅196 3.4g 菁純絲絨霧面啞光脣膏 化妝品禮盒 胡蘿蔔色","firstCategory":1316,"secondCategory":1387,"thirdCategory":1425,"aesPin":null,"days":21,"afterDays":0}]});
相關文章
- python3 爬蟲入門Python爬蟲
- Python3 爬蟲快速入門攻略Python爬蟲
- Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作Python爬蟲
- 爬蟲入門爬蟲
- Python3網路爬蟲快速入門實戰解析Python爬蟲
- Python爬蟲入門Python爬蟲
- 我的爬蟲入門書 —— 《Python3網路爬蟲開發實戰(第二版)》爬蟲Python
- 爬蟲入門第一章爬蟲
- 【爬蟲】python爬蟲從入門到放棄爬蟲Python
- python-爬蟲入門Python爬蟲
- 爬蟲(1) - 爬蟲基礎入門理論篇爬蟲
- Scrapy入門-第一個爬蟲專案爬蟲
- Java爬蟲入門(一)——專案介紹Java爬蟲
- 爬蟲入門(HTTP和HTTPS)爬蟲HTTP
- 爬蟲入門(字串相關)爬蟲字串
- 爬蟲入門基礎-Python爬蟲Python
- Python爬蟲入門教程 53-100 Python3爬蟲獲取三亞天氣做旅遊參照Python爬蟲
- Python爬蟲入門,8個常用爬蟲技巧盤點Python爬蟲
- 什麼是Python爬蟲?python爬蟲入門難嗎?Python爬蟲
- Python網路爬蟲實戰(一)快速入門Python爬蟲
- python3網路爬蟲開發實戰_Python3 爬蟲實戰Python爬蟲
- python爬蟲 之 BeautifulSoup庫入門Python爬蟲
- 三分鐘爬蟲入門爬蟲
- Python爬蟲入門【6】:蜂鳥網圖片爬取之一Python爬蟲
- Python爬蟲入門【5】:27270圖片爬取Python爬蟲
- Python爬蟲入門學習實戰專案(一)Python爬蟲
- 為什麼學習python及爬蟲,Python爬蟲[入門篇]?Python爬蟲
- 帶你入門Python爬蟲,8個常用爬蟲技巧盤點Python爬蟲
- Python爬蟲入門【9】:圖蟲網多執行緒爬取Python爬蟲執行緒
- python3 爬蟲實戰:為爬蟲新增 GUI 影象介面Python爬蟲GUI
- Python爬蟲入門教程導航帖Python爬蟲
- 5 行程式碼就能入門爬蟲?行程爬蟲
- scrapy入門教程()部署爬蟲專案爬蟲
- Python網路爬蟲4 - scrapy入門Python爬蟲
- 爬蟲工程師的入門簡介爬蟲工程師
- 爬蟲工程師的unidbg入門教程爬蟲工程師
- Scrapy使用入門及爬蟲代理配置爬蟲
- python入門之爬蟲工具有哪些?Python爬蟲