Python3爬蟲入門(一)

god_mellon發表於2020-12-05

Python3爬蟲入門

​網路爬蟲,也叫網路蜘蛛(Web?Spider)。它根據網頁地址(URL)爬取網頁內容,而網頁地址(URL)就是我們在瀏覽器中輸入的網站連結。

在瀏覽器的位址列輸入URL地址,在網頁處右鍵單擊,找到檢查。(不同瀏覽器的叫法不同,Chrome瀏覽器叫做檢查,Firefox瀏覽器叫做檢視元素,但是功能都是相同的)
在這裡插入圖片描述

  • 每個網站都有爬蟲協議,(例如:https://www.baidu.com/robots.txt,這裡會寫清楚哪些允許 哪些不被允許)
  • 可見即可爬(技術上)
  • 違法的:擦邊球
    一、URL 專業一些的叫法是統一資源定位符(Uniform Resource Locator),它的一般格式如下(帶方括號[]的為可選項):
    protocol ?/ hostname[:port] / path / [;parameters][?query]#fragment

主要由前個三部分組成:

protocol:第一部分就是協議,例如google使用的是https協議;

hostname[:port]:第二部分就是主機名(還有埠號為可選引數),一般使用http協議的網站預設的埠號為80、使用https協議的網站埠號為443。
path:第三部分就是主機資源的具體地址,如目錄和檔名等也是就我們常說的路徑,這裡很重要我們訪問不同的路徑對應著我們向伺服器請求不同的資源,比如,京東這兩雙大拖鞋對應的path分別為

100006079301.html和100003887822.html

在這裡插入圖片描述

在這裡插入圖片描述
二、網路爬蟲的第一步就是根據 URL ,獲取網頁的 HTML 資訊。在 Python3 中,可以使用 urllib.request 和 requests 進行網頁爬取

1、request模組

在這裡插入圖片描述

- 安裝:pip3 install requests.  --- urllib,urllib2 (這兩個是py內建的),requests模組是基於這兩個模組封裝的
​
# **** 基本使用 ****
# 匯入模組
# import requests
#
# # 傳送get請求,有返回結果
# resp = requests.get('https://www.baidu.com')
#
# # 請求回來的內容
# print(resp.text)
#
# with open('a.html','w',encoding='utf-8') as f:
#     f.write(resp.text)
#
#
# # 請求返回的狀態碼
# print(res.status_code)

三、簡單例項

1、首先,讓我們看下 requests.get() 方法,它用於向伺服器發起 GET 請求。

import requests
​
if __name__ == '__main__':
    url= "http://www.baidu.com/"
    req = requests.get(url=url)
    req.encoding = 'utf-8'
    print(req.text)

執行結果:

<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新聞</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地圖</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>視訊</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>貼吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登入</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登入</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多產品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>關於百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必讀</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意見反饋</a>&nbsp;京ICP證030173號&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

我們把獲取到的結果貼上到文字文件,儲存為1.html,訪問就是我們爬取到的內容
在這裡插入圖片描述
2、爬取京東某商品評論

在這裡插入圖片描述
點進去之後找到我們想要找的內容
在這裡插入圖片描述
右鍵>檢查>network>點選商品評價>搜尋部分評論內容找到對應請求
在這裡插入圖片描述
找到了對應的URL請求

在這裡插入圖片描述

開始寫程式碼

# -*_coding:utf8-*-
# 爬取京東銷量最高口紅的10頁評論import requests
headers = {
 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
resp=requests.get('https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100006079301&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1',headers=headers)
​
sp=resp.text
print(sp)

注意京東這裡是反扒的,需要驗證agent,所以加了header。
發現資料有點多不好貼上,我們這裡設定只讀取一頁
在這裡插入圖片描述
執行結果:

fetchJSON_comment98({"productAttr":null,"productCommentSummary":{"skuId":100006079301,"averageScore":5,"defaultGoodCount":226069,"defaultGoodCountStr":"22萬+","commentCount":267810,"commentCountStr":"26萬+","goodCount":40390,"goodCountStr":"4萬+","goodRate":0.96,"goodRateShow":96,"generalCount":539,"generalCountStr":"500+","generalRate":0.012,"generalRateShow":1,"poorCount":812,"poorCountStr":"800+","poorRate":0.028,"poorRateShow":3,"videoCount":315,"videoCountStr":"300+","afterCount":538,"afterCountStr":"500+","showCount":6217,"showCountStr":"6200+","oneYear":0,"sensitiveBook":0,"fixCount":0,"plusCount":0,"plusCountStr":"0","buyerShow":0,"poorRateStyle":4,"generalRateStyle":2,"goodRateStyle":144,"installRate":0,"productId":100006079301,"score1Count":812,"score2Count":174,"score3Count":365,"score4Count":427,"score5Count":39963},"hotCommentTagStatistics":[{"id":"51197de19f0ba763","name":"高階大氣","count":83,"type":4,"canBeFiltered":true,"stand":1,"rid":"51197de19f0ba763","ckeKeyWordBury":"eid=100^^tagid=51197de19f0ba763^^pid=20006^^sku=100006079301^^sversion=1000^^token=cf387ae8a15eca59"},{"id":"8d5387624e70bbd2","name":"質地細膩","count":21,"type":4,"canBeFiltered":true,"stand":1,"rid":"8d5387624e70bbd2","ckeKeyWordBury":"eid=100^^tagid=8d5387624e70bbd2^^pid=20006^^sku=100006079301^^sversion=1000^^token=4060dfd99e9f9368"},{"id":"13bacd43c59bb7d8","name":"質感極佳","count":19,"type":4,"canBeFiltered":true,"stand":1,"rid":"13bacd43c59bb7d8","ckeKeyWordBury":"eid=100^^tagid=13bacd43c59bb7d8^^pid=20006^^sku=100006079301^^sversion=1000^^token=22259161e6f06a4a"},{"id":"4244676cbb4a9a7a","name":"質地柔軟舒適","count":18,"type":4,"canBeFiltered":true,"stand":1,"rid":"4244676cbb4a9a7a","ckeKeyWordBury":"eid=100^^tagid=4244676cbb4a9a7a^^pid=20006^^sku=100006079301^^sversion=1000^^token=d4c916443c62aa1d"},{"id":"eeb3d5553c5b4d96","name":"少女感十足","count":10,"type":4,"canBeFiltered":true,"stand":1,"rid":"eeb3d5553c5b4d96","ckeKeyWordBury":"eid=100^^tagid=eeb3d5553c5b4d96^^pid=20006^^sku=100006079301^^sversion=1000^^token=0cfa91f4619d42cd"},{"id":"751bb69d96ad1a03","name":"不卡脣紋","count":3,"type":4,"canBeFiltered":true,"stand":1,"rid":"751bb69d96ad1a03","ckeKeyWordBury":"eid=100^^tagid=751bb69d96ad1a03^^pid=20006^^sku=100006079301^^sversion=1000^^token=13417f3acd3ee870"},{"id":"2f0435df19147d25","name":"沒有異味","count":2,"type":4,"canBeFiltered":true,"stand":1,"rid":"2f0435df19147d25","ckeKeyWordBury":"eid=100^^tagid=2f0435df19147d25^^pid=20006^^sku=100006079301^^sversion=1000^^token=d7844a0a37a59036"},{"id":"3a1d9b72c8f37e71","name":"烏黑亮麗","count":1,"type":4,"canBeFiltered":true,"stand":1,"rid":"3a1d9b72c8f37e71","ckeKeyWordBury":"eid=100^^tagid=3a1d9b72c8f37e71^^pid=20006^^sku=100006079301^^sversion=1000^^token=1aa00f26771093f3"},{"id":"3a57805849e14dbb","name":"安全可靠","count":1,"type":4,"canBeFiltered":true,"stand":1,"rid":"3a57805849e14dbb","ckeKeyWordBury":"eid=100^^tagid=3a57805849e14dbb^^pid=20006^^sku=100006079301^^sversion=1000^^token=633dd3c0804f145d"},{"id":"d1a61b29c4d11818","name":"分量足夠","count":1,"type":4,"canBeFiltered":true,"stand":1,"rid":"d1a61b29c4d11818","ckeKeyWordBury":"eid=100^^tagid=d1a61b29c4d11818^^pid=20006^^sku=100006079301^^sversion=1000^^token=04b8c858e9da8a54"}],"jwotestProduct":null,"maxPage":100,"testId":"cmt","score":0,"soType":5,"imageListCount":500,"vTagStatistics":null,"csv":"eid=100^^tagid=ALL^^pid=20006^^sku=100010958774^^sversion=1001^^pageSize=11","comments":[{"id":13868985681,"guid":"8bf8cff2-173c-45f7-a2cd-44e5fb295c13","content":"情人節的時候首發搶的,焦急的等待快遞的到來,很滿意,很不錯,不僅送了禮盒,還有面膜的小樣,瞬間感覺熬夜都是值得的,開心。口紅的包裝很有質感,設計的也很有檔次,顏色拿捏的也很細緻,非常不錯,沒有色差。而且保溼效果也很好,不會很乾,看上去很棒。非常的滿意。","vcontent":"情人節的時候首發搶的,焦急的等待快遞的到來,很滿意,很不錯,不僅送了禮盒,還有面膜的小樣,瞬間感覺熬夜都是值得的,開心。口紅的包裝很有質感,設計的也很有檔次,顏色拿捏的也很細緻,非常不錯,沒有色差。而且保溼效果也很好,不會很乾,看上去很棒。非常的滿意。","creationTime":"2020-03-03 19:26:32","isDelete":false,"isTop":false,"userImageUrl":"misc.360buyimg.com/user/myjd-2015/css/i/peisong.jpg","topped":0,"replyCount":0,"score":5,"imageStatus":1,"title":"","usefulVoteCount":1,"userClient":2,"discussionId":688960275,"imageCount":4,"anonymousFlag":1,"plusAvailable":201,"mobileVersion":"","images":[{"id":1070279057,"imgUrl":"//img30.360buyimg.com/n0/s128x96_jfs/t1/86015/17/13903/221606/5e5e3ee8E7c2135bb/3015e0cc8d1dcf28.jpg","imgTitle":"","status":0},{"id":1070279058,"imgUrl":"//img30.360buyimg.com/n0/s128x96_jfs/t1/90879/28/13906/189795/5e5e3ee7E16c8d3fb/00e1e7f373c4c967.jpg","imgTitle":"","status":0},{"id":1070279059,"imgUrl":"//img30.360buyimg.com/n0/s128x96_jfs/t1/101831/8/13791/230984/5e5e3ee7E290fb10c/53a6be934b8cd090.jpg","imgTitle":"","status":0},{"id":1070279060,"imgUrl":"//img30.360buyimg.com/n0/s128x96_jfs/t1/107793/21/7596/62938/5e5e3ee7E1f6b32ea/88234ceb80ab0c4d.jpg","imgTitle":"","status":0}],"mergeOrderStatus":2,"productColor":"限量款196","productSize":"","textIntegral":40,"imageIntegral":40,"status":1,"referenceId":"100010958774","referenceTime":"2020-02-11 09:49:01","nickname":"j***0","replyCount2":0,"userImage":"misc.360buyimg.com/user/myjd-2015/css/i/peisong.jpg","orderId":0,"integral":80,"productSales":"[]","referenceImage":"jfs/t1/148778/34/16762/154478/5fc9e554E45dd107a/fa5f882090cf848e.jpg","referenceName":"蘭蔻(LANCOME)口紅196 3.4g 菁純絲絨霧面啞光脣膏 化妝品禮盒 胡蘿蔔色","firstCategory":1316,"secondCategory":1387,"thirdCategory":1425,"aesPin":null,"days":21,"afterDays":0}]});

相關文章