最簡單的 先把網頁的HTML程式碼爬取下來
from urllib.request import urlopen from urllib.request import Request #遇到反爬取可以新增模擬瀏覽器協議頭 headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} #想要爬取的網站地址 url = "https://www.zhihu.com/" req_timeout=5 #設定req_timeout防止url不可訪問,或者響應速度太慢而造成的時間浪費。 req=Request(url=url,headers=headers) f=urlopen(req,None,req_timeout) s=f.read() s=s.decode('utf-8')# 防止爬取的頁面中文出現亂碼 ss=str(s) print(ss)
遇到的問題:
1.大部分網站會有發爬取措施 所以我們需要新增一段程式碼:
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
這個是新增模擬瀏覽器協議頭,可以解決這個問題。自己親測百度知乎都可以用這個方法爬取下來HTML程式碼
2.爬取的程式碼中有亂碼
s=s.decode('utf-8')
使用這個方法可以解決
3.輸出結果需要str型別
將其轉換成str型別
上面程式碼結果(爬取知乎首頁程式碼):
<!DOCTYPE html> <html lang="zh-CN" class=""> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" /> <meta http-equiv="X-ZA-Response-Id" content="1b244bb1a32b4315"> <meta http-equiv="X-ZA-Experiment" content="default:None,ge3:ge3_9,ge2:ge2_1,nweb_sticky_sidebar:sticky,live_review_buy_bar:live_review_buy_bar_2,is_office:false,home_ui2:default,is_show_unicom_free_entry:unicom_free_entry_off,app_store_rate_dialog:close,qa_sticky_sidebar:sticky_sidebar,android_profile_panel:panel_b,live_store:ls_a2_b2_c1_f2,search_hybrid_tabs:without-tabs,answer_related_readings:qa_recommend_with_ads_and_article,asdfadsf:asdfad,new_mobile_column_appheader:new_header,fav_act:default,remix_one_key_play_button:headerButton,mobile_qa_page_proxy_heifetz:m_qa_page_nweb,nweb_write_answer:default,android_pass_through_push:getui,new_more:new,new_buy_bar:livenewbuy3,zcm-lighting:zcm,iOS_newest_version:4.2.0,qrcode_login:qrcode,wechat_share_modal:wechat_share_modal_show"> <meta name="renderer" content="webkit" /> <meta name="description" content="中文網際網路最大的知識平臺,幫助人們便捷地分享彼此的知識、經驗和見解。"/> <meta name="viewport" content="user-scalable=no, width=device-width, initial-scale=1.0, maximum-scale=1.0"/> <title>知乎 - 發現更大的世界</title> <link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-152.87c020b9.png" sizes="152x152"> <link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-120.496c913b.png" sizes="120x120"> <link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-76.dcf79352.png" sizes="76x76"> <link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-60.9911cffb.png" sizes="60x60"> <link rel="shortcut icon" href="https://static.zhihu.com/static/favicon.ico" type="image/x-icon" /> <link rel="dns-prefetch" href="p1.zhimg.com"/> <link rel="dns-prefetch" href="p2.zhimg.com"/> <link rel="dns-prefetch" href="p3.zhimg.com"/> <link rel="dns-prefetch" href="p4.zhimg.com"/> <link rel="dns-prefetch" href="comet.zhihu.com"/> <link rel="dns-prefetch" href="static.zhihu.com"/> <link rel="dns-prefetch" href="upload.zhihu.com"/> <link rel="stylesheet" href="https://static.zhihu.com/static/revved/-/css/pages/unlogin-index/main.f214513a.css"> <meta name="google-site-verification" content="FTeR0c8arOPKh8c5DYh_9uu98_zJbaWw53J-Sch9MTg" /> <meta name="baidu-site-verification" content="KPFppAFoYF4Kkdv9" /> <meta property="qc:admins" content="00544670776201056375" /> <link rel="canonical" href="http://www.zhihu.com" /> <meta id="znonce" name="znonce" content="d5e581328572473aad8501685dae174f"> <!--[if lt IE 9]> <script src="https://static.zhihu.com/static/components/respond/dest/respond.min.js"></script> <link href="https://static.zhihu.com/static/components/respond/cross-domain/respond-proxy.html" id="respond-proxy" rel="respond-proxy" /> <link href="/static/components/respond/cross-domain/respond.proxy.gif" id="respond-redirect" rel="respond-redirect" /> <script src="/static/components/respond/cross-domain/respond.proxy.js"></script> <![endif]--> <script src="https://static.zhihu.com/static/revved/-/js/instant.14757a4a.js"></script> </head> <body class="zhi "> <div class="index-main"> <div class="index-main-body"> <div class="index-header"> <h1 class="logo hide-text">知乎</h1> <h2 class="subtitle">與世界分享你的知識、經驗和見解</h2> </div> <div class="desk-front sign-flow sign-flow clearfix sign-flow-simple"> <div class="index-tab-navs"> <div class="navs-slider"> <a href="#signup" class="active">註冊</a> <a href="#signin">登入</a> <span class="navs-slider-bar"></span> </div> </div> <div class="view view-signin" data-za-module="SignInForm"> <form method="POST"> <input type="hidden" name="_xsrf" value="060f3dedb5b35e2ac6fe354bed716d04"/> <div class="group-inputs"> <div class="account input-wrapper"> <input type="text" name="account" aria-label="手機號或郵箱" placeholder="手機號或郵箱" required> </div> <div class="verification input-wrapper"> <input type="password" name="password" aria-label="密碼" placeholder="密碼" required /><button type="button" class="send-code-button">獲取驗證碼</button> </div> <div class="Captcha input-wrapper" data-type="cn" data-za-module="Captcha"> <div class="Captcha-operate"> <input type="hidden" name="captcha" required data-rule-required="true" data-msg-required="請點選圖中所有倒立的文字"> <input type="hidden" name="captcha_type" value="cn" required> <label class="Captcha-prompt">請點選圖中所有倒立的文字</label> <span class="Captcha-refresh js-refreshCaptcha sprite-index-icon-refresh"></span> </div> <div class="Captcha-imageConatiner"> <img class="Captcha-image" alt="驗證碼" > </div> </div> </div> <div class="button-wrapper command"> <button class="sign-button submit" type="submit">登入</button> </div> <div class="signin-misc-wrapper clearfix"> <button type="button" class="signin-switch-button">手機驗證碼登入</button> <a class="unable-login" href="#">無法登入?</a> </div> <div class="other-signup-wrapper" data-za-module="SNSSignIn"> <span class="name signin-switch-qrcode-buttons">二維碼登入</span> <span class="signup-footer-separate signup-footer-se"> · </span> <span class="name signup-social-buttons js-toggle-sns-buttons">社交帳號登入</span> <div class="sns-buttons"> <a title="微信登入" class="js-bindwechat" href="#"><i class="sprite-index-icon-wechat"></i></a> <a title="微博登入" class="js-bindweibo" href="#"><i class="sprite-index-icon-weibo"></i></a> <a title="QQ 登入" class="js-bindqq" href="#"><i class="sprite-index-icon-qq"></i></a> </div> </div> </form> <div class="qrcode-signin-container"> <div class="qrcode-signin-step1"> <div class="qrcode-signin-img-wrapper"> <img src="/static/img/spinner/grey-loading.gif" class="qrcode-signin-loading"/> </div> <p>開啟最新 <a href="https://www.zhihu.com/app/" target="_blank">知乎 App</a></p> <p>在「更多」頁面右上角開啟掃一掃</p> <div class="qrcode-signin-cut-button"> <span class="signin-switch-password">使用密碼登入</span> </div> </div> <div class="qrcode-signin-step2"> <div class="qrcode-signin-scan-status"></div> <p class="qrcode-signin-scan-tips">掃描成功</p> <p>請在手機上「確認登入」</p> <div class="qrcode-signin-cut-button"> <span class="qrcode-goto-scan">返回二維碼</span> </div> </div> <div class="qrcode-signin-failure"> <div class="qrcode-signin-failure-icon"></div> <p class="qrcode-signin-failure-message"></p> <div class="qrcode-signin-cut-button"> <span class="signin-switch-password">使用密碼登入</span> </div> </div> <div class="qrcode-signin-guide"></div> </div> <div class="QRCode"> <button class="QRCode-toggleButton"> <span class="sprite-global-icon-qrcode"></span> <span class="QRCode-toggleButtonText ">下載知乎 App</span> </button> <div class="QRCode-card"> <div class="QRCode-image"></div> <div class="sprite-index-icon-arrow"></div> </div> </div> </div> <div class="view view-signup selected" data-za-module="SignUpForm"> <form class="zu-side-login-box" action="/register/email" id="sign-form-1" autocomplete="off" method="POST"> <input type="password" hidden> <input type="hidden" name="_xsrf" value="060f3dedb5b35e2ac6fe354bed716d04"/> <div class="group-inputs"> <div class="name input-wrapper"> <input required type="text" name="fullname" aria-label="姓名" placeholder="姓名"> </div> <div class="email input-wrapper"> <input required type="text" class="account" name="phone_num" aria-label="手機號" placeholder="手機號"> </div> <div class="input-wrapper"> <input required type="password" name="password" aria-label="密碼" placeholder="密碼(不少於 6 位)" autocomplete="off"> </div> <div class="Captcha input-wrapper" data-type="cn" data-za-module="Captcha"> <div class="Captcha-operate"> <input type="hidden" name="captcha" required data-rule-required="true" data-msg-required="請點選圖中所有倒立的文字"> <input type="hidden" name="captcha_type" value="cn" required> <label class="Captcha-prompt">請點選圖中所有倒立的文字</label> <span class="Captcha-refresh js-refreshCaptcha sprite-index-icon-refresh"></span> </div> <div class="Captcha-imageConatiner"> <img class="Captcha-image" alt="驗證碼" > </div> </div> </div> <div class="button-wrapper command"> <button class="sign-button submit" type="submit">註冊知乎</button> </div> </form> <p class="agreement-tip">點選「註冊」按鈕,即代表你同意<a href="/terms" target="_blank">《知乎協議》</a></p> <a class="signup-entry--org" href="/org/signup">序號產生器構號</a> <div class="QRCode"> <button class="QRCode-toggleButton"> <span class="sprite-global-icon-qrcode"></span> <span class="QRCode-toggleButtonText ">下載知乎 App</span> </button> <div class="QRCode-card"> <div class="QRCode-image"></div> <div class="sprite-index-icon-arrow"></div> </div> </div> </div> </div> </div> </div> <div class="footer"> <a target="_blank" href="https://zhuanlan.zhihu.com">知乎專欄</a> <span class="dot">·</span> <a target="_blank" href="/roundtable">知乎圓桌</a> <span class="dot">·</span> <a target="_blank" href="/explore" data-za-c="explore" data-za-a="visit_explore" data-za-l="home_bottom_explore">發現</a> <span class="dot">·</span> <a target="_blank" href="/app">移動應用</a> <span class="dot">·</span> <a href="/contact" class="footer-mobile-show">聯絡我們</a> <span class="dot">·</span> <a target="_blank" href="/careers">來知乎工作</a> <br /> <span>© 2017 知乎</span> <span class="dot">·</span> <a href="http://www.miibeian.gov.cn/" target="_blank">京 ICP 證 110745 號</a> <span class="dot">·</span> <span>京公網安備 11010802010035 號</span> <span class="dot">·</span> <a href="http://zhstatic.zhihu.com/assets/zhihu/publish-license.jpg" target="_blank">出版物經營許可證</a> <br /> <a target="_blank" href="https://zhuanlan.zhihu.com/p/28852607">侵權舉報</a> <span class="dot">·</span> <a target="_blank" href="http://www.12377.cn">網上有害資訊舉報專區</a> <span class="dot">·</span> <a target="_blank" href="/jubao">兒童色情資訊舉報專區</a> <span class="dot">·</span> <span>違法和不良資訊舉報:010-82716601</span> <div class="chengxing"> <a id='___szfw_logo___' href='https://credit.szfw.org/CX20170607038331320388.html' target='_blank'> <img src="https://static.zhihu.com/static/revved/img/index/chengxing_logo@2x.65dc76e8.png" border='0' /> </a> <script type='text/javascript'>(function(){document.getElementById('___szfw_logo___').oncontextmenu = function(){return false;}})();</script> </div> </div> <script type="text/json" class="json-inline" data-name="disabled_components">["back_to_top"]</script> <script type="text/json" class="json-inline" data-name="current_user">["","","","-1","",0,0]</script> <script type="text/json" class="json-inline" data-name="env">["zhihu.com","comet.zhihu.com",false,null,false,false]</script> <script type="text/json" class="json-inline" data-name="ga_vars">{"user_created":0,"now":1509713487000,"abtest_mask":"------------------------------","user_attr":[0,0,0,"-","-"],"user_hash":0}</script> <script src="https://static.zhihu.com/static/revved/-/js/vendor.cb14a042.js"></script> <script src="https://static.zhihu.com/static/revved/-/js/closure/base.41bb3b24.js"></script> <script src="https://static.zhihu.com/static/revved/-/js/closure/common.ef6c9c27.js"></script> <script src="https://static.zhihu.com/static/revved/-/js/closure/page-index.f17f3a40.js"></script> <meta name="entry" content="ZH.entrySignPage" data-module-id="page-index"> <input type="hidden" name="_xsrf" value="060f3dedb5b35e2ac6fe354bed716d04"/> </body> </html>