我的爬蟲筆記(1)

weixin_34249678發表於2017-11-03

最簡單的 先把網頁的HTML程式碼爬取下來

from urllib.request import urlopen
from urllib.request import Request
#遇到反爬取可以新增模擬瀏覽器協議頭
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
#想要爬取的網站地址
url = "https://www.zhihu.com/"
req_timeout=5  #設定req_timeout防止url不可訪問,或者響應速度太慢而造成的時間浪費。
req=Request(url=url,headers=headers)
f=urlopen(req,None,req_timeout)
s=f.read()
s=s.decode('utf-8')# 防止爬取的頁面中文出現亂碼
ss=str(s)
print(ss)

遇到的問題:

1.大部分網站會有發爬取措施 所以我們需要新增一段程式碼:

headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}

這個是新增模擬瀏覽器協議頭,可以解決這個問題。自己親測百度知乎都可以用這個方法爬取下來HTML程式碼

2.爬取的程式碼中有亂碼

s=s.decode('utf-8')

使用這個方法可以解決

3.輸出結果需要str型別

將其轉換成str型別

上面程式碼結果(爬取知乎首頁程式碼):

<!DOCTYPE html>
<html lang="zh-CN" class="">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta http-equiv="X-ZA-Response-Id" content="1b244bb1a32b4315">
<meta http-equiv="X-ZA-Experiment" content="default:None,ge3:ge3_9,ge2:ge2_1,nweb_sticky_sidebar:sticky,live_review_buy_bar:live_review_buy_bar_2,is_office:false,home_ui2:default,is_show_unicom_free_entry:unicom_free_entry_off,app_store_rate_dialog:close,qa_sticky_sidebar:sticky_sidebar,android_profile_panel:panel_b,live_store:ls_a2_b2_c1_f2,search_hybrid_tabs:without-tabs,answer_related_readings:qa_recommend_with_ads_and_article,asdfadsf:asdfad,new_mobile_column_appheader:new_header,fav_act:default,remix_one_key_play_button:headerButton,mobile_qa_page_proxy_heifetz:m_qa_page_nweb,nweb_write_answer:default,android_pass_through_push:getui,new_more:new,new_buy_bar:livenewbuy3,zcm-lighting:zcm,iOS_newest_version:4.2.0,qrcode_login:qrcode,wechat_share_modal:wechat_share_modal_show">
<meta name="renderer" content="webkit" />
<meta name="description" content="中文網際網路最大的知識平臺,幫助人們便捷地分享彼此的知識、經驗和見解。"/>
<meta name="viewport" content="user-scalable=no, width=device-width, initial-scale=1.0, maximum-scale=1.0"/>
<title>知乎 - 發現更大的世界</title>



<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-152.87c020b9.png" sizes="152x152">
<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-120.496c913b.png" sizes="120x120">
<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-76.dcf79352.png" sizes="76x76">
<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-60.9911cffb.png" sizes="60x60">

<link rel="shortcut icon" href="https://static.zhihu.com/static/favicon.ico" type="image/x-icon" />
<link rel="dns-prefetch" href="p1.zhimg.com"/>
<link rel="dns-prefetch" href="p2.zhimg.com"/>
<link rel="dns-prefetch" href="p3.zhimg.com"/>
<link rel="dns-prefetch" href="p4.zhimg.com"/>
<link rel="dns-prefetch" href="comet.zhihu.com"/>
<link rel="dns-prefetch" href="static.zhihu.com"/>
<link rel="dns-prefetch" href="upload.zhihu.com"/>
<link rel="stylesheet" href="https://static.zhihu.com/static/revved/-/css/pages/unlogin-index/main.f214513a.css">
<meta name="google-site-verification" content="FTeR0c8arOPKh8c5DYh_9uu98_zJbaWw53J-Sch9MTg" />
<meta name="baidu-site-verification" content="KPFppAFoYF4Kkdv9" />
<meta property="qc:admins" content="00544670776201056375" />
<link rel="canonical" href="http://www.zhihu.com" />
<meta id="znonce" name="znonce" content="d5e581328572473aad8501685dae174f">
<!--[if lt IE 9]>
<script src="https://static.zhihu.com/static/components/respond/dest/respond.min.js"></script>
<link href="https://static.zhihu.com/static/components/respond/cross-domain/respond-proxy.html" id="respond-proxy" rel="respond-proxy" />
<link href="/static/components/respond/cross-domain/respond.proxy.gif" id="respond-redirect" rel="respond-redirect" />
<script src="/static/components/respond/cross-domain/respond.proxy.js"></script>
<![endif]-->
<script src="https://static.zhihu.com/static/revved/-/js/instant.14757a4a.js"></script>
</head>
<body class="zhi ">




<div class="index-main">
<div class="index-main-body">
<div class="index-header">
<h1 class="logo hide-text">知乎</h1>

<h2 class="subtitle">與世界分享你的知識、經驗和見解</h2>

</div>

<div class="desk-front sign-flow sign-flow clearfix sign-flow-simple">


<div class="index-tab-navs">
<div class="navs-slider">
<a href="#signup" class="active">註冊</a>
<a href="#signin">登入</a>
<span class="navs-slider-bar"></span>
</div>
</div>



<div class="view view-signin" data-za-module="SignInForm">
<form method="POST">
<input type="hidden" name="_xsrf" value="060f3dedb5b35e2ac6fe354bed716d04"/>
<div class="group-inputs">

<div class="account input-wrapper">

<input type="text" name="account" aria-label="手機號或郵箱" placeholder="手機號或郵箱" required>
</div>
<div class="verification input-wrapper">
<input type="password" name="password" aria-label="密碼" placeholder="密碼" required /><button type="button" class="send-code-button">獲取驗證碼</button>
</div>

<div class="Captcha input-wrapper" data-type="cn" data-za-module="Captcha">
<div class="Captcha-operate">
<input type="hidden" name="captcha" required data-rule-required="true" data-msg-required="請點選圖中所有倒立的文字">
<input type="hidden" name="captcha_type" value="cn" required>
<label class="Captcha-prompt">請點選圖中所有倒立的文字</label>
<span class="Captcha-refresh js-refreshCaptcha sprite-index-icon-refresh"></span>
</div>
<div class="Captcha-imageConatiner">
<img class="Captcha-image" alt="驗證碼" >
</div>
</div>

</div>
<div class="button-wrapper command">
<button class="sign-button submit" type="submit">登入</button>
</div>
<div class="signin-misc-wrapper clearfix">

<button type="button" class="signin-switch-button">手機驗證碼登入</button>

<a class="unable-login" href="#">無法登入?</a>
</div>

<div class="other-signup-wrapper" data-za-module="SNSSignIn">

<span class="name signin-switch-qrcode-buttons">二維碼登入</span>
<span class="signup-footer-separate signup-footer-se"> · </span>

<span class="name signup-social-buttons js-toggle-sns-buttons">社交帳號登入</span>

<div class="sns-buttons">
<a title="微信登入" class="js-bindwechat" href="#"><i class="sprite-index-icon-wechat"></i></a>
<a title="微博登入" class="js-bindweibo" href="#"><i class="sprite-index-icon-weibo"></i></a>
<a title="QQ 登入" class="js-bindqq" href="#"><i class="sprite-index-icon-qq"></i></a>
</div>


</div>

</form>

<div class="qrcode-signin-container">
<div class="qrcode-signin-step1">
<div class="qrcode-signin-img-wrapper">
<img src="/static/img/spinner/grey-loading.gif" class="qrcode-signin-loading"/>
</div>
<p>開啟最新 <a href="https://www.zhihu.com/app/" target="_blank">知乎 App</a></p>
<p>在「更多」頁面右上角開啟掃一掃</p>
<div class="qrcode-signin-cut-button">
<span class="signin-switch-password">使用密碼登入</span>
</div>
</div>
<div class="qrcode-signin-step2">
<div class="qrcode-signin-scan-status"></div>
<p class="qrcode-signin-scan-tips">掃描成功</p>
<p>請在手機上「確認登入」</p>
<div class="qrcode-signin-cut-button">
<span class="qrcode-goto-scan">返回二維碼</span>
</div>
</div>
<div class="qrcode-signin-failure">
<div class="qrcode-signin-failure-icon"></div>
<p class="qrcode-signin-failure-message"></p>
<div class="qrcode-signin-cut-button">
<span class="signin-switch-password">使用密碼登入</span>
</div>
</div>
<div class="qrcode-signin-guide"></div>
</div>



<div class="QRCode">
<button class="QRCode-toggleButton">
<span class="sprite-global-icon-qrcode"></span>
<span class="QRCode-toggleButtonText ">下載知乎 App</span>
</button>
<div class="QRCode-card">
<div class="QRCode-image"></div>
<div class="sprite-index-icon-arrow"></div>
</div>
</div>


</div>
<div class="view  view-signup selected" data-za-module="SignUpForm">

<form class="zu-side-login-box" action="/register/email" id="sign-form-1" autocomplete="off" method="POST">
<input type="password" hidden> 
<input type="hidden" name="_xsrf" value="060f3dedb5b35e2ac6fe354bed716d04"/>

<div class="group-inputs">


<div class="name input-wrapper">
<input required type="text" name="fullname" aria-label="姓名" placeholder="姓名">
</div>
<div class="email input-wrapper">

<input required type="text" class="account" name="phone_num" aria-label="手機號" placeholder="手機號">

</div>
<div class="input-wrapper">
<input required type="password" name="password" aria-label="密碼" placeholder="密碼(不少於 6 位)" autocomplete="off">
</div>


<div class="Captcha input-wrapper" data-type="cn" data-za-module="Captcha">
<div class="Captcha-operate">
<input type="hidden" name="captcha" required data-rule-required="true" data-msg-required="請點選圖中所有倒立的文字">
<input type="hidden" name="captcha_type" value="cn" required>
<label class="Captcha-prompt">請點選圖中所有倒立的文字</label>
<span class="Captcha-refresh js-refreshCaptcha sprite-index-icon-refresh"></span>
</div>
<div class="Captcha-imageConatiner">
<img class="Captcha-image" alt="驗證碼" >
</div>
</div>

</div>
<div class="button-wrapper command">
<button class="sign-button submit" type="submit">註冊知乎</button>
</div>

</form>

<p class="agreement-tip">點選「註冊」按鈕,即代表你同意<a href="/terms" target="_blank">《知乎協議》</a></p>
<a class="signup-entry--org" href="/org/signup">序號產生器構號</a>

<div class="QRCode">
<button class="QRCode-toggleButton">
<span class="sprite-global-icon-qrcode"></span>
<span class="QRCode-toggleButtonText ">下載知乎 App</span>
</button>
<div class="QRCode-card">
<div class="QRCode-image"></div>
<div class="sprite-index-icon-arrow"></div>
</div>
</div>



</div>
</div>
</div>


</div>

<div class="footer">
<a target="_blank" href="https://zhuanlan.zhihu.com">知乎專欄</a>
<span class="dot">·</span>
<a target="_blank" href="/roundtable">知乎圓桌</a>
<span class="dot">·</span>
<a target="_blank" href="/explore" data-za-c="explore" data-za-a="visit_explore" data-za-l="home_bottom_explore">發現</a>
<span class="dot">·</span>
<a target="_blank" href="/app">移動應用</a>
<span class="dot">·</span>
<a href="/contact" class="footer-mobile-show">聯絡我們</a>
<span class="dot">·</span>
<a target="_blank" href="/careers">來知乎工作</a>
<br />
<span>&copy; 2017 知乎</span>
<span class="dot">·</span>
<a href="http://www.miibeian.gov.cn/" target="_blank">京 ICP 證 110745 號</a>
<span class="dot">·</span>
<span>京公網安備 11010802010035 號</span>
<span class="dot">·</span>
<a href="http://zhstatic.zhihu.com/assets/zhihu/publish-license.jpg" target="_blank">出版物經營許可證</a>
<br />
<a target="_blank" href="https://zhuanlan.zhihu.com/p/28852607">侵權舉報</a>
<span class="dot">·</span>
<a target="_blank" href="http://www.12377.cn">網上有害資訊舉報專區</a>
<span class="dot">·</span>
<a target="_blank" href="/jubao">兒童色情資訊舉報專區</a>
<span class="dot">·</span>
<span>違法和不良資訊舉報:010-82716601</span>
<div class="chengxing">
<a id='___szfw_logo___' href='https://credit.szfw.org/CX20170607038331320388.html' target='_blank'>
<img src="https://static.zhihu.com/static/revved/img/index/chengxing_logo@2x.65dc76e8.png" border='0' />
</a>
<script type='text/javascript'>(function(){document.getElementById('___szfw_logo___').oncontextmenu = function(){return false;}})();</script>
</div>
</div>




<script type="text/json" class="json-inline" data-name="disabled_components">["back_to_top"]</script>
<script type="text/json" class="json-inline" data-name="current_user">["","","","-1","",0,0]</script>
<script type="text/json" class="json-inline" data-name="env">["zhihu.com","comet.zhihu.com",false,null,false,false]</script>

<script type="text/json" class="json-inline" data-name="ga_vars">{"user_created":0,"now":1509713487000,"abtest_mask":"------------------------------","user_attr":[0,0,0,"-","-"],"user_hash":0}</script>

<script src="https://static.zhihu.com/static/revved/-/js/vendor.cb14a042.js"></script>
<script src="https://static.zhihu.com/static/revved/-/js/closure/base.41bb3b24.js"></script>

<script src="https://static.zhihu.com/static/revved/-/js/closure/common.ef6c9c27.js"></script>
<script src="https://static.zhihu.com/static/revved/-/js/closure/page-index.f17f3a40.js"></script>
<meta name="entry" content="ZH.entrySignPage" data-module-id="page-index">


<input type="hidden" name="_xsrf" value="060f3dedb5b35e2ac6fe354bed716d04"/>
</body>
</html>

 

轉載於:https://www.cnblogs.com/wssx/p/7780462.html

相關文章