用python語言編寫網路爬蟲

覆手為雲p發表於2017-08-11

原文網址 : https://www.cnblogs.com/aland-1415/p/7347739.html

本文主要用到python3自帶的urllib模組編寫輕量級的簡單爬蟲。至於怎麼定位一個網頁中具體元素的url可自行百度火狐瀏覽器的firebug外掛或者谷歌瀏覽器的自帶方法。

1、訪問一個網址

re=urllib.request.urlopen('網址‘）

開啟的也可以是個urllib.request.Request物件，後邊也可以跟資料引數，當有傳入資料時會自動變為POST請求；

2、urllib.request.Request(url,data=None,headers={})物件屬性和方法

 1     full_url
 2     type
 3     host
 4     data
 5     selector
 6     method    
 7     get_method()
 8     add_header(key,val)
 9     add_unredirected_header(key,header)
10     has_header(header)
11     remove_header(header)
12     get_full_url(header)
13     set_proxy(host,type)
14     get_header(header_name,default=None)
15　　  header_items()

3、已連線物件的可用方法：

1 re.read()    　　　　　　　　 讀取內容，想要將內容儲存下來，需先新建一個相應格式的檔案，再將讀取到的內容寫入到這個檔案內即可；
2 re.geturl()    　　　　　　  可取得已開啟物件的url地址；
3 re.info()    　　　　　　　　 可取得響應伺服器的資訊；
4 re.getcode()    　　　　　　 可取得響應狀態碼；
5 urllib.parse.urlencode()　　將一個儲存post資料的字典轉換成開啟網頁所需要的資料格式；

可用json.loads()將文字轉換成鍵值對

可在傳地址時將header以一個字典資料的形式傳入，以隱藏自己的訪問方式；也可用re.add_header('') 的方式進行追加；

4、當知道一個檔案的url時可用此方法直接下載儲存到本地

urllib.request.urlretrieve('http://wx1.sinaimg.cn/mw600/9bbc284bgy1ffkuafn4xtj20dw0jgh08.jpg','bc.jpg')

5、登入功能的實現(post)

（1）利用session保留登入狀態

1 login_data = {
2             '_xsrf': getXSRF(baseurl),
3             'password': password,
4             'remember_me': 'true',
5             'email': email,
6 session = requests.session()
7 content = session.post(url, headers = headers_base, data = login_data)
8 s = session.get("http://www.zhihu.com", verify = False)
9 print s.text.encode('utf-8')

(2)利用cookie進行登入

 1 post = {
 2             'ua':self.ua,
 3             'TPL_checkcode':'',
 4             'CtrlVersion': '1,0,0,7',
 5             'TPL_password':'',
 6 }
 7 #將POST的資料進行編碼轉換
 8 postData = urllib.urlencode(post)
 9 cookie = cookielib.LWPCookieJar()
10 cookieHandler = urllib2.HTTPCookieProcessor(cookie)
11 opener = urllib2.build_opener(cookieHandler, urllib2.HTTPHandler)
12 #第一次登入獲取驗證碼嘗試，構建request
13 request = urllib2.Request(loginURL,postData,loginHeaders)
14 #得到第一次登入嘗試的相應
15 response = self.opener.open(request)
16 #獲取其中的內容
17 content = response.read().decode('gbk')
18

網站常用的編碼方式有utf8,gbk,gb2132,gb18030等

6、代理的使用

同一個Ip裝置在短時間內訪問一個伺服器次數過多會被伺服器禁止訪問，所以很多時候我們都需要用天代理來幫助我們解決這個問題。方法如下：

1 proxy_support = urllib.request.ProxyHandler({型別：代理ip和埠號})
2 opner = urllib.request.build_opener(proxy_suppoert)
3 urllib.request.install_opener(opener)  #可選安裝
4 opener.open(url)        #或直接呼叫opener代理

注：如想實現更復雜的可使用更全面的scrapy框架。

附：自己寫的一個驗證網上代理的有效性的爬蟲，此爬蟲先從網站上獲取代理的地址，然後使用這個代理來訪問百度，驗證是否能得到百度的網頁，如能則將此代理地址儲存。

 1 import threading,time,pickle,re
 2 import urllib.request
 3 
 4 class ProxyCheck(threading.Thread):
 5     def __init__(self,proxylist):
 6         threading.Thread.__init__(self)
 7         self.proxylist = proxylist
 8         self.timeout = 5
 9         self.test_url = 'http://www.baidu.com'
10         self.test_str = '11000002000001'
11         self.checkedProxyList = []
12 
13     def checkProxy(self):
14         cookies = urllib.request.HTTPCookieProcessor()
15         for proxy in self.proxylist:
16             proxy_handler = urllib.request.ProxyHandler({'http':r'%s://%s:%s' %(proxy[0],proxy[1],proxy[2])})
17             opener = urllib.request.build_opener(cookies,proxy_handler)
18             opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '
19                                                 '(KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36')]
20             urllib.request.install_opener(opener)
21             t1 = time.time()
22             try:
23                 req = urllib.request.urlopen(self.test_url,timeout=self.timeout)
24                 result = req.read().decode('utf-8')
25                 timeused = time.time() - t1
26                 pos = result.find(self.test_str)
27                 if pos > 1:
28                     self.checkedProxyList.append((proxy[0],proxy[1],proxy[2],proxy[3],timeused))
29                     print((proxy[0],proxy[1],proxy[2],proxy[3],timeused))
30                 else:
31                     continue
32             except:
33                 continue
34     # def sort(self):
35     #     sorted(self.checkedProxyList,cmp=lambda x,y:cmp(x[4],y[4]))
36     def save(self,filename):
37         with open("%s.txt"%filename,'w') as f:
38             for proxy in self.checkedProxyList:
39                 f.write("{}\t{}:{}\t{}\t{}\n".format(*proxy))
40         with open("%s.pickle"%filename,'wb') as fb:
41             pickle.dump(self.checkedProxyList,fb)
42 
43     def run(self):
44         self.checkProxy()
45         self.save("checked-50")
46 
47 
48 class xiciProxy:
49     def __init__(self):
50         self.alllist = []
51     def grep(self,url):
52         # req = urllib.request.Request(url)
53         # req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '
54         #                             '(KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36')
55 
56 
57         result1 = urllib.request.urlopen(req)
58         result2 = result1.read().decode('utf-8')
59 
60         regex = r"<td>(\d+.\d+.\d+.\d+)</td>\n.*?" \
61                 r"<td>(\d+)</td>\n.*?" \
62                 r"\n.*?" \
63                 r"<a href=.*?>(.*?)</a>\n.*?" \
64                 r"\n.*?" \
65                 r"\n.*?" \
66                 r"<td>(HTTPS?)</td>"
67         get = re.findall(regex,result2)
68         proxylist = []
69         for i in get:
70             proxylist.append((i[3],i[0],i[1],i[2]))
71         return proxylist
72     def save(self,filename):
73         with open("%s.txt"%filename,'w') as f:
74             for proxy in self.alllist:
75                 f.write("{}\t{}:{}\t{}\n".format(*proxy))
76         with open("%s.pickle"%filename,'wb') as fb:
77             pickle.dump(self.alllist,fb)
78     def run(self):
79         for i in range(51,1951):
80             url = "http://www.xicidaili.com/nn/{}".format(i)
81             print(url)
82             proxylist = self.grep(url)
83             self.alllist += proxylist
84             if i % 50 == 0:
85                 self.save("xiciproxy-{}".format(i))
86                 self.alllist = []
87 
88 with open("xiciproxy-50.pickle","rb") as fb:
89     proxylist = pickle.load(fb)
90 ProxyCheck(proxylist).run()

View Code

為什麼寫爬蟲用Python語言?
2020-12-01
爬蟲Python
python爬蟲是什麼?為什麼用python語言寫爬蟲？
2022-04-02
Python爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
什麼是網路爬蟲?為什麼用Python寫爬蟲?
2021-03-08
爬蟲Python
使用 Kotlin DSL 編寫網路爬蟲
2024-03-26
Kotlin爬蟲
為什麼寫爬蟲用Python語言?原因很簡單！
2021-03-19
爬蟲Python
C語言爬蟲程式編寫的爬取APP通用模板
2024-01-17
C語言爬蟲APP
網路爬蟲編寫常見問題
2020-07-30
爬蟲
5 個用 Python 編寫 web 爬蟲的方法
2018-05-20
PythonWeb爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
python DHT網路爬蟲
2019-02-14
Python爬蟲
github上的python爬蟲專案_GitHub - ahaharry/PythonCrawler: 用python編寫的爬蟲專案集合
2022-02-18
GithubPython爬蟲
網路爬蟲（python專案）
2018-12-04
爬蟲Python
專案－－python網路爬蟲
2020-08-15
Python爬蟲
python網路爬蟲合法嗎
2021-09-11
Python爬蟲
Python網路爬蟲實戰
2022-03-18
Python爬蟲
python網路爬蟲（14）使用Scrapy搭建爬蟲框架
2019-07-27
Python爬蟲框架
為什麼寫網路爬蟲天然就是擇Python而用
2018-12-02
爬蟲Python
[Python] 網路爬蟲與資訊提取（1）網路爬蟲之規則
2020-11-06
Python爬蟲
什麼是Python網路爬蟲?常見的網路爬蟲有哪些?
2020-11-27
Python爬蟲
python3網路爬蟲開發實戰_Python 3開發網路爬蟲(一)
2020-12-07
Python爬蟲
python網路爬蟲（9）構建基礎爬蟲思路
2019-06-09
Python爬蟲
寫網路爬蟲的法律邊界
2018-12-20
爬蟲
如何自己寫一個網路爬蟲
2020-02-27
爬蟲
python網路爬蟲筆記（一）
2020-10-25
Python爬蟲筆記
python實現selenium網路爬蟲
2021-03-11
Python爬蟲
手把手教你寫網路爬蟲（2）：迷你爬蟲架構
2018-04-27
爬蟲架構
使用python的scrapy來編寫一個爬蟲
2019-03-14
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python網路爬蟲--爬取淘寶聯盟
2018-07-17
Python爬蟲
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
網路爬蟲
2018-12-07
爬蟲
聊聊 Python 的應用 - 健壯高效的網路爬蟲
2018-10-19
Python爬蟲
用Python網路爬蟲獲取Mikan動漫資源
2020-08-26
Python爬蟲
Python 網路爬蟲的常用庫彙總及應用
2020-10-24
Python爬蟲
Python網路爬蟲 - Phantomjs, selenium/Chromedirver使用
2019-01-22
Python爬蟲JSChrome
Python網路爬蟲4 - scrapy入門
2018-05-29
Python爬蟲
《Python3網路爬蟲開發實戰》教程||爬蟲教程
2018-11-13
Python爬蟲

用python語言編寫網路爬蟲

相關文章