【Python學習】爬蟲爬蟲爬蟲爬蟲~

愛夕發表於2018-05-03

第八天
網上好多爬蟲都是py2的(:з」∠)
今天找了條py3的爬蟲嘗試爬學校的門戶

import io
import sys
import urllib.request
web_header = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
'Cookie':'iPlanetDirectoryPro=AQIC5wM2LY4SfczGmS5S1wsHjs3f8d%2FvQadvCPz780%2B9%2B1o%3D%40AAJTSQACMDI%3D%23; JSESSIONID=0000g4W05n040WWJHMgWwYK6u41:172u4qcnp'}
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')
url_mh='http://xxxx.xxxx.xxx.xx/index.portal'
req=urllib.request.Request(url=url_mh,headers=web_header)
resp=urllib.request.urlopen(req)
data=resp.read()
print(data.decode('utf-8'))

分析下這些程式碼web_header 是頭部資訊包含了cookie之前cookie都是單寫在cookie模組裡用這個方法很簡單= =(:з」∠)
sys.stdout 重定向頁面的編碼為utf8
url_mh 門戶登陸成功的介面
req為設定好的關聯性修改頭部資訊
resq傳送post請求返回的資料通過read引數獲取
然後顯示出來用utf-8的格式
兩個utf-8一定要設定好了一個是顯示的格式一個是解碼的格式

經過諸多除錯可以爬到公告了

import io
import sys
import urllib.request
import re
import requests
web_header = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
'Cookie':'iPlanetDirectoryPro=AQIC5wM2LY4SfczGmS5S1wsHjs3f8d%2FvQadvCPz780%2B9%2B1o%3D%40AAJTSQACMDI%3D%23; JSESSIONID=0000g4W05n040WWJHMgWwYK6u41:172u4qcnp'}
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')
url_mh='http://XXX/index.portal'
req=urllib.request.Request(url=url_mh,headers=web_header)
#resp=urllib.request.urlopen(req)
# data=resp.read()
#  print(data.decode('utf8'))
resp = requests.get(url=url_mh,headers=web_header)
resp.encoding = 'utf-8'
# print(resp.text)
gonggao = re.findall('<img src="images/s.gif" alt="" /></a><a   title='"(.*?)"' class="rss-title" οnclick=',resp.text,re.S)
for each in gonggao:
    print (each)

這個正則沒寫好這樣寫就沒問題了

import io
import sys
import urllib.request
import re
import requests
web_header = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
'Cookie':'iPlanetDirectoryPro=AQIC5wM2LY4SfczGmS5S1wsHjs3f8d%2FvQadvCPz780%2B9%2B1o%3D%40AAJTSQACMDI%3D%23; JSESSIONID=0000g4W05n040WWJHMgWwYK6u41:172u4qcnp'}
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')
url_mh='http://xxx.cn/index.portal'
req=urllib.request.Request(url=url_mh,headers=web_header)
#resp=urllib.request.urlopen(req)
# data=resp.read()
#  print(data.decode('utf8'))
resp = requests.get(url=url_mh,headers=web_header)
resp.encoding = 'utf-8'
# print(resp.text)
gonggao = re.findall('<a   title=\'(.*?)\' class="rss-title"',resp.text,re.S)
for each in gonggao:
    print (each)

今天有點晚就不弄匯出到text了下次吧·

相關文章