Python 爬取網站資原始檔

weixin_34085658發表於2015-06-29

原文網址 : https://blog.csdn.net/weixin_34085658/article/details/86317494

爬蟲原理：

以下來自知乎解釋

首先你要明白爬蟲怎樣工作。
想象你是一隻蜘蛛，現在你被放到了互聯“網”上。那麼，你需要把所有的網頁都看一遍。怎麼辦呢？沒問題呀，你就隨便從某個地方開始，比如說人民日報的首頁，這個叫initial pages，用$表示吧。
在人民日報的首頁，你看到那個頁面引向的各種連結。於是你很開心地從爬到了“國內新聞”那個頁面。太好了，這樣你就已經爬完了倆頁面（首頁和國內新聞）！暫且不用管爬下來的頁面怎麼處理的，你就想象你把這個頁面完完整整抄成了個html放到了你身上。
突然你發現，在國內新聞這個頁面上，有一個連結鏈回“首頁”。作為一隻聰明的蜘蛛，你肯定知道你不用爬回去的吧，因為你已經看過了啊。所以，你需要用你的腦子，存下你已經看過的頁面地址。這樣，每次看到一個可能需要爬的新連結，你就先查查你腦子裡是不是已經去過這個頁面地址。如果去過，那就別去了。
好的，理論上如果所有的頁面可以從initial page達到的話，那麼可以證明你一定可以爬完所有的網頁。

連結：http://www.zhihu.com/question/20899988/answer/24923424

1.爬取一個匿名可訪問upload目錄的網站

import re,os
import urllib.request
import urllib
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

from collections import deque

queue = deque()
visited = set()

origurl=url = 'http://www.***.cn/Upload/'  # 入口頁面, 可以換成別的
path = 'C:/Users/Administrator/Desktop/a/'

queue.append(url)
cnt = 0

while queue:
    url = queue.popleft()  # 隊首元素出隊

    print('已經抓取: ' + str(cnt) + '     正在抓取 <---    ' + url)
    cnt += 1
    try:
        urlop = urllib.request.urlopen(url, timeout=3)
    except:
        continue

    if 'image' in urlop.getheader('Content-Type'):
        xpath=url.replace(origurl,'')
        orig_list=xpath.split("/")
        orig_ext_file = orig_list[-1]
        path_sub = orig_list[:-1]
        new_path=path+('/'.join(path_sub))
        try:
            os.makedirs(new_path)
        except Exception as e:
            print(e)

        urllib.request.urlretrieve(url, new_path+'/'+orig_ext_file)

    if 'html' not in urlop.getheader('Content-Type'):
        continue

    # 處理異常
    try:
        data = urlop.read().decode('utf-8')
    except:
        continue

    # 正則表達 提取頁面中所有佇列, and判斷or訪問過, too加入待爬佇列
    linkre = re.compile('href="(.+?)"')
    for x in linkre.findall(data):
        if re.match(r"\?C=.", x):
            continue
        if re.match(r"/Upload/", x):
            continue

        if x not in visited:
            queue.append(url + x)
            visited |= {url}  # 標記為已訪問
            print('加入佇列 --->    ' + x)

2.抓取一個美圖高清桌布網站

import re
import urllib.request
import urllib
import ssl

ssl._create_default_https_context = ssl._create_unverified_context  # 取消ssl驗證https://

from collections import deque

queue = deque()
visited = set()

website = 'http://www.***.com/'
website_column = 'column/'
url = website + website_column + '80827.html'  # 入口頁面
path = './images/'

queue.append(url)  # 加入佇列
cnt = 0
while queue:
    url = queue.popleft()  # 隊首元素出隊
    visited |= {url}  # 已訪問

    print('已經抓取: ' + str(cnt) + '     正在抓取 <---    ' + url)
    cnt += 1
    try:
        urlop = urllib.request.urlopen(url, timeout=3)
    except:
        continue
    current_num_re = re.compile(r'/' + website_column + '(\d+)/')
    current_num = current_num_re.findall(url)
    if url == website + website_column:
        continue
    if 'html' not in urlop.getheader('Content-Type'):
        continue

    # 處理異常
    try:
        data = urlop.read().decode('gbk')
    except:
        try:
            data = urlop.read().decode('utf-8')
        except:
            continue

    # 正則表達 提取頁面中所有佇列, and判斷or訪問過, too加入待爬佇列
    linkre = re.compile('href="(.+?)"')
    inside1 = re.compile(r'/' + website_column + '(.*)')
    inside2 = re.compile(r'(\d+).htm')

    for x in linkre.findall(data):
        if 'http' not in x and x not in visited:
            resulturl = ''
            c = inside1.findall(x)
            if c:
                resulturl = website + website_column + c[0]
            else:
                c = inside2.findall(x)
                if c:
                    cnum = ''
                    cnum = current_num[0] if current_num else ''
                    resulturl = website + website_column + cnum + '/' + c[0] + '.htm'

            if resulturl:
                queue.append(resulturl)
                print('加入佇列 --->    ' + resulturl)

    linkrerr = re.compile('<p><img src="(.*)" οnlοad="btnaddress\(1\);')
    src = linkrerr.findall(data)
    if src:
        print(src)
        req = urllib.request.Request(src[0], headers={
            'Connection': 'Keep-Alive',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3',
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
            'Referer': url
        })
        resource = urllib.request.urlopen(req, timeout=30)
        orig_list = src[0].split("/")
        orig_ext_file = orig_list[-1]
        path_sub = orig_list[:-1]
        # urllib.request.urlretrieve(src[0], path  + orig_ext_file)  #網站拒絕爬蟲使用Referer 時， urlretrieve無法下載
        foo = open(path + orig_ext_file, "wb")
        str = resource.read()
        foo.write(str)
        foo.close()

參考地址： https://jecvay.com/2014/09/python3-web-bug-series1.html

使用 Python 爬取網站資料
2024-07-27
Python網站
JB的Python之旅-爬取phizhub網站（原始碼）
2019-03-01
Python網站原始碼
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
JB的Python之旅-爬取phizhub網站
2019-02-21
Python網站
爬蟲Selenium+PhantomJS爬取動態網站圖片資訊（Python）
2018-03-24
爬蟲JS網站Python
利用Python爬蟲獲取招聘網站職位資訊
2021-08-09
Python爬蟲網站
爬取某網站寫的python程式碼
2019-11-29
網站Python
快速爬取登入網站資料
2020-11-20
網站
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
如何用Python爬取需要登入的網站？
2018-08-23
Python網站
python 非同步佇列爬取多個網站
2020-11-21
Python非同步佇列網站
Python爬取噹噹網APP資料
2020-10-21
PythonAPP
爬取網站新聞
2020-09-24
網站
Python網路爬蟲3 – 生產者消費者模型爬取某金融網站資料
2019-02-28
Python爬蟲模型網站
Python網路爬蟲3 - 生產者消費者模型爬取某金融網站資料
2018-05-01
Python爬蟲模型網站
Python爬蟲入門教程 2-100 妹子圖網站爬取
2018-12-13
Python爬蟲網站
python爬取網圖
2019-10-15
Python
[譯] 如何使用 Python 和 BeautifulSoup 爬取網站內容
2019-02-23
Python網站
利用Python爬取攝影網站圖片，切勿商用
2018-12-18
Python網站
動態網站的爬取
2018-08-29
網站
Python爬蟲入門【3】：美空網資料爬取
2019-07-30
Python爬蟲
python-python爬取豆果網（菜譜資訊）
2019-01-22
Python
爬取薅羊毛網站百度雲資源
2020-02-16
網站
某網站加密返回資料加密_爬取過程
2024-06-08
網站加密
利用python爬取城市公交站點
2021-12-09
Python
用xpath、bs4、re爬取B站python資料
2018-08-07
Python
爬蟲搭建代理池、爬取某網站影片案例、爬取新聞案例
2023-03-16
爬蟲網站
Python學習：爬個電影資源網站
2018-03-16
Python網站
Python爬蟲訓練：爬取酷燃網視訊資料
2020-10-23
Python爬蟲
python網路爬蟲（7）爬取靜態資料詳解
2019-06-07
Python爬蟲
關於python爬取網頁
2021-03-10
Python網頁
Python 爬取網頁資料的兩種方法
2023-02-15
Python網頁
【Python爬蟲】正則爬取趕集網
2020-12-24
Python爬蟲
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
Python筆記：網頁資訊爬取簡介（一）
2020-11-11
Python筆記網頁
python爬蟲獲取天氣網實時資料
2022-11-29
Python爬蟲
python網路爬蟲--爬取淘寶聯盟
2018-07-17
Python爬蟲
Python爬蟲爬取B站up主所有動態內容
2024-05-08
Python爬蟲
python 爬蟲 mc 皮膚站 little skin 的簡單爬取
2019-08-02
Python爬蟲

Python 爬取網站資原始檔

相關文章