python自動開啟瀏覽器下載zip，並且提取內容寫入excel

拯救自己的小毛猴發表於2021-01-03

原文網址 : https://blog.csdn.net/Savingme/article/details/112143595

Python瀏覽器Excel

佬們輕噴，裡面有些程式碼都是現學現寫的，一些細節沒處理好的地方還請指出來~~~
首先貼上效果圖：有些部分我沒有放進來，比如瀏覽器的啟動，但我詳細聰明的你們那個玩意肯定一學就會。有些東西我沒放進來

程式碼思路

下載

使用到的庫和總體思路

這部分用到time，selenium，urllib，re，requests，os這幾個庫。

程式碼

#!/usr/bin/python3
# coding=utf-8
import time
from selenium import webdriver
from urllib.parse import quote,unquote
import re
import requests
import os
# 下面兩個引數是防止反爬的，別的文章也是這麼寫的，但我這裡沒用到
headers = {
    'Accept': '*/*',
    'Accept-Language': 'en-US,en;q=0.5',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36'
}
params = {
    'from': 'search',
    'seid': '9698329271136034665'
}


class Download_file():
    def __init__(self,url,order_number,file_path):
        self.url=url
        self.order_number=order_number
        self.file_path=file_path

    # 拿到檔案對應的下載連結
    def _get_files_url(self):
        # 用谷歌瀏覽器開啟
        driver=webdriver.Chrome()
        # 拿到url
        driver.get(self.url)
        print(driver.title)
        time.sleep(5)
        # 通過標籤id拿到對應操作物件
        driver.switch_to.frame(0)
        driver.find_element_by_id('search_id').send_keys(self.order_number)
        # 具體頁面有具體的操作，這裡我需要找的button沒有id，他是用ng-click="queryCheckRecordByTid(queryInfo.queryTid)"
        driver.find_element_by_class_name('btn').click()
        # driver.find_element_by_id('su').click()
        time.sleep(3)
        # AngularJS語法寫的標籤很煩。。。我這裡先找到目標標籤的父標籤
        # 然後通過父標籤拿到目標標籤
        dd=driver.find_elements_by_class_name('col-xs-2')
        # 我這個父標籤下有兩個<a></a>標籤，只能要第一個
        url_list=[]
        for i in dd:
            # 因為下載的url正好是第一個，然後這裡取得是element，所以正好取到正確的url
            a=i.find_element_by_xpath('.//a')
            # print(a.get_attribute('href'))
            url_list.append(a.get_attribute('href'))
        # download_btn[0].click()
        time.sleep(3)
        driver.close()
        return url_list

    # 下載檔案
    def download_save(self):
        # 匹配出來的可能有None，所以要做一下處理
        url_list=self._get_files_url()
        url_list=list(filter(lambda x:x!=None,url_list))
        if len(url_list)==0:
            return False
        # 建立一個儲存zip的資料夾
        # 更改執行路徑的原因是這樣可以靈活的在使用者指定的目錄下建立檔案
        os.chdir(self.file_path)
        if os.path.exists(self.file_path+'/'+'Download_Files') == False:
            os.mkdir('Download_Files')
        # 更改執行路徑
        os.chdir(self.file_path + '/'+'Download_Files/')
        for url in url_list:
            # 連結中附帶了作者和檔名，但是需要解碼，所以先用正則語言提取目標串，然後轉換成中文
            ret = re.search(r'_.*\.zip$',url)
            file_info=unquote(ret.group())
            file_author=file_info.split('_')[1]
            file_title=file_info.split('_')[2]
            file_object=requests.get(url)
            file_name=file_author+'_'+file_title
            print('正在下載:%s'%file_name)
            with open(file_name,'wb') as f:
                f.write(file_object.content)


    # def auto_fill(self):

if __name__ == '__main__':
    url='http://***'
    order_id='***'
    file_path='D:/For discipline/Get_excel'
    test=Download_file(url,order_id,file_path)
    test.download_save()

解釋

用selenium庫訪問目標頁面，我這裡通過_get_files_url方法定位輸入框和超連結地址，然後返回超連結地址。之後在download_save方法內通過request.get拿到檔案，然後存在本地，裡面的一些存放目錄、檔名處理等細節看程式碼就可以了。
注意，這只是一個案例，不具備普適性，因為每個頁面的前端編寫方法不盡相同，具體頁面需要具體分析，我這裡不貼我的網站是涉及到女朋友的業務，所以不適合貼。

提取內容並填寫

使用到的庫

這部分用到time，xlwt，urllib，re，pickle，os，zipfile，BeautifulSoup這幾個庫。

程式碼

#!/usr/bin/python3
# coding=utf-8
import os
import time
import xlwt
import zipfile
import re
import pickle
from bs4 import BeautifulSoup
from Download_files import Download_file
class get_excel():
    def __init__(self,file_path):
        self.file_path=file_path


    # 解壓出目標檔案
    def _unzip_files(self):
        '''
        這個函式具備解壓目標檔案的功能並且返回需要處理的檔案列表
        :return:
        '''
        files_list=os.listdir(self.file_path)
        # 檔名存放在列表中，為了防止處理了別的檔案，先用正則匹配一下
        files_list=list(filter(lambda x:re.search(r'\.zip$',x)!=None,files_list))
        title_list=[]
        for file in files_list:
            title=file.split('.')[0].split('_')[1]
            with zipfile.ZipFile(self.file_path+'/'+file,'r') as z:
                # 程式碼有點長，主要是用於篩選出目標檔案
                target_file=list(filter(lambda x:re.search(r'比對報告.html$',x)!=None,z.namelist()))
                # 下面的方法就是比較靈活的
                contentb=z.read(target_file[0])
                # 這裡很頭痛的一點是返回值是二進位制的，就算decode了也沒辦法正則匹配
                # 所以我想把它存一下再用utf8格式讀取
                # 當然我也嘗試了decode，但網頁內的有些東西還是沒辦法轉換，也會導致正則無法匹配
                if os.path.exists(self.file_path+'/'+title+'_'+'比對報告.html')==False:
                    with open(self.file_path+'/'+title+'_'+'比對報告.html','wb') as fb:
                        pickle.dump(contentb,fb)
                # with open(self.file_path+'/'+target_file[0],'r',encoding='utf-8') as fa:
                #     contenta=fa.read()
                #     print(contenta)
                #     sentence=str(re.search(r'<b [^"]*red tahoma.*</b>$',contenta))
                #     value=re.search(r'\d.*%', sentence)
                #     info=[author,title,value]
                #     repetition_rate.append(info)
                title_list.append(target_file[0])
        return files_list,title_list


    # 讀取html檔案內容
    def read_html(self):
        '''
        之前的函式已經把目標檔案解壓出來了，但html檔案的讀取比較麻煩，
        所以這裡用到了BeautifulSoup庫來讀取我想要的資訊，
        然後把想要的東西存在列表裡面返回回來。
        :return:
        '''
        files_list,title_list=self._unzip_files()
        repetition_rate=[]
        for file in files_list:
            # 取出作者和標題，這兩個資料要寫到excel裡面
            file=file.split('.')
            file=file[0].split('_')
            author=file[0]
            title=file[1]
            # 比對報告已經解壓出來了，直接讀取就可以
            with open(self.file_path+'/'+title+'_比對報告.html','rb') as f:
                # 下面是BeautifulSoup的用法，看不懂的話可以去官網
                content=f.read()
                content=BeautifulSoup(content,"html.parser")
                # print(type(content))
                # 網上搜了很多，終於可以找到我想要的重複率了
                value=content.find('b',{"class":"red tahoma"}).string
                repetition_rate.append([author,title,value])
        return repetition_rate


    def write_excel(self):
        '''
        生成xls表格
        :return:
        '''
        workbook=xlwt.Workbook(encoding='utf-8')
        booksheet=workbook.add_sheet('Sheet1')
        # 設定邊框
        borders = xlwt.Borders()  # Create Borders
        borders.left = xlwt.Borders.THIN   #DASHED虛線，NO_LINE沒有，THIN實線
        borders.right = xlwt.Borders.THIN  #borders.right=1 表示實線
        borders.top = xlwt.Borders.THIN
        borders.bottom = xlwt.Borders.THIN
        borders.left_colour=0x40
        borders.right_colour = 0x40
        borders.top_colour = 0x40
        borders.bottom_colour = 0x40
        style1=xlwt.XFStyle()
        style1.borders=borders
        # 設定背景顏色，這些操作搞得很像js和css
        pattern = xlwt.Pattern()
        pattern.pattern = xlwt.Pattern.SOLID_PATTERN
        pattern.pattern_fore_colour = 44
        style = xlwt.XFStyle()  # Create the Pattern
        style.pattern = pattern
        repetition_rate=self.read_html()
        # 寫一個標題
        booksheet.write(0,0,'作者',style)
        booksheet.write(0,1,'標題',style)
        booksheet.write(0,2,'重複率',style)
        for item in repetition_rate:
            booksheet.write(repetition_rate.index(item)+1,0,item[0],style1)
            booksheet.write(repetition_rate.index(item)+1,1,item[1],style1)
            booksheet.write(repetition_rate.index(item)+1,2,item[2],style1)
        s='重複率.xls'
        workbook.save(self.file_path+'/'+s)


if __name__ == '__main__':
    # 判斷一下Download_files資料夾
    file_path='D:/For discipline/Get_excel'
    url='http://***'
    order_number='***'
    if os.path.exists('./Download_Files')==False:
        get_file=Download_file(url,order_number,file_path)
        get_file.download_save()
    os.chdir(file_path+'/Download_Files')
    test=get_excel('D:/For discipline/Get_excel/Download_Files')
    test.write_excel()

解釋

由於我下載的zip檔案，這就需要先解壓，解壓的庫是zipfile，當然這種解壓只是在執行的時候解開，不是實際解壓到目錄下面的。解壓出來的檔案比較冗雜，所以我用正則匹配了一個最合適（能夠減少編寫工作量）的檔案，這部分程式碼中的大部分工作都是為了拿到我的目標值（其中包括位元組流和字串的轉換工作，我就是失敗了才會選擇儲存html檔案並重新讀取資訊的多餘過程），也就是（作者,標題,repetition rate），資訊寫入excel的過程倒不是很複雜。我基本上沒有解釋方法是因為這些百度一下或者看官網就行了，主要還是闡述一下我的編寫思路

參考文章

Excel的操作： Python3讀取和寫入excel表格資料.
BeautifulSoup的操作：Python使用beautifulSoup獲取標籤內資料.
selenium的操作: selenium之定位以及切換frame（iframe）. Python Selenium庫的使用.
zip檔案的解壓和讀取：Python讀寫zip壓縮檔案.

python excel 內容寫入mysql
2021-09-09
PythonExcelMySql
使用chrome瀏覽器驅動自動開啟瀏覽器
2024-08-02
Chrome瀏覽器
【瀏覽器開啟匯出的excel】
2018-08-19
瀏覽器Excel
win10瀏覽器如何啟動載入項_win10瀏覽器載入項啟動怎麼設定
2020-07-22
Win10瀏覽器
自動化測試系列（2）—— 下載瀏覽器驅動
2018-08-02
瀏覽器
Web自動化之瀏覽器啟動
2021-07-20
Web瀏覽器
前端檔案下載和瀏覽器自動嗅探
2019-03-03
前端瀏覽器
win10系統下edge瀏覽器總是開機自啟動怎麼辦_win10edge瀏覽器開機自啟如何禁用
2020-06-22
Win10瀏覽器
微信中無法下載APP的解決方案（微信自動跳轉瀏覽器開啟下載連結）
2019-02-17
APP瀏覽器
Excel匯出並完成後自動開啟
2019-03-19
Excel
webkit 瀏覽器內幕之資源載入
2018-08-29
WebKit瀏覽器
excel怎麼篩選重複的內容 excel找出重複項並提取
2022-02-26
Excel
win10自帶瀏覽器在哪_win10如何開啟自帶瀏覽器
2020-06-30
Win10瀏覽器
Puppeteer無頭瀏覽器：開啟自動化之門，掌握瀏覽器世界的無限可能
2023-09-21
瀏覽器
[求助]如何讓pc瀏覽器和手機瀏覽器自動識別並跳轉
2019-05-11
瀏覽器
chrome開啟瀏覽器的python指令碼
2018-04-08
Chrome瀏覽器Python指令碼
測試案例，Python +Selenium啟動不同瀏覽器
2018-07-30
Python瀏覽器
python用selenium開啟瀏覽器後瀏覽器關閉---解決辦法
2024-11-13
Python瀏覽器
預設瀏覽器設定及vue自動開啟頁面
2018-05-10
瀏覽器Vue
Python提取文字指定內容
2024-03-26
Python
win10 edge瀏覽器自啟動怎麼關閉_win10電腦edge瀏覽器自啟動解決方法
2020-07-15
Win10瀏覽器
Chrome、Edge瀏覽器內建多執行緒下載
2024-03-16
Chrome瀏覽器執行緒
win10電腦自帶瀏覽器哪裡開啟flash_win10自帶瀏覽器開啟flash player的方法
2019-12-10
Win10瀏覽器
python selenium爬蟲不開啟網頁不開啟瀏覽器
2020-11-15
Python爬蟲網頁瀏覽器
瀏覽器開啟md文件
2024-05-25
瀏覽器
判斷瀏覽器版本並且對使用低版本瀏覽器的使用者進行提示
2019-06-17
瀏覽器
python操作檔案寫入內容
2021-09-09
Python
windows10系統下ie瀏覽器開啟後自動關閉怎麼解決
2019-01-15
Windows瀏覽器
使用SAP WebIDE建立開發Java應用，並且在瀏覽器裡除錯
2020-02-13
WebIDEJava瀏覽器除錯
基於騰訊瀏覽服務 TBS 實現應用內開啟並瀏覽 Office 檔案
2019-03-02
瀏覽器可以自動修改URL？
2020-04-06
瀏覽器
IE8瀏覽器下，設定span標籤內容無效
2018-05-02
瀏覽器
Mac電腦自帶的Safari瀏覽器複製網頁內容教程
2020-08-12
Mac瀏覽器網頁
win10系統下ie瀏覽器怎麼禁用載入項_win10禁用ie瀏覽器載入項教程
2020-06-06
Win10瀏覽器
Flutter Scheme 使用(瀏覽器開啟App，App內開啟另一個App)
2020-06-10
FlutterScheme瀏覽器APP
核對不同資料夾所含內容的差異並提取缺失內容：Python程式碼
2024-07-03
Python
ie瀏覽器開啟變成別的瀏覽器怎麼辦開啟ie瀏覽器變成360怎麼改
2022-05-20
瀏覽器
瀏覽器載入及渲染機制
2019-04-17
瀏覽器