zf_利用feapder中的selenium網頁爬取資料

我爱你的發表於2024-06-03

原文網址 : https://www.cnblogs.com/Lhptest/p/18229742

"http://www.ccgp.gov.cn/cggg/dfgg/"
#個人學習用切勿其他用途


# 標題 name
# 釋出時間 publish_time
# 地域 location
# 採購人 purchaser
# 採購網址 url





#前提配置資料庫安裝feapder庫



import random
import re
import time

import feapder
from feapder.utils.webdriver import WebDriver
from parsel import Selector
from feapder.db.mysqldb import MysqlDB
from selenium.webdriver.common.by import By



class TestRender(feapder.AirSpider):
    db = MysqlDB()

    __custom_setting__ = dict(
        WEBDRIVER=dict(
            pool_size=1,  # 瀏覽器的數量
            load_images=True,  # 是否載入圖片
            user_agent=None,  # 字串 或 無參函式，返回值為user_agent
            proxy=None,  # xxx.xxx.xxx.xxx:xxxx 或 無參函式，返回值為代理地址
            headless=False,  # 是否為無頭瀏覽器
            driver_type="CHROME",  # CHROME、EDGE、PHANTOMJS、FIREFOX
            timeout=30,  # 請求超時時間
            window_size=(1024, 800),  # 視窗大小
            executable_path=None,  # 瀏覽器路徑，預設為預設路徑
            render_time=0,  # 渲染時長，即開啟網頁等待指定時間後再獲取原始碼
            custom_argument=["--ignore-certificate-errors"],  # 自定義瀏覽器渲染引數
            # xhr_url_regexes=[
            #     "/ad",
            # ],  # 攔截 http://www.spidertools.cn/spidertools/ad 介面
        )
    )

    def start_requests(self):
        for i in range(5):
            if i == 0:
                url = 'https://www.ccgp.gov.cn/cggg/dfgg/index.htm'
            else:
                url = f'https://www.ccgp.gov.cn/cggg/dfgg/index_{i}.htm'
            yield feapder.Request(url, render=True,xxx='https://www.ccgp.gov.cn/cggg/dfgg/')


    def parse(self, request, response):
        browser: WebDriver = response.browser
        time.sleep(random.randint(3, 5))
        li = Selector(browser.page_source).xpath('//ul[@class="c_list_bid"]/li/a/@href').extract()
        for url in li:
            url1 = request.xxx + url.replace('./','')
            print(url1)
　　　　　　　　#browser = browser 將瀏覽器控制代碼需要傳遞到下一頁

            yield feapder.Request(url=url1, render=True,callback=self.parse_1,browser = browser)


    def parse_1(self, request, response):
        time.sleep(random.randint(1, 2))
　　　　#獲取瀏覽器控制代碼
        browser = request.browser


        # 標題 title
        # 釋出時間 publish_time
        # 地域 location
        # 採購人 purchaser
        # 採購網址 url
        # 採購正文 text
        li = Selector(browser.page_source).xpath('//div[@class="table"]/table/tbody')
        item = {}

        '//*[@id="detail"]/div[2]/div/div[2]/div/div[2]/table/tbody/tr[2]'
        item["title"] = li.xpath('./tr[2]/td[@colspan="3"]/text()').get('').strip()
        '//*[@id="detail"]/div[2]/div/div[2]/div/div[2]/table/tbody/tr[5]/td[4]'
        item["publish_time"] = li.xpath('./tr[5]/td[4]/text()').get('').strip()

        '//*[@id="detail"]/div[2]/div/div[2]/div/div[2]/table/tbody/tr[11]'
        item["location"] = li.xpath('./tr[11]/td[2]/text()').get('').strip()

        '//*[@id="detail"]/div[2]/div/div[2]/div/div[2]/table/tbody/tr[4]/td[2]'
        item["purchaser"] = li.xpath('./tr[4]/td[2]/text()').get('').strip()
        item["url"] = request.url
        # item["text"] = li.xpath('./tr[2]/td[@colspan="3"]/text()').get('').strip()


        print(item)
        # 寫入資料庫
        self.db.add_smart("zf_table", item)



if __name__ == "__main__":
    TestRender().start()

如何利用 Selenium 爬取評論資料？
2018-04-12
Puppeteer爬取網頁資料
2019-03-22
網頁
利用requests+BeautifulSoup爬取網頁關鍵資訊
2018-11-13
網頁
使用selenium爬取網頁，如何在scrapy shell中除錯響應
2018-09-12
網頁除錯
Python 爬取網頁資料的兩種方法
2023-02-15
Python網頁
Selenium + Scrapy爬取某商標資料
2018-06-27
「無程式碼」高效的爬取網頁資料神器
2021-10-18
網頁
feapder框架爬取ks評論_遞迴的方式
2024-06-07
框架遞迴
結合LangChain實現網頁資料爬取
2024-07-18
LangChain網頁
輕鬆利用Python爬蟲爬取你想要的資料
2021-09-10
Python爬蟲
用Jupyter—Notebook爬取網頁資料例項14
2020-12-01
網頁
用Jupyter—Notebook爬取網頁資料例項12
2020-12-01
網頁
爬取網頁文章
2021-09-29
網頁
使用selenium進行爬取掘金前端小冊的資料
2019-08-13
前端
利用python爬取某殼的房產資料
2024-05-05
Python
爬蟲——網頁爬取方法和網頁解析方法
2020-12-07
爬蟲網頁
Python網路爬蟲第三彈《爬取get請求的頁面資料》
2018-09-14
Python爬蟲
python爬取58同城一頁資料
2018-08-04
Python
爬蟲Selenium+PhantomJS爬取動態網站圖片資訊（Python）
2018-03-24
爬蟲JS網站Python
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
Python3.x：Selenium+PhantomJS爬取帶Ajax、Js的網頁及獲取JS返回值
2024-04-12
PythonJS網頁
C#爬取動態網頁上的資訊：B站主頁
2024-09-27
C#網頁
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
【爬蟲】專案篇-使用selenium爬取大魚潮汐網
2024-04-05
爬蟲
selenium自動爬取網易易盾的驗證碼
2020-07-20
爬蟲抓取網頁資料原理
2021-12-06
爬蟲網頁
Java爬蟲系列四：使用selenium-java爬取js非同步請求的資料
2021-10-17
Java爬蟲JS非同步
ferret 爬取動態網頁
2019-12-15
網頁
關於python爬取網頁
2021-03-10
Python網頁
Python 爬蟲進階篇-利用beautifulsoup庫爬取網頁文章內容實戰演示
2020-09-14
Python爬蟲網頁
爬蟲（6） - 網頁資料解析(2) | BeautifulSoup4在爬蟲中的使用
2022-07-04
爬蟲網頁
拉勾網職位資料爬取
2018-08-26
Python爬取噹噹網APP資料
2020-10-21
PythonAPP
使用 Python 爬取網站資料
2024-07-27
Python網站
利用python編寫爬蟲爬取淘寶奶粉部分資料.1
2021-09-09
Python爬蟲
[Python3網路爬蟲開發實戰] 7-動態渲染頁面爬取-4-使用Selenium爬取淘寶商品
2018-03-30
Python爬蟲
Python 爬取網頁中JavaScript動態新增的內容（一）
2018-09-28
Python網頁JavaScript
Python 爬取網頁中JavaScript動態新增的內容（二）
2018-09-28
Python網頁JavaScript

zf_利用feapder中的selenium網頁爬取資料

相關文章