資料採集作業4

小鹿的博客發表於2024-11-26

作業1

程式碼及圖片:

點選檢視程式碼
def spider(page_num):
    cnt = 0
    while cnt < page_num:
        spiderOnePage()
        driver.find_element(By.XPATH,'//a[@class="next paginate_button"]').click()
        cnt +=1
    time.sleep(2)


作業心得:

本次作業讓我對 Selenium 和資料庫互動有了更深入理解,同時也意識到處理超時異常和最佳化程式碼結構對程式穩定執行的重要性。

Gitee資料夾連結:點這裡

作業2

程式碼及圖片:

點選檢視程式碼
def spiderOnePage():
    time.sleep(5)  # 等待頁面載入完成
    courses = driver.find_elements(By.XPATH, '//*[@id="channel-course-list"]/div/div/div[2]/div[1]/div')
    current_window_handle = driver.current_window_handle
    for course in courses:
        cCourse = course.find_element(By.XPATH, './/h3').text  # 課程名
        cCollege = course.find_element(By.XPATH, './/p[@class="_2lZi3"]').text  # 大學名稱
        cTeacher = course.find_element(By.XPATH, './/div[@class="_1Zkj9"]').text  # 主講老師
        cCount = course.find_element(By.XPATH, './/div[@class="jvxcQ"]/span').text  # 參與該課程的人數
        cProcess = course.find_element(By.XPATH, './/div[@class="jvxcQ"]/div').text  # 課程進展

        course.click()  # 點選進入課程詳情頁,在新標籤頁中開啟
        Handles = driver.window_handles  # 獲取當前瀏覽器的所有頁面的控制代碼
        driver.switch_to.window(Handles[1])  # 跳轉到新標籤頁
        time.sleep(5)  # 等待頁面載入完成

        # 爬取課程詳情資料
        # cBrief = WebDriverWait(driver,10,0.48).until(EC.presence_of_element_located((By.ID,'j-rectxt2'))).text
        cBrief = driver.find_element(By.XPATH, '//*[@id="j-rectxt2"]').text
        if len(cBrief) == 0:
            cBriefs = driver.find_elements(By.XPATH, '//*[@id="content-section"]/div[4]/div//*')
            cBrief = ""
            for c in cBriefs:
                cBrief += c.text

        # 將文字中的引號進行轉義處理,防止插入表格時報錯
        cBrief = cBrief.replace('"', r'\"').replace("'", r"\'")
        cBrief = cBrief.strip()

        # 爬取老師團隊資訊
        nameList = []
        cTeachers = driver.find_elements(By.XPATH, '//div[@class="um-list-slider_con_item"]')
        for Teacher in cTeachers:
            name = Teacher.find_element(By.XPATH, './/h3[@class="f-fc3"]').text.strip()
            nameList.append(name)
        # 如果有下一頁的標籤,就點選它,然後繼續爬取
        nextButton = driver.find_elements(By.XPATH, '//div[@class="um-list-slider_next f-pa"]')
        while len(nextButton) != 0:
            nextButton[0].click()
            time.sleep(3)
            cTeachers = driver.find_elements(By.XPATH, '//div[@class="um-list-slider_con_item"]')
            for Teacher in cTeachers:
                name = Teacher.find_element(By.XPATH, './/h3[@class="f-fc3"]').text.strip()
                nameList.append(name)
            nextButton = driver.find_elements(By.XPATH, '//div[@class="um-list-slider_next f-pa"]')
        cTeam = ','.join(nameList)

        driver.close()  # 關閉新標籤頁
        driver.switch_to.window(current_window_handle)  # 跳轉回原始頁面
        try:
            cursor.execute('INSERT INTO courseMessage VALUES ("%s","%s","%s","%s","%s","%s","%s")' % (
            cCourse, cCollege, cTeacher, cTeam, cCount, cProcess, cBrief))
            db.commit()
        except Exception as e:
            print(e)

作業心得:

透過本次作業,我熟練掌握了 Selenium 框架的各項操作,如查詢元素、模擬登入、爬取 Ajax 資料等,也進一步體會到結合 MySQL 儲存爬取資訊的便捷與實用。

Gitee資料夾連結:點這裡

作業3

圖片:

  • 開通MapReduce服務:

  • Python指令碼生成測試資料:

  • 配置Kafka:


  • 安裝Flume客戶端:


  • 配置Flume採集資料:

相關文章