Python學習：爬個電影資源網站

尤小紅發表於2018-03-16

原文網址 : https://juejin.im/post/5aab774ef265da2389257914

我們抓的網站地址是 http://xwxmovie.cn/

用了selenium、BeautifulSoup

首先還是最基本的初始化程式碼

baseURL = "http://xwxmovie.cn/"
headers = {
    'Host': 'xwxmovie.cn',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/64.0.3282.167 Safari/537.36"'
}

def browser_get():
    browser = webdriver.Chrome()
    browser.get(baseURL)
    html_text = browser.page_source
    page_count = get_page_count(html_text)
    get_page_data(html_text)

複製程式碼

一開始想用BeautifulSoup抓取片段的，猶豫剛學，很多API還不會用，最後用正則先匹配自己想要的區域,然後用BeautifulSoup匹配電影名等資訊；

items = re.findall(re.compile('<div id="post-.*?class="post-.*?style="position:.*?>'
                                  '.*?<div class="pinbin-image">(.*?)</div>'
                                  '.*?<div class="pinbin-category">(.*?)</div>'
                                  '.*?<div class="pinbin-copy">(.*?)</div>'
                                  '.*?</div>', re.S), html)
複製程式碼

這時候我們就要迴圈挨個是找自己想要的了；

    for item in items:
        if item[0].strip():
            soup = BeautifulSoup(item[0].strip(), 'html.parser')
            img = soup.find('img', attrs={'class': 'attachment-detail-image wp-post-image'})
            # 圖片
            print("海報：" + img.get('src'))
        if item[1].strip():
            soup = BeautifulSoup(item[1].strip(), 'html.parser')
            categorys = soup.find_all('a')
            for category in categorys:
                print(category.get_text())
        if item[2].strip():
            soup = BeautifulSoup(item[2].strip(), 'html.parser')
            title = soup.find('a', attrs={'class': 'front-link'})
            print("電影名：" + title.get_text())
            print("連結地址：" + title.get('href'))
            date = soup.find('p', attrs={'class': 'pinbin-date'})
            print("日期：" + date.get_text())
            brief = soup.find_all('p')
            print("簡介：" + brief[1].string)
複製程式碼

以上就是得到一頁的資料；

如果想得到總得就需要得到總頁面，然後迴圈獲取；

# 得到總頁數
def get_page_count(html):
    soup = BeautifulSoup(html, 'html.parser')
    page_count = soup.find('span', attrs={'class': 'pages'})
    return int(page_count.get_text()[-4:-2])
複製程式碼

最終程式碼如下：

# -*- coding: UTF-8 -*-

from selenium import webdriver
from bs4 import BeautifulSoup
import re

baseURL = "http://xwxmovie.cn/"
headers = {
    'Host': 'xwxmovie.cn',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/64.0.3282.167 Safari/537.36"'
}


def browser_get():
    browser = webdriver.Chrome()
    browser.get(baseURL)
    html_text = browser.page_source
    page_count = get_page_count(html_text)
    get_page_data(html_text)


# 得到總頁數
def get_page_count(html):
    soup = BeautifulSoup(html, 'html.parser')
    page_count = soup.find('span', attrs={'class': 'pages'})
    return int(page_count.get_text()[-4:-2])


def get_page_data(html):
    items = re.findall(re.compile('<div id="post-.*?class="post-.*?style="position:.*?>'
                                  '.*?<div class="pinbin-image">(.*?)</div>'
                                  '.*?<div class="pinbin-category">(.*?)</div>'
                                  '.*?<div class="pinbin-copy">(.*?)</div>'
                                  '.*?</div>', re.S), html)
    for item in items:
        if item[0].strip():
            soup = BeautifulSoup(item[0].strip(), 'html.parser')
            img = soup.find('img', attrs={'class': 'attachment-detail-image wp-post-image'})
            # 圖片
            print("海報：" + img.get('src'))
        if item[1].strip():
            soup = BeautifulSoup(item[1].strip(), 'html.parser')
            categorys = soup.find_all('a')
            for category in categorys:
                print(category.get_text())
        if item[2].strip():
            soup = BeautifulSoup(item[2].strip(), 'html.parser')
            title = soup.find('a', attrs={'class': 'front-link'})
            print("電影名：" + title.get_text())
            print("連結地址：" + title.get('href'))
            date = soup.find('p', attrs={'class': 'pinbin-date'})
            print("日期：" + date.get_text())
            brief = soup.find_all('p')
            print("簡介：" + brief[1].string)

if __name__ == '__main__':
    browser_get()

複製程式碼

一個基於 golang 的爬蟲電影站
2020-03-20
Golang爬蟲
Python爬取電影天堂
2018-11-01
Python
Excel教程：學習資源（網站/課程）
2024-06-24
Excel網站
利用Python爬取攝影網站圖片，切勿商用
2018-12-18
Python網站
使用 Python 爬取網站資料
2024-07-27
Python網站
Python乾貨：用Scrapy爬電商網站
2018-09-04
Python網站
常用Python學習網站
2018-03-22
Python學習網站
Tweek影視站 Tweek影視資源站
2024-09-22
python爬蟲爬取豆瓣電影 1-10 ajax 資料
2024-07-04
Python爬蟲
python更換代理爬取豆瓣電影資料
2019-08-03
Python
Python3爬取貓眼電影資訊
2020-11-06
Python
Python爬蟲小專案：爬一個圖書網站
2018-11-21
Python爬蟲網站
python爬蟲學習01--電子書爬取
2020-07-13
Python爬蟲
python有哪些學習網站
2021-09-11
Python學習網站
筆尖電影網站
2019-05-11
網站
擼個爬蟲，爬取電影種子
2019-05-11
爬蟲
python 非同步佇列爬取多個網站
2020-11-21
Python非同步佇列網站
Python學習資源整理
2018-03-23
Python
來學習！五個免費充電資源
2021-10-10
Python網路爬蟲實踐案例：爬取貓眼電影Top100
2024-11-21
Python爬蟲
Nginx學習之從零搭建靜態資源網站
2018-11-14
Nginx網站
【python爬蟲案例】利用python爬取豆瓣電影TOP250評分排行資料！
2024-09-18
Python爬蟲
幾個有趣的線上python 程式碼學習網站
2018-10-08
Python學習網站
一個網站拿下機器學習優質資源！搜尋效率提高 50%
2020-06-28
網站機器學習
豐富的詩詞資源！一個現代化詩詞學習網站！
2024-09-29
學習網站
Python資料爬蟲學習筆記（11）爬取千圖網圖片資料
2018-09-18
Python爬蟲筆記
爬取薅羊毛網站百度雲資源
2020-02-16
網站
Python網路爬蟲（正則, 內涵段子，貓眼電影, 鏈家爬取）
2018-10-30
Python爬蟲
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
這10個學習資源網站，一年能幫你省下幾十萬的學費
2018-06-14
網站
GitHub 上 25 個 Python 學習資源，你最多知道五個
2021-01-05
GithubPython
學習網站和資料
2024-10-03
學習網站
推薦一個Oracle資料庫學習網站
2018-12-16
Oracle資料庫學習網站
scrapy爬取豆瓣電影資料
2021-09-11
【Python爬蟲&資料分析】2018年電影，你看了幾部？
2018-12-06
Python爬蟲
python爬取貓眼正在熱映電影
2019-03-04
Python
python初級爬蟲之貓眼電影
2019-02-23
Python爬蟲
Python爬取分析豆瓣電影Top250
2018-09-07
Python

Python學習：爬個電影資源網站

首先還是最基本的初始化程式碼

相關文章