爬蟲案例

ssrheart發表於2024-03-31

原文網址 : https://www.cnblogs.com/ssrheart/p/18106946

BS爬取筆趣閣小說資料

# -*- coding: utf-8 -*-
# author : heart
# blog_url : https://www.cnblogs.com/ssrheart/
# time : 2024/3/30
import random
import time

import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
import os

headers = {
    'User-Agent': UserAgent().random,
}
proxies = {
    'http': 'http://221.6.139.190:9002'
}


def spider_title(url):
    response = requests.get(url=url, headers=headers, proxies=proxies).text

    soup = BeautifulSoup(response, 'lxml')

    dd_list = soup.find_all('div', class_='listmain')[0].find_all('dd')
    title_list = []
    for i in dd_list:
        if '<<---展開全部章節--->>' in i.text:
            continue
        href = i.a.get('href')
        href1 = 'https://www.bqgbb.cc' + href
        title = i.a.text
        title_list.append({
            'href': href1,
            'title': title
        })
    return title_list


def spider_content(url):
    response = requests.get(url=url, headers=headers, proxies=proxies).text
    soup = BeautifulSoup(response, 'lxml')
    div_list = soup.find_all('div', class_='Readarea ReadAjax_content')[0].text
    content = div_list
    return content


def save(title, content):
    base_dir = os.path.dirname(__file__)
    wenjian = os.path.join(base_dir, 'xiaoshuo')
    os.makedirs(wenjian, exist_ok=True)
    lujing = os.path.join(wenjian, f'{title}.txt')
    with open(lujing, 'w', encoding='utf-8') as f:
        f.write(content)


def main():
    title = spider_title(url='https://www.bqgbb.cc/book/11174/')
    for index, data in enumerate(title, start=1):
        title = data['title']
        href = data['href']
        time.sleep(random.randint(1,3))
        content = spider_content(url=href)
        save(title, content)
        print(f'{title}下載完成')


if __name__ == '__main__':
    main()

xpath爬取豆瓣TOP250資料

使用xpath

# -*- coding: utf-8 -*-
# author : heart
# blog_url : https://www.cnblogs.com/ssrheart/
# time : 2024/3/31

import requests
from fake_useragent import UserAgent
from lxml import etree


class SpiderDB():
    def __init__(self):
        self.headers = {
            'User-Agent': UserAgent().random,
        }
        self.proxies = {
            'http': 'http://221.6.139.190:9002'
        }

    def spider_tag(self):
        tagurl_list = []
        for i in range(0, int(250 / 25)):
            if i == 0:
                tag_url = f'https://movie.douban.com/top250'
                tagurl_list.append(tag_url)
            else:
                tag_url = f'https://movie.douban.com/top250?start={i * 25}'
                tagurl_list.append(tag_url)
        return tagurl_list

    def spider_info(self, url):
        # print(url) # https://movie.douban.com/top250
        response = requests.get(url=url, headers=self.headers, proxies=self.proxies).text
        tree = etree.HTML(response)

        info = tree.xpath('//li/div[@class="item"]/div[@class="info"]')
        data_list = []
        for i in info:
            try:
                title = i.xpath('./div[1]/a/span[1]/text()')[0].strip()
            except:
                title = ''
            try:
                title_eng = i.xpath('./div[1]/a/span[2]/text()')[0].replace('\xa0', '').strip()
            except:
                title_eng = ''
            try:
                other_title = i.xpath('./div[1]/a/span[3]/text()')[0].replace('\xa0', '').strip()
            except:
                other_title = ''

            actor = i.xpath('./div[2]/p/text()')[0].replace('\xa0', '').strip()
            publish_time = i.xpath('./div[2]/p/text()')[1].replace('\xa0', '').strip()

            score = i.xpath('./div[2]/div/span[2]/text()')[0]
            pingjia_people = i.xpath('./div[2]/div/span[4]/text()')[0][0:-3]
            try:
                quote = i.xpath('./div[2]/p[@class="quote"]/span/text()')[0]
            except:
                quote = ''
            data_list.append({
                'title': title,
                'title_eng': title_eng,
                'other_title': other_title,
                'actor': actor,
                'publish_time': publish_time,
                'score': score,
                'pingjia_people': pingjia_people,
                'quote': quote,
            })
        # print(data_list)
        return data_list

    def main(self):
        tag_url = self.spider_tag()
        data_list_all = []
        for url in tag_url:
            res = self.spider_info(url)
            data_list_all.extend(res)
        print(len(data_list_all))  # 250


if __name__ == '__main__':
    spider = SpiderDB()
    spider.main()

爬蟲案例（六）
2020-11-03
爬蟲
9.爬蟲案例
2024-12-06
爬蟲
基礎爬蟲案例實戰
2024-05-24
爬蟲
爬蟲—有道翻譯案例分析
2021-09-03
爬蟲
爬蟲搭建代理池、爬取某網站影片案例、爬取新聞案例
2023-03-16
爬蟲網站
基於 go + xpath 爬蟲小案例
2021-07-11
Go爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
搜狗搜尋微信Python爬蟲案例
2022-04-04
Python爬蟲
情況最簡單下的爬蟲案例
2020-03-06
爬蟲
Python爬蟲實戰案例-爬取幣世界標紅快訊
2019-02-16
Python爬蟲
爬蟲：多程式爬蟲
2021-05-19
爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python零基礎爬蟲教學（實戰案例手把手Python爬蟲教學）
2020-04-17
Python爬蟲
《網路爬蟲開發實戰案例》筆記
2020-08-10
爬蟲筆記
中國爬蟲違法違規案例彙總！
2020-04-06
爬蟲
Java實現網路爬蟲案例程式碼
2022-11-22
Java爬蟲
通用爬蟲與聚焦爬蟲
2023-04-18
爬蟲
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
爬蟲進階：反反爬蟲技巧
2018-06-28
爬蟲
反爬蟲之字型反爬蟲
2019-06-27
爬蟲
爬蟲
2024-11-16
爬蟲
爬蟲案例2-爬取影片的三種方式之一：DrissionPage篇(3)
2024-09-24
爬蟲
爬蟲案例2-爬取影片的三種方式之一：selenium篇(2)
2024-09-11
爬蟲
Python網路爬蟲實踐案例：爬取貓眼電影Top100
2024-11-21
Python爬蟲
一天時間入門python爬蟲，直接寫一個爬蟲案例，分享出來，很簡單
2018-12-02
Python爬蟲
【爬蟲】爬蟲專案推薦 / 思路
2020-04-21
爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
2個月精通Python爬蟲——3大爬蟲框架+6場實戰+反爬蟲技巧+分散式爬蟲
2018-06-28
Python爬蟲框架分散式
大型爬蟲案例：爬取去哪兒網自由行資料(10萬條資料)
2018-08-05
爬蟲
Python爬蟲教程-01-爬蟲介紹
2018-09-06
Python爬蟲
Java爬蟲與Python爬蟲的區別？
2023-10-25
Java爬蟲Python
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
爬蟲與反爬蟲技術簡介
2022-09-20
爬蟲
request爬蟲
2019-02-16
爬蟲
nodejs 爬蟲
2019-02-16
NodeJS爬蟲
科普：爬蟲
2018-06-29
爬蟲
python 爬蟲
2024-04-20
Python爬蟲
app爬蟲
2024-05-04
APP爬蟲

爬蟲案例

BS爬取筆趣閣小說資料

xpath爬取豆瓣TOP250資料

相關文章