c++多執行緒按行讀取同一個每行長度不規則檔案

火眼观世界發表於2024-03-02

原文網址 : https://www.cnblogs.com/zhaobulw/p/18049072

C++執行緒

對於非常大的比如上百G的大檔案讀取,單執行緒讀是非常非常慢的,需要考慮用多執行緒讀,多個執行緒讀同一個檔案時不用加鎖的,每個執行緒開啟一個獨立的檔案控制代碼

多執行緒讀同一個檔案實現思路

思路1

先開啟一個檔案控制代碼,獲取整個檔案大小file_size
確定要採用執行緒讀取的部分大小read_size和多執行緒的個數thread_num,算出平均每個執行緒要讀取的大小為read_size/thread_num=each_size
計算出每個執行緒讀取的位置start_pos和它下一個執行緒的讀取位置next_pos
對於每個執行緒來說,讀取時的情況可以有如下種情況:

start_pos等於0(整個檔案都採用多執行緒讀取),這種情況下直接用getline讀取,直到讀取某一行後讀取指標位置超過next_pos
start_pos>0, 讀取當前位置所在的字元,如果字元恰好為\n,則直接用getline讀取,直到讀取某一行後讀取指標位置超過next_pos
start_pos>0, 讀取當前位置所在的字元,如果字元不為\n,則先用getline讀取一行,假設讀取這行後新的位置為cur_pos,如果cur_pos >= next_pos則這個執行緒直接退出,不讀取任何資料,因為這個執行緒的下一個執行緒會和它讀取同一行,這一行的內容應該有下一個執行緒讀取; 如果cur_pos < next_pos則當前讀取的這一行直接丟棄(因為這一行交給了上一個執行緒來讀取), 直接從下一行開始用getline讀取,直到讀取某一行後讀取指標位置超過next_pos
最後程式碼還要計算剩下的部分,因為檔案大小read_size不一定能整除執行緒個數thread_num,剩下的部分應該全部交給主執行緒來讀

這個思路實現起來容易出bug,需要保證每一個執行緒至少能讀取一個完整的行

原始碼實現

可能有bug,但是功能基本實現

#include "spdlog/sinks/basic_file_sink.h"
#include "spdlog/spdlog.h"
#include <chrono>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include <thread>
#include <vector>
using namespace std;

void init_log()
{
    try
    {
        auto new_logger = spdlog::basic_logger_mt("new_default_logger", "test.log", true);
        spdlog::set_default_logger(new_logger);
        spdlog::info("new logger log start");
    }
    catch (const spdlog::spdlog_ex &ex)
    {
        std::cout << "Log init failed: " << ex.what() << std::endl;
    }
}

void thread_read_file(int tid, const string &file_path, std::streampos start_pos, std::streampos next_pos, int each_size)
{
    ifstream file(file_path.c_str(), ios::in);
    if (!file.good())
    {
        file.close();
        spdlog::info("執行緒{} 開啟檔案{}失敗", tid, file_path);
        return;
    }

    file.seekg(start_pos, ios::beg);
    //
    string text;
    if (start_pos != 0)
    {
        char cur_ch = 0;
        // spdlog::info("讀取前{}", file.tellg());
        file.read(&cur_ch, 1); //會讓指標向後移動一個位元組
        // spdlog::info("讀取後{}", file.tellg());
        if (start_pos == 115)
        {
            spdlog::info("tid={},115={}", tid, cur_ch);
        }
        if (cur_ch != '\n')
        {
            getline(file, text);
            spdlog::info("執行緒{},跳過{}", tid, text);
            if (file.tellg() >= next_pos)
            {
                /*
                1. 如果執行緒起始位置不為換行符,則要跳過本行,本行內容交給上一個執行緒讀取,如果跳過本行後的讀取位置(一定是換行符)>=下一個執行緒的起始位置,
                如果位置等於下一個執行緒起始位置,說明下個執行緒起始位置是換行符,下一行內容應該由下一個執行緒讀取;如果位置>下一個執行緒起始位置,同樣本行內容由上一個執行緒
                讀取,下一行內容也不用本執行緒讀取,可能是下一個執行緒讀取
                 */
                spdlog::info("執行緒{} start_pos={},next_pos={},each_size={} 起始位置不是\\n,讀取一行後的指標位置{}>=next_pos,不需要讀取內容",
                             tid, start_pos, next_pos, each_size, file.tellg());
                file.close();
                return;
            }
        }
        else
        {
            file.seekg(-1, ios::cur);
        }
        // spdlog::info("執行緒{} cur_ch={}", tid, cur_ch);
    }

    std::streampos cur_pos = file.tellg();
    while (cur_pos < next_pos && getline(file, text))
    {
        /*
        1. cur_pos始終指向每一行的行尾,如果cur_pos=next_pos則說明next_pos是行尾,則接下來的一行應該由
        下一個執行緒讀,所以這裡是cur_pos < next_pos,而不是cur_pos <= next_pos
         */
        int cur_line_len = file.tellg() - cur_pos;
        spdlog::info("執行緒{} start_pos={},next_pos={},each_size={},本行開始pos={},本行結束pos={},本行讀長={},text={}",
                     tid, start_pos, next_pos, each_size, cur_pos, file.tellg(), cur_line_len, text);
        cur_pos = file.tellg();
    }
    spdlog::info("執行緒{} start_pos={},next_pos={},each_size={},結束時cur_pos={},總共區間長度為{}\n", tid, start_pos, next_pos, each_size, cur_pos, cur_pos - start_pos);
    file.close();
    return;
}

void test_detach(const string &file_path)
{
    // for (int i = 0; i < 10; ++i)
    // {
    //     std::thread th(thread_read_file, i, file_path);
    //     th.detach();
    // }
}

void test_join(const string &file_path)
{
    //確定檔案長度
    ifstream file(file_path.c_str(), ios::in);

    //把指標指到檔案末尾求出檔案大小
    int file_size = file.seekg(0, ios::end).tellg();
    file.close();

    int thread_nums = 50;                       //執行緒個數
    int each_size = file_size / thread_nums;    //平均每個執行緒讀取的位元組數
    std::streampos start_pos = 0, next_pos = 0; //每個執行緒讀取位置的起始和下一個執行緒讀取的起始位置
    vector<std::thread> vec_threads;            //執行緒列表
    spdlog::info("thread_nums={},each_size={},file_size={}", thread_nums, each_size, file_size);
    int t_id = 0; //執行緒id
    for (; t_id < thread_nums; ++t_id)
    {
        next_pos += each_size;
        std::thread th(thread_read_file, t_id, file_path, start_pos, next_pos, each_size);
        vec_threads.emplace_back(std::move(th)); // push_back() is also OK
        start_pos = next_pos;
    }
    if (file_size % thread_nums != 0)
    {
        thread_read_file(t_id, file_path, start_pos, file_size, each_size);
    }

    for (auto &it : vec_threads)
    {
        it.join();
    }
}
int main()
{
    init_log();
    string file_path = "./1.txt";
    // test_detach(file_path);
    // std::this_thread::sleep_for(std::chrono::seconds(1)); // wait for detached threads done
    test_join(file_path);
    return 0;
}

思路2

整體思路和方法1一樣,只是讀取的時候不是按照位置來判斷每個執行緒應該讀取多少,而是統計每個執行緒讀取的長度
每次移動位置指標時,記錄一下移動的位置,因為每個執行緒應該讀取的平均長度已經提前計算,只要執行緒讀取的資料超過了平均大小,或者讀取到了檔案末尾就結束

原始碼實現

沒有bug,可以適應多個執行緒被分配到同一行的情況,但是每個執行緒讀取的大小必須>0

#include <chrono>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include <thread>
#include <vector>
using namespace std;

void thread_read_file(int tid, const string &file_path, int start_pos, int next_pos, int each_size)
{
    ifstream file(file_path.c_str(), ios::in);
    if (!file.good())
    {
        stringstream ss;
        ss << "Thread " << tid << " failed to open file: " << file_path << endl;
        cout << ss.str();
        return;
    }

    file.seekg(start_pos, ios::beg);
    //
    string text;
    stringstream ss;

    if (start_pos != 0)
    {
        char cur_ch;
        file.read(&cur_ch, 1);
        // ss << "Thread " << tid << ", cur_ch=" << cur_ch << endl;
        if (cur_ch != '\n')
        {
            getline(file, text);
        }
    }

    while (getline(file, text) && start_pos <= next_pos)
    {
        ss << "Thread " << tid << ", start_pos=" << start_pos << ";next_pos="
           << next_pos << ";each_size=" << each_size << ": " << text << endl;
        cout << ss.str();
        start_pos = file.tellg();
    }
    file.close();
    return;
}

void test_detach(const string &file_path)
{
    // for (int i = 0; i < 10; ++i)
    // {
    //     std::thread th(thread_read_file, i, file_path);
    //     th.detach();
    // }
}

void test_join(const string &file_path)
{
    //確定檔案長度
    ifstream file(file_path.c_str(), ios::in);

    //把指標指到檔案末尾求出檔案大小
    int file_size = file.seekg(0, ios::end).tellg();
    file.close();

    int thread_nums = 10;
    int each_size = file_size / thread_nums;
    int start_pos = 0, next_pos = 0;

    vector<std::thread> vec_threads;
    int t_id = 0;
    for (; t_id < thread_nums; ++t_id)
    {
        next_pos += each_size;
        std::thread th(thread_read_file, t_id, file_path, start_pos, next_pos, each_size);
        vec_threads.emplace_back(std::move(th)); // push_back() is also OK
        start_pos = next_pos;
    }
    if (file_size % thread_nums != 0)
    {
        thread_read_file(t_id, file_path, start_pos, next_pos, each_size);
    }

    for (auto &it : vec_threads)
    {
        it.join();
    }
}
int main()
{
    string file_path = "./1.txt";
    // test_detach(file_path);
    // std::this_thread::sleep_for(std::chrono::seconds(1)); // wait for detached threads done
    test_join(file_path);
    return 0;
}

執行結果

file

本文由部落格一文多發平臺 OpenWrite 釋出！

drools執行String規則或執行某個規則檔案
2022-06-02
簡單分析執行緒獲取ReentrantReadWriteLock 讀鎖的規則
2019-07-24
執行緒
多執行緒下載檔案
2018-10-17
執行緒
c++多執行緒
2024-03-08
C++執行緒
C++ 多執行緒
2023-04-12
C++執行緒
Python執行緒專題10:queue、多執行緒按順序執行
2019-02-16
Python執行緒
C++多執行緒：atomic
2024-11-01
C++執行緒
【java】【多執行緒】獲取和設定執行緒名字、獲取執行緒物件（3）
2018-04-15
Java執行緒物件
小度分享-【多執行緒工作及執行緒安全】
2019-07-20
執行緒
執行緒間通訊就是讀寫同一個變數
2022-03-01
執行緒變數
多執行緒C++更新MYSQL
2024-03-15
執行緒C++MySql
C++多執行緒學習
2020-09-25
C++執行緒
C++使用Boost多執行緒
2019-01-02
C++執行緒
讀取檔案，每行不超過100個字元，輸出每行中字母最多的單詞的字母數
2020-12-21
字元
opencv-python 讀取同一目錄的多個檔案
2019-02-18
OpenCVPython
多執行緒和多執行緒同步
2024-08-22
執行緒
[短文速讀 -5] 多執行緒程式設計引子：程式、執行緒、執行緒安全
2018-09-17
執行緒程式設計
簡單的多執行緒複製檔案
2020-11-26
執行緒
Python建立多執行緒任務並獲取每個執行緒返回值
2018-09-29
Python執行緒
多執行緒--執行緒管理
2018-07-31
執行緒
執行緒與多執行緒
2024-08-11
執行緒
多執行緒【執行緒池】
2021-02-20
執行緒
從CSV檔案中讀取jpg圖片的URL地址並多執行緒批量下載
2019-02-16
執行緒
多執行緒，到底該設定多少個執行緒？
2019-06-02
執行緒
C++多執行緒基礎教程
2020-08-20
C++執行緒
drools執行完某個規則後終止別的規則執行
2022-05-26
Java多執行緒001——一圖讀懂執行緒與程式
2019-02-18
Java執行緒
easyexcel多sheet多執行緒匯入示例，獲取所以執行緒執行結果後返回
2024-11-12
Excel執行緒
Python——程式、執行緒、協程、多程式、多執行緒（個人向）
2020-10-22
Python執行緒
多執行緒------執行緒與程式/執行緒排程/建立執行緒
2020-12-31
執行緒
springboot 執行 jar 包讀取外部配置檔案
2021-07-23
Spring BootJAR
Java多執行緒-執行緒中止
2019-08-26
Java執行緒
多執行緒之初識執行緒
2020-06-30
執行緒
Java 多執行緒讀取檔案並統計詞頻例項出神入化的《ThreadPoolExecutor》
2021-01-18
Java執行緒thread
Java多執行緒檔案分片下載實現
2020-03-05
Java執行緒
三個執行緒迴圈列印123-多執行緒
2020-10-15
執行緒
多執行緒系列（1），多執行緒基礎
2020-08-20
執行緒
Linux C++ 多執行緒程式設計
2024-08-11
LinuxC++執行緒程式設計

c++多執行緒按行讀取同一個每行長度不規則檔案

多執行緒讀同一個檔案實現思路

思路1

原始碼實現

思路2

原始碼實現

執行結果

相關文章