外部排序

AH20發表於2024-10-06

原文網址 : http://blog.itpub.net/29254281/viewspace-1167988/

外部排序
如果需要排序的檔案超過了記憶體的大小，就需要分塊排序，按塊寫入檔案，最後再歸併已經有序的分塊檔案。

第一階段切分排序
1.將原始未排序的大檔案，分為有序的小檔案
    1.1確定大檔案切分的位置
    1.2主執行緒按塊讀取原始檔案,然後交給排序執行緒
    1.3排序執行緒,對塊進行排序
    1.4寫入執行緒將有序的資料，寫入檔案

第二階段歸併
    將臨時檔案歸併到目標的結果檔案

作為JAVA程式，需要注意的是，第一階段是IO密集型,儘量分配更多的記憶體給Old區
而第二階段是CPU密集型,需要注意Young區,避免過多的GC佔用CPU資源。

需要提前瞭解的技術細節。
1.JAVA切分大檔案
http://blog.itpub.net/29254281/viewspace-1161173/

2.JAVA記憶體對映檔案
http://blog.itpub.net/29254281/viewspace-1162157/

3.JAVA柵欄
http://blog.itpub.net/29254281/viewspace-1164727/

4.觀察者模式、生產者/消費者模式

5.JVM監控和GC

實驗環境:
雙核CPU,使用1G的JAVA堆記憶體,對2億long型隨機數排序,一般情況下,2億long型隨機數檔案4G左右大小。

總體設計的思路
Main主執行緒負責切分原始檔案,blocking函式根據檔案分片的設定，返回原始檔案的分塊。
這樣主要為了分塊包含整行，而不會切斷資料。
讀取出來的原始未排序資料封裝為Sorter物件,提交給排序執行緒池。
排序之後的資料,封裝為Writer物件,再交給寫入執行緒池,將排序好的分塊寫入臨時檔案。
在每個分塊寫入之後，執行緒在CyclicBarrier柵欄處等待,待所有的分塊寫入完畢,啟用歸併。
需要特別注意的是,在柵欄等待之前,一定要釋放所有的資源,以便JVM GC回收記憶體。
就是Writer物件write方法的下列程式碼

歸併非常消耗CPU資源。
在歸併的過程中，因為每個分塊本身都是有序的,所以只需要一個執行緒計算各個分塊中最小的數字，將其寫入BlockingQueue。
而另一個執行緒,不斷的將佇列的資料順序寫入目標檔案。
這就是Merge物件的作用。這個過程使用了觀察者模式和生產者消費者模式。

實現如下:

import java.io.BufferedInputStream;

import java.io.BufferedOutputStream;

import java.io.File;

import java.io.FileInputStream;

import java.io.FileOutputStream;

import java.io.IOException;

import java.io.RandomAccessFile;

import java.nio.MappedByteBuffer;

import java.nio.channels.FileChannel;

import java.nio.channels.FileChannel.MapMode;

import java.util.ArrayList;

import java.util.Arrays;

import java.util.Collections;

import java.util.List;

import java.util.Queue;

import java.util.concurrent.BlockingQueue;

import java.util.concurrent.BrokenBarrierException;

import java.util.concurrent.ConcurrentLinkedQueue;

import java.util.concurrent.CyclicBarrier;

import java.util.concurrent.ExecutorService;

import java.util.concurrent.Executors;

import java.util.concurrent.LinkedBlockingQueue;

public class Controller {

    public static void main(String[] args) throws IOException {

        Controller c = new Controller(new File("/home/lihuilin/桌面/t.txt"), 15, "/home/lihuilin/桌面/");

    }

    // 排序執行緒池

    private final ExecutorService sortThread;

    // 寫入執行緒池，將排序之後的分片寫入檔案

    private final ExecutorService writerThread;

    // 柵欄,等待所有分片寫入檔案之後，啟動合併

    private final CyclicBarrier barrier;

    // 原始沒有順序的大檔案

    private final File file;

    // 分片大小

    private final int pieces;

    // 輸出目錄

    private final String outDir;

    // 記錄分片寫入臨時檔案的位置

    private final List<File> outFileList = new ArrayList<File>();

    public Controller(File file, int pieces, final String outDir) throws IOException {

        final long start = System.currentTimeMillis();

        sortThread = Executors.newFixedThreadPool(1);

        // 寫入執行緒池的執行緒數一定不能小於分片的大小。否則CyclicBarrier

        // await之後，後續的分片將沒有執行緒可用。await不會釋放執行緒資源。

        writerThread = Executors.newFixedThreadPool(pieces + 1);

        this.file = file;

        this.pieces = pieces;

        this.outDir = outDir;

        this.barrier = new CyclicBarrier(pieces, new Runnable() {

            @Override

            public void run() {

                long end = System.currentTimeMillis();

                System.out.println("合併之前總用時:" + (end - start) / 1000 + "s");

                // 合併有序的分片臨時檔案

                Merger merger = new Merger(outFileList, outDir);

                writerThread.submit(merger);

                try {

                    merger.merge();

                } catch (IOException e) {

                    // TODO Auto-generated catch block

                    e.printStackTrace();

                } catch (InterruptedException e) {

                    // TODO Auto-generated catch block

                    e.printStackTrace();

                }

                writerThread.shutdown();

                sortThread.shutdown();

                end = System.currentTimeMillis();

                System.out.println("外部排序總用時:" + (end - start) / 1000 + "s");

            }

        });

        action();

    }

    private void action() throws IOException {

        List<Point> list = blocking(file, pieces);

        for (Point p : list) {

            Spilter spilter = new MappedByteBufferSpilter(file, p);

            long[] data = null;

            data = spilter.spilt();

            Sorter s = new Sorter(data, p, writerThread, barrier, outFileList);

            sortThread.submit(s);

        }

    }

    private List<Point> blocking(File file, int piece) throws IOException {

        List<Point> result = new ArrayList<Point>();

        List<Long> list = new ArrayList<Long>();

        list.add(-1L);

        long length = file.length();

        long step = length / piece;

        long index = 0;

        for (int i = 0; i < piece; i++) {

            BufferedInputStream in = new BufferedInputStream(new FileInputStream(file));

            if (index + step < length) {

                index = index + step;

                in.skip(index);

                while (in.read() != 10) {

                    index = index + 1;

                }

                list.add(index);

                index++;

            }

            in.close();

        }

        list.add(length - 1);

        for (int i = 0; i < list.size() - 1; i++) {

            long skipSize = list.get(i) + 1;

            long l = list.get(i + 1) - list.get(i);

            result.add(new Point(skipSize, l, outDir));

        }

        return result;

    }

}

class Merger implements Runnable {

    private final List<Worker> workerList = new ArrayList<Worker>();

    private String outDir = null;

    private BlockingQueue<Long> queue = new LinkedBlockingQueue<Long>(1000);

    private volatile boolean finished = false;

    public Merger(List<File> outFileList, String outDir) {

        for (File file : outFileList) {

            Worker worker = new Worker(file, workerList);

            workerList.add(worker);

        }

        this.outDir = outDir;

    }

    @Override

    public void run() {

        try {

            System.out.println("讀取佇列,寫入目標檔案");

            BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(outDir + "result.txt"), 50 * 1024 * 1024);

            while (finished != true || !queue.isEmpty()) {

                Long l = queue.take();

                bos.write((l + "\n").getBytes());

            }

            bos.flush();

            bos.close();

        } catch (Exception ex) {

            ex.printStackTrace();

        }

    }

    public void merge() throws IOException, InterruptedException {

        while (workerList.size() != 0) {

            Collections.sort(workerList);

            Worker worker = workerList.get(0);

            Long data = worker.poll();

            if (data == null) {

                workerList.remove(worker);

            } else {

                queue.put(data);

            }

        }

        finished = true;

    }

    private class Worker implements Comparable<Worker> {

        private long data;

        private MappedByteBuffer buffer = null;

        private List<Worker> workerList = null;

        private boolean eof = false;

        Worker(File file, List<Worker> workerList) {

            try {

                RandomAccessFile rFile = new RandomAccessFile(file, "r");

                FileChannel channel = rFile.getChannel();

                buffer = channel.map(MapMode.READ_ONLY, 0, channel.size());

                channel.close();

                rFile.close();

                this.workerList = workerList;

                data = buffer.getLong();

            } catch (IOException e) {

                // TODO Auto-generated catch block

                e.printStackTrace();

            }

        }

        public long peek() {

            return data;

        }

        public Long poll() {

            long result = data;

            if (buffer.position() != buffer.limit()) {

                data = buffer.getLong();

            } else {

                if (eof == false) {

                    eof = true;

                } else {

                    return null;

                }

            }

            return result;

        }

        @Override

        public int compareTo(Worker o) {

            if (this.peek() > o.peek()) {

                return 1;

            } else if (this.peek() < o.peek()) {

                return -1;

            } else {

                return 0;

            }

        }

    }

}

interface Spilter {

    public long[] spilt();

}

class Sorter implements Runnable {

    long[] data;

    Point p;

    ExecutorService writerThread;

    List<File> outFileList;

    CyclicBarrier barrier;

    public Sorter(long[] data, Point p, ExecutorService writerThread, CyclicBarrier barrier, List<File> outFileList) {

        this.data = data;

        this.p = p;

        this.outFileList = outFileList;

        this.barrier = barrier;

        this.writerThread = writerThread;

    }

    public long[] sort() {

        System.out.println("\t開始排序:" + p);

        long start = System.currentTimeMillis();

        Arrays.sort(this.data);

        long end = System.currentTimeMillis();

        System.out.println("\t結束排序:" + p + ",用時:" + (end - start) / 1000);

        return this.data;

    }

    @Override

    public void run() {

        Writer writer = new MappedByteBufferWriter(sort(), p, barrier, outFileList);

        writerThread.submit(writer);

    }

}

interface Writer extends Runnable {

    public void write();

}

class MappedByteBufferWriter implements Writer {

    private static int FLAG = 1;

    private CyclicBarrier barrier = null;

    private File outfile = null;

    private Point point = null;

    private long[] data = null;

    private List<File> outFileList = null;

    public MappedByteBufferWriter(long[] data, Point point, CyclicBarrier barrier, List<File> outFileList) {

        this.data = data;

        this.point = point;

        this.outfile = new File(point.getOutDir() + FLAG + ".txt");

        this.barrier = barrier;

        this.outFileList = outFileList;

        FLAG++;

    }

    @Override

    public void write() {

        try {

            System.out.println("\t\t開始寫入:" + point);

            long start = System.currentTimeMillis();

            FileChannel channel = new RandomAccessFile(this.outfile, "rw").getChannel();

            MappedByteBuffer buffer = channel.map(MapMode.READ_WRITE, 0, this.data.length * 8);

            for (int i = 0; i < data.length; i++) {

                buffer.putLong(data[i]);

            }

            buffer.force();

            long end = System.currentTimeMillis();

            System.out.println("\t\t結束寫入:" + point + ",用時:" + (end - start) / 1000);

            synchronized (outFileList) {

                outFileList.add(outfile);

            }

            this.data = null;

            channel.close();

            buffer = null;

            barrier.await();

        } catch (IOException ex) {

            ex.printStackTrace();

        } catch (InterruptedException e) {

            // TODO Auto-generated catch block

            e.printStackTrace();

        } catch (BrokenBarrierException e) {

            // TODO Auto-generated catch block

            e.printStackTrace();

        }

    }

    @Override

    public void run() {

        this.write();

    }

}

class MappedByteBufferSpilter implements Spilter {

    private File file;

    private Point point;

    public MappedByteBufferSpilter(File file, Point p) {

        this.file = file;

        this.point = p;

    }

    @Override

    public long[] spilt() {

        System.out.println("開始讀入:" + point);

        long start = System.currentTimeMillis();

        long[] result = null;

        try {

            FileChannel in = new RandomAccessFile(file, "r").getChannel();

            MappedByteBuffer inBuffer = in.map(MapMode.READ_ONLY, point.getSkipSize(), point.getLength());

            byte[] data = new byte[inBuffer.limit()];

            inBuffer.get(data);

            result = new long[getObjectSize(data)];

            int resultIndex = 0;

            int index = 0;

            int first = 0;

            while (index < data.length) {

                if (data[index] == 10) {

                    byte[] tmpData = Arrays.copyOfRange(data, first, index);

                    String str = new String(tmpData);

                    result[resultIndex] = Long.valueOf(str);

                    resultIndex++;

                    first = index + 1;

                }

                index++;

            }

            in.close();

        } catch (IOException ex) {

            ex.printStackTrace();

        }

        long end = System.currentTimeMillis();

        System.out.println("結束讀入:" + point + ",用時:" + (end - start) / 1000);

        return result;

    }

    private int getObjectSize(byte[] data) {

        int size = 0;

        for (byte b : data) {

            if (b == 10) {

                size++;

            }

        }

        return size;

    }

}

class Point {

    public Point(long skipSize, long length, String outDir) {

        if (length > Integer.MAX_VALUE) {

            throw new RuntimeException("長度溢位");

        }

        this.skipSize = skipSize;

        this.length = (int) length;

        this.outDir = outDir;

    }

    @Override

    public String toString() {

        return "Point [skipSize=" + skipSize + ", length=" + length + "]";

    }

    private long skipSize;

    private int length;

    private String outDir;

    public String getOutDir() {

        return outDir;

    }

    public long getSkipSize() {

        return skipSize;

    }

    public int getLength() {

        return length;

    }

}

執行:

[lihuilin@lihuilin 桌面]$ java Controller

開始讀入:Point [skipSize=0, length=271726519]

結束讀入:Point [skipSize=0, length=271726519],用時:8

開始讀入:Point [skipSize=271726519, length=271726515]

開始排序:Point [skipSize=0, length=271726519]

結束排序:Point [skipSize=0, length=271726519],用時:3

開始寫入:Point [skipSize=0, length=271726519]

結束寫入:Point [skipSize=0, length=271726519],用時:2

結束讀入:Point [skipSize=271726519, length=271726515],用時:9

開始排序:Point [skipSize=271726519, length=271726515]

開始讀入:Point [skipSize=543453034, length=271726511]

結束排序:Point [skipSize=271726519, length=271726515],用時:4

開始寫入:Point [skipSize=271726519, length=271726515]

結束寫入:Point [skipSize=271726519, length=271726515],用時:3

結束讀入:Point [skipSize=543453034, length=271726511],用時:9

開始讀入:Point [skipSize=815179545, length=271726515]

開始排序:Point [skipSize=543453034, length=271726511]

結束排序:Point [skipSize=543453034, length=271726511],用時:3

開始寫入:Point [skipSize=543453034, length=271726511]

結束寫入:Point [skipSize=543453034, length=271726511],用時:5

結束讀入:Point [skipSize=815179545, length=271726515],用時:13

開始讀入:Point [skipSize=1086906060, length=271726524]

開始排序:Point [skipSize=815179545, length=271726515]

結束排序:Point [skipSize=815179545, length=271726515],用時:3

開始寫入:Point [skipSize=815179545, length=271726515]

結束寫入:Point [skipSize=815179545, length=271726515],用時:5

結束讀入:Point [skipSize=1086906060, length=271726524],用時:13

開始讀入:Point [skipSize=1358632584, length=271726507]

開始排序:Point [skipSize=1086906060, length=271726524]

結束排序:Point [skipSize=1086906060, length=271726524],用時:3

開始寫入:Point [skipSize=1086906060, length=271726524]

結束寫入:Point [skipSize=1086906060, length=271726524],用時:5

結束讀入:Point [skipSize=1358632584, length=271726507],用時:12

開始讀入:Point [skipSize=1630359091, length=271726523]

開始排序:Point [skipSize=1358632584, length=271726507]

結束排序:Point [skipSize=1358632584, length=271726507],用時:3

開始寫入:Point [skipSize=1358632584, length=271726507]

結束寫入:Point [skipSize=1358632584, length=271726507],用時:5

結束讀入:Point [skipSize=1630359091, length=271726523],用時:13

開始讀入:Point [skipSize=1902085614, length=271726514]

開始排序:Point [skipSize=1630359091, length=271726523]

結束排序:Point [skipSize=1630359091, length=271726523],用時:3

開始寫入:Point [skipSize=1630359091, length=271726523]

結束寫入:Point [skipSize=1630359091, length=271726523],用時:5

結束讀入:Point [skipSize=1902085614, length=271726514],用時:13

開始讀入:Point [skipSize=2173812128, length=271726519]

開始排序:Point [skipSize=1902085614, length=271726514]

結束排序:Point [skipSize=1902085614, length=271726514],用時:3

開始寫入:Point [skipSize=1902085614, length=271726514]

結束寫入:Point [skipSize=1902085614, length=271726514],用時:5

結束讀入:Point [skipSize=2173812128, length=271726519],用時:13

開始讀入:Point [skipSize=2445538647, length=271726516]

開始排序:Point [skipSize=2173812128, length=271726519]

結束排序:Point [skipSize=2173812128, length=271726519],用時:3

開始寫入:Point [skipSize=2173812128, length=271726519]

結束寫入:Point [skipSize=2173812128, length=271726519],用時:5

結束讀入:Point [skipSize=2445538647, length=271726516],用時:12

開始讀入:Point [skipSize=2717265163, length=271726517]

開始排序:Point [skipSize=2445538647, length=271726516]

結束排序:Point [skipSize=2445538647, length=271726516],用時:3

開始寫入:Point [skipSize=2445538647, length=271726516]

結束寫入:Point [skipSize=2445538647, length=271726516],用時:5

結束讀入:Point [skipSize=2717265163, length=271726517],用時:13

開始讀入:Point [skipSize=2988991680, length=271726517]

開始排序:Point [skipSize=2717265163, length=271726517]

結束排序:Point [skipSize=2717265163, length=271726517],用時:3

開始寫入:Point [skipSize=2717265163, length=271726517]

結束寫入:Point [skipSize=2717265163, length=271726517],用時:5

結束讀入:Point [skipSize=2988991680, length=271726517],用時:12

開始讀入:Point [skipSize=3260718197, length=271726516]

開始排序:Point [skipSize=2988991680, length=271726517]

結束排序:Point [skipSize=2988991680, length=271726517],用時:3

開始寫入:Point [skipSize=2988991680, length=271726517]

結束寫入:Point [skipSize=2988991680, length=271726517],用時:5

結束讀入:Point [skipSize=3260718197, length=271726516],用時:12

開始讀入:Point [skipSize=3532444713, length=271726515]

開始排序:Point [skipSize=3260718197, length=271726516]

結束排序:Point [skipSize=3260718197, length=271726516],用時:3

開始寫入:Point [skipSize=3260718197, length=271726516]

結束寫入:Point [skipSize=3260718197, length=271726516],用時:5

結束讀入:Point [skipSize=3532444713, length=271726515],用時:12

開始讀入:Point [skipSize=3804171228, length=271726376]

開始排序:Point [skipSize=3532444713, length=271726515]

結束排序:Point [skipSize=3532444713, length=271726515],用時:3

開始寫入:Point [skipSize=3532444713, length=271726515]

結束寫入:Point [skipSize=3532444713, length=271726515],用時:4

結束讀入:Point [skipSize=3804171228, length=271726376],用時:12

開始排序:Point [skipSize=3804171228, length=271726376]

結束排序:Point [skipSize=3804171228, length=271726376],用時:2

開始寫入:Point [skipSize=3804171228, length=271726376]

結束寫入:Point [skipSize=3804171228, length=271726376],用時:3

合併之前總用時:190s

讀取佇列,寫入目標檔案

外部排序總用時:398s

JVM監控:

從監控可以看到第一階段是IO密集型,對於記憶體需要很大;
而第二階段,排序各個分塊對於CPU壓力很大,一定要注意不要讓GC執行緒佔用過多CPU資源,就是Young區不能過小。

上圖中FGC 55之後的就是第二階段的過程,Young GC明顯增多。

驗證:
開始使用1-100的實驗資料，可以正確排序。
大檔案排序之後，可以使用Linux Sort命令驗證。

關於效能優化
外部排序的本質就是排序寫入小檔案，再將小檔案歸併為有序的目標檔案。
所以時間大致應該是拷貝這個檔案的時間乘以2.

但是...
想起在15所的時候，吳老師使用大致與我效能相當的配置，居然只用了230S左右..
優化需要注意的兩點
1.記憶體對映檔案
    檔案複製,分塊檔案寫入然後讀出這種場景使用記憶體對映檔案，避免了核心空間和使用者空間的記憶體複製。並且可以使用堆外作業系統記憶體作為快取.
2.避免GC佔用CPU
    讀取檔案和寫入檔案的時候,Byte和Long做轉換的時候，均採用了String型別作為中轉,



    後續可以考慮直接將byte和long型互轉,避免引入String型別,這樣就可以避免額外的GC

第一點都很容易想到,而第二點吳老師做了,我沒有實現,這可能是導致效能較慢的原因。
以後有時間再補上這個細節吧。
因為使用了記憶體對映檔案,比了避免誤差,每次實驗之前最好清除快取。

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/29254281/viewspace-1167988/，如需轉載，請註明出處，否則將追究法律責任。

資料結構篇_知識點板塊_第九章外部排序
2022-05-05
資料結構排序
外部排序中多路歸併排序，採用敗者樹比勝者樹更優的原因和簡易證明
2024-03-14
排序
引入外部字型
2019-03-15
Oracle 外部表
2024-03-07
Oracle
GEDIT外部工具
2020-07-18
外部插入.after()
2021-09-09
Redshift__在一個外部架構下建立外部表後，其他外部架構也自動生成了一樣的外部表
2023-04-21
架構
Redshift建立外部架構external schema和外部表external table
2021-12-09
架構
Android 外部儲存
2019-02-27
Android
SpringBoot外部化配置
2024-03-06
Spring Boot
SpringBoot使用外部tomcat
2020-08-06
Spring BootTomcat
外部儲存器
2024-08-08
排序：氣泡排序&快速排序
2020-11-28
排序
外部函式的呼叫
2018-11-06
函式
Springboot整合外部Tomcat配置
2019-10-24
Spring BootTomcat
SpringBoot中配置外部Tomcat
2019-07-09
Spring BootTomcat
SpringBoot 之配置外部Tomcat
2019-04-22
Spring BootTomcat
EXTI-外部中斷
2024-08-10
Maven 引入外部依賴
2022-07-28
Maven
Java_操作外部程式
2020-12-23
Java
php插入排序,快速排序,歸併排序,堆排序
2019-02-16
PHP排序
【排序】插入類排序—(折半)插入排序、希爾排序
2020-01-30
排序
排序演算法（氣泡排序，選擇排序，插入排序，希爾排序）
2018-12-14
排序演算法
排序——氣泡排序
2019-01-24
排序
排序之快速排序
2018-08-08
排序
插入排序排序
2019-05-22
排序
排序之交換排序
2020-12-25
排序
PHP 常見4種排序氣泡排序、選擇排序、插入排序、快速排序
2018-08-27
PHP排序
SparkSQL外部資料來源
2018-09-06
SparkSQL
fdw批次匯入外部表
2023-10-13
內部類與外部類
2020-09-24
httprunner yml 呼叫外部函式
2020-10-23
HTTP函式
訪問外部裝置方式
2020-10-09
STM32--外部中斷
2020-10-12
Java魔法堂：呼叫外部程式
2021-02-10
Java
Python八大演算法的實現，插入排序、希爾排序、氣泡排序、快速排序、直接選擇排序、堆排序、歸併排序、基數排序。
2018-08-01
Python演算法排序
快速排序&&歸併排序
2020-08-17
排序
Springboot2專案配置（熱部署+war+外部tomcat+外部配置檔案）
2019-06-05
Spring Boot熱部署Tomcat

外部排序

相關文章