一、前言

最近幾天在開發公司業務時，遇到了需要往不資料庫中多個表中插入大量資料的一個場景，於是有了這篇文章：

在使用Mybatis批量插入資料時的注意事項，以及使用函數語言程式設計對分批次資料插入方法的簡單封裝。

對於包含我在內大部分Java後端開發的小夥伴們在平常的CURD開發工作中，一定是免不了使用Mybatis這個工具來對資料庫進行操作的。

在SpringBoot的工程中，引入Mybatis後，可使用mapper注入的方式來實現增刪改查。

比如如果要新增一條資料，那麼在mapper.xml檔案中可以這麼寫：

        <insert id="testInsert">
        insert into massive_data_insert_test
            (value1,value2)
        values
             (#{value1},#{value2})
    </insert>

然後在service層呼叫mapper.insertOneItem(insertItem);即可。

如果要新增多條資料，如果是剛學Java的同學可能會這麼寫：

for(int i = 0; i < insertList.size(); i++){
  mapper.insertOneItem(insertList.get(i));
}

就是簡單的放在一個迴圈中，多次呼叫mapper中的資料庫新增方法。這種方法編寫簡單易於理解，在資料量比較少的時候是不會有什麼問題的。

但是一旦資料過多就會有問題了。

實際上每次的mapper方法呼叫都是一次連線資料庫、預處理（PreparedStatement）、execute（執行SQL）的過程。

由此發現，如果放在for迴圈中的話，上述過程則會多次執行，而我們又清楚資料庫的連線數是有限的（建立連線的過程也是很耗費資源的），如果for迴圈的次數過多時，不僅效能會下降還會造成資料庫的堵塞甚至程式崩潰。當然，我們可以建立或配置資料庫連線池（比如HikariCP、Durid等）來複用連線，但這還是對資源的一種浪費。

總而言之，如果能有一種方法來一次性把要完成的事情做完就不要分多次去做。

大部分資料庫的新增語句是支援一次插入多條資料的。

insert into table
(value1, value2)
values ('v1','v2'),('v3','v4'),('v1','v5')

Mybatis也給出了一種批量運算元據的方法。使用動態SQL語句中的<foreach>標籤，將幫助我們拼接出形似上面的SQL語句。

我們在mapper.xml中編寫下面的內容：

    <insert id="testMassiveInsert">
        insert into massive_data_insert_test
            (value1,value2)
        values
        <foreach collection="list" item="item" separator=",">
             (#{item.value1},#{item.value2})
        </foreach>
    </insert>

這樣我們就只需要呼叫一次mapper中的方法就能達到上面for迴圈程式碼的效果。

並且實際上執行的時候也是執行一次SQL，這個動態SQL語句的作用就是將傳入的引數的內容拼接到插入語句的SQL中（預處理技術）。

這種方法很明顯要比一開始的for迴圈的實現要好一點了。

二、批量插入資料量達到上萬後報錯

但是，當我們用上述的拼接SQL的方式進行批量插入時，資料量過大也會出現問題！

我們可以先來實驗一下批量插入一個四萬條資料會怎樣。

先來新建一張表作為插入資料的目標表：massive_data_insert_test。

CREATE TABLE "supply"."massive_data_insert_test" (
  "value1" varchar(255) COLLATE "pg_catalog"."default",
  "value2" varchar(255) COLLATE "pg_catalog"."default"
);

隨便在一個SpringBoot的工程中連線資料庫並建立mapper，編寫插入語句（建立工程和掃描mapper等操作就不在此贅述了）：

下面是mapper介面和mapper.xml檔案（中的sql語句）。

TestMapper.java

@Repository
public interface TestMapper.java {
    void testMassiveInsert(List<HashMap> list);
}

TestMapper.xml

    <insert id="testMassiveInsert">
        insert into massive_data_insert_test
            (value1,value2)
        values
        <foreach collection="list" item="item" separator=",">
             (#{item.value1},#{item.value2})
        </foreach>
    </insert>

測試語句：

@Service
@Slf4j
public class TestService {
        // 批量新增的最大數量
    private static final int maxInsertItemNumPerTime = 500;

    private TestMapper mapper;

    @Autowired
    public TestService(TestMapper mapper) {
        this.mapper = mapper;
    }

    public Result testMassiveInsert() {
        long startTime = System.currentTimeMillis();    //獲取開始時間
        List<HashMap> list = new LinkedList<>();
          // 組裝資料 獲得一個長度為 500 * 80 = 40000的連結串列
        for (int i = 0; i < maxInsertItemNumPerTime * 80; i++) {
            HashMap map = new HashMap();
            map.put("value1", "value1" + i);
            map.put("value2", "value2" + i);
            list.add(map);
        }
        // 直接批量插入
        try {
            mapper.testMassiveInsert(list);
        } catch (RuntimeException e) {
            log.info("直接批量插入" + list.size() + "失敗", e.getMessage());
            throw new RuntimeException(e);
        }
        long endTime = System.currentTimeMillis();    //獲取結束時間
        return Result.ok().message("程式執行時間：" + (endTime - startTime) + "ms");
    }
}

當執行上面的直接批量插入時：

直接報出了I/O error，為什麼會這樣呢？

在前文中提到過這個動態SQL語句的作用就是將傳入的引數的內容拼接到插入語句的SQL中，所以發生這個錯誤的原因就是因為要拼接的內容過多，導致SQL語句過長從而導致了I/O的錯誤。所以當資料量過大時就會使得拼接過長從而導致程式報錯。

並且這個SQL的長度不光跟資料量有關，還跟插入語句的插入引數的數量有關。實際上，SQL的長度與二者的乘積呈正相關的線性變化，所以當插入引數過多時更要控制好批量插入資料量的大小。

那麼怎麼解決呢？最簡單的就是分批次插入了，這有點像文章一開始提到的for迴圈中依次插入的方式，只不過是這回for迴圈中的插入是批量插入了。

由於不是一次性的插入，所以需要加上事務的包裹，從而保證無論哪次插入出錯都能回滾。

    @Transactional
    public Result testMassiveInsert() {
        long startTime = System.currentTimeMillis();    //獲取開始時間
        List<HashMap> list = new LinkedList<>();
          // 組裝資料 獲得一個長度為 500 * 80 = 40000的連結串列
        for (int i = 0; i < maxInsertItemNumPerTime * 80; i++) {
            HashMap map = new HashMap();
            map.put("value1", "value1" + i);
            map.put("value2", "value2" + i);
            list.add(map);
        }
        // 分批次的批量插入
        try {
            if (list.size() > maxInsertItemNumPerTime) {
                List<List<HashMap>> all = new ArrayList<>();
                int i = 0;
                while (i < list.size()) {
                    List subList = list.subList(i, i + maxInsertItemNumPerTime);
                    i = i + maxInsertItemNumPerTime;
                    all.add(subList);
                }
                all.parallelStream().forEach(o -> mapper.testMassiveInsert(o));
            }
        } catch (RuntimeException e) {
            log.info("分批次批量插入" + list.size() + "失敗", e.getMessage());
            throw new RuntimeException(e);
        }
        long endTime = System.currentTimeMillis();    //獲取結束時間
        return Result.ok().message("程式執行時間：" + (endTime - startTime) + "ms");
    }

我們通過設定一個maxInsertItemNumPerTime來控制每次批量插入資料的長度。

三、簡單測試

下面我的簡單測試（即插入總數量為4w條，但設定不同maxInsertItemNumPerTime大小時，計算比較程式執行的耗時）。不過該測試沒有進行很多次測試取平均並又可能網路也有抖動，所以只能算是簡單測試。

2000

1000

500

250

最後我選擇了500作為分批次插入時每個批次的大小。正如我們上文所說的一樣，即使有資料庫連線池提供的連線複用，但是如果跟資料庫的互動多了還是會造成效能的下降，所以這裡的maxInsertItemNumPerTime的也不是越小越好。

同時，隨著maxInsertItemNumPerTime的變大，每一次for迴圈中的SQL的預處理過程（SQL拼接）耗時會變大，並且這種變大並不是一種線性的，而是往往呈現指數型變大(查了一些資料證實了我的這個猜測)，否則的話就不會是2000的時候要遠遠大於500了。

大家在實際業務中也需要簡單測測來選取一個比較合適的值，總比沒有測試的要好。

四、做點擴充套件

其實在Mybatis的官方文件中是提供了另外一種方式來支援批量插入的。

但由於公司的專案中都是用的掃描Mapper的方式來運算元據庫，加上這種大資料插入場景確實比較少，所以就沒有特意引進下面Mybatis提供的方式。

A multiple row insert is a single insert statement that inserts multiple rows into a table. This can be a convenient way to insert a few rows into a table, but it has some limitations:

Since it is a single SQL statement, you could generate quite a lot of prepared statement parameters. For example, suppose you wanted to insert 1000 records into a table, and each record had 5 fields. With a multiple row insert you would generate a SQL statement with 5000 parameters. There are limits to the number of parameters allowed in a JDBC prepared statement - and this kind of insert could easily exceed those limits. If you want to insert many records, you should probably use a JDBC batch insert instead (see below)
The performance of a giant insert statement may be less than you expect. If you have many records to insert, it will almost always be more efficient to use a JDBC batch insert (see below). With a batch insert, the JDBC driver can do some optimization that is not possible with a single large statement
Retrieving generated values with multiple row inserts can be a challenge. MyBatis currently has some limitations related to retrieving generated keys in multiple row inserts that require special considerations (see below)

Nevertheless, there are use cases for a multiple row insert - especially when you just want to insert a few records in a table and don’t need to retrieve generated keys. In those situations, a multiple row insert will be an easy solution.

翻譯：

    try (SqlSession session = sqlSessionFactory.openSession()) {
        GeneratedAlwaysAnnotatedMapper mapper = session.getMapper(GeneratedAlwaysAnnotatedMapper.class);
        List<GeneratedAlwaysRecord> records = getRecordsToInsert(); // not shown
            
        MultiRowInsertStatementProvider<GeneratedAlwaysRecord> multiRowInsert = insertMultiple(records)
                .into(generatedAlways)
                .map(id).toProperty("id")
                .map(firstName).toProperty("firstName")
                .map(lastName).toProperty("lastName")
                .build()
                .render(RenderingStrategies.MYBATIS3);
            
        int rows = mapper.insertMultiple(multiRowInsert);
    }

上面就是文件中的示例程式碼。

五、進一步優化程式碼

由於我在公司實際開發中要做的是要往多張表中匯入資料，如果按照上面的分批次的寫法的話，那我的每一次插入都要有這麼一段邏輯：

        // 分批次的批量插入
        try {
            if (list.size() > maxInsertItemNumPerTime) {
                List<List<HashMap>> all = new ArrayList<>();
                int i = 0;
                while (i < list.size()) {
                    List subList = list.subList(i, i + maxInsertItemNumPerTime);
                    i = i + maxInsertItemNumPerTime;
                    all.add(subList);
                }
                all.parallelStream().forEach(o -> mapper.testMassiveInsert(o));
            }
        } catch (RuntimeException e) {
            log.info("分批次批量插入" + list.size() + "失敗", e.getMessage());
            throw new RuntimeException(e);
        }

顯然這有點重複程式碼的壞味道了，看起來就很不好。所以接下來做個簡單封裝，將這段程式碼封裝到一個方法中，誰插入時就由誰呼叫。

首先，這段程式碼需要傳入的引數是什麼？

maxInsertItemNumPerTime 不需要傳入，因為顯然這是一個常量
list 插入的內容連結串列，需要傳入，並且型別不會全都是HashMap，而是不同表對應不同的實體類。需要用到泛型T。
mapper中的testMassiveInsert(HashMap map) 方法，很明顯不同表的插入mapper肯定不是同一個方法，所以這個也需要傳入，把一個方法當作引數傳入，那麼就需要用到Lambda表示式和函數語言程式設計。如果你瞭解過函式式介面，你很自然就會想到像這種只有輸入沒有輸出的函式應該由Consumer來傳入（反之對應Supplier，有輸入又有輸出時則是Function）。

所以最後抽象出的程式碼應該是這樣的：

public <T> void batchSplitInsert(List<T> list, Consumer insertFunc) {
    List<List<T>> all = new ArrayList<>();
    if (list.size() > maxInsertItemNumPerTime) {
        int i = 0;
        while (i < list.size()) {
            List subList = list.subList(i, i + maxInsertItemNumPerTime);
            i = i + maxInsertItemNumPerTime;
            all.add(subList);
        }
        all.parallelStream().forEach(insertFunc);
    } else {
        insertFunc.accept(list);
    }
}

這樣子我在做不同表的插入時：

// 待插入資料連結串列
List<TableDTO> needToInsert = ……；
// 進行新增
Consumer<List<TableDTO>> consumer = o -> mapper.insertTable(o);
batchSplitInsert(needToInsert, consumer);

現在整個世界都變優雅了！如果剛入職或者剛學Java的小夥伴還對函數語言程式設計不瞭解，不如先從從簡單的Lambda語句開始學起，這個真的是Java8中超好用的一個特性。

到此為止，本文就算是結束了，大家有啥疑問，請評論留言，我們們相互交流。

如果這篇文章對你有幫助請多多點贊，感謝！如果有大佬有更好的實現方式也請務必不吝賜教！

使用Mybatis批量插入大量資料的實踐

一、前言

二、批量插入資料量達到上萬後報錯

三、簡單測試

四、做點擴充套件

五、進一步優化程式碼

相關文章