RedisTemplate使用不當引起服務耗時激增

一縷陽光同志發表於2020-04-05

原文網址 : https://juejin.im/post/5e895f91f265da47fa28b68d

Redis

如果你們的redis部署為叢集模式，使用jedis作為驅動包，且專案中用到了Spring-data提供的RedisTemplate.delete(Collection<K> keys)方法，那麼，就需要注意了。

ps：想看原因的直接翻到最後的講解部分。

背景

上週通過監控發現系統突然出現大量的long-service，通過跟蹤日誌發現該請求中操作redis的耗時竟然高達500ms！

在進一步跟進日誌後發現，最近出現了很多操作redis耗時過長的記錄，而且奇怪的是，長耗時都是del操作，而set,get等都在0~1ms能完成。這裡，基本上可以排除網路問題了

接下來還有更詭異的：

聯絡DBA，DBA小哥一臉肯定的說：服務端單個操作超過10ms就會告警，現在叢集一切ok。

為了證實不是服務端的鍋，還到叢集的每個主節點上拉了redis的slowlog，很遺憾，裡面並沒有期待的del記錄。所以，不得不承認，服務端也是沒問題的

最小復現環境

為了更好的說明問題，我在本地搭建了一套最小的復現環境。

redis叢集

必須是叢集模式，我本地啟動的為3主3從，埠從6380~6385。

詳細資訊如下：

127.0.0.1:6380> cluster nodes
1e5df367880965ead5b7064a2e4880db21b7aa94 127.0.0.1:6385@16385 slave 3f8598d3ec376d6f732c9921c2da9d48c825b1ad 0 1586069946810 6 connected
a136c7ed8206f0fa6fe73c02bf5f5d864fbc0f29 127.0.0.1:6383@16383 slave f658a0138f70acbd880928930efa19eebba1098a 0 1586069945000 4 connected
f658a0138f70acbd880928930efa19eebba1098a 127.0.0.1:6381@16381 master - 0 1586069946000 2 connected 5461-10922
3872ab46cf6f316bc1d2a12c2d18e59e5d3a7e83 127.0.0.1:6384@16384 slave a3a7431d8bbaa1bc795433d59ab0dd8bc3fa2d2f 0 1586069947000 5 connected
3f8598d3ec376d6f732c9921c2da9d48c825b1ad 127.0.0.1:6380@16380 myself,master - 0 1586069945000 1 connected 0-5460
a3a7431d8bbaa1bc795433d59ab0dd8bc3fa2d2f 127.0.0.1:6382@16382 master - 0 1586069947821 3 connected 10923-16383
複製程式碼

測試專案

主要是針對redisTemplate裡面的刪除操作，redis驅動我已經配置為jedis(版本無關)。

為了比較，我這裡寫了三種不同的service方法：

1 直接呼叫RedisTemplate.delete(Collection<K> keys)方法。

    @Override
    public void batchCollDel(List<String> list){
        redisTemplate.delete(list);
    }
複製程式碼

2 使用普通的forEach方法，迴圈呼叫RedisTemplate.delete(K key)方法。

    @Override
    public void batchNormalDel(List< String> list){
        list.stream().forEach(redisTemplate::delete);
    }
複製程式碼

3 使用並行流，呼叫RedisTemplate.delete(K key)方法。

    @Override
    public void batchParallelDel(List<String> list){
        list.parallelStream().forEach(redisTemplate::delete);
    }
複製程式碼

為了觀察上面三種執行的耗時，寫了一個around切面，用來列印整個list執行完delete所消耗的時間：

 @Around(value = "methodPointCut()")
    public Object doAroundAdvice(ProceedingJoinPoint proceedingJoinPoint){
        String name = proceedingJoinPoint.getSignature().getName();
        StopWatch watch = new StopWatch();
        watch.start();
        Object proceed = null;
        try {
            proceed = proceedingJoinPoint.proceed();
            watch.stop();
            log.info("method:{} spend : {}",name,watch.getTime());
        } catch (Throwable throwable) {
            throwable.printStackTrace();
        }
        return proceed;
    }
複製程式碼

好了，剩下的程式碼就是把上面的三個方法包成controller暴露出去，文章裡面就不放了。

壓測

首先宣告一下，redis叢集效能是夠用的，rps輕輕鬆鬆上10w，大家可以自己用redis-benchmark跑一下。

ab命令，先壓一下普通的forEach迴圈刪除：

 ab -c 50 -n 500 http://localhost:8080/redis/batchNormalDel?size=1000
複製程式碼

下面是ab返回的統計資訊，算上生成隨機字串的時間，每個請求都在700ms以內完成：

Percentage of the requests served within a certain time (ms)
  50%    527
  66%    545
  75%    559
  80%    568
  90%    595
  95%    613
  98%    632
  99%    658
 100%    663 (longest request)
複製程式碼

再看一下日誌中列印的執行redis批量刪除都消耗的時間：

接下來，再壓一下直接使用RedisTemplate.delete(Collection<K> keys)批量刪除的效果：

ab -c 50 -n 500 http://localhost:8080/redis/batchCollDel?size=1000
複製程式碼

要炸了。。在壓測開始的時候耗時直線上升，最後每次批量刪除維持在1500ms左右。

Percentage of the requests served within a certain time (ms)
  50%   1576
  66%   1757
  75%   1785
  80%   1854
  90%   2172
  95%   2308
  98%   2340
  99%   2399
 100%   2446 (longest request)
複製程式碼

最後，再看看並行流的表現怎麼樣：

ab -c 50 -n 500 http://localhost:8080/redis/batchParallelDel?size=1000
複製程式碼

看來併發也不是萬能的哈哈。

Percentage of the requests served within a certain time (ms)
  50%    383
  66%    490
  75%    572
  80%    639
  90%    797
  95%    989
  98%   1213
  99%   1462
 100%   2206 (longest request)
複製程式碼

最後，一頓操作下來，檢查一下slowlog(這裡時間的單位是微妙)。

127.0.0.1:6380> config get slowlog-log-slower-than
1) "slowlog-log-slower-than"
2) "10000"
127.0.0.1:6380> slowlog get
(empty list or set)
127.0.0.1:6380>
複製程式碼

講解部分

spring-data:2.2.6.realease

jedis:3.2.0

下面看一下RedisTemplate相關的程式碼:

@Override
	public Long delete(Collection<K> keys) {
		if (CollectionUtils.isEmpty(keys)) {
			return 0L;
		}
		byte[][] rawKeys = rawKeys(keys);
		return execute(connection -> connection.del(rawKeys), true);
	}
複製程式碼

沒什麼特別的，繼續往下看：因為是叢集模式，所以對應的類是JedisClusterKeyCommands。

	@Override
	public Long del(byte[]... keys) {

		Assert.notNull(keys, "Keys must not be null!");
		Assert.noNullElements(keys, "Keys must not contain null elements!");

		if (ClusterSlotHashUtil.isSameSlotForAllKeys(keys)) {
			try {
				return connection.getCluster().del(keys);
			} catch (Exception ex) {
				throw convertJedisAccessException(ex);
			}
		}
		return (long) connection.getClusterCommandExecutor()
				.executeMultiKeyCommand((JedisMultiKeyClusterCommandCallback<Long>) (client, key) -> client.del(key),
						Arrays.asList(keys))
				.resultsAsList().size();
	}
複製程式碼

這裡判斷傳入的keys是否都是在redis節點中的同一個槽，在ClusterSlotHashUtil.isSameSlotForAllKeys裡面會計算傳入的所有key的slot值。

其實，因為叢集中slot一般都是均勻分佈的，批量刪除落在同一個槽內的概率很小(除非特地group過)，所有，大部分情況下都是走下面的executeMultiKeyCommand分支。

	public <S, T> MultiNodeResult<T> executeMultiKeyCommand(MultiKeyClusterCommandCallback<S, T> cmd,
			Iterable<byte[]> keys) {
		Map<RedisClusterNode, PositionalKeys> nodeKeyMap = new HashMap<>();
		int index = 0;
		for (byte[] key : keys) {
			for (RedisClusterNode node : getClusterTopology().getKeyServingNodes(key)) {
				nodeKeyMap.computeIfAbsent(node, val -> PositionalKeys.empty()).append(PositionalKey.of(key, index++));
			}
		}
		Map<NodeExecution, Future<NodeResult<T>>> futures = new LinkedHashMap<>();
		for (Entry<RedisClusterNode, PositionalKeys> entry : nodeKeyMap.entrySet()) {
			if (entry.getKey().isMaster()) {
				for (PositionalKey key : entry.getValue()) {
					futures.put(new NodeExecution(entry.getKey(), key),
							executor.submit(() -> executeMultiKeyCommandOnSingleNode(cmd, entry.getKey(), key.getBytes())));
				}
			}
		}
		return collectResults(futures);
	}
複製程式碼

接下來會根據每個key所在redis node節點進行分組，計算好的結果如下：

然後，使用Future向分別對應的主節點上非同步提交命令。

在全部命令都提交完成後，使用collectResults(futures)方法等待所有的結果返回，也就是非同步又轉會了同步。

這程式碼這裡就不放了，主要邏輯就是判斷futures是否已經執行完成，沒有完成的話就Thread.sleep(10)繼續等。

關鍵問題出在提交執行任務的地方

executor.submit(() -> executeMultiKeyCommandOnSingleNode(cmd, entry.getKey(), key.getBytes()))
複製程式碼

我們看一下這個executor物件：

他是一個AsyncTaskExecutor物件，那麼，他的corePoolSize是多少呢？

在ClusterCommandExecutor中，一共有兩個對方對executor進行初始化：

1 構造方法：

	public ClusterCommandExecutor(ClusterTopologyProvider topologyProvider, ClusterNodeResourceProvider resourceProvider,
			ExceptionTranslationStrategy exceptionTranslation, @Nullable AsyncTaskExecutor executor) {

		this(topologyProvider, resourceProvider, exceptionTranslation);
		this.executor = executor;
	}
複製程式碼

但是，這個構造方法沒有被任何一個地方使用。

2 程式碼塊

	{
		if (executor == null) {
			ThreadPoolTaskExecutor threadPoolTaskExecutor = new ThreadPoolTaskExecutor();
			threadPoolTaskExecutor.initialize();
			this.executor = threadPoolTaskExecutor;
		}
	}
複製程式碼

說實話，已經很少見到這種寫法了。

這樣建立出來的ThreadPoolTaskExecutor，他的幾個關鍵屬性是：

	private int corePoolSize = 1;
	private int maxPoolSize = Integer.MAX_VALUE;
	private int keepAliveSeconds = 60;
	private int queueCapacity = Integer.MAX_VALUE;
	private boolean allowCoreThreadTimeOut = false;
複製程式碼

核心執行緒數是1

等待佇列最大長度是Integer.MAX_VALUE

最大執行緒數是Integer.MAX_VALUE

在這種配置下，執行緒池始終只有一個執行緒在處理任務，只有等待佇列滿後才會建立第二個worker執行緒，然而，這個佇列太大了。。。

其實，更可氣的是，這個executor屬性我們是沒有辦法配置的，他所在的類ClusterCommandExecutor也不是一個被Spring容器管理的類。

所以，問題已經找到了。由於叢集模式下Spring對jedis的封裝，在使用批量方法刪除時會把任務都提交到僅有一個核心執行緒的executor中執行，在高併發場景下會造成應用內大量任務處於排隊狀態而得不到執行。

其他

相關的程式碼已上傳到github github.com/huifrank/re…

包含有redis建立叢集相關的指令碼，可以使用docker，或者是直接在宿主機上啟動。

如果是mac的話，還是不要用docker建立redis叢集了。。mac下不支援docker的host網路模式(僅linux支援)，如果使用橋接的話，由於spring執行cluster nodes拿到的是docker內的網路地址，儘管你配置的是執行過埠對映的地址，但是spring還是會用自己拿到的內網地址，一樣啟動不成功。

我搞了半天，放棄了。。。

Executors使用不當引起的記憶體溢位
2020-09-10
記憶體溢位
IO 密集型服務耗時優化
2022-04-01
優化
GC耗時高，原因竟是服務流量小？
2023-01-07
GC
【web服務】耗時任務基於API與worker模式
2024-10-15
WebAPI模式
bring on 導致；引起；使…發展
2018-08-31
一次排查線上介面偶發異常耗時引起的思考！
2020-07-11
當API成為服務
2021-12-07
API
Vue 服務端渲染實踐 ——Web應用首屏耗時最優化方案
2019-02-20
Vue服務端Web優化
Redis的KEYS命令引起當機事件
2019-04-19
Redis事件
服務執行過程中磁碟壞道引起的思考
2021-10-16
【Azure 應用服務】App Service/Azure Function的出站連線過多而引起了SNAT埠耗盡，導致一些新的請求出現超時錯誤(Timeout)
2021-04-24
APPFunction
Redis使用不當導致應用卡死
2022-12-08
Redis
上線影片推理服務使AI應用更高效
2020-07-15
AI
CRM服務管理是什麼？如何使企業受益？
2023-03-28
CSS中哪些屬性會引起GPU渲染，會增加耗電嗎？
2024-12-03
CSSGPU
臨近年關，兩個磁碟佔滿引起的服務下線
2020-12-29
DRM特性引起的RAC節點當機
2023-02-08
定時任務crond服務
2018-11-30
實時同步服務
2024-10-18
【Azure 應用服務】由 Azure Functions runtime is unreachable 的錯誤訊息推匯出 ASYNC(非同步）和 SYNC(同步）混用而引起ThreadPool耗盡問題
2021-06-05
Function非同步thread
Canalys：2020年Q4全球雲服務市場激增100億美元
2021-02-04
Yii1列印當前請求所有執行的SQL及耗時
2018-12-06
SQL
如何規避雲服務當機現象？
2024-01-11
當代服務型遊戲大盤點
2019-08-29
遊戲
Android 快速定位耗時方法
2019-12-31
Android
systrace 統計方法耗時
2022-06-07
chrony時間同步服務
2024-05-19
域名解析耗時是什麼？域名解析耗時影響因素有哪些？
2022-06-09
資本加持、需求激增，“體驗式”服務成遊戲行業新風口
2019-11-08
遊戲行業
建立sshd服務容器，並使宿主機與容器免密通訊
2024-07-29
日常Bug排查-系統失去響應-Redis使用不當
2021-05-20
Redis
PHP程式碼審計05之正則使用不當
2020-12-29
PHP
記一次 Mysql 日期使用不當造成的 bug
2020-11-27
MySql
iOS 定時器耗電探究
2018-09-14
iOS定時器
JVM 中配置服務時區
2021-11-16
JVM
從線上當機引起的問題談開去
2020-06-21
RedisTemplate讀取slowlog
2018-09-09
Redis
服務應用突然當機了？別怕，Dubbo 幫你自動搞定服務隔離！
2020-08-28

RedisTemplate使用不當引起服務耗時激增

背景

最小復現環境

redis叢集

測試專案

壓測

講解部分

其他

相關文章