記一次 .NET 某藥品倉儲管理系統 卡死分析

一線碼農發表於2022-01-17

一:背景

1. 講故事

這個月初,有位朋友wx上找到我,說他的api過一段時間後,就會出現只有請求,沒有響應的情況,截圖如下:

從朋友的描述中看樣子程式是被什麼東西卡住了,這種卡死的問題解決起來相對簡單,接下來我就用 windbg 給大家分析一下。

二:Windbg 分析

1. Request 請求正在幹嘛?

既然朋友說 api 有 request 無 response,那怎麼去驗證朋友的話對不對呢? 我們都知道 .NET 用 HttpContext 來表示一個請求,言外之意就是可以去抓 HttpContext 下的時長屬性,Netext 中有一個 !whttp 命令可以幫助我們。


0:000> !whttp
HttpContext    Thread Time Out Running  Status Verb     Url
000000563bf803b0   42 00:01:50 00:01:24    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x-HN
000000563bf84660   -- 00:01:50 Finished    200 GET      http://xxx.com:30003/
000000563c4a0470   51 00:01:50 00:00:12    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2C
00000056bbf63590   30 00:01:50 00:02:41    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-B2C
00000056bc82a038   -- 00:01:50 Finished    200 GET      http://localhost:30003/
00000056bc84a3e8   44 00:01:50 00:00:51    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x
00000056bc8671c8   46 00:01:50 00:00:45    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-B2C
000000573bf44698   35 00:01:50 00:02:39    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x
000000573bf483c0   33 00:01:50 00:02:41    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x-HN
000000573bf97e80   40 00:01:50 00:02:32    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=ZJB2C
000000573c583b08   -- 00:01:50 Finished    200 GET      http://localhost:30003/
000000573c589ec8   -- 00:01:50 Finished    200 GET      http://xxx.com:30003/Wms/xxx/xxx/xxx
000000573c760e28   -- 00:01:50 Finished    200 POST     http://xxx.com:30003/Wms/xxx/xxx/xxx
000000573c95f990   48 00:01:50 00:00:31    200 POST     http://xxx.com:30003/Wms/Common/xxx?xxx=xxx&xxx=x-HN
00000057bbf4f8e8   31 00:01:50 00:02:12    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x
00000057bc080340   50 00:01:50 00:00:19    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x
000000583c4aee80   43 00:01:50 00:01:11    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2B
000000583c4d0c50   53 00:01:50 00:00:01    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2B
00000058bbf8f1a0   34 00:01:50 00:02:22    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2B
000000593bfe1758   41 00:01:50 00:01:22    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2C
000000593c892160   -- 00:01:50 Finished    200 GET      http://xxx.com:30003/Wms/xxx/xxx/xxxJob
000000593ca813b0   45 00:01:50 00:00:30    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-HN
000000593caa45d8   -- 00:01:50 Finished    200 GET      http://xxx.com:30003/
00000059bc1ad808   32 00:01:50 00:01:45    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-B2C
00000059bc1c3d70   36 00:01:50 00:01:29    200 POST     http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x

25 HttpContext object(s) found matching criteria

Running 列可以看到大多請求都已經達到1分鐘以上,這也驗證了朋友所說的卡死問題,按照經驗,可以取 Running 列中最大的 httpContext 所在的執行緒,也就是上面的 3033 號執行緒, 看看它們都在幹什麼?

2. 探究 Running 最長執行緒

接下來切到 3033 號執行緒,看看它們的執行緒棧。


0:000> ~30s
ntdll!NtWaitForSingleObject+0xa:
00007ffd`b81f024a c3              ret
0:030> !clrstack 
OS Thread Id: 0x29d0 (30)
        Child SP               IP Call Site
0000005acc3ac590 00007ffdb81f024a [PrestubMethodFrame: 0000005acc3ac590] xxx.xxx.RedisConnectionHelp.get_Instance()
0000005acc3ac850 00007ffd4dd78911 xxx.xxx.RedisCache..ctor(Int32, System.String)
0000005acc3ac8c0 00007ffd4dd78038 xxx.xxx.CacheByRedis.HashGet[[System.__Canon, mscorlib]](System.String, System.String, Int32)
0000005acc3ac968 00007ffdabef1f7c [StubHelperFrame: 0000005acc3ac968] 
0000005acc3ac9c0 00007ffd4dd77f18 xxx.xxx.Cache.xxx.GetCacheNotAreaDataEntity[[System.__Canon, mscorlib]](System.String, System.String, System.String)

...

0:030> ~33s
ntdll!NtWaitForMultipleObjects+0xa:
00007ffd`b81f07ba c3              ret
0:033> !clrstack 
OS Thread Id: 0x3ad4 (33)
        Child SP               IP Call Site
0000005accabae90 00007ffdb81f07ba [GCFrame: 0000005accabae90] 
0000005accabafb8 00007ffdb81f07ba [HelperMethodFrame_1OBJ: 0000005accabafb8] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
0000005accabb0d0 00007ffdaac60d64 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)
0000005accabb160 00007ffdaac5b4bb System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken)
0000005accabb1d0 00007ffdab5a01d1 System.Threading.Tasks.Task.InternalWait(Int32, System.Threading.CancellationToken)
0000005accabb2a0 00007ffdab59cfa7 System.Threading.Tasks.Task`1[[System.__Canon, mscorlib]].GetResultxxx(Boolean)
0000005accabb2e0 00007ffd4d8d338f xxx.Config.xxx.Config`1[[System.__Canon, mscorlib]].GetConfig(xxx.Config.Model.ConfigListener, System.Func`2<xxx.Config.Request.GetConfigRequest,System.Threading.Tasks.Task`1<System.String>>)
0000005accabb340 00007ffd4d8d2f40 xxx.Config.xxx.Config`1[[System.__Canon, mscorlib]].get_Item(System.String, System.String)
0000005accabb3c0 00007ffd4dd78f7f xxx.Util.BaseConfig.get_GetRedisConn()
0000005accabb440 00007ffd4dd78e9c xxx.xxx.RedisConnectionHelp.GetConnectionString()
0000005accabb4a0 00007ffd4dd789cb xxx.xxx.RedisConnectionHelp..cctor()
0000005accabb940 00007ffdabef6953 [GCFrame: 0000005accabb940] 
0000005accabc5b0 00007ffdabef6953 [PrestubMethodFrame: 0000005accabc5b0] xxx.xxx.RedisConnectionHelp.get_Instance()
0000005accabc870 00007ffd4dd78911 xxx.xxx.RedisCache..ctor(Int32, System.String)
0000005accabc8e0 00007ffd4dd78038 xxx.xxx.CacheByRedis.HashGet[[System.__Canon, mscorlib]](System.String, System.String, Int32)
0000005accabc988 00007ffdabef1f7c [StubHelperFrame: 0000005accabc988] 
0000005accabc9e0 00007ffd4dd77f18 xxx.Core.Cache.xxx.GetCacheNotAreaDataEntity[[System.__Canon, mscorlib]](System.String, System.String, System.String)
...

上面的資訊不難發現 30 號執行緒正卡在 RedisConnectionHelp.get_Instance() 處,33 號線已經進入了 RedisConnectionHelp.get_Instance() 方法中,最後在 GetConfig() 處等待 Result 的結果,按經驗來說,30 號執行緒看樣子正在鎖等待, 33 號正在等待非同步結果,接下來的突破點就是探究下 RedisConnectionHelp.Instance 處程式碼。

3. 尋找問題程式碼

接下來用反編譯工具 ILSpy 找到問題程式碼。


public static class RedisConnectionHelp
{
    public static ConnectionMultiplexer Instance
    {
        get
        {
            if (_instance == null)
            {
                lock (Locker)
                {
                    if (_instance == null || !_instance.IsConnected)
                    {
                        _instance = GetManager();
                    }
                }
            }
            return _instance;
        }
    }
}

30 號執行緒果然是卡在 Locker 處,接下來深挖下 33 號執行緒所執行的 GetManager() 方法,簡化後程式碼如下:


public T this[string dataId, string key = ""]
{
    get
    {
        try
        {
            string config = GetConfig(configListener, new NacosConfigClient(Base.NacosConfiguration).DoGetConfigAsync);

            return JsonConvert.DeserializeObject<T>(config);
        }
        catch (Exception ex)
        {
            return default(T);
        }
    }
}

private string GetConfig(ConfigListener listener, Func<GetConfigRequest, Task<string>> action)
{
    var text2 = action(new GetConfigRequest
    {
        DataId = listener.DataId,
        Group = listener.Group,
        Tenant = text
    }).Result;

    return text2;
}

internal async Task<string> DoGetConfigAsync(GetConfigRequest request)
{
    IRestResponse restResponse = await HttpUtil.Request(currentServerAddr, Method.GET, request.ParamValues(), xxx);

    return restResponse.Content;
}

可以看到程式碼卡在了 Result 上無限期等待,到這裡我就想到了 同步上下文 ,我看他這個程式是 .NET 4.8 下的 ASP.NET MVC,記得不錯上下文應該是 AspNetSynchronizationContext,具體死鎖原因可參見我的這篇文章:一句 Task.Result 就死鎖, 這程式碼還怎麼寫?,解決辦法大概有四種。

  1. 使用 .ConfigureAwait(false)
  2. 改成全非同步
  3. 用 Task 再包一層。
  4. 改成全同步

三:總結

其實本次事故主要還是因為在 同步程式碼 中做了 非同步程式碼.Result 導致的死鎖問題,有非常多的文章都在抨擊這種現象,在 asp.net core 中已經移除了這種同步上下文的大坑,給到朋友的建議是改成全同步,死鎖問題也隨之消失。

哈哈,真替朋友開心!

相關文章