一:背景
- 講故事
公司部署在某碟上的專案在9月份壓測50併發時,發現某個容器執行緒、記憶體非正常的上漲,導致功能出現了異常無法使用。根據所學,自己分析了下執行緒和記憶體問題,分析時可以使用lldb或者windbg,但是個人比較傾向於介面化的windbg,所以最終使用windbg開幹。
二:WinDbg 分析
- 到底是哪裡的洩漏
在 windows 平臺上相信有很多朋友都知道用 !address -summary 命令看,但這是專屬於windows平臺的命令,在分析linux上的dump不好使,參考如下輸出:
0:000> !address -summary
Mapping file section regions...
Mapping module regions...
Mapping heap regions...
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
<unknown> 4062 ffffffff`f5638600 ( 16.000 EB) 100.00% 100.00%
Image 1282 0`09fc8a00 ( 159.784 MB) 0.00% 0.00%
--- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal
2431 fffffffe`2b813000 ( 16.000 EB) 100.00%
MEM_PRIVATE 2913 1`d3dee000 ( 7.310 GB) 0.00% 0.00%
--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
2431 fffffffe`2b813000 ( 16.000 EB) 100.00% 100.00%
MEM_COMMIT 2913 1`d3dee000 ( 7.310 GB) 0.00% 0.00%
--- Protect Summary (for commit) - RgnCount ----------- Total Size -------- %ofBusy %ofTotal
PAGE_READWRITE 2115 1`cb683000 ( 7.178 GB) 0.00% 0.00%
PAGE_EXECUTE_READ 175 0`03d49000 ( 61.285 MB) 0.00% 0.00%
PAGE_READONLY 585 0`03ce9000 ( 60.910 MB) 0.00% 0.00%
PAGE_EXECUTE_WRITECOPY 38 0`00d39000 ( 13.223 MB) 0.00% 0.00%
--- Largest Region by Usage ----------- Base Address -------- Region Size ----------
<unknown> 7ffc`011fa000 ffff8003`fe406000 ( 16.000 EB)
Image 7f45`fe4e9000 0`01b16000 ( 27.086 MB)
卦中的記憶體段分類用處不大,也沒有多大的參考價值,那怎麼辦呢?其實 coreclr 團隊也考慮到了這個情況,它提供了一個 maddress 命令來實現跨平臺的 !address,更改後輸出如下:
0:000> !sos maddress
Enumerating and tagging the entire address space and caching the result...
Subsequent runs of this command should be faster.
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Memory Kind | StartAddr | EndAddr-1 | Size | Type | State | Protect | Image |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Stack | 7f42d256e000 | 7f42d2d6e000 | 8.00mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7f42d3570000 | 7f42d3d70000 | 8.00mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7f42d3d71000 | 7f42d4571000 | 8.00mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7f42d4572000 | 7f42d4d72000 | 8.00mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7f42d4d73000 | 7f42d5573000 | 8.00mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7f42d5574000 | 7f42d5d74000 | 8.00mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7f42d5d75000 | 7f42d6575000 | 8.00mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7f42d6d77000 | 7f42d7577000 | 8.00mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7f42d7578000 | 7f42d7d78000 | 8.00mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7f42d7d79000 | 7f42d8579000 | 8.00mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7f42d857a000 | 7f42d8d7a000 | 8.00mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
...
+-------------------------------------------------------------------------+
| Memory Type | Count | Size | Size (bytes) |
+-------------------------------------------------------------------------+
| Stack | 788 | 6.28gb | 6,743,269,376 |
| GCHeap | 48 | 688.98mb | 722,448,384 |
| PAGE_READWRITE | 930 | 180.22mb | 188,977,152 |
| Image | 1,278 | 159.69mb | 167,447,040 |
| HighFrequencyHeap | 327 | 20.35mb | 21,336,064 |
| LowFrequencyHeap | 259 | 18.31mb | 19,202,048 |
| LoaderCodeHeap | 15 | 17.53mb | 18,378,752 |
| HostCodeHeap | 11 | 1.51mb | 1,581,056 |
| ResolveHeap | 1 | 348.00kb | 356,352 |
| PAGE_READONLY | 123 | 261.50kb | 267,776 |
| DispatchHeap | 1 | 196.00kb | 200,704 |
| IndirectionCellHeap | 3 | 152.00kb | 155,648 |
| LookupHeap | 3 | 144.00kb | 147,456 |
| CacheEntryHeap | 2 | 100.00kb | 102,400 |
| PAGE_EXECUTE_WRITECOPY | 5 | 96.00kb | 98,304 |
| StubHeap | 2 | 76.00kb | 77,824 |
| PAGE_EXECUTE_READ | 2 | 8.00kb | 8,192 |
+-------------------------------------------------------------------------+
| [TOTAL] | 3,798 | 7.34gb | 7,884,054,528 |
+-------------------------------------------------------------------------+
從卦中可以看到當前程式總計 6.28gb 記憶體佔用,基本上都被執行緒棧給吃掉了,更讓人意想不到的是這個執行緒棧居然佔用 8M 的記憶體空間,這個著實有點大了,而且 linux 不像 windows 有一個 reserved 的概念,這裡的 8M 是實實在在的預佔,可以觀察這 8M 的記憶體地址即可,都是初始化的 0, 這就說不過去了。
0:000> dp 7f42d256e000 7f42d2d6e000
...
00007f42`d2d6dfa0 00000000`00000000 00000000`00000000
00007f42`d2d6dfb0 00000000`00000000 00000000`00000000
00007f42`d2d6dfc0 00000000`00000000 00000000`00000000
00007f42`d2d6dfd0 00000000`00000000 00000000`00000000
00007f42`d2d6dfe0 00000000`00000000 00000000`00000000
00007f42`d2d6dff0 00000000`00000000 00000000`00000000
00007f42`d2d6e000 ????????`????????
- 如何修改棧空間大小
一般來說不同的作業系統發行版有不同的預設棧空間配置,可以先到記憶體搜一下當前是哪一個發行版,做法就是搜尋作業系統名稱主要關鍵字。
0:000> s-a 0 L?0xffffffffffffffff "centos"
...
00005570`9cddbc18 63 65 6e 74 6f 73 2e 37-2d 78 36 34 00 00 00 00 centos.7-x64....
...
從卦中可以看到當前作業系統是 centos7-x64,在 windows 平臺上修改棧空間大小可以修改 PE 頭,在 linux 上有兩種做法。
修改 ulimit -s 引數(不建議)
root@ubuntu:/data# ulimit -s
8192
root@ubuntu:/data# ulimit -s 2048
root@ubuntu:/data# ulimit -s
2048
修改 DOTNET_DefaultStackSize 環境變數(建議,針對異常容器在環境變數配置)
DOTNET_DefaultStackSize=180000
更多可以參考文章: https://www.alexander-koepke.de/post/2023-10-18-til-dotnet-stack-size/
上面是解決問題的第一個方向,接下來我們說另一個方向,為什麼會產生總計 888 個執行緒呢?
- 為什麼會有那麼多執行緒
要找到這個答案,需要去看每一個執行緒此時都在幹嘛,這個可以使用 windbg 專屬命令。
0:000> ~*e !clrstack
...
OS Thread Id: 0x1b82 (225)
Child SP IP Call Site
00007F441B7FD660 00007f4cdbb69ad8 [HelperMethodFrame_1OBJ: 00007f441b7fd660] System.Threading.Monitor.ObjWait(Int32, System.Object)
00007F441B7FD790 00007f4c676318cd System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ManualResetEventSlim.cs @ 570]
00007F441B7FD810 00007f4c676312e1 System.Net.Sockets.SocketAsyncContext.PerformSyncOperation[[System.__Canon, System.Private.CoreLib]](OperationQueue`1<System.__Canon> ByRef, System.__Canon, Int32, Int32) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs @ 1330]
00007F441B7FD8A0 00007f4c67e26ff1 System.Net.Sockets.SocketAsyncContext.ReceiveFrom(System.Memory`1, System.Net.Sockets.SocketFlags ByRef, Byte[], Int32 ByRef, Int32, Int32 ByRef) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs @ 1557]
00007F441B7FD920 00007f4c67e2ea6b System.Net.Sockets.SocketPal.Receive(System.Net.Sockets.SafeSocketHandle, Byte[], Int32, Int32, System.Net.Sockets.SocketFlags, Int32 ByRef)
00007F441B7FD9A0 00007f4c67e26c37 System.Net.Sockets.Socket.Receive(Byte[], Int32, Int32, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError ByRef)
00007F441B7FDA20 00007f4c67e26929 System.Net.Sockets.NetworkStream.Read(Byte[], Int32, Int32) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/NetworkStream.cs @ 231]
00007F441B7FDA70 00007f4c69b85757 System.IO.BufferedStream.ReadByteSlow() [/_/src/libraries/System.Private.CoreLib/src/System/IO/BufferedStream.cs @ 771]
00007F441B7FDA90 00007f4c69b774e8 System.IO.BinaryReader.ReadByte() [/_/src/libraries/System.Private.CoreLib/src/System/IO/BinaryReader.cs @ 207]
00007F441B7FDAA0 00007f4c69b853ee RabbitMQ.Client.Impl.InboundFrame.ReadFrom(RabbitMQ.Util.NetworkBinaryReader)
00007F441B7FDAF0 00007f4c69b852c6 RabbitMQ.Client.Framing.Impl.Connection.MainLoopIteration()
00007F441B7FDB10 00007f4c69b57068 RabbitMQ.Client.Framing.Impl.Connection.MainLoop()
00007F441B7FDB50 00007f4c67590d19 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs @ 183]
00007F441B7FDCF0 00007f4cdb1e3aa7 [DebuggerU2MCatchHandlerFrame: 00007f441b7fdcf0]
...
可以使用正規的 dotnet-dump 或者 procdump抓取,根據上面卦象展示,可以看到大量的和 RabbitMQ.Client.Framing.Impl 有關的連結庫,猜測大量執行緒卡在 RabbitMQ.Client.Framing.Impl 中。
有了這些知識,最後給到朋友的建議如下:
修改 DOTNET_DefaultStackSize 引數
可以仿照 windows 上的 .netcore 預設 1.5M 的棧空間設定,因為8M真的太大了,扛不住,也和 Linux 的低記憶體使用不符。修改後壓測讀取dump觀察發現配置已生效
0:000> !sos maddress
Enumerating and tagging the entire address space and caching the result...
Subsequent runs of this command should be faster.
*** WARNING: Unable to verify timestamp for lttng-ust-wait-8-0
*** WARNING: Unable to verify timestamp for lttng-ust-wait-8
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Memory Kind | StartAddr | EndAddr-1 | Size | Type | State | Protect | Image |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
.......
| Stack | 7fabe4e8c000 | 7fabe500c000 | 1.50mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7fabe500d000 | 7fabe518d000 | 1.50mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7fabe518e000 | 7fabe530e000 | 1.50mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7fabe530f000 | 7fabe548f000 | 1.50mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7fabe5490000 | 7fabe5610000 | 1.50mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7fabe5611000 | 7fabe5791000 | 1.50mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7fabe5792000 | 7fabe5912000 | 1.50mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7fabe5913000 | 7fabe5a93000 | 1.50mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7fabe5a94000 | 7fabe5c14000 | 1.50mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
| Stack | 7fabe5c15000 | 7fabe5d95000 | 1.50mb | MEM_PRIVATE | MEM_COMMIT | PAGE_READWRITE | |
.......
+-------------------------------------------------------------------------+
| Memory Type | Count | Size | Size (bytes) |
+-------------------------------------------------------------------------+
| Stack | 766 | 1.41gb | 1,518,571,520 |
| GCHeap | 48 | 702.39mb | 736,509,952 |
| PAGE_READWRITE | 931 | 186.31mb | 195,358,720 |
| Image | 1,283 | 158.77mb | 166,480,384 |
| HighFrequencyHeap | 336 | 20.97mb | 21,991,424 |
| LowFrequencyHeap | 256 | 18.32mb | 19,214,336 |
| LoaderCodeHeap | 15 | 17.53mb | 18,378,752 |
| HostCodeHeap | 11 | 1.63mb | 1,703,936 |
| ResolveHeap | 1 | 348.00kb | 356,352 |
| PAGE_READONLY | 123 | 261.50kb | 267,776 |
| DispatchHeap | 1 | 196.00kb | 200,704 |
| IndirectionCellHeap | 3 | 152.00kb | 155,648 |
| LookupHeap | 3 | 144.00kb | 147,456 |
| PAGE_EXECUTE_WRITECOPY | 5 | 132.00kb | 135,168 |
| CacheEntryHeap | 2 | 100.00kb | 102,400 |
| StubHeap | 2 | 76.00kb | 77,824 |
| PAGE_EXECUTE_READ | 2 | 8.00kb | 8,192 |
+-------------------------------------------------------------------------+
| [TOTAL] | 3,788 | 2.50gb | 2,679,660,544 |
+-------------------------------------------------------------------------+
觀察專案程式碼中RabbitMQ.Client.Framing.Impl 的相關邏輯
發現該引用其實在程式碼中屬於無效引用,將該引用刪除壓測觀察,發現執行緒正常。
三:總結
Linux 上的 .NET 除錯生態在日漸豐富,這是一件讓人很興奮的事情,最後給我的老師《一線碼農》和 WinDbg 點個贊