利用dotnet-dump分析docker容器記憶體洩露

找不到一個滿意的暱稱發表於2021-01-12

教程:官方文件https://docs.microsoft.com/zh-cn/dotnet/core/diagnostics/debug-memory-leak
環境:Linux、Docker、.NET Core 3.1 SDK及更高版本
示例程式碼:https://github.com/dotnet/samples/tree/master/core/diagnostics/DiagnosticScenarios

一 執行官方示例

示例中包含了洩漏記憶體、死鎖執行緒、CPU佔用過高等介面,方便學習。

1,Clone程式碼並編譯

[root@localhost ~]# cd Diagnostic_scenarios_sample_debug_target
[root@localhost Diagnostic_scenarios_sample_debug_target]# dotnet build -c Release
Microsoft (R) Build Engine version 16.7.0+7fb82e5b2 for .NET
...
Build succeeded.
    0 Warning(s)
    0 Error(s)

Time Elapsed 00:00:05.62

2,建立Dockerfile構建映象

FROM mcr.microsoft.com/dotnet/core/aspnet:3.1-buster-slim AS base
WORKDIR /app
EXPOSE 80

COPY  bin/Release/netcoreapp3.1 .

ENTRYPOINT ["dotnet", "DiagnosticScenarios.dll"]
[root@localhost Diagnostic_scenarios_sample_debug_target]# vim Dockerfile
[root@localhost Diagnostic_scenarios_sample_debug_target]# docker build -t dumptest .
Sending build context to Docker daemon   1.47MB
Step 1/5 : FROM mcr.microsoft.com/dotnet/core/aspnet:3.1-buster-slim AS base
...
Successfully built 928e512d9be4
Successfully tagged dumptest:latest
[root@localhost Diagnostic_scenarios_sample_debug_target]# docker images
REPOSITORY                                                              TAG                   IMAGE ID            CREATED             SIZE
dumptest                                                                latest                928e512d9be4        5 seconds ago       267MB

3,啟動容器

--privileged=true引數的作用是給容器賦予root許可權,後面createdump需要該支援。

[root@localhost Diagnostic_scenarios_sample_debug_target]# docker run --name diagnostic --privileged=true -p 888:80 -d dumptest
a68efbfbfb13117142bac9b0c6fb9f4c5124e4490eaf4850dd17fdc612e2cfda
[root@localhost Diagnostic_scenarios_sample_debug_target]# docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                  NAMES
a68efbfbfb13        dumptest            "dotnet DiagnosticSc…"   24 seconds ago      Up 23 seconds       0.0.0.0:888->80/tcp   diagnostic

看看能否訪問成功 http://192.168.0.161:888/api/values


二 生成dump轉儲檔案

很多時候問題出現在生產環境,而測試環境又很難重現,所以問題出現時優先生成一個dump檔案,然後去恢復服務(重啟解決大部分問題啊?),最後再來慢慢分析問題是一個比較合理的方案。

1,製造問題

通過示例中提供的介面生成一些死鎖執行緒和記憶體洩漏。
http://192.168.0.161:888/api/diagscenario/deadlock
http://192.168.0.161:888/api/diagscenario/memleak/101024


2,建立dump檔案

進入示例容器,通過find找到netcore自帶的createdump工具;
執行createdump路徑 PID命令建立dump檔案(如果容器內只有一個應用,一般PID預設為1,也可以使用top命令來檢視PID)
容器佔用越大createdump越慢,建立完之後退出容器,將coredump.1檔案拷貝到宿主機慢慢分析。

[root@localhost dumpfile]# docker exec -it diagnostic bash
root@6b6f0a7ebe10:/app# find / -name createdump
/usr/share/dotnet/shared/Microsoft.NETCore.App/3.1.10/createdump
root@6b6f0a7ebe10:/app# /usr/share/dotnet/shared/Microsoft.NETCore.App/3.1.10/createdump 1
Writing minidump with heap to file /tmp/coredump.1
Written 2920263680 bytes (712955 pages) to core file
root@6b6f0a7ebe10:/app# exit
[root@localhost dumpfile]# docker cp 6b6f0a7ebe10:/tmp/coredump.1 /mnt/dumpfile/coredump.1
[root@localhost dumpfile]# ls
coredump.1

三 分析dump檔案

dotnet-dump工具依賴dotnet的sdk,如果宿主機中安裝了sdk可以直接在宿主機中分析;
如果不想汙染宿主機環境可以拉取一個sdk映象mcr.microsoft.com/dotnet/core/sdk:3.1,建立一個臨時環境用於分析。

1,建立一個用於分析的臨時容器

需要把用於存放coredump.1檔案的目錄掛載到容器,或者自己cp進去。
建立好容器之後需要安裝dotnet-dump工具

[root@iZwz9883hajros0tZ ~]# docker run --rm -it -v /mnt/dumpfile:/tmp/coredump mcr.microsoft.com/dotnet/core/sdk:3.1
root@6e560e9f5d55:/# cd /tmp/coredump   
root@6e560e9f5d55:/tmp/coredump# ls
coredump.1
root@6e560e9f5d55:/tmp/coredump# dotnet tool install -g dotnet-dump
Tools directory '/root/.dotnet/tools' is not currently on the PATH environment variable.
...
export PATH="$PATH:/root/.dotnet/tools"
You can invoke the tool using the following command: dotnet-dump
Tool 'dotnet-dump' (version '5.0.152202') was successfully installed.
root@6e560e9f5d55:/tmp/coredump# export PATH="$PATH:/root/.dotnet/tools"
root@6e560e9f5d55:/tmp/coredump#

2,分析死鎖

利用clrstack命令輸出呼叫堆疊。

root@6e560e9f5d55:/tmp/coredump# dotnet-dump analyze coredump.1
Loading core dump: coredump.1 ...
Ready to process analysis commands. Type 'help' to list available commands or 'help [command]' to get detailed help on a command.
Type 'quit' or 'exit' to exit the session.
> clrstack -all
OS Thread Id: 0x311
        Child SP               IP Call Site
00007FCDE06DF530 00007fd1521e600c [GCFrame: 00007fcde06df530] 
00007FCDE06DF620 00007fd1521e600c [GCFrame: 00007fcde06df620] 
00007FCDE06DF680 00007fd1521e600c [HelperMethodFrame_1OBJ: 00007fcde06df680] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
00007FCDE06DF7D0 00007FD0DB4F765A testwebapi.Controllers.DiagScenarioController.<deadlock>b__3_1()
00007FCDE06DF800 00007FD0D7A25862 System.Threading.ThreadHelper.ThreadStart_Context(System.Object) [/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 44]
00007FCDE06DF820 00007FD0DB03686D System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
00007FCDE06DF870 00007FD0D7A2597E System.Threading.ThreadHelper.ThreadStart() [/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 93]
00007FCDE06DFBD0 00007fd1512f849f [GCFrame: 00007fcde06dfbd0] 
00007FCDE06DFCA0 00007fd1512f849f [DebuggerU2MCatchHandlerFrame: 00007fcde06dfca0] 
OS Thread Id: 0x312
        Child SP               IP Call Site
00007FCDDFEDE530 00007fd1521e600c [GCFrame: 00007fcddfede530] 
00007FCDDFEDE620 00007fd1521e600c [GCFrame: 00007fcddfede620] 
00007FCDDFEDE680 00007fd1521e600c [HelperMethodFrame_1OBJ: 00007fcddfede680] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
00007FCDDFEDE7D0 00007FD0DB4F765A testwebapi.Controllers.DiagScenarioController.<deadlock>b__3_1()
00007FCDDFEDE800 00007FD0D7A25862 System.Threading.ThreadHelper.ThreadStart_Context(System.Object) [/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 44]
00007FCDDFEDE820 00007FD0DB03686D System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
00007FCDDFEDE870 00007FD0D7A2597E System.Threading.ThreadHelper.ThreadStart() [/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 93]
00007FCDDFEDEBD0 00007fd1512f849f [GCFrame: 00007fcddfedebd0] 
00007FCDDFEDECA0 00007fd1512f849f [DebuggerU2MCatchHandlerFrame: 00007fcddfedeca0] 
...

300多個執行緒的呼叫堆疊大多數執行緒共享一個公共呼叫堆疊,
該呼叫堆疊似乎顯示請求傳入了死鎖方法,而死鎖方法繼而又呼叫了 Monitor.ReliableEnter,此方法表示這些執行緒正試圖進入鎖定,然而這個obj可能已被其他執行緒獲取了排它鎖了。
光看這個呼叫堆疊,太難看出問題... 但是結合原始碼就很容易看出DeadlockFunc這個方法早就已經將o1,o2兩個object造成了交叉死鎖,然後後面啟動的300個執行緒都試圖獲取獲取鎖定,結果就是無限等待。

00007FCDDEEDC530 00007fd1521e600c [GCFrame: 00007fcddeedc530] 
00007FCDDEEDC620 00007fd1521e600c [GCFrame: 00007fcddeedc620] 
00007FCDDEEDC680 00007fd1521e600c [HelperMethodFrame_1OBJ: 00007fcddeedc680] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
00007FCDDEEDC7D0 00007FD0DB4F765A testwebapi.Controllers.DiagScenarioController.<deadlock>b__3_1()
00007FCDDEEDC800 00007FD0D7A25862 System.Threading.ThreadHelper.ThreadStart_Context(System.Object) [/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 44]
00007FCDDEEDC820 00007FD0DB03686D System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
00007FCDDEEDC870 00007FD0D7A2597E System.Threading.ThreadHelper.ThreadStart() [/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 93]
00007FCDDEEDCBD0 00007fd1512f849f [GCFrame: 00007fcddeedcbd0] 
00007FCDDEEDCCA0 00007fd1512f849f [DebuggerU2MCatchHandlerFrame: 00007fcddeedcca0] 

利用syncblk命令找出實際持有排它鎖的執行緒。

> syncblk                                                                                                                                                                                                                             
Index         SyncBlock MonitorHeld Recursion Owning Thread Info          SyncBlock Owner
   25 0000000000E59BB8          603         1 00007FCE8C00FB30 1e8  14   00007fcea81ffb98 System.Object
   26 0000000000E59C00            3         1 00007FD0C8001FA0 1e9  15   00007fcea81ffbb0 System.Object
-----------------------------
Total           326
Free            307

Owning Thread Info列下面有3個子列,第一列是地址,第二列是作業系統執行緒 ID,第三列是執行緒索引,也可以通過threads命令拿到執行緒索引。
通過setthread命令將執行緒切換到0x1e8上,然後用clrstack檢視它的呼叫堆疊。

> setthread 14                                                                                                                                                                                                                        
> clrstack                                                                                                                                                                                                                            
OS Thread Id: 0x1e8 (14)
        Child SP               IP Call Site
00007FCE77FFE500 00007fd1521e600c [GCFrame: 00007fce77ffe500] 
00007FCE77FFE5F0 00007fd1521e600c [GCFrame: 00007fce77ffe5f0] 
00007FCE77FFE650 00007fd1521e600c [HelperMethodFrame_1OBJ: 00007fce77ffe650] System.Threading.Monitor.Enter(System.Object)
00007FCE77FFE7A0 00007FD0DB4F5D0C testwebapi.Controllers.DiagScenarioController.DeadlockFunc()
00007FCE77FFE7E0 00007FD0DB4F5B27 testwebapi.Controllers.DiagScenarioController.<deadlock>b__3_0()
00007FCE77FFE800 00007FD0D7A25862 System.Threading.ThreadHelper.ThreadStart_Context(System.Object) [/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 44]
00007FCE77FFE820 00007FD0DB03686D System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
00007FCE77FFE870 00007FD0D7A2597E System.Threading.ThreadHelper.ThreadStart() [/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 93]
00007FCE77FFEBD0 00007fd1512f849f [GCFrame: 00007fce77ffebd0] 
00007FCE77FFECA0 00007fd1512f849f [DebuggerU2MCatchHandlerFrame: 00007fce77ffeca0] 

這個執行緒的呼叫堆疊跟上面的300多個比起來多了一些資訊,可以看出它通過DiagScenarioController.DeadlockFunc中的Monitor.Enter方法獲得了某個object得排它鎖。

3,分析記憶體洩漏

SOS命令:https://docs.microsoft.com/zh-cn/dotnet/core/diagnostics/dotnet-dump#dotnet-dump-analyze
檢查當前所有託管型別的統計資訊 -min [byte]可選引數可以限制統計範圍

root@6e560e9f5d55:/tmp/coredump# dotnet-dump analyze coredump.1                                                                                                                                                                       
Loading core dump: coredump.1 ...
Ready to process analysis commands. Type 'help' to list available commands or 'help [command]' to get detailed help on a command.
Type 'quit' or 'exit' to exit the session.
> dumpheap -stat                                                                                                                                                                                                                      
Statistics:
              MT    Count    TotalSize Class Name
00007fd0dc2b3ab0        1           24 System.Collections.Generic.ObjectEqualityComparer`1[[System.Threading.ThreadPoolWorkQueue+WorkStealingQueue, System.Private.CoreLib]]
00007fd0dc29ca28        1           24 System.Collections.Generic.ObjectEqualityComparer`1[[Microsoft.AspNetCore.Mvc.ModelBinding.Metadata.DefaultModelMetadata, Microsoft.AspNetCore.Mvc.Core]]
...
00007fd0d7f85510      482       160864 System.Object[]
0000000000da9d80    19191      7512096      Free
00007fd0dc2b3718        2      8388656 testwebapi.Controllers.Customer[]
00007fd0dadbeb58  1010240     24245760 testwebapi.Controllers.Customer
00007fd0d7f90f90  1012406     95190854 System.String

可在此處看到大多數是String或Customer物件。String物件佔用了90MB左右的空間
使用方法表mt分析佔用最多的System.String型別

>dumpheap -mt 00007f61876a0f90
00007fcfabc376f8 00007fd0d7f90f90       94     
00007fcfabc37770 00007fd0d7f90f90       94     
00007fcfabc377e8 00007fd0d7f90f90       94     
00007fcfabc37860 00007fd0d7f90f90       94     
00007fcfabc378d8 00007fd0d7f90f90       94  
...

Statistics:
              MT    Count    TotalSize Class Name
00007fd0d7f90f90  1012406     95190854 System.String
Total 1012406 objects

一百萬個物件,大小都是94byte,使用gcroot命令隨便檢視幾個物件的根。

> gcroot -all 00007fcfabc378d8                                                                                                                                                                                                        
HandleTable:
    00007FD1508815F8 (pinned handle)
    -> 00007FD0A7FFF038 System.Object[]
    -> 00007FCEA830A230 testwebapi.Controllers.Processor
    -> 00007FCEA830A248 testwebapi.Controllers.CustomerCache
    -> 00007FCEA830A260 System.Collections.Generic.List`1[[testwebapi.Controllers.Customer, DiagnosticScenarios]]
    -> 00007FD0B821F0A8 testwebapi.Controllers.Customer[]
    -> 00007FCFABC378C0 testwebapi.Controllers.Customer
    -> 00007FCFABC378D8 System.String

Found 1 roots.
>

基本上就是Customer,CustomerCache,Processor這幾個物件造成的記憶體佔用過高。
下一篇利用perf分析容器中CPU佔用的問題。

參考:
除錯記憶體洩漏教程 | Microsoft Docs https://docs.microsoft.com/zh-cn/dotnet/core/diagnostics/debug-memory-leak
dotnet core除錯docker下生成的dump檔案 https://www.cnblogs.com/iamsach/p/10118628.html

相關文章