記一次 .NET 某WMS倉儲打單系統 記憶體暴漲分析

一線碼農發表於2021-08-13

一:背景

1. 講故事

七月中旬有一位朋友加wx求助,他的程式在生產上跑著跑著記憶體就飆起來了,貌似沒有回頭的趨勢,詢問如何解決,截圖如下:

和這位朋友聊下來,感覺像是自己在小縣城當了個小老闆,規律的生活,有當地資源,各種小關係,有一股財務自由的味道,這也是我一直嚮往的生活方式 ???。

既然朋友找到我了,我得想辦法給他解決問題,既然是記憶體暴漲,我就賭一把在託管層面吧,嘿嘿,上windbg說話。

二: windbg 分析

1. 託管還是非託管

一直在追這個系列的朋友應該知道,我無數次的用 !address -summary!eeheap -gc 這兩個命令來判斷當前記憶體屬於託管層還是非託管層。


0:000> !address -summary

--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_FREE                                393     7dfe`f2105000 ( 125.996 TB)           98.43%
MEM_RESERVE                            1691      200`0f1e4000 (   2.000 TB)  99.81%    1.56%
MEM_COMMIT                             6191        0`fed07000 (   3.981 GB)   0.19%    0.00%


0:000> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x000001D2E572BBC8
generation 1 starts at 0x000001D2E54F70E0
generation 2 starts at 0x000001D252051000
ephemeral segment allocation context: none
         segment             begin         allocated              size
000001D252050000  000001D252051000  000001D26204FFE0  0xfffefe0(268431328)
Large object heap starts at 0x000001D262051000
         segment             begin         allocated              size
000001D262050000  000001D262051000  000001D2655F3F80  0x35a2f80(56242048)
Total Size:              Size: 0xbf4dbf80 (3209543552) bytes.
------------------------------
GC Heap Size:    Size: 0xbf4dbf80 (3209543552) bytes.

卦象上程式指標為 3.98G ,GC堆指標為 3209543552 = 3G ,很顯然,本次事故屬於 託管層面

2. 尋找託管層上的大物件

我們都知道C#是託管語言,所以甭管有用沒用的物件都逃不出GC堆,言外之意就是看GC堆準沒錯,挑幾個大物件看看。


0:000> !dumpheap -stat
Statistics:
              MT    Count    TotalSize Class Name
00007ff98a68f090   391475     43869284 System.Int32[]
00007ff98b6adfa0  1902760     45666240 System.Collections.ObjectModel.ReadOnlyCollection`1[[System.Linq.Expressions.Expression, System.Linq.Expressions]]
00007ff98b6ac3c0  1951470     46835280 System.Linq.Expressions.ConstantExpression
00007ff98bc452e0  1681178     53797696 System.Linq.Expressions.TypedConstantExpression
00007ff98eacb6b8  1902708     60886656 System.Dynamic.Utils.ListArgumentProvider
00007ff98f236518  1774982     70999280 Microsoft.EntityFrameworkCore.Query.Expressions.ColumnExpression
00007ff98c650c58  1681142     80694816 System.Linq.Expressions.MethodCallExpression3
00007ff98a82bc38  3414094     81938256 System.RuntimeMethodHandle
00007ff98fd96fc0    17750     83936016 System.Collections.Generic.Dictionary`2+Entry[[System.Reflection.MemberInfo, System.Private.CoreLib],[System.Linq.Expressions.Expression, System.Linq.Expressions]][]
00007ff98e5ed5d8    35493    101740504 System.Collections.Generic.Dictionary`2+Entry[[Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceCacheKey, Microsoft.Extensions.DependencyInjection],[System.Object, System.Private.CoreLib]][]
00007ff98bcff6a8  3639389    116460448 System.Linq.Expressions.PropertyExpression
00007ff98b85cf00  5028347    160907104 System.Reflection.Emit.GenericFieldInfo
00007ff98a671e18  2178117    168395994 System.String
00007ff98a5b6610   160565    171498416 System.Object[]
00007ff98eaa8ab0  4981589    199263560 System.Linq.Expressions.MemberAssignment
00007ff98a672360   398740    391928469 System.Byte[]
00007ff98a746d68   181886    486150592 System.Char[] 

從託管堆上看,System.Linq.Expressions.MemberAssignment 物件高達 498w ,很明顯有問題,從類名看可能和 ExpressionTree 有關,那就抽幾個物件看看它的引用鏈上是否有過大的物件。


0:000> !gcroot 000001d25399f690
HandleTable:
    000001D251B715A8 (pinned handle)
    -> 000001D262068CF0 System.Object[]
    -> 000001D2531C3B78 Microsoft.EntityFrameworkCore.Internal.ServiceProviderCache
    -> 000001D25399E3D0 Remotion.Linq.QueryModel
    -> 000001D25399E3B8 Remotion.Linq.Clauses.SelectClause
    -> 000001D25442C068 System.Linq.Expressions.MemberInitExpression
    -> 000001D25442C050 System.Runtime.CompilerServices.TrueReadOnlyCollection`1[[System.Linq.Expressions.MemberBinding, System.Linq.Expressions]]
    -> 000001D2539A0290 System.Linq.Expressions.MemberBinding[]
    -> 000001D25399F690 System.Linq.Expressions.MemberAssignment

引用鏈特別長,這裡我就擷取一下,經過一頓排查,我發現大物件居然是 Remotion.Linq.Clauses.SelectClause,objsize 這個物件直接爆掉了,真的很奇葩,如下程式碼所示:


0:000> !objsize 000001D25399E3B8
sizeof(000001D25399E3B8) = -1187378032 (0xb93a0c90) bytes (Remotion.Linq.Clauses.SelectClause)

有點懵,這個物件居然是罪魁禍首,從引用鏈看它是 EF 下的一個構建表示式樹的小部件,可以肯定的是,朋友在用EF的時候出了什麼問題,不過還得硬著頭皮繼續挖 SelectClause,經過深挖,我發現這個類有大量這樣的 char[] 陣列,匯出來後大概是下面這樣。


Logistics.Text30), 
|           Text31 = string TryReadValue(t1.Outer.Outer, 42, WmsOutboundConfirmLogistics.Text31), 
|           Text32 = string TryReadValue(t1.Outer.Outer, 43, WmsOutboundConfirmLogistics.Text32), 
|           Text33 = string TryReadValue(t1.Outer.Outer, 44, WmsOutboundConfirmLogistics.Text33), 
|           Text34 = string TryReadValue(t1.Outer.Outer, 45, WmsOutboundConfirmLogistics.Text34), 
|           Text35 = string TryReadValue(t1.Outer.Outer, 46, WmsOutboundConfirmLogistics.Text35), 
|           IsQueue = Nullable<bool> TryReadValue(t1.Outer.Outer, 47, WmsOutboundConfirmLogistics.IsQueue), 
|           IsStop = Nullable<bool> TryReadValue(t1.Outer.Outer, 48, WmsOutboundConfirmLogistics.IsStop), 
|           CheckCode = string TryReadValue(t1.Outer.Outer, 49, WmsOutboundConfirmLogistics.CheckCode), 
|           ClientCode = string TryReadValue(t1.Outer.Inner, 50, WmsOutboundOrder.ClientCode), 
|           WarehouseCode = string TryReadValue(t1.Outer.Inner, 51, WmsOutboundOrder.WarehouseCode), 
|           ErpNumber = string TryReadValue(t1.Outer.Inner, 52, WmsOutboundOrder.ErpNumber), 
|           OrderCategory = string TryReadValue(t1.Outer.Inner, 53, WmsOutboundOrder.OrderCategory), 
|           OrderStatus = string TryReadValue(t1.Outer.Inner, 54, WmsOutboundOrder.OrderStatus), 
|           OrderType = string TryReadValue(t1.Outer.Inner, 55, WmsOutboundOrder.OrderType), 
|           SendCompany = string TryReadValue(t1.Outer.Inner, 56, WmsOutboundOrder.SendCompany), 
|           SendName = string TryReadValue(t1.Outer.Inner, 57, WmsOutboundOrder.SendName), 
|           SendTel = string TryReadValue(t1.Outer.Inner, 58, WmsOutboundOrder.SendTel), 
|           SendMobile = string TryReadValue(t1.Outer.Inner, 59, WmsOutboundOrder.SendMobile), 
|           SendProvince = string TryReadValue(t1.Outer.Inner, 60, WmsOutboundOrder.SendProvince), 
|           SendCity = string TryReadValue(t1.Outer.Inner, 61, WmsOutboundOrder.SendCity), 
|           SendArea = string TryReadValue(t1.Outer.Inner, 62, WmsOutboundOrder.SendArea), 
|           ...
|           CategoryName = string TryReadValue(t1.Outer.Inner, 88, WmsOutboundOrder.CategoryName), 
|           SourcePlatformCode = string TryReadValue(t1.Outer.Inner, 89, WmsOutboundOrder.SourcePlatformCode), 
|           PayMode = (string)string TryReadValue(t1.Outer.Outer, 90, null), 
|           List = List<WmsOutboundConfirmLogisticsLinesDTO> WmsOutboundConfirmLogisticsBusiness.GetOrderLines(string TryReadValue(t1.Outer.Outer, 5, WmsOutboundConfirmLogistics.OrderNumber)), 
|           ConfirmTime = DateTime TryReadValue(t1.Inner, 91, WmsOutboundOrderConfirmation.CreateTime), 
|           ReturnUrl = (string)string TryReadValue(t1.Outer.Outer, 92, null) 
|       }
|__ ), 
|__ contextType: Core.DataRepository.BaseDbContext, 
|__ logger: DiagnosticsLogger<Query>, 
|__ queryContext: Unhandled parameter: queryContext)                                                      

從內容看,應該是 select 語句的ExpressionTree表示,問了下朋友,說大概是報表業務,不過這些資訊給他貌似也沒有多大幫助,說實話到這裡我其實也不知道怎麼繼續往下排查了,陷入了絕望。

3. 從絕望中尋找希望

我在想,既然EF構建了大量這樣的 ExpressionTree,肯定有問題,但也想不出是什麼問題,隔了半天,我突然靈光一現,EF既然構建了樹,有可能sql也出來了,對,我何不直接在 heap 上搜尋 select 的sql語句。。。。


0:000> !strings /m:*select*
Address            Gen    Length   Value
000001d2e4de64e0    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
000001d2e4e11e78    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
000001d2e4e3d1f0    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
000001d2e4e673c8    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
000001d2e4e91760    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
000001d2e4ebb2e8    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
000001d2e4ee54f8    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
000001d2e4f10758    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...
000001d2e4f398d0    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...

---------------------------------------
18128 matching strings

果然發現了大量重複的 select 語句,而且從最左邊的記憶體地址看都是非常接近的,也就說明他們是在某一個操作中同時生成的,然後我們匯出幾個sql語句。


SELECT a."Id", ....
FROM "WmsOutboundConfirmLogistics" AS a
INNER JOIN "WmsOutboundOrder" AS b ON a."OrderNumber" = b."OrderNumber"
INNER JOIN "WmsOutboundOrderConfirmation" AS c ON a."OrderNumber" = c."OrderNumber"
WHERE (a."OrderNumber" = @__pagination_OrderNumber_0) AND (b."FreezeStatus" = FALSE)
ORDER BY a."Id"";

SELECT a."Id", ....
FROM "WmsOutboundConfirmLogistics" AS a
INNER JOIN "WmsOutboundOrder" AS b ON a."OrderNumber" = b."OrderNumber"
INNER JOIN "WmsOutboundOrderConfirmation" AS c ON a."OrderNumber" = c."OrderNumber"
WHERE (a."OrderNumber" = @__pagination_OrderNumber_0) AND (b."FreezeStatus" = FALSE)
ORDER BY a."Id""

拿到這 1.8w 重複的sql 給朋友看,朋友說這是查詢報表的sql。

4. 所有線索整合打通

那這裡就存在著很大問題,既然是查詢報表,為什麼會有 1.8w 相同的sql,唯一不同的就是 a."OrderNumber" = @__pagination_OrderNumber_0 中的訂單號,難道不應該是 a.OrderNumber in (xxxx) 或者是表關聯查詢嗎??? 整理一下就是下面這樣的猜想:


-- 理想
select * from a where a.id in (1,2,3)

-- 現實
select * from a where a.id=1;
select * from a where a.id=2;
select * from a where a.id=3;

加上每個sql記憶體地址相近,再結合爆表的 Remotion.Linq.Clauses.SelectClause 物件,整個流程大概就是:本該表關聯或者in操作,結果變成了無數個單條sql語句查詢,導致EF底層出現記憶體爆炸式增長。

三:總結

看了下朋友查詢ef的寫法,猜測大多都是人肉構建 ExpressionTree 去查詢資料庫,大寫的??,比如下面的這張圖:

解決方案就是讓朋友檢查下表示式樹的寫法問題,或者直接灌寫好的sql得了,說實話這個dump還是費了九牛二虎之力,本以為很簡單,實操起來還是碰到了一點小困難,就當歷練成長吧!

更多高質量乾貨:參見我的 GitHub: dotnetfly

相關文章