為什麼linux下多執行緒程式如此消耗虛擬記憶體

發表於2015-01-29

最近遊戲已上線運營,進行伺服器記憶體優化,發現一個非常奇妙的問題,我們的認證伺服器(AuthServer)負責跟第三方渠道SDK打交道(登陸和充值),由於採用了curl阻塞的方式,所以這裡開了128個執行緒,奇怪的是每次剛啟動的時候佔用的虛擬記憶體在2.3G,然後每次處理訊息就增加64M,增加到4.4G就不再增加了,由於我們採用預分配的方式,線上程內部根本沒有大塊分記憶體,那麼這些記憶體到底是從哪來的呢?讓人百思不得其解。

1.探索

一開始首先排除掉記憶體洩露,不可能每次都洩露64M記憶體這麼巧合,為了證明我的觀點,首先,我使用了valgrind

然後啟動測試,跑至記憶體不再增加,果然valgrind顯示沒有任何記憶體洩露。反覆試驗了很多次,結果都是這樣。

在多次使用valgrind無果以後,我開始懷疑程式內部是不是用到mmap之類的呼叫,於是使用strace對mmap,brk等系統函式的檢測:

其結果如下:

我檢查了一下trace檔案也沒有發現大量記憶體mmap動作,即便是brk動作引起的記憶體增長也不大。於是感覺人生都沒有方向了,然後懷疑是不是檔案快取把虛擬記憶體佔掉了,註釋掉了程式碼中所有讀寫日誌的程式碼,虛擬記憶體依然增加,排除了這個可能。

2.靈光一現

後來,我開始減少thread的數量開始測試,在測試的時候偶然發現一個很奇怪的現象。那就是如果程式建立了一個執行緒並且在該執行緒內分配一個很小的記憶體1k,整個程式虛擬記憶體立馬增加64M,然後再分配,記憶體就不增加了。測試程式碼如下:

其執行結果如下圖,剛開始時,程式佔用虛擬記憶體14M,輸入0,建立子執行緒,程式記憶體達到23M,這增加的10M是執行緒堆疊的大小(檢視和設定執行緒堆疊大小可用ulimit –s),第一次輸入1,程式分配1k記憶體,整個程式增加64M虛擬記憶體,之後再輸入2,3,各再次分配1k,記憶體均不再變化。

這個結果讓我欣喜若狂,由於以前學習過谷歌的Tcmalloc,其中每個執行緒都有自己的緩衝區來解決多執行緒記憶體分配的競爭,估計新版的glibc同樣學習了這個技巧,於是檢視pmap $(pidof main) 檢視記憶體情況,如下:

請注意65404這一行,種種跡象表明,這個再加上它上面那一行(在這裡是132)就是增加的那個64M)。後來增加thread的數量,就會有新增thread數量相應的65404的記憶體塊。

3.刨根問底

經過一番google和程式碼檢視。終於知道了原來是glibc的malloc在這裡搗鬼。glibc 版本大於2.11的都會有這個問題:在redhat 的官方文件上:https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/6.0_Release_Notes/compiler.html

Red Hat Enterprise Linux 6 features version 2.11 of glibc, providing many features and enhancements, including… An enhanced dynamic memory allocation (malloc) behaviour enabling higher scalability across many sockets and cores.This is achieved by assigning threads their own memory pools and by avoiding locking in some situations. The amount of additional memory used for the memory pools (if any) can be controlled using the environment variables MALLOC_ARENA_TEST and MALLOC_ARENA_MAX. MALLOC_ARENA_TEST specifies that a test for the number of cores is performed once the number of memory pools reaches this value. MALLOC_ARENA_MAX sets the maximum number of memory pools used, regardless of the number of cores.

The developer, Ulrich Drepper, has a much deeper explanation on his blog:http://udrepper.livejournal.com/20948.html

Before, malloc tried to emulate a per-core memory pool. Every time when contention for all existing memory pools was detected a new pool is created. Threads stay with the last used pool if possible… This never worked 100% because a thread can be descheduled while executing a malloc call. When some other thread tries to use the memory pool used in the call it would detect contention. A second problem is that if multiple threads on multiple core/sockets happily use malloc without contention memory from the same pool is used by different cores/on different sockets. This can lead to false sharing and definitely additional cross traffic because of the meta information updates. There are more potential problems not worth going into here in detail.

The changes which are in glibc now create per-thread memory pools. This can eliminate false sharing in most cases. The meta data is usually accessed only in one thread (which hopefully doesn’t get migrated off its assigned core). To prevent the memory handling from blowing up the address space use too much the number of memory pools is capped. By default we create up to two memory pools per core on 32-bit machines and up to eight memory per core on 64-bit machines. The code delays testing for the number of cores (which is not cheap, we have to read /proc/stat) until there are already two or eight memory pools allocated, respectively.

While these changes might increase the number of memory pools which are created (and thus increase the address space they use) the number can be controlled. Because using the old mechanism there could be a new pool being created whenever there are collisions the total number could in theory be higher. Unlikely but true, so the new mechanism is more predictable.

… Memory use is not that much of a premium anymore and most of the memory pool doesn’t actually require memory until it is used, only address space… We have done internally some measurements of the effects of the new implementation and they can be quite dramatic.

New versions of glibc present in RHEL6 include a new arena allocator design. In several clusters we’ve seen this new allocator cause huge amounts of virtual memory to be used, since when multiple threads perform allocations, they each get their own memory arena. On a 64-bit system, these arenas are 64M mappings, and the maximum number of arenas is 8 times the number of cores. We’ve observed a DN process using 14GB of vmem for only 300M of resident set. This causes all kinds of nasty issues for obvious reasons.

Setting MALLOC_ARENA_MAX to a low number will restrict the number of memory arenas and bound the virtual memory, with no noticeable downside in performance – we’ve been recommending MALLOC_ARENA_MAX=4. We should set this in hadoop-env.sh to avoid this issue as RHEL6 becomes more and more common.

總結一下,glibc為了分配記憶體的效能的問題,使用了很多叫做arena的memory pool,預設配置在64bit下面是每一個arena為64M,一個程式可以最多有 cores * 8個arena。假設你的機器是4核的,那麼最多可以有4 * 8 = 32個arena,也就是使用32 * 64 = 2048M記憶體。 當然你也可以通過設定環境變數來改變arena的數量.例如export MALLOC_ARENA_MAX=1

hadoop推薦把這個值設定為4。當然了,既然是多核的機器,而arena的引進是為了解決多執行緒記憶體分配競爭的問題,那麼設定為cpu核的數量估計也是一個不錯的選擇。設定這個值以後最好能對你的程式做一下壓力測試,用以看看改變arena的數量是否會對程式的效能有影響。

mallopt(M_ARENA_MAX, xxx)如果你打算在程式程式碼中來設定這個東西,那麼可以呼叫mallopt(M_ARENA_MAX, xxx)來實現,由於我們AuthServer採用了預分配的方式,在各個執行緒內並沒有分配記憶體,所以不需要這種優化,在初始化的時候採用mallopt(M_ARENA_MAX, 1)將其關掉,設定為0,表示系統按CPU進行自動設定。

4.意外發現

想到tcmalloc小物件才從執行緒自己的記憶體池分配,大記憶體仍然從中央分配區分配,不知道glibc是如何設計的,於是將上面程式中執行緒每次分配的記憶體從1k調整為1M,果然不出所料,再分配完64M後,仍然每次都會增加1M,由此可見,新版 glibc完全借鑑了tcmalloc的思想。

忙了幾天的問題終於解決了,心情大好,通過今天的問題讓我知道,作為一個伺服器程式設計師,如果不懂編譯器和作業系統核心,是完全不合格的,以後要加強這方面的學習。

相關文章