linux記憶體管理(一)實體記憶體的組織和記憶體分配

半山随笔發表於2024-06-07

從這一篇開始記錄以下我看有關記憶體管理的核心程式碼的筆記. 內容很長,很多是我自己的理解,請謹慎觀看.

夥伴系統的工作的基礎是物理頁的組織,組織結構有小到大依次為page->zone->node。下面從原始碼裡看看各個結構是如何組織的。

typedef struct pglist_data {
    struct zone node_zones[MAX_NR_ZONES];   /*當前node包含的zone列表*/
struct zonelist node_zonelists[MAX_ZONELISTS]; /*有兩個元素,第一個代表當前node的zone連結串列,第二個是其他node的zone 連結串列*/
int nr_zones; /* number of populated zones in this node */
unsigned long node_start_pfn;
unsigned long node_present_pages; /* total number of physical pages */
unsigned long node_spanned_pages; /* total size of physical page range, including holes */
...
}

pglist_data表示node的記憶體layout。包含當前node的所有zone,以及所有node的zonelist。分配記憶體首先從當前node開始分配,如果沒有足夠的記憶體才從其他node開始分配。

struct zone {
    /* Read-mostly fields */

    /* zone watermarks, access with *_wmark_pages(zone) macros */
/*水位相關的項,跟記憶體回收有關 unsigned long _watermark[NR_WMARK]; unsigned long watermark_boost; unsigned long nr_reserved_highatomic; /* * We don't know if the memory that we're going to allocate will be * freeable or/and it will be released eventually, so to avoid totally * wasting several GB of ram we must reserve some of the lower zone * memory (otherwise we risk to run OOM on the lower zones despite * there being tons of freeable ram on the higher zones). This array is * recalculated at runtime if the sysctl_lowmem_reserve_ratio sysctl * changes. */ long lowmem_reserve[MAX_NR_ZONES]; #ifdef CONFIG_NUMA int node; #endif struct pglist_data *zone_pgdat; struct per_cpu_pages __percpu *per_cpu_pageset; struct per_cpu_zonestat __percpu *per_cpu_zonestats; /* * the high and batch values are copied to individual pagesets for * faster access */ int pageset_high_min; int pageset_high_max; int pageset_batch; #ifndef CONFIG_SPARSEMEM /* * Flags for a pageblock_nr_pages block. See pageblock-flags.h. * In SPARSEMEM, this map is stored in struct mem_section */ unsigned long *pageblock_flags; #endif /* CONFIG_SPARSEMEM */
//下面一系列成員跟zone包含的page數量,記憶體位置相關 /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */ unsigned long zone_start_pfn; /* * spanned_pages is the total pages spanned by the zone, including * holes, which is calculated as: * spanned_pages = zone_end_pfn - zone_start_pfn; * * present_pages is physical pages existing within the zone, which * is calculated as: * present_pages = spanned_pages - absent_pages(pages in holes); * * present_early_pages is present pages existing within the zone * located on memory available since early boot, excluding hotplugged * memory. * * managed_pages is present pages managed by the buddy system, which * is calculated as (reserved_pages includes pages allocated by the * bootmem allocator): * managed_pages = present_pages - reserved_pages; * * cma pages is present pages that are assigned for CMA use * (MIGRATE_CMA). * * So present_pages may be used by memory hotplug or memory power * management logic to figure out unmanaged pages by checking * (present_pages - managed_pages). And managed_pages should be used * by page allocator and vm scanner to calculate all kinds of watermarks * and thresholds. * * Locking rules: * * zone_start_pfn and spanned_pages are protected by span_seqlock. * It is a seqlock because it has to be read outside of zone->lock, * and it is done in the main allocator path. But, it is written * quite infrequently. * * The span_seq lock is declared along with zone->lock because it is * frequently read in proximity to zone->lock. It's good to * give them a chance of being in the same cacheline. * * Write access to present_pages at runtime should be protected by * mem_hotplug_begin/done(). Any reader who can't tolerant drift of * present_pages should use get_online_mems() to get a stable value. */ atomic_long_t managed_pages; unsigned long spanned_pages; unsigned long present_pages; #if defined(CONFIG_MEMORY_HOTPLUG) unsigned long present_early_pages; #endif #ifdef CONFIG_CMA unsigned long cma_pages; #endif const char *name; #ifdef CONFIG_MEMORY_ISOLATION /* * Number of isolated pageblock. It is used to solve incorrect * freepage counting problem due to racy retrieving migratetype * of pageblock. Protected by zone->lock. */ unsigned long nr_isolate_pageblock; #endif #ifdef CONFIG_MEMORY_HOTPLUG /* see spanned/present_pages for more description */ seqlock_t span_seqlock; #endif int initialized; /* Write-intensive fields used from the page allocator */ CACHELINE_PADDING(_pad1_); /* free areas of different sizes */
//跟頁面分配相關的重要結構
struct free_area free_area[MAX_ORDER + 1];。。。 #if defined CONFIG_COMPACTION || defined CONFIG_CMA /* Set to true when the PG_migrate_skip bits should be cleared */ bool compact_blockskip_flush; #endif bool contiguous; CACHELINE_PADDING(_pad3_); /* Zone statistics */
//統計相關的項
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS]; } ____cacheline_internodealigned_in_smp;

zone最重要的資料即,起始位置由zone_start_pfn表示,頁面數量,由present_pages, spanned_pages, managed_pages共同表示。跟夥伴系統相關的引數free_area陣列表示記憶體是怎麼按照夥伴系統進行組織的。長度是最大order加1。

struct free_area {
    struct list_head    free_list[MIGRATE_TYPES];
    unsigned long        nr_free;
};

free_area表示當前zone可以分配的記憶體。是一個二維陣列,相同的記憶體階數(buddy系統按order分配記憶體,階代表2的指數,單位是page)組成一行,列是不同的記憶體遷移型別。每個元素又是一個頁組成的連結串列。下圖是從奔跑吧linux卷1上的截圖。

start_kernel->setup_arch->bootmem_init->arch_numa_init->numa_init完成numa的初始化。

全域性陣列numa_distance存放了該資訊。預設情況下如果是同一個node,距離是10, 如果不是同一個node,距離是20,當然這個數字是可以從dtb或者acpi表中由廠商提供的。可以從/sys/devices/system/node/nodeX/distance中得到不同numa node與當前node的距離。

node_data是一個pglist_data*型別的全域性陣列,存放了所有node的資訊。node_data的初始化在numa_register_nodes中,設定了每個node的起始pfn,長度資訊。

struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;

全域性陣列node_states存放了node的狀態資訊。

nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
    [N_POSSIBLE] = NODE_MASK_ALL,
    [N_ONLINE] = { { [0] = 1UL } },
#ifndef CONFIG_NUMA
    [N_NORMAL_MEMORY] = { { [0] = 1UL } },
#ifdef CONFIG_HIGHMEM
    [N_HIGH_MEMORY] = { { [0] = 1UL } },
#endif
    [N_MEMORY] = { { [0] = 1UL } },
    [N_CPU] = { { [0] = 1UL } },
#endif    /* NUMA */
};

mem_section存放永久的稀疏記憶體的page指標,對應與flat memory的mem_map。也就是說我們可以在mem_section或者mem_map中找到所有的實體記憶體頁,聽起來是一個非常有用的結構。

struct mem_section **mem_section;
struct mem_section {
    /*
     * This is, logically, a pointer to an array of struct
     * pages.  However, it is stored with some other magic.
     * (see sparse.c::sparse_init_one_section())
     *
     * Additionally during early boot we encode node id of
     * the location of the section here to guide allocation.
     * (see sparse.c::memory_present())
     *
     * Making it a UL at least makes someone do a cast
     * before using it wrong.
     */
    unsigned long section_mem_map;

    struct mem_section_usage *usage;
#ifdef CONFIG_PAGE_EXTENSION
    /*
     * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use
     * section. (see page_ext.h about this.)
     */
    struct page_ext *page_ext;
    unsigned long pad;
#endif
    /*
     * WARNING: mem_section must be a power-of-2 in size for the
     * calculation and use of SECTION_ROOT_MASK to make sense.
     */
};

mem_section在sparse_init函式中被初始化,可以參考實體記憶體模型 — The Linux Kernel documentation

free_area_init完成記憶體域的初始化。即每個node中所有zone的起始pfn,span_pages, present_pages, per_cpu_pageset, free_area。不過此時free_area只是初始化為0. 初始化node,zone的資訊來源是memblock。

start_kernel->mm_core_init->build_all_zonelists->build_all_zonelists_init->__build_all_zonelists

__build_all_zonelists初始化所有node的zonelist。

static void __build_all_zonelists(void *data)
{
...
    for_each_node(nid) {
      pg_data_t *pgdat = NODE_DATA(nid);
       build_zonelists(pgdat);
...
    }
...
}

zonelist包含倆陣列,一個是本node的zonelist,一個是其他zonelist,即fallback。fallback zonelist的新增順序跟由node序列和zone序列共同決定。透過node之間的距離資訊定下node order,每個node的zone的順序是由高到低,比如最高可以是ZONE_NORMAL,最低可能是ZONE_DMA.

start_kernel->mm_core_init->mem_init->memblock_free_all會將memblock釋放的記憶體加入夥伴系統。具體的函式路徑是memblock_free_all->free_low_memory_core_early->__free_memory_core->__free_pages_memory->memblock_free_pages->__free_pages_core,將從memblock釋放的物理page以最大的階加入buddy system,並返回pages 數,隨後將該值加入到_totalram_pages上。此時大部分的記憶體的遷移型別是MIGRATE_MOVABLE,在buddy system中的order以10最多,也就是在組織記憶體的時候儘量組成最大的連續記憶體塊。

static void __init memmap_init_zone_range(struct zone *zone,
                      unsigned long start_pfn,
                      unsigned long end_pfn,
                      unsigned long *hole_pfn)
{
    unsigned long zone_start_pfn = zone->zone_start_pfn;
    unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
    int nid = zone_to_nid(zone), zone_id = zone_idx(zone);

    start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
    end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);

    if (start_pfn >= end_pfn)
        return;

    memmap_init_range(end_pfn - start_pfn, nid, zone_id, start_pfn,
              zone_end_pfn, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);

    if (*hole_pfn < start_pfn)
        init_unavailable_range(*hole_pfn, start_pfn, zone_id, nid);

    *hole_pfn = end_pfn;
}

記憶體組織講完了,看看記憶體分配得幾個API。

alloc_pages(gfp, order) 分配2^order個頁面。影響頁面分配行為的因素很多,包括gfp,current->flags。核心函式為__alloc_pages。

/*
 * This is the 'heart' of the zoned buddy allocator.
 */
struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
                            nodemask_t *nodemask)
{
    ...
    gfp = current_gfp_context(gfp);
    alloc_gfp = gfp;
    if (!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac,
            &alloc_gfp, &alloc_flags))
        return NULL;

...

    /* First allocation attempt */
    page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
    if (likely(page))
        goto out;

    ...
    page = __alloc_pages_slowpath(alloc_gfp, order, &ac);

out:
    ...
    return page;
}

首先讓get_page_from_freelist嘗試分配記憶體,如果失敗,使用__alloc_pages_slowpath繼續嘗試。

/*
 * get_page_from_freelist goes through the zonelist trying to allocate
 * a page.
 */
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
                        const struct alloc_context *ac)
{
    。。。
    for_next_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx,
                    ac->nodemask) {
        

        if (zone_watermark_fast(zone, order, mark,
                    ac->highest_zoneidx, alloc_flags,
                    gfp_mask))
            goto try_this_zone;

try_this_zone:
        page = rmqueue(ac->preferred_zoneref->zone, zone, order,
                gfp_mask, alloc_flags, ac->migratetype);
        if (page) {
            prep_new_page(page, order, gfp_mask, alloc_flags);
。。。
return page; } 。。。 }

get_page_from_freelist首先判斷一下當前zone是不是有足夠的空閒頁,如果沒有就繼續找,直到找到一個擁有足夠空閒頁的zone,rmqueue會在該zone上分配頁。zonelist的掃描順序是首先是prefered_zone,然後是按照zone_type從高到低進行掃描。

__no_sanitize_memory
static inline
struct page *rmqueue(struct zone *preferred_zone,
            struct zone *zone, unsigned int order,
            gfp_t gfp_flags, unsigned int alloc_flags,
            int migratetype)
{
    struct page *page;

    。。。
    if (likely(pcp_allowed_order(order))) {
        page = rmqueue_pcplist(preferred_zone, zone, order,
                       migratetype, alloc_flags);
        if (likely(page))
            goto out;
    }

    page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
                            migratetype);

out:
    /* Separate test+clear to avoid unnecessary atomics */
    if ((alloc_flags & ALLOC_KSWAPD) &&
        unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
        clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
        wakeup_kswapd(zone, 0, 0, zone_idx(zone));
    }

    return page;
}

首先看看請求的頁面order是不是足夠小,當前是小於3,可以在pcp_list裡面獲取,如果是首先嚐試在pcplist裡面分配。這個是percpu的空閒list,不需要獲取zone的lock,只需獲取percpu list的lock,速度快。pcplist內的頁應該是頁面釋放的時候放進去的,沒看到初始化過程中對其的操作。如果不能在pcplist上獲取則嘗試去free_area獲取。

rmqueue_buddy最終會呼叫rmqueue_smallest去分配記憶體。

static __always_inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
                        int migratetype)
{
    
    for (current_order = order; current_order <= MAX_ORDER; ++current_order) {
        area = &(zone->free_area[current_order]);
        page = get_page_from_free_area(area, migratetype);
        if (!page)
            continue;
        del_page_from_free_list(page, zone, current_order);
        expand(zone, page, order, current_order, migratetype);
        set_pcppage_migratetype(page, migratetype);
        
        return page;
    }

    return NULL;
}

如果搞明白了buddy system的記憶體組織結構,理解上面的程式碼就比較容易了。空閒頁是按階儲存的,初始化的時候10階最多,尋找空閒頁的順序是從當前order開始遞增查詢。空閒頁存放在每個zone的free_area中,free_area可以視為一個2維陣列,第一維是order,第二維是遷移型別。get_page_from_free_area會從對應的遷移型別找到對應的第一個page的指標。如果在比所需的階更高的階那一層才找到記憶體,則會剩餘一部分記憶體塊,expand會嘗試將剩餘塊再加入free_area。

如果快速路徑沒能獲得空閒頁,就要進入慢速路徑。

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                        struct alloc_context *ac)
{

慢速路徑裡會做各種可能的嘗試去回收記憶體,做記憶體規整來獲得足夠得空閒頁,其中可能會主動排程,因此慢速路徑可能要花較長得時間來獲得記憶體,甚至失敗。所以記憶體大小對系統的效能會有較大的影響,當系統中的記憶體不足時,系統在獲取記憶體時會花費大量的時間,很大程度上造成效能下降。慢速路徑涉及較多高階內容,之後再分析。

看看釋放頁的API,free_pages.

free_the_page是其核心函式。

static inline void free_the_page(struct page *page, unsigned int order)
{
    if (pcp_allowed_order(order))        /* Via pcp? */
        free_unref_page(page, order);
    else
        __free_pages_ok(page, order, FPI_NONE);
}

首先判斷一下page order是不是足夠小可以放到pcplist裡面,如果不能就放到free_area裡面。zone->per_cpu_pageset是一個一維陣列,每一階有MIGRATE_PCPTYPES個元素,每個元素包含一個page list。釋放頁就是把頁放到對應order和遷移型別的位置即可。接著更新一下後設資料,如果頁面數量高於per_cpu_pageset->high就釋放一些頁到buddy系統。見free_unref_page->free_unref_page_commit。

這裡有一個重要的函式page_zone,從page中得到zone的指標。對於單node系統,page的flag域包含了nodeid和zoneid,透過nodeid得到node,再透過zoneid得到zone。對於多node系統,要先找到page對應的section,由全域性變數section_to_node_table得到node。

static inline struct zone *page_zone(const struct page *page)
{
    return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
}

由page找到對應的zone是free的前提,系統的page是從哪裡來還回那裡去。

/*
 * Free a pcp page
 */
void free_unref_page(struct page *page, unsigned int order)
{
    。。。
    zone = page_zone(page);
    pcp_trylock_prepare(UP_flags);
    pcp = pcp_spin_trylock(zone->per_cpu_pageset);
    if (pcp) {
        free_unref_page_commit(zone, pcp, page, pcpmigratetype, order);
        pcp_spin_unlock(pcp);
    } else {
        free_one_page(zone, page, pfn, order, migratetype, FPI_NONE);
    }
    pcp_trylock_finish(UP_flags);
}
static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
                   struct page *page, int migratetype,
                   unsigned int order)
{
    。。。
    pindex = order_to_pindex(migratetype, order);
    list_add(&page->pcp_list, &pcp->lists[pindex]);
    pcp->count += 1 << order;
。。。

回來看看__free_page_ok

static void __free_pages_ok(struct page *page, unsigned int order,
                fpi_t fpi_flags)
{
    ..
    spin_lock_irqsave(&zone->lock, flags);
    ..
    __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
    spin_unlock_irqrestore(&zone->lock, flags);
        ...
}

核心函式是__free_one_page。

static inline void __free_one_page(struct page *page,
        unsigned long pfn,
        struct zone *zone, unsigned int order,
        int migratetype, fpi_t fpi_flags)
{
...
    while (order < MAX_ORDER) {
        ...
        buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
        if (!buddy)
            goto done_merging;

        ...

        ...
            del_page_from_free_list(buddy, zone, order);
        combined_pfn = buddy_pfn & pfn;
        page = page + (combined_pfn - pfn);
        pfn = combined_pfn;
        order++;
    }

done_merging:
    set_buddy_order(page, order);
...
    if (to_tail)
        add_to_free_list_tail(page, zone, order, migratetype);
    else
        add_to_free_list(page, zone, order, migratetype);

   ...
}

__free_one_page的名字取得不好,讓人以為只free一個page。原理也不復雜,依據page得pfn和order找他的buddy。buddy的含義是跟它相鄰大小相同的一塊區域。如果找到了,那麼這兩塊區域是可以合併成更大的區域的,所以order會自增,然後繼續尋找buddy,直到找不到buddy或者order已經達到最大值,然後就跳到done_merging,將之前找到的這一大塊buddy加入到zone的free_area。之所以可以加入是因為之前找到的buddy都已經從list中取下來了。

下面講講slab。這裡有一篇講解很好的文章Linux 核心 | 記憶體管理——slab 分配器 - 知乎 (zhihu.com) 記憶體管理-slab[原理] - DoOrDie - 部落格園 (cnblogs.com)

這個slab我曾經嘗試理解,一直沒有太明白。我沒看到slab到底是啥,也沒這麼個資料結構,有個叫kmem_cache的東西slab在它之下,嗯,slab是另一個概念的子結構,這個概念叫緩衝區或者cache。話說cache這個在linux裡面真的被用濫了,有一大堆自稱緩衝的東西,讓人迷惑,這個初學者太不友好了。對於複雜的系統,命名是個非常重要的環節,可惜現在做的不好。

slab複雜的地方還有它有多個變種,slab,slub,slob。我看到slob似乎是在6.x核心中被移除了,只剩slab和slub了。我們先主要看看slab吧。現在slab也沒了,只剩slub了. 不過下面講的還是老的slab.

linux已經走過了30年,我以為它已經老態龍鍾不在有太大的變化,沒想到它過一陣子就大變樣,即使是像程序排程,記憶體管理這些核心的程式碼也是在不停的進化。就拿slab來講,最早是有slab結構的,後來給加入到struct page裡面去了,這不21年又給整回來了。

為啥需要slab呢?

前面提到的記憶體分配的API都是按頁分配,最小是4k,對於小記憶體分配這有點浪費;

核心中經常需要用到一些資料結構,比如task_struct, fs_struct, inode。如果每次獲取這些結構都去找夥伴系統不僅慢,對cache也不友好。可以預先分配一批這些結構,存起來,用的時候直接拿走,不用了再存起來給一下用,這就高效多了;

slab就是為此設計的。理解slab要先知道倆資料結構,kmem_cache 和slab。簡單來說,slab是具體的記憶體塊,kmeme_cache是管理slab的。

/* Reuses the bits in struct page */
struct slab {
    unsigned long __page_flags;

#if defined(CONFIG_SLAB)

    struct kmem_cache *slab_cache;
    union {
        struct {
            struct list_head slab_list;
            void *freelist;    /* array of free object indexes */
            void *s_mem;    /* first object */
        };
        struct rcu_head rcu_head;
    };
    unsigned int active;

#endif

    atomic_t __page_refcount;

};

去掉slub的東西,看起來slab也不復雜。一個指向管理slab的kmem_cache指標,一個slab連結串列。

/*
 * Definitions unique to the original Linux SLAB allocator.
 */

struct kmem_cache {
    struct array_cache __percpu *cpu_cache;

/* 1) Cache tunables. Protected by slab_mutex */
    unsigned int batchcount;
    unsigned int limit;
    unsigned int shared;

    unsigned int size;
    struct reciprocal_value reciprocal_buffer_size;
/* 2) touched by every alloc & free from the backend */

    slab_flags_t flags;        /* constant flags */
    unsigned int num;        /* # of objs per slab */

/* 3) cache_grow/shrink */
    /* order of pgs per slab (2^n) */
    unsigned int gfporder;

    /* force GFP flags, e.g. GFP_DMA */
    gfp_t allocflags;

    size_t colour;            /* cache colouring range */ //著色區的數量,分配PAGESIZE << gfporder個頁面,num個物件,等於剩餘長度/colour_off
    unsigned int colour_off;    /* colour offset */ //L1 cache大小
    unsigned int freelist_size;

    /* constructor func */
    void (*ctor)(void *obj);

/* 4) cache creation/removal */
    const char *name;
    struct list_head list;
    int refcount;
    int object_size;
    int align;

/* 5) statistics */。。。
    struct kmem_cache_node *node[MAX_NUMNODES];
};

kmem_cache就有點複雜了。主要看看cpu_cache和node。

/*
 * struct array_cache
 *
 * Purpose:
 * - LIFO ordering, to hand out cache-warm objects from _alloc
 * - reduce the number of linked list operations
 * - reduce spinlock operations
 *
 * The limit is stored in the per-cpu structure to reduce the data cache
 * footprint.
 *
 */
struct array_cache {
    unsigned int avail;
    unsigned int limit;
    unsigned int batchcount;
    unsigned int touched;
    void *entry[];    /*
             * Must have this definition in here for the proper
             * alignment of array_cache. Also simplifies accessing
             * the entries.
             */
};

array_cache主要是給per-cpu變數用的,是個fifo佇列,這樣可以更好的利用cache。它用一個指標陣列存放資料,應該就是slab了。其他是管理資料。

/*
 * The slab lists for all objects.
 */
struct kmem_cache_node {
#ifdef CONFIG_SLAB
    raw_spinlock_t list_lock;
    struct list_head slabs_partial;    /* partial list first, better asm code */
    struct list_head slabs_full;
    struct list_head slabs_free;
    unsigned long total_slabs;    /* length of all slab lists */
    unsigned long free_slabs;    /* length of free slab list only */
    unsigned long free_objects;
    unsigned int free_limit;
    unsigned int colour_next;    /* Per-node cache coloring */
    struct array_cache *shared;    /* shared per node */
    struct alien_cache **alien;    /* on other nodes */
    unsigned long next_reap;    /* updated without locking */
    int free_touched;        /* updated without locking */
#endif

#ifdef CONFIG_SLUB
    spinlock_t list_lock;
    unsigned long nr_partial;
    struct list_head partial;
#endif
};

還是slub好,簡單明瞭,為啥slab這麼複雜,搞3個連結串列,七七八八的管理資料,不知道在搞什麼。node是為了更好的利用本地記憶體結點。

下面看看建立和分配slab比較重要的API。

struct kmem_cache *
kmem_cache_create(const char *name, unsigned int size, unsigned int align,
        slab_flags_t flags, void (*ctor)(void *))
{
    return kmem_cache_create_usercopy(name, size, align, flags, 0, 0,
                      ctor);
}
struct kmem_cache *
kmem_cache_create_usercopy(const char *name,
          unsigned int size, unsigned int align,
          slab_flags_t flags,
          unsigned int useroffset, unsigned int usersize,
          void (*ctor)(void *))
{
    struct kmem_cache *s = NULL;
    const char *cache_name;
    int err;


    mutex_lock(&slab_mutex);
。。。
    
    if (!usersize)
        s = __kmem_cache_alias(name, size, align, flags, ctor);
    if (s)
        goto out_unlock;

    。。。

    s = create_cache(cache_name, size,
             calculate_alignment(flags, align, size),
             flags, useroffset, usersize, ctor, NULL);

。。。
out_unlock:
    mutex_unlock(&slab_mutex);

    return s;
}

先看看現在是不是已經有了要建立的物件緩衝,如果有就不需要建立了。

struct kmem_cache *
__kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
           slab_flags_t flags, void (*ctor)(void *))
{
    struct kmem_cache *cachep;

    cachep = find_mergeable(size, align, flags, name, ctor);
    if (cachep) {
        cachep->refcount++;

        /*
         * Adjust the object sizes so that we clear
         * the complete object on kzalloc.
         */
        cachep->object_size = max_t(int, cachep->object_size, size);
    }
    return cachep;
}
struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
        slab_flags_t flags, const char *name, void (*ctor)(void *))
{
    struct kmem_cache *s;

    。。。
    list_for_each_entry_reverse(s, &slab_caches, list) {
    。。。
        return s;
    }
    return NULL;
}

find_mergeable會遍歷全域性變數slab_caches,尋找滿足條件的cache物件。kmem cache都會連結到這個變數中。

如果沒找到就去建立。

static struct kmem_cache *create_cache(const char *name,
        unsigned int object_size, unsigned int align,
        slab_flags_t flags, unsigned int useroffset,
        unsigned int usersize, void (*ctor)(void *),
        struct kmem_cache *root_cache)
{
    struct kmem_cache *s;
    int err;

    s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL);
。。。
    err = __kmem_cache_create(s, flags);

    s->refcount = 1;
    list_add(&s->list, &slab_caches);
    return s;
。。。
}

先分配一段記憶體給kmem_cache變數,在此變數上建立緩衝區物件,初始化它的計數為1,之後將其連結到上面提到過的slab_caches全域性變數裡。

int __kmem_cache_create(struct kmem_cache *cachep, slab_flags_t flags)
{
   size = ALIGN(size, BYTES_PER_WORD);
/* 3) caller mandated alignment */
    if (ralign < cachep->align) {
        ralign = cachep->align;
    }
 cachep->align = ralign;
    cachep->colour_off = cache_line_size();
 size = ALIGN(size, cachep->align);
//計算gpforder,物件數num,著色區數colour
if (set_objfreelist_slab_cache(cachep, size, flags)) { flags |= CFLGS_OBJFREELIST_SLAB; goto done; } if (set_off_slab_cache(cachep, size, flags)) { flags |= CFLGS_OFF_SLAB; goto done; } if (set_on_slab_cache(cachep, size, flags)) goto done; return -E2BIG; done: cachep->freelist_size = cachep->num * sizeof(freelist_idx_t); cachep->flags = flags; cachep->allocflags = __GFP_COMP; if (flags & SLAB_CACHE_DMA) cachep->allocflags |= GFP_DMA; if (flags & SLAB_CACHE_DMA32) cachep->allocflags |= GFP_DMA32; if (flags & SLAB_RECLAIM_ACCOUNT) cachep->allocflags |= __GFP_RECLAIMABLE; cachep->size = size; cachep->reciprocal_buffer_size = reciprocal_value(size); err = setup_cpu_cache(cachep, gfp); return 0; }

slab有兩種格式,freelist管理陣列在slab上和不在slab上,分別呼叫set_objfreelist_slab_cache和set_off_slab_cache。

setup_cpu_cache設定cpu_array和kmem_cache_node感覺名字取得不大對。

cpu_array得初始化在setup_cpu_cache->enable_cpucache->do_tune_cpucache中。

static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
                int batchcount, int shared, gfp_t gfp)
{
    struct array_cache __percpu *cpu_cache, *prev;
    int cpu;

    cpu_cache = alloc_kmem_cache_cpus(cachep, limit, batchcount);
    prev = cachep->cpu_cache;
    cachep->cpu_cache = cpu_cache;
    check_irq_on();
    cachep->batchcount = batchcount;
    cachep->limit = limit;
    cachep->shared = shared;

    for_each_online_cpu(cpu) {
        LIST_HEAD(list);
        int node;
        struct kmem_cache_node *n;
        struct array_cache *ac = per_cpu_ptr(prev, cpu);

        node = cpu_to_mem(cpu);
        n = get_node(cachep, node);
        raw_spin_lock_irq(&n->list_lock);
        free_block(cachep, ac->entry, ac->avail, node, &list);
        raw_spin_unlock_irq(&n->list_lock);
        slabs_destroy(cachep, &list);
    }
    free_percpu(prev);

setup_node:
    return setup_kmem_cache_nodes(cachep, gfp);
}

首先是分配一個cpu_array,這是一個percpu變數,每個cpu一個,然後初始化。

static struct array_cache __percpu *alloc_kmem_cache_cpus(
        struct kmem_cache *cachep, int entries, int batchcount)
{
    
    size = sizeof(void *) * entries + sizeof(struct array_cache);
    cpu_cache = __alloc_percpu(size, sizeof(void *));

    for_each_possible_cpu(cpu) {
        init_arraycache(per_cpu_ptr(cpu_cache, cpu),
                entries, batchcount);
    }

    return cpu_cache;
}
static void init_arraycache(struct array_cache *ac, int limit, int batch)
{
    if (ac) {
        ac->avail = 0;
        ac->limit = limit;
        ac->batchcount = batch;
        ac->touched = 0;
    }
}

avail為0,所以下面得free_block操作就不需要了。設定了limit和batch。

除了cpu_array, node也會初始化。

static int setup_kmem_cache_nodes(struct kmem_cache *cachep, gfp_t gfp)
{
    for_each_online_node(node) {
        ret = setup_kmem_cache_node(cachep, node, gfp, true);
    }
    return 0;
}
static int setup_kmem_cache_node(struct kmem_cache *cachep,
                int node, gfp_t gfp, bool force_change)
{
    LIST_HEAD(list);

    if (use_alien_caches) {
        new_alien = alloc_alien_cache(node, cachep->limit, gfp);
    }

    if (cachep->shared) {
        new_shared = alloc_arraycache(node,
            cachep->shared * cachep->batchcount, 0xbaadf00d, gfp);
    }

    ret = init_cache_node(cachep, node, gfp);

    n = get_node(cachep, node);
    raw_spin_lock_irq(&n->list_lock);
    if (n->shared && force_change) {
        free_block(cachep, n->shared->entry,
                n->shared->avail, node, &list);
        n->shared->avail = 0;
    }

    if (!n->shared || force_change) {
        old_shared = n->shared;
        n->shared = new_shared;
        new_shared = NULL;
    }

    if (!n->alien) {
        n->alien = new_alien;
        new_alien = NULL;
    }

    raw_spin_unlock_irq(&n->list_lock);
    slabs_destroy(cachep, &list);

    return ret;
}

分配並初始化一個alien cache和一個array cache。alien是所有其他node上的快取,array cache付給shared變數,為本node共享的快取。然後初始化一個kmem cache node,並把剛剛建立的alien cache和shared cache賦給node。這裡有個奇怪的地方,0xbaadf00d是啥意思?kernel程式碼有時候寫的真是莫名奇妙,這麼ugly的東西至少得加個註釋吧。

呵呵,分析了這麼一天的slab,晚上發現,slab已經被kernel移除了,呵呵白乾一天。明天還是看slub吧,睡覺。

這裡有一篇部落格講slub,圖解slub (wowotech.net)

重新看看kmem_cache

/*
 * Slab cache management.
 */
struct kmem_cache {
#ifndef CONFIG_SLUB_TINY
    struct kmem_cache_cpu __percpu *cpu_slab;
#endif
    /* Used for retrieving partial slabs, etc. */
    slab_flags_t flags;
    unsigned long min_partial;
    unsigned int size;        /* Object size including metadata */
    unsigned int object_size;    /* Object size without metadata */
    struct reciprocal_value reciprocal_size;
    unsigned int offset;        /* Free pointer offset */
#ifdef CONFIG_SLUB_CPU_PARTIAL
    /* Number of per cpu partial objects to keep around */
    unsigned int cpu_partial;
    /* Number of per cpu partial slabs to keep around */
    unsigned int cpu_partial_slabs;
#endif
    struct kmem_cache_order_objects oo;

    /* Allocation and freeing of slabs */
    struct kmem_cache_order_objects min;
    gfp_t allocflags;        /* gfp flags to use on each alloc */
    int refcount;            /* Refcount for slab cache destroy */
    void (*ctor)(void *object);    /* Object constructor */
    unsigned int inuse;        /* Offset to metadata */
    unsigned int align;        /* Alignment */
    unsigned int red_left_pad;    /* Left redzone padding size */
    const char *name;        /* Name (only for display!) */
    struct list_head list;        /* List of slab caches */    struct kmem_cache_node *node[MAX_NUMNODES];
};

slub感覺還是要比slab簡潔很多。那些著色相關的東西都不見了。percpu cache已經不是必要的了。node cache更是精簡。主要的知道描述一個緩衝區的幾個關鍵資料:包含後設資料的size,不包含後設資料的object_size,空閒區的偏移offset,緩衝區長度兼物件數oo,這個有趣,其實就是一個無符號int變數,slab list(這個意思是把slab直接放在這個list上?跟node是啥關係?),對齊align,node。下面看看kmem_cache_node.

/*
 * The slab lists for all objects.
 */
struct kmem_cache_node {
    spinlock_t list_lock;
    unsigned long nr_partial;
    struct list_head partial;
#ifdef CONFIG_SLUB_DEBUG
    atomic_long_t nr_slabs;
    atomic_long_t total_objects;
    struct list_head full;
#endif
};

清爽,簡潔,比slab看起來舒服多了,拋開debug只有一個連結串列partial。slab刪得好,早看他不順眼。

在6.7的核心中slab已經迴歸不再寄生於page結構中了。

/* Reuses the bits in struct page */
struct slab {
    unsigned long __page_flags;

    struct kmem_cache *slab_cache;
    union {
        struct {
            union {
                struct list_head slab_list;
#ifdef CONFIG_SLUB_CPU_PARTIAL
                struct {
                    struct slab *next;
                    int slabs;    /* Nr of slabs left */
                };
#endif
            };
            /* Double-word boundary */
            union {
                struct {
                    void *freelist;        /* first free object */
                    union {
                        unsigned long counters;
                        struct {
                            unsigned inuse:16;
                            unsigned objects:15;
                            unsigned frozen:1;
                        };
                    };
                };
#ifdef system_has_freelist_aba
                freelist_aba_t freelist_counter;
#endif
            };
        };
        struct rcu_head rcu_head;
    };
    unsigned int __unused;

    atomic_t __page_refcount;
#ifdef CONFIG_MEMCG
    unsigned long memcg_data;
#endif
};

盜圖一張,slab還嵌在page中,需自行更正。

看看slub的API,kmem_cache_create,用來建立一個物件緩衝。

struct kmem_cache *
kmem_cache_create(const char *name, unsigned int size, unsigned int align,
        slab_flags_t flags, void (*ctor)(void *))
{
    return kmem_cache_create_usercopy(name, size, align, flags, 0, 0,
                      ctor);
}

關鍵函式是kmem_cache_create_usercopy。

struct kmem_cache *
kmem_cache_create_usercopy(const char *name,
          unsigned int size, unsigned int align,
          slab_flags_t flags,
          unsigned int useroffset, unsigned int usersize,
          void (*ctor)(void *))
{
    mutex_lock(&slab_mutex);
    err = kmem_cache_sanity_check(name, size);
  /* Fail closed on bad usersize of useroffset values. */
   if (!usersize)
        s = __kmem_cache_alias(name, size, align, flags, ctor);
  if (s)
    goto out_unlock;
cache_name = kstrdup_const(name, GFP_KERNEL); s = create_cache(cache_name, size, calculate_alignment(flags, align, size), flags, useroffset, usersize, ctor, NULL); out_unlock: mutex_unlock(&slab_mutex);return s; }

簡化後主要執行倆函式,第一步先去看看slab_cache裡面有沒有合適的緩衝,有個話直接返回。

struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
        slab_flags_t flags, const char *name, void (*ctor)(void *))
{
。。。

    list_for_each_entry_reverse(s, &slab_caches, list) {
        。。。
        return s;
    }
    。。
}

如果沒有找到已有的緩衝,就建立一個。

static struct kmem_cache *create_cache(const char *name,
        unsigned int object_size, unsigned int align,
        slab_flags_t flags, unsigned int useroffset,
        unsigned int usersize, void (*ctor)(void *),
        struct kmem_cache *root_cache)
{
  。。。
    s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL);
    s->name = name;
    s->size = s->object_size = object_size;
    s->align = align;
    s->ctor = ctor;
#ifdef CONFIG_HARDENED_USERCOPY
    s->useroffset = useroffset;
    s->usersize = usersize;
#endif

    err = __kmem_cache_create(s, flags);
    s->refcount = 1;
    list_add(&s->list, &slab_caches);
    return s;
。。。
}

先分配記憶體給kmem_cache,用入參初始化一下後交給kmem_cache的核心函式__kmem_cache_create。建立好後加入到全域性變數slab_caches中。

int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags)
{
。。。
    err = kmem_cache_open(s, flags);
。。。return 0;
}

kmem_cache_open做了重要的初始化工作,calculate_sizes計算slab的大小,也即計算出kmem_cache->oo,包含order和物件數。set_cpu_partial計算per cpu partial list的物件數和slab數。init_kmem_cache_nodes分配kmem cache node並初始化。alloc_kmem_cache_cpus分配percpu cpu_slab並初始化。

下面看看如何銷燬一個緩衝區。

void kmem_cache_destroy(struct kmem_cache *s)
{
。。。    
    s->refcount--;
        err = shutdown_cache(s);
。。。
        kmem_cache_release(s);
}

shutdown_cache會釋放所有slab記憶體,kmem_cache_release會刪除sysfs相關物件。

建立緩衝區對slab分配記憶體來說只是第一步,比如這個時候我想分配記憶體,需要用到另一個API,kmem_cache_alloc.

kmem_cache_alloc有點複雜,這裡簡單介紹一下流程。存放slab的地方有三個,percpu freelist, percpu partial list, node cache。分配器會依次嘗試獲取object,一旦成功就可以返回object地址,但是如果都沒有獲得那就從buddy system獲取新的slab。分配之後存放slab的順序同上。

釋放一個slab物件也是按照上述順序。這就是slab被叫做緩衝的原因,緩衝被釋放的時候並不是還給夥伴系統而是還給緩衝區,留給下次再用。

slab在核心記憶體分配中非常重要,很多高頻記憶體分配器就是依賴它,比如常見的kmalloc。使用slab的流程也就明瞭了,首先使用kmem_cache_create建立一個kmem_cache,基於kmem_cache使用kmem_cache_alloc分配一個slab物件。

來看一個核心中是slab的例子,比如task_struct:

void __init fork_init(void)
{
  ...
        task_struct_cachep = kmem_cache_create_usercopy("task_struct",
                        arch_task_struct_size, align,
                        SLAB_PANIC|SLAB_ACCOUNT,
                        useroffset, usersize, NULL);
...
}

static inline struct task_struct *alloc_task_struct_node(int node)
{
        return kmem_cache_alloc_node(task_struct_cachep, GFP_KERNEL, node);
}

在fork_init預先使用kmem_cache_create_usercopy建立了task_struct的緩衝,然後再alloc_task_struct_node中使用kmem_cache_alloc_node來分配一個task_struct物件。

相關文章