在前面我們講解了kmalloc申請連續實體記憶體的操作,以及原理和基礎cache . 在核心中還有另外一個介面函式那就是vmalloc,申請一片連續的虛擬地址空間,但不保證物理空間連續,實際上我們會想到使用者空間的malloc,malloc它是標準的glibc封裝的一個函式,最終實現是通過系統呼叫brk和mmap來實現,以後在分析它的實現過程. 它就是申請連續的虛擬空間,但是不保證實體記憶體的連續,當然使用者程式也不怎麼關心這個問題,只所以會關心實體記憶體的連續性一般是由於裝置驅動的使用,或者DMA. 但是vmalloc申請效率比較低,還會造成TLB抖動. 一般核心裡常用kmalloc. 除非特殊需求,比如要獲取大塊記憶體時,例項就是當ko模組載入到核心執行時,即需要vmalloc.
釋放函式:vfree
參考核心 3.8.13
這裡是說32位的處理器,即最大定址4G虛擬空間,(當然現在已經64位比較普及了,後續補上吧)而虛擬地址到實體地址的轉化往往需要硬體的支援才能提高效率,即MMU。
當然前提需要os先建立頁表PT. 在linux核心,這4G空間並不是完全給使用者空間使用在高階0xC0000000 (3G開始)留給核心空間使用(x86預設配置,預設0-16M(DMA),16M-896M(Normal),896M-1G(128M)作為高階記憶體分配區域),當然這個區域也是可是配置的.).
kmalloc函式返回的是虛擬地址(線性地址). kmalloc特殊之處在於它分配的記憶體是物理上連續的,這對於要進行DMA的裝置十分重要. 而用vmalloc分配的記憶體只是線性地址連續,實體地址不一定連續,不能直接用於DMA。我們可以參考一個圖:(它是arm 32架構的核心虛擬地址分配圖)
下面我們就看看vmalloc函式:(mm/vmalloc.c)
1 2 3 4 5 6 7 8 9 10 11 12 13 |
/** * vmalloc - allocate virtually contiguous memory * @size: allocation size * Allocate enough pages to cover @size from the page level * allocator and map them into contiguous kernel virtual space. * * For tight control over page level allocator and protection flags * use __vmalloc() instead. */ void *vmalloc(unsigned long size) { return __vmalloc_node_flags(size, -1, GFP_KERNEL | __GFP_HIGHMEM); } |
這裡我們只用關注size即可,而vmalloc優先從高階記憶體分配,並且可以睡眠.
繼續:
1 2 3 4 5 6 |
static inline void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags) { return __vmalloc_node(size, 1, flags, PAGE_KERNEL, node, __builtin_return_address(0)); } |
重點看一下__vmalloc_node:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
/** * __vmalloc_node - allocate virtually contiguous memory * @size: allocation size * @align: desired alignment * @gfp_mask: flags for the page level allocator * @prot: protection mask for the allocated pages * @node: node to use for allocation or -1 * @caller: caller's return address * * Allocate enough pages to cover @size from the page level * allocator with @gfp_mask flags. Map them into contiguous * kernel virtual space, using a pagetable protection of @prot. */ static void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, pgprot_t prot, int node, const void *caller) { return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END, gfp_mask, prot, node, caller); } |
因為這裡提到了VMALLOC_START和VMALLOC_END它們究竟是什麼值呢?
這裡看了arm32和mips32的(根據架構虛擬地址分配不同而不同,比如mips就比較特殊):
在arch/mips/include/asm/pgtable-32.h中
首先看mips虛擬地址分佈圖:
從這個圖裡我們知道使用者空間為2G(0x0-0x7fff ffff),dma或者normal記憶體對映在kseg0(512M)/kseg1,而對於vmalloc申請的虛擬地址在kseg2中,當然還有其他一些特殊的對映比如io等.
1 2 3 4 5 6 7 8 9 |
#define VMALLOC_START MAP_BASE #define PKMAP_BASE (0xfe000000UL) #ifdef CONFIG_HIGHMEM # define VMALLOC_END (PKMAP_BASE-2*PAGE_SIZE) #else # define VMALLOC_END (FIXADDR_START-2*PAGE_SIZE) #endif |
在arch/arm/include/asm/pgtable.h
1 2 3 4 5 6 7 8 9 10 11 |
/* * Just any arbitrary offset to the start of the vmalloc VM area: the * current 8MB value just means that there will be a 8MB "hole" after the * physical memory until the kernel virtual memory starts. That means that * any out-of-bounds memory accesses will hopefully be caught. * The vmalloc() routines leaves a hole of 4kB between each vmalloced * area for the same reason. ;) */ #define VMALLOC_OFFSET (8*1024*1024) #define VMALLOC_START (((unsigned long)high_memory + VMALLOC_OFFSET) & ~(VMALLOC_OFFSET-1)) #define VMALLOC_END 0xff000000UL |
在看一個圖:
我們知道實體記憶體簡單分為三個區域:ZONE_NORMAL、ZONE_DMA、ZONE_HIGHMEM
vmalloc我們看到它是預設從ZONE_HIGMEM裡申請,但是這兩個函式虛擬地址是保持一致的,即都佔用了4G地址空間的核心虛擬地址.通過上面的圖,我們確定了虛擬地址從哪裡分配,以及對於的物理空間從哪裡分配。
下面看看 vmalloc核心實現:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
/** * __vmalloc_node_range - allocate virtually contiguous memory * @size: allocation size * @align: desired alignment * @start: vm area range start * @end: vm area range end * @gfp_mask: flags for the page level allocator * @prot: protection mask for the allocated pages * @node: node to use for allocation or -1 * @caller: caller's return address * * Allocate enough pages to cover @size from the page level * allocator with @gfp_mask flags. Map them into contiguous * kernel virtual space, using a pagetable protection of @prot. */ void *__vmalloc_node_range(unsigned long size, unsigned long align, unsigned long start, unsigned long end, gfp_t gfp_mask, pgprot_t prot, int node, const void *caller) { struct vm_struct *area; void *addr; unsigned long real_size = size; size = PAGE_ALIGN(size); if (!size || (size >> PAGE_SHIFT) > totalram_pages) goto fail; area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNLIST, // 分配虛擬地址空間 把vm_struct 和vm_area(紅黑樹機制)關聯起來. start, end, node, gfp_mask, caller); if (!area) goto fail; addr = __vmalloc_area_node(area, gfp_mask, prot, node, caller); //計算需要申請的頁面,申請page,然後修改頁表完成對映. if (!addr) return NULL; /* * In this function, newly allocated vm_struct is not added * to vmlist at __get_vm_area_node(). so, it is added here. */ insert_vmalloc_vmlist(area); //把vm_struct插入 全域性vmlist連結串列 /* * A ref_count = 3 is needed because the vm_struct and vmap_area * structures allocated in the __get_vm_area_node() function contain * references to the virtual address of the vmalloc'ed block. */ kmemleak_alloc(addr, real_size, 3, gfp_mask); //記憶體洩露追蹤 return addr; fail: warn_alloc_failed(gfp_mask, 0, "vmalloc: allocation failure: %lu bytes\n", real_size); return NULL; } |
它的基本實現思路很簡單:
1. 分配虛擬地址空間
2.對虛擬地址空間進行頁表對映
需要熟知 下面兩個結構體:
struct vmap_area
1 2 3 4 5 6 7 8 9 10 |
struct vmap_area { unsigned long va_start; unsigned long va_end; unsigned long flags; struct rb_node rb_node; /* address sorted rbtree */ struct list_head list; /* address sorted list */ struct list_head purge_list; /* "lazy purge" list */ struct vm_struct *vm; struct rcu_head rcu_head; }; |
vm_struct *area :
1 2 3 4 5 6 7 8 9 10 |
struct vm_struct { struct vm_struct *next; void *addr; unsigned long size; unsigned long flags; struct page **pages; unsigned int nr_pages; phys_addr_t phys_addr; const void *caller; }; |
這裡在說明一下vmalloc_init的初始化.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
/* * Set up kernel memory allocators */ static void __init mm_init(void) { /* * page_cgroup requires contiguous pages, * bigger than MAX_ORDER unless SPARSEMEM. */ page_cgroup_init_flatmem(); mem_init(); kmem_cache_init(); percpu_init_late(); pgtable_cache_init(); vmalloc_init(); } |
其實在講slab機制的時候已經說過。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
void __init vmalloc_init(void) { struct vmap_area *va; struct vm_struct *tmp; int i; for_each_possible_cpu(i) { struct vmap_block_queue *vbq; vbq = &per_cpu(vmap_block_queue, i); spin_lock_init(&vbq->lock); INIT_LIST_HEAD(&vbq->free); } /* Import existing vmlist entries. */ for (tmp = vmlist; tmp; tmp = tmp->next) { // 在系統啟動或者初始化之初,vmlist為空. va = kzalloc(sizeof(struct vmap_area), GFP_NOWAIT); va->flags = VM_VM_AREA; va->va_start = (unsigned long)tmp->addr; va->va_end = va->va_start + tmp->size; va->vm = tmp; __insert_vmap_area(va); } vmap_area_pcpu_hole = VMALLOC_END; vmap_initialized = true; } |
下面就說說__get_vm_area_node函式:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
static struct vm_struct *__get_vm_area_node(unsigned long size, unsigned long align, unsigned long flags, unsigned long start, unsigned long end, int node, gfp_t gfp_mask, const void *caller) { struct vmap_area *va; struct vm_struct *area; BUG_ON(in_interrupt()); if (flags & VM_IOREMAP) { // ioremap標誌,對映的是裝置記憶體 int bit = fls(size); if (bit > IOREMAP_MAX_ORDER) bit = IOREMAP_MAX_ORDER; else if (bit < PAGE_SHIFT) bit = PAGE_SHIFT; align = 1ul << bit; } size = PAGE_ALIGN(size); if (unlikely(!size)) return NULL; area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node); if (unlikely(!area)) return NULL; /* * We always allocate a guard page. */ size += PAGE_SIZE; // 多偏移一頁,為了防止訪問越界,由於多出來的一頁並不對映,所以當訪問的時候,會引發保護異常. va = alloc_vmap_area(size, align, start, end, node, gfp_mask); // 申請vm_area虛擬地址空間 if (IS_ERR(va)) { kfree(area); return NULL; } /* * When this function is called from __vmalloc_node_range, * we do not add vm_struct to vmlist here to avoid * accessing uninitialized members of vm_struct such as * pages and nr_pages fields. They will be set later. * To distinguish it from others, we use a VM_UNLIST flag. */ if (flags & VM_UNLIST) // 必然走這裡 setup_vmalloc_vm(area, va, flags, caller); // 關聯vm_struct 和 vm_area else insert_vmalloc_vm(area, va, flags, caller); return area; } |
這個函式核心就是alloc_vmap_area,這個很有趣的,之前我們講到了vmalloc申請的虛擬地址範圍,而它只傳遞了size而已,對於mips,x86,arm會有不同的虛擬空間.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
/* * Allocate a region of KVA of the specified size and alignment, within the * vstart and vend. */ static struct vmap_area *alloc_vmap_area(unsigned long size, unsigned long align, unsigned long vstart, unsigned long vend, int node, gfp_t gfp_mask) { struct vmap_area *va; struct rb_node *n; unsigned long addr; int purged = 0; struct vmap_area *first; BUG_ON(!size); BUG_ON(size & ~PAGE_MASK); BUG_ON(!is_power_of_2(align)); va = kmalloc_node(sizeof(struct vmap_area), gfp_mask & GFP_RECLAIM_MASK, node); if (unlikely(!va)) return ERR_PTR(-ENOMEM); retry: spin_lock(&vmap_area_lock); /* * Invalidate cache if we have more permissive parameters. * cached_hole_size notes the largest hole noticed _below_ * the vmap_area cached in free_vmap_cache: if size fits * into that hole, we want to scan from vstart to reuse * the hole instead of allocating above free_vmap_cache. * Note that __free_vmap_area may update free_vmap_cache * without updating cached_hole_size or cached_align. */ if (!free_vmap_cache || //第一次呼叫的時候 free_vmap_cache為空,後來即後邊的程式碼line 105 : free_vmap_cache = &va->rb_node; 一般不為空 ;一般會發 // 生align < cached_align的情況,即會清除free_vmap_cache。有時候align比較大的時候,它會跳過一段虛擬地址空間.後面的申請由於沒 //有free_vmap_cache,所以它需要重新查詢 size < cached_hole_size || vstart < cached_vstart || align < cached_align) { nocache: cached_hole_size = 0; free_vmap_cache = NULL; } /* record if we encounter less permissive parameters */ cached_vstart = vstart; cached_align = align; /* find starting point for our search */ if (free_vmap_cache) { // 第一次使用的時候為空;當不為空時,它保持上次申請的節點,並初始化addr為va_end. first = rb_entry(free_vmap_cache, struct vmap_area, rb_node); addr = ALIGN(first->va_end, align); if (addr < vstart) goto nocache; if (addr + size - 1 < addr) goto overflow; } else { addr = ALIGN(vstart, align); if (addr + size - 1 < addr) goto overflow; n = vmap_area_root.rb_node; // 同樣vmap_area_root.rb_node; 初始化也為空,第一次使用為空 first = NULL; while (n) { // 當不是第一申請,並且free_cache為空的時候, 需要重新找到根節點即va_start <= addr struct vmap_area *tmp; tmp = rb_entry(n, struct vmap_area, rb_node); if (tmp->va_end >= addr) { first = tmp; if (tmp->va_start <= addr) break; n = n->rb_left; } else n = n->rb_right; } if (!first) goto found; } /* from the starting point, walk areas until a suitable hole is found */ while (addr + size > first->va_start && addr + size <= vend) { // 當不是第一申請,並且free_cache為空的時候,查詢紅黑樹節點,找到合適的空間地址. if (addr + cached_hole_size < first->va_start) cached_hole_size = first->va_start - addr; addr = ALIGN(first->va_end, align); if (addr + size - 1 < addr) goto overflow; if (list_is_last(&first->list, &vmap_area_list)) // 預設不會在這裡操作。也就是說它沒有元素. goto found; first = list_entry(first->list.next, struct vmap_area, list); } found: if (addr + size > vend) goto overflow; va->va_start = addr; va->va_end = addr + size; va->flags = 0; __insert_vmap_area(va); // 新增到紅黑樹 vmap_area_root free_vmap_cache = &va->rb_node; // 初始化free_vmap_cache ,它會影響後續虛擬空間的申請. spin_unlock(&vmap_area_lock); BUG_ON(va->va_start & (align-1)); BUG_ON(va->va_start < vstart); BUG_ON(va->va_end > vend); return va; overflow: spin_unlock(&vmap_area_lock); if (!purged) { purge_vmap_area_lazy(); purged = 1; goto retry; } if (printk_ratelimit()) printk(KERN_WARNING "vmap allocation for size %lu failed: " "use vmalloc= to increase size.\n", size); kfree(va); return ERR_PTR(-EBUSY); } |
既然我們已經開闢了虛擬地址空間,那麼還需要做的當然是和頁面一一對映起來.
看函式__vmalloc_area_node:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, pgprot_t prot, int node, const void *caller) { const int order = 0; struct page **pages; unsigned int nr_pages, array_size, i; gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; nr_pages = (area->size - PAGE_SIZE) >> PAGE_SHIFT; //申請多少pages array_size = (nr_pages * sizeof(struct page *)); //需要多大的存放page指標的空間 . area->nr_pages = nr_pages; /* Please note that the recursion is strictly bounded. */ if (array_size > PAGE_SIZE) { // 這裡預設page_size 為4k 即4096 ,地址32位的話,相當於申請1024個pages:4M空間 pages = __vmalloc_node(array_size, 1, nested_gfp|__GFP_HIGHMEM, PAGE_KERNEL, node, caller); area->flags |= VM_VPAGES; } else { pages = kmalloc_node(array_size, nested_gfp, node); // 小於一頁,則直接利用slab機制申請物理空間地址 給pages. } area->pages = pages; area->caller = caller; if (!area->pages) { remove_vm_area(area->addr); kfree(area); return NULL; } for (i = 0; i < area->nr_pages; i++) { // 每次申請一個page利用alloc_page直接申請物理頁面 struct page *page; gfp_t tmp_mask = gfp_mask | __GFP_NOWARN; if (node < 0) page = alloc_page(tmp_mask); else page = alloc_pages_node(node, tmp_mask, order); if (unlikely(!page)) { /* Successfully allocated i pages, free them in __vunmap() */ area->nr_pages = i; goto fail; } area->pages[i] = page; // 分配的地址存放在指標陣列. } if (map_vm_area(area, prot, &pages)) // 修改頁表 ,一頁一頁的實現對映,以及flush cache保持資料的一致性;對頁面對映和操作感興趣的可以深入看看這個函式. goto fail; return area->addr; fail: warn_alloc_failed(gfp_mask, order, "vmalloc: allocation failure, allocated %ld of %ld bytes\n", (area->nr_pages*PAGE_SIZE), area->size); vfree(area->addr); return NULL; } |
而insert_vmalloc_vmlist很明顯把vm_struct插入到vmlist。
那麼就完成了整個過程,沒有想象的複雜,當然對記憶體有了更多的認識,這裡還需要說一下,一般情況下有高階記憶體會比沒有的好些,防止了vmalloc申請的時候造成的TLB抖動等問題,更少的破壞normal空間。
可以通過proc來檢視vmalloc的一下資訊:
1 2 3 4 5 |
cat /proc/vmallocinfo 0xc0002000-0xc0045000 274432 jffs2_zlib_init+0x24/0xa4 pages=66 vmalloc 0xc0045000-0xc0051000 49152 jffs2_zlib_init+0x40/0xa4 pages=11 vmalloc 0xc0051000-0xc0053000 8192 brcmnand_create_cet+0x244/0x788 pages=1 vmalloc 0xc0053000-0xc0055000 8192 ebt_register_table+0x98/0x39c pages=1 vmalloc |
還有:
1 2 |
# cat /proc/vmstat #cat /proc/meminfo |