Linux 記憶體管理：記憶體對映

發表於2015-09-24

之前講了那麼多記憶體的東西，那麼都離不開記憶體對映，不論虛擬地址到實體地址，還是使用者空間地址到核心空間。關於對映使用者空間最常用的是mmap來對映裝置的io空間，直接訪問，來提高io效率。核心的有ioremap對映裝置io地址空間以供核心訪問，kmap對映申請的高階記憶體，

還有DMA ,dma主要用的多的是網路卡驅動裡ring buffer機制.

下面就說說mmap：

函式原型：void* mmap ( void * start , size_t len , int prot , int flags , int fd , off_t offset )

引數說明：

start：對映區的開始地址，設定為0時表示由系統決定對映區的起始地址。
length：對映區的長度。//長度單位是以位元組為單位，不足一記憶體頁按一記憶體頁處理
prot：期望的記憶體保護標誌，不能與檔案的開啟模式衝突。是以下的某個值，可以通過or運算合理地組合在一起
PROT_EXEC //頁內容可以被執行
PROT_READ //頁內容可以被讀取
PROT_WRITE //頁可以被寫入
PROT_NONE //頁不可訪問
flags：指定對映物件的型別，對映選項和對映頁是否可以共享。它的值可以是一個或者多個以下位的組合體
MAP_FIXED //使用指定的對映起始地址，如果由start和len引數指定的記憶體區重疊於現存的對映空間，重疊部分將會被丟棄。如果指定的起始地址不可用，操作將會失敗。並且起始地址必須落在頁的邊界上。
MAP_SHARED //與其它所有對映這個物件的程式共享對映空間。對共享區的寫入，相當於輸出到檔案。直到msync()或者munmap()被呼叫，檔案實際上不會被更新。
MAP_PRIVATE //建立一個寫入時拷貝的私有對映。記憶體區域的寫入不會影響到原檔案。這個標誌和以上標誌是互斥的，只能使用其中一個。
MAP_DENYWRITE //這個標誌被忽略。
MAP_EXECUTABLE //同上
MAP_NORESERVE //不要為這個對映保留交換空間。當交換空間被保留，對對映區修改的可能會得到保證。當交換空間不被保留，同時記憶體不足，對對映區的修改會引起段違例訊號。
MAP_LOCKED //鎖定對映區的頁面，從而防止頁面被交換出記憶體。
MAP_GROWSDOWN //用於堆疊，告訴核心VM系統，對映區可以向下擴充套件。
MAP_ANONYMOUS //匿名對映，對映區不與任何檔案關聯。
MAP_ANON //MAP_ANONYMOUS的別稱，不再被使用。
MAP_FILE //相容標誌，被忽略。
MAP_32BIT //將對映區放在程式地址空間的低2GB，MAP_FIXED指定時會被忽略。當前這個標誌只在x86-64平臺上得到支援。
MAP_POPULATE //為檔案對映通過預讀的方式準備好頁表。隨後對對映區的訪問不會被頁違例阻塞。
MAP_NONBLOCK //僅和MAP_POPULATE一起使用時才有意義。不執行預讀，只為已存在於記憶體中的頁面建立頁表入口。
MAP_HUGETLB (since Linux 2.6.32)
Allocate the mapping using “huge pages.” See the kernel source file Documentation/vm/hugetlbpage.txt for further information.
fd：有效的檔案描述詞。一般是由open()函式返回，其值也可以設定為-1，此時需要指定flags引數中的MAP_ANON,表明進行的是匿名對映。
off_t offset：被對映物件內容的起點。檔案對映的偏移量，通常設定為0，代表從檔案最前方開始對應，offset必須是分頁大小的整數倍

返回值：

成功執行時，mmap()返回被對映區的指標，失敗時，mmap()返回MAP_FAILED[其值為(void *)-1]，

上邊只是對mmap的基本引數做了說明，我們知道使用者空間都是檔案訪問，即file_operations 中有函式指標mmap，那麼呼叫的mmap 的時候一般需要傳遞fd。

在include/linux/fs.h：

struct file_operations {
    struct module *owner;
    loff_t (*llseek) (struct file *, loff_t, int);
    ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
    ...
    long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
    long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
    int (*mmap) (struct file *, struct vm_area_struct *);

struct file_operations {

struct module *owner;

loff_t (*llseek) (struct file *, loff_t, int);

ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);

...

long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);

long (*compat_ioctl) (struct file *, unsigned int, unsigned long);

int (*mmap) (struct file *, struct vm_area_struct *);

通過上面的結構我們知道mmap系統呼叫最後呼叫檔案操作指標函式mmap.
那麼需要看一下mmap系統呼叫的實現：mm/mmap.c:

SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
        unsigned long, prot, unsigned long, flags,
        unsigned long, fd, unsigned long, pgoff)
{
    struct file *file = NULL;
    unsigned long retval = -EBADF;

    if (!(flags & MAP_ANONYMOUS)) {                     //  匿名對映flag
        audit_mmap_fd(fd, flags);
        if (unlikely(flags & MAP_HUGETLB))
            return -EINVAL;
        file = fget(fd);
        if (!file)
            goto out;
        if (is_file_hugepages(file))
            len = ALIGN(len, huge_page_size(hstate_file(file)));
    } else if (flags & MAP_HUGETLB) {                          // hugetlb 大頁對映
        struct user_struct *user = NULL;

        len = ALIGN(len, huge_page_size(hstate_sizelog(
            (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK)));
        /*
         * VM_NORESERVE is used because the reservations will be
         * taken when vm_ops->mmap() is called
         * A dummy user value is used because we are not locking
         * memory so no accounting is necessary
         */
        file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
                VM_NORESERVE,
                &user, HUGETLB_ANONHUGE_INODE,
                (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
        if (IS_ERR(file))
            return PTR_ERR(file);
    }

    flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);

    retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
    if (file)
        fput(file);
out:
    return retval;
}

SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,

unsigned long, prot, unsigned long, flags,

unsigned long, fd, unsigned long, pgoff)

{

struct file *file = NULL;

unsigned long retval = -EBADF;

if (!(flags & MAP_ANONYMOUS)) { // 匿名對映flag

audit_mmap_fd(fd, flags);

if (unlikely(flags & MAP_HUGETLB))

return -EINVAL;

file = fget(fd);

if (!file)

goto out;

if (is_file_hugepages(file))

len = ALIGN(len, huge_page_size(hstate_file(file)));

} else if (flags & MAP_HUGETLB) { // hugetlb 大頁對映

struct user_struct *user = NULL;

len = ALIGN(len, huge_page_size(hstate_sizelog(

(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK)));

* VM_NORESERVE is used because the reservations will be

* taken when vm_ops->mmap() is called

* A dummy user value is used because we are not locking

* memory so no accounting is necessary

file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,

VM_NORESERVE,

&user, HUGETLB_ANONHUGE_INODE,

(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);

if (IS_ERR(file))

return PTR_ERR(file);

}

flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);

retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);

if (file)

fput(file);

out:

return retval;

}

在說之前，需要補充一下知識，第一使用者空間的記憶體佈局，和結構體struct vm_area_struct

其實之前文章已經說過這個佈局。
我們看include/linux/mm_types.h:

/*
 * This struct defines a memory VMM memory area. There is one of these
 * per VM-area/task. A VM area is any part of the process virtual memory
 * space that has a special rule for the page-fault handlers (ie a shared
 * library, the executable area etc).
 */
struct vm_area_struct {
    /* The first cache line has the info for VMA tree walking. */

    unsigned long vm_start;        /* Our start address within vm_mm. */
    unsigned long vm_end;        /* The first byte after our end address
                     within vm_mm. */

    /* linked list of VM areas per task, sorted by address */
    struct vm_area_struct *vm_next, *vm_prev;

    struct rb_node vm_rb;

    /*
     * Largest free memory gap in bytes to the left of this VMA.
     * Either between this VMA and vma->vm_prev, or between one of the
     * VMAs below us in the VMA rbtree and its ->vm_prev. This helps
     * get_unmapped_area find a free area of the right size.
     */
    unsigned long rb_subtree_gap;

    /* Second cache line starts here. */

    struct mm_struct *vm_mm;    /* The address space we belong to. */
    pgprot_t vm_page_prot;        /* Access permissions of this VMA. */
    unsigned long vm_flags;        /* Flags, see mm.h. */

    /*
     * For areas with an address space and backing store,
     * linkage into the address_space->i_mmap interval tree, or
     * linkage of vma in the address_space->i_mmap_nonlinear list.
     */
    union {
        struct {
            struct rb_node rb;
            unsigned long rb_subtree_last;
        } linear;
        struct list_head nonlinear;
    } shared;

    /*
     * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
     * list, after a COW of one of the file pages.    A MAP_SHARED vma
     * can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack
     * or brk vma (with NULL file) can only be in an anon_vma list.
     */
    struct list_head anon_vma_chain; /* Serialized by mmap_sem &
                     * page_table_lock */
    struct anon_vma *anon_vma;    /* Serialized by page_table_lock */

    /* Function pointers to deal with this struct. */
    const struct vm_operations_struct *vm_ops;

    /* Information about our backing store: */
    unsigned long vm_pgoff;        /* Offset (within vm_file) in PAGE_SIZE
                     units, *not* PAGE_CACHE_SIZE */
    struct file * vm_file;        /* File we map to (can be NULL). */
    void * vm_private_data;        /* was vm_pte (shared mem) */

#ifndef CONFIG_MMU
    struct vm_region *vm_region;    /* NOMMU mapping region */
#endif
#ifdef CONFIG_NUMA
    struct mempolicy *vm_policy;    /* NUMA policy for the VMA */
#endif
}

* This struct defines a memory VMM memory area. There is one of these

* per VM-area/task. A VM area is any part of the process virtual memory

* space that has a special rule for the page-fault handlers (ie a shared

* library, the executable area etc).

struct vm_area_struct {

/* The first cache line has the info for VMA tree walking. */

unsigned long vm_start; /* Our start address within vm_mm. */

unsigned long vm_end; /* The first byte after our end address

within vm_mm. */

/* linked list of VM areas per task, sorted by address */

struct vm_area_struct *vm_next, *vm_prev;

struct rb_node vm_rb;

* Largest free memory gap in bytes to the left of this VMA.

* Either between this VMA and vma->vm_prev, or between one of the

* VMAs below us in the VMA rbtree and its ->vm_prev. This helps

* get_unmapped_area find a free area of the right size.

unsigned long rb_subtree_gap;

/* Second cache line starts here. */

struct mm_struct *vm_mm; /* The address space we belong to. */

pgprot_t vm_page_prot; /* Access permissions of this VMA. */

unsigned long vm_flags; /* Flags, see mm.h. */

* For areas with an address space and backing store,

* linkage into the address_space->i_mmap interval tree, or

* linkage of vma in the address_space->i_mmap_nonlinear list.

union {

struct {

struct rb_node rb;

unsigned long rb_subtree_last;

} linear;

struct list_head nonlinear;

} shared;

* A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma

* list, after a COW of one of the file pages. A MAP_SHARED vma

* can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack

* or brk vma (with NULL file) can only be in an anon_vma list.

struct list_head anon_vma_chain; /* Serialized by mmap_sem &

* page_table_lock */

struct anon_vma *anon_vma; /* Serialized by page_table_lock */

/* Function pointers to deal with this struct. */

const struct vm_operations_struct *vm_ops;

/* Information about our backing store: */

unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE

units, *not* PAGE_CACHE_SIZE */

struct file * vm_file; /* File we map to (can be NULL). */

void * vm_private_data; /* was vm_pte (shared mem) */

#ifndef CONFIG_MMU

struct vm_region *vm_region; /* NOMMU mapping region */

#endif

#ifdef CONFIG_NUMA

struct mempolicy *vm_policy; /* NUMA policy for the VMA */

#endif

}

Struct vm_area_struct用紅黑樹來管理。不是和vmalloc裡一些結構很相似？但是別搞混了.
核心中每一個這樣的物件都表示使用者程式地址空間的一段區域。
當linux 執行一個應用程式時，系統呼叫exec通過load_elf_binary函式把elf載入到使用者虛擬空間。前面我們已經說了棧和堆。Text不用多解釋。

那麼基本流程就是：
1. 使用者呼叫mmap系統呼叫
2. 核心在使用者空間mmap區域分配一個空閒的vm_area_struct物件。
3. 然後修改頁目錄表項把物件的地址和裝置的記憶體對應起來

那麼在使用者空間，mmap系統呼叫函式原型為：
Void *mmap(void *start,size_t length,int prot ,int flags,int fd, off_t offset);

它能夠起作用的前提是開啟的裝置檔案的驅動裡實現了mmap。

看看mmap系統呼叫核心實現，
1.找到fd對應的struct file；
2 do_mmap_pgoff完成對映的工作。

細說do_mmap_pgoff函式
（1）呼叫get_unmapped_area獲得未使用的vm_area_struct
（2）後續是mmap_region
（3）呼叫到驅動file->mmap的具體實現
（4）具體驅動層mmap的實現

在具體實現驅動層的mmap前，linux核心已經實現了頁表對映的介面api供我們使用。

Remap_pfn_range （memory.c）也有其他延伸介面

Mmap是可以忽略fd引數的：MAP_ANONYMOUS建立匿名對映。此時會忽略引數fd，不涉及檔案，而且對映區域無法和其他程式共享
引數fd：要對映到記憶體中的檔案描述符。如果使用匿名記憶體對映時，即flags中設定了MAP_ANONYMOUS，fd設為-1。有
些系統不支援匿名記憶體對映，則可以使用fopen開啟/dev/zero檔案，然後對該檔案進行對映，可以同樣達到匿名記憶體對映的效果。
MAP_HUGETLB是核心2.6.32引入的一個mmap flags, 用於使用huge pages分配共享記憶體.

使用大頁面的好處是在大記憶體的管理上減少CPU的開銷。Linux對大頁面記憶體的引入對減少TLB的失效效果不錯，特別是記憶體大而密集型的程式，比如說在資料庫中的使用

顯然正常的mmap呼叫流程會走人第一個if語句獲取file指標.

if (!(flags & MAP_ANONYMOUS)) {
        audit_mmap_fd(fd, flags);
        if (unlikely(flags & MAP_HUGETLB))
            return -EINVAL;
        file = fget(fd);
        if (!file)
            goto out;
        if (is_file_hugepages(file))
            len = ALIGN(len, huge_page_size(hstate_file(file)));
    }

if (!(flags & MAP_ANONYMOUS)) {

audit_mmap_fd(fd, flags);

if (unlikely(flags & MAP_HUGETLB))

return -EINVAL;

file = fget(fd);

if (!file)

goto out;

if (is_file_hugepages(file))

len = ALIGN(len, huge_page_size(hstate_file(file)));

}

接著呼叫了：

unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
    unsigned long len, unsigned long prot,
    unsigned long flag, unsigned long pgoff)
{
    unsigned long ret;
    struct mm_struct *mm = current->mm;

    ret = security_mmap_file(file, prot, flag);
    if (!ret) {
        down_write(&mm->mmap_sem);
        ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff);
        up_write(&mm->mmap_sem);
    }
    return ret;
}

unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,

unsigned long len, unsigned long prot,

unsigned long flag, unsigned long pgoff)

{

unsigned long ret;

struct mm_struct *mm = current->mm;

ret = security_mmap_file(file, prot, flag);

if (!ret) {

down_write(&mm->mmap_sem);

ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff);

up_write(&mm->mmap_sem);

}

return ret;

}

獲取互斥鎖，呼叫do_mmap_pgoff

/*
 * The caller must hold down_write(&current->mm->mmap_sem).
 */

unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
            unsigned long len, unsigned long prot,
            unsigned long flags, unsigned long pgoff)
{
    struct mm_struct * mm = current->mm;
    struct inode *inode;
    vm_flags_t vm_flags;

    /*
     * Does the application expect PROT_READ to imply PROT_EXEC?
     *
     * (the exception is when the underlying filesystem is noexec
     * mounted, in which case we dont add PROT_EXEC.)
     */
    if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
        if (!(file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC)))
            prot |= PROT_EXEC;

    if (!len)
        return -EINVAL;

    if (!(flags & MAP_FIXED))
        addr = round_hint_to_min(addr);

    /* Careful about overflows.. */
    len = PAGE_ALIGN(len);
    if (!len)
        return -ENOMEM;

    /* offset overflow? */
    if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
               return -EOVERFLOW;

    /* Too many mappings? */
    if (mm->map_count > sysctl_max_map_count)
        return -ENOMEM;

    /* Obtain the address to map to. we verify (or select) it and ensure
     * that it represents a valid section of the address space.
     */
    addr = get_unmapped_area(file, addr, len, pgoff, flags);                 // 從使用者空間map空閒區裡分配一個地址空間，返回首地址。稍                                                                                //後它要賦值給vma （struct vm_area_struct）
    if (addr & ~PAGE_MASK)
        return addr;

    /* Do simple checking here so the lower-level routines won't have
     * to. we assume access permissions have been handled by the open
     * of the memory object, so we don't do any here.
     */
    vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
            mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

    if (flags & MAP_LOCKED)
        if (!can_do_mlock())
            return -EPERM;

    /* mlock MCL_FUTURE? */
    if (vm_flags & VM_LOCKED) {
        unsigned long locked, lock_limit;
        locked = len >> PAGE_SHIFT;
        locked += mm->locked_vm;
        lock_limit = rlimit(RLIMIT_MEMLOCK);
        lock_limit >>= PAGE_SHIFT;
        if (locked > lock_limit && !capable(CAP_IPC_LOCK))
            return -EAGAIN;
    }

    inode = file ? file->f_path.dentry->d_inode : NULL;             // 檔案節點

    if (file) {
        switch (flags & MAP_TYPE) {
        case MAP_SHARED:
            if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
                return -EACCES;

            /*
             * Make sure we don't allow writing to an append-only
             * file..
             */
            if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))
                return -EACCES;

            /*
             * Make sure there are no mandatory locks on the file.
             */
            if (locks_verify_locked(inode))
                return -EAGAIN;

            vm_flags |= VM_SHARED | VM_MAYSHARE;
            if (!(file->f_mode & FMODE_WRITE))
                vm_flags &= ~(VM_MAYWRITE | VM_SHARED);

            /* fall through */
        case MAP_PRIVATE:
            if (!(file->f_mode & FMODE_READ))
                return -EACCES;
            if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) {
                if (vm_flags & VM_EXEC)
                    return -EPERM;
                vm_flags &= ~VM_MAYEXEC;
            }

            if (!file->f_op || !file->f_op->mmap)
                return -ENODEV;
            break;

        default:
            return -EINVAL;
        }
    } else {
        switch (flags & MAP_TYPE) {
        case MAP_SHARED:
            /*
             * Ignore pgoff.
             */
            pgoff = 0;
            vm_flags |= VM_SHARED | VM_MAYSHARE;
            break;
        case MAP_PRIVATE:
            /*
             * Set pgoff according to addr for anon_vma.
             */
            pgoff = addr >> PAGE_SHIFT;
            break;
        default:
            return -EINVAL;
        }
    }

    return mmap_region(file, addr, len, flags, vm_flags, pgoff);
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

* The caller must hold down_write(&current->mm->mmap_sem).

unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,

unsigned long len, unsigned long prot,

unsigned long flags, unsigned long pgoff)

{

struct mm_struct * mm = current->mm;

struct inode *inode;

vm_flags_t vm_flags;

* Does the application expect PROT_READ to imply PROT_EXEC?

* (the exception is when the underlying filesystem is noexec

* mounted, in which case we dont add PROT_EXEC.)

if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))

if (!(file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC)))

prot |= PROT_EXEC;

if (!len)

return -EINVAL;

if (!(flags & MAP_FIXED))

addr = round_hint_to_min(addr);

/* Careful about overflows.. */

len = PAGE_ALIGN(len);

if (!len)

return -ENOMEM;

/* offset overflow? */

if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)

return -EOVERFLOW;

/* Too many mappings? */

if (mm->map_count > sysctl_max_map_count)

return -ENOMEM;

/* Obtain the address to map to. we verify (or select) it and ensure

* that it represents a valid section of the address space.

addr = get_unmapped_area(file, addr, len, pgoff, flags); // 從使用者空間map空閒區裡分配一個地址空間，返回首地址。稍 //後它要賦值給vma （struct vm_area_struct）

if (addr & ~PAGE_MASK)

return addr;

/* Do simple checking here so the lower-level routines won't have

* to. we assume access permissions have been handled by the open

* of the memory object, so we don't do any here.

vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |

mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

if (flags & MAP_LOCKED)

if (!can_do_mlock())

return -EPERM;

/* mlock MCL_FUTURE? */

if (vm_flags & VM_LOCKED) {

unsigned long locked, lock_limit;

locked = len >> PAGE_SHIFT;

locked += mm->locked_vm;

lock_limit = rlimit(RLIMIT_MEMLOCK);

lock_limit >>= PAGE_SHIFT;

if (locked > lock_limit && !capable(CAP_IPC_LOCK))

return -EAGAIN;

}

inode = file ? file->f_path.dentry->d_inode : NULL; // 檔案節點

if (file) {

switch (flags & MAP_TYPE) {

case MAP_SHARED:

if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))

return -EACCES;

* Make sure we don't allow writing to an append-only

* file..

if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))

return -EACCES;

* Make sure there are no mandatory locks on the file.

if (locks_verify_locked(inode))

return -EAGAIN;

vm_flags |= VM_SHARED | VM_MAYSHARE;

if (!(file->f_mode & FMODE_WRITE))

vm_flags &= ~(VM_MAYWRITE | VM_SHARED);

/* fall through */

case MAP_PRIVATE:

if (!(file->f_mode & FMODE_READ))

return -EACCES;

if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) {

if (vm_flags & VM_EXEC)

return -EPERM;

vm_flags &= ~VM_MAYEXEC;

}

if (!file->f_op || !file->f_op->mmap)

return -ENODEV;

break;

default:

return -EINVAL;

}

} else {

switch (flags & MAP_TYPE) {

case MAP_SHARED:

* Ignore pgoff.

pgoff = 0;

vm_flags |= VM_SHARED | VM_MAYSHARE;

break;

case MAP_PRIVATE:

* Set pgoff according to addr for anon_vma.

pgoff = addr >> PAGE_SHIFT;

break;

default:

return -EINVAL;

}

return mmap_region(file, addr, len, flags, vm_flags, pgoff);

}

首先呼叫get_unmapped_area在使用者記憶體空間map區裡分配一個空閒區。然後呼叫mmap_region具體的對映.
在mmap_region中：

unsigned long mmap_region(struct file *file, unsigned long addr,
             unsigned long len, unsigned long flags,
             vm_flags_t vm_flags, unsigned long pgoff)
{
    struct mm_struct *mm = current->mm;
    struct vm_area_struct *vma, *prev;
    int correct_wcount = 0;
    int error;
    struct rb_node **rb_link, *rb_parent;
    unsigned long charged = 0;
    struct inode *inode = file ? file->f_path.dentry->d_inode : NULL;

    ...

    /*
     * Can we just expand an old mapping?
     */
    vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
    if (vma)
        goto out;

    /*
     * Determine the object being mapped and call the appropriate
     * specific mapper. the address has already been validated, but
     * not unmapped, but the maps are removed from the list.
     */
    vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);                  //  申請  vma 並初始化
    if (!vma) {
        error = -ENOMEM;
        goto unacct_error;
    }

    vma->vm_mm = mm;
    vma->vm_start = addr;
    vma->vm_end = addr + len;
    vma->vm_flags = vm_flags;
    vma->vm_page_prot = vm_get_page_prot(vm_flags);
    vma->vm_pgoff = pgoff;
    INIT_LIST_HEAD(&vma->anon_vma_chain);

    error = -EINVAL;    /* when rejecting VM_GROWSDOWN|VM_GROWSUP */

    if (file) {
        if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
            goto free_vma;
        if (vm_flags & VM_DENYWRITE) {
            error = deny_write_access(file);
            if (error)
                goto free_vma;
            correct_wcount = 1;
        }
        vma->vm_file = get_file(file);
        error = file->f_op->mmap(file, vma);               //呼叫open檔案的mmap實現
        if (error)
            goto unmap_and_free_vma;

        /* Can addr have changed??
         *
         * Answer: Yes, several device drivers can do it in their
         * f_op->mmap method. -DaveM
         * Bug: If addr is changed, prev, rb_link, rb_parent should
         * be updated for vma_link()
         */
        WARN_ON_ONCE(addr != vma->vm_start);

        addr = vma->vm_start;
        pgoff = vma->vm_pgoff;
        vm_flags = vma->vm_flags;
    } else if (vm_flags & VM_SHARED) {
        if (unlikely(vm_flags & (VM_GROWSDOWN|VM_GROWSUP)))
            goto free_vma;
        error = shmem_zero_setup(vma);
        if (error)
            goto free_vma;
    }

   ...


}

unsigned long mmap_region(struct file *file, unsigned long addr,

unsigned long len, unsigned long flags,

vm_flags_t vm_flags, unsigned long pgoff)

{

struct mm_struct *mm = current->mm;

struct vm_area_struct *vma, *prev;

int correct_wcount = 0;

int error;

struct rb_node **rb_link, *rb_parent;

unsigned long charged = 0;

struct inode *inode = file ? file->f_path.dentry->d_inode : NULL;

...

* Can we just expand an old mapping?

vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);

if (vma)

goto out;

* Determine the object being mapped and call the appropriate

* specific mapper. the address has already been validated, but

* not unmapped, but the maps are removed from the list.

vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL); // 申請 vma 並初始化

if (!vma) {

error = -ENOMEM;

goto unacct_error;

}

vma->vm_mm = mm;

vma->vm_start = addr;

vma->vm_end = addr + len;

vma->vm_flags = vm_flags;

vma->vm_page_prot = vm_get_page_prot(vm_flags);

vma->vm_pgoff = pgoff;

INIT_LIST_HEAD(&vma->anon_vma_chain);

error = -EINVAL; /* when rejecting VM_GROWSDOWN|VM_GROWSUP */

if (file) {

if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))

goto free_vma;

if (vm_flags & VM_DENYWRITE) {

error = deny_write_access(file);

if (error)

goto free_vma;

correct_wcount = 1;

}

vma->vm_file = get_file(file);

error = file->f_op->mmap(file, vma); //呼叫open檔案的mmap實現

if (error)

goto unmap_and_free_vma;

/* Can addr have changed??

* Answer: Yes, several device drivers can do it in their

* f_op->mmap method. -DaveM

* Bug: If addr is changed, prev, rb_link, rb_parent should

* be updated for vma_link()

WARN_ON_ONCE(addr != vma->vm_start);

addr = vma->vm_start;

pgoff = vma->vm_pgoff;

vm_flags = vma->vm_flags;

} else if (vm_flags & VM_SHARED) {

if (unlikely(vm_flags & (VM_GROWSDOWN|VM_GROWSUP)))

goto free_vma;

error = shmem_zero_setup(vma);

if (error)

goto free_vma;

}

...

}

這個函式有兩個關鍵的地方，第一就是申請了vma並初始化，然後呼叫 file->f_op->mmap(file, vma);
這樣整個流程就清晰了，驅動開發人員只需要關注裝置驅動裡file操作中mmap實現就可以了。

關於可執行檔案的對映我們可以參考幾個圖：

那麼對應每個程式都有個一個mm_struct：

在mm_struct中有
struct vm_area_struct * mmap; /* list of VMAs */

它儲存了程式所有對映的區域，之前我們提到過每個vma（即結構vm_area_struct都代表使用者空間的一個對映）。那麼它在這裡連線起來。

我們在mmap_region中看到這樣一行程式碼：

vma_link(mm, vma, prev, rb_link, rb_parent); 即把申請的vma加入管理中.

這裡需要說明庫檔案的map和裝置驅動的對映不太一樣，前者不要求實體地址連續，但是後者要求，因為裝置io空間預設是連續的.

對於任何一個普通檔案，對於的file *中的mmap操作是什麼呢？

這個跟fs有關係：
.mmap=generic_file_mmap // filemap.c

我們也可以通過proc來檢視：
#cat /proc/pid/maps

而檢視靜態的bin可以通過nm和objdump，Nm檢視bin的符號，objdump可以檢視elf資訊，也可以通過file 和readelf檢視

這裡就說說mmap支援的功能：

1. mmap共享記憶體：

（1）使用普通檔案提供的記憶體對映：
適用於任何程式之間。此時，需要開啟或建立一個檔案，然後再呼叫mmap()

典型呼叫程式碼如下：
fd=open(name, flag, mode); if(fd<0) …
ptr=mmap(NULL, len , PROT_READ|PROT_WRITE, MAP_SHARED , fd , 0);
通過mmap()實現共享記憶體的通訊方式有許多特點和要注意的地方，可以參看UNIX網路程式設計第二卷。

（2）使用特殊檔案提供匿名記憶體對映：
適用於具有親緣關係的程式之間。由於父子程式特殊的親緣關係，在父程式中先呼叫mmap()，然後呼叫 fork()。那麼在呼叫fork()之後，子程式繼承父程式匿名對映後的地址空間，同樣也繼承mmap()返回的地址，這樣，父子程式就可以通過對映區域進行通訊了。注意，這裡不是一般的繼承關係。一般來說，子程式單獨維護從父程式繼承下來的一些變數。而mmap()返回的地址，卻由父子程式共同維護。對於具有親緣關係的程式實現共享記憶體最好的方式應該是採用匿名記憶體對映的方式。此時，不必指定具體的檔案，只要設定相應的標誌即可。

2. 提高檔案訪問效率

3. 對映裝置

實現對映裝置的函式mmap的時候，需要用到remap_pfn_range
remap_pfn_range不能對映常規記憶體，只存取保留頁和在實體記憶體頂之上的實體地址。因為保留頁和在物理
記憶體頂之上的實體地址記憶體管理系統的各個子模組管理不到。640 KB 和 1MB 是保留頁可能對映，裝置I/O
記憶體也可以對映。如果想把kmalloc()申請的記憶體對映到使用者空間，則可以通過mem_map_reserve()把相應
的記憶體設定為保留後就可以。

remap_pfn_range常用於裝置記憶體對映，而nopage()常用於RAM對映
呼叫mmap()時就決定了對映大小，不能再增加。換句話說，對映不能改變檔案的大小。反過來，由檔案被對映部分，而不是由檔案大小來決定程式可訪問記憶體空間範圍(對映時，指定offset最好是記憶體頁面大小的整數倍)。

通常使用mmap()的三種情況.提高I/O效率、匿名記憶體對映、共享記憶體程式通訊。

在kernel裡，通常有3種申請記憶體的方式：vmalloc, kmalloc, alloc_pages。kmalloc與alloc_pages類似，均是申請連續的地址空間。而vmalloc則可以申請一段不連續的實體地址空間，並將其對映到連續的線性地址上。每次vmalloc之後，核心會建立一個vm_struct，用以對映分配到的不連續的記憶體區域。vm_struct類似vma，但是又不是一回事。vma是將實體記憶體對映到程式的虛擬地址空間。而vm_struct是將實體記憶體對映到核心的線性地址空間。　　既然vmalloc拿到的不是連續的實體記憶體，那麼將這些記憶體對映到vma時，就不能直接利用remap_pfn_range()了。此時可以採用兩種方法，一種是實現vm_operations_struct的fault()方法，用以在缺頁時再對映需要的頁。此方法操作起來較為麻煩。另一種方法是直接使用remap_vmalloc_range()函式。該函式的原型為：

int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,

unsigned long pgoff)

其中引數vma是mmap使用呼叫傳下來的，addr即為vmalloc()所分配記憶體的起始地址。而pgoff則為mmap()系統呼叫裡的偏移引數，可以通過vma->vm_pgoff獲得。該函式成功執行後，返回值為0。如果返回值為負數，則說明出錯了。通常是由於所傳的引數不正確。

需要注意的是，需要對映到使用者空間的記憶體段，不能直接利用vmalloc()分配，而應該使用vmalloc_user()函式。該函式除了分配記憶體之外，還會將相應的vm_struct結構標記為VM_USERMAP。否則，remap_vmalloc_range將返回錯誤。

下面附上自己裝置對映的測試程式碼（由於是測試只對映核心記憶體，用了兩種方式一種是kmalloc 一種是vmalloc，而對映裝置的時候直接傳遞裝置io地址）

使用者空間程式：

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>



int main(void)
{

    int fd;
    char *p;

    fd=open("/dev/my_mmap",O_RDWR);
    if(fd < 0)
     {

     printf("open my dev failed \n");
     return 0;
    }
    
    p=(char *)mmap(0,4096,PROT_READ|PROT_WRITE,MAP_SHARED,fd,0);
    printf("p..is %s.uuu..\n",p);
    
    munmap(p,4096);
    close(fd);
    return 0;

}

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <sys/types.h>

#include <sys/stat.h>

#include <fcntl.h>

#include <unistd.h>

#include <sys/mman.h>

int main(void)

{

int fd;

char *p;

fd=open("/dev/my_mmap",O_RDWR);

if(fd < 0)

{

printf("open my dev failed \n");

return 0;

}

p=(char *)mmap(0,4096,PROT_READ|PROT_WRITE,MAP_SHARED,fd,0);

printf("p..is %s.uuu..\n",p);

munmap(p,4096);

close(fd);

return 0;

}

核心模組程式碼：

#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/module.h>

#include <linux/device.h>
#include <linux/cdev.h>
#include <linux/fs.h>
#include <linux/fcntl.h>
#include <linux/string.h>
#include <linux/gfp.h>
#include <linux/mm_types.h>
#include <linux/mm.h>
#include <linux/highmem.h>
#include <linux/slab.h>


static struct cdev *my_dev;
static dev_t md;
static struct page *pg;
void *mp;

static int my_open(struct inode *inode, struct file *filp)
{

    return 0;

}

static int my_mmap(struct file *filp, struct vm_area_struct *vma)
{

    int err;
    unsigned long start;
     unsigned long size;
    unsigned long pfn;
    start = vma->vm_start;
    size = vma->vm_end -vma->vm_start;
    
    // use remap_pfn_range to map phy addr
    /* 2 user kmalloc */ 
    pfn=virt_to_phys(mp);
    err = remap_pfn_range(vma,start,pfn >> 12,size,vma->vm_page_prot);
    /* 1 user vmalloc */
//    err = remap_vmalloc_range(vma,mp,0);
    return err;
}

static struct file_operations mmap_fops =
{
    .owner =THIS_MODULE,
    .open =my_open,
    .mmap =my_mmap,


};

static int __init hello_init(void)
{
    int err;
    char *p;
    printk("hello ko ..\n");
    void *m;    
    
    /* 1 use vmalloc */
//    mp =vmalloc_user(4096);

    /* 2 user kmalloc */
    mp = kmalloc(4096,GFP_KERNEL);
    SetPageReserved(virt_to_page(mp));
    /* 2 end */
    //memset(mp,5,4096);
    strcpy(mp,"hello");
    printk("p is %s....\n",mp);
    // create cdev and alloc page 
    my_dev =cdev_alloc();
    cdev_init(my_dev,&mmap_fops);
    alloc_chrdev_region(&md,0,1,"mmap_dev");
    printk("major=%d,minor=%d...\n",MAJOR(md),MINOR(md));        
    my_dev->owner=THIS_MODULE;
    cdev_add(my_dev,md,1);

    return 0;

}


static void __exit hello_exit(void)
{

    printk("hello exit...\n");
    kfree(mp);
    cdev_del(&my_dev);
    unregister_chrdev_region(md,1);

}

module_init(hello_init);
module_exit(hello_exit);

#include <linux/kernel.h>

#include <linux/init.h>

#include <linux/module.h>

#include <linux/device.h>

#include <linux/cdev.h>

#include <linux/fs.h>

#include <linux/fcntl.h>

#include <linux/string.h>

#include <linux/gfp.h>

#include <linux/mm_types.h>

#include <linux/mm.h>

#include <linux/highmem.h>

#include <linux/slab.h>

static struct cdev *my_dev;

static dev_t md;

static struct page *pg;

void *mp;

static int my_open(struct inode *inode, struct file *filp)

{

return 0;

}

static int my_mmap(struct file *filp, struct vm_area_struct *vma)

{

int err;

unsigned long start;

unsigned long size;

unsigned long pfn;

start = vma->vm_start;

size = vma->vm_end -vma->vm_start;

// use remap_pfn_range to map phy addr

/* 2 user kmalloc */

pfn=virt_to_phys(mp);

err = remap_pfn_range(vma,start,pfn >> 12,size,vma->vm_page_prot);

/* 1 user vmalloc */

// err = remap_vmalloc_range(vma,mp,0);

return err;

}

static struct file_operations mmap_fops =

{

.owner =THIS_MODULE,

.open =my_open,

.mmap =my_mmap,

};

static int __init hello_init(void)

{

int err;

char *p;

printk("hello ko ..\n");

void *m;

/* 1 use vmalloc */

// mp =vmalloc_user(4096);

/* 2 user kmalloc */

mp = kmalloc(4096,GFP_KERNEL);

SetPageReserved(virt_to_page(mp));

/* 2 end */

//memset(mp,5,4096);

strcpy(mp,"hello");

printk("p is %s....\n",mp);

// create cdev and alloc page

my_dev =cdev_alloc();

cdev_init(my_dev,&mmap_fops);

alloc_chrdev_region(&md,0,1,"mmap_dev");

printk("major=%d,minor=%d...\n",MAJOR(md),MINOR(md));

my_dev->owner=THIS_MODULE;

cdev_add(my_dev,md,1);

return 0;

}

static void __exit hello_exit(void)

{

printk("hello exit...\n");

kfree(mp);

cdev_del(&my_dev);

unregister_chrdev_region(md,1);

}

module_init(hello_init);

module_exit(hello_exit);

Makfile：
obj-m:=hello.o

編譯：
make -C /usr/src/linux M=pwd modules // /usr/src/linux是核心路徑或者核心標頭檔案路徑
安裝 insmod hello.ko // 還需要自己查詢裝置號來建立裝置檔案.

記憶體對映
2017-10-05
記憶體
linux記憶體管理（八）- 反向對映RMAP
2024-06-15
Linux記憶體
mmap記憶體對映
2016-07-10
記憶體
記憶體管理記憶體管理概述
2020-11-03
記憶體
linux記憶體管理（一）實體記憶體的組織和記憶體分配
2024-06-07
Linux記憶體
linux記憶體管理
2014-07-24
Linux記憶體
LINUX 記憶體管理
2010-07-06
Linux記憶體
自動共享記憶體管理自動記憶體管理手工記憶體管理
2017-11-20
記憶體
【記憶體管理】記憶體佈局
2024-06-10
記憶體
記憶體管理篇——實體記憶體的管理
2022-02-23
記憶體
使用記憶體對映檔案（mmap）
2024-07-15
記憶體
Linux實體記憶體管理
2024-11-28
Linux記憶體
Go：記憶體管理與記憶體清理
2020-08-04
Go記憶體
Java的記憶體 -JVM 記憶體管理
2018-08-20
Java記憶體JVM
Linux 記憶體管理: Kmalloc
2015-09-22
Linux記憶體
linux的記憶體管理
2007-03-15
Linux記憶體
Linux記憶體管理：Vmalloc
2015-09-23
Linux記憶體
Linux記憶體管理：Malloc
2015-09-24
Linux記憶體
Linux記憶體管理：DMA
2015-09-25
Linux記憶體
從記憶體對映mmap說開去
2019-04-08
記憶體
記憶體管理兩部曲之實體記憶體管理
2021-05-22
記憶體
記憶體管理
2016-12-19
記憶體
Linux記憶體洩露案例分析和記憶體管理分享
2024-10-24
Linux記憶體洩露
Aerospike的bin記憶體管理--即列記憶體管理
2017-12-01
ROS記憶體
遊戲記憶體對比普通記憶體區別遊戲記憶體和普通記憶體相差大嗎？
2018-06-23
遊戲記憶體
VC++中用記憶體對映檔案 (轉)
2007-12-07
C++記憶體
二進位制檔案記憶體對映
2024-04-27
記憶體
記憶體管理兩部曲之虛擬記憶體管理
2021-05-31
記憶體
【記憶體管理】Oracle AMM自動記憶體管理詳解
2020-08-27
記憶體Oracle
Linux共享記憶體的管理
2018-06-07
Linux記憶體
Linux中的記憶體管理
2013-12-19
Linux記憶體
linux記憶體管理機制
2006-11-12
Linux記憶體
Linux 記憶體管理: Kmalloc(2)
2015-09-22
Linux記憶體
linux記憶體管理（二）- vmalloc
2024-06-11
Linux記憶體
共享記憶體對映（linux程式與執行緒學習筆記）
2020-10-05
記憶體Linux執行緒筆記
記憶體對齊
2024-03-18
記憶體
記憶體管理-swMemoryGlobal
2019-09-05
記憶體
OC記憶體管理
2018-08-29
記憶體

Linux 記憶體管理：記憶體對映

相關文章