前言
共享記憶體主要用於程式間通訊,Linux有兩種共享記憶體(Shared Memory)機制:
(1) ** System V shared memory(shmget/shmat/shmdt) **
Original shared memory mechanism, still widely used Sharing between unrelated processes.
(2) ** POSIX shared memory(shm_open/shm_unlink) **
Sharing between unrelated processes, without overhead of filesystem I/O Intended to be simpler and better than older APIs.
另外,在Linux中不得不提一下記憶體對映(也可用於程式間通訊):
** Shared mappings – mmap(2) **
l Shared anonymous mappings:Sharing between related processes only (related via fork())
l Shared file mappings:Sharing between unrelated processes, backed by file in filesystem
System V共享記憶體歷史悠久,使用也很廣範,很多類Unix系統都支援。一般來說,我們在寫程式時也通常使用第一種。這裡不再討論如何使用它們,關於POSIX共享記憶體的詳細介紹可以參考這裡1,這裡2。
** 講到那麼多,那麼問題來了,共享記憶體與tmpfs有什麼關係? **
The POSIX shared memory object implementation on Linux 2.4 makes use of a dedicated filesystem, which is normally mounted under /dev/shm.
從這裡可以看到,POSIX共享記憶體是基於tmpfs來實現的。實際上,更進一步,不僅PSM(POSIX shared memory),而且SSM(System V shared memory)在核心也是基於tmpfs實現的。
tmpfs介紹
下面是核心文件中關於tmpfs的介紹:
tmpfs has the following uses:
1) There is always a kernel internal mount which you will not see at all. This is used for shared anonymous mappings and SYSV shared memory.
This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not set, the user visible part of tmpfs is not build. But the internal mechanisms are always present.
2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for POSIX shared memory (shm_open, shm_unlink). Adding the following line to /etc/fstab should take care of this:
tmpfs /dev/shm tmpfs defaults 0 0
Remember to create the directory that you intend to mount tmpfs on if necessary.
This mount is not needed for SYSV shared memory. The internal mount is used for that. (In the 2.3 kernel versions it was necessary to mount the predecessor of tmpfs (shm fs) to use SYSV shared memory)
從這裡可以看到tmpfs主要有兩個作用:
(1)用於SYSV共享記憶體,還有匿名記憶體對映;這部分由核心管理,使用者不可見;
(2)用於POSIX共享記憶體,由使用者負責mount,而且一般mount到/dev/shm;依賴於CONFIG_TMPFS;
到這裡,我們可以瞭解,SSM與PSM之間的區別,也明白了/dev/shm的作用。
下面我們來做一些測試:
測試
我們將/dev/shm的tmpfs設定為64M:
# mount -size=64M -o remount /dev/shm# df -lh
Filesystem Size Used Avail Use% Mounted on
tmpfs 64M 0 64M 0% /dev/shm
SYSV共享記憶體的最大大小為32M:
# cat /proc/sys/kernel/shmmax
33554432
(1)建立65M的system V共享記憶體失敗:
# ipcmk -M 68157440
ipcmk: create share memory failed: Invalid argument
這是正常的。
(2)將shmmax調整為65M
# echo 68157440 > /proc/sys/kernel/shmmax# cat /proc/sys/kernel/shmmax
68157440# ipcmk -M 68157440
Shared memory id: 0# ipcs -m
—— Shared Memory Segments ——–
key shmid owner perms bytes nattch status
0xef46b249 0 root 644 68157440 0
可以看到system v共享記憶體的大小並不受/dev/shm的影響。
(3)建立POSIX共享記憶體
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
/*gcc -o shmopen shmopen.c -lrt*/#include <unistd.h> #include <fcntl.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/mman.h> #include <stdio.h> #include <stdlib.h> #define MAP_SIZE 68157440 int main(int argc, char *argv[]) { int fd; void* result; fd = shm_open("/shm1", O_RDWR|O_CREAT, 0644); if(fd < 0){ printf("shm_open failed\n"); exit(1); } return 0; } |
# ./shmopen# ls -lh /dev/shm/shm1
-rw-r–r– 1 root root 65M Mar 3 06:19 /dev/shm/shm1
僅管/dev/shm只有64M,但建立65M的POSIX SM也可以成功。
(4)向POSIX SM寫資料
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
/*gcc -o shmwrite shmwrite.c -lrt*/#include <unistd.h> #include <fcntl.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/mman.h> #include <stdio.h> #include <stdlib.h> #define MAP_SIZE 68157440 int main(int argc, char *argv[]) { int fd; void* result; fd = shm_open("/shm1", O_RDWR|O_CREAT, 0644); if(fd < 0){ printf("shm_open failed\n"); exit(1); } if (ftruncate(fd, MAP_SIZE) < 0){ printf("ftruncate failed\n"); exit(1); } result = mmap(NULL, MAP_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if(result == MAP_FAILED){ printf("mapped failed\n"); exit(1); } /* ... operate result pointer */ printf("memset\n"); memset(result, 0, MAP_SIZE); //shm_unlink("/shm1"); return 0; } |
# ./shmwrite
memset
Bus error
可以看到,寫65M的資料會報Bus error錯誤。
但是,卻可以在/dev/shm建立新的檔案:
# ls -lh /dev/shm/ -lh
總用量 64M
-rw-r–r– 1 root root 65M 3月 3 15:23 shm1
-rw-r–r– 1 root root 65M 3月 3 15:24 shm2
這很正常,ls顯示的是inode->size。
# stat /dev/shm/shm2
File: “/dev/shm/shm2”
Size: 68157440 Blocks: 0 IO Block: 4096 普通檔案
Device: 10h/16d Inode: 217177 Links: 1
Access: (0644/-rw-r–r–) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2015-03-03 15:24:28.025985167 +0800
Modify: 2015-03-03 15:24:28.025985167 +0800
Change: 2015-03-03 15:24:28.025985167 +0800
(5)向SYS V共享記憶體寫資料
將System V共享記憶體的最大值調整為65M(/dev/shm仍然為64M)。
# cat /proc/sys/kernel/shmmax
68157440
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
/*gcc -o shmv shmv.c*/#include <sys/ipc.h> #include <sys/shm.h> #include <sys/types.h> #include <unistd.h> #define MAP_SIZE 68157440 int main(int argc, char** argv){ int shm_id,i; key_t key; char temp; char *p_map; char* name = "/dev/shm/shm3"; key = ftok(name,0); if(key==-1) perror("ftok error"); shm_id=shmget(key,MAP_SIZE,IPC_CREAT); if(shm_id==-1) { perror("shmget error"); return; } p_map=(char*)shmat(shm_id,NULL,0); memset(p_map, 0, MAP_SIZE); if(shmdt(p_map)==-1) perror(" detach error "); } |
#./shmv
卻可以正常執行。
(7)結論
雖然System V與POSIX共享記憶體都是通過tmpfs實現,但是受的限制卻不相同。也就是說/proc/sys/kernel/shmmax只會影響SYS V共享記憶體,/dev/shm只會影響Posix共享記憶體。實際上,System V與Posix共享記憶體本來就是使用的兩個不同的tmpfs例項(instance)。
核心分析
核心在初始化時,會自動mount一個tmpfs檔案系統,掛載為shm_mnt:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
//mm/shmem.cstatic struct file_system_type shmem_fs_type = { .owner = THIS_MODULE, .name = "tmpfs", .get_sb = shmem_get_sb, .kill_sb = kill_litter_super, }; int __init shmem_init(void) { ... error = register_filesystem(&shmem_fs_type); if (error) { printk(KERN_ERR "Could not register tmpfs\n"); goto out2; } ///掛載tmpfs(用於SYS V) shm_mnt = vfs_kern_mount(&shmem_fs_type, MS_NOUSER,shmem_fs_type.name, NULL); |
/dev/shm的mount與普通檔案mount的流程類似,不再討論。但是,值得注意的是,/dev/shm預設的大小為當前實體記憶體的1/2:
shmem_get_sb –> shmem_fill_super
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
//mem/shmem.c int shmem_fill_super(struct super_block *sb, void *data, int silent) { ... #ifdef CONFIG_TMPFS /* * Per default we only allow half of the physical ram per * tmpfs instance, limiting inodes to one per page of lowmem; * but the internal instance is left unlimited. */ if (!(sb->s_flags & MS_NOUSER)) {///核心會設定MS_NOUSER sbinfo->max_blocks = shmem_default_max_blocks(); sbinfo->max_inodes = shmem_default_max_inodes(); if (shmem_parse_options(data, sbinfo, false)) { err = -EINVAL; goto failed; } } sb->s_export_op = &shmem_export_ops; #else ... #ifdef CONFIG_TMPFS static unsigned long shmem_default_max_blocks(void) { return totalram_pages / 2; } |
可以看到:由於核心在mount tmpfs時,指定了MS_NOUSER,所以該tmpfs沒有大小限制,因此,SYS V共享記憶體能夠使用的記憶體空間只受/proc/sys/kernel/shmmax限制;而使用者通過掛載的/dev/shm,預設為實體記憶體的1/2。
注意CONFIG_TMPFS.
另外,在/dev/shm建立檔案走VFS介面,而SYS V與匿名對映卻是通過shmem_file_setup實現:
SIGBUS
當應用訪問共享記憶體對應的地址空間,如果對應的物理PAGE還沒有分配,就會呼叫fault方法,分配失敗,就會返回OOM或者BIGBUS錯誤:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
static const struct vm_operations_struct shmem_vm_ops = { .fault = shmem_fault, #ifdef CONFIG_NUMA .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif }; static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { struct inode *inode = vma->vm_file->f_path.dentry->d_inode; int error; int ret = VM_FAULT_LOCKED; error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret); if (error) return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS); return ret; } shmem_getpage –> shmem_getpage_gfp: /* * shmem_getpage_gfp - find page in cache, or get from swap, or allocate * * If we allocate a new one we do not mark it dirty. That's up to the * vm. If we swap it in we mark it dirty since we also free the swap * entry since a page cannot live in both the swap and page cache */ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type) { ... if (sbinfo->max_blocks) { ///dev/shm會有該值 if (percpu_counter_compare(&sbinfo->used_blocks,sbinfo->max_blocks) >= 0) { error = -ENOSPC; goto unacct; } percpu_counter_inc(&sbinfo->used_blocks); } //分配一個物理PAGE page = shmem_alloc_page(gfp, info, index); if (!page) { error = -ENOMEM; goto decused; } SetPageSwapBacked(page); __set_page_locked(page); error = mem_cgroup_cache_charge(page, current->mm,gfp & GFP_RECLAIM_MASK); ///mem_cgroup檢查 if (!error) error = shmem_add_to_page_cache(page, mapping, index, gfp, NULL); |
共享記憶體與CGROUP
目前,共享記憶體的空間計算在第一個訪問共享記憶體的group,參考:
l http://lwn.net/Articles/516541/
l https://www.kernel.org/doc/Documentation/cgroups/memory.txt
POSIX共享記憶體與Docker
目前Docker將/dev/shm限制為64M,卻沒有提供引數,這種做法比較糟糕。如果應用使用大記憶體的POSIX共享記憶體,必然會導致問題。 參考:
l https://github.com/docker/docker/issues/2606
l https://github.com/docker/docker/pull/4981
總結
(1)POSIX共享記憶體與SYS V共享記憶體在核心都是通過tmpfs實現,但對應兩個不同的tmpfs例項,相互獨立。
(2)通過/proc/sys/kernel/shmmax可以限制SYS V共享記憶體(單個)的最大值,通過/dev/shm可以限制POSIX共享記憶體的最大值(所有之和)。