Java服務.問題排查.問題復現

vbcjnmkk發表於2024-09-01

最近有使用者反饋測試環境Java服務總在凌晨00:00左右掛掉,使用者反饋Java服務沒有定時任務,也沒有流量突增的情況,Jvm配置也合理,莫名其妙就掛了

問題排查
問題復現
為了復現該問題,寫了個springboot的demo部署在測試環境,其中demo裡只做了hello world功能,應用型別為web_tomcat (war包部署),基礎映象是base_tomcat/java-centos6-jdk18-60-tom8050-ngx197,映象使用的Java版本是1.8.0_60,有了上次MySQL被kill的經驗,盲猜是linux limit惹的禍,因此將打好的映象分別部署了兩批不同的機器,果不其然,新機器當晚掛掉了,老機器服務正常

看一下掛掉的limit設定

排查過程
Java程序會受到limits影響?
按理說Java程序是不會受到系統limit open files(系統最大控制代碼數)影響的,但是為了驗證這個問題,我們將他修改為正常機器的值,由於demo是web_tomcat應用,沒法修改啟動指令碼,因此我們透過prlimit修改java程序的limit

prlimit -p 32672 --nofile=1048576

結果當晚00:00左右還是掛了,看來open files和java程序掛掉沒關係,看dmesg也沒發現什麼問題

Java版本過低導致記憶體分配不合理?
透過尋求jdos研發組的幫助,jdos研發組的同學認為是java版本的問題,低版本可能沒有限制住申請的記憶體大小,具體原因如下

https://blog.softwaremill.com/docker-support-in-new-java-8-finally-fd595df0ca54?gi=a0cc6736ed14

異常機器java記憶體情況

正常機器java記憶體情況

按照這個文件描述,使用docker cgroups限制記憶體可能會導致JVM程序被終止,原因是Java讀取的還是宿主機的CPU,而不是docker cgroups限制的CPU,高版本的Java解決了這個問題,文件解決方案截圖如下:

對此我們表示懷疑,因為我們的程式裡設定了JVM引數

保持著試一試的心態,我們增加了一個實驗組,實驗組使用的Java版本是11.0.8

結果當晚實驗組的Java程序還是死了,看來和Java版本也沒關係

容器上存在定時任務導致的?
由於基礎映象是jdos官方提供的映象,所以之前從來沒有懷疑過是定時任務的問題,但是現在別無他法了,檢查下容器的定時任務

雖然有定時任務,但是這個執行的時間點和Java掛掉的時間對不上,為此我們決定刪除定時任務試試

結果當晚Java程序還是掛了,並且這次有dmesg的日誌,發現Java被kill的同時crond也被kill了,被kill的原因是crond記憶體過高導致oom

難道還有系統級cron任務?於是查了一下/etc/crontab,發現果然還有cron任務(這是誰打的映象!!!)

這個時間點和Java程序掛掉的時間點吻合,但是問題來了,執行的任務並沒有logrotate.sh這個指令碼,應該不會出現問題才對

到底是不是定時任務的問題,我們修改下cron的時間驗證下,調整時間為中午11:00,驗證下Java程序是否會掛,同時使用strace列印程序trace log

果然Java程序在中午11.00掛了,看來真的是cron任務導致的,讓我們一起看一下strace
//程式碼效果參考:http://www.ningluan.com/sitemap/post.html
19:59:01 close(3) = 0
19:59:01 stat("/etc/pam.d", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19:59:01 open("/etc/pam.d/crond", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=293, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "#\n# The PAM configuration file f"..., 4096) = 293
19:59:01 open("/lib64/security/pam_access.so", O_RDONLY) = 5
19:59:01 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\17\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=18552, ...}) = 0
19:59:01 mmap(NULL, 2113800, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd769322000
19:59:01 mprotect(0x7fd769325000, 2097152, PROT_NONE) = 0
19:59:01 mmap(0x7fd769525000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x3000) = 0x7fd769525000
19:59:01 close(5) = 0
19:59:01 open("/etc/ld.so.cache", O_RDONLY) = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=16203, ...}) = 0
19:59:01 mmap(NULL, 16203, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fd7707f8000
19:59:01 close(5) = 0
19:59:01 open("/lib64/libnsl.so.1", O_RDONLY) = 5
19:59:01 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p@\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=113432, ...}) = 0
19:59:01 mmap(NULL, 2198192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd769109000
19:59:01 mprotect(0x7fd76911f000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd76931e000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x15000) = 0x7fd76931e000
19:59:01 mmap(0x7fd769320000, 6832, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fd769320000
19:59:01 close(5) = 0
19:59:01 mprotect(0x7fd76931e000, 4096, PROT_READ) = 0
19:59:01 mprotect(0x7fd769525000, 4096, PROT_READ) = 0
19:59:01 munmap(0x7fd7707f8000, 16203) = 0
19:59:01 open("/etc/pam.d/password-auth", O_RDONLY) = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770803000
19:59:01 read(5, "#%PAM-1.0\n# This file is auto-ge"..., 4096) = 692
19:59:01 open("/lib64/security/pam_unix.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240&\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=51960, ...}) = 0
19:59:01 mmap(NULL, 2196352, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd768ef0000
19:59:01 mprotect(0x7fd768efc000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7690fb000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0xb000) = 0x7fd7690fb000
19:59:01 mmap(0x7fd7690fd000, 45952, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fd7690fd000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd7690fb000, 4096, PROT_READ) = 0
19:59:01 read(5, "", 4096) = 0
19:59:01 close(5) = 0
19:59:01 munmap(0x7fd770803000, 4096) = 0
19:59:01 open("/lib64/security/pam_loginuid.so", O_RDONLY) = 5
19:59:01 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220\t\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=10240, ...}) = 0
19:59:01 mmap(NULL, 2105480, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd768ced000
19:59:01 mprotect(0x7fd768cef000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd768eee000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x1000) = 0x7fd768eee000
19:59:01 close(5) = 0
19:59:01 mprotect(0x7fd768eee000, 4096, PROT_READ) = 0
19:59:01 open("/etc/pam.d/password-auth", O_RDONLY) = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770803000
19:59:01 read(5, "#%PAM-1.0\n# This file is auto-ge"..., 4096) = 692
19:59:01 open("/lib64/security/pam_keyinit.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\10\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=10224, ...}) = 0
19:59:01 mmap(NULL, 2105488, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd768aea000
19:59:01 mprotect(0x7fd768aec000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd768ceb000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x1000) = 0x7fd768ceb000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd768ceb000, 4096, PROT_READ) = 0
19:59:01 open("/lib64/security/pam_limits.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\20\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=18600, ...}) = 0
19:59:01 mmap(NULL, 2113848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7688e5000
19:59:01 mprotect(0x7fd7688e9000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd768ae8000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x3000) = 0x7fd768ae8000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd768ae8000, 4096, PROT_READ) = 0
19:59:01 open("/lib64/security/pam_succeed_if.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\v\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=14384, ...}) = 0
19:59:01 mmap(NULL, 2109624, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7686e1000
19:59:01 mprotect(0x7fd7686e4000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7688e3000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x2000) = 0x7fd7688e3000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd7688e3000, 4096, PROT_READ) = 0
19:59:01 read(5, "", 4096) = 0
19:59:01 close(5) = 0
19:59:01 munmap(0x7fd770803000, 4096) = 0
19:59:01 open("/etc/pam.d/password-auth", O_RDONLY) = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770803000
19:59:01 read(5, "#%PAM-1.0\n# This file is auto-ge"..., 4096) = 692
19:59:01 open("/lib64/security/pam_env.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\r\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=18592, ...}) = 0
19:59:01 mmap(NULL, 2113776, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7684dc000
19:59:01 mprotect(0x7fd7684e0000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7686df000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x3000) = 0x7fd7686df000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd7686df000, 4096, PROT_READ) = 0
19:59:01 open("/lib64/security/pam_deny.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\5\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=5952, ...}) = 0
19:59:01 mmap(NULL, 2101272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7682da000
19:59:01 mprotect(0x7fd7682db000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7684da000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0) = 0x7fd7684da000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd7684da000, 4096, PROT_READ) = 0
19:59:01 read(5, "", 4096) = 0
19:59:01 close(5) = 0
19:59:01 munmap(0x7fd770803000, 4096) = 0
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 open("/etc/pam.d/other", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=154, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "#%PAM-1.0\nauth required "..., 4096) = 154
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root❌0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 uname({sys="Linux", node="host-11-159-73-176", ...}) = 0
19:59:01 open("/etc/security/access.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=4620, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "# Login access control table.\n#\n"..., 4096) = 4096
19:59:01 read(3, " should get access from ipv4 net"..., 4096) = 524
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 getuid() = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root❌0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 geteuid() = 0
19:59:01 open("/etc/shadow", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG, st_size=901, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:$6$4.53VPrJ$1wxMpbsWYp4VKea"..., 4096) = 901
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 socket(PF_NETLINK, SOCK_RAW, 9) = 3
19:59:01 fcntl(3, F_SETFD, FD_CLOEXEC) = 0
19:59:01 readlink("/proc/self/exe", "/usr/sbin/crond", 4096) = 15
19:59:01 sendto(3, "p\0\0\0M\4\5\0\1\0\0\0\0\0\0\0op=PAM:accountin"..., 112, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 112
19:59:01 poll([{fd=3, events=POLLIN}], 1, 500) = 1 ([{fd=3, revents=POLLIN}])
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\1\0\0\0\227\7\0\0\0\0\0\0p\0\0\0M\4\5\0\1\0\0\0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\1\0\0\0\227\7\0\0\0\0\0\0p\0\0\0M\4\5\0\1\0\0\0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 close(3) = 0
19:59:01 open("/etc/security/pam_env.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=2980, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "#\n# This is the configuration fi"..., 4096) = 2980
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 open("/etc/environment", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 socket(PF_NETLINK, SOCK_RAW, 9) = 3
19:59:01 fcntl(3, F_SETFD, FD_CLOEXEC) = 0
19:59:01 sendto(3, "p\0\0\0O\4\5\0\2\0\0\0\0\0\0\0op=PAM:setcred a"..., 112, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 112
19:59:01 poll([{fd=3, events=POLLIN}], 1, 500) = 1 ([{fd=3, revents=POLLIN}])
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\2\0\0\0\227\7\0\0\0\0\0\0p\0\0\0O\4\5\0\2\0\0\0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\2\0\0\0\227\7\0\0\0\0\0\0p\0\0\0O\4\5\0\2\0\0\0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 close(3) = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root❌0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 open("/proc/self/loginuid", O_WRONLY|O_TRUNC|O_NOFOLLOW) = 3
19:59:01 write(3, "0", 1) = 1
19:59:01 close(3) = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root❌0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 getuid() = 0
19:59:01 getgid() = 0
19:59:01 keyctl(0, 0xfffffffd, 0, 0, 0) = 496466385
19:59:01 keyctl(0, 0xfffffffb, 0, 0, 0x30) = 785702132
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root❌0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 getrlimit(RLIMIT_CPU, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_FSIZE, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_DATA, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_STACK, {rlim_cur=81921024, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_CORE, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_RSS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_NPROC, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_NOFILE, {rlim_cur=1073741816, rlim_max=1073741816}) = 0
19:59:01 getrlimit(RLIMIT_MEMLOCK, {rlim_cur=64
1024, rlim_max=641024}) = 0
19:59:01 getrlimit(RLIMIT_AS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_LOCKS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_SIGPENDING, {rlim_cur=883632, rlim_max=883632}) = 0
19:59:01 getrlimit(RLIMIT_MSGQUEUE, {rlim_cur=800
1024, rlim_max=8001024}) = 0
19:59:01 getrlimit(RLIMIT_NICE, {rlim_cur=0, rlim_max=0}) = 0
19:59:01 getrlimit(RLIMIT_RTPRIO, {rlim_cur=0, rlim_max=0}) = 0
19:59:01 getpriority(PRIO_PROCESS, 0) = 20
19:59:01 open("/etc/security/limits.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1835, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "# /etc/security/limits.conf\n#\n#E"..., 4096) = 1835
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 open("/etc/security/limits.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
19:59:01 getdents(3, /
3 entries /, 32768) = 88
19:59:01 open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=26060, ...}) = 0
19:59:01 mmap(NULL, 26060, PROT_READ, MAP_SHARED, 5, 0) = 0x7fd7707f5000
19:59:01 close(5) = 0
19:59:01 getdents(3, /
0 entries */, 32768) = 0
19:59:01 close(3) = 0
19:59:01 open("/etc/security/limits.d/90-nproc.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=193, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "# Default limit for number of us"..., 4096) = 193
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 setrlimit(RLIMIT_NPROC, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 setpriority(PRIO_PROCESS, 0, 0) = 0
19:59:01 getuid() = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root❌0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 socket(PF_NETLINK, SOCK_RAW, 9) = 3
19:59:01 fcntl(3, F_SETFD, FD_CLOEXEC) = 0
19:59:01 sendto(3, "t\0\0\0Q\4\5\0\3\0\0\0\0\0\0\0op=PAM:session_o"..., 116, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 116
19:59:01 poll([{fd=3, events=POLLIN}], 1, 500) = 1 ([{fd=3, revents=POLLIN}])
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\3\0\0\0\227\7\0\0\0\0\0\0t\0\0\0Q\4\5\0\3\0\0\0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\3\0\0\0\227\7\0\0\0\0\0\0t\0\0\0Q\4\5\0\3\0\0\0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 close(3) = 0
19:59:01 setgid(0) = 0
19:59:01 open("/proc/sys/kernel/ngroups_max", O_RDONLY) = 3
19:59:01 read(3, "65536\n", 31) = 6
19:59:01 close(3) = 0
19:59:01 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
19:59:01 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
19:59:01 close(3) = 0
19:59:01 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
19:59:01 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
19:59:01 close(3) = 0
19:59:01 open("/etc/group", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=497, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 lseek(3, 0, SEEK_CUR) = 0
19:59:01 read(3, "root❌0:\nbin❌1:bin,daemon\ndae"..., 4096) = 497
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 setgroups(1, [0]) = 0
19:59:01 setreuid(0, 4294967295) = 0
19:59:01 rt_sigaction(SIGCHLD, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x7fd76fa316a0}, {0x558826e03b80, [], SA_RESTORER|SA_RESTART, 0x7fd76fa316a0}, 8) = 0
19:59:01 pipe([3, 5]) = 0
19:59:01 pipe([6, 7]) = 0
19:59:01 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd7707fca70) = 1946
19:59:01 gettid() = 1943
19:59:01 open("/proc/self/task/1943/attr/exec", O_RDWR) = 8
19:59:01 write(8, NULL, 0) = -1 EINVAL (Invalid argument)
19:59:01 close(8) = 0
19:59:01 close(3) = 0
19:59:01 close(7) = 0
19:59:01 close(5) = 0
19:59:01 fcntl(6, F_GETFL) = 0 (flags O_RDONLY)
19:59:01 fstat(6, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 lseek(6, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
19:59:01 read(6, "/bin/bash: ./logrotate.sh: \346\262\241\346\234"..., 4096) = 55
19:59:01 uname({sys="Linux", node="host-11-159-73-176", ...}) = 0
19:59:01 getrlimit(RLIMIT_NOFILE, {rlim_cur=1073741816, rlim_max=1073741816}) = 0
19:59:01 mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd6682da000
19:59:01 --- SIGCHLD (Child exited) @ 0 (0) ---
19:59:06 +++ killed by SIGKILL +++

可以看到最後用 mmap 一次分配了 4G 記憶體,然後就被kill了。

mmap前呼叫了getrlimit,和上次MySQL的問題一樣,都是根據系統資源限制來分配記憶體

為了確定就是cron導致java掛掉的元兇,我們把cron程序手動kill掉,這樣就不會執行定時任務了,這次我們在驗證下Java程序是否會掛掉

果不其然,Java程序並沒有掛掉,看來真的是cron任務導致的

高版本CentOS是否也會出現類似問題?
按理說oom killer應該只kill掉佔用記憶體最高的才對,Java程序佔用記憶體又不是最高的,高版本的CentOS系統oom killer策略會不會有升級?

讓我們來一起驗證下高版本的CentOS系統是否有這個問題

當前映象的CentOS版本是CentOS release 6.6 (Final),為了驗證高版本的CentOS是否也有類似的問題,我們將增加兩個實驗組,分別升級基礎映象至CentOS release 6.10 (Final)和CentOS Linux release 7.9.2009 (Core),也新增相同的cron任務

結果發現CentOS release 6.10 (Final)和CentOS Linux release 7.9.2009 (Core)都沒有kill掉Java程序,只kill掉了cron的子程序

結論

由於容器limit open files(系統最大控制代碼數)設定不合理導致cron執行任務時使容器記憶體飆升,存在記憶體溢位的風險,linux由於保護機制會kill掉佔用記憶體高的程序,導致cron子任務程序和Java程序一起被kill(但是問題來了,這個jdos基礎映象為什麼會執行一個完全不存在的shell指令碼,而且還是執行兩次???),高版本的CentOS系統不會kill java程序,猜測不同版本的CentOS的kill選擇策略略有不同

問題分析
Cron任務執行邏輯
在Linux中,crontab工具是由croine軟體包提供的,讓我們一起看下cron的執行過程

其中child_process()執行了cron子程序,cron執行子程序時會有傳送mail的動作

cron_popen在執行時會按照open files(系統最大控制代碼數)清除記憶體

綜上,cron oom的原因找到了,是由於open files設定過大且cron任務沒有標準輸出,導致執行了傳送mail邏輯,而清除的記憶體大小超出了容器本身記憶體的大小,導致oom。

croine 1.5.4 版本之後修復了該問題,如果想檢視當前容器croine版本可執行如下命令:

rpm -q cronie

Linux核心OOM killer機制
Linux 核心有個機制叫OOM killer(Out Of Memory killer),該機制會監控那些佔用記憶體過大,尤其是瞬間佔用記憶體很快的程序,然後防止記憶體耗盡而自動把該程序殺掉。核心檢測到系統記憶體不足、挑選並殺掉某個程序的過程可以參考核心原始碼linux/mm/oom_kill.c,當系統記憶體不足的時候,out_of_memory()被觸發,然後呼叫select_bad_process()選擇一個”bad”程序殺掉。

以下是一些主要的程序選擇策略:

記憶體使用情況:OOM Killer首先傾向於選擇佔用記憶體最多的程序,因為終止這些程序可以釋放最多的記憶體。

OOM分數:每個程序都有一個OOM分數,該分數是基於其記憶體使用情況和其他因素計算出來的。OOM Killer傾向於終止OOM分數最高的程序。

程序優先順序:在選擇要終止的程序時,OOM Killer通常會避免終止對系統至關重要的系統程序。這些程序通常具有較高的優先順序,因此它們更不容易成為終止目標。

程序資源需求:OOM Killer還會考慮程序的資源需求。它傾向於終止那些請求較少資源的程序,以最小化影響其他程序的執行。

程序屬性:某些程序可能被標記為不可終止,例如透過設定/proc/[PID]/oom_score_adj的值來調整OOM分數。這些程序通常不容易被OOM Killer終止。

注:不同版本的Linux oom killer機制可能會存在一些差異

解決方案
使用高版本穩定的CentOS系統,如果業務無法升級CentOS,則需要設定合理的limit open files數量,application_worker型別應用可以在啟動指令碼中手動修改limit,web_tomcat型別應用沒法修改啟動指令碼,可以選擇kill掉cron程序或刪除系統cron任務,也可以手動升級cronie的版本至1.5.7-5

寫在後面
open files這個坑很大,栽這個坑兩次了,大家一定要檢查自己服務對應容器的CentOS版本和limit設定是否合理,本次案例發生在測試環境,尚不會引起事故,如果在生產出現類似情況,後果不堪設想

由於測試環境新增的這批機器都存在這個問題,我們團隊已經聯絡機器提供方上報了該問題,後續這批機器會由提供方統一修改系統最大控制代碼數,如果當前問題影響到了業務的正常使用,可以臨時刪除容器中/etc/crontab中的任務

相關文章