Linux系統及應用問題分析排查工具

瓦力瓦力發表於2016-02-19

Linux

Linux伺服器上經常遇到一些系統和應用上的問題，如何分析排查，需要利器，下面總結列表了一些常用工具、trace tool；最後也列舉了最近hadoop社群在開發發展的分散式系統的trace tool。

概覽：

引用linux-performance-analysis-and-tools中圖片，說明這些tool試用層次位置

OS系統命令

系統資訊（RHEL/Fedora）

uname -a 或 cat /proc/version #print system information
- Linux hadoopst2.cm6 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
uptime
- 15:42:46 up 674 days, 6 min, 35 users, load average: 1.30, 5.97, 11.53
cat /etc/redhat-release
- Red Hat Enterprise Linux Server release 5.4 (Tikanga)
lsb_release
- LSB Version: :core-3.1-amd64:core-3.1-ia32:core-3.1-noarch:graphics-3.1-amd64:graphics-3.1-ia32:graphics-3.1-noarch
cat /proc/cpuinfo
cat /proc/meminfo
lspci – list all PCI devices
lsusb – list USB devices
last, lastb – show listing of last logged in users
lsmod — show the status of modules in the Linux Kernel
modprobe – add and remove modules from the Linux Kernel

常用命令/工具

ps
- To print a process tree: ps -ejH / ps axjf
- To get info about threads: ps -eLf / ps axms
ulimit -a
lsof – list open files, UNIX一切皆檔案
- lsof -p PID
rpm/yum
- rpm -qf FILE #檔案所屬rpm包
- rpm -ql RPM #rpm包含檔案
- /var/log/yum.log #yum 更新包日誌
/etc/XXX #系統級程式配置目錄，如
- /etc/yum.repos.d/ yum源配置
/var/log/XXX #日誌目錄，如
- /var/log/cron #crontab日誌，可以檢視排程執行情況
ntpd – Network Time Protocol (NTP) daemon，同步叢集中機器時間
squid – proxy caching server，叢集WebUI的代理

系統監控

mpstat – Report processors related statistics. 注意%sys %iowait值
vmstat – Report virtual memory statistics
iostat – Report Central Processing Unit (CPU) statistics and input/output statistics for devices and partitions.
netstat – Print network connections, routing tables, interface statistics, masquerade connections, and multicast memberships
- netstat -atpn | grep PID
ganglia – a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.
sar/tsar – Collect, report, or save system activity information; tsar是淘寶自己改進的版本
- 定時取樣（每分鐘），可查歷史記錄（預設5分鐘），可彌補ganglia顯示更詳細資訊
iftop – the “top” bandwidth consumers shown. iftop wiki
iotop
vmtouch, Portable file system cache diagnostics and control

網路相關

telnet/nc IP PORT – 確認目標埠是否可訪問，只ping通不一定埠可訪問，可能防火牆等禁止
ifconfig/ifup/ifdown – configure a network interface
traceroute – print the route packets trace to network host
nslookup – query Internet name servers interactively
tcpdump – dump traffic on a network，類似開源工具 wireshark, netsniff-ng, 更多工具比較
lynx – a general purpose distributed information browser for the World Wide Web
tcpcp – allows cooperating applications to pass ownership of TCP connection endpoints from one Linux host to another one.

程式/程式相關

靜態資訊

ldconfig – configure dynamic linker run time bindings
- ldconfig -p | grep SO 檢視so是否在link cache中
ldd – print shared library dependencies，檢視exe或so依賴的so
nm – list symbols from object files，可grep查詢是否存在相關的symbol，是否Undefined.
readelf – Displays information about ELF files. 可現實elf相關資訊，如32/64位，適用的OS，處理器

動態資訊

gdb
cat /proc/$PID/[cmdline|environ|limits|status|…] – 程式相關資訊
pstack – print a stack trace of a running process
pmap – report memory map of a process

java相關

JDK Tools and Utilities
Java Troubleshooting Tools
jinfo – print java process information, 如classpath，java.libary.path（jni so目錄）
jstack – print a stack trace of a running java process，可檢視死鎖情況
jmap – report memory map of a java process
- jmap -histo:live 可觸發full gc
- jmap -dump:live,file=$FILE 可dump heap記憶體，用於jhat等工具debug分析object在heap的佔用情況
jhat – Heap Dump Browser – Starts a web server on a heap dump file (eg, produced by jmap -dump), allowing the heap to be browsed.
- 起http服務，瀏覽器訪問檢視
- -J-mxXXXm ，分析大檔案時需要加大heap大小
- 若有物件資料超大或記憶體佔用過多，極有可能memory leak
Memory Analyzer (MAT) – eclipse plugin，Java heap analyzer
- 視覺化工具，但受到機器記憶體的限制，無法分析太大的heap dump file
jdb – 可起服務做server，eclipse等工具遠端連線除錯
jstat – Java Virtual Machine Statistics Monitoring Tool
jstatd – Virtual Machine jstat Daemon，可配合jvisualvm
jvisualvm – Java Virtual Machine Monitoring, Troubleshooting, and Profiling Tool；可遠端連線jstatd/jmx, 視覺化展示工具：演示
jvmtop – In a top-like manner, displays JVM internal metrics (e.g. memory information) of running java processes.
JVM performance optimization JVM開發者寫的優化文章
HPROF – Heap Profiler： java -agentlib:hprof

Trace/Debug/Profiling工具

通用工具

寫log，但系統線上或無法原始碼時
strace – trace system calls and signals
- 示例：strace/ltrace的應用例項
- 示例：可跟蹤系統呼叫時間，如機器cpu:%sys高的問題

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 67.90 3966.320849         496   7992161   3050250 futex
 25.80 1507.326693      127093     11860           epoll_wait
....................

blktrace, generate traces of the i/o traffic on block devices
ltrace – A library call tracer
xtrace
gprof – a performance analysis tool, sampling and call-graph profiling
valgrind – an instrumentation framework for building dynamic analysis tools. automatically detect many memory management and threading bugs, and profile your programs in detail
systemtap – a simple command line interface and scripting language for writing instrumentation for a live running kernel plus user-space applications for complex tasks that may require live analysis, programmable on-line response, and whole-system symbolic access.
- Linux版DTrace（SUN在Solaris上開發的）
- 功能強大，kernel， user-space app，cross language（java perl python ruby），build-in markers（pg mysql）
- can write and reuse simple scripts to deeply examine the activities of a live system
- Data can be extracted, filtered, and summarized quickly and safely, to enable diagnoses of complex performance or functional problems
- 豐富的 “tapset” script library

java trace工具

btrace – dynamic tracing tool for the Java platform. UserGuide
- 基於動態位元組碼修改技術(Hotswap)來實現執行時java程式的跟蹤和替換, 實現原理
- BTrace使用總結
- 詳細介紹
byteman – simplifies tracing and testing of Java programs. Can modify a running application without needing to stop and restart it.
- define rules specifying the side effects you want to inject 而 BTrace類java語法

Distributed Tracing Tools

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
x-trace, a network diagnostic tool designed to provide users and network operators with better visibility into increasingly complex Internet applications.
HTrace， a tracing framework intended for use with distributed systems written in java
- Add Tracing to HDFS
- Update HTrace for HBase

部分內容有引用微博其他童鞋的，如有問題可以及時聯絡。

使用jvm工具排查系統問題
2023-12-19
JVM
windows系統相關命令及問題排查實踐
2019-01-05
Windows
Linux排查JVM問題
2020-12-22
LinuxJVM
應用系統瓶頸排查和分析的思考-Arthas 實戰
2020-09-03
Linux系統中CPU佔用率較高問題排查思路與解決方法
2022-07-06
Linux
技能篇：linux服務效能問題排查及jvm調優思路
2022-04-01
LinuxJVM
Centos 系統簡單排查流量異常問題
2021-09-10
CentOS
Arthas實踐–快速排查SpringBoot應用404/401問題
2019-01-10
Spring Boot
系統安全及應用
2021-06-29
伺服器效能指標（一）——負載（Load）分析及問題排查
2018-05-21
伺服器指標負載
深入Spring Boot--使用Arthas排查應用404/401問題
2019-01-07
Spring Boot
製藥行業MES系統軟體應用問題及實施難點
2018-10-02
行業
Windows、Linux快速排查系統是否被黑
2021-03-12
WindowsLinux
伺服器效能指標（三）——記憶體使用分析及問題排查
2019-03-04
伺服器指標記憶體
阿里員工的Java問題排查工具單
2018-11-26
阿里Java
如何快速排查Linux伺服器效能問題
2022-10-24
Linux伺服器
Linux應急響應排查
2020-11-26
Linux
GreatSQL記憶體消耗異常排查攻略：從系統到應用層面的深入分析
2024-11-29
SQL記憶體
執行在生產系統中的企業級 JavaScript 應用的效能問題分析指南
2022-04-25
JavaScript
Redis應用場景及快取問題
2021-08-24
Redis快取
框架問題排查
2024-06-05
框架
java問題排查
2020-10-30
Java
使用工具分析 SAP UI5 應用前端執行的效能問題
2021-10-17
UI前端
【FAQ】呼叫應用內購買SDK時報錯，如何用tag對問題進行排查和分析
2022-06-17
Linux系統下ifconfig命令使用及結果分析
2019-01-08
Linux
mac系統應用快速啟動工具
2020-04-27
Mac
Java線上問題排查神器Arthas實戰分析
2022-01-29
Java
Linux磁碟滿問題分析
2018-12-21
Linux
應急響應- Linux入侵排查
2024-04-28
Linux
ERP系統開發 ERP系統詳解及應用
2019-01-10
在Linux中，如何進行系統故障排查？
2024-04-11
Linux
應用故障排查
2020-12-24
使用 Chrome 開發者工具的 lighthouse 功能分析 web 應用的效能問題
2021-10-17
ChromeWeb
Linux系統中五款好用的日誌分析工具
2019-05-11
Linux
ShardingSphere-Proxy 前端協議問題排查方法及案例
2022-06-28
前端協議
SDK與問題排查
2021-11-24
kubernetesgraceperiod失效問題排查
2018-07-10
故障分析 | 租戶 memstore 記憶體滿問題排查
2023-04-30
記憶體
論系統測試技術及應用
2020-05-07