Linux系統效能監控採集項

大搜車-自娛發表於2018-03-28

Linux

1. Linux運維基礎採集項

做運維，不怕出問題，怕的是出了問題，抓不到現場，兩眼摸黑。所以，依靠強大的監控系統，收集儘可能多的指標，意義重大。但哪些指標才是有意義的呢，本著從實踐中來的思想，各位工程師在長期摸爬滾打中總結出來的經驗最有價值。

在各位運維工程師長期的工作實踐中，我們總結了在系統運維過程中，經常會參考的一些指標，主要包括以下幾個類別：

CPU
Load
記憶體
磁碟
IO
網路相關
核心引數
ss 統計輸出
埠採集
核心服務的程式存活資訊採集
關鍵業務程式資源消耗
NTP offset採集
DNS解析採集

每個類別，具體的詳細指標如下，這些指標，都是open-falcon的agent元件直接支援的。falcon-agent每隔一定時間間隔（目前是60秒）會採集一次相關的指標，並彙報給server端。

2. CPU相關採集項

計算方法：通過採集/proc/stat來得到，大家可以參考sar命令的統計輸出來理解。

cpu.idle：Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
cpu.busy：與cpu.idle相對，他的值等於100減去cpu.idle。
cpu.guest：Percentage of time spent by the CPU or CPUs to run a virtual processor.
cpu.iowait：Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
cpu.irq：Percentage of time spent by the CPU or CPUs to service hardware interrupts.
cpu.softirq：Percentage of time spent by the CPU or CPUs to service software interrupts.
cpu.nice：Percentage of CPU utilization that occurred while executing at the user level with nice priority.
cpu.steal：Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
cpu.system：Percentage of CPU utilization that occurred while executing at the system level (kernel).
cpu.user：Percentage of CPU utilization that occurred while executing at the user level (application).
cpu.cnt：cpu核數。
cpu.switches：cpu上下文切換次數，計數器型別。

3. 磁碟相關採集項

計算方法：先讀取/proc/mounts拿到所有掛載點，然後通過syscall.Statfs_t拿到blocks和inode的使用情況。每個metric都會附加一組tag描述，類似mount=$mount,fstype=$fstype，其中$mount是掛載點，比如/home，$fstype是檔案系統，比如ext4。

df.bytes.free：磁碟可用量，int64
df.bytes.free.percent：磁碟可用量佔總量的百分比，float64，比如32.1
df.bytes.total：磁碟總大小，int64
df.bytes.used：磁碟已用大小，int64
df.bytes.used.percent：磁碟已用大小佔總量的百分比，float64
df.inodes.total：inode總數，int64
df.inodes.free：可用inode數目，int64
df.inodes.free.percent：可用inode佔比，float64
df.inodes.used：已用的inode資料，int64
df.inodes.used.percent：已用inode佔比，float64

4. megacli工具輸出

使用 megacli 工具讀取 RAID 相關資訊，每個metric都會附件一組tag描述，用來標明所屬PD或者 VD，PD格式為PD=Enclosure_ID:SLOT_ID，比如PD=32:0表明第一塊磁碟，VD=0 表明第一個邏輯磁碟。

sys.disk.lsiraid.pd.Media_Error_Count：這個及以下三個指標目前僅作為資料收集，不一定意味磁碟損壞（只是表示損壞概率變大）
sys.disk.lsiraid.pd.Other_Error_Count
sys.disk.lsiraid.pd.Predictive_Failure_Count
sys.disk.lsiraid.pd.Drive_Temperature
sys.disk.lsiraid.pd.Firmware_state：如果值不為0，則此物理磁碟出現問題
sys.disk.lsiraid.vd.cache_policy：如果值不為0，表示此邏輯磁碟快取策略和設定不符
sys.disk.lsiraid.vd.state：如果值不為0，表示此邏輯磁碟出現問題

5. SMART工具輸出

使用 smartctl 工具讀取磁碟 SMART 資訊，目前所有指標僅作為資料收集，不一定意味磁碟損壞（只是表示概率變大），每個metric都會有一組tag描述，表明碟符，例如device=/dev/sda。

sys.disk.smart.Reallocated_Sector_Ct
sys.disk.smart.Spin_Retry_Count
sys.disk.smart.Reallocated_Event_Count
sys.disk.smart.Current_Pending_Sector
sys.disk.smart.Offline_Uncorrectable
sys.disk.smart.Temperature_Celsius

6. 分割槽讀寫監控

測試所有已掛載分割槽是否可讀寫，每個metric都會有一組tag描述，表示掛載點，比如mount=/home

sys.disk.rw：如果值不為0，表明此分割槽讀寫出現問題

7. IO相關採集項

計算方法：每秒採集一次/proc/diskstats，計算差值，都是計數器型別的。每個metric都會有一組tag描述，形如device=$device，用來表示具體的裝置，比如sda1、sdb。使用者可以參考iostat的幫助文件來理解具體的metric含義。

disk.io.ios_in_progress：Number of actual I/O requests currently in flight.
disk.io.msec_read：Total number of ms spent by all reads.
disk.io.msec_total：Amount of time during which ios_in_progress >= 1.
disk.io.msec_weighted_total：Measure of recent I/O completion time and backlog.
disk.io.msec_write：Total number of ms spent by all writes.
disk.io.read_merged：Adjacent read requests merged in a single req.
disk.io.read_requests：Total number of reads completed successfully.
disk.io.read_sectors：Total number of sectors read successfully.
disk.io.write_merged：Adjacent write requests merged in a single req.
disk.io.write_requests：total number of writes completed successfully.
disk.io.write_sectors：total number of sectors written successfully.
disk.io.read_bytes：單位是byte的數字
disk.io.write_bytes：單位是byte的數字
disk.io.avgrq_sz：下面幾個值就是iostat -x 1看到的值
disk.io.avgqu-sz
disk.io.await
disk.io.svctm
disk.io.util：是個百分數，比如56.43，表示56.43%

8. 機器負載相關採集項

計算方法：讀取/proc/loadavg，都是原始值型別的：

load.1min
load.5min
load.15min

9. 記憶體相關採集項

計算方法：讀取/proc/meminfo 中的內容，其中的mem.memfree是free+buffers+cached，mem.memused=mem.memtotal-mem.memfree。使用者具體可以參考free命令的輸出和幫助文件來理解每個metric的含義。

mem.memtotal：記憶體總大小
mem.memused：使用了多少記憶體
mem.memused.percent：使用的記憶體佔比
mem.memfree
mem.memfree.percent
mem.swaptotal：swap總大小
mem.swapused：使用了多少swap
mem.swapused.percent：使用的swap的佔比
mem.swapfree
mem.swapfree.percent

10. 網路相關採集項

計算方法：讀取/proc/net/dev的內容，每個metric都附加有一組tag，形如iface=$iface，標明具體那個interface，比如eth0。metric中帶有in的表示流入情況，out表示流出情況，total是總量in+out，支援的metric如下：

net.if.in.bytes
net.if.in.compressed
net.if.in.dropped
net.if.in.errors
net.if.in.fifo.errs
net.if.in.frame.errs
net.if.in.multicast
net.if.in.packets
net.if.out.bytes
net.if.out.carrier.errs
net.if.out.collisions
net.if.out.compressed
net.if.out.dropped
net.if.out.errors
net.if.out.fifo.errs
net.if.out.packets
net.if.total.bytes
net.if.total.dropped
net.if.total.errors
net.if.total.packets

11. 埠採集項

計算方法，通過ss -ln，來判斷指定的埠是否處於listen狀態。原始值型別，值要麼是1：代表在監聽，要麼是0，代表沒有在監聽。每個metric都附件一組tag，形如port=$port，$port就是具體的埠。

net.port.listen

12. 機器核心配置

kernel.maxfiles：讀取的/proc/sys/fs/file-max
kernel.files.allocated：讀取的/proc/sys/fs/file-nr第一個Field
kernel.files.left：值=kernel.maxfiles-kernel.files.allocated
kernel.maxproc：讀取的/proc/sys/kernel/pid_max

13. ntp採集項

使用 ntpq -pn 獲取本機時間相對於 ntp 伺服器的 offset。

sys.ntp.offset：本機偏移時間，單位為ms，值過大或者為0則表明有異常，需要報警

14. 程式監控

proc.num：判斷某個程式的數目，這裡需要分兩個場景，一種是根據程式的名字來判定，比如name=sshd；另外一種是根據cmdline來判定，比如Java的應用程式名可能都是java，根據第一種情況沒法做區分，此時可以配置cmdline，如cmdline=./falcon_agent-c./cfg.ini

15. 程式資源監控

process.cpu.all：程式和它的子程式使用的sys+user的cpu，單位是jiffies
process.cpu.sys：程式和它的子程式使用的sys cpu，單位是jiffies
process.cpu.user：程式和它的子程式使用的user cpu，單位是jiffies
process.swap：程式和它的子程式使用的swap，單位是page
process.fd：程式使用的檔案描述符個數
process.mem：程式佔用記憶體，單位byte

16. ss命令輸出

ss.orphaned
ss.closed
ss.timewait
ss.slabinfo.timewait
ss.synrecv
ss.estab

在Linux中，如何監控系統的效能？
2024-04-25
Linux
Linux中監控系統效能常用的命令！
2023-12-14
Linux
使用Prometheus監控Linux系統各項指標
2019-11-25
PrometheusLinux指標
Python對系統資料進行採集監控——psutil
2021-08-20
Python
在Linux中，如何進行系統效能監控？
2024-05-24
Linux
數控磨床資料採集遠端監控物聯網系統
2023-04-11
Linux作業系統效能指標監控與通知
2022-09-30
Linux作業系統指標
Linux 系統監控指南
2024-07-26
Linux
sysstat——系統效能監控神器
2018-07-25
系統效能監控工具ssar例項精選 | 龍蜥SIG
2021-09-22
Linux 效能監控工具
2024-05-03
Linux
水質監測儀資料採集遠端監控系統解決方案
2024-04-11
linux系統物理硬碟監控
2019-03-17
Linux硬碟
分散式監控系統之Zabbix 使用SNMP、JMX通道採集資料
2020-11-24
分散式
在Linux中，如何進行系統效能的持續監控？
2024-06-06
Linux
監控採集上報和儲存監控資料策略
2022-02-23
Linux程式管理與效能監控
2020-08-20
Linux
一種對雲主機進行效能監控的監控系統及其監控方法
2018-07-30
Redis安裝+叢集+效能監控
2019-03-04
Redis
消防應急電源資料採集遠端監控系統解決方案
2024-01-31
讓前端監控資料採集更高效
2019-04-19
前端
運維文件 - 伺服器效能監控系統
2024-07-27
運維伺服器
杜絕採購欺詐：利用SRM系統監控採購計劃
2022-05-18
基於 Zabbix 系統監控 Windows、Linux、VMware
2023-05-08
WindowsLinux
搭建前端監控，如何採集異常資料？
2022-06-10
前端
智慧養殖自動化PLC資料採集遠端監控系統解決方案
2023-12-27
西門子PLC水處理系統如何實現資料採集遠端監控？
2022-11-28
基於MQTT協議的工業物聯網資料採集和監控系統
2023-03-28
MQQT協議
分散式監控系統Zabbix3.4-針對MongoDB效能監控操作筆記
2018-09-21
分散式MongoDB筆記
在Linux中，什麼是系統監控和效能分析工具？舉例說明。
2024-04-07
Linux
幾個常用的linux效能監控命令
2018-04-12
Linux
shell實戰之Linux主機系統監控
2019-04-05
Linux
Linux系統安裝zabbix 4.4監控軟體
2020-06-06
Linux
手淘 Android 幀率採集與監控詳解
2022-01-15
Android
大型監控網路系統規劃ip地址例項
2019-11-09
實時監控系統，統一監控企業API
2022-03-10
API
高併發&效能優化（二）------系統監控工具使用
2020-08-25
優化
搭建服務端效能監控系統 Prometheus 詳細指南
2024-06-19
服務端Prometheus