Conntrack 監控,別等故障了再回來加監控

技术颜良發表於2024-06-12

這是專欄第 8 篇,介紹一下 node-exporter 的 conntrack 外掛。這個外掛大家平時關注可能較少,但是在一些場景下,比如防火牆、NAT 閘道器等,需要監控 conntrack 表的使用情況。我就遇到過一次生產事故,就是因為 conntract 表滿了,導致新連線無法建立,所以這個外掛還是很有用的。

conntrack 外掛採集了那些指標

預設普通機器未必會啟用 conntrack,所以你可能看不到這個外掛的指標。我這裡透過 systemctl start firewalld 把防火牆啟動了,然後就有了 conntrack 的指標。

[root@aliyun-2c2g40g3m tarball]# curl -s localhost:9100/metrics | grep "node_nf_conntrack_"
# HELP node_nf_conntrack_entries Number of currently allocated flow entries for connection tracking.
# TYPE node_nf_conntrack_entries gauge
node_nf_conntrack_entries 44
# HELP node_nf_conntrack_entries_limit Maximum size of connection tracking table.
# TYPE node_nf_conntrack_entries_limit gauge
node_nf_conntrack_entries_limit 65536
# HELP node_nf_conntrack_stat_drop Number of packets dropped due to conntrack failure.
# TYPE node_nf_conntrack_stat_drop gauge
node_nf_conntrack_stat_drop 0
# HELP node_nf_conntrack_stat_early_drop Number of dropped conntrack entries to make room for new ones, if maximum table size was reached.
# TYPE node_nf_conntrack_stat_early_drop gauge
node_nf_conntrack_stat_early_drop 0
# HELP node_nf_conntrack_stat_found Number of searched entries which were successful.
# TYPE node_nf_conntrack_stat_found gauge
node_nf_conntrack_stat_found 0
# HELP node_nf_conntrack_stat_ignore Number of packets seen which are already connected to a conntrack entry.
# TYPE node_nf_conntrack_stat_ignore gauge
node_nf_conntrack_stat_ignore 0
# HELP node_nf_conntrack_stat_insert Number of entries inserted into the list.
# TYPE node_nf_conntrack_stat_insert gauge
node_nf_conntrack_stat_insert 0
# HELP node_nf_conntrack_stat_insert_failed Number of entries for which list insertion was attempted but failed.
# TYPE node_nf_conntrack_stat_insert_failed gauge
node_nf_conntrack_stat_insert_failed 0
# HELP node_nf_conntrack_stat_invalid Number of packets seen which can not be tracked.
# TYPE node_nf_conntrack_stat_invalid gauge
node_nf_conntrack_stat_invalid 2751
# HELP node_nf_conntrack_stat_search_restart Number of conntrack table lookups which had to be restarted due to hashtable resizes.
# TYPE node_nf_conntrack_stat_search_restart gauge
node_nf_conntrack_stat_search_restart 6261

什麼是 Conntrack

要想理解這些指標,首先得知道什麼是 Conntrack。Conntrack 是 Linux 核心中的一個模組,用來跟蹤連線的狀態。比如,你的機器是一個 NAT 閘道器,那麼 Conntrack 就會記錄內網 IP 和埠到外網 IP 和埠的對映關係。這樣,當外網回包的時候,核心就能根據 Conntrack 表找到對應的內網 IP 和埠,把包轉發給內網機器。我們可以透過 conntrack -L 命令檢視 Conntrack 表的內容。

Conntrack 表是有限的,所以當表滿了,新連線就無法建立。這時,就會出現 nf_conntrack: table full的錯誤,導致生產故障。

常用告警規則

通常,我們需要配置如下告警規則:

100 * node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 85

Conntrack 條目使用率超過 85% 就告警,及時做出應對,通常的應對措施是增大 Conntrack 表的大小,或者調整 Conntrack 的超時時間,或者直接設定某些連線不走 Conntrack。

圖片

conntrack 外掛採集邏輯

具體邏輯在 conntrack_linux.go,只有 Linux 有此外掛,其他系統沒有。和其他 node-exporter 採集外掛類似,還是提供了 initNewConntrackCollectorUpdate 等函式。採集邏輯在 Update 中。程式碼如下:

func (c *conntrackCollector) Update(ch chan<- prometheus.Metric) error {
value, err := readUintFromFile(procFilePath("sys/net/netfilter/nf_conntrack_count"))
if err != nil {
return c.handleErr(err)
}
ch <- prometheus.MustNewConstMetric(
c.current, prometheus.GaugeValue, float64(value))

value, err = readUintFromFile(procFilePath("sys/net/netfilter/nf_conntrack_max"))
if err != nil {
return c.handleErr(err)
}
ch <- prometheus.MustNewConstMetric(
c.limit, prometheus.GaugeValue, float64(value))

conntrackStats, err := getConntrackStatistics()
if err != nil {
return c.handleErr(err)
}

ch <- prometheus.MustNewConstMetric(
c.found, prometheus.GaugeValue, float64(conntrackStats.found))
ch <- prometheus.MustNewConstMetric(
c.invalid, prometheus.GaugeValue, float64(conntrackStats.invalid))
ch <- prometheus.MustNewConstMetric(
c.ignore, prometheus.GaugeValue, float64(conntrackStats.ignore))
ch <- prometheus.MustNewConstMetric(
c.insert, prometheus.GaugeValue, float64(conntrackStats.insert))
ch <- prometheus.MustNewConstMetric(
c.insertFailed, prometheus.GaugeValue, float64(conntrackStats.insertFailed))
ch <- prometheus.MustNewConstMetric(
c.drop, prometheus.GaugeValue, float64(conntrackStats.drop))
ch <- prometheus.MustNewConstMetric(
c.earlyDrop, prometheus.GaugeValue, float64(conntrackStats.earlyDrop))
ch <- prometheus.MustNewConstMetric(
c.searchRestart, prometheus.GaugeValue, float64(conntrackStats.searchRestart))
return nil
}

首先是讀取 /proc/sys/net/netfilter/nf_conntrack_count/proc/sys/net/netfilter/nf_conntrack_max 檔案,獲取當前 Conntrack 表的條目數和最大條目數。然後呼叫 getConntrackStatistics 函式獲取 Conntrack 的統計資訊,比如 found、invalid、ignore、insert、insertFailed、drop、earlyDrop、searchRestart 等。最後將這些指標傳送到 ch 通道。

getConntrackStatistics 讀取的是 /proc/net/stat/nf_conntrack 檔案的內容並做解析,我們看看 /proc/net/stat/nf_conntrack 的內容:

[root@aliyun-2c2g40g3m tarball]# cat /proc/net/stat/nf_conntrack
entries clashres found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error expect_new expect_create expect_delete search_restart
00000022 00000000 00000000 00000000 00000b0c 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000012 00000000 00000000 00000000 00001591
00000022 00000000 00000000 00000000 0000008b 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000034f

各個欄位的含義,我們查閱一下權威文件:https://man7.org/linux/man-pages/man8/rtstat.8.html

/proc/net/stat/ip_conntrack, /proc/net/stat/nf_conntrack
Conntrack related counters. ip_conntrack is for backwards
compatibility with older userspace only and shows the same
data as nf_conntrack.

entries Number of entries in conntrack table.

searched Number of conntrack table lookups performed.

found Number of searched entries which were successful.

new Number of conntrack entries added which were not
expected before.

invalid Number of packets seen which can not be tracked.

ignore Number of packets seen which are already connected
to a conntrack entry.

delete Number of conntrack entries which were removed.

delete_list Number of conntrack entries which were put to
dying list.

insert Number of entries inserted into the list.

insert_failed Number of entries for which list insertion
was attempted but failed (happens if the same entry is
already present).

drop Number of packets dropped due to conntrack failure.
Either new conntrack entry allocation failed, or protocol
helper dropped the packet.

early_drop Number of dropped conntrack entries to make
room for new ones, if maximum table size was reached.

icmp_error Number of packets which could not be tracked
due to error situation. This is a subset of invalid.

expect_new Number of conntrack entries added after an
expectation for them was already present.

expect_create Number of expectations added.

expect_delete Number of expectations deleted.

search_restart Number of conntrack table lookups which had
to be restarted due to hashtable resizes.

這個檔案的數值是 16 進製表示的,所以 node-exporter 中會轉換成十進位制的數值,另外我的系統上來看,數值有多行,node-exporter 程式碼裡會把這些數值累加起來。

小結

本節介紹了 node-exporter 的 conntrack 外掛,這個外掛用來監控 Conntrack 表的使用情況,通常用於防火牆、NAT 閘道器等場景。Conntrack 表是有限的,當表滿了,新連線就無法建立,所以需要監控 Conntrack 表的使用情況,及時做出應對。如果有問題歡迎大家留言一起探討。

圖片

擴充套件閱讀

  • 專欄:透過node-exporter原始碼掌握機器監控指標
  • SRE 排障利器,介面請求超時試試 httpstat
  • 告警恢復時終於可以顯示恢復時的值了
  • 面向故障處理的可觀測性體系建設

閱讀 850

相關文章