在前文《Linux 路由三大件》中,我們提到了 iptables 可以修改資料包的特徵從而影響其路由。這個功能無論是傳統場景下的 防火牆,還是雲原生場景下的 服務路由(k8s service)、網路策略(calico network policy) 等都有依賴。
雖然業界在積極地推進 ipvs、ebpf 等更新、更高效的技術落地,但是出於各種各樣的原因(技術、成本、管理等),iptables 仍然有著非常廣泛的使用 ,所以我們有必要了解 iptables 的方方面面。今天我們就來了解其中的重要一環:除錯和追蹤 iptables。更具體一點:一個資料包是怎麼在 iptables 的各個 chain/table/rule 中游走的。
注:不瞭解 iptables 的同學可以先看看這篇 wiki https://wiki.archlinux.org/title/Iptables
trace
我們知道,iptables 的 rule 最後是一個 target,這個 target 可以設定為 TRACE。這樣,匹配這個 rule 的包經過的所有 chain/table/rule 都會被記錄下來。
這個 rule 本身只能被放在 raw
表裡,可以存在於 PREROUTING
和 OUTPUT
兩個 chain 中。其匹配規則和其他一般的 rule 沒有什麼差別。
實踐
我們採用一個 docker 場景來進行實踐,如下圖:
亦即宿主機的 10000 埠對映給了容器的 80 埠。圖中有兩種測試場景,分別是:
- 本節點訪問容器 IP
- 跨節點訪問宿主機 IP
為了讓分析更直觀/簡單,我們用 telnet 測試即可(curl 會產生很多後續 http 包,不夠簡潔)。
OK,開始我們的測試!
首先在宿主機上開啟 iptables log
# check first
# cat /proc/net/netfilter/nf_log
modprobe nf_log_ipv4
sysctl net.netfilter.nf_log.2=nf_log_ipv4
然後增加 trace 配置
# 訪問容器 IP 使用
iptables -t raw -A OUTPUT -p tcp --destination 172.19.0.0/16 --dport 80 -j TRACE
# 訪問節點 IP 使用
iptables -t raw -A PREROUTING -p tcp --destination 192.168.64.4 --dport 10000 -j TRACE
檢視結果可以用這個命令:cat /var/log/messages | grep "TRACE:"
本節點訪問:telnet 172.19.0.9 80
,結果:
fedora kernel: [ 2350.411619] TRACE: raw:OUTPUT:policy:2 IN= OUT=br-00ea7870520a SRC=172.19.0.1 DST=172.19.0.9 LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=11432 DF PROTO=TCP SPT=43256 DPT=80 SEQ=2316959420 ACK=0 WINDOW=64240 RES=0x00 SYN URGP=0 OPT (020405B40402080AAF8EE4320000000001030307) UID=0 GID=0
fedora kernel: [ 2350.411634] TRACE: nat:OUTPUT:policy:2 IN= OUT=br-00ea7870520a SRC=172.19.0.1 DST=172.19.0.9 LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=11432 DF PROTO=TCP SPT=43256 DPT=80 SEQ=2316959420 ACK=0 WINDOW=64240 RES=0x00 SYN URGP=0 OPT (020405B40402080AAF8EE4320000000001030307) UID=0 GID=0
fedora kernel: [ 2350.411639] TRACE: filter:OUTPUT:policy:1 IN= OUT=br-00ea7870520a SRC=172.19.0.1 DST=172.19.0.9 LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=11432 DF PROTO=TCP SPT=43256 DPT=80 SEQ=2316959420 ACK=0 WINDOW=64240 RES=0x00 SYN URGP=0 OPT (020405B40402080AAF8EE4320000000001030307) UID=0 GID=0
fedora kernel: [ 2350.411644] TRACE: nat:POSTROUTING:policy:4 IN= OUT=br-00ea7870520a SRC=172.19.0.1 DST=172.19.0.9 LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=11432 DF PROTO=TCP SPT=43256 DPT=80 SEQ=2316959420 ACK=0 WINDOW=64240 RES=0x00 SYN URGP=0 OPT (020405B40402080AAF8EE4320000000001030307) UID=0 GID=0
fedora kernel: [ 2350.411778] TRACE: raw:OUTPUT:policy:2 IN= OUT=br-00ea7870520a SRC=172.19.0.1 DST=172.19.0.9 LEN=52 TOS=0x10 PREC=0x00 TTL=64 ID=11433 DF PROTO=TCP SPT=43256 DPT=80 SEQ=2316959421 ACK=357062108 WINDOW=502 RES=0x00 ACK URGP=0 OPT (0101080AAF8EE4320C2ACEC2) UID=0 GID=0
fedora kernel: [ 2350.411784] TRACE: filter:OUTPUT:policy:1 IN= OUT=br-00ea7870520a SRC=172.19.0.1 DST=172.19.0.9 LEN=52 TOS=0x10 PREC=0x00 TTL=64 ID=11433 DF PROTO=TCP SPT=43256 DPT=80 SEQ=2316959421 ACK=357062108 WINDOW=502 RES=0x00 ACK URGP=0 OPT (0101080AAF8EE4320C2ACEC2) UID=0 GID=0
跨節點訪問:telnet 192.168.64.4 10000
,結果:
fedora kernel: [ 2631.978281] TRACE: raw:PREROUTING:policy:2 IN=enp0s1 OUT= MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=192.168.64.4 LEN=64 TOS=0x10 PREC=0x00 TTL=64 ID=0 PROTO=TCP SPT=52842 DPT=10000 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000)
fedora kernel: [ 2631.978585] TRACE: nat:PREROUTING:rule:1 IN=enp0s1 OUT= MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=192.168.64.4 LEN=64 TOS=0x10 PREC=0x00 TTL=64 ID=0 PROTO=TCP SPT=52842 DPT=10000 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000)
fedora kernel: [ 2631.979055] TRACE: nat:DOCKER:rule:3 IN=enp0s1 OUT= MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=192.168.64.4 LEN=64 TOS=0x10 PREC=0x00 TTL=64 ID=0 PROTO=TCP SPT=52842 DPT=10000 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000)
fedora kernel: [ 2631.979077] TRACE: filter:FORWARD:rule:1 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=64 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000)
fedora kernel: [ 2631.979585] TRACE: filter:DOCKER-USER:return:1 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=64 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000)
fedora kernel: [ 2631.979585] TRACE: filter:FORWARD:rule:2 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=64 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000)
fedora kernel: [ 2631.979585] TRACE: filter:DOCKER-ISOLATION-STAGE-1:return:3 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=64 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000)
fedora kernel: [ 2631.980070] TRACE: filter:FORWARD:rule:8 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=64 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000)
fedora kernel: [ 2631.980070] TRACE: filter:DOCKER:rule:1 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=64 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000)
fedora kernel: [ 2631.980070] TRACE: nat:POSTROUTING:policy:4 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=64 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000)
fedora kernel: [ 2631.981288] TRACE: raw:PREROUTING:policy:2 IN=enp0s1 OUT= MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=192.168.64.4 LEN=52 TOS=0x10 PREC=0x00 TTL=64 ID=0 PROTO=TCP SPT=52842 DPT=10000 SEQ=1184533181 ACK=3245308350 WINDOW=2058 RES=0x00 ACK URGP=0 OPT (0101080A6EF8750D8E0CA28F)
fedora kernel: [ 2631.981299] TRACE: filter:FORWARD:rule:1 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=52 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533181 ACK=3245308350 WINDOW=2058 RES=0x00 ACK URGP=0 OPT (0101080A6EF8750D8E0CA28F)
fedora kernel: [ 2631.981304] TRACE: filter:DOCKER-USER:return:1 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=52 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533181 ACK=3245308350 WINDOW=2058 RES=0x00 ACK URGP=0 OPT (0101080A6EF8750D8E0CA28F)
fedora kernel: [ 2631.981308] TRACE: filter:FORWARD:rule:2 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=52 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533181 ACK=3245308350 WINDOW=2058 RES=0x00 ACK URGP=0 OPT (0101080A6EF8750D8E0CA28F)
fedora kernel: [ 2631.981312] TRACE: filter:DOCKER-ISOLATION-STAGE-1:return:3 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=52 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533181 ACK=3245308350 WINDOW=2058 RES=0x00 ACK URGP=0 OPT (0101080A6EF8750D8E0CA28F)
fedora kernel: [ 2631.981316] TRACE: filter:FORWARD:rule:7 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=52 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533181 ACK=3245308350 WINDOW=2058 RES=0x00 ACK URGP=0 OPT (0101080A6EF8750D8E0CA28F)
對比這張圖
以及宿主機節點上具體的規則(刪除了 docker0 相關)
*raw
:PREROUTING ACCEPT [312:33617]
:OUTPUT ACCEPT [335:24610]
-A PREROUTING -d 192.168.64.4/32 -p tcp -m tcp --dport 10000 -j TRACE
-A OUTPUT -d 172.19.0.0/16 -p tcp -m tcp --dport 80 -j TRACE
COMMIT
*nat
:PREROUTING ACCEPT [4:903]
:INPUT ACCEPT [4:903]
:OUTPUT ACCEPT [222:16027]
:POSTROUTING ACCEPT [224:16155]
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.19.0.0/16 ! -o br-00ea7870520a -j MASQUERADE
-A POSTROUTING -s 172.19.0.9/32 -d 172.19.0.9/32 -p tcp -m tcp --dport 80 -j MASQUERADE
-A DOCKER -i br-00ea7870520a -j RETURN
-A DOCKER ! -i br-00ea7870520a -p tcp -m tcp --dport 10000 -j DNAT --to-destination 172.19.0.9:80
COMMIT
*filter
:INPUT ACCEPT [308:34040]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [351:26291]
:DOCKER - [0:0]
:DOCKER-ISOLATION-STAGE-1 - [0:0]
:DOCKER-ISOLATION-STAGE-2 - [0:0]
:DOCKER-USER - [0:0]
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -o br-00ea7870520a -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o br-00ea7870520a -j DOCKER
-A FORWARD -i br-00ea7870520a ! -o br-00ea7870520a -j ACCEPT
-A FORWARD -i br-00ea7870520a -o br-00ea7870520a -j ACCEPT
-A DOCKER -d 172.19.0.9/32 ! -i br-00ea7870520a -o br-00ea7870520a -p tcp -m tcp --dport 80 -j ACCEPT
-A DOCKER-ISOLATION-STAGE-1 -i br-00ea7870520a ! -o br-00ea7870520a -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
-A DOCKER-ISOLATION-STAGE-2 -o br-00ea7870520a -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
-A DOCKER-USER -j RETURN
COMMIT
分析內容可以知:
- 兩個測試都是 TCP 三步握手的第一、三步的兩個包,從包的 SEQ 欄位可以驗證,也可以用 iptables rule 的 pkts 欄位的差異來進行驗證
iptables -v -t raw -L PREROUTING --line-number
Chain PREROUTING (policy ACCEPT 239 packets, 24919 bytes)
num pkts bytes target prot opt in out source destination
1 8 445 TRACE tcp -- any any anywhere fedora tcp dpt:ndmp
- 對於 TCP 連結,只有第一個包才會經過 nat,後續包屬於
RELATED,ESTABLISHED
,直接放過,不需要重新 nat - 對於跨節點訪問,第一個包首先來到
PREROUTING
鏈, 命中了 docker 新增的DNAT
規則:nat:DOCKER
(rule 3
),效果就是第三行到第四行DST
從192.168.64.4
變成了172.19.0.9
iptables -v -t nat -L DOCKER --line-number
Chain DOCKER (2 references)
num pkts bytes target prot opt in out source destination
1 0 0 RETURN all -- docker0 any anywhere anywhere
2 0 0 RETURN all -- br-00ea7870520a any anywhere anywhere
3 2 128 DNAT tcp -- !br-00ea7870520a any anywhere anywhere tcp dpt:ndmp to:172.19.0.9:80
DNAT
後走到了Forward
鏈,DOCKER-USER
和DOCKER-ISOLATION-STAGE-1
都return
了。最後透過filter:FORWARD:rule:8
來到filter:DOCKER
上,accept
將包送到了docker bridge: br-00ea7870520a
上,從而送進了容器 172.19.0.9。當然,第二個包依然屬於RELATED,ESTABLISHED
,所以在filter:FORWARD:rule:7
上直接ACCEPT
,不需要再走filter:DOCKER
了。
iptables -v -t filter -L DOCKER --line-number
Chain DOCKER (2 references)
num pkts bytes target prot opt in out source destination
1 2 128 ACCEPT tcp -- !br-00ea7870520a br-00ea7870520a anywhere 172.19.0.9 tcp dpt:http
iptables -v -t filter -L FORWARD --line-number
Chain FORWARD (policy DROP 0 packets, 0 bytes)
num pkts bytes target prot opt in out source destination
1 18 1290 DOCKER-USER all -- any any anywhere anywhere
2 18 1290 DOCKER-ISOLATION-STAGE-1 all -- any any anywhere anywhere
3 0 0 ACCEPT all -- any docker0 anywhere anywhere ctstate RELATED,ESTABLISHED
4 0 0 DOCKER all -- any docker0 anywhere anywhere
5 0 0 ACCEPT all -- docker0 !docker0 anywhere anywhere
6 0 0 ACCEPT all -- docker0 docker0 anywhere anywhere
7 8 421 ACCEPT all -- any br-00ea7870520a anywhere anywhere ctstate RELATED,ESTABLISHED
8 2 128 DOCKER all -- any br-00ea7870520a anywhere anywhere
9 8 741 ACCEPT all -- br-00ea7870520a !br-00ea7870520a anywhere anywhere
10 0 0 ACCEPT all -- br-00ea7870520a br-00ea7870520a anywhere anywhere
補充
最後補充一下,對於較新的核心,iptables 是 nf_tables
而不是 iptables-legacy
的場景下:
# 較舊的
iptables -V
iptables v1.8.8 (legacy)
# 較新的
iptables -V
iptables v1.8.8 (nf_tables)
用 nft monitor trace # xtables-monitor --trace
命令來觀察結果(這個方式比 messages 更友好:每個包都有一個單獨的 id,更直觀),當然,trace 方式也有差別,此處不展開,感興趣讀者可以自行到參考連結中進行研究。
參考
- https://serverfault.com/questions/122157/debugger-for-iptable...
- https://man7.org/linux/man-pages/man8/iptables.8.html#TARGETS
- https://man7.org/linux/man-pages/man8/iptables-extensions.8.html
- https://stackoverflow.com/questions/41748330/how-to-read-iptables-trace-logs-policy-numbers#
- https://wiki.nftables.org/wiki-nftables/index.php/Ruleset_deb...
- https://unix.stackexchange.com/questions/614413/how-to-proper...