資料包如何遊走於 Iptables 規則之間?

發表於2023-09-18

在前文《Linux 路由三大件》中,我們提到了 iptables 可以修改資料包的特徵從而影響其路由。這個功能無論是傳統場景下的 防火牆,還是雲原生場景下的 服務路由(k8s service)、網路策略(calico network policy) 等都有依賴。

雖然業界在積極地推進 ipvsebpf 等更新、更高效的技術落地,但是出於各種各樣的原因(技術、成本、管理等),iptables 仍然有著非常廣泛的使用 ,所以我們有必要了解 iptables 的方方面面。今天我們就來了解其中的重要一環:除錯和追蹤 iptables。更具體一點:一個資料包是怎麼在 iptables 的各個 chain/table/rule 中游走的。

注:不瞭解 iptables 的同學可以先看看這篇 wiki https://wiki.archlinux.org/title/Iptables

trace

我們知道,iptables 的 rule 最後是一個 target,這個 target 可以設定為 TRACE。這樣,匹配這個 rule 的包經過的所有 chain/table/rule 都會被記錄下來。

這個 rule 本身只能被放在 raw 表裡,可以存在於 PREROUTINGOUTPUT 兩個 chain 中。其匹配規則和其他一般的 rule 沒有什麼差別。

實踐

我們採用一個 docker 場景來進行實踐,如下圖:

亦即宿主機的 10000 埠對映給了容器的 80 埠。圖中有兩種測試場景,分別是:

  1. 本節點訪問容器 IP
  2. 跨節點訪問宿主機 IP

為了讓分析更直觀/簡單,我們用 telnet 測試即可(curl 會產生很多後續 http 包,不夠簡潔)。

OK,開始我們的測試!

首先在宿主機上開啟 iptables log

# check first
# cat /proc/net/netfilter/nf_log

modprobe nf_log_ipv4
sysctl net.netfilter.nf_log.2=nf_log_ipv4

然後增加 trace 配置

# 訪問容器 IP 使用
iptables -t raw -A OUTPUT -p tcp --destination 172.19.0.0/16 --dport 80 -j TRACE

# 訪問節點 IP 使用
iptables -t raw -A PREROUTING -p tcp --destination 192.168.64.4 --dport 10000 -j TRACE

檢視結果可以用這個命令:cat /var/log/messages | grep "TRACE:"

本節點訪問:telnet 172.19.0.9 80,結果:

fedora kernel: [ 2350.411619] TRACE: raw:OUTPUT:policy:2 IN= OUT=br-00ea7870520a SRC=172.19.0.1 DST=172.19.0.9 LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=11432 DF PROTO=TCP SPT=43256 DPT=80 SEQ=2316959420 ACK=0 WINDOW=64240 RES=0x00 SYN URGP=0 OPT (020405B40402080AAF8EE4320000000001030307) UID=0 GID=0 
fedora kernel: [ 2350.411634] TRACE: nat:OUTPUT:policy:2 IN= OUT=br-00ea7870520a SRC=172.19.0.1 DST=172.19.0.9 LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=11432 DF PROTO=TCP SPT=43256 DPT=80 SEQ=2316959420 ACK=0 WINDOW=64240 RES=0x00 SYN URGP=0 OPT (020405B40402080AAF8EE4320000000001030307) UID=0 GID=0 
fedora kernel: [ 2350.411639] TRACE: filter:OUTPUT:policy:1 IN= OUT=br-00ea7870520a SRC=172.19.0.1 DST=172.19.0.9 LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=11432 DF PROTO=TCP SPT=43256 DPT=80 SEQ=2316959420 ACK=0 WINDOW=64240 RES=0x00 SYN URGP=0 OPT (020405B40402080AAF8EE4320000000001030307) UID=0 GID=0 
fedora kernel: [ 2350.411644] TRACE: nat:POSTROUTING:policy:4 IN= OUT=br-00ea7870520a SRC=172.19.0.1 DST=172.19.0.9 LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=11432 DF PROTO=TCP SPT=43256 DPT=80 SEQ=2316959420 ACK=0 WINDOW=64240 RES=0x00 SYN URGP=0 OPT (020405B40402080AAF8EE4320000000001030307) UID=0 GID=0 
fedora kernel: [ 2350.411778] TRACE: raw:OUTPUT:policy:2 IN= OUT=br-00ea7870520a SRC=172.19.0.1 DST=172.19.0.9 LEN=52 TOS=0x10 PREC=0x00 TTL=64 ID=11433 DF PROTO=TCP SPT=43256 DPT=80 SEQ=2316959421 ACK=357062108 WINDOW=502 RES=0x00 ACK URGP=0 OPT (0101080AAF8EE4320C2ACEC2) UID=0 GID=0 
fedora kernel: [ 2350.411784] TRACE: filter:OUTPUT:policy:1 IN= OUT=br-00ea7870520a SRC=172.19.0.1 DST=172.19.0.9 LEN=52 TOS=0x10 PREC=0x00 TTL=64 ID=11433 DF PROTO=TCP SPT=43256 DPT=80 SEQ=2316959421 ACK=357062108 WINDOW=502 RES=0x00 ACK URGP=0 OPT (0101080AAF8EE4320C2ACEC2) UID=0 GID=0 

跨節點訪問:telnet 192.168.64.4 10000,結果:

fedora kernel: [ 2631.978281] TRACE: raw:PREROUTING:policy:2 IN=enp0s1 OUT= MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=192.168.64.4 LEN=64 TOS=0x10 PREC=0x00 TTL=64 ID=0 PROTO=TCP SPT=52842 DPT=10000 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000) 
fedora kernel: [ 2631.978585] TRACE: nat:PREROUTING:rule:1 IN=enp0s1 OUT= MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=192.168.64.4 LEN=64 TOS=0x10 PREC=0x00 TTL=64 ID=0 PROTO=TCP SPT=52842 DPT=10000 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000) 
fedora kernel: [ 2631.979055] TRACE: nat:DOCKER:rule:3 IN=enp0s1 OUT= MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=192.168.64.4 LEN=64 TOS=0x10 PREC=0x00 TTL=64 ID=0 PROTO=TCP SPT=52842 DPT=10000 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000) 
fedora kernel: [ 2631.979077] TRACE: filter:FORWARD:rule:1 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=64 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000) 
fedora kernel: [ 2631.979585] TRACE: filter:DOCKER-USER:return:1 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=64 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000) 
fedora kernel: [ 2631.979585] TRACE: filter:FORWARD:rule:2 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=64 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000) 
fedora kernel: [ 2631.979585] TRACE: filter:DOCKER-ISOLATION-STAGE-1:return:3 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=64 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000) 
fedora kernel: [ 2631.980070] TRACE: filter:FORWARD:rule:8 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=64 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000) 
fedora kernel: [ 2631.980070] TRACE: filter:DOCKER:rule:1 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=64 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000) 
fedora kernel: [ 2631.980070] TRACE: nat:POSTROUTING:policy:4 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=64 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533180 ACK=0 WINDOW=65535 RES=0x00 CWR ECE SYN URGP=0 OPT (020405B4010303060101080A6EF8750A0000000004020000) 
fedora kernel: [ 2631.981288] TRACE: raw:PREROUTING:policy:2 IN=enp0s1 OUT= MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=192.168.64.4 LEN=52 TOS=0x10 PREC=0x00 TTL=64 ID=0 PROTO=TCP SPT=52842 DPT=10000 SEQ=1184533181 ACK=3245308350 WINDOW=2058 RES=0x00 ACK URGP=0 OPT (0101080A6EF8750D8E0CA28F) 
fedora kernel: [ 2631.981299] TRACE: filter:FORWARD:rule:1 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=52 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533181 ACK=3245308350 WINDOW=2058 RES=0x00 ACK URGP=0 OPT (0101080A6EF8750D8E0CA28F) 
fedora kernel: [ 2631.981304] TRACE: filter:DOCKER-USER:return:1 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=52 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533181 ACK=3245308350 WINDOW=2058 RES=0x00 ACK URGP=0 OPT (0101080A6EF8750D8E0CA28F) 
fedora kernel: [ 2631.981308] TRACE: filter:FORWARD:rule:2 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=52 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533181 ACK=3245308350 WINDOW=2058 RES=0x00 ACK URGP=0 OPT (0101080A6EF8750D8E0CA28F) 
fedora kernel: [ 2631.981312] TRACE: filter:DOCKER-ISOLATION-STAGE-1:return:3 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=52 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533181 ACK=3245308350 WINDOW=2058 RES=0x00 ACK URGP=0 OPT (0101080A6EF8750D8E0CA28F) 
fedora kernel: [ 2631.981316] TRACE: filter:FORWARD:rule:7 IN=enp0s1 OUT=br-00ea7870520a MAC=fa:f3:dc:b4:4e:ef:3a:f9:d3:9e:aa:64:08:00 SRC=192.168.64.1 DST=172.19.0.9 LEN=52 TOS=0x10 PREC=0x00 TTL=63 ID=0 PROTO=TCP SPT=52842 DPT=80 SEQ=1184533181 ACK=3245308350 WINDOW=2058 RES=0x00 ACK URGP=0 OPT (0101080A6EF8750D8E0CA28F) 

對比這張圖

以及宿主機節點上具體的規則(刪除了 docker0 相關)

*raw
:PREROUTING ACCEPT [312:33617]
:OUTPUT ACCEPT [335:24610]
-A PREROUTING -d 192.168.64.4/32 -p tcp -m tcp --dport 10000 -j TRACE
-A OUTPUT -d 172.19.0.0/16 -p tcp -m tcp --dport 80 -j TRACE
COMMIT

*nat
:PREROUTING ACCEPT [4:903]
:INPUT ACCEPT [4:903]
:OUTPUT ACCEPT [222:16027]
:POSTROUTING ACCEPT [224:16155]
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.19.0.0/16 ! -o br-00ea7870520a -j MASQUERADE
-A POSTROUTING -s 172.19.0.9/32 -d 172.19.0.9/32 -p tcp -m tcp --dport 80 -j MASQUERADE
-A DOCKER -i br-00ea7870520a -j RETURN
-A DOCKER ! -i br-00ea7870520a -p tcp -m tcp --dport 10000 -j DNAT --to-destination 172.19.0.9:80
COMMIT

*filter
:INPUT ACCEPT [308:34040]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [351:26291]
:DOCKER - [0:0]
:DOCKER-ISOLATION-STAGE-1 - [0:0]
:DOCKER-ISOLATION-STAGE-2 - [0:0]
:DOCKER-USER - [0:0]
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -o br-00ea7870520a -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o br-00ea7870520a -j DOCKER
-A FORWARD -i br-00ea7870520a ! -o br-00ea7870520a -j ACCEPT
-A FORWARD -i br-00ea7870520a -o br-00ea7870520a -j ACCEPT
-A DOCKER -d 172.19.0.9/32 ! -i br-00ea7870520a -o br-00ea7870520a -p tcp -m tcp --dport 80 -j ACCEPT
-A DOCKER-ISOLATION-STAGE-1 -i br-00ea7870520a ! -o br-00ea7870520a -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
-A DOCKER-ISOLATION-STAGE-2 -o br-00ea7870520a -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
-A DOCKER-USER -j RETURN
COMMIT

分析內容可以知:

  1. 兩個測試都是 TCP 三步握手的第一、三步的兩個包,從包的 SEQ 欄位可以驗證,也可以用 iptables rule 的 pkts 欄位的差異來進行驗證
iptables -v -t raw -L PREROUTING --line-number
Chain PREROUTING (policy ACCEPT 239 packets, 24919 bytes)
num   pkts bytes target     prot opt in     out     source               destination         
1        8   445 TRACE      tcp  --  any    any     anywhere             fedora               tcp dpt:ndmp
  1. 對於 TCP 連結,只有第一個包才會經過 nat,後續包屬於 RELATED,ESTABLISHED,直接放過,不需要重新 nat
  2. 對於跨節點訪問,第一個包首先來到 PREROUTING 鏈, 命中了 docker 新增的 DNAT 規則:nat:DOCKERrule 3),效果就是第三行到第四行 DST192.168.64.4 變成了 172.19.0.9
iptables -v  -t nat -L DOCKER --line-number
Chain DOCKER (2 references)
num   pkts bytes target     prot opt in     out     source               destination         
1        0     0 RETURN     all  --  docker0 any     anywhere             anywhere            
2        0     0 RETURN     all  --  br-00ea7870520a any     anywhere             anywhere            
3        2   128 DNAT       tcp  --  !br-00ea7870520a any     anywhere             anywhere             tcp dpt:ndmp to:172.19.0.9:80
  1. DNAT 後走到了 Forward 鏈, DOCKER-USERDOCKER-ISOLATION-STAGE-1return 了。最後透過 filter:FORWARD:rule:8 來到 filter:DOCKER 上,accept 將包送到了 docker bridge: br-00ea7870520a 上,從而送進了容器 172.19.0.9。當然,第二個包依然屬於RELATED,ESTABLISHED,所以在 filter:FORWARD:rule:7 上直接 ACCEPT,不需要再走 filter:DOCKER 了。
iptables -v  -t filter  -L DOCKER  --line-number
Chain DOCKER (2 references)
num   pkts bytes target     prot opt in     out     source               destination         
1        2   128 ACCEPT     tcp  --  !br-00ea7870520a br-00ea7870520a  anywhere             172.19.0.9           tcp dpt:http

iptables -v  -t filter  -L FORWARD  --line-number
Chain FORWARD (policy DROP 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination         
1       18  1290 DOCKER-USER  all  --  any    any     anywhere             anywhere            
2       18  1290 DOCKER-ISOLATION-STAGE-1  all  --  any    any     anywhere             anywhere            
3        0     0 ACCEPT     all  --  any    docker0  anywhere             anywhere             ctstate RELATED,ESTABLISHED
4        0     0 DOCKER     all  --  any    docker0  anywhere             anywhere            
5        0     0 ACCEPT     all  --  docker0 !docker0  anywhere             anywhere            
6        0     0 ACCEPT     all  --  docker0 docker0  anywhere             anywhere            
7        8   421 ACCEPT     all  --  any    br-00ea7870520a  anywhere             anywhere             ctstate RELATED,ESTABLISHED
8        2   128 DOCKER     all  --  any    br-00ea7870520a  anywhere             anywhere            
9        8   741 ACCEPT     all  --  br-00ea7870520a !br-00ea7870520a  anywhere             anywhere            
10       0     0 ACCEPT     all  --  br-00ea7870520a br-00ea7870520a  anywhere             anywhere    

補充

最後補充一下,對於較新的核心,iptables 是 nf_tables 而不是 iptables-legacy 的場景下:

# 較舊的
iptables -V
iptables v1.8.8 (legacy)
# 較新的
iptables -V
iptables v1.8.8 (nf_tables)

nft monitor trace # xtables-monitor --trace 命令來觀察結果(這個方式比 messages 更友好:每個包都有一個單獨的 id,更直觀),當然,trace 方式也有差別,此處不展開,感興趣讀者可以自行到參考連結中進行研究。

參考

相關文章