TCP Implementation in Linux: A Brief Tutorial
一個簡單教程關於 TCP 協議在 linux 核心的實現
翻譯:核心小王子 (歡迎訂閱微信公眾號) 原文:Helali Bhuiyan, Mark McGinley, Tao Li, Malathi Veeraraghavan University of Virginia
原文連結 TCP Implementation in Linux: A Brief Tutorial
A. Introduction
This document provides a brief overview of how TCP is implemented in Linux. 1 It is not meant to be comprehensive, nor do we assert that it is without inaccuracies.
本文件簡要概述瞭如何在Linux中實現TCP。他可能並不全面,並且也不能保證完全準確。
B. TCP implementation in Linux
Figures 1 and 2 show the internals of the TCP implemen- tation in Linux kernel. Fig. 1 shows the path taken by a new packet from the the wire to a user application. The Linux kernel uses an sk buff data structure to describe each packet. When a packet arrives at the NIC, it invokes the DMA engine to place the packet into the kernel memory via empty sk buffs stored in a ring buffer called rx ring. An incoming packet is dropped if the ring buffer is full. When a packet is processed at higher layers, packet data remains in the same kernel memory, avoiding any extra memory copies.
圖1 和 圖2 展示了 TCP/IP 協議棧在 Linux 核心中的實現,圖1 展示了一個網路包通過物理網線到達應用程式的過程,Linux 核心使用一個名為 sk_buff 的資料結構來表示一個網路包。當一個網路包到達網路卡時,會通過 DMA 引擎將這個 sk_buff 加入到一個叫 rx ring 的 ring buffer 中,當這個 ring buffer 已經滿了的時候,的報文將被捨棄。當更高層的協議處理資料包的時候,報文儲存在核心的記憶體中從而避免了額外的拷貝。
Once a packet is successfully received, the NIC raises an interrupt to the CPU, which processes each incoming packet and passes it to the IP layer. The IP layer performs its processing on each packet, and passes it up to the TCP layer if it is a TCP packet. The TCP process is then scheduled to handle received packets. Each packet in TCP goes through a series of complex processing steps. The TCP state machine is updated, and finally the packet is stored inside the TCP recv buffer.
一旦成功接收到一個資料包,網路卡會向 CPU 傳送一箇中斷,中斷處理函式將資料包傳給 IP 層。 IP層處理完後,判斷如果是 TCP 報文,就會將資料包發給 TCP 層處理,資料包經過 TCP 層一系列複雜的處理過程,會更新 TCP 的狀態機,最後將資料包儲存在 TCP 的 接收緩衝區中。
A critical parameter for tuning TCP is the size of the recv buffer at the receiver. The number of packets a TCP sender is able to have outstanding (unacknowledged) is the minimum of the congestion window (cwnd) and the receiver’s advertised window (rwnd). The maximum size of the receiver’s advertised window is the TCP recv buffer size. Hence, if the size of the recv buffer is smaller than the the bandwidth- delay product (BDP) of the end-to-end path, the achievable throughput will be low. On the other hand, a large recv buffer allows a correspondingly large number of packets to remain outstanding, possibly exceeding the number of packets an end- to-end path can sustain. The size of the recv buffer can be set by modifying the /proc/sys/net/ipv4/tcp rmem variable. It takes three different values, i.e, min, default, and max. The min value defines the minimum receive buffer size even when the operating system is under hard memory pressure. The default is the default size of the receive buffer, which is used together with the TCP window scaling factor to calculate the actual advertised window. The max defines the maximum size of the receive buffer,
TCP 調優的一個關鍵引數為接收端的 recv 緩衝區大小。TCP 傳送方能夠傳送的資料包的數量為傳送方的擁塞控制視窗 (cwnd) 和接收方的告知的接收視窗 (rwnd) 中的最小值。而接收方告知的接收視窗的最大值就是 recv 緩衝區大小。因此,如果 recv 緩衝區設定的比 BGP (頻寬延遲積) 小,則網路的吞吐量將會很低。另外,一個大的 recv 緩衝區允許大量的資料包處於未完成狀態,可能超過了雙方可以維持的資料包數量。recv 緩衝區大小可以通過修改 /proc/sys/net/ipv4/tcp rmem變數來設定。它需要三個值,最大值,最小值,預設值。最小值定義了最小可以接收的緩衝區大小,即使作業系統處於硬體記憶體很小。預設值是接收緩衝區的預設大小,它與TCP滑動視窗比例一起用來計算實際公示的視窗大小。max 定義接收緩衝區的最大值。
Also at the receiver, the parameter netdev max backlog dictates the maximum number of packets queued at a device, which are waiting to be processed by the TCP receiving process. If a newly received packet when added to the queue would cause the queue to exceed netdev max backlog then it is discarded.
此外在接收端,引數netdev max backlog 指示網路卡裝置上排隊的最大資料包數,這些資料包等待TCP接收程式處理。如果一個新收到的資料包在新增到佇列時會導致佇列超過netdev max backlog,那麼它將被丟棄。
On the sender, as shown in Fig 2, a user application writes the data into the TCP send buffer by calling the write() system call. Like the TCP recv buffer, the send buffer is a crucial parameter to get maximum throughput. The maximum size of the congestion window is related to the amount of send buffer space allocated to the TCP socket. The send buffer holds all outstanding packets (for potential retransmission) as well as all data queued to be transmitted. Therefore, the congestion window can never grow larger than send buffer can accommodate. If the send buffer is too small, the congestion window will not fully open, limiting the throughput. On the other hand, a large send buffer allows the congestion window to grow to a large value. If not constrained by the TCP recv buffer, the number of outstanding packets will also grow as the congestion window grows, causing packet loss if the end-to- end path can not hold the large number of outstanding packets. The size of the send buffer can be set by modifying the /proc/sys/net/ipv4/tcp wmem variable, which also takes three different values, i.e., min, default, and max.
在傳送端,如圖 2 ,所示,使用者程式通過系統呼叫 write() 將資料寫入 TCP 的 send buffer,和接收端的緩衝區一樣,send buffer 也是提供吞吐量很重要的引數。擁塞視窗的最大值和分配給 TCP socket 的 send buffer 空間大小相關,send buffer 儲存了所有還沒有確認的資料包,因為該資料包可能還需要重發,如果s end buffer 設定的太小,則擁塞視窗也會變小,將影響吞吐量。另外,一個大的 send buffer 可能導致擁塞視窗變大,如果沒有通過 接收端的 recv buffer 來限制,未確認的報文數目會隨著擁塞視窗的增加而變大,如果超過雙方可以維持的最大包數目從而導致丟包。send buffer 的大小可以通過修改 /proc/sys/net/ipv4/tcp 的 wmem 變數值,同樣需要配置最大最小值和預設值。
The analogue to the receiver’s netdev max backlog is the sender’s txqueuelen. The TCP layer builds packets when data is available in the send buffer or ACK packets in response to data packets received. Each packet is pushed down to the IP layer for transmission. The IP layer enqueues each packet in an output queue (qdisc) associated with the NIC. The size of the qdisc can be modified by assigning a value to the txqueuelen variable associated with each NIC device. If the output queue is full, the attempt to enqueue a packet generates a local- congestion event, which is propagated upward to the TCP layer. The TCP congestion-control algorithm then enters into the Congestion Window Reduced (CWR) state, and reduces the congestion window by one every other ACK (known as rate halving). After a packet is successfully queued inside the output queue, the packet descriptor (sk buff) is then placed in the output ring buffer tx ring. When packets are available inside the ring buffer, the device driver invokes the NIC DMA engine to transmit packets onto the wire.
類似於接收端的 netdev max backlog 是傳送者的網路卡裝置上排隊的最大資料包數。TCP 層在資料到達 send buffer的時候會構建報文,當收到確認報文回覆的時候也會更高資料包狀態。構建好 TCP 報文後會將資料包推送到協議下層的 IP 層進行傳輸,IP 層將加資料包放入一個和網路卡關聯的輸出佇列。該佇列的大小可以通過修改和網路卡裝置關聯的 txqueuelen 變數值來設定。如果佇列已滿,會嘗試將資料包排隊生成一個阻塞事件傳播到 TCP層。TCP 擁塞控制演算法將減少擁塞視窗的狀態變數,每有一個阻塞事件會將當前擁塞視窗的狀態變數減半。當資料包成功加入到佇列,則這個資料包的描述符 (sk buff) 將會放入到傳送方的 ring buffer 中,之後裝置驅動通過 DMA engine 將資料包傳輸到線路中。
While the above parameters dictate the flow-control profile of a connection, the congestion-control behavior can also have a large impact on the throughput. TCP uses one of several congestion control algorithms to match its sending rate with the bottleneck-link rate. Over a connectionless network, a large number of TCP flows and other types of traffic share the same bottleneck link. As the number of flows sharing the bottleneck link changes, the available bandwidth for a certain TCP flow varies. Packets get lost when the sending rate of a TCP flow is higher than the available bandwidth. On the other hand, packets are not lost due to competition with other flows in a circuit as bandwidth is reserved. However, when a fast sender is connected to a circuit with lower rate, packets can get lost due to buffer overflow at the switch.
上述引數展示了網路連線的流量控制,但擁塞控制行為也會對對吞吐量產生很大影響。TCP使用多種擁塞控制演算法來匹配傳送速率以適應有瓶頸的線路。在一個無連線的網路環境裡,大量的TCP流和其他型別的流量共享同一個瓶頸鍊路,當鏈路上的資料包數量發生變化的時候,TCP 流的可用頻寬也會變化。當TCP流的傳送速率高於可用頻寬時,資料包會丟失。另一方面,由於頻寬被保留,資料包不會因為與電路中其他流的競爭而丟失。但,當一個傳送速率很快的傳送端連線到一個速率較低的鏈路時,由於交換機的緩衝區溢位,資料包也可能會丟失。
When a TCP connection is set up, a TCP sender uses ACK packets as a ’clock, known as ACK-clocking, to inject new packets into the network [1]. Since TCP receivers cannot send ACK packets faster than the bottleneck-link rate, a TCP senders transmission rate while under ACK-clocking is matched with the bottleneck link rate. In order to start the ACK-clock, a TCP sender uses the slow-start mechanism. During the slow-start phase, for each ACK packet received, a TCP sender transmits two data packets back-to-back. Since ACK packets are coming at the bottleneck-link rate, the sender is essentially transmitting data twice as fast as the bottleneck link can sustain. The slow-start phase ends when the size of the congestion window grows beyond ssthresh. In many congestion control algorithms, such as BIC [2], the initial slow start threshold (ssthresh) can be adjusted, as can other factors such as the maximum increment, to make BIC more or less aggressive. However, like changing the buffers via the sysctl function, these are system-wide changes which could adversely affect other ongoing and future connections. A TCP sender is allowed to send the minimum of the con- gestion window and the receivers advertised window number of packets. Therefore, the number of outstanding packets is doubled in each roundtrip time, unless bounded by the receivers advertised window. As packets are being forwarded by the bottleneck-link rate, doubling the number of outstanding packets in each roundtrip time will also double the buffer occupancy inside the bottleneck switch. Eventually, there will be packet losses inside the bottleneck switch once the buffer overflows.
當一個 TCP 完成連線建立後,傳送方使用確認報文作為一個時鐘從而將新的資料包加入網路,稱為 ACK-clocking。由於 TCP 接收端傳送 ACK 資料包的速度不能超過瓶頸鍊路速率,因此ACK 時鐘下的 TCP 傳送端傳輸速率與瓶頸鍊路速率匹配。為了啟動 ACK 時鐘,TCP 傳送端使用慢速啟動機制。在慢啟動階段,對於接收到的每個 ACK 資料包,TCP傳送端連續傳輸兩個資料包。由於 ACK 資料包以瓶頸鍊路速率傳輸,傳送方傳輸資料的速度基本上是瓶頸鍊路能夠維持的速度的兩倍。當擁塞視窗的大小超過 ssthresh 時,慢啟動階段結束。在許多擁塞控制演算法中,如 bic,可以調整初始慢啟動閾值(ssthresh),以及其他因素(如最大增量),使bic或多或少提高效率。但是,與通過sysctl函式更改緩衝區一樣,這些是系統範圍內的更改,可能會對其他正在進行的連線和將來的連線產生不利影響。TCP 傳送端最多隻能傳送擁塞視窗和接收端公佈的視窗中的最小值。因此,除非受接收端公示的視窗的限制,否則每個往返時間內未完成資料包的數量將增加一倍。由於資料包是由瓶頸鍊路速率轉發的,因此在每個往返時間內,將未完成資料包的數量加倍也將使瓶頸交換機內的緩衝區佔用率加倍。最後,一旦緩衝區溢位,瓶頸交換機內部就會有資料包丟失。
After packet loss occurs, a TCP sender enters into the congestion avoidance phase. During congestion avoidance, the congestion window is increased by one packet in each roundtrip time. As ACK packets are coming at the bottleneck link rate, the congestion window keeps growing, as does the the number of outstanding packets. Therefore, packets will get lost again once the number of outstanding packets grows larger than the buffer size in the bottleneck switch plus the number of packets on the wire.
當發生資料包丟失後,TCP傳送端進入擁塞控制階段。在這期間,每收到一個回覆報文擁塞視窗加一。當 ACK 資料包以瓶頸鍊路速率返回時,擁塞視窗和未完成資料包的數量都在不斷增加。因此,一旦未完成資料包的數量超過瓶頸鍊路交換機中的緩衝區大小加上線路上的資料包數量,資料包將再次丟失。
There are many other parameters that are relevant to the operation of TCP in Linux, and each is at least briefly explained in the documentation included in the distribution (Documentation/networking/ip-sysctl.txt). An example of a configurable parameter in the TCP implementation is the RFC2861 congestion window restart function. RFC2861 pro- poses restarting the congestion window if the sender is idle for a period of time (one RTO). The purpose is to ensure that the congestion window reflects the current state of the network. If the connection has been idle, the congestion window may reflect an obsolete view of the network and so is reset. This be- havior can be disabled using the sysctl tcp slow start after idle but, again, this change affects all connections system-wide.
還有許多與 Linux 中的 TCP 操作相關的其他引數,並且每個引數都在釋出的文件(documentation/networking/ip sysctl.txt)中進行了簡要說明。TCP 實現可配置引數的一個例子是 rfc2861 擁塞視窗重啟功能。如果傳送方 空閒一段時間(一個 RTO),則RFC2861 Pro 將重新啟動擁塞視窗,目的是確保擁塞視窗反映網路的當前狀態。如果連線處於空閒狀態,擁塞視窗可能反映網路的已經過時狀態,需要進行重置。可以使用 ysctl tcp slow start 在空閒後禁用此行為,但此更改會影響系統範圍內的所有連線。
如果對 TCP 對流量控制和擁塞控制不是很理解,歡迎關注公眾號 核心小王子 ,下週將分享 網路核心之如何實現c10m 深入分析linux的網路模型