《TCP/IP 卷一》筆記、ping和traceroute 的實現思路

s1mba發表於2013-10-21
一、TCP協議相關筆記

Normally TCP does not send an ACK the instant it receives data. Instead, it delays the ACKhoping to have data going in the same direction as the ACK, so the ACK can be sent along with the data. (This is sometimes called having the ACK piggyback with the data.) Most implementations use a 200-ms delay-that is, TCP will delay an ACK up to 200 ms to see if there is data to send with the ACK.

The Nagle algorithm  says that a TCP connection can have only one outstanding small segment that has not yet been acknowledged. No additional small segments can be sent until the acknowledgment is received. Instead, small amounts of data are collected by TCP and sent in a single segment when the acknowledgment arrives. The beauty of this algorithm is that it is self-clocking: the faster the ACKs come back, the faster the data is sent. But on a slow WAN, where it is desired to reduce the number of tinygrams, fewer segments are sent.

This shows that TCP can acknowledge received data before the application has read and processed that data. The TCP
acknowledgment just means TCP has correctly received the data. We also have an indication that the server process has not read these 3 bytes of data because the advertised window in the final segment is 8189, not 8192.

There are times when the Nagle algorithm needs to be turned off. The classic example is the X Window System server : small messages (mouse movements) must be delivered without delay to provide real-time feedback for interactive users doing certain operations.
The sockets API uses the TCP_NODELAY socket option to disable the Nagle algorithm.

Here we'll show another example that's easier to demonstrate-typing one of the terminal's special function keys during an interactive login. The function keys normally generate multiple bytes of data, often beginning with the ASCII escape character. If TCP gets the data 1 byte at a time, it's possible for it to send the first byte (the ASCII ESC) and then hold the remaining bytes of the sequence waiting for the ACK of this byte. But when the server receives this first byte it doesn't generate an echo until the remaining bytes are received. This often triggers the delayed ACK algorithm on the server, meaning that the remaining bytes aren't sent for up to 200 ms. This can lead to noticeable delays to the interactive user.

**************************************************************************************
With TCP's sliding-window protocol the receiver does not have to acknowledge every received packet. With TCP, the ACKs are cumulative-they acknowledge that the receiver has correctly received all bytes up through the acknowledged sequence number minus one.

The ordering of the packets that we see on the wire depends on many factors, most of which we have no control over: the sending TCP implementation, the receiving TCP implementation, the reading of data by the receiving process (which depends on the process scheduling by the operating system), and the dynamics of the network (i.e., Ethernet collisions and backoffs). There is no single correct way for two TCPs to exchange a given amount of data.

The sockets API allows a process to set the sizes of the send buffer and the receive buffer. The size of the receive buffer is the maximum size of the advertised window (也就是Tcp頭部的滑動視窗大小)for that connection. Some applications change the socket buffer sizes to increase performance.

*****************************************************************************************************
In all the examples we've seen so far in this chapter, the sender starts off by injecting multiple segments into the network, up to the window size advertised by the receiver. While this is OK when the two hosts are on the same LAN, if there are routers and slower links between the sender and the receiver, problems can arise. Some intermediate router must queue the packets, and it's possible for that router to run out of space.

TCP is now required to support an algorithm called slow start. It operates by observing that the rate at which new packets should be injected into the network is the rate at which the acknowledgments are returned by the other end.

Slow start adds another window to the sender's TCP: the congestion window, called cwnd. When a new connection is established with a host on another network, the congestion window is initialized to one segment (i.e., the segment size announced by the other end).Each time an ACK is received, the congestion window is increased by one segment, (cwnd is maintained in bytes, but slow start always increments it by the segment size.) The sender can transmit up to the minimum of the congestion window and the advertised window. The congestion window is flow control imposed by the sender, while the advertised window is flow control imposed by the receiver.

The sender starts by transmitting one segment and waiting for its ACK. When that ACK is received, the congestion window is incremented from one to two, and two segments can be sent. When each of those two segments is acknowledged, the congestion window is increased to four. This provides an exponential increase. At some point the capacity of the internet can be reached, and an intermediate router will start discarding packets. This tells the sender that its congestion window we'll see how this is handled, and what happens to the congestion window. For now, let's watch slow start in action.
********************************************************************
Congestion can occur when data arrives on a big pipe (a fast LAN) and gets sent out a smaller pipe (a slower WAN). Congestion can also occur when multiple input streams arrive at a router whose output capacity is less than the sum of the inputs.
*********************************************************************
TCP itself says little more about urgent data. There is no way to specify where the urgent data starts in the data stream. The only information sent across the connection by TCP is that urgent mode has begun (the URG bit in the TCP header) and the pointer to the last byte of urgent data. Everything else is left to the application.

Unfortunately many implementations incorrectly call TCP's urgent mode out-of-band data. If an application really wants a separate out-of-band channel, a second TCP connection is the easiest way to accomplish this. (Some transport layers do provide what most people consider true out-of-band data: a logically separate data path using the same connection as the normal data path. This is not what TCP provides.)

The confusion between TCP's urgent mode and out-of-band data is also because the predominant programming interface, the sockets API, maps TCP's urgent mode into what sockets calls out-of-band data.
******************************************************************************
TCP manages four different timers for each connection.
1. A retransmission timer is used when expecting an acknowledgment from the other end
2. A persist timer keeps window size information flowing even if the other end closes its receive window.
3. A keepalive timer detects when the other end on an otherwise idle connection crashes or reboots. 
4. A 2MSL timer measures the time a connection has been in the TIME_WAIT state. 
*****************************************************************************************
TCP calculates the round-trip time and then uses these measurements to keep track of a smoothed RTT estimator and a smoothed mean deviation estimator. These two estimators are then used to calculate the next retransmission timeout value. Many  implementations only measure a single RTT per window. Karn's algorithm removes the retransmission ambiguity problem by preventing us from measuring the RTT when a packet is lost.
*****************************************************************************************
Indeed, Berkeley-derived implementations count the number of duplicate ACKs received, and when the third one is received, assume that a segment has been lost and retransmit only one segment, starting with that sequence number. This is Jacobson's fast retransmit algorithm, which is followed by his fast recovery algorithm.
*****************************************************************************************
Slow start,  is the way to initiate data flow across a connection. But at some point we'll reach the limit of an intervening router, and packets can be dropped. Congestion avoidance is a way to deal with lost packets. 

The assumption of the algorithm is that packet loss caused by damage is very small (much less than 1%), therefore the loss of a packet signals congestion somewhere in the network between the source and destination. There are two indications of packet loss: a timeout occurring and the receipt of duplicate ACKs. ( If we are using a timeout as an indication of congestion, we can see the need for a good RTT algorithm(retransmission timeout (RTO)).)

Congestion avoidance and slow start are independent algorithms with different objectives. But when congestion occurs we want to slow down the transmission rate of packets into the network, and then invoke slow start to get things going again. In practice they are implemented together.

Congestion avoidance and slow start require that two variables be maintained for each connection: a congestion window, cwnd, and a slow start threshold size, ssthresh. The combined algorithm operates as follows:

1. Initialization for a given connection sets cwnd to one segment and ssthresh to 65535 bytes.
2. The TCP output routine never sends more than the minimum of cwnd and the receiver's advertised window.

Congestion avoidance is flow control imposed by the sender, while the advertised window is flow control imposed by the receiver. The former is based on the sender's assessment of perceived network congestion; the latter is related to the amount of available buffer space at the receiver for this connection.

3. When congestion occurs (indicated by a timeout or the reception of duplicate ACKs)one-half of the current window size (the minimum of cwnd and the receiver's advertised window, but at least two segments) is saved in ssthresh. Additionally, if the congestion is indicated by a timeout, cwnd is set to one segment (i.e., slow start).

4. When new data is acknowledged by the other end, we increase cwnd, but the way it increases depends on whether we're performing slow start or congestion avoidance. If cwnd is less than or equal to ssthresh, we're doing slow start; otherwise we're doing congestion avoidance. Slow start continues until we're halfway to where we were when congestion occurred (since we recorded half of the window size that got us into trouble in step 2), and then congestion avoidance takes over.

Slow start has cwnd start at one segment, and be incremented by one segment every time an ACK is received.This opens the window exponentially: send one segment, then two, then four, and so on.

Congestion avoidance dictates that cwnd be incremented by 1/ cwnd each time an ACK is received. This is an additive increase, compared to slow start's exponential increase. We want to increase cwnd by at most one segment each round-trip time (regardless how many ACKs are received in that RTT), whereas slow start will increment cwnd by the number of ACKs received in a round-trip time.

Before describing the change, realize that TCP is required to generate an immediate acknowledgment (a duplicate ACK) when an out-of-order segment is received. This duplicate ACK should not be delayed. The purpose of this duplicate ACK is to let the other end know that a segment was received out of order, and to tell it what sequence number is expected.
*****************************************************************************************
Since we don't know whether a duplicate ACK is caused by a lost segment or just a reordering of segments, we wait for a small number of duplicate ACKs to be received. It is assumed that if there is just a reordering of the segments, there will be only one or two duplicate ACKs before the reordered segment is processed, which will then generate a new ACK. If three or more duplicate ACKs are received in a row, it is a strong indication that a segment has been lost.  We then perform a retransmission of what appears to be the missing segment, without waiting for a retransmission timer to expire. This is the fast retransmit algorithm. Next, congestion avoidance, but not slow start is performed. This is the fast recovery algorithm.

1. When the third duplicate ACK is received, set ssthresh to one-half the current congestion window, cwnd. Retransmit the missing segment. Set cwnd to ssthresh plus 3 times the segment size.

2. Each time another duplicate ACK arrives, increment cwnd by the segment size and transmit a packet (if allowed by the new value of cwnd).

3. When the next ACK arrives that acknowledges new data, set cwnd to ssthresh (the value set in step 1). This should be the ACK of the retransmission from step 1, one round-trip time after the retransmission. Additionally, this ACK should acknowledge all the intermediate segments sent between the lost packet and the receipt of the first duplicate ACK. This step is congestion avoidance, since we're slowing down to onehalf the rate we were at when the packet was lost.
*****************************************************************************************
If an acknowledgment is lost, we could end up with both sides waiting for the other: the receiver waiting to receive data (since it provided the sender with a nonzero window) and the sender waiting to receive the window update allowing it to send. To prevent this form of deadlock from occurring the sender uses a persist timer that causes it to query the receiver periodically, to find out if the window has been increased. These segments from the sender are called window probes.

TCP's persist timer is set by one end of a connection when it has data to send, but has been stopped because the other end has advertised a zero-sized window. The sender keeps probing the closed window using a retransmission interval. This probing of the closed window continues indefinitely.
*****************************************************************************************
There are times, however, when a server wants to know if the client's host has either crashed and is down, or crashed and rebooted. The keepalive timera feature of many implementations, provides this capability.

If there is no activity on a given connection for 2 hours, the server sends a probe segmentto the client.
The client host must be in one of four states.

1. The client host is still up and running and reachable from the server. The client's TCP responds normally and the server knows that the other end is still up. The server's TCP will reset the keepalive timer for 2 hours in the future. If there is application traffic across the connection before the next 2-hour timer expires, the timer is reset for 2 hours in the future, following the exchange of data.

2. The client's host has crashed and is either down or in the process of rebooting. In either case, its TCP is not responding. The server will not receive a response to its probe and it times out after 75 seconds. The server sends a total of 10 of these probes, 75 seconds apart, and if it doesn't receive a response, the server considers the client's host as down and terminates the connection.

3. The client's host has crashed and rebooted. Here the server will receive a response to its keepalive probe, but the response will be a reset, causing the server to terminate the connection.

4. The client's host is up and running, but unreachable from the server. This is the same as scenario 2, because TCP can't distinguish between the two. All it can tell is that no replies are received to its probes.
*****************************************************************************************************
二、ping和traceroute的實現思路

(一)、ping 的實現

1. 相關ICMP協議概述
  這裡只講解與ping有關的ICMP訊息型別,主機傳送回送訊息(Type =8),被請求主機回送響應訊息(Type= 0),基本格式如下:
  回送訊息[ECHO]


回送響應訊息[ECHOREPLY]


   其中•Code = 0
•CheckSum為校驗和,重點注意從ICMP的頭部(Type開始),到data結束(即到整個資料包結束),具體計算見程式碼。
•Identifier為識別符號,由主機設定,一般設定為程式號,回送響應訊息與回送訊息中identifier保持一致
•Sequence Number為序列號,由主機設定,一般設為由0遞增的序列,回送響應訊息與回送訊息中Sequence Number保持一致
•data為資料,由主機設定,回送響應訊息與回送訊息中data保持一致


2. Ping流程
Ping實際上利用的就是ICMP ECHOICMP ECHOREPLY包來探測主機是否存在,所以Ping程式的流程十分簡單:傳送ICMP ECHO---->接收ICMP ECHOREPLY
  傳送ICMP ECHO包時填充Identifier為程式IDSequence Number為從0遞增計數,data填充為傳送時間
  接收ICMP ECHOREPLY包時檢查Identifier,Sequence Number是否正確,通過IP報頭的源地址欄位獲得回送報文的主機地址是否正確

 

3. 模擬Ping實現pingy
Ping的基本流程已講解完

  由於要自己構造ICMP包,因此建立需要建立原始套接字(即需要自己填充報頭)
sockfd= socket(AF_INET, SOCK_RAW, IPPROTO_ICMP)
SOCK_RAW用於直接訪問網路層,應用程式負責構造自己的協議首部;IPPROTO_ICMP表示ICMP報頭由程式構造
  構造ICMP報頭,注意各個欄位的填充,特別是校驗和(可以參照icmp的結構定義)
icmp->icmp_type
icmp->icmp_code
icmp->icmp_cksum
icmp->icmp_id
icmp->icmp_seq
icmp->icmp_data;

  而完成系統的Ping命令還需新增資訊統計,如傳送位元組數,收到位元組數,傳送包,接收包,傳送時間,TTL等;另外,新增訊號處理,在使用者使用Control^C等退出時列印即時資訊

程式實現主要參考 http://blog.csdn.net/qy532846454/article/details/5429700,修改了部分程式碼,現在可以ping 域名。

(二)、traceroute的實現

1. 相關ICMP協議概述
   
這裡只講解與tracert有關的ICMP訊息型別,閘道器傳送超時報文(type = 11),主機傳送目標不可達報文(type = 3),基本格式如下:
   
超時報文

         

        其中code = 0,表示由閘道器傳送

這裡提一下,當code = 1時表示是IP層組裝分片超時。

The one we've been describing is generated when the TTL reaches 0, and is specified by a code of 0.

It's also possible for a host to send an ICMP "time exceeded during reassembly" when it timesout during the reassembly of a fragmented datagram. This error is specified by a code of 1.


   
目標主機不可達報文

         

        其中code = 3,表示在目的主機,埠不可用

 

2. Traceroute流程
    
首先明確TTLIP報頭中的欄位,TTL表示了資料包的time to live,即還能經由多少跳,所以TTL = 1表示資料包將在下個路由或主機被丟棄,併傳送超時報文;
    
其次為了明確已到達主機,傳送時目的埠設為非法埠(58127),這樣主機收到報文後會傳送目標不可達報文。
下面是tracert的流程:
     1>
構造UDP資料包,設定TTL = 1
     2>
傳送UDP資料包,記錄傳送時間t1
     3>
接收ICMP差錯包,如果是超時報文,則是經過的中間路由,記錄路由資訊,記錄接收時間t2,計算時間(t2 - t1);如果是目標不可達報文,則抵達目的主機,記錄接收時間t2,列印資訊,退出
     4>
構造UDP資料包,設定TTL += 1,返回第二步
其中,TTL的每個數值(TTL = 1)傳送3UDP包,即重複2~33次;
   接收超時,列印"*"表示報文丟失


程式實現主要參考http://blog.csdn.net/qy532846454/article/details/5443718, 但測試發現有點問題,故自己修改了部分程式碼,對返回的ICMP報文增加了一些型別判斷,比如type為3的不可達報文實際上有16種情況,即使不是目的埠不可達code=3,其他不可達情況也得退出迴圈列印提示,否則就卡在那邊了。如果不是ICMP型別報文則繼續接收直到recvfrom超時,列印'*'。

參考:
《TCP/IP Illustrated Volume 1: The protocols》


相關文章