通過完整示例來理解如何使用 epoll

LynnShaw發表於2015-10-29

網路伺服器通常使用一個獨立的程式或執行緒來實現每個連線。由於高效能應用程式需要同時處理大量的客戶端，這種方法就不太好用了，因為資源佔用和上下文切換時間等因素影響了同時處理大量客戶端的能力。另一種方法是在一個執行緒中使用非阻塞 I/O，以及一些就緒通知方法，即當你可以在一個套接字上讀寫更多資料的時候告訴你。

本文介紹了 Linux 的 epoll(7) 機制，它是 Linux 最好的就緒通知機制。我們用 C 語言編寫了示例程式碼，實現了一個完整的 TCP 伺服器。我假設您有一定 C 語言程式設計經驗，知道如何在 Linux 上編譯和執行程式，並且可以閱讀手冊檢視各種需要的 C 函式。

epoll 是在 Linux 2.6 中引入的，在其他類 UNIX 操作系統上不可用。它提供了一個類似於 select(2) 和 poll(2) 函式的功能：

select(2) 一次可以監測 FD_SETSIZE數量大小的描述符，FD_SETSIZE 通常是一個在 libc 編譯時指定的小數字。
poll(2) 一次可以監測的描述符數量並沒有限制，但撇開其它因素，我們每次都不得不檢查就緒通知，線性掃描所有通過描述符，這樣時間複雜度為 O(n)而且很慢。

epoll 沒有這些固定限制，也不執行任何線性掃描。因此它可以更高效地執行和處理大量事件。

一個 epoll 例項可由 epoll_create(2) 或 epoll_create1(2) （它們採用不同的引數）建立，它們的返回值是一個 epoll 例項。epoll_ctl(2) 用來新增或刪除監聽 epoll 例項的描述符。epoll_wait(2) 用來等待被監聽的描述符事件，一直阻塞到事件可用。更多資訊請參見相關手冊。

當描述符被新增到 epoll 例項時，有兩種模式：電平觸發和邊緣觸發（譯者注：借鑑電路里面的概念）。當你使用電平觸發模式，並且資料可以被讀取，epoll_wait(2) 函式總是會返回就緒事件。如果你還沒有讀完資料，並且再次在 epoll 例項上呼叫 epoll_wait(2) 函式監聽這個描述符，由於還有資料可讀，那麼它會再次返回這個事件。在邊緣觸發模式下，你只會得到一次就緒通知。如果你沒有將資料全部讀走，並且再次在 epoll 例項上呼叫 epoll_wait(2) 函式監聽這個描述符，它就會阻塞，因為就緒事件已經傳送過了。

傳遞到 epoll_ctl(2) 的 epoll 事件結構體如下。對每一個被監聽的描述符，你可以關聯到一個整數或者一個使用者資料的指標。

typedef union epoll_data
{
  void        *ptr;
  int          fd;
  __uint32_t   u32;
  __uint64_t   u64;
} epoll_data_t;

struct epoll_event
{
  __uint32_t   events; /* Epoll events */
  epoll_data_t data;   /* User data variable */
};

typedef union epoll_data

{

void *ptr;

int fd;

__uint32_t u32;

__uint64_t u64;

} epoll_data_t;

struct epoll_event

{

__uint32_t events; /* Epoll events */

epoll_data_t data; /* User data variable */

};

現在我們開始寫程式碼。我們將實現一個小的 TCP 服務器，將傳送到這個套接字的所有資料列印到標準輸出上。首先編寫一個 create_and_bind() 函式，用來建立和繫結 TCP 套接字：

static int
create_and_bind (char *port)
{
  struct addrinfo hints;
  struct addrinfo *result, *rp;
  int s, sfd;

  memset (&hints, 0, sizeof (struct addrinfo));
  hints.ai_family = AF_UNSPEC;     /* Return IPv4 and IPv6 choices */
  hints.ai_socktype = SOCK_STREAM; /* We want a TCP socket */
  hints.ai_flags = AI_PASSIVE;     /* All interfaces */

  s = getaddrinfo (NULL, port, &hints, &result);
  if (s != 0)
    {
      fprintf (stderr, "getaddrinfo: %sn", gai_strerror (s));
      return -1;
    }

  for (rp = result; rp != NULL; rp = rp->ai_next)
    {
      sfd = socket (rp->ai_family, rp->ai_socktype, rp->ai_protocol);
      if (sfd == -1)
        continue;

      s = bind (sfd, rp->ai_addr, rp->ai_addrlen);
      if (s == 0)
        {
          /* We managed to bind successfully! */
          break;
        }

      close (sfd);
    }

  if (rp == NULL)
    {
      fprintf (stderr, "Could not bindn");
      return -1;
    }

  freeaddrinfo (result);

  return sfd;
}

static int

create_and_bind (char *port)

{

struct addrinfo hints;

struct addrinfo *result, *rp;

int s, sfd;

memset (&hints, 0, sizeof (struct addrinfo));

hints.ai_family = AF_UNSPEC; /* Return IPv4 and IPv6 choices */

hints.ai_socktype = SOCK_STREAM; /* We want a TCP socket */

hints.ai_flags = AI_PASSIVE; /* All interfaces */

s = getaddrinfo (NULL, port, &hints, &result);

if (s != 0)

{

fprintf (stderr, "getaddrinfo: %sn", gai_strerror (s));

return -1;

}

for (rp = result; rp != NULL; rp = rp->ai_next)

{

sfd = socket (rp->ai_family, rp->ai_socktype, rp->ai_protocol);

if (sfd == -1)

continue;

s = bind (sfd, rp->ai_addr, rp->ai_addrlen);

if (s == 0)

{

/* We managed to bind successfully! */

break;

}

close (sfd);

}

if (rp == NULL)

{

fprintf (stderr, "Could not bindn");

return -1;

}

freeaddrinfo (result);

return sfd;

}

create_and_bind() 包含一個標準程式碼塊，用一種可移植的方式來獲得 IPv4 和 IPv6 套接字。它接受一個 port 字串引數，可由 argv[1] 傳遞。getaddrinfo(3) 函式返回一堆 addrinfo 結構體到 result 變數中，它們與傳入的 hints引數是相容的。addrinfo結構體像這樣：

struct addrinfo
{
  int              ai_flags;
  int              ai_family;
  int              ai_socktype;
  int              ai_protocol;
  size_t           ai_addrlen;
  struct sockaddr *ai_addr;
  char            *ai_canonname;
  struct addrinfo *ai_next;
};

struct addrinfo

{

int ai_flags;

int ai_family;

int ai_socktype;

int ai_protocol;

size_t ai_addrlen;

struct sockaddr *ai_addr;

char *ai_canonname;

struct addrinfo *ai_next;

};

我們依次遍歷這些結構體並用它們建立套接字，直到可以建立並繫結一個套接字。如果成功了，create_and_bind() 返回這個套接字描述符。如果失敗則返回 -1。

下面我們編寫一個函式，用於將套接字設定為非阻塞狀態。make_socket_non_blocking() 為傳入的 sfd 引數設定 O_NONBLOCK 標誌：

static int
make_socket_non_blocking (int sfd)
{
  int flags, s;

  flags = fcntl (sfd, F_GETFL, 0);
  if (flags == -1)
    {
      perror ("fcntl");
      return -1;
    }

  flags |= O_NONBLOCK;
  s = fcntl (sfd, F_SETFL, flags);
  if (s == -1)
    {
      perror ("fcntl");
      return -1;
    }

  return 0;
}

static int

make_socket_non_blocking (int sfd)

{

int flags, s;

flags = fcntl (sfd, F_GETFL, 0);

if (flags == -1)

{

perror ("fcntl");

return -1;

}

flags |= O_NONBLOCK;

s = fcntl (sfd, F_SETFL, flags);

if (s == -1)

{

perror ("fcntl");

return -1;

}

return 0;

}

現在說說 main() 函式吧，它裡面包含了這個程式的事件迴圈。這是主要程式碼:

#define MAXEVENTS 64

int
main (int argc, char *argv[])
{
  int sfd, s;
  int efd;
  struct epoll_event event;
  struct epoll_event *events;

  if (argc != 2)
    {
      fprintf (stderr, &quot;Usage: %s [port]n&quot;, argv[0]);
      exit (EXIT_FAILURE);
    }

  sfd = create_and_bind (argv[1]);
  if (sfd == -1)
    abort ();

  s = make_socket_non_blocking (sfd);
  if (s == -1)
    abort ();

  s = listen (sfd, SOMAXCONN);
  if (s == -1)
    {
      perror (&quot;listen&quot;);
      abort ();
    }

  efd = epoll_create1 (0);
  if (efd == -1)
    {
      perror (&quot;epoll_create&quot;);
      abort ();
    }

  event.data.fd = sfd;
  event.events = EPOLLIN | EPOLLET;
  s = epoll_ctl (efd, EPOLL_CTL_ADD, sfd, &amp;event);
  if (s == -1)
    {
      perror (&quot;epoll_ctl&quot;);
      abort ();
    }

  /* Buffer where events are returned */
  events = calloc (MAXEVENTS, sizeof event);

  /* The event loop */
  while (1)
    {
      int n, i;

      n = epoll_wait (efd, events, MAXEVENTS, -1);
      for (i = 0; i &lt; n; i++)
    {
      if ((events[i].events &amp; EPOLLERR) ||
              (events[i].events &amp; EPOLLHUP) ||
              (!(events[i].events &amp; EPOLLIN)))
        {
              /* An error has occured on this fd, or the socket is not
                 ready for reading (why were we notified then?) */
          fprintf (stderr, &quot;epoll errorn&quot;);
          close (events[i].data.fd);
          continue;
        }

      else if (sfd == events[i].data.fd)
        {
              /* We have a notification on the listening socket, which
                 means one or more incoming connections. */
              while (1)
                {
                  struct sockaddr in_addr;
                  socklen_t in_len;
                  int infd;
                  char hbuf[NI_MAXHOST], sbuf[NI_MAXSERV];

                  in_len = sizeof in_addr;
                  infd = accept (sfd, &amp;in_addr, &amp;in_len);
                  if (infd == -1)
                    {
                      if ((errno == EAGAIN) ||
                          (errno == EWOULDBLOCK))
                        {
                          /* We have processed all incoming
                             connections. */
                          break;
                        }
                      else
                        {
                          perror (&quot;accept&quot;);
                          break;
                        }
                    }

                  s = getnameinfo (&amp;in_addr, in_len,
                                   hbuf, sizeof hbuf,
                                   sbuf, sizeof sbuf,
                                   NI_NUMERICHOST | NI_NUMERICSERV);
                  if (s == 0)
                    {
                      printf(&quot;Accepted connection on descriptor %d &quot;
                             &quot;(host=%s, port=%s)n&quot;, infd, hbuf, sbuf);
                    }

                  /* Make the incoming socket non-blocking and add it to the
                     list of fds to monitor. */
                  s = make_socket_non_blocking (infd);
                  if (s == -1)
                    abort ();

                  event.data.fd = infd;
                  event.events = EPOLLIN | EPOLLET;
                  s = epoll_ctl (efd, EPOLL_CTL_ADD, infd, &amp;event);
                  if (s == -1)
                    {
                      perror (&quot;epoll_ctl&quot;);
                      abort ();
                    }
                }
              continue;
            }
          else
            {
              /* We have data on the fd waiting to be read. Read and
                 display it. We must read whatever data is available
                 completely, as we are running in edge-triggered mode
                 and won&#039;t get a notification again for the same
                 data. */
              int done = 0;

              while (1)
                {
                  ssize_t count;
                  char buf[512];

                  count = read (events[i].data.fd, buf, sizeof buf);
                  if (count == -1)
                    {
                      /* If errno == EAGAIN, that means we have read all
                         data. So go back to the main loop. */
                      if (errno != EAGAIN)
                        {
                          perror (&quot;read&quot;);
                          done = 1;
                        }
                      break;
                    }
                  else if (count == 0)
                    {
                      /* End of file. The remote has closed the
                         connection. */
                      done = 1;
                      break;
                    }

                  /* Write the buffer to standard output */
                  s = write (1, buf, count);
                  if (s == -1)
                    {
                      perror (&quot;write&quot;);
                      abort ();
                    }
                }

              if (done)
                {
                  printf (&quot;Closed connection on descriptor %dn&quot;,
                          events[i].data.fd);

                  /* Closing the descriptor will make epoll remove it
                     from the set of descriptors which are monitored. */
                  close (events[i].data.fd);
                }
            }
        }
    }

  free (events);

  close (sfd);

  return EXIT_SUCCESS;
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

#define MAXEVENTS 64

int

main (int argc, char *argv[])

{

int sfd, s;

int efd;

struct epoll_event event;

struct epoll_event *events;

if (argc != 2)

{

fprintf (stderr, "Usage: %s [port]n", argv[0]);

exit (EXIT_FAILURE);

}

sfd = create_and_bind (argv[1]);

if (sfd == -1)

abort ();

s = make_socket_non_blocking (sfd);

if (s == -1)

abort ();

s = listen (sfd, SOMAXCONN);

if (s == -1)

{

perror ("listen");

abort ();

}

efd = epoll_create1 (0);

if (efd == -1)

{

perror ("epoll_create");

abort ();

}

event.data.fd = sfd;

event.events = EPOLLIN | EPOLLET;

s = epoll_ctl (efd, EPOLL_CTL_ADD, sfd, &event);

if (s == -1)

{

perror ("epoll_ctl");

abort ();

}

/* Buffer where events are returned */

events = calloc (MAXEVENTS, sizeof event);

/* The event loop */

while (1)

{

int n, i;

n = epoll_wait (efd, events, MAXEVENTS, -1);

for (i = 0; i < n; i++)

{

if ((events[i].events & EPOLLERR) ||

(events[i].events & EPOLLHUP) ||

(!(events[i].events & EPOLLIN)))

{

/* An error has occured on this fd, or the socket is not

ready for reading (why were we notified then?) */

fprintf (stderr, "epoll errorn");

close (events[i].data.fd);

continue;

}

else if (sfd == events[i].data.fd)

{

/* We have a notification on the listening socket, which

means one or more incoming connections. */

while (1)

{

struct sockaddr in_addr;

socklen_t in_len;

int infd;

char hbuf[NI_MAXHOST], sbuf[NI_MAXSERV];

in_len = sizeof in_addr;

infd = accept (sfd, &in_addr, &in_len);

if (infd == -1)

{

if ((errno == EAGAIN) ||

(errno == EWOULDBLOCK))

{

/* We have processed all incoming

connections. */

break;

}

else

{

perror ("accept");

break;

}

s = getnameinfo (&in_addr, in_len,

hbuf, sizeof hbuf,

sbuf, sizeof sbuf,

NI_NUMERICHOST | NI_NUMERICSERV);

if (s == 0)

{

printf("Accepted connection on descriptor %d "

"(host=%s, port=%s)n", infd, hbuf, sbuf);

}

/* Make the incoming socket non-blocking and add it to the

list of fds to monitor. */

s = make_socket_non_blocking (infd);

if (s == -1)

abort ();

event.data.fd = infd;

event.events = EPOLLIN | EPOLLET;

s = epoll_ctl (efd, EPOLL_CTL_ADD, infd, &event);

if (s == -1)

{

perror ("epoll_ctl");

abort ();

}

continue;

}

else

{

/* We have data on the fd waiting to be read. Read and

display it. We must read whatever data is available

completely, as we are running in edge-triggered mode

and won't get a notification again for the same

data. */

int done = 0;

while (1)

{

ssize_t count;

char buf[512];

count = read (events[i].data.fd, buf, sizeof buf);

if (count == -1)

{

/* If errno == EAGAIN, that means we have read all

data. So go back to the main loop. */

if (errno != EAGAIN)

{

perror ("read");

done = 1;

}

break;

}

else if (count == 0)

{

/* End of file. The remote has closed the

connection. */

done = 1;

break;

}

/* Write the buffer to standard output */

s = write (1, buf, count);

if (s == -1)

{

perror ("write");

abort ();

}

if (done)

{

printf ("Closed connection on descriptor %dn",

events[i].data.fd);

/* Closing the descriptor will make epoll remove it

from the set of descriptors which are monitored. */

close (events[i].data.fd);

}

free (events);

close (sfd);

return EXIT_SUCCESS;

}

main() 首先呼叫 create_and_bind() 新建套接字。然後把套接字設定非阻塞模式，再呼叫listen(2)。接下來它建立一個 epoll 例項 efd，新增監聽套接字 sfd ，用電平觸發模式來監聽輸入事件。

外層的 while 迴圈是主要事件迴圈。它呼叫epoll_wait(2)，執行緒保持阻塞以等待事件到來。當事件就緒，epoll_wait(2) 用 events 引數返回事件，這個引數是一群 epoll_event 結構體。

當我們新增新的監聽輸入連線以及刪除終止的現有連線時，efd 這個 epoll 例項在事件迴圈中不斷更新。

當事件是可用的，它們可以有三種型別：

錯誤：當一個錯誤連線出現，或事件不是一個可以讀取資料的通知，我們只要簡單地關閉相關的描述符。關閉描述符會自動地移除 efd 這個 epoll 例項的監聽列表。
新連線：當監聽描述符 sfd 是可讀狀態，這表明一個或多個連線已經到達。當有一個新連線， accept(2) 接受這個連線，列印一條相應的訊息，把這個到來的套接字設置為非阻塞狀態，並將其新增到 efd 這個 epoll 例項的監聽列表。
客戶端資料：當任何一個客戶端描述符的資料可讀時，我們在內部 while 迴圈中用 read(2) 以 512 位元組大小讀取資料。這是因為當前我們必須讀走所有可讀的資料，當監聽描述符是邊緣觸發模式下，我們不會再得到事件。被讀取的資料使用 write(2) 被寫入標準輸出(fd=1)。如果 read(2) 返回 0，這表示 EOF 並且我們可以關閉這個客戶端的連線。如果返回 -1，errno 被設定為 EAGAIN，這表示這個事件的所有資料被讀走，我們可以返回主迴圈。

就是這樣。它在一個迴圈中執行，在監聽列表中新增和刪除描述符。

下載 epoll-example.c 程式碼。

更新1：電平和邊緣觸發的定義被顛倒錯誤了（雖然程式碼是正確的）。這是被Reddit使用者 bodski 發現的。文章現在正確了。我應該在釋出前校對的。對不起，並感謝謝指出錯誤。:)

更新2：程式碼被修改成連線將被阻塞時才執行accept(2)，所以如果多個連線到達，我們全部接受。這是Reddit使用者 pitchford 提出。謝謝你的評論。 :)

通過示例學習使用 netstat
2017-12-11
如何通過同理心地圖（情景地圖）來更好的理解你的使用者？
2018-01-07
地圖
通過例項來理解MySQL索引薦
2014-09-25
MySql索引
理解select、epoll
2020-10-28
通過BitSet原始碼來理解BitMap演算法
2019-03-19
原始碼演算法
Webpack.devServer 配置項如何使用？附devServer完整示例
2023-11-19
WebdevServer
Docker 下開發 hyperf 完整使用示例
2020-01-08
Docker
通過示例學習PYTORCH
2022-02-11
PyTorch
如何通過程式來查詢表名
2013-08-04
開發者談如何通過遊戲社群更好地理解玩家
2021-10-11
遊戲
通過示例瞭解Vue過渡和動畫
2022-01-28
Vue動畫
如何在 Linux 系統中通過使用者組來管理使用者
2017-12-06
Linux
C# 通過ServiceStack 操作Redis——Set型別的使用及示例
2021-03-14
C#Redis型別
epoll使用與原理
2024-06-12
通過c++示例解釋回撥
2018-10-06
C++
Spring MVC 完整示例
2016-06-21
SpringMVC
通過一個案例理解 JWT
2018-09-20
JWT
通過實現仿照FeignClient框架原理的示例來看清FeignClient的本質
2021-11-14
client框架
Spark SQL 教程：通過示例瞭解 Spark SQL
2021-12-29
SparkSQL
通過示例學習Python列表推導
2015-01-05
Python
如何通過emca來修改Database Control HTTP 埠
2013-05-27
DatabaseHTTP
使用emca命令列配置EM並通過瀏覽器訪問EM示例
2013-11-15
命令列瀏覽器
c#12 實驗特性Interceptor如何使用的一個簡單但完整的示例
2024-08-06
C#
Linux下EPoll通訊模型簡析
2015-06-30
Linux模型
通過MOVE PARTITION來回收已經使用的空間
2019-06-04
通過OpenGL理解前端渲染原理（1）
2019-07-31
前端
使用Laravel框架，怎麼通過訪問/xxxx/ooo.php也通過路由來使用
2021-05-25
Laravel框架PHP路由
[譯] Swift：通過示例避免記憶體洩漏
2019-02-18
Swift記憶體
Android通過startService實現批量下載示例
2015-08-27
Android
通過開發 Babel 外掛來理解什麼是抽象語法樹（AST）
2019-06-24
Babel抽象語法樹AST
jmeter通過cookies來登入
2018-06-23
JMeterCookie
通過shell來比較oracle和java中的字串使用
2015-02-17
OracleJava字串
通過使用 IBM Rational來測試 SIP 應用程式
2009-04-14
IBM
PoweJob高階特性-MapReduce完整示例
2022-07-08
通過實驗理解PG邏輯結構：1 使用者（角色）
2022-03-04
通過 bilibili 的 discovery 理解下 cap
2021-06-08
對epoll機制的學習理解v1
2021-10-20
如何通過以太坊智慧合約來進行眾籌（ICO）
2018-03-01

通過完整示例來理解如何使用 epoll

相關文章