poll和epoll的使用應該不用再多說了。當fd很多時,使用epoll比poll效率更高。我們通過核心原始碼分析來看看到底是為什麼。
poll剖析poll系統呼叫:
int poll(struct pollfd *fds, nfds_t nfds, int timeout);
對應的實現程式碼為:
[fs/select.c -->sys_poll]
asmlinkage long sys_poll(struct pollfd __user * ufds, unsigned int nfds, long timeout)
{
struct poll_wqueues table;
int fdcount, err;
unsigned int i;
struct poll_list *head;
struct poll_list *walk;
/* Do a sanity check on nfds ... */ /* 使用者給的nfds數不可以超過一個struct file結構支援
的最大fd數(預設是256)*/
if (nfds > current->files->max_fdset && nfds > OPEN_MAX)
return -EINVAL;
if (timeout) {
/* Careful about overflow in the intermediate values */
if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ)
timeout = (unsigned long)(timeout*HZ+999)/1000+1;
else /* Negative or overflow */
timeout = MAX_SCHEDULE_TIMEOUT;
}
poll_initwait(&table);
其中poll_initwait較為關鍵,從字面上看,應該是初始化變數table,注意此處table在整個執行poll的過程中是很關鍵的變數。而struct poll_table其實就只包含了一個函式指標:
[fs/poll.h]
/*
* structures and helpers for f_op->poll implementations
*/
typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct
poll_table_struct *);
typedef struct poll_table_struct {
poll_queue_proc qproc;
}
poll_table;
現在我們來看看poll_initwait到底在做些什麼
[fs/select.c]
void __pollwait(struct file *filp, wait_queue_head_t *wait_address, poll_table *p);
void poll_initwait(struct poll_wqueues *pwq)
{
&(pwq->pt)->qproc = __pollwait; /*此行已經被我“翻譯”了,方便觀看*/
pwq->error = 0;
pwq->table = NULL;
}
需要C/C++ Linux伺服器架構師學習資料加群812855908(資料包括C/C++,Linux,golang技術,Nginx,ZeroMQ,MySQL,Redis,fastdfs,MongoDB,ZK,流媒體,CDN,P2P,K8S,Docker,TCP/IP,協程,DPDK,ffmpeg等),免費分享
很明顯,poll_initwait的主要動作就是把table變數的成員poll_table對應的回撥函式置pollwait。這個pollwait不僅是poll系統呼叫需要,select系統呼叫也一樣是用這個pollwait,說白了,這是個作業系統的非同步操作的“御用”回撥函式。當然了,epoll沒有用這個,它另外新增了一個回撥函式,以達到其高效運轉的目的,這是後話,暫且不表。我們先不討論pollwait的具體實現,還是繼續看sys_poll:
[fs/select.c -->sys_poll]
head = NULL;
walk = NULL;
i = nfds;
err = -ENOMEM;
while(i!=0) {
struct poll_list *pp;
pp = kmalloc(sizeof(struct poll_list)+
sizeof(struct pollfd)*
(i>POLLFD_PER_PAGE?POLLFD_PER_PAGE:i),
GFP_KERNEL);
if(pp==NULL)
goto out_fds;
pp->next=NULL;
pp->len = (i>POLLFD_PER_PAGE?POLLFD_PER_PAGE:i);
if (head == NULL)
head = pp;
else
walk->next = pp;
walk = pp;
if (copy_from_user(pp->entries, ufds + nfds-i,
sizeof(struct pollfd)*pp->len)) {
err = -EFAULT;
goto out_fds;
}
i -= pp->len;
}
fdcount = do_poll(nfds, head, &table, timeout);
這一大堆程式碼就是建立一個連結串列,每個連結串列的節點是一個page大小(通常是4k),這連結串列節點由一個指向struct poll_list的指標掌控,而眾多的struct pollfd就通過struct_list的entries成員訪問。上面的迴圈就是把使用者態的struct pollfd拷進這些entries裡。通常使用者程式的poll呼叫就監控幾個fd,所以上面這個連結串列通常也就只需要一個節點,即作業系統的一頁。但是,當使用者傳入的fd很多時,由於poll系統呼叫每次都要把所有struct pollfd拷進核心,所以引數傳遞和頁分配此時就成了poll系統呼叫的效能瓶頸。最後一句do_poll,我們跟進去:
[fs/select.c-->sys_poll()-->do_poll()]
static void do_pollfd(unsigned int num, struct pollfd * fdpage,
poll_table ** pwait, int *count)
{
int i;
for (i = 0; i < num; i++) {
int fd;
unsigned int mask;
struct pollfd *fdp;
mask = 0;
fdp = fdpage+i;
fd = fdp->fd;
if (fd >= 0) {
struct file * file = fget(fd);
mask = POLLNVAL;
if (file != NULL) {
mask = DEFAULT_POLLMASK;
if (file->f_op && file->f_op->poll)
mask = file->f_op->poll(file, *pwait);
mask &= fdp->events | POLLERR | POLLHUP;
fput(file);
}
if (mask) {
*pwait = NULL;
(*count)++;
}
}
fdp->revents = mask;
}
}
static int do_poll(unsigned int nfds, struct poll_list *list,
struct poll_wqueues *wait, long timeout)
{
int count = 0;
poll_table* pt = &wait->pt;
if (!timeout)
pt = NULL;
for (;;) {
struct poll_list *walk;
set_current_state(TASK_INTERRUPTIBLE);
walk = list;
while(walk != NULL) {
do_pollfd( walk->len, walk->entries, &pt, &count);
walk = walk->next;
}
pt = NULL;
if (count || !timeout || signal_pending(current))
break;
count = wait->error;
if (count)
break;
timeout = schedule_timeout(timeout); /* 讓current掛起,別的程式跑,timeout到了
以後再回來執行current*/
}
__set_current_state(TASK_RUNNING);
return count;
}
注意set_current_state和signal_pending,它們兩句保障了當使用者程式在呼叫poll後掛起時,發訊號可以讓程式迅速推出poll呼叫,而通常的系統呼叫是不會被訊號打斷的。
縱覽do_poll函式,主要是在迴圈內等待,直到count大於0才跳出迴圈,而count主要是靠do_pollfd函式處理。注意這段程式碼:
while(walk != NULL) {
do_pollfd( walk->len, walk->entries, &pt, &count);
walk = walk->next;
}
當使用者傳入的fd很多時(比如1000個),對do_pollfd就會呼叫很多次,poll效率瓶頸的另一原因就在這裡。do_pollfd就是針對每個傳進來的fd,呼叫它們各自對應的poll函式,簡化一下呼叫過程,如下:
struct file* file = fget(fd);
file->f_op->poll(file, &(table->pt));
如果fd對應的是某個socket,do_pollfd呼叫的就是網路裝置驅動實現的poll;如果fd對應的是某個ext3檔案系統上的一個開啟檔案,那do_pollfd呼叫的就是ext3檔案系統驅動實現的poll。一句話,這個file->f_op->poll是裝置驅動程式實現的,那裝置驅動程式的poll實現通常又是什麼樣子呢?其實,裝置驅動程式的標準實現是:呼叫poll_wait,即以裝置自己的等待佇列為引數(通常裝置都有自己的等待佇列,不然一個不支援非同步操作的裝置會讓人很鬱悶)呼叫struct poll_table的回撥函式。作為驅動程式的代表,我們看看socket在使用tcp時的程式碼:
[net/ipv4/tcp.c-->tcp_poll]
unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
{
unsigned int mask;
struct sock *sk = sock->sk;
struct tcp_opt *tp = tcp_sk(sk);
poll_wait(file, sk->sk_sleep, wait);
程式碼就看這些,剩下的無非就是判斷狀態、返回狀態值,tcp_poll的核心實現就是poll_wait,而
poll_wait就是呼叫struct poll_table對應的回撥函式,那poll系統呼叫對應的回撥函式就是__poll_wait,所以這裡幾乎就可以把tcp_poll理解為一個語句:
__poll_wait(file, sk->sk_sleep, wait);
由此也可以看出,每個socket自己都帶有一個等待佇列sk_sleep,所以上面我們所說的“裝置的等待佇列”其實不止一個。這時候我們再看看__poll_wait的實現:
[fs/select.c-->__poll_wait()]
void __pollwait(struct file *filp, wait_queue_head_t *wait_address, poll_table *_p)
{
struct poll_wqueues *p = container_of(_p, struct poll_wqueues, pt);
struct poll_table_page *table = p->table;
if (!table || POLL_TABLE_FULL(table)) {
struct poll_table_page *new_table;
new_table = (struct poll_table_page *) __get_free_page(GFP_KERNEL);
if (!new_table) {
p->error = -ENOMEM;
__set_current_state(TASK_RUNNING);
return;
}
new_table->entry = new_table->entries;
new_table->next = table;
p->table = new_table;
table = new_table;
}
/* Add a new entry */
{
struct poll_table_entry * entry = table->entry;
table->entry = entry+1;
get_file(filp);
entry->filp = filp;
entry->wait_address = wait_address;
init_waitqueue_entry(&entry->wait, current);
add_wait_queue(wait_address,&entry->wait);
}
}
poll_wait的作用就是建立了上圖所示的資料結構(一次poll_wait即一次裝置poll呼叫只建立一個poll_table_entry),並通過struct poll_table_entry的wait成員,把current掛在了裝置的等待佇列
上,此處的等待佇列是wait_address,對應tcp_poll裡的sk->sk_sleep。現在我們可以回顧一下poll系統呼叫的原理了:先註冊回撥函式__poll_wait,再初始化table變數(型別為struct poll_wqueues),接著拷貝使用者傳入的struct pollfd(其實主要是fd),然後輪流呼叫所有fd對應的poll(把current掛到各個fd對應的裝置等待佇列上)。在裝置收到一條訊息(網路裝置)或填寫完檔案資料(磁碟裝置)後,會喚醒裝置等待佇列上的程式,這時current便被喚醒了。current醒來後離開sys_poll的操作相對簡單,這裡就不逐行分析了。
epoll
通過上面的分析,poll執行效率的兩個瓶頸已經找出,現在的問題是怎麼改進。首先,每次poll都要把1000個fd 拷入核心,太不科學了,核心幹嘛不自己儲存已經拷入的fd呢?答對了,epoll就是自己儲存拷入的fd,它的API就已經說明了這一點——不是 epoll_wait的時候才傳入fd,而是通過epoll_ctl把所有fd傳入核心再一起”wait”,這就省掉了不必要的重複拷貝。其次,在 epoll_wait時,也不是把current輪流的加入fd對應的裝置等待佇列,而是在裝置等待佇列醒來時呼叫一個回撥函式(當然,這就需要“喚醒回撥”機制),把產生事件的fd歸入一個連結串列,然後返回這個連結串列上的fd。
epoll剖析
epoll是個module,所以先看看module的入口eventpoll_init
[fs/eventpoll.c-->evetpoll_init()]
static int __init eventpoll_init(void)
{
int error;
init_MUTEX(&epsem);
/* Initialize the structure used to perform safe poll wait head wake ups */
ep_poll_safewake_init(&psw);
/* Allocates slab cache used to allocate "struct epitem" items */
epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),
0, SLAB_HWCACHE_ALIGN|EPI_SLAB_DEBUG|SLAB_PANIC,
NULL, NULL);
/* Allocates slab cache used to allocate "struct eppoll_entry" */
pwq_cache = kmem_cache_create("eventpoll_pwq",
sizeof(struct eppoll_entry), 0,
EPI_SLAB_DEBUG|SLAB_PANIC, NULL, NULL);
/*
* Register the virtual file system that will be the source of inodes
* for the eventpoll files
*/
error = register_filesystem(&eventpoll_fs_type);
if (error)
goto epanic;
/* Mount the above commented virtual file system */
eventpoll_mnt = kern_mount(&eventpoll_fs_type);
error = PTR_ERR(eventpoll_mnt);
if (IS_ERR(eventpoll_mnt))
goto epanic;
DNPRINTK(3, (KERN_INFO "[%p] eventpoll: successfully initialized.\n",
current));
return 0;
epanic:
panic("eventpoll_init() failed\n");
}
很有趣,這個module在初始化時註冊了一個新的檔案系統,叫”eventpollfs”(在eventpoll_fs_type結構裡),然後掛載此檔案系統。另外建立兩個核心cache(在核心程式設計中,如果需要頻繁分配小塊記憶體,應該建立kmem_cahe來做“記憶體池”),分別用於存放struct epitem和eppoll_entry。如果以後要開發新的檔案系統,可以參考這段程式碼。現在想想epoll_create為什麼會返回一個新的fd?因為它就是在這個叫做”eventpollfs”的檔案系統裡建立了一個新檔案!如下:
[fs/eventpoll.c-->sys_epoll_create()]
asmlinkage long sys_epoll_create(int size)
{
int error, fd;
struct inode *inode;
struct file *file;
DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_create(%d)\n",
current, size));
/* Sanity check on the size parameter */
error = -EINVAL;
if (size <= 0)
goto eexit_1;
/*
* Creates all the items needed to setup an eventpoll file. That is,
* a file structure, and inode and a free file descriptor.
*/
error = ep_getfd(&fd, &inode, &file);
if (error)
goto eexit_1;
/* Setup the file internal data structure ( "struct eventpoll" ) */
error = ep_file_init(file);
if (error)
goto eexit_2;
函式很簡單,其中ep_getfd看上去是“get”,其實在第一次呼叫epoll_create時,它是要建立新inode、新的file、新的fd。而ep_file_init則要建立一個struct eventpoll結構,並把它放入file-
private_data,注意,這個private_data後面還要用到的。看到這裡,也許有人要問了,為什麼epoll的開發者不做一個核心的超級大map把使用者要建立的epoll控制程式碼存起來,在epoll_create時返回一個指標?那似乎很直觀呀。但是,仔細看看,linux的系統呼叫有多少是返回指標的?你會發現幾乎沒有!(特此強調,malloc不是系統呼叫,malloc呼叫的brk才是)因為linux做為unix的最傑出的繼承人,它遵循了unix的一個巨大優點——一切皆檔案,輸入輸出是檔案、socket也
是檔案,一切皆檔案意味著使用這個作業系統的程式可以非常簡單,因為一切都是檔案操作而已!(unix還不是完全做到,plan 9才算)。而且使用檔案系統有個好處:epoll_create返回的是一個fd,而不是該死的指標,指標如果指錯了,你簡直沒辦法判斷,而fd則可以通過current->files->fd_array[]找到其真偽。epoll_create好了,該epoll_ctl了,我們略去判斷性的程式碼:
[fs/eventpoll.c-->sys_epoll_ctl()]
asmlinkage long
sys_epoll_ctl(int epfd, int op, int fd, struct epoll_event __user *event)
{
int error;
struct file *file, *tfile;
struct eventpoll *ep;
struct epitem *epi;
struct epoll_event epds;
....
epi = ep_find(ep, tfile, fd);
error = -EINVAL;
switch (op) {
case EPOLL_CTL_ADD:
if (!epi) {
epds.events |= POLLERR | POLLHUP;
error = ep_insert(ep, &epds, tfile, fd);
} else
error = -EEXIST;
break;
case EPOLL_CTL_DEL:
if (epi)
error = ep_remove(ep, epi);
else
error = -ENOENT;
break;
case EPOLL_CTL_MOD:
if (epi) {
epds.events |= POLLERR | POLLHUP;
error = ep_modify(ep, epi, &epds);
} else
error = -ENOENT;
break;
}
原來就是在一個大的結構(現在先不管是什麼大結構)裡先ep_find,如果找到了struct epitem而使用者操作是ADD,那麼返回-EEXIST;如果是DEL,則ep_remove。如果找不到struct epitem而使用者操作是ADD,就ep_insert建立並插入一個。很直白。那這個“大結構”是什麼呢?看ep_find的呼叫方式,ep引數應該是指向這個“大結構”的指標,再看ep = file->private_data,我們才明白,原來這個“大結構”就是那個在epoll_create時建立的struct eventpoll,具體再看看ep_find的實現,發現原來是struct eventpoll的rbr成員(struct rb_root),原來這是一個紅黑樹的根!而紅黑樹上掛的都是struct epitem。現在清楚了,一個新建立的epoll檔案帶有一個struct eventpoll結構,這個結構上再掛一個紅黑樹,而這個紅黑樹就是每次epoll_ctl時fd存放的地方!現在資料結構都已經清楚了,我們來看最核心的:
[fs/eventpoll.c-->sys_epoll_wait()]
asmlinkage long sys_epoll_wait(int epfd, struct epoll_event __user *events,
int maxevents, int timeout)
{
int error;
struct file *file;
struct eventpoll *ep;
DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_wait(%d, %p, %d, %d)\n",
current, epfd, events, maxevents, timeout));
/* The maximum number of event must be greater than zero */
if (maxevents <= 0)
return -EINVAL;
/* Verify that the area passed by the user is writeable */
if ((error = verify_area(VERIFY_WRITE, events, maxevents * sizeof(struct
epoll_event))))
goto eexit_1;
/* Get the "struct file *" for the eventpoll file */
error = -EBADF;
file = fget(epfd);
if (!file)
goto eexit_1;
/*
* We have to check that the file structure underneath the fd
* the user passed to us _is_ an eventpoll file.
*/
error = -EINVAL;
if (!IS_FILE_EPOLL(file))
goto eexit_2;
/*
* At this point it is safe to assume that the "private_data" contains
* our own data structure.
*/
ep = file->private_data;
/* Time to fish for events ... */
error = ep_poll(ep, events, maxevents, timeout);
eexit_2:
fput(file);
eexit_1:
DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_wait(%d, %p, %d, %d) =
%d\n",
current, epfd, events, maxevents, timeout, error));
return error;
}
故伎重演,從file->private_data中拿到struct eventpoll,再呼叫ep_poll
[fs/eventpoll.c-->sys_epoll_wait()->ep_poll()]
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
int maxevents, long timeout)
{
int res, eavail;
unsigned long flags;
long jtimeout;
wait_queue_t wait;
/*
* Calculate the timeout by checking for the "infinite" value ( -1 )
* and the overflow condition. The passed timeout is in milliseconds,
* that why (t * HZ) / 1000.
*/
jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ?
MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
retry:
write_lock_irqsave(&ep->lock, flags);
res = 0;
if (list_empty(&ep->rdllist)) {
/*
* We don't have any available event to return to the caller.
* We need to sleep here, and we will be wake up by
* ep_poll_callback() when events will become available.
*/
init_waitqueue_entry(&wait, current);
add_wait_queue(&ep->wq, &wait);
for (;;) {
/*
* We don't want to sleep if the ep_poll_callback() sends us
* a wakeup in between. That's why we set the task state
* to TASK_INTERRUPTIBLE before doing the checks.
*/
set_current_state(TASK_INTERRUPTIBLE);
if (!list_empty(&ep->rdllist) || !jtimeout)
break;
if (signal_pending(current)) {
res = -EINTR;
break;
}
write_unlock_irqrestore(&ep->lock, flags);
jtimeout = schedule_timeout(jtimeout);
write_lock_irqsave(&ep->lock, flags);
}
remove_wait_queue(&ep->wq, &wait);
set_current_state(TASK_RUNNING);
}
又是一個大迴圈,不過這個大迴圈比poll的那個好,因為仔細一看——它居然除了睡覺和判斷ep->rdllist是否為空以外,啥也沒做!什麼也沒做當然效率高了,但到底是誰來讓ep->rdllist不為空呢?答案是ep_insert時設下的回撥函式
[fs/eventpoll.c-->sys_epoll_ctl()-->ep_insert()]
static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
struct file *tfile, int fd)
{
int error, revents, pwake = 0;
unsigned long flags;
struct epitem *epi;
struct ep_pqueue epq;
error = -ENOMEM;
if (!(epi = EPI_MEM_ALLOC()))
goto eexit_1;
/* Item initialization follow here ... */
EP_RB_INITNODE(&epi->rbn);
INIT_LIST_HEAD(&epi->rdllink);
INIT_LIST_HEAD(&epi->fllink);
INIT_LIST_HEAD(&epi->txlink);
INIT_LIST_HEAD(&epi->pwqlist);
epi->ep = ep;
EP_SET_FFD(&epi->ffd, tfile, fd);
epi->event = *event;
atomic_set(&epi->usecnt, 1);
epi->nwait = 0;
/* Initialize the poll table using the queue callback */
epq.epi = epi;
init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
/*
* Attach the item to the poll hooks and get current event bits.
* We can safely use the file* here because its usage count has
* been increased by the caller of this function.
*/
revents = tfile->f_op->poll(tfile, &epq.pt);
我們注意init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);這一行,其實就是&(epq.pt)->qproc = ep_ptable_queue_proc;緊接著 tfile->f_op->poll(tfile, &epq.pt)其實就是呼叫被監控檔案(epoll裡叫“target file”)的poll方法,而這個poll其實就是呼叫poll_wait(還記得poll_wait嗎?每個支援poll的裝置驅動程式都要呼叫的),最後就是呼叫ep_ptable_queue_proc。這是比較難解的一個呼叫關係,因為不是語言級的直接呼叫。ep_insert還把struct epitem放到struct file裡的f_ep_links連表裡,以方便查詢,struct epitem裡的fllink就是擔負這個使命的。
[fs/eventpoll.c-->ep_ptable_queue_proc()]
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
poll_table *pt)
{
struct epitem *epi = EP_ITEM_FROM_EPQUEUE(pt);
struct eppoll_entry *pwq;
if (epi->nwait >= 0 && (pwq = PWQ_MEM_ALLOC())) {
init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
pwq->whead = whead;
pwq->base = epi;
add_wait_queue(whead, &pwq->wait);
list_add_tail(&pwq->llink, &epi->pwqlist);
epi->nwait++;
} else {
/* We have to signal that an error occurred */
epi->nwait = -1;
}
}
上面的程式碼就是ep_insert中要做的最重要的事:建立struct eppoll_entry,設定其喚醒回撥函式為
ep_poll_callback,然後加入裝置等待佇列(注意這裡的whead就是上一章所說的每個裝置驅動都要帶的等待佇列)。只有這樣,當裝置就緒,喚醒等待佇列上的等待著時,ep_poll_callback就會被呼叫。每次呼叫poll系統呼叫,作業系統都要把current(當前程式)掛到fd對應的所有裝置的等待佇列上,可以想象,fd多到上千的時候,這樣“掛”法很費事;而每次呼叫epoll_wait則沒有這麼羅嗦,epoll只在epoll_ctl時把current掛一遍(這第一遍是免不了的)並給每個fd一個命令“好了就調回撥函式”,如果裝置有事件了,通過回撥函式,會把fd放入rdllist,而每次呼叫epoll_wait就只是收集rdllist裡的fd就可以了——epoll巧妙的利用回撥函式,實現了更高效的事件驅動模型。現在我們猜也能猜出來ep_poll_callback會幹什麼了——肯定是把紅黑樹上的收到event的epitem(代表每個fd)插入ep->rdllist中,這樣,當epoll_wait返回時,rdllist裡就都是就緒的fd了!
[fs/eventpoll.c-->ep_poll_callback()]
static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
int pwake = 0;
unsigned long flags;
struct epitem *epi = EP_ITEM_FROM_WAIT(wait);
struct eventpoll *ep = epi->ep;
DNPRINTK(3, (KERN_INFO "[%p] eventpoll: poll_callback(%p) epi=%p
ep=%p\n",
current, epi->file, epi, ep));
write_lock_irqsave(&ep->lock, flags);
/*
* If the event mask does not contain any poll(2) event, we consider the
* descriptor to be disabled. This condition is likely the effect of the
* EPOLLONESHOT bit that disables the descriptor when an event is received,
* until the next EPOLL_CTL_MOD will be issued.
*/
if (!(epi->event.events & ~EP_PRIVATE_BITS))
goto is_disabled;
/* If this file is already in the ready list we exit soon */
if (EP_IS_LINKED(&epi->rdllink))
goto is_linked;
list_add_tail(&epi->rdllink, &ep->rdllist);
is_linked:
/*
* Wake up ( if active ) both the eventpoll wait list and the ->poll()
* wait list.
*/
if (waitqueue_active(&ep->wq))
wake_up(&ep->wq);
if (waitqueue_active(&ep->poll_wait))
pwake++;
is_disabled:
write_unlock_irqrestore(&ep->lock, flags);
/* We have to call this outside the lock */
if (pwake)
ep_poll_safewake(&psw, &ep->poll_wait);
return 1;
}
真正重要的只有 list_add_tail(&epi->rdllink, &ep->rdllist);一句,就是把struct epitem放到struct eventpoll的rdllist中去。現在我們可以畫出epoll的核心資料結構圖了:
epoll獨有的EPOLLET
EPOLLET是epoll系統呼叫獨有的flag,ET就是Edge Trigger(邊緣觸發)的意思,具體含義和應用大家可google之。有了EPOLLET,重複的事件就不會總是出來打擾程式的判斷,故而常被使用。那EPOLLET的原理是什麼呢?epoll把fd都掛上一個回撥函式,當fd對應的裝置有訊息時,就把fd放入rdllist連結串列,這樣epoll_wait只要檢查這個rdllist連結串列就可以知道哪些fd有事件了。我們看看ep_poll的最後幾行程式碼:
[fs/eventpoll.c->ep_poll()]
/*
* Try to transfer events to user space. In case we get 0 events and
* there's still timeout left over, we go trying again in search of
* more luck.
*/
if (!res && eavail &&
!(res = ep_events_transfer(ep, events, maxevents)) && jtimeout)
goto retry;
return res;
}
把rdllist裡的fd拷到使用者空間,這個任務是ep_events_transfer做的:
[fs/eventpoll.c->ep_events_transfer()]
static int ep_events_transfer(struct eventpoll *ep,
struct epoll_event __user *events, int maxevents)
{
int eventcnt = 0;
struct list_head txlist;
INIT_LIST_HEAD(&txlist);
/*
* We need to lock this because we could be hit by
* eventpoll_release_file() and epoll_ctl(EPOLL_CTL_DEL).
*/
down_read(&ep->sem);
/* Collect/extract ready items */
if (ep_collect_ready_items(ep, &txlist, maxevents) > 0) {
/* Build result set in userspace */
eventcnt = ep_send_events(ep, &txlist, events);
/* Reinject ready items into the ready list */
ep_reinject_items(ep, &txlist);
}
up_read(&ep->sem);
return eventcnt;
}
程式碼很少,其中ep_collect_ready_items把rdllist裡的fd挪到txlist裡(挪完後rdllist就空了),接著
ep_send_events把txlist裡的fd拷給使用者空間,然後ep_reinject_items把一部分fd從txlist裡“返還”給
rdllist以便下次還能從rdllist裡發現它。其中ep_send_events的實現:
[fs/eventpoll.c->ep_send_events()]
static int ep_send_events(struct eventpoll *ep, struct list_head *txlist,
struct epoll_event __user *events)
{
int eventcnt = 0;
unsigned int revents;
struct list_head *lnk;
struct epitem *epi;
/*
* We can loop without lock because this is a task private list.
* The test done during the collection loop will guarantee us that
* another task will not try to collect this file. Also, items
* cannot vanish during the loop because we are holding "sem".
*/
list_for_each(lnk, txlist) {
epi = list_entry(lnk, struct epitem, txlink);
/*
* Get the ready file event set. We can safely use the file
* because we are holding the "sem" in read and this will
* guarantee that both the file and the item will not vanish.
*/
revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL);
/*
* Set the return event set for the current file descriptor.
* Note that only the task task was successfully able to link
* the item to its "txlist" will write this field.
*/
epi->revents = revents & epi->event.events;
if (epi->revents) {
if (__put_user(epi->revents,
&events[eventcnt].events) ||
__put_user(epi->event.data,
&events[eventcnt].data))
return -EFAULT;
if (epi->event.events & EPOLLONESHOT)
epi->event.events &= EP_PRIVATE_BITS;
eventcnt++;
}
}
return eventcnt;
}
這個拷貝實現其實沒什麼可看的,但是請注意revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL);這一行,這個poll很狡猾,它把第二個引數置為NULL來呼叫。我們先看一下裝置驅動通常是怎麼實現poll的:
static unsigned int scull_p_poll(struct file *filp, poll_table *wait)
{
struct scull_pipe *dev = filp->private_data;
unsigned int mask = 0;
/*
* The buffer is circular; it is considered full
* if "wp" is right behind "rp" and empty if the
* two are equal.
*/
down(&dev->sem);
poll_wait(filp, &dev->inq, wait);
poll_wait(filp, &dev->outq, wait);
if (dev->rp != dev->wp)
mask |= POLLIN | POLLRDNORM; /* readable */
if (spacefree(dev))
mask |= POLLOUT | POLLWRNORM; /* writable */
up(&dev->sem);
return mask;
}
上面這段程式碼摘自《linux裝置驅動程式(第三版)》,絕對經典,裝置先要把current(當前程式)掛在inq和outq兩個佇列上(這個“掛”操作是wait回撥函式指標做的),然後等裝置來喚醒,喚醒後就能通過mask拿到事件掩碼了(注意那個mask引數,它就是負責拿事件掩碼的)。那如果wait為NULL,poll_wait會做些什麼呢?
[include/linux/poll.h->poll_wait]
static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address,
poll_table *p)
{
if (p && wait_address)
p->qproc(filp, wait_address, p);
}
如果poll_table為空,什麼也不做。我們倒回ep_send_events,那句標紅的poll,實際上就是“我不想休眠,我只想拿到事件掩碼”的意思。然後再把拿到的事件掩碼拷給使用者空間。ep_send_events完成後,就輪到ep_reinject_items了:
[fs/eventpoll.c->ep_reinject_items]
static void ep_reinject_items(struct eventpoll *ep, struct list_head *txlist)
{
int ricnt = 0, pwake = 0;
unsigned long flags;
struct epitem *epi;
write_lock_irqsave(&ep->lock, flags);
while (!list_empty(txlist)) {
epi = list_entry(txlist->next, struct epitem, txlink);
/* Unlink the current item from the transfer list */
EP_LIST_DEL(&epi->txlink);
/*
* If the item is no more linked to the interest set, we don't
* have to push it inside the ready list because the following
* ep_release_epitem() is going to drop it. Also, if the current
* item is set to have an Edge Triggered behaviour, we don't have
* to push it back either.
*/
if (EP_RB_LINKED(&epi->rbn) && !(epi->event.events & EPOLLET) &&
(epi->revents & epi->event.events) && !EP_IS_LINKED(&epi->rdllink)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
ricnt++;
}
}
if (ricnt) {
/*
* Wake up ( if active ) both the eventpoll wait list and the ->poll()
* wait list.
*/
if (waitqueue_active(&ep->wq))
wake_up(&ep->wq);
if (waitqueue_active(&ep->poll_wait))
pwake++;
}
write_unlock_irqrestore(&ep->lock, flags);
/* We have to call this outside the lock */
if (pwake)
ep_poll_safewake(&psw, &ep->poll_wait);
}
ep_reinject_items把txlist裡的一部分fd又放回rdllist,那麼,是把哪一部分fd放回去呢?看上面if (EP_RB_LINKED(&epi->rbn) && !(epi->event.events & EPOLLET) &&這個判斷——是哪些“沒有標上EPOLLET”(標紅程式碼)且“事件被關注”(標藍程式碼)的fd被重新放回了rdllist。那麼下次epoll_wait當然會又把rdllist裡的fd拿來拷給使用者了。舉個例子。假設一個socket,只是connect,還沒有收發資料,那麼它的poll事件掩碼總是有POLLOUT的(參見上面的驅動示例),每次呼叫epoll_wait總是返回POLLOUT事件(比較煩),因為它的fd就總是被放回rdllist;假如此時有人往這個socket裡寫了一大堆資料,造成socket塞住(不可寫了),那麼(epi->revents & epi->event.events) && !EP_IS_LINKED(&epi->rdllink)) {裡的判斷就不成立了(沒有POLLOUT了),fd不會放回rdllist,epoll_wait將不會再返回使用者POLLOUT事件。現在我們給這個socket加上EPOLLET,然後connect,沒有收發資料,此時,if (EP_RB_LINKED(&epi->rbn) && !(epi->event.events & EPOLLET) &&判斷又不成立了,所以epoll_wait只會返回一次POLLOUT通知給使用者(因為此fd不會再回到rdllist了),接下來的epoll_wait都不會有任何事件通知了。
本作品採用《CC 協議》,轉載必須註明作者和本文連結