epoll的基础数据结构-德赢Vwin官网网

一、epoll的基础数据结构

在开始研究源代码之前，我们先看一下 epoll 中使用的数据结构，分别是 eventpoll、epitem 和 eppoll_entry。

1、eventpoll

我们先看一下 eventpoll 这个数据结构，这个数据结构是我们在调用 epoll_create 之后内核创建的一个句柄，表示了一个 epoll 实例。后续如果我们再调用 epoll_ctl 和 epoll_wait 等，都是对这个 eventpoll 数据进行操作，这部分数据会被保存在 epoll_create 创建的匿名文件 file 的 private_data 字段中。

* This structure is stored inside the "private_data" member of the file
* structure and represents the main data structure for the eventpoll
* interface.
*/
struct eventpoll {
/* Protect the access to this structure */
spinlock_t lock;

/*
* This mutex is used to ensure that files are not removed
* while epoll is using them. This is held during the event
* collection loop, the file cleanup path, the epoll file exit
* code and the ctl operations.
*/
struct mutex mtx;

/* Wait queue used by sys_epoll_wait() */
// 这个队列里存放的是执行 epoll_wait 从而等待的进程队列
wait_queue_head_t wq;

/* Wait queue used by file->poll() */
// 这个队列里存放的是该 eventloop 作为 poll 对象的一个实例，加入到等待的队列
// 这是因为 eventpoll 本身也是一个 file, 所以也会有 poll 操作
wait_queue_head_t poll_wait;

/* List of ready file descriptors */
// 这里存放的是事件就绪的 fd 列表，链表的每个元素是下面的 epitem
struct list_head rdllist;

/* RB tree root used to store monitored fd structs */
// 这是用来快速查找 fd 的红黑树
struct rb_root_cached rbr;

/*
* This is a single linked list that chains all the "struct epitem" that
* happened while transferring ready events to userspace w/out
* holding ->lock.
*/
struct epitem *ovflist;

/* wakeup_source used when ep_scan_ready_list is running */
struct wakeup_source *ws;

/* The user that created the eventpoll descriptor */
struct user_struct *user;

// 这是 eventloop 对应的匿名文件，充分体现了 Linux 下一切皆文件的思想
struct file *file;

/* used to optimize loop detection check */
int visited;
struct list_head visited_list_link;

#ifdef CONFIG_NET_RX_BUSY_POLL
/* used to track busy poll napi_id */
unsigned int napi_id;
#endif
};

2、epitem

每当我们调用 epoll_ctl 增加一个 fd 时，内核就会为我们创建出一个 epitem 实例，并且把这个实例作为红黑树的一个子节点，增加到 eventpoll 结构体中的红黑树中，对应的字段是 rbr。这之后，查找每一个 fd 上是否有事件发生都是通过红黑树上的 epitem 来操作。

/*
* Each file descriptor added to the eventpoll interface will
* have an entry of this type linked to the "rbr" RB tree.
* Avoid increasing the size of this struct, there can be many thousands
* of these on a server and we do not want this to take another cache line.
*/
struct epitem {
union {
/* RB tree node links this structure to the eventpoll RB tree */
struct rb_node rbn;
/* Used to free the struct epitem */
struct rcu_head rcu;
};

/* List header used to link this structure to the eventpoll ready list */
// 将这个 epitem 连接到 eventpoll 里面的 rdllist 的 list 指针
struct list_head rdllink;

/*
* Works together "struct eventpoll"->ovflist in keeping the
* single linked chain of items.
*/
struct epitem *next;

/* The file descriptor information this item refers to */
//epoll 监听的 fd
struct epoll_filefd ffd;

/* Number of active wait queue attached to poll operations */
// 一个文件可以被多个 epoll 实例所监听，这里就记录了当前文件被监听的次数
int nwait;

/* List containing poll wait queues */
struct list_head pwqlist;

/* The "container" of this item */
// 当前 epollitem 所属的 eventpoll
struct eventpoll *ep;

/* List header used to link this item to the "struct file" items list */
struct list_head fllink;

/* wakeup_source used when EPOLLWAKEUP is set */
struct wakeup_source __rcu *ws;

/* The structure that describe the interested events and the source fd */
struct epoll_event event;
};

3、eppoll_entry

每次当一个 fd 关联到一个 epoll 实例，就会有一个 eppoll_entry 产生。eppoll_entry 的结构如下：

/* Wait structure used by the poll hooks */
struct eppoll_entry {
/* List header used to link this structure to the "struct epitem" */
struct list_head llink;

/* The "base" pointer is set to the container "struct epitem" */
struct epitem *base;

/*
* Wait queue item that will be linked to the target file wait
* queue head.
*/
wait_queue_entry_t wait;

/* The wait queue head that linked the "wait" wait queue item */
wait_queue_head_t *whead;
};

二、epoll底层原理

在高并发场景下，如果有100万用户同时与一个进程保持着TCP连接，而每一时刻只有几十个或几百个TCP连接是活跃的(接收TCP包)，也就是说在每一时刻进程只需要处理这100万连接中的一小部分连接。

对于这种场景，select或者poll事件驱动方式采用了轮询的方式操作系统收集有事件发生的TCP连接，把这100万个连接告诉操作系统。但这里有一个明显的问题，在某一时刻，进程收集有事件的连接时，其实这100万连接中的大部分都是没有事件发生的。因此如果每次收集事件时，都把100万连接的套接字传给操作系统，从用户态内存到内核态内存的大量复制，这无疑会产生巨大的开销。而由操作系统内核寻找这些连接上有没有未处理的事件，将会是巨大的资源浪费，然后select和poll就是这样做的，因此它们最多只能处理几千个并发连接。

而epoll不这样做，它在Linux内核中申请了一个简易的文件系统，把原先的一个select或poll调用分成了3部分：

int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events,int maxevents, int timeout);

调用epoll_create建立一个epoll对象(在epoll文件系统中给这个句柄分配资源)；
调用epoll_ctl向epoll对象中添加这100万个连接的套接字；
调用epoll_wait收集发生事件的连接。

这样只需要在进程启动时建立1个epoll对象，并在需要的时候向它添加或删除连接就可以了，因此，在实际收集事件时，epoll_wait的效率就会非常高，因为调用epoll_wait时并没有向它传递这100万个连接，内核也不需要去遍历全部的连接。

1、epoll_create

我们在调用epoll_create时，内核除了帮我们在epoll文件系统里建了个file结点，在内核cache里建了个红黑树用于存储以后epoll_ctl传来的socket外，还会再建立一个rdllist双向链表，用于存储准备就绪的事件，当epoll_wait调用时，仅仅观察这个rdllist双向链表里有没有数据即可。有数据就返回，没有数据就sleep，等到timeout时间到后即使链表没数据也返回。所以，epoll_wait非常高效。

红黑树操作使用的是互斥锁，在添加和删除操作时需要加锁。

双向链表操作使用的是spinlock自旋锁，当没有竞争到锁资源时，不会睡眠，加快了链表操作的速度，添加和删除操作需要加锁。

总之，红黑树存储所监控的文件描述符的节点数据，就绪链表存储就绪的文件描述符的节点数据

epoll_create工作流程

首先，epoll_create 会对传入的 flags 参数做简单的验证。

/* Check the EPOLL_* constant for consistency. */
BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);

if (flags & ~EPOLL_CLOEXEC)
return -EINVAL;
/*

接下来，内核申请分配 eventpoll 需要的内存空间。

/* Create the internal data structure ("struct eventpoll").
*/
error = ep_alloc(&ep);
if (error < 0)
return error;

在接下来，epoll_create 为 epoll 实例分配了匿名文件和文件描述字，其中 fd 是文件描述字，file 是一个匿名文件。这里充分体现了 UNIX 下一切都是文件的思想。注意，eventpoll 的实例会保存一份匿名文件的引用，通过调用 fd_install 函数将匿名文件和文件描述字完成了绑定。

这里还有一个特别需要注意的地方，在调用 anon_inode_get_file 的时候，epoll_create 将 eventpoll 作为匿名文件 file 的 private_data 保存了起来，这样，在之后通过 epoll 实例的文件描述字来查找时，就可以快速地定位到 eventpoll 对象了。

最后，这个文件描述字作为 epoll 的文件句柄，被返回给 epoll_create 的调用者。

/*
* Creates all the items needed to setup an eventpoll file. That is,
* a file structure and a free file descriptor.
*/
fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
if (fd < 0) {
error = fd;
goto out_free_ep;
}
file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
O_RDWR | (flags & O_CLOEXEC));
if (IS_ERR(file)) {
error = PTR_ERR(file);
goto out_free_fd;
}
ep->file = file;
fd_install(fd, file);
return fd;

2、epoll_ctl

接下来，我们看一下一个套接字是如何被添加到 epoll 实例中的。这就要解析一下 epoll_ctl 函数实现了。

查找 epoll 实例

首先，epoll_ctl 函数通过 epoll 实例句柄来获得对应的匿名文件，这一点很好理解，UNIX 下一切都是文件，epoll 的实例也是一个匿名文件。

// 获得 epoll 实例对应的匿名文件
f = fdget(epfd);
if (!f.file)
goto error_return;

接下来，获得添加的套接字对应的文件，这里 tf 表示的是 target file，即待处理的目标文件。

/* Get the "struct file *" for the target file */
// 获得真正的文件，如监听套接字、读写套接字
tf = fdget(fd);
if (!tf.file)
goto error_fput;

再接下来，进行了一系列的数据验证，以保证用户传入的参数是合法的，比如 epfd 真的是一个 epoll 实例句柄，而不是一个普通文件描述符。

/* The target file descriptor must support poll */
// 如果不支持 poll，那么该文件描述字是无效的
error = -EPERM;
if (!tf.file->f_op->poll)
goto error_tgt_fput;
...

红黑树查找

接下来 epoll_ctl 通过目标文件和对应描述字，在红黑树中查找是否存在该套接字，这也是 epoll 为什么高效的地方。红黑树（RB-tree）是一种常见的数据结构，这里 eventpoll 通过红黑树跟踪了当前监听的所有文件描述字，而这棵树的根就保存在 eventpoll 数据结构中。

对于每个被监听的文件描述字，都有一个对应的 epitem 与之对应，epitem 作为红黑树中的节点就保存在红黑树中。

红黑树是一棵二叉树，作为二叉树上的节点，epitem 必须提供比较能力，以便可以按大小顺序构建出一棵有序的二叉树。其排序能力是依靠 epoll_filefd 结构体来完成的，epoll_filefd 可以简单理解为需要监听的文件描述字，它对应到二叉树上的节点。

ep_insert

ep_insert 首先判断当前监控的文件值是否超过了 /proc/sys/fs/epoll/max_user_watches 的预设最大值，如果超过了则直接返回错误。接下来是分配资源和初始化动作。

if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
return -ENOMEM;

/* Item initialization follow here ... */
INIT_LIST_HEAD(&epi->rdllink);
INIT_LIST_HEAD(&epi->fllink);
INIT_LIST_HEAD(&epi->pwqlist);
epi->ep = ep;
ep_set_ffd(&epi->ffd, tfile, fd);
epi->event = *event;
epi->nwait = 0;
epi->next = EP_UNACTIVE_PTR;

再接下来的事情非常重要，ep_insert 会为加入的每个文件描述字设置回调函数。这个回调函数是通过函数 ep_ptable_queue_proc 来进行设置的。这个回调函数是干什么的呢？其实，对应的文件描述字上如果有事件发生，就会调用这个函数，比如套接字缓冲区有数据了，就会回调这个函数。这个函数就是 ep_poll_callback。这里你会发现，原来内核设计也是充满了事件回调的原理。

/*
* This is the callback that is used to add our wait queue to the
* target file wakeup lists.
*/
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,poll_table *pt)
{
struct epitem *epi = ep_item_from_epqueue(pt);
struct eppoll_entry *pwq;

if (epi>nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
pwq->whead = whead;
pwq->base = epi;
if (epi->event.events & EPOLLEXCLUSIVE)
add_wait_queue_exclusive(whead, &pwq->wait);
else
add_wait_queue(whead, &pwq->wait);
list_add_tail(&pwq->llink, &epi->pwqlist);
epi->nwait++;
} else {
/* We have to signal that an error occurred */
epi->nwait = -1;
}
}

总而言之，当我们使用epoll_ctl()函数注册一个socket时，内核将会做这些事情：

分配一个红黑树节点对象epitem
添加等待事件到socket的等待队列中
将epitem插入到epoll对象的红黑树中

3、epoll_wait

epoll_wait被调用时会观察 eventpoll->rdllist 链表里有没有数据，有数据就返回，没有数据就创建一个等待队列项，将其添加到 eventpoll 的等待队列上（1.1节中的wait_queue_head_t），然后把自己阻塞掉就结束。

查找 epoll 实例

epoll_wait 函数首先进行一系列的检查，例如传入的 maxevents 应该大于 0。和前面介绍的 epoll_ctl 一样，通过 epoll 实例找到对应的匿名文件和描述字，并且进行检查和验证。

还是通过读取 epoll 实例对应匿名文件的 private_data 得到 eventpoll 实例。

/* The maximum number of event must be greater than zero */
if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
return -EINVAL;

/* Verify that the area passed by the user is writeable */
if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event)))
return -EFAULT;
/* Get the "struct file *" for the eventpoll file */
f = fdget(epfd);
if (!f.file)
return -EBADF;

/*
* We have to check that the file structure underneath the fd
* the user passed to us _is_ an eventpoll file.
*/
error = -EINVAL;
if (!is_file_epoll(f.file))
goto error_fput;

4、总结

执行epoll_create()时，创建了红黑树和就绪链表；
执行epoll_ctl()时，如果增加socket句柄，则检查在红黑树中是否存在，存在立即返回，不存在则添加到树干上，然后向内核注册回调函数，用于当中断事件来临时向准备就绪链表中插入数据；
执行epoll_wait()时立刻返回准备就绪链表里的数据即可；

三、epoll的两种触发模式

epoll有EPOLLLT和EPOLLET两种触发模式，LT是默认的模式，ET是“高速”模式。

LT（水平触发）模式下，只要这个文件描述符还有数据可读，每次 epoll_wait都会返回它的事件，提醒用户程序去操作；

LT比ET多了一个开关EPOLLOUT事件(系统调用消耗，上下文切换）的步骤；对于监听的sockfd，最好使用水平触发模式（参考nginx），边缘触发模式会导致高并发情况下，有的客户端会连接不上，LT适合处理紧急事件；对于读写的connfd，水平触发模式下，阻塞和非阻塞效果都一样，不过为了防止特殊情况，还是建议设置非阻塞；LT的编程与poll/select接近，符合一直以来的习惯，不易出错；

ET（边缘触发）模式下，在它检测到有 I/O 事件时，通过 epoll_wait 调用会得到有事件通知的文件描述符，对于每一个被通知的文件描述符，如可读，则必须将该文件描述符一直读到空，让 errno 返回 EAGAIN 为止，否则下次的 epoll_wait 不会返回余下的数据，会丢掉事件。如果ET模式不是非阻塞的，那这个一直读或一直写势必会在最后一次阻塞。

边沿触发模式很大程度上降低了同一个epoll事件被重复触发的次数，所以效率更高；对于读写的connfd，边缘触发模式下，必须使用非阻塞IO，并要一次性全部读写完数据。ET的编程可以做到更加简洁，某些场景下更加高效，但另一方面容易遗漏事件，容易产生bug；

总之，ET和LT各有优缺点，需要根据业务场景选择最合适的模式。

四、epoll的不足之处

1.定时的精度不够，只到5ms级别，select可以到0.1ms；

2.当连接数少并且连接都十分活跃的情况下，select和poll的性能可能比epoll好；

3.epoll_ctrl每次只能够修改一个fd（kevent可以一次改多个，每次修改，epoll需要一个系统调用，不能 batch 操作，可能会影响性能）；

4.可能会在定时到期之前返回，导致还需要下一个epoll_wait调用；

声明：本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人，不代表德赢Vwin官网网立场。文章及其配图仅供工程师学习之用，如有内容侵权或者其他违规问题，请联系本站处理。举报投诉

TCP

TCP

+关注

关注
8

文章
1353

浏览量
79054
文件

文件

+关注

关注
1

文章
565

浏览量
24727
数据结构

数据结构

+关注

关注
3

文章
573

浏览量
40121
epoll

epoll

+关注

关注
0

文章
28

浏览量
2951

什么是数据结构（Data Structrue）

什么是数据结构（Data Structrue）一名词术语数据：描述客观事物的数字，字符以及一切能够输入到计算机中，并且能够被计算机程序处理的符号的集合。数据元素：数据这个

发表于 02-09 17:17

数据结构

1.数据结构的概念所谓数据结构是指由某一数据对象及该对象中所有数据成员之间的关系组成的集合。成员之间的关系有很多种，最常见的是前后件关系。 2.

发表于 03-04 14:13

常见的数据结构

`数据结构在实际应用中非常常见，现在各种算法基本都牵涉到数据结构，因此，掌握数据结构算是软件工程师的必备技能。一、什么是数据结构数据结构，直

发表于 05-10 07:58

epoll_wait的事件返回的fd为错误是怎么回事？

event数据结构中的data.fd2、在嵌入式Linux下执行返回的 fd 为 0，在Ubuntu下运行为4217881

发表于 06-12 09:03

数据结构教程,下载

1. 数据结构的基本概念 2. 算法与数据结构3. C语言的数据类型及其算法描述要点4. 学习算法与数据结构的意义与方法

发表于 05-14 17:22 •0次下载

GPIB命令的数据结构

针对GPIB命令的结构，提出一种存储GPIB命令的数据结构。根据GPIB命令的层次关系的特点，选择数据结构中“树”的概念来存储GPIB命令结点；并考虑程序实现的效率问题以及管理维护

发表于 02-10 16:20 •70次下载

什么是数据结构

什么是数据结构 1、数据类型和数据结构·数据值：atomic data value: 不可再分解。如3、2、5等。nonatomicdata value: 可以再分解，其成分称为

发表于 08-13 13:56 •1680次阅读

数据结构与算法

全国C语言考试公共基础知识点——数据结构与算法，该资料包含了有关数据结构与算法的全部知识点。

发表于 03-30 14:27 •0次下载

数据结构

数据结构PPT教程

发表于 02-27 16:43 •0次下载

数据结构是什么_数据结构有什么用

数据结构是计算机存储、组织数据的方式。数据结构是指相互之间存在一种或多种特定关系的数据元素的集合。通常情况下，精心选择的数据结构可以带来更高

发表于 11-17 14:45 •1.6w次阅读

为什么要学习数据结构？数据结构的应用详细资料概述免费下载

本文档的主要内容详细介绍的是为什么要学习数据结构？数据结构的应用详细资料概述免费下载包括了：数据结构在串口通信当中的应用，数据结构在按键监测当中的应用

发表于 09-11 17:15 •13次下载

什么是数据结构?为什么要学习数据结构？数据结构的应用实例分析

本文档的主要内容详细介绍的是什么是数据结构?为什么要学习数据结构？数据结构的应用实例分析包括了：数据结构在串口通信当中的应用，数据结构在按键

发表于 09-26 15:45 •14次下载

一文详解epoll的实现原理

本文以四个方面介绍epoll的实现原理，1.epoll的数据结构；2.协议栈如何与epoll通信；3.epoll线程安全如何加锁；4.ET与

发表于 08-01 13:28 •4110次阅读

SystemVerilog中可以嵌套的数据结构

SystemVerilog中除了数组、队列和关联数组等数据结构，这些数据结构还可以嵌套。

发表于 11-03 09:59 •1593次阅读

NetApp的数据结构是如何演变的

混合和多云部署模型是企业IT组织的新常态。随着这些复杂的环境，围绕数据管理的新挑战出现了。NetApp的数据管理愿景是一种无缝连接不同的数据结构云，无论它们是私有环境、公共环境还是混合环境。数

发表于 08-25 17:15 •0次下载

搜索历史

epoll的基础数据结构

一、epoll的基础数据结构

1、eventpoll

2、epitem

3、eppoll_entry

二、epoll底层原理

1、epoll_create

epoll_create工作流程

2、epoll_ctl

查找 epoll 实例

红黑树查找

ep_insert

3、epoll_wait

查找 epoll 实例

4、总结

三、epoll的两种触发模式

四、epoll的不足之处

评论

什么是数据结构（Data Structrue）

数据结构

常见的数据结构

epoll_wait的事件返回的fd为错误是怎么回事？

数据结构教程,下载

GPIB命令的数据结构

什么是数据结构

数据结构与算法

数据结构

数据结构是什么_数据结构有什么用

为什么要学习数据结构？数据结构的应用详细资料概述免费下载

什么是数据结构?为什么要学习数据结构？数据结构的应用实例分析

一文详解epoll的实现原理

SystemVerilog中可以嵌套的数据结构

NetApp的数据结构是如何演变的