I / O multiplexing depth analysis

This two-day brush algorithm, feel cold, do not even look at other people's answers to their own questions can not read. Talent it, this is not saved. We decided to first algorithm put about, find something more interesting to do some things, so on a whim epoll is better to look at the source of it all day long with this thing, look at him achieve hurt it!


Network programming, there are common multiplexing IO There are three different scenarios for recount. That select, poll, epoll, of course, select and poll not the protagonist of the article, the following is a detailed realization of my main principles of epoll. Of course, I select and poll will be described contrasting.

Let's think about a problem:

  • Why there is I / O multiplexing, using IO multiplexing and do not have any effect on the use of a single process?

This problem did not find the answer on the Internet, maybe this "why" proposed does not make sense. The answer is actually relatively simple, multiplexed in a single thread, a mechanism for tracking multi-socket socket state change. Without the IO multiplexing, the plurality of detection not achieve a single-threaded socket status changes. (The above is a personal opinion, for reference only)

We use select, to achieve IO multiplexing mechanism poll, epoll system calls. For the first two, as long as there is every ready event, to all fd passed to select or poll system call, which means that all the fd copy from user mode to kernel mode. This excess number of copies does not make sense, because the event has not happened IO on some descriptors, just because there are other descriptors IO event is copied into kernel mode.

Then it is necessary to distinguish between a select and poll, select system call collect to listen through an array of bits fd_bits descriptors, the kernel macro definition limits the maximum number of descriptors listening to 1024.

For the poll number listening without limitation, as long as to a new connection, the connection will be added to the descriptor list maintained by the poll.

For both IO multiplexing mechanism have not been resolved to avoid copying invalid ready to connect to non-core operations. For less connection status, the efficiency of the above three mechanisms rather IO multiplexing, but if the number of connections increases, the use of the program are three performance difference becomes apparent.

Why epoll effective and efficient?

epoll related system call: epoll_create, epoll_ctl and epoll_wait.

In the Linux system, the kernel epoll want to register an epoll file system to store the file descriptor to be monitored.

When epoll_create, it will create a file node epoll file system, only serve to epoll. When epoll is initialized in the kernel, the operating system back to epoll open up a kernel buffer cache ( 连续的物理内存页), used to save the save 红黑树结构.

static int __init eventpoll_init(void)
{
    mutex_init(&pmutex);
    ep_poll_safewake_init(&psw);
    //开辟slab高速缓冲区
    epi_cache = kmem_cache_create(“eventpoll_epi”,sizeof(struct epitem),
            0,SLAB_HWCACHE_ALIGN|EPI_SLAB_DEBUG|
            SLAB_PANIC,NULL);
    pwq_cache = kmem_cache_create(“eventpoll_pwq”,sizeof(struct
             eppoll_entry),0,EPI_SLAB_DEBUG|SLAB_PANIC,NULL);
    
    return 0;
}

In the beginning of learning epoll have heard epoll realize the underlying core data structure is the red-black tree, red-black tree is a self-balancing binary tree, find and delete the higher the efficiency.

After calling epoll_create, epoll underlying implementation will create a red-black tree is used to receive incoming connections behind epoll high-speed buffer perspectives, but also creates a ready list, save fd ready events:

//当来一个新连接,基于slab高速缓存(可以理解为一种分配内存的机制,对于频繁的使用和销毁特别的高效),创建一个epitem对象,保存下面描述的信息,然后将这个对象加到红黑树中。

struct epitem {

    struct rb_node  rbn;        //用于主结构管理的红黑树

    struct list_head  rdllink;  //事件就绪队列

    struct epitem  *next;       //用于主结构体中的链表

 struct epoll_filefd  ffd;   //这个结构体对应的被监听的文件描述符信息

 int  nwait;                 //poll操作中事件的个数

    struct list_head  pwqlist;  //双向链表,保存着被监视文件的等待队列,功能类似于select/poll中的poll_table

    struct eventpoll  *ep;      //该项属于哪个主结构体(多个epitm从属于一个eventpoll)

    struct list_head  fllink;   //双向链表,用来链接被监视的文件描述符对应的struct file。因为file里有f_ep_link,用来保存所有监视这个文件的epoll节点

    struct epoll_event  event;  //注册的感兴趣的事件,也就是用户空间的epoll_event
}

Here Insert Picture Description


//每个epoll fd(epfd)对应的主要数据结构为:

struct eventpoll {

  spin_lock_t       lock;        //对本数据结构的访问

  struct mutex      mtx;         //防止使用时被删除

  wait_queue_head_t     wq;      //sys_epoll_wait() 使用的等待队列

  wait_queue_head_t   poll_wait;       //file->poll()使用的等待队列

  struct list_head    rdllist;        //事件满足条件的链表

  struct rb_root      rbr;            //用于管理所有fd的红黑树(树根)

  struct epitem      *ovflist;       //将事件到达的fd进行链接起来发送至用户空间

}

//内核通过创建一个fd与这个结构相关联,管理红黑树结构,即epoll_create1实现

  SYSCALL_DEFINE1(epoll_create1, int, flags)
  {
  	int error, fd;
  	//////////////////////////////////////////////////////////////////////////////////////////////////////
  	struct eventpoll *ep = NULL;
  	struct file *file;
   
  	/* Check the EPOLL_* constant for consistency.  */
  	BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
   
  	if (flags & ~EPOLL_CLOEXEC)
  		return -EINVAL;
  	/*
  	 * Create the internal data structure ("struct eventpoll").
  	 */
  	error = ep_alloc(&ep);
  	if (error < 0)
  		return error;
  	/*
  	 * Creates all the items needed to setup an eventpoll file. That is,
  	 * a file structure and a free file descriptor.
  	 */
  	 ////////////////////////////////////////////////////////////////////////////////////////////////////
  	fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
  	if (fd < 0) {
  		error = fd;
  		goto out_free_ep;
  	}
  	file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
  				 O_RDWR | (flags & O_CLOEXEC));
  	if (IS_ERR(file)) {
  		error = PTR_ERR(file);
  		goto out_free_fd;
  	}
  	////////////////////////////////////////////////////////////////////////////////////////////
  	ep->file = file;
  	fd_install(fd, file);
  	return fd;
   
  out_free_fd:
  	put_unused_fd(fd);
  out_free_ep:
  	ep_free(ep);
  	return error;
  }
   //创建并初始化eventpoll结构体实例,由ep_alloc实现(第52行)可知,调用kzalloc在内核空间申请内存并初始化清零。
  static int ep_alloc(struct eventpoll **pep)
  {
  	int error;
  	struct user_struct *user;
  	struct eventpoll *ep;
   
  	user = get_current_user();
  	error = -ENOMEM;
  	ep = kzalloc(sizeof(*ep), GFP_KERNEL);
  	if (unlikely(!ep))
  		goto free_uid;
   
  	spin_lock_init(&ep->lock);
  	mutex_init(&ep->mtx);
  	init_waitqueue_head(&ep->wq);
  	init_waitqueue_head(&ep->poll_wait);
  	INIT_LIST_HEAD(&ep->rdllist);
  	ep->rbr = RB_ROOT;
  	ep->ovflist = EP_UNACTIVE_PTR;
  	ep->user = user;
   
  	*pep = ep;
  	return 0;
  free_uid:
  	free_uid(user);
  	return error;
  }

When the red-black tree IO event there will be the removal operations to remove red-black tree node into the ready event list rdlist in (that is, the ready event from the red-black tree out into the ready list), and then return by epoll_wait to the user process, or processing logic and IO processing carried out in the process, so be ready for epoll is simply concerned descriptor linked list. Do not like the poll and select as long as there are individual IO activity on fd occur through the entire collection or list.

  • There is epoll said the Internet also uses mmap, shared user space and kernel space?

This discovery by the previous example is untenable, we can see the source code by epoll cache is there, he is used to save the red-black tree structure to a new connection is added to the red-black tree testing, which has been very reasonable in logic, if there is a user space and kernel space shared memory, then it would not be superfluous? So whether epoll fd itself or epitem are created based on the kernel memory allocation, there is no mmap said.

The following code is mainly to add nodes EPOLL_CTL_ADD red-black tree, there is the process of removing the ready event:

   static int ep_insert(struct eventpoll *ep, struct epoll_event *event, struct file *tfile, int fd)

{

   int error ,revents,pwake = 0;

   unsigned long flags ;

   struct epitem *epi;

   /*

      struct ep_queue{

         poll_table pt;

         struct epitem *epi;

      }   */

 

   struct ep_pqueue epq;

 

   //分配一个epitem结构体来保存每个加入的fd

   if(!(epi = kmem_cache_alloc(epi_cache,GFP_KERNEL)))

      goto error_return;

   //初始化该结构体

   ep_rb_initnode(&epi->rbn);

   INIT_LIST_HEAD(&epi->rdllink);

   INIT_LIST_HEAD(&epi->fllink);

   INIT_LIST_HEAD(&epi->pwqlist);

   epi->ep = ep;

   ep_set_ffd(&epi->ffd,tfile,fd);

   epi->event = *event;

   epi->nwait = 0;

   epi->next = EP_UNACTIVE_PTR;

 

   epq.epi = epi;

   //安装poll回调函数

   init_poll_funcptr(&epq.pt, ep_ptable_queue_proc );

   /* 调用poll函数来获取当前事件位,其实是利用它来调用注册函数ep_ptable_queue_proc(poll_wait中调用)。

       如果fd是套接字,f_op为socket_file_ops,poll函数是

       sock_poll()。如果是TCP套接字的话,进而会调用

       到tcp_poll()函数。此处调用poll函数查看当前

       文件描述符的状态,存储在revents中。

       在poll的处理函数(tcp_poll())中,会调用sock_poll_wait(),

       在sock_poll_wait()中会调用到epq.pt.qproc指向的函数,

       也就是ep_ptable_queue_proc()。  */ 

 

   revents = tfile->f_op->poll(tfile, &epq.pt);

 

   spin_lock(&tfile->f_ep_lock);

   list_add_tail(&epi->fllink,&tfile->f_ep_lilnks);

   spin_unlock(&tfile->f_ep_lock);

 

   ep_rbtree_insert(ep,epi); //将该epi插入到ep的红黑树中

 

   spin_lock_irqsave(&ep->lock,flags);

 

//  revents & event->events:刚才fop->poll的返回值中标识的事件有用户event关心的事件发生。

// !ep_is_linked(&epi->rdllink):epi的ready队列中有数据。ep_is_linked用于判断队列是否为空。

/*  如果要监视的文件状态已经就绪并且还没有加入到就绪队列中,则将当前的

    epitem加入到就绪队列中.如果有进程正在等待该文件的状态就绪,则

    唤醒一个等待的进程。  */ 

 

if((revents & event->events) && !ep_is_linked(&epi->rdllink)) {

      list_add_tail(&epi->rdllink,&ep->rdllist); //将当前epi插入到ep->ready队列中。

/* 如果有进程正在等待文件的状态就绪,

也就是调用epoll_wait睡眠的进程正在等待,

则唤醒一个等待进程。

waitqueue_active(q) 等待队列q中有等待的进程返回1,否则返回0。

*/

 

      if(waitqueue_active(&ep->wq))

         __wake_up_locked(&ep->wq,TAKS_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE);

 

/*  如果有进程等待eventpoll文件本身(???)的事件就绪,

           则增加临时变量pwake的值,pwake的值不为0时,

           在释放lock后,会唤醒等待进程。 */ 

 

      if(waitqueue_active(&ep->poll_wait))

         pwake++;

   }

   spin_unlock_irqrestore(&ep->lock,flags);

  

 

if(pwake)

      ep_poll_safewake(&psw,&ep->poll_wait);//唤醒等待eventpoll文件状态就绪的进程

   return 0;

}

 

 

init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);

revents = tfile->f_op->poll(tfile, &epq.pt);

这两个函数将ep_ptable_queue_proc注册到epq.pt中的qproc。

typedef struct poll_table_struct {

poll_queue_proc qproc;

unsigned long key;

}poll_table;

 

执行f_op->poll(tfile, &epq.pt)时,XXX_poll(tfile, &epq.pt)函数会执行poll_wait()poll_wait()会调用epq.pt.qproc函数,即ep_ptable_queue_proc。

ep_ptable_queue_proc函数如下:

 

 

/*  在文件操作中的poll函数中调用,将epoll的回调函数加入到目标文件的唤醒队列中。

    如果监视的文件是套接字,参数whead则是sock结构的sk_sleep成员的地址。  */

static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, poll_table *pt) {

/* struct ep_queue{

         poll_table pt;

         struct epitem *epi;

      } */

    struct epitem *epi = ep_item_from_epqueue(pt); //pt获取struct ep_queue的epi字段。

    struct eppoll_entry *pwq;

 

    if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {

        init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);

        pwq->whead = whead;

        pwq->base = epi;

        add_wait_queue(whead, &pwq->wait);

        list_add_tail(&pwq->llink, &epi->pwqlist);

        epi->nwait++;

    } else {

        /* We have to signal that an error occurred */

        /*

         * 如果分配内存失败,则将nwait置为-1,表示

         * 发生错误,即内存分配失败,或者已发生错误

         */
        epi->nwait = -1;
    }
}

Because epoll is based on red-black tree that has a time complexity ep_find, ep_insert, ep_modify, ep_remove like red-black tree-based lookup, insert other operations are O (logn). So epoll fast and efficient.

In conjunction with the following figures should be more clearly epoll event handling process was.
Here Insert Picture Description

For epoll edge trigger ETand the trigger level LT, in the case of edge-triggered, when the socket is changed from one state to another state, only for IO operations. For example: The buffer would have been empty, and now suddenly there is data, that is, from nothing to some process in ET mode, one-time data removed, if the data buffer also did not get finished , this time scoring the socket set to blocking mode or non-blocking mode, and if the set socket to blocking mode, then in ET mode, read event trigger, a one-time, if not finish reading the data, after it ET does not trigger another time, it would have been blocked socket down, can not read the new data; If set to a non-blocking, and then we read in the processing cycle should be set at the time, meaning that although the event is triggered once, but is determined by the number of times I read, I read the socket until the return EAGAIN. This is the case when using epoll, the reasons for the non-blocking socket set!

For the LT model for the data is not removed, this did not take away, then come back the next found not to take complete take on the line, is relatively simple.

Since ET mode so much trouble, LT speak heart and so little, why ET commonly used model? Because the LT mode, will continue to produce epoll related system calls and i / o processing related system calls, in ET mode, the socket nonblocking set, the processing will only io related system calls. In Linux system calls compared to the ordinary user mode interface call performance is more to eat, so the program kind of try to think of ways to reduce the frequency of system calls, this program will improve the performance!(ps.本段是个人见解,要是有其他见解的话,希望能分享一下)

For summary IO multiplexed so far come to an end, we will continue to interpret the source code behind will also continue to update this blog. If the description is wrong, correct me please share!

This article Reference document:

http://www.pandademo.com/2016/11/the-discrimination-of-linux-kernel-epoll-et-and-lt/

http://blog.chinaunix.net/uid-20687780-id-2105154.html

Guess you like

Origin blog.csdn.net/qq_41681241/article/details/89076705