Saturday, January 2, 2010

epoll - Small tutorial

For a long time, programmers are used to using 'select' and 'poll' calls to wait for events from multiple sockets and file descriptors etc..   Though they served their purpose well,  they are no longer sufficient from scalability and performance perspective.  'Select' mechanism expects all 'fds' to be represented in a three different bit masks (one for read events, one for write and one for exceptions).  'select' call waits until some event happens. Once the 'select' call returns, user program needs to go through all descriptors (basically go through the corresponding bit masks) to identify the actual descriptors which have events ready.    Normally only few descriptors are ready with events.  Going through bit by bit is really wasting CPU cycles. Also, user programs typically require to store some data on per 'FD' basis and expect this data to be passed to it when there is some event on the FD.  'select' mechanism does not provide that facility. 

'poll' mechanism is certainly a big improvement over the 'select' mechanism.  It does not use bit masks, hence there is no limit on number of descriptors to wait on.  Also, along with the FD, user programs can provide 'callback' argument.   But it has similar performance problems as 'select'. That is, every time 'poll' comes out, it needs to figure out which descriptors have events ready by going through all descriptors which were given to 'poll' call.  Though scalability problem is taken care, performance problem still continues to be there.

'epoll' mechanism solves both scalability and performance problems.  It has mechanisms to add descriptors and remove descriptors with descriptor value, not the bit position. It also has facility for user programs to associate its data to the descriptor, which is returned along with descriptor that is ready when 'epoll_wait' returns. More importantly, epoll_wait returns the array of descriptors which has some event ready. Due to this, user programs need not go through complete 'fd' list to figure out ready descriptors.

Usage is also simple. Create a epoll instance using epoll_create.  Add, Delete and Modify descriptors and events they are interested in using epoll_ctl and call epoll_wait to wait for events to occur.  epoll_wait outputs the descriptors which are ready with some events. User program can walk though this ready descriptors to do whatever they need to do.

It provides following API functions.
  • int epoll_create(int size):   This function is expected to be called by user thread only once.  'size' parameter was hint for old Linux distributions to allocate memory internally in the kernel.  But this value is no longer used. You can pass any value.
  • int epoll_ctl(int epfd,  int op, int fd, struct epoll_event *event) :   This function can be used  to add new descriptor and associated interested events,  delete existing descriptor from the epoll instance and modify the interested events information for the existing descriptor.  User programs can do this on dynamic basis. 
  • int epoll_wait(int epfd,  struct epoll_event *events,  int maxevents,  int timeout) :  This function is used by user threads to wait for events on the descriptors which were added to the epoll instance identified by epfd.   This call waits until there is some ready event on any descriptor or until timeout.  'events' would have all ready descriptors when this call returns. 
  • Use typical 'close' call on epfd to close the epoll instance. This is typically done when the thread is exiting.

No comments: