Monday, June 21, 2010

User space IO - Some Challenges & Mitigations

There is pretty good information about UIO in Internet.  This link provides good introduction to this subject.

What is UIO (User space I/O) framework?

UIO framework is part of Linux kernel to enable device driver development in user space.

Which applications require User space drivers?

Zero-Copy drivers are  becoming necessary for performance reasons.  Many network packet processing applications are traditionally used to be developed in Linux kernel space.  Firewall, NAT, IPsec are some of the examples which you find in the kernel space.  Increasingly,  these applications are moved to the user space for multiple reasons such as - Availability of large memory space,   Easy-to-debug,  faster image upgrade and restart and many more.    Moving these applications without moving the Ethernet driver or acceleration drivers reduces the performance.  Even though there are some efficient mechanisms to transfer packets between kernel and user space, they  still take some core cycles.  Having access to the hardware from the user space eliminates the need for any mechanisms to transfer packets back and forth between user space and kernel space. UIO frame work allows user space applications to own the device.  UIO frame work does this by letting the application kernel driver to map the hardware IO to the user space process.  UIO frame work also allows the application kernel driver to register interrupt handler with hardware IRQ and wake up the user space daemon upon hardware interrupt.  User space application upon getting indication from the UIO frame work,  reads the packets or acceleration results from the hardware memory directly without involving kernel.

What are components involved in UIO?

UIO frame work is part of Kernel itself.  Application developers need to develop one simple kernel module and user space application.   Kernel module as indicated above is expected to register interrupt handler with the UIO framework and also indicate the memory ranges (address, size pairs) to the UIO framework.  User space application open the appropriate UIO device /dev/uioX  (X being the minor number),  get hold of memory map ranges from 'sysfs' file system, do memory map and wait for the interrupt events either using 'read' or 'ePoll()' system calls.  When the read/ePoll returns,  it can read the content from the memory mapped area and do the actual application processing on the packets.

API exposed by UIO framework for application kernel modules:

uio_register_device(struct device *parent,  struct uio_info *info)

This function is expected to be called by the application kernel module.  'info' to be filled up with the right values.   At the end of this function,  UIO frame work creates the device /dev/uioX where X is dynamically assigned minor number.  This is the device which is expected to be opened by the user space program to read the interrupt events.

struct uio_info {
    const char        *name;
    const char        *version;
    struct uio_mem        mem[MAX_UIO_MAPS];
    struct uio_port        port[MAX_UIO_PORT_REGIONS];
    long            irq;
    unsigned long        irq_flags;
    struct uio_device *uio_dev;
    irqreturn_t (*handler)(int irq, struct uio_info *dev_info);
    int (*mmap)(struct uio_info *info, struct vm_area_struct *vma);
    int (*open)(struct uio_info *info, struct inode *inode);
    int (*release)(struct uio_info *info, struct inode *inode);
    int (*irqcontrol)(struct uio_info *info, s32 irq_on);

name, version:  Application driver can provide any string as part of it.  Since this 'name' field is used by user space application to figure out the device name (/dev/uioX),  it is necessary that the name field is chosen such a way that it is specific to your application and unique across UIO devices.  Note that value X in /dev/uioX is chosen dynamically by the UIO frame work.  X value can be different across restarts of Linux system.  So, if user space application hardcodes the device file in its code, then it could be an issue when the system restarts and different UIO devices register with UIO framework in different order.  'name' is the one which is constant across restarts as it is given by the application driver.  Since the device name is not constant,   user space application, upon its initi8alization, should find out the UIO device name based on the value of 'name'.  UIO frame work creates set of files under /sys/class/uio/ directory. Under /sys/class/uio/, all device names are present. As many sub directories as number UIO files are present in /sys/class/uio. If there are two UIO files, then there would be two sub directories -  /sys/class/uio/uio0/,  /sys/class/uio/uio1/. Under each uioX directory, there are set of files -  'name', 'version', 'event' and set of directories - 'maps' and 'device'.   'name' file contains name of the device given by the application driver in the first line.  'version' file contains the version string given by the application driver.   'events' contains the number of times the interrupt service routine called so far.

User space application software is expected to find out the right device name by scanning through the directory entries (using scandir()) in /sys/class/uio/ directory.  For each directory entry, it needs to open the file 'name', read the first line and check the name.  If it matches with the name the application is looking for, then note down the directory entry that has matching entry - uioX.  Use this to form /dev/uioX string to open the UIO device.  FD returned by opening the device can be used to read the interrupt events.  This FD can be kept even in epoll(). This is useful if your application requires to wait for event from multiple file descriptors.

struct uio_mem mem[MAX_UIO_MAPS]:   If your application requires to map the register space of hardware in your user space application, then application kernel driver is expected to fill this up.  Since there could be multiple memory ranges required to access the hardware and hence there is array of mappings,  UIO provides facility to give multiple memory ranges. 

struct uio_mem {
    const char *name;
    unsigned long addr;
    unsigned long size;
    int memtype;
    void  __iomem *internal_addr;

It is expected that application kernel driver fills up the array of memory ranges using above structure during registration time. User space application is expected to read memory ranges from the /sys/class/uio/uioX/maps/ directory and do the memory mapping using mmap().  If application kernel driver fills up four memory addresses, then there would be four sub directories  under /sys/class/uio/uioX/maps/ -  map0, map1, map2, map3 and map4.   Under each 'mapX' sub directory, there are three  files - name, addr,  and size.   'addr' file contains address and 'size' file contains 'size'.  See below for explanation.   User space application is expected to read all paris of 'addr' and 'size' and use mmap() function to map them  to its virtual space.   Some explanation of fileds of uio_mem before going into further details.
  • addr:  It could be physical,  logical or virtual memory. Mostly it would be physical address as hardware device memory is exposed here.
  • size:  Size of the memory that needs to be exposed to the user space.
  • name :  Name given to each memory range.  
  • internal_addr:  This is not meant for user space programs to do anything.  Kernel driver can initialize this for its own usage at later time by interrupt service routine or irqcontrol function.  Typically, this memory is mapped using ioremap().
One thing note is that the memory mapping is always with respect to page boundary.  Very often, the device memory does not start at the page boundary. Hence it is required that the user space application adds the right offset to the return address of mmap() to point to the right locations in the device.  User space application is expected to keep the 'offset' parameter for each memory range using 'name'.

mmap() function takes one parameter 'offset' (note that this offset is nothing to do with offset explained above).  This offset is normally given in the multiples of page size.  This offset field is used by UIO frame work to determine the the memory range that user space programs intends to map.  Note that Linux IO infrastructure allows UIO framework to have only one corresponding mmap() function. Whenever mmap() is called in user space,  mmap() function of UIO framework in kernel is called.  UIO mmap() function internally calls remap_pfn_range() function to map the memory.  Note that there is only one mmap() function. How does UIO know which memory range to use to map?  TO solve this issue, UIO expects the user space programs to pass offset which is N * getpagesize() where N being the memory map index.   UIO internally gets hold of memory map index from the offset field and use corresponding 'addr' and 'size' values.

irq:  If your hardware device requires to interrupt the user process, then the application kernel driver is expected to register the IRQ number with the UIO frame work.  If the hardware device does not have this facility or interrupt is not required, then '0' need to be passed to it.  UIO frame work also provides an API function ' uio_notify_event()' to wakeup the user process. This can be used by timer or other facilities to wake up the user space process if the hardware device does not support interrupts.

irq_flags: Kernel driver is expected pass these flags. These flags would be given to request_irq() function by the UIO framework.  Typically,  IRQ_SHARED flag is sent if the IRQ is shared across more than one hardware device.

uio_dev:   This is filled up by the UIO framework.  UIO frame work puts the its own private information in there.  For every registration, UIO frame work creates an instance of uio_dev and keeps it in there. It is not expected to be interpreted by application kernel driver. Any further calls to the UIO framework from the application kernel driver is expected to pass uio_info. UIO frame works gets its instance from the uio_info->uio_dev and use the information in there to do its processing.

irqreturn_t (*handler)(int irq, struct uio_info *dev_info):   This is main interrupt handler.  It is expected to be provided by the application kernel driver.   Application kernel driver implements the interrupt service routine as required by the device. Waking up the user process is taken care by the UIO frame work itself.  UIO framework sets its own function as interrupt handler while calling request_irq() function.  That is, when there is an interrupt,  UIO framework gets the control first.  It calls the application driver handler function and then it does whatever is necessary to wake up the user process.  Hence the application driver handler does not worry about waking up the user process.  More often, my observation is that the application driver interrupt handler function does not do much.  Mostly, it just disables any interrupt generation by programming the device registers.  What should be done in the application driver handler depends on the hardware device capabilities.
  •  Hardware devices typically have capability for software to mask/unmask interrupt generation.  They also provide ability for software to indicate to the hardware to generate interrupts only for new events by acknowledging the previous events. and hardware devices generate interrupt, if new packets have come in , when interrupt is enabled.   If hardware has these capabilities , then the kernel handler typically disables the interrupt generation.  User space process upon being woken up,  indicates to the hardware to generate interrupts only for new events from now onward,   reads all device events in a loop (packets, results etc..) and then enables the interrupt.  This method automatically provides coalescing capability.  User space process is woken up upon first event and interrupts are disabled by the kernel handler. By the time, user space process woken up, it processes not only the event that had woken this up, but also any other events that have come after that.  
  • Note that the user space process or thread may be processing packets from multiple UIO devices.  In this case,  if the  user process processes all the packets coming from one device in a loop until all events are read, then there is a chance that packets from other devices are not handled in timely fashion.  It is expected that all devices are given fair chance.   One way to take care of this is to have one thread each for devices. But that may not be efficient.  It appears that the performance is best if the number of packet processing threads are equal to number of cores/HW threads.  There could be more devices than the threads.  Due to efficiency reasons, one thread may need to work with multiple devices.  In these cases, to give the fairness across devices,  it is necessary that thread handles only 'quota' number of packets from each device before revisiting the devices again.  This concept is similar to NAPI model adopted in Linux Ethernet drivers.
 int (*irqcontrol)(struct uio_info *info, s32 irq_on) :  This function pointer is filled up by application kernel driver to allow user space process to explicitly enable/disable interrupt generation by the hardware device.  This function gets called by UIO infrastructure when the user space process calls write() function on the UIO device fd.  Normally, irq handler disables interrupt generation and user space process enables interrupts using mapped memory. Some hardware devices might have race conditions if two contexts update interrupt mask related registers. This can happen when the mask register is used for other purposes.  In these cases,  central control of enable and disable is necessary.  But modern hardware devices don't have this issue and hence this function registration is not required.

 int (*mmap)(struct uio_info *info, struct vm_area_struct *vma) :  In usual cases, application kernel driver need not set this pointer.  UIO infrastructure has its own mmap() function defined which can do the memory mapping when user space calls mmap() function.  UIO Infrastructure itself can do the mapping using uio_mem mapping array.  Yet times, the number of entries needed to map could be more tham MAX_UIO_MAPS. In that case,  UIO infrastructure will not be able to do the mapping.  In this case, application kernel driver will need to provide mmap() function pointer and do the mapping necessary.  

Even though UIO framework provides  application kernel driver to indicate the memory ranges to map or register application specific mmap() function pointer,  more often I see that both of them are not used.  UIO framework predominantly used only for registering the interrupt handler to wake up the application user process. Many times, application kernel driver itself made as character devices driver with its ioctl() and mmap() functions in addition to open(), close().    There are multiple reasons for doing this. One of  the reason is given below.
  • Applications not only require to map the device specific memory locations, but also map the kernel memory for packet/acceleration-result buffers. UIO infrastructure does not provide this.  Ethernet hardware devices are typically expose descriptor rings of descriptors to receive packets.  Application is expected to provide buffer in each descriptor.  Ethernet controller fills up the buffer in the descriptors with incoming packets.  Buffers that are to be given to the Ethernet controller must be physical addresses. Current generation of Multicore SoCs don't have capability to convert from virtual space to physical space internally.  Hence physical addresses need to be provided for buffers that go in receive descriptors.  Since Linux user space does not have physical memory with it, it needs to get this memory from the kernel space.  Application kernel driver does this job.   User space program asks the kernel driver to allocate and map the memory to user space.  When the mmap() in  user space returns, it has the virtual address.  It gets the physical address of allocated buffer from the kernel driver and uses the physical address while programming the hardware and uses virtual space while using it in its program.  User space programs typically ask kernel driver to allocate big amount of memory and then asks that memory to be mapped. Packet buffers are allocation/free is done from  this big chunk.   

Applications may require big chunks of memory blocks for several reasons - packet buffers, acceleration results and even for local contexts.   But there is only one mmap() function and there are no special arguments by which user process indicates the purpose to the application driver.  Hence, it is necessary that there is some kind of protocol between user space process and the kernel driver.  One method that is typically followed is to indicate the purpose via one IOCTL command, then do mmap() and another IOCTL command to know the base address of allocated memory.  Let us say that there are two different memory chunks to be allocated - Chunk1 of size 128Kbyets for packets   and Chunk 2 for acceleration results of size 64K.  Then the sequence by which user space calls the kernel driver through FD are:

ioctl(fd,  SET_PURPOSE,  argument consisting of  type 'CHUNK1',  size '128K')
ioctl(fd,  GET_MMAP_RESULT, argument consisting of 'physical address').

Similar sequence need to be followed whenever Chunk2 is required.

Kernel Driver need to keep the information given via SET_PURPOSE in its private information. When kernel driver mmap() function gets invoked, then it allocates memory using  kmalloc(),  calls remap_pfn_range.  It stores the address returns by kmalloc() in private information.  This is given back to user space when GET_MMAP_RESULT command is issued by user space.  All these three operations need to happen atomically.  Kernel driver may like to ensure the sequence and return error if new sequence is started before old sequence is completed.

 int (*open)(struct uio_info *info, struct inode *inode),  int (*release)(struct uio_info *info, struct inode *inode) :   These function pointers are can be set by application kernel driver to get hold of control whenever user space applications open or close the UIO device. It can do any cleanup necessary. 

Example Program


No comments: