There are two types of devices that get mapped to user space for zero-copy drivers - Ethernet devices and Accelerator devices such as Crypto Engine, Pattern Matching Engine etc.. Normally, a given Ethernet device is completely owned by one user process. But accelerator devices are normally needed across multiple processes and also is needed by kernel applications. Hence acceleration device usage is more challenging.
To enable usage of acceleration devices by multiple user processes, acceleration device normally support multiple individual descriptor rings. I know of some acceleration devices supporting four descriptor rings, with each descriptor ring working independent of each other, that is, one descriptor ring is sufficient for issuing the command and read the result . In this scenario, a given user process at least should own one descriptor ring for zero copy driver. If the acceleration device contains four descriptor rings, then four user processes can use the acceleration device without involving the kernel. Since a typical system contains more processes than the descriptor rings, it is necessary that at least one descriptor ring is reserved for kernel usage and other application processes. In the example where one acceleration device supports four descriptor rings, in one scenario, one can choose three critical user processes that require zero copy driver. Each of these critical processes use one descriptor ring each. All other user processes and kernel share one descriptor ring.
Each process requiring zero copy driver should memory map the descriptor ring space. Since many chip vendors provide Linux kernel drivers for acceleration engines, my suggestion is to make changes to this acceleration engine driver to provide some additional API functions to detach and attach the descriptor rings on demand basis. When user process requires a descriptor ring, the associated application kernel module can call the acceleration driver 'detach' function for that descriptor ring. When the process dies, the associated kernel module should attach back the descriptor ring to the kernel driver. This way, each user process need not work on the initialization of security engine. It only need to worry about the filling up the descriptors with commands and reading responses. It also provides the benefit that the descriptor rings can be dynamically allocated and freed based on the applications running at that point of time.
If there are as many interrupts as number of descriptor rings, then each process's zero copy driver can have its own interrupt line. Yet times, even though there are multiple descriptor rings, number of interrupts are less than the descriptor rings. In this case, interrupt need to be shared across multiple descriptor rings. Fortunately Linux kernel and UIO frame work provides mechanism for multiple application kernel modules to register different interrupt handlers for the same interrupt line. irq_flags field as part of uio_info structure that is registered with the UIO framework should have IRQ_SHARED bit set. Linux Kernel and UIO frame work call the interrupt handler one by one in sequence. Interrupt handler that has data pending to be read from corresponding descriptor ring should return IRQ_HANDLED. It means that the device should have capability to check the pending data without reading it out. Note that reading the acceleration result should be done by the user space. When the handler returns IRQ_HANDLED, UIO framework wakes up the user process. Since one IRQ line is shared by multiple processes, as described in earlier post, masking and unmasking the interrupts can't be done by the interrupt handler and user process. Since interrupts can't be disabled, one can't use natural coalescing capability as described in the earlier post. But fortunately, many acceleration devices provide hardware interrupt coalescing capability. Hardware can be programmed to generate interrupt for X number of events or within Y amount of time. If the hardware device you have chosen does not have the coalescing capability and require IRQ to be shared across multiple user processes, then you are out of luck. Either don't use UIO facilities or live with too many interrupts coming in.
All other user process without dedicated descriptor rings should work with accelerator kernel driver that is provided by the OS/Chip vendors. That is, they need to send the command buffer to the kernel driver and read the result from the kernel driver. Kernel drivers are normally intelligent enough to service multiple consumers and hence many user processes can use the acceleration engine.