Random technical bits and thoughts: December 2010

Thursday, December 30, 2010

What are Traffic Monitoring Enabler Switches?

There is increasing trend of Traffic Monitoring Enabler Switches (TMES) in Enterprise, Data Center and Service provider environments.

Need for TMES:

Traffic monitoring devices are increasingly becoming requirement for networks in Enterprise, Data Center and Service provider environments. There are multiple types of monitoring devices are being deployed in networks.

Traffic Monitoring for intrusion detection: Security is very important aspect of Enterprise networks. Intrusion detection is one of the components of comprehensive network security. IDS devices listen for the traffic passively and do the intrusion analysis on the traffic. Intrusion attempts and intrusion events are sent to the administrators for out-of-band action. IDS devices also can be configured to send TCP resets in case of intrusion detection to stop any more traffic going on the TCP connection. IDS devices also can be configured to block certain traffic for certain amount of time by informing local firewall devices.
Surveillance: Due to government regulations, all important data needs to be recorded. Surveillance monitoring devices again listen for the traffic passively and record them in persistent storage for later interpretation. Surveillance devices also provide capability to recreate the sessions such as Email conversations, file transfer conversations, Voice and video conversations from the recorded traffic. Some surveillance devices also provide run time recreation of conversations too.
Network Visibility : These monitoring devices capture the traffic passively and provide complete network visibility of the traffic. They provide capabilities such as 'Identification of applications such as P2P, IM, Social networking and many more ', 'Bandwidth usage of different applications, networks' and provide analysis for network administrators with valuable information to maintain networks and bandwidth to make Enterprise critical applications work always.
Traffic Trace: Traffic trace devices help network administrators to find the bottlenecks in different network segments. These devices tap the traffic at multiple locations in the network and provide the trace capability for finding out the issues in network such as misconfiguration of different devices in network, choke points etc..

Network administrators face following challenges to deploy multiple monitoring devices.

Few SPAN ports in existing L2 switch infrastructure: Many L2 switch vendors provide one or at the most two SPAN ports. L2 switches replicate the packets to SPAN ports. Since there are only two SPAN ports at the most, only two types of monitoring devices can be connected. This is one big limitation network administrators face.
Multiple network tap points : In complex network infrastructure, there are multiple points where monitoring is required. Placing multiple monitoring devices at each point is too expensive. Network administrators would like to use same monitoring devices to capture traffic at multiple locations.
Capacity limitations of monitoring devices: With increasing bandwidth in the networks, it is possible that one monitoring device may not be able to cope with the traffic. Administrators would like to multiple monitoring devices of same type to capture the traffic with some external component doing load balancing the sessions across multiple monitoring devices.
High Capacity Monitoring devices : There could be instances where monitoring device can take more load. In these cases, one monitoring device can take load from several tap points. Administrators look for facility to aggregate the traffic from multiple points to one or few monitoring devices of same type.
Non-Switch capture points : Network administrator may like monitoring of traffic in a point where there are no switches - Router to Server, Wireless LAN Access Point to Access Concentrator etc.. Since there is no switch, there are no SPAN ports. Network administrators look for some mechanism such as Inline TAP functionality to capture the traffic for monitoring.

What is TMES?

TMES is a switch device with monitoring enabling intelligence to allow connectivity of multiple monitoring devices of different types without any major changes to the existing network infrastructure.

This device taps the traffic from SPAN ports of existing switches in the network and direct the traffic to attached monitoring devices.

TMES allows:

Centralization of monitoring devices.
Filtering of the traffic.
Balancing of traffic to multiple monitoring devices of a given type.
Replication of traffic to different types of monitoring devices.
Aggregation of traffic from multiple points to same set of monitoring devices.
Truncation of data
Data manipulation & masking of the sensitive content of the traffic being sent to monitoring devices.
Inline TAP functionality to allow capture points where there are no SPAN ports.
Time Stamp functionality
Stripping off Layer 2 and Tunnel headers that are unrecognized by monitoring devices.
Conditioning of the burst traffic going to the monitoring devices.

Centralization of Monitoring Devices:

Without TMES, monitoring devices need to be placed at different locations in the network. With TMES, TAP points are connected to TMES and monitoring devices are connected to only TMES ports.

Filtering of Traffic:

This feature of TMES allows filtering of unwanted traffic to a given monitoring device. Monitoring devices are normally listen for traffic in promiscuous mode. That is, monitoring device gets all the traffic that is going on the wire. But all the traffic is not interesting to the monitoring device. Typically monitoring device itself does the filtering. By offloading filtering out of it, it saves valuable cycles in receiving the traffic (interrupts) and filtering the traffic. TMES takes this load out of monitoring device and thereby increase the capacity of monitoring devices.

Filtering of traffic should not only be restricted to unicast. It should be made available even for Multicast and broadcast packets.

Balancing the traffic to Multiple Monitoring devices of a given type of monitoring:

If the amount of traffic that needs to be recorded is very high, then multiple monitoring devices will be deployed.   TMES allows multiple monitoring devices to take the load. TMES load balances the sessions (not packets) to multiple monitoring devices based on performance of monitoring device. By balancing based on sessions, TMES ensures that all the traffic for a given connection go to one monitor device.

Replication of Traffic

When there are different types of monitoring devices, each device is expected to get the traffic. As discussed above, traditional L2 switches have at the most two SPAN ports. TMES is expected to replicate the traffic as many number of times as number of different monitoring device types and send the replicated traffic to the monitoring devices.

Combining the replication feature with load balancing: Assume that a deployment requires the traffic to be sent to two types of monitoring devices - IDS and Surveillance. This deployment requires that the 6Gbps bandwidth traffic to be analyzed and recorded. If IDS and Surveillance devices can analyze and record only 2Gbps bandwidth, then the deployment requires 3 IDS devices and 3 Surveillance devices. In this case, TMES is expected to replicate the original traffic twice - One for IDS devices and another for Surveillance devices. Then TMES is expected to balance the one set of replicated packets to go to one of 3 IDS devices and second set to go to one of three Surveillance devices.

Aggregation of traffic from multiple points to same set of monitoring devices

As discussed in 'Centralization of Monitoring devices', traffic from different locations of the network can be tapped. TMES is expected to provide multiple ports to receive the traffic from multiple locations in the network, filter the traffic, replicate and balance the traffic across monitoring devices. It is possible that traffic of a given connection might be going through multiple points and hence there could be duplication of traffic coming to the TMES. It is also possible that duplicated traffic might be going to the same monitoring device. Monitoring device might get confused with duplicated traffic. To avoid this scenario, TMES is expected to mark the packets based on the incoming port on TMFS (that is the capture point) such as adding different VLAN ID based on the capture point or adding an IP option etc..    This would allow monitoring device to distinguish the same connection traffic across different capture points.

Truncation or Slicing of packets

Some monitoring device types such as traffic measuring devices don't require complete data of the packets to come in. By slicing the packet to smaller packet would increase the performance of those monitoring devices. TMESs are expected to provide this functionality before sending the packets to the monitoring devices. Truncate value is with respect to payload of TCP, UDP etc.. Some monitoring devices are only interested in headers upto layer 4. In this case, truncation value can be 0. Some monitoring devices may expect to see few bytes of payload. TMESes are expected to provide this flexibility of configuring truncate value.

Truncation of packet content should not reflect in the IP payload size. It should be kept intact to ensure that monitoring devices can figure out the original data length even though it receives truncated packets.

Data masking

Based on type of monitoring devices, administrator may like to mask some sensitive information such as credit cards, user names and passwords from being recorded. TMES is expected to provide this functionality of pattern match and mask the content there by removing privacy concerns.

TMES also might support Data replacement (DR). DR feature might increase the size of the data. Though it is not a big issue for UDP type of sessions, it requires good amount of handling for TCP connections. As we all know TCP sequence number represent the bytes, not packets. So, any changes in the data size requires sequence number update. It not only requires sequence number update in the affected packet, but also further packets going on the session. All new packets would undergo the sequence number update. Similarly, ACK sequence number of reverse packets also should be updated while sending the packets to the monitoring devices.

When DR feature is combined with the 'Replication' feature, this delta sequence number update can be different for different replicated packets. Delta sequence number update feature is required to ensure that monitoring devices find the packet consistency with respect to sequence numbers and the data.

Some TMES vendors call this as part of DPI feature.

Inline TAP functionality:

Many places in the network might not have L2 switch to get hold of traffic from SPAN ports. If the traffic needs to be monitored from those points, then one choice is to place L2 switch and pass the traffic to the monitoring devices through the SPAN ports. If the capture points are high, then there is a need for placing multiple L2 switches. Inline TAP functionality is expected to be there in the TMES. Two ports of TMES are required to TAP the traffic from these capture points. These two ports will act as L2 switch while replicating traffic for monitoring devices. Baiscally, TMES is expected to act as L2 swtich for these capture points. Since there are many capture points, TMES essentially become Multi-switch device with each logical switch having two ports.

Time Stamping of Packets

Analysis of traffic that was recorded acorss multiple monitoring devices would be a requirement in general. It means that the recording devices should have same clock reference so that analysis engine knows the order in which the packets were received. Yet times, it is not practical to assume that the monitoring devices will have expensive clock synchronization mechanisms. Since TMES is becoming a central location to get hold of traffic and redirecting them to monitoring devices, TMES is expected to add time stamp in each packet that is being sent to the monitoring devices.

IP protocol provides an option called 'Internet TimeStamp'. This option expects TMES to fill its IP address and timestamp in milliseconds with midnight UT.

Stripping of L2 and Tunnel headers

Many monitoring devices don't understand complicated L2 headers such as MPLS, PPPoE and tunnel headers such as PPTP (GRE), GTP (in case of wireless core networks), L2TP-Data, IP-in-IP, IPv6 in IPv4 (Toredo, 6-to-4, 6-in-4) and many more. Monitoring devices are primarily interested in inner packets. TMESs are expected to provide stripping functionality and provide basic IP packets to the monitoring devices. Since monitoring devices expect to see some known L2 header, TMESes typically are expected to strip off tunnel headers and complicated L2 headers and keep Ethernet header intact. If Ethernet header is not present, TMESes are expected to add dummy Ethernet header to satisfy the monitoring device reception.

Traffic Conditioning

Monitroing devices are normally rated for certain amount of Mbps. Yet times, there could be bursts in the traffic, even though overall average traffic rate is within the device rating.  To avoid any packet drop due to brusts, TMES is expected to condition the traffic going to monitoring device.

Players :

I came across few vendors who are providing solutions meeting most of abvoe requirements.

Gigamon : http://www.gigamon.com/
Anue Systems: http://www.anuesystems.com/
NetOptics: http://www.netoptics.com

I believe this market is yet to mature and there is lot of good upside potential.

There is good need for monitoring devices and hence the need for TMES will only go up in coming years.

Sunday, December 26, 2010

User space Packet processing applications - Execution Engine differences with processors

Please read this post to understand Execution Engine.

Many processors with descriptor based IO devices have their own interrupts. For each device, there is corresponding UIO device. Hence software poll based EE provides 'file descriptor' based interface to register, deregister and get hold of indication through callbacks. EE applications are expected to read the packets from the the hardware by themselves and do rest of the processing.

As discussed in UIO related posts, we have discussed ways to share the interrupts across devices. As long as UIO related application kernel driver knows the type of event for which interrupt is generated, appropriate UIO FD is woken up and things will work fine.

Non-descriptor based IO is becoming quite common in recent Multicore processors. Hardware events (packets from the Ethernet controllers, acceleration results from the acceleration engines) are given to the software through set of HW interfaces. Selection of HW interface by the hardware is based on some load balancing algorithms or based on some software inputs. But the point is that, the events which are being given to the software through one HW interface are from multiple hardware IO sources. Each HW interface is normally associated with one interrupt. One might say that this can be treated as interrupt being shared across multiple devices. But, some of the Multicore processors don't have facility to know the reason for HW interrupt. Nor they have facility to know the event type of first pending event in HW interface. Unless the event is dequeued from the HW interface, it is impossible to know the type of event. Also, due to interrupt coalescing requirements, a given interrupt instance might represent multiple events of different IO source devices. Due to this behavior, there may be only one UIO device for multiple IO devices. Hence responsibility of demultiplexing these events to right EE application falls on the EE itself. EE needs to read the event and find out the right application and call the appropriate callback function registered to it. Let us call this functionality in EE as 'EE Event DeMux'.

In Descriptor based systems, EE applications are expected to read the HW events (packets & acceleration results) by each EE application. Callback function invocation only provides indication for EE application to read the events from associated hardware descriptors. In case of 'EE Event DeMux', the event is already read by the EE itself. Hence, event is expected to be passed to the callback function.

'EE Event DeMux' submodule registers itself with the rest of EE module to get UIO indication in case of software poll method. In case of hardware poll, 'EE Event DeMux' in invoked by the hardware poll function.

Multicore processors normally provides HW interface for multiple IO devices for the devices which are part of the Multicore processors. External devices such as PCI and other HW bus based IO devices are still implemented using descriptor based mechanism. Software poll based EE should not assume that all devices are satisfied using 'EE Event DeMux'. As far as core Software poll system is concerned, 'EE Event DeMux' is another EE application. Hardware Poll based method, if they need to use descriptor based HW interfaces, then the hardware poll should also poll descriptor based HW interfaces.

When 'EE Event DeMux' is used by EE applications (such as Ethernet Driver, Accelerator drivers,), it is necessary that 'EE Event DeMux' considers following requirements.

It should have its own 'Quota' as number of maximum events it is going to read from the HW interface as part of the callback function invocation by the EE core. Once it reads the 'Quota' number of events or if there are no more events, then it should return back to the 'Core EE' module.
Since this the module which demuxes to some EE applications, it should provide its own register/De-register functions.
When 'Core EE' module invokes this module callback function due to interrupt generation or due to hardware poll, as described above, it is expected to read at the most 'quota' number of events. While giving the control back to the 'Core EE', it is expected to call EE applications that there are no more events in this iteration. Some EE applications might register to get this indication. For example, Ethernet driver application might register for this to do the 'Generic Receive Offload' function. GRO functionality requires to know when to give up while doing TCP coalescing functionality. In case descriptor based drivers, this issue does not arise as each Ethernet driver as part of callback invocation by the EE itself reads the events and knows when to give up.

Thanks for reading my earlier post. I hope this helps.

Sunday, December 19, 2010

User space Packet processing applications - Execution Engine

If you plan to port your data plane network processing application from Linux kernel space to user space, first thing you would think is how you can port your software to user space with minimal changes to your software. Execution Engine is the first thing one would think of.

Many kernel based networking applications don't create their own threads. They work with the threads which are already present in the kernel. For example, packet processing applications such as Firewall, IDS/IPS, Ipsec VPN work in the context of Kernel TCP/IP stack. This is mainly done for performance reasons. Additional threads for these applications result in multiple context switches. Also, it results into pipeline processing as packets handover from one execution context to another execution context. Pipelining requires queues involving enqueue, dequeue operations which take some core cycles. Also, it results into flow control issues when one thread processing is more than other threads.

Essentially, Linux kernel itself provides execution contexts and networking packet processing applications work within these contexts. Linux TCP/IP stack itself works in softirq context. SoftIRQ processing in normal kernel runs from both IRQ context as well as softirqd context. I would say 90% of the time SoftIRQ processing happens in the IRQ context. In PREEMPT_RT patched kernel, network IRQs are mapped to the IRQ threads. In any case, the context at which network packet processing applications run is unknown to the applications. Since Linux kernel execution contexts are per core basis, there are less shared data structure and hence less locking requirement. Kernel and underlying hardware also provides mechanism to balance the traffic across different execution contexts with flow granularity. In cases where hardware does not provide any load balancing functionality, IRQs are dedicated to different execution contexts. If there are 4 Ethernet devices and 2 cores (hence two execution contexts), four receive interrupts of Ethernet controllers are assigned equally to two execution contexts.. If the traffic from all four Ethernet devices is same or similar, then both the cores are used effectively.

Execution Engine in user space packet processing applications, if made similar to the Kernel execution contexts, then application porting becomes simpler. Execution Engine (EE) can be considered part of infrastructure to enable DP processing in user space. EE design should consider following.

There could be multiple data plane processing applications in user space. Each DP daemon may be assigned to run in fixed set of cores - core mask may be provided at startup time.
If DP daemon is not associated with any core mask, then it should assume that the daemon may be run by all cores. That is, it should considered as if core mask contains all core bits set.
Set of cores are dedicated to the daemon. That is, those cores don't do anything else other than doing DP processing of the DP daemon. This facility is typically used to ensure in Multicore processors providing hardware poll. Recent generation of Multicore processors have facility to provide incoming events and acceleration results through single portal (or station or work group). Since the core is dedicated, there is no software polling is required. That is hardware polling can be used if underlying hardware supports it and if the core(s) are dedicated to the process.

It appears that number threads in the process equaling the number of cores assigned to the process provides the best performance. Also, this also provides great similarity with kernel execution contexts. With the above background, I believe EE needs to have following capabilities:

Provide capability to assign the core mask.
Provide capability to indicate whether the cores are dedicated or assigned.
If no core mask is provided, it should have capability to read the number of cores in the system and should assume that all the cores are given in core-mask.
Provide capability to use software poll or hardware poll. Hardware poll should be validated and accepted only if underlying hardware supports it and only if the cores are dedicated to it. Hardware polling has performance advantages as it does not require interrupt generation and interrupt processing. But the disadvantage is that the core is not used for anything else. One should weigh the options based on the application performance requirements.
API it exposes for its applications should be same whether the execution engine uses software poll (such as epoll()) or hardware poll.

Typically capabilities are provided through command line parameters or via some configuration file. EE is expected to create as many threads as number of cores in the core mask. Each thread should provide following functionality:

Software timers functionality - EE should provide following functionality.

Creation and deletion of timer block
Starting, stopping, restarting timers in each timer block.
Each application can create one ore more timer blocks and use large number of timers in each timer block.
As in Kernel, it is required that EE provides cascaded timer wheels for each timer block.

Facility for applications to register/De register for events and passing the events.

API function (EEGetPollType()) to return the type of poll - Software or Hardware : This function would be used by EE applications to use file descriptors such as UIO and other application oriented kernel drivers for software poll or use hardware facilities for hardware poll
Register Event Receiver : EE applications can use this function to register the FD, READ/Write/Error, associated callback function pointer and callback argument.
Deregister Event receiver: EE applications can call this to de register the event receiver which was registered using 'Register' function.
Variations above API will need to be provided by EE if it is configured with hardware poll. Since each hardware device has its own way of representing this, there may as many number of API sets. Some part of each EE application have hardware specific initialization code and calls the right set of API functions.
Note that one thread handles multiple devices (multiple file descriptors in case of software poll). Every time epoll() comes out, callback functions of ready FDs would need be called. These functions which are provided by the EE applications are expected to get hold of packets in case of Ethernet controllers, acceleration results in case of acceleration devices or other kinds of events from different kinds of devices. From the UIO discussion, if the applications use UIO based interrupts to wakeup the thread, then it is expected that all events are read from the device to reduce the number of wakeups (UIO coalescing capability). Some EE application might be reading lot of events. For each event it reads, it is going to call its own application function. These applications can be very heavy too. Due to this, if there are multiple FDs are ready, one EE application may take very long time before it returns back to the EE. This results into unfair assignment of EE thread to FDs which are also ready. This unfairness might even result into packet drops or increase the jitter if high priority traffic is pending to be read in other devices. To ensure the fairness, it is expected that EE applications process only 'quota' number of events at any time before returning back to the EE. 'Quota' is tunable parameter and can be different for different types of devices. EE is expected to callback the same application after it runs through all other ready file descriptors. Until all ready EE applications indicate that they don't have anything to process, EE should not be calling ePoll(). To allow EE to know whether to call the application callbacks again, there should be some protocol. Each EE application can indicate to the EE while returning from the callback function on whether it processed all events. EE based on this indication will decide to call the EE application again or not before it goes to the epoll() again. Note that epoll() is expensive call and hence it is better if all events are processed in fairness fashion before epoll() is called again. In case of hardware poll based configuration, this kind of facility is not required as polling is not expensive. Also Multicore SoCs implementing the single portal for all events have fairness capabilities built in. Since the callback function definition is same for both software and hardware poll based systems, these parameters exist, but they are not used by hardware poll based systems.

EE before creating threads should initialize itself and then create threads. Once created it should load the shared libraries of its applications one by one. For each EE application library, it is expected to call 'init' function by getting hold of address of 'init()' symbol. Init() function is expected to initialize its own module. Each EE packet processing threads is expected to call another function of EE application. Let us call this symbol name is 'EEAppContextInit()'. EEAppContextInit function expected to real initialization such as opening UIO and other character device drivers and registering with the software poll() system.

EE also would need to call 'EEAppFinish()' function when the EE is killed. EEAppFinish does whatever graceful shutdown required for its module.

Each thread, if it is software based poll, does epoll on all the FDs registered so far. Polling happens in the while() loop. epoll() can take the timeout argument. Timeout argument must be next lowest timer expiry timeout of all software timer blocks. When epoll() returns, it should call the software timer library for any timer expiry processing. In case of hardware poll, specific hardware specific poll function would need to be used.

In addition to above functions, EE typically needs to emulate other capabilities provided by Linux Kernel for its applications such as - Memory Pool library, Packet descriptor buffer library, Mutual exclusion facilities using Futexes and user space RCU etc..

With these above capabilities, EE can jump start the application development. This kind of functionality only requires changes at very few places in the applications.

Hope it helps.

Saturday, December 18, 2010

UIO - Acceleration Device mapping in user space

Please see this post on how to use UIO frame work to implement device drivers in user space. As noted in that post, UIO framework predominantly is used to install the interrupt handler and to wake up the user space process implementing the device driver. Please read the earlier post, before going further.

There are two types of devices that get mapped to user space for zero-copy drivers - Ethernet devices and Accelerator devices such as Crypto Engine, Pattern Matching Engine etc.. Normally, a given Ethernet device is completely owned by one user process. But accelerator devices are normally needed across multiple processes and also is needed by kernel applications. Hence acceleration device usage is more challenging.

To enable usage of acceleration devices by multiple user processes, acceleration device normally support multiple individual descriptor rings. I know of some acceleration devices supporting four descriptor rings, with each descriptor ring working independent of each other, that is, one descriptor ring is sufficient for issuing the command and read the result . In this scenario, a given user process at least should own one descriptor ring for zero copy driver. If the acceleration device contains four descriptor rings, then four user processes can use the acceleration device without involving the kernel. Since a typical system contains more processes than the descriptor rings, it is necessary that at least one descriptor ring is reserved for kernel usage and other application processes. In the example where one acceleration device supports four descriptor rings, in one scenario, one can choose three critical user processes that require zero copy driver. Each of these critical processes use one descriptor ring each. All other user processes and kernel share one descriptor ring.

Each process requiring zero copy driver should memory map the descriptor ring space. Since many chip vendors provide Linux kernel drivers for acceleration engines, my suggestion is to make changes to this acceleration engine driver to provide some additional API functions to detach and attach the descriptor rings on demand basis. When user process requires a descriptor ring, the associated application kernel module can call the acceleration driver 'detach' function for that descriptor ring. When the process dies, the associated kernel module should attach back the descriptor ring to the kernel driver. This way, each user process need not work on the initialization of security engine. It only need to worry about the filling up the descriptors with commands and reading responses. It also provides the benefit that the descriptor rings can be dynamically allocated and freed based on the applications running at that point of time.

If there are as many interrupts as number of descriptor rings, then each process's zero copy driver can have its own interrupt line. Yet times, even though there are multiple descriptor rings, number of interrupts are less than the descriptor rings. In this case, interrupt need to be shared across multiple descriptor rings. Fortunately Linux kernel and UIO frame work provides mechanism for multiple application kernel modules to register different interrupt handlers for the same interrupt line. irq_flags field as part of uio_info structure that is registered with the UIO framework should have IRQ_SHARED bit set. Linux Kernel and UIO frame work call the interrupt handler one by one in sequence. Interrupt handler that has data pending to be read from corresponding descriptor ring should return IRQ_HANDLED. It means that the device should have capability to check the pending data without reading it out. Note that reading the acceleration result should be done by the user space. When the handler returns IRQ_HANDLED, UIO framework wakes up the user process. Since one IRQ line is shared by multiple processes, as described in earlier post, masking and unmasking the interrupts can't be done by the interrupt handler and user process. Since interrupts can't be disabled, one can't use natural coalescing capability as described in the earlier post. But fortunately, many acceleration devices provide hardware interrupt coalescing capability. Hardware can be programmed to generate interrupt for X number of events or within Y amount of time. If the hardware device you have chosen does not have the coalescing capability and require IRQ to be shared across multiple user processes, then you are out of luck. Either don't use UIO facilities or live with too many interrupts coming in.

All other user process without dedicated descriptor rings should work with accelerator kernel driver that is provided by the OS/Chip vendors. That is, they need to send the command buffer to the kernel driver and read the result from the kernel driver. Kernel drivers are normally intelligent enough to service multiple consumers and hence many user processes can use the acceleration engine.

Comments?

Random technical bits and thoughts