Random technical bits and thoughts: 2010

Thursday, December 30, 2010

What are Traffic Monitoring Enabler Switches?

There is increasing trend of Traffic Monitoring Enabler Switches (TMES) in Enterprise, Data Center and Service provider environments.

Need for TMES:

Traffic monitoring devices are increasingly becoming requirement for networks in Enterprise, Data Center and Service provider environments. There are multiple types of monitoring devices are being deployed in networks.

Traffic Monitoring for intrusion detection: Security is very important aspect of Enterprise networks. Intrusion detection is one of the components of comprehensive network security. IDS devices listen for the traffic passively and do the intrusion analysis on the traffic. Intrusion attempts and intrusion events are sent to the administrators for out-of-band action. IDS devices also can be configured to send TCP resets in case of intrusion detection to stop any more traffic going on the TCP connection. IDS devices also can be configured to block certain traffic for certain amount of time by informing local firewall devices.
Surveillance: Due to government regulations, all important data needs to be recorded. Surveillance monitoring devices again listen for the traffic passively and record them in persistent storage for later interpretation. Surveillance devices also provide capability to recreate the sessions such as Email conversations, file transfer conversations, Voice and video conversations from the recorded traffic. Some surveillance devices also provide run time recreation of conversations too.
Network Visibility : These monitoring devices capture the traffic passively and provide complete network visibility of the traffic. They provide capabilities such as 'Identification of applications such as P2P, IM, Social networking and many more ', 'Bandwidth usage of different applications, networks' and provide analysis for network administrators with valuable information to maintain networks and bandwidth to make Enterprise critical applications work always.
Traffic Trace: Traffic trace devices help network administrators to find the bottlenecks in different network segments. These devices tap the traffic at multiple locations in the network and provide the trace capability for finding out the issues in network such as misconfiguration of different devices in network, choke points etc..

Network administrators face following challenges to deploy multiple monitoring devices.

Few SPAN ports in existing L2 switch infrastructure: Many L2 switch vendors provide one or at the most two SPAN ports. L2 switches replicate the packets to SPAN ports. Since there are only two SPAN ports at the most, only two types of monitoring devices can be connected. This is one big limitation network administrators face.
Multiple network tap points : In complex network infrastructure, there are multiple points where monitoring is required. Placing multiple monitoring devices at each point is too expensive. Network administrators would like to use same monitoring devices to capture traffic at multiple locations.
Capacity limitations of monitoring devices: With increasing bandwidth in the networks, it is possible that one monitoring device may not be able to cope with the traffic. Administrators would like to multiple monitoring devices of same type to capture the traffic with some external component doing load balancing the sessions across multiple monitoring devices.
High Capacity Monitoring devices : There could be instances where monitoring device can take more load. In these cases, one monitoring device can take load from several tap points. Administrators look for facility to aggregate the traffic from multiple points to one or few monitoring devices of same type.
Non-Switch capture points : Network administrator may like monitoring of traffic in a point where there are no switches - Router to Server, Wireless LAN Access Point to Access Concentrator etc.. Since there is no switch, there are no SPAN ports. Network administrators look for some mechanism such as Inline TAP functionality to capture the traffic for monitoring.

What is TMES?

TMES is a switch device with monitoring enabling intelligence to allow connectivity of multiple monitoring devices of different types without any major changes to the existing network infrastructure.

This device taps the traffic from SPAN ports of existing switches in the network and direct the traffic to attached monitoring devices.

TMES allows:

Centralization of monitoring devices.
Filtering of the traffic.
Balancing of traffic to multiple monitoring devices of a given type.
Replication of traffic to different types of monitoring devices.
Aggregation of traffic from multiple points to same set of monitoring devices.
Truncation of data
Data manipulation & masking of the sensitive content of the traffic being sent to monitoring devices.
Inline TAP functionality to allow capture points where there are no SPAN ports.
Time Stamp functionality
Stripping off Layer 2 and Tunnel headers that are unrecognized by monitoring devices.
Conditioning of the burst traffic going to the monitoring devices.

Centralization of Monitoring Devices:

Without TMES, monitoring devices need to be placed at different locations in the network. With TMES, TAP points are connected to TMES and monitoring devices are connected to only TMES ports.

Filtering of Traffic:

This feature of TMES allows filtering of unwanted traffic to a given monitoring device. Monitoring devices are normally listen for traffic in promiscuous mode. That is, monitoring device gets all the traffic that is going on the wire. But all the traffic is not interesting to the monitoring device. Typically monitoring device itself does the filtering. By offloading filtering out of it, it saves valuable cycles in receiving the traffic (interrupts) and filtering the traffic. TMES takes this load out of monitoring device and thereby increase the capacity of monitoring devices.

Filtering of traffic should not only be restricted to unicast. It should be made available even for Multicast and broadcast packets.

Balancing the traffic to Multiple Monitoring devices of a given type of monitoring:

If the amount of traffic that needs to be recorded is very high, then multiple monitoring devices will be deployed.   TMES allows multiple monitoring devices to take the load. TMES load balances the sessions (not packets) to multiple monitoring devices based on performance of monitoring device. By balancing based on sessions, TMES ensures that all the traffic for a given connection go to one monitor device.

Replication of Traffic

When there are different types of monitoring devices, each device is expected to get the traffic. As discussed above, traditional L2 switches have at the most two SPAN ports. TMES is expected to replicate the traffic as many number of times as number of different monitoring device types and send the replicated traffic to the monitoring devices.

Combining the replication feature with load balancing: Assume that a deployment requires the traffic to be sent to two types of monitoring devices - IDS and Surveillance. This deployment requires that the 6Gbps bandwidth traffic to be analyzed and recorded. If IDS and Surveillance devices can analyze and record only 2Gbps bandwidth, then the deployment requires 3 IDS devices and 3 Surveillance devices. In this case, TMES is expected to replicate the original traffic twice - One for IDS devices and another for Surveillance devices. Then TMES is expected to balance the one set of replicated packets to go to one of 3 IDS devices and second set to go to one of three Surveillance devices.

Aggregation of traffic from multiple points to same set of monitoring devices

As discussed in 'Centralization of Monitoring devices', traffic from different locations of the network can be tapped. TMES is expected to provide multiple ports to receive the traffic from multiple locations in the network, filter the traffic, replicate and balance the traffic across monitoring devices. It is possible that traffic of a given connection might be going through multiple points and hence there could be duplication of traffic coming to the TMES. It is also possible that duplicated traffic might be going to the same monitoring device. Monitoring device might get confused with duplicated traffic. To avoid this scenario, TMES is expected to mark the packets based on the incoming port on TMFS (that is the capture point) such as adding different VLAN ID based on the capture point or adding an IP option etc..    This would allow monitoring device to distinguish the same connection traffic across different capture points.

Truncation or Slicing of packets

Some monitoring device types such as traffic measuring devices don't require complete data of the packets to come in. By slicing the packet to smaller packet would increase the performance of those monitoring devices. TMESs are expected to provide this functionality before sending the packets to the monitoring devices. Truncate value is with respect to payload of TCP, UDP etc.. Some monitoring devices are only interested in headers upto layer 4. In this case, truncation value can be 0. Some monitoring devices may expect to see few bytes of payload. TMESes are expected to provide this flexibility of configuring truncate value.

Truncation of packet content should not reflect in the IP payload size. It should be kept intact to ensure that monitoring devices can figure out the original data length even though it receives truncated packets.

Data masking

Based on type of monitoring devices, administrator may like to mask some sensitive information such as credit cards, user names and passwords from being recorded. TMES is expected to provide this functionality of pattern match and mask the content there by removing privacy concerns.

TMES also might support Data replacement (DR). DR feature might increase the size of the data. Though it is not a big issue for UDP type of sessions, it requires good amount of handling for TCP connections. As we all know TCP sequence number represent the bytes, not packets. So, any changes in the data size requires sequence number update. It not only requires sequence number update in the affected packet, but also further packets going on the session. All new packets would undergo the sequence number update. Similarly, ACK sequence number of reverse packets also should be updated while sending the packets to the monitoring devices.

When DR feature is combined with the 'Replication' feature, this delta sequence number update can be different for different replicated packets. Delta sequence number update feature is required to ensure that monitoring devices find the packet consistency with respect to sequence numbers and the data.

Some TMES vendors call this as part of DPI feature.

Inline TAP functionality:

Many places in the network might not have L2 switch to get hold of traffic from SPAN ports. If the traffic needs to be monitored from those points, then one choice is to place L2 switch and pass the traffic to the monitoring devices through the SPAN ports. If the capture points are high, then there is a need for placing multiple L2 switches. Inline TAP functionality is expected to be there in the TMES. Two ports of TMES are required to TAP the traffic from these capture points. These two ports will act as L2 switch while replicating traffic for monitoring devices. Baiscally, TMES is expected to act as L2 swtich for these capture points. Since there are many capture points, TMES essentially become Multi-switch device with each logical switch having two ports.

Time Stamping of Packets

Analysis of traffic that was recorded acorss multiple monitoring devices would be a requirement in general. It means that the recording devices should have same clock reference so that analysis engine knows the order in which the packets were received. Yet times, it is not practical to assume that the monitoring devices will have expensive clock synchronization mechanisms. Since TMES is becoming a central location to get hold of traffic and redirecting them to monitoring devices, TMES is expected to add time stamp in each packet that is being sent to the monitoring devices.

IP protocol provides an option called 'Internet TimeStamp'. This option expects TMES to fill its IP address and timestamp in milliseconds with midnight UT.

Stripping of L2 and Tunnel headers

Many monitoring devices don't understand complicated L2 headers such as MPLS, PPPoE and tunnel headers such as PPTP (GRE), GTP (in case of wireless core networks), L2TP-Data, IP-in-IP, IPv6 in IPv4 (Toredo, 6-to-4, 6-in-4) and many more. Monitoring devices are primarily interested in inner packets. TMESs are expected to provide stripping functionality and provide basic IP packets to the monitoring devices. Since monitoring devices expect to see some known L2 header, TMESes typically are expected to strip off tunnel headers and complicated L2 headers and keep Ethernet header intact. If Ethernet header is not present, TMESes are expected to add dummy Ethernet header to satisfy the monitoring device reception.

Traffic Conditioning

Monitroing devices are normally rated for certain amount of Mbps. Yet times, there could be bursts in the traffic, even though overall average traffic rate is within the device rating.  To avoid any packet drop due to brusts, TMES is expected to condition the traffic going to monitoring device.

Players :

I came across few vendors who are providing solutions meeting most of abvoe requirements.

Gigamon : http://www.gigamon.com/
Anue Systems: http://www.anuesystems.com/
NetOptics: http://www.netoptics.com

I believe this market is yet to mature and there is lot of good upside potential.

There is good need for monitoring devices and hence the need for TMES will only go up in coming years.

Sunday, December 26, 2010

User space Packet processing applications - Execution Engine differences with processors

Please read this post to understand Execution Engine.

Many processors with descriptor based IO devices have their own interrupts. For each device, there is corresponding UIO device. Hence software poll based EE provides 'file descriptor' based interface to register, deregister and get hold of indication through callbacks. EE applications are expected to read the packets from the the hardware by themselves and do rest of the processing.

As discussed in UIO related posts, we have discussed ways to share the interrupts across devices. As long as UIO related application kernel driver knows the type of event for which interrupt is generated, appropriate UIO FD is woken up and things will work fine.

Non-descriptor based IO is becoming quite common in recent Multicore processors. Hardware events (packets from the Ethernet controllers, acceleration results from the acceleration engines) are given to the software through set of HW interfaces. Selection of HW interface by the hardware is based on some load balancing algorithms or based on some software inputs. But the point is that, the events which are being given to the software through one HW interface are from multiple hardware IO sources. Each HW interface is normally associated with one interrupt. One might say that this can be treated as interrupt being shared across multiple devices. But, some of the Multicore processors don't have facility to know the reason for HW interrupt. Nor they have facility to know the event type of first pending event in HW interface. Unless the event is dequeued from the HW interface, it is impossible to know the type of event. Also, due to interrupt coalescing requirements, a given interrupt instance might represent multiple events of different IO source devices. Due to this behavior, there may be only one UIO device for multiple IO devices. Hence responsibility of demultiplexing these events to right EE application falls on the EE itself. EE needs to read the event and find out the right application and call the appropriate callback function registered to it. Let us call this functionality in EE as 'EE Event DeMux'.

In Descriptor based systems, EE applications are expected to read the HW events (packets & acceleration results) by each EE application. Callback function invocation only provides indication for EE application to read the events from associated hardware descriptors. In case of 'EE Event DeMux', the event is already read by the EE itself. Hence, event is expected to be passed to the callback function.

'EE Event DeMux' submodule registers itself with the rest of EE module to get UIO indication in case of software poll method. In case of hardware poll, 'EE Event DeMux' in invoked by the hardware poll function.

Multicore processors normally provides HW interface for multiple IO devices for the devices which are part of the Multicore processors. External devices such as PCI and other HW bus based IO devices are still implemented using descriptor based mechanism. Software poll based EE should not assume that all devices are satisfied using 'EE Event DeMux'. As far as core Software poll system is concerned, 'EE Event DeMux' is another EE application. Hardware Poll based method, if they need to use descriptor based HW interfaces, then the hardware poll should also poll descriptor based HW interfaces.

When 'EE Event DeMux' is used by EE applications (such as Ethernet Driver, Accelerator drivers,), it is necessary that 'EE Event DeMux' considers following requirements.

It should have its own 'Quota' as number of maximum events it is going to read from the HW interface as part of the callback function invocation by the EE core. Once it reads the 'Quota' number of events or if there are no more events, then it should return back to the 'Core EE' module.
Since this the module which demuxes to some EE applications, it should provide its own register/De-register functions.
When 'Core EE' module invokes this module callback function due to interrupt generation or due to hardware poll, as described above, it is expected to read at the most 'quota' number of events. While giving the control back to the 'Core EE', it is expected to call EE applications that there are no more events in this iteration. Some EE applications might register to get this indication. For example, Ethernet driver application might register for this to do the 'Generic Receive Offload' function. GRO functionality requires to know when to give up while doing TCP coalescing functionality. In case descriptor based drivers, this issue does not arise as each Ethernet driver as part of callback invocation by the EE itself reads the events and knows when to give up.

Thanks for reading my earlier post. I hope this helps.

Sunday, December 19, 2010

User space Packet processing applications - Execution Engine

If you plan to port your data plane network processing application from Linux kernel space to user space, first thing you would think is how you can port your software to user space with minimal changes to your software. Execution Engine is the first thing one would think of.

Many kernel based networking applications don't create their own threads. They work with the threads which are already present in the kernel. For example, packet processing applications such as Firewall, IDS/IPS, Ipsec VPN work in the context of Kernel TCP/IP stack. This is mainly done for performance reasons. Additional threads for these applications result in multiple context switches. Also, it results into pipeline processing as packets handover from one execution context to another execution context. Pipelining requires queues involving enqueue, dequeue operations which take some core cycles. Also, it results into flow control issues when one thread processing is more than other threads.

Essentially, Linux kernel itself provides execution contexts and networking packet processing applications work within these contexts. Linux TCP/IP stack itself works in softirq context. SoftIRQ processing in normal kernel runs from both IRQ context as well as softirqd context. I would say 90% of the time SoftIRQ processing happens in the IRQ context. In PREEMPT_RT patched kernel, network IRQs are mapped to the IRQ threads. In any case, the context at which network packet processing applications run is unknown to the applications. Since Linux kernel execution contexts are per core basis, there are less shared data structure and hence less locking requirement. Kernel and underlying hardware also provides mechanism to balance the traffic across different execution contexts with flow granularity. In cases where hardware does not provide any load balancing functionality, IRQs are dedicated to different execution contexts. If there are 4 Ethernet devices and 2 cores (hence two execution contexts), four receive interrupts of Ethernet controllers are assigned equally to two execution contexts.. If the traffic from all four Ethernet devices is same or similar, then both the cores are used effectively.

Execution Engine in user space packet processing applications, if made similar to the Kernel execution contexts, then application porting becomes simpler. Execution Engine (EE) can be considered part of infrastructure to enable DP processing in user space. EE design should consider following.

There could be multiple data plane processing applications in user space. Each DP daemon may be assigned to run in fixed set of cores - core mask may be provided at startup time.
If DP daemon is not associated with any core mask, then it should assume that the daemon may be run by all cores. That is, it should considered as if core mask contains all core bits set.
Set of cores are dedicated to the daemon. That is, those cores don't do anything else other than doing DP processing of the DP daemon. This facility is typically used to ensure in Multicore processors providing hardware poll. Recent generation of Multicore processors have facility to provide incoming events and acceleration results through single portal (or station or work group). Since the core is dedicated, there is no software polling is required. That is hardware polling can be used if underlying hardware supports it and if the core(s) are dedicated to the process.

It appears that number threads in the process equaling the number of cores assigned to the process provides the best performance. Also, this also provides great similarity with kernel execution contexts. With the above background, I believe EE needs to have following capabilities:

Provide capability to assign the core mask.
Provide capability to indicate whether the cores are dedicated or assigned.
If no core mask is provided, it should have capability to read the number of cores in the system and should assume that all the cores are given in core-mask.
Provide capability to use software poll or hardware poll. Hardware poll should be validated and accepted only if underlying hardware supports it and only if the cores are dedicated to it. Hardware polling has performance advantages as it does not require interrupt generation and interrupt processing. But the disadvantage is that the core is not used for anything else. One should weigh the options based on the application performance requirements.
API it exposes for its applications should be same whether the execution engine uses software poll (such as epoll()) or hardware poll.

Typically capabilities are provided through command line parameters or via some configuration file. EE is expected to create as many threads as number of cores in the core mask. Each thread should provide following functionality:

Software timers functionality - EE should provide following functionality.

Creation and deletion of timer block
Starting, stopping, restarting timers in each timer block.
Each application can create one ore more timer blocks and use large number of timers in each timer block.
As in Kernel, it is required that EE provides cascaded timer wheels for each timer block.

Facility for applications to register/De register for events and passing the events.

API function (EEGetPollType()) to return the type of poll - Software or Hardware : This function would be used by EE applications to use file descriptors such as UIO and other application oriented kernel drivers for software poll or use hardware facilities for hardware poll
Register Event Receiver : EE applications can use this function to register the FD, READ/Write/Error, associated callback function pointer and callback argument.
Deregister Event receiver: EE applications can call this to de register the event receiver which was registered using 'Register' function.
Variations above API will need to be provided by EE if it is configured with hardware poll. Since each hardware device has its own way of representing this, there may as many number of API sets. Some part of each EE application have hardware specific initialization code and calls the right set of API functions.
Note that one thread handles multiple devices (multiple file descriptors in case of software poll). Every time epoll() comes out, callback functions of ready FDs would need be called. These functions which are provided by the EE applications are expected to get hold of packets in case of Ethernet controllers, acceleration results in case of acceleration devices or other kinds of events from different kinds of devices. From the UIO discussion, if the applications use UIO based interrupts to wakeup the thread, then it is expected that all events are read from the device to reduce the number of wakeups (UIO coalescing capability). Some EE application might be reading lot of events. For each event it reads, it is going to call its own application function. These applications can be very heavy too. Due to this, if there are multiple FDs are ready, one EE application may take very long time before it returns back to the EE. This results into unfair assignment of EE thread to FDs which are also ready. This unfairness might even result into packet drops or increase the jitter if high priority traffic is pending to be read in other devices. To ensure the fairness, it is expected that EE applications process only 'quota' number of events at any time before returning back to the EE. 'Quota' is tunable parameter and can be different for different types of devices. EE is expected to callback the same application after it runs through all other ready file descriptors. Until all ready EE applications indicate that they don't have anything to process, EE should not be calling ePoll(). To allow EE to know whether to call the application callbacks again, there should be some protocol. Each EE application can indicate to the EE while returning from the callback function on whether it processed all events. EE based on this indication will decide to call the EE application again or not before it goes to the epoll() again. Note that epoll() is expensive call and hence it is better if all events are processed in fairness fashion before epoll() is called again. In case of hardware poll based configuration, this kind of facility is not required as polling is not expensive. Also Multicore SoCs implementing the single portal for all events have fairness capabilities built in. Since the callback function definition is same for both software and hardware poll based systems, these parameters exist, but they are not used by hardware poll based systems.

EE before creating threads should initialize itself and then create threads. Once created it should load the shared libraries of its applications one by one. For each EE application library, it is expected to call 'init' function by getting hold of address of 'init()' symbol. Init() function is expected to initialize its own module. Each EE packet processing threads is expected to call another function of EE application. Let us call this symbol name is 'EEAppContextInit()'. EEAppContextInit function expected to real initialization such as opening UIO and other character device drivers and registering with the software poll() system.

EE also would need to call 'EEAppFinish()' function when the EE is killed. EEAppFinish does whatever graceful shutdown required for its module.

Each thread, if it is software based poll, does epoll on all the FDs registered so far. Polling happens in the while() loop. epoll() can take the timeout argument. Timeout argument must be next lowest timer expiry timeout of all software timer blocks. When epoll() returns, it should call the software timer library for any timer expiry processing. In case of hardware poll, specific hardware specific poll function would need to be used.

In addition to above functions, EE typically needs to emulate other capabilities provided by Linux Kernel for its applications such as - Memory Pool library, Packet descriptor buffer library, Mutual exclusion facilities using Futexes and user space RCU etc..

With these above capabilities, EE can jump start the application development. This kind of functionality only requires changes at very few places in the applications.

Hope it helps.

Saturday, December 18, 2010

UIO - Acceleration Device mapping in user space

Please see this post on how to use UIO frame work to implement device drivers in user space. As noted in that post, UIO framework predominantly is used to install the interrupt handler and to wake up the user space process implementing the device driver. Please read the earlier post, before going further.

There are two types of devices that get mapped to user space for zero-copy drivers - Ethernet devices and Accelerator devices such as Crypto Engine, Pattern Matching Engine etc.. Normally, a given Ethernet device is completely owned by one user process. But accelerator devices are normally needed across multiple processes and also is needed by kernel applications. Hence acceleration device usage is more challenging.

To enable usage of acceleration devices by multiple user processes, acceleration device normally support multiple individual descriptor rings. I know of some acceleration devices supporting four descriptor rings, with each descriptor ring working independent of each other, that is, one descriptor ring is sufficient for issuing the command and read the result . In this scenario, a given user process at least should own one descriptor ring for zero copy driver. If the acceleration device contains four descriptor rings, then four user processes can use the acceleration device without involving the kernel. Since a typical system contains more processes than the descriptor rings, it is necessary that at least one descriptor ring is reserved for kernel usage and other application processes. In the example where one acceleration device supports four descriptor rings, in one scenario, one can choose three critical user processes that require zero copy driver. Each of these critical processes use one descriptor ring each. All other user processes and kernel share one descriptor ring.

Each process requiring zero copy driver should memory map the descriptor ring space. Since many chip vendors provide Linux kernel drivers for acceleration engines, my suggestion is to make changes to this acceleration engine driver to provide some additional API functions to detach and attach the descriptor rings on demand basis. When user process requires a descriptor ring, the associated application kernel module can call the acceleration driver 'detach' function for that descriptor ring. When the process dies, the associated kernel module should attach back the descriptor ring to the kernel driver. This way, each user process need not work on the initialization of security engine. It only need to worry about the filling up the descriptors with commands and reading responses. It also provides the benefit that the descriptor rings can be dynamically allocated and freed based on the applications running at that point of time.

If there are as many interrupts as number of descriptor rings, then each process's zero copy driver can have its own interrupt line. Yet times, even though there are multiple descriptor rings, number of interrupts are less than the descriptor rings. In this case, interrupt need to be shared across multiple descriptor rings. Fortunately Linux kernel and UIO frame work provides mechanism for multiple application kernel modules to register different interrupt handlers for the same interrupt line. irq_flags field as part of uio_info structure that is registered with the UIO framework should have IRQ_SHARED bit set. Linux Kernel and UIO frame work call the interrupt handler one by one in sequence. Interrupt handler that has data pending to be read from corresponding descriptor ring should return IRQ_HANDLED. It means that the device should have capability to check the pending data without reading it out. Note that reading the acceleration result should be done by the user space. When the handler returns IRQ_HANDLED, UIO framework wakes up the user process. Since one IRQ line is shared by multiple processes, as described in earlier post, masking and unmasking the interrupts can't be done by the interrupt handler and user process. Since interrupts can't be disabled, one can't use natural coalescing capability as described in the earlier post. But fortunately, many acceleration devices provide hardware interrupt coalescing capability. Hardware can be programmed to generate interrupt for X number of events or within Y amount of time. If the hardware device you have chosen does not have the coalescing capability and require IRQ to be shared across multiple user processes, then you are out of luck. Either don't use UIO facilities or live with too many interrupts coming in.

All other user process without dedicated descriptor rings should work with accelerator kernel driver that is provided by the OS/Chip vendors. That is, they need to send the command buffer to the kernel driver and read the result from the kernel driver. Kernel drivers are normally intelligent enough to service multiple consumers and hence many user processes can use the acceleration engine.

Comments?

Sunday, October 31, 2010

Multicore Networking applications - Mitigating the Performance bottlenecks

I had given this talk in 2010 Multicore Expo in San Jose. It was in presentation document in concise form. I voiced the most of the details during my talk. Many people requested me to provide details in written form. I tried to give details here in this post. I hope this post would give enough details on 'New techniques to improve software performance with increasing number of cores'.

Before going further, I would like to differentiate two kinds of applications - Packet processing applications and Stream processing applications.

Packet processing applications in my definition are the ones which take packet by packet, work on the packet and send out the same packet or packet with some minor modifications. In packet processing applications, there is one-to-one correspondence between input and output packets except for very small number of exceptions. One example where there is no one-to-one correspondence is when there is IP reassembly or fragmentation. Other example is when the packet is dropped by the application. Example applications in this category are: IP forwarding, L2 Bridging, Firewall/NAT, Ipsec and even some portions of IDS/IPS.

Stream processing applications are the ones which may take packets or stream of data, work on data and send out the data or send out different packets. Most of the TCP socket based proxy applications come under this category. Examples: HTTP Proxy, SMTP Proxy, FTP Proxy etc..

This post tries to aid the programmers debugging the software to find out the performance bottlenecks in Multicore networking applications.

Always Ensure to do flow/Session Parallelization

Ensure that only one core is processing the session at any given time. If multiple packets from the same session are being processed by more than one core at the same time, then there would be requirement to ensure that Mutual exclusion on the session variables. That would be very expensive. Multicore SoCs actually aid you to do flow parallelization in packet processing applications. Many Multicore SoCs support parsing the fields from the packets, calculate hash on the software defined fields and distribute the packets across the multiple queues based on the hash value. And then they provide provision for software threads to dequeue the packets from the queues. These SoCs also provide provision to stop dequeue of packets from threads until the control of the queue is given up explicitly. This ensures that a given flow is processed by only software thread at any time.

Many Multicore SoCs also have facility to bind the queues to the software threads and each software thread to the core. If the number of flows are small, there is a possibility of cache being warmed with contexts due to previous packets. This reduces the data movement from DDR.   Also, many Multicore SoCs provide facility to stash the context as part of dequeue operation which reduces the cache thrashing issue even if binding of the queues to the cores are not done.

Flow parallelization not only eliminates the need for Mutexes, it also ensures that there is no packet mis-ordering in the flows.

Many stateful packet processing applications require not only flow parallelization, but also session parallelization. Session typically consists of two flows - Client to Server traffic and Server to Client traffic. It is possible that two packets from both the flows may be coming to the device and two separate software threads might be processing these packets. Stateful applications share many state variables across these two flows. Due to this, you may require mutual exclusion operation if both the packets are allowed to be processed at the same time. Session Parallelization as described here would eliminate the need for mutual exclusion. Unlike flow parallelization, session parallelization is not available in many Multicore SoCs for cases where the tuple values are different in both the flows and hence needs to be done in software. Packet tuples are different when NAT is applied. Note that many Multicore SoCs enqueue the packet to the same queue if there is no NAT. They are intelligent enough to generate the same hash value even though the tuples position get changed, that is, source IP in one flow would be destination IP in reverse flow and same is true for destination IP, Source Port and Destination Port.

Stream processing modules such as proxies would need to ensure that both client side and server side sockets are processed by the same software thread to ensure that there is no Mutual exclusion operations requirement to protect the sanctity of state variables. Stream processing modules typically create many software threads - worker threads. Master thread terminates the client side connections and handover the connection descriptor to one of the less loaded worker threads. Worker thread is expected to create new connection to the server and do rest of the application processing. Worker threads are typically implement FSM for processing multiple sessions. More often, the number of worker threads would be same as number of cores dedicated for that application. In cases where the threads need to block for some operations such as waiting for accelerator results, then more threads, in multiples of number of cores, would be created to take advantage of full power of accelerators.

Eliminate the Mutual Exclusion Operation while Searching for Session/Flow Context

This technique is also expected to ensure that there are no mutual exclusion operations in the packet path. Any networking application do some search operations on the data structures to figure out the operations and other action to be done on the packet/data. Upon the incoming packet/data, search is done to get hold of session/flow context and then further packet processing happens based on the state variables in the session. For example, IP routing does search on the routing table to figure out the destination port, PMTU and other information for operations such as fragmentation, TTL decrement and packet transmit. Similarly firewall/IPsec packet processing applications maintain the sessions in a easy to search data structures such as RB trees, hash lists etc..   Since the sessions are created or removed dynamically from these structures, it is necessary to protect the data structure while doing operations such as add/delete/search. Mutual exclusion operations using spinlock, futex, up/down are one way to do this.   RCU (Read-Copy-Update) is another method that can be used which eliminates the Mutex operation during search.   RCU operation is described in earlier post. Please check that here and here. RCU lock/unlock operations in many operating systems is very simple operation. Note that Mutex operations are still required for add/delete even in RCU based usage.

Eliminate Reference Counting

One of the other bottlenecks in the Multicore programming is the need to keep the session safe from deletion while it is being used by other software threads. Traditionally this is achieved by doing 'reference counting'. Reference counting is used in two cases - During packet processing operation or When neighbor module store the reference.

In the first case, reference count of the session context is incremented as part of the session lookup operation. During packet processing, the session is referred many times to get hold of state variable values and to set the new values in the state variables of the session. It is expected that if the session is deleted, it should not be freed until the current thread is done with its operation. Otherwise, it would corrupt some other memory if the session memory is freed and allocated to somebody else during packet processing. To ensure that the session ownership is not given away, the reference count is checked as part of 'delete operation'. If is is not zero, then the session is marked for deletion, but not freed until the reference count becomes zero. If the value is zero, it indicates there is no reference to this session and the session gets freed.

Since RCU operation postpones the delete operation until current processing cycles of all other threads, reference counting becomes redundant. Elimination of reference count not only helps in improving the performance, but also reduces the maintenance complexity. Note that reference counting operation requires atomic usage of count variable. Atomic operations are not inexpensive.

Second use case of reference count is when the neighbor modules store the reference (pointer) to the sessions in their session contexts. By eliminating the storage of pointer, reference count usage can be eliminated. This post helps you understand how this can be done.

Linux user space programs also can take advantage of RCUs. See this post for more details.

Use the Cache Effectively

Once the matching session is found upon incoming data/packet, processing functionality uses many variables in the session. If these variables are together in a cache line, any cache fill due to access of one variable result all other variable in the cache line available in the cache. That is, Access to other variables will not result in going to DDR. But all variables may not fill in one cache line. In those cases, it is necessary to group the related variables together to reduce going to DDR.

To effectively use instruction cache, always arrange your code with likely/unlikely compiler directives. Compilers will try to arrange the likely() code together.

Reduce Cache Thrashing due to Statistics variables

Almost all networking applications update statistics variables. Some variables are global and some of them are session context specific variables. There are two types of statistics variables - increment variables and add variables. Increment variables are typically used to maintain the count of packets. Add variables are used to maintain the byte count. Updating these variables require getting hold of current values and then add or increment operation.   If these variables are updated by multiple threads (with each thread running on a specific core), then every time an variable is updated, cache information of this variable is no longer valid in other cores. When one of other cores needs to do same operations, it needs to get the current value first from the DDR and apply the operation. In worst case scenario, where packets are going to round robin fashion to different software threads (hence cores), then the cache thrashing due to statistics variables would be very high and this would reduce the performance dramatically.

Always use 'per core/thread statistics counters' whenever possible. Please see this post for more details.

Some Multicore SoCs provide special feature which also eliminates the need for 'per core' statistics maintenance. These SoCs provide facility to allocate memory block for statistics. These SoCs provide facility to fire the operation and forget about it. Firing the operation involves the operation type (increment, decrement, add X or sub Y etc..) and memory address (32 bit or 64 bit). SoCs internally do this operation without cache thrashing. I suggest strongly to use this feature, if it is available in your SoC.

Use LRO/GRO facilities

Many networking applications' performance depends on the number of packets being processed than the number of bytes processed. Examples: IP Forwarding, Firewall/NAT and Ipsec with hardware acceleration. So, reducing the number of packets processed becomes key in improving the performance.

LRO/GRO facilities provided by operating system in Ethernet drivers or by Multicore SoCs reduce the number of TCP packets, if multiple packets from the same TCP flow are pending to be processed. Since TCP is byte oriented stream protocol, it does not matter whether or not the processing happens on packets. Please see this post for more information on LRO feature in Linux operating system. If it is supported by your operating system or Multcore SoC, always make use of it.

Process Multiple Packets together

Each packet processing module does set of operations on the packets/data - such as Search, Process and Pkt out. If the packet is going through multiple modules, there are many C functions get called. Each invocation of C function has its own overhead such as pushing the variables in the stack, initializing some local variables etc..    By bunching multiple packets of same flow together can reduce search/pkt out overhead and overhead associated with the C functions.

Some Multicore SoCs provide facility to coalesce packets together on per queue basis with coalescing parameters - Packet threshold and time threshold. Queue does not let the target thread to dequeue until one of the conditions reached - either number of packets in the queue exceed the packet threshold parameter or if no packet was dequeued for time 'time threshold'.   If this facility is available, ensure that your software dequeues multiple packets together and processes them together.

Yet times, there is no one-to-one correspondence between queues and sessions. In that case, one might ask that search overhead can't be reduced as there is no guarantee that the packets in the same queue belong to the same session. Though it is correct, it might still have some improvements due to cache warming if there are more than one packet belonging to same session in the bunch.

As a software developer, it would be required to strive for one-to-one correspondence between queues and sessions. This can be done easily among the modules running in software. Some Multicore SoCs provide queues for not only to access hardware blocks, but also for inter-module communication. Software can take advantage of this to create one-to-one mapping between queues and destination module's sessions.

It is true that when the packets are being read from the Ethernet controllers, there is no way to ensure that a queue only holds packets of one session as the queue selection happens based on the hash value of packet fields. Two different sessions may fall into same queue. In those cases, as mentioned above you might not see improvement from 'serach' functionality, but you would still see improvements due to less number of invocations of C functions.

Many Multicore SoCs also have functionality to take multiple packets together for acceleration and for sending the packets out. This also will reduce the number of invocation to acceleration functions and for sending packets out. If this facility is available in your Multicore SoCs, take advantage of it.

Eliminate usage of software queues

Some Multicore applications need to send the packets/data/control-data to other modules. If multiple threads send the data to the queue, then there is a need for mutual exclusion to protect these data structure queues.

Many Multicore SoCs provide queues for software usage. These queues would eliminate the need for software queues and hence eliminate the mutual exclusion problem, there by improving performance.   Some Multicore SoCs also provide facility to group multiple queues together into a queue group which allows sending and receiving applications to enqueue priority items and dequeue based on priority. These queues can be used even among different processes or virtual machines as long as shared memory is used for items that get enqueued and dequeued. Some Multicore SoCs even went a step further to provide 'copy' feature which avoids shared memory and there by providing good isolation. This feature makes a copy of these items from source process to internal managed memory by Multicore SoCs and copy to the destination process memory as part of dequeue operation.

Always use this feature if it is available in your Multicore SoC.

Eliminate the usage of Software Free pools

Networking applications use free pools of memory blocks for memory management. These free pools are used to allocate/free session contexts, buffers etc.. Many software threads would require these facilities at different times. Software typically maintains the memory pools on per core basis to avoid mutual exclusion operations on per allocation basis. Since there is a possibility of asymmetric usage of pools by different threads, yet times there is a possibility of memory allocation failures even though there are free memory blocks in other threads' pools.   To avoid this, software does complex operations during these scenarios of moving memory blocks from one pool to another through global queues.   Many Multicore SoCs provide 'free pool' functionality in hardware. Allocation and free can be done by any thread at any time without mutual exclusion operations. Use this facility whenever it is available. It saves some core cycles. More than that is provides efficient usage of memory blocks.

Use Multicore SoC acceleration features to improve performance

There are many acceleration features that are available in Multicore SoCs. Try to take advantage of them. I classify acceleration functions in Multicore SoCs into three buckets - Ingress In-flow acceleration, In-flight acceleration and Egress in-flow acceleration.

Ingress In-flow acceleration: Acceleration functions that are done by Multicore SoCs in the hardware on the packets before they are handed over to software are called Ingress In-flow accelerations. Some of the features, I am aware, in Multicore SoCs are:

Parsing of Packet fields : Some Multicore SoCs parse the headers and make those fields available to the software along with the packet. Software needing the fields can eliminate the parsing of fields. These SoCs also provide facility for software to choose the fields to be made available along with the packet. They also provide facilities for software to create parser to extract fields from proprietary headers or from non pre-defined headers. Try to take advantage of this feature.
Distribution of packets across threads: This is basic feature required in Multicore environments. Packets needs to be distributed to different software threads. Many Multicore SoCs also ensure that packets belonging to one flow go to one software thread at any time to ensure that packets will not get mis-ordered within a flow. As described above, multiple queues would be used by hardware to place the packets. Selection of queue is based on hash value calculated on the set of software programmable fields. As a software developer, take advantage of this feature rather than implementing the distribution in software.
Packet Integrity checks & Processing offloads: Many Multicore SoCs do quite a bit of integrity checks on the packet as listed below. Ensure that your software don't do them again to save some core cycles.

IP Checksum verification.
TCP, UDP checksum verification.
Ensuring that the headers are there in full.
Ensure that size of packet is not less than the size indicated in the headers.
Invalid field values.
IPsec inbound processing.
Reassembly of fragments
LRO/GRO as described above.
Packet coalescing as described above.
Many more.

Policing : This feature can police the traffic and reduce the amount of traffic that is seen by the software. If your software requires policing of some particular traffic to stop cores from getting overwhelmed, this feature can be used rather than doing it in the lowest layers of software.
Congestion Management : This feature ensures that the number of buffers used up by the hardware won't go up exponentially. Without this feature, cores may not find buffers to send out the packets if all buffers are used up by the receiving hardware. This situation typically happens when the core is doing lot of processing and hence slow in dequeuing while lot more packets are coming in. Many Multicore SoCs also have facility to generate pause frames in case of congestion.

Egress In-flow acceleration: Acceleration functions that are done in the hardware once the packets are handed over to it by software to send the packets out are called Egress in-flow acceleration functions. Some of the Egress in-flow acceleration functions are given below. If these are available, take advantage of them in your software as these can reduce significant number of cycles in the core.

Shaping and Scheduling : High priority packets are sent out within the shaped bandwidth. Many Multicore SoCs provide facilities to program the effective bandwidth. These SoCs shape the traffic with this bandwidth. Packets which are queued to it by software would be scheduled based on the priority of the packets. Some SoCs even provide multiple scheduling algorithms and provide facility for software to choose the algorithm on per physical or logical port. Some SoCs even provide hierarchical scheduling and shaping. Take advantage of this in your software if you require shaping and scheduling of the traffic.
Checksum Generation for IP and TCP/UDP transport packets : Checksum generation, especially for locally generated TCP and UDP packets is very expensive. Use the facilities provided by hardware.
Ipsec Outbound processing : Some Multicore SoCs provide this functionality in hardware. If you require Ipsec processing, use this facility to save large number of cycles on per packet basis.
TCP Segmentation and IP Fragmentation : Some Multicore SoCs provide this functionality. TCP segmentation performs well for local generated packets. Use this functionality to get best out of your Multicore.

In-flight Acceleration : Acceleration functions provided by hardware that can be used during packet processing are called In-flight acceleration functions. Crypto, Crypto with protocol offload, Pattern Matching, XML acceleration are some of the acceleration functions that come in this category. Here the packet/data for acceleration is handed over to the hardware acceleration functions by software. Software reads the results at later time when the results are ready. Take advantage of these feature in your software wherever they are available . Some Multicore SoCs differentiate themselves by doing lot more in the acceleration functions. For example, some Multicore SoCs do protocol offload along with crypto such as Ipsec ESP, SSL record layer protocol , SRTP and MACSec offloads which do beyond crypto offload.

I see many times people asking me a question on how to use the acceleration functions. I had detailed this long time back here. Please see the details there and there.

Software Directed Ingress In-flow accelerations:

As described before, Ingress in-flow acceleration is applied before the packets are given to the software. Packets that are received on integrated Etherent controllers go through this acceleration. But many times this acceleration is required from software too. Take the example of Ipsec, SSL or any tunneling protocol. Once the software processes these packets, that is once it gets hold of inner packets, software would like ingress in-flow acceleration to be applied on the inner packets for distribution across cores and other acceleration functions. To facilitate these kinds of scenarios, some Multicore SoCs provide concept of 'offline port' which allows software to reserve the offline ports and send the traffic for ingress in-flow acceleration. Some software features that can take advantage of this feature are:

Tunneled traffic as described above to let the inner packets to go through the ingress in-flow acceleration,
IP reassembled traffic - Once the fragments are reassembled, it would have all 5-tuples which can be used to distribute the traffic through offline port.
L2 encapsulated packets - Such as IP packet from PPP, FR etc..
Ethernet controllers on PCI and Traffic from Wireless interfaces : Here the traffic might need to be read by the software and Ingress in-flow acceleration might not have been implemented for these features. Software after getting hold of packets can be directed to in-flow acceleration functions through offline ports.

Use Multicore core features wherever they are available

Multicore SoCs from different vendors have different core architecture. Some Multicore SoCs are based on power pc, some based on MIPS core and Intel Multicore is based on x86 processors. Multicore SoC vendors provide different features to improve performance of Multicore applications. Whenever they are available, software should make use of them to get the best performance out of cores. Some of the features that I am aware of are listed below.

Single Instruction & Multiple Data instructions (SIMD)

Multicore SoCs from Freescale and Intel have this block in their cores. This feature in the cores allows software do a given operation on the multiple data elements. This kind of parallelism is called 'Data level parallelism'. 'Add' operation in typical cores is performance either on 32 bit or at the most 64 bit operands. Current generation of SIMD do this operation on 128 bit operands. They also provide flexibility to do multiple 16 bit, 32 bit add operations on different parts of data simultaneously. SIMD greatly helps in operations which involve arithmetic, bit, copy, compare operations on large amount of data. Any operation that is done in a loop can be accelerated using SIMD. In networking world, SIMD is helpful in following cases:

Memory compare, copy, clear operations.
String compare, copy, tokenization and other string operations.
WFQ scheduling of QoS, where multiple queues need to be checked to figure out which queues need to be scheduled based on sequence number property of queues. If the sequence numbers are arranged in array form, then SIMD can be used very effectively.
Crypto operations.
Big Number arithmetic which is useful in RSA, DSA and DH operations.
XML Parsing and schema validations.
Search algorithms - Accelerating compare operation to find matching entry from collision elements in a hash list.
Check-sum verification and generation: In some cases Ingress and Egress In-flow accelerations can't be used to verify and generate the checksums. One example is, TCP and UDP packets that come in IPsec tunnel. Since the packets are in encrypted form, ingress and egress accelerators will not be able to verify and generate checksums in inner packets. Even packets that get encapsulated in tunnels will not be able to take advantage of Ingress & Egress in-flow accelerations. Checksum verifications and generations need to be done in software by cores. SIMD would help in those cases tremendously.
CRC verification and generation: These algorithms are not very expensive to have In-flight acceleration and not inexpensive for core to do. SIMD in these cases help as it does not involve any architecture changes to the software and still get lot better performance over the cores which don't have SIMD.

Normally SIMD based cores give at least 50% more performance improvement for typical workloads. So, as a software developer, figure out the ones that can be improved using SIMD and modify the code to improve performance of your application.

Speculative Hardware Data Prefetching & Software Directed Prefetching

This feature fetches the next cache line worth of data from the current memory access in the hopes that software would use next memory line. Many core technologies provide control on enabling and disabling this at run time. Software can take advantage of this while doing memory copy, set and compare operations. Any data is arranged in linear fashion in the memory (such as arrays) can get good boost of performance with this feature. Note that, if this feature is not used selectively and carefully, it might even give degradation in performance. Be careful in using this feature.

Many cores also provide special instruction to warm the cache given a memory address. Software developers know the kind of processing (next module) and many times next module session context is also known. In those cases, software can be developed such a way that next module session is prefetched while packet processing happens in current module. When the next module gets the control of the packet, it already has session context in the cache which avoids getting it DDR in serial fashion. My experience is that using software directed prefetching gives very good results. This also ensures that the performance does not go down even with large number of sessions.

Some Multicore SoCs provide support for Cache warming on the incoming packets. As part of making packets ready for the software, these SoCs warm the cache with some part of packet content, annotation data containing parsed fields and software issued context data. When the software dequeues the packet, most of the information required to process the packet of the module that is getting hold of packet is in place in the cache, thereby, avoiding on-demand DDR access. Software can program its context on per queue basis. Note that, this feature is useful for the first module that receives the packet. Actually that is good enough as this module can prefetch the next module context while the packet is being processed in the current module. As long as each modules does this, there is no performance degradation even with high capacity.

As described before, hardware queues may not have one-to-one correspondence with the receiving module session contexts. A queue might be having packets for multiple session contexts. Many times, software maintains the sessions in the hash table with large number of hash buckets. All collision sessions are arranged in linked list or RB tree. Software can ensure that there are as many queues as number of hash buckets and program the first collision element in the queue. If the matching context is not same as the one that was programmed, then one might not get the full benefit of cache warming by the hardware. But if there are 4 collision elements and the traffic across these four are same, cache warming would come in handy 25% of the time. Some software developers might even store the collision elements in an array and program the array to the queue.

Software directed prefetch works very well as long as there is one-to-one correspondence between current module session context and next module session context. In this case, current module session context can cache the reference to the next module session context and use this to do prefetch operation. This scheme also work fine if next module context is super set of multiple current module contexts. But it does not work well if the next module context is finer granular. Example: Ipsec SA transfer packets from multiple firewall/NAT sessions. In this case, 'Software Directed Ingress In-flow acceleration' method can be used to direct the hardware to send the packet to next module. This method not only provides cache warming, but also distributes the processing to multiple cores.

Hardware Page Table walk:

Some cores provide nested hardware page table walk to find out the physical address given the virtual address. This is really useful for user space applications in Linux kind of operating systems. Hardware page table walk feature is expected to be taken care by operating system vendors. But unfortunately many OS vendors are not taking advantage of this feature. As a software developer, if your Multicore SoC provide this feature, don't forget to ask your OS vendors to take advantage of this. This will ensure that your performance does not go down when you move your application from Bare-metal environment (where the TLB are fixed and there is no page walk required) to Linux user space.

I hope it helps.

Sunday, October 10, 2010

Fastpath Ipsec implementations - Developer integration tips on Inbound policy check

Basic purpose of Ipsec fast path implementations is to reduce the IPsec processing load on the main processing cores. Since most of the Ipsec processing is same across different kinds of packets, offloading of this processing to hardware makes sense.

There are companies today who provide fast path implementations - either as software component or as an add-on card such as PCIe card that goes onto PCI slot of main processing unit such as x86 based mother board.

Software based fast path implementations are becoming quite popular in Multicore processing environment. Fast path is run on some cores and rest of the cores are used for other applications.

Ipsec fast path implementations typically work as follows:

Fast path typically owns the Ethernet and other L2 ports. That is, all packets come to the fast path plane first.
If there is enough state information to process the packet, fast path implementations act on the packets without involving normal path running in main cores. Packet might even get transmitted out after working on the packet. If the packet requires some other application processing that is not present in the fast path, then the packet is handed over to normal path processing unit. In case of Ipsec fast path, decrypted packets are given to the normal path in inbound direction. In outbound direction, it does Ipsec processing before packet is sent out.

Basic purpose of fast path is to save CPU cycles so that it can do some other processing.
All fast path implementation from different vendors are not created equal.

In this post, I specifically would like to concentrate on 'Inbound Policy Check'. Some fast path implementation skip this check. Reasons given by vendors of the fast path implementation typically is that this is done for performance reasons. Some people believe that it can be done without any implications on security too.   Unfortunately, that is not true.

What is inbound policy check?

Inbound policy check ensures that the decapsulated IPsec packets used the SA that was formed for this traffic.   And also it ensures that the inbound policy rules allow this traffic to come through.

What are the issues if the inbound policy check is not done?

I can think of two issues - DoS attack & Allowing the traffic that is supposed to be denied (Access control violation).

DoS attack:

Let us assume that a corporation gateway has two tunnels to two different partners - Partner1 and Partner2. Without inbound policy check, it is possible for partner1 to interfere with the sessions/traffic between corporate gateway and partner2. That is, partner1 can create denial of service attack on partner2 traffic. Even though I have taken the example of partners, this kind of attack is possible among IPsec remote users.

Let us assume this scenario:

10.1.10.0/24-----------SGW---------Internet-----------PSGW1-----------10.1.11.0/24
                                                             |
                                                             |-----------------PSGW2------------10.1.12.0/24

SGW: Security Gateway of a corporation - It is protecting network 10.1.10.0/24
PSGW1: Partner 1 Security Gateway - Its LAN is 10.1.11.0/24
PSGW2: Parnter 2 Security Gateway. Its LAN is 10.1.12.0/24

There are two security tunnels from SGW - One to PSGW1 and another to PSGW2. Let us call them Tunnel1 and Tunnel2 respectively.

Tunnel1 is negotiated to secure traffic between 10.1.10/24 to 10.1.11.0/24. Tunnel2 is negotiated to secure traffic between 10.1.10/24 and 10.1.12.0/24.   Let us also assume that the Tunnel1 SPI at the SGW is SPI1 and the Tunnel2 SPI at the SGW is SPI2.

It is expected that any packets coming from PSGW1 are expected to have SPI1 in its ESP header and inner IP packet SIP address be one of addresses in 10.1.11.0/24 and DIP address be one of 10.1.10.0/24. Similarly it is expected that any packet coming from PSGW2 is expected to have SPI2 in its ESP header and inner IP packet SIP address be one of addresses in 10.1.12.0/24 and DIP address be one of 10.1.10.0/24.

Now to the attack scenario:

If PSGW1 network sends the inner packets whose IP addresses is other than 10.1.11.0/24 and 10.1.10.0/24 , then the SGW is expected to drop those packets. SGW can only drop the traffic only if SGW does the inbound policy check. If PSGW1 is allowed to send any inner packets, then it is possible that PSGW1 and its network can misuse this by sending packets with inner packets with IPs of PSGW2 LAN and SGW LAN. Since it sends the traffic on the right SA using its own SPI, SGW IPsec packet processing will happen smoothly. If no other check is done, this traffic can go to SGW LAN.   Based on type of traffic, different attacks are possible. Some of the attacks that are possible are:

If attacker at PSGW1 guesses the TCP ports of some long lived sessions between PSGW2 network and SGW network, it can send the RST packets or ICMP Error messages to terminate the connections.
Attacker at PSGW1 can send ICMP Echo message to SGW1 LAN network multicast IP address with SIP as PSGW2 LAN machine. Replies from all machines in the SGW1 LAN go to PSGW2 victim machine and overwhelm the machine.

If SGW checks the inbound policy after the IPsec decapsulation is done using inner IP packet, then it would have found that the SA used for the matching inbound policy is not same as the SA used to decapsulate the packet. Whenver there is any mismatch, it is expected to drop the packet. Due to this, no malicious traffic would have gone to the SGW LAN in above scenario. Also, by logging these events, administartor can find out the misbehaving peer security gateway and take appropriate out-of-band action.

Access Control Violation:

This is one more problem that can be faced if inbound policy check is not done.
Many Ipsec normal path implementations provide facility for administartors to add multiple rules with different actions to the secuirty policy database (SPD).   Rules normally have 5-tuple selectors in ranges/subnets/exact IP addresses for source and destination and ranges/exact values for UDP/TCP ports. Actions can be one of 'Bypass', 'Discard' and 'Apply'.   Rules are arranged in a ordered list. During packet processing, rule search is done. Rule search is stopped upon match. Action specified on the matching rule is taken. If the action is 'Bypass', then the packet is forwarded without any Ipsec processing. 'Discard' action indicates the packet is to be dropped. Apply action indication that Ipsec processing is to be applied. Normally administartors configure the rules with respect to outbound traffic. Inbound policy rules are created automatically by the system from outbound policy rules by reversing the selectors - That is SIP becomes DIP and vice versa. Similary SP becomes DP and vice versa.

Now let us look at the possible access policy violation with following example:

Let us take these two policy rules in outbound list:

Rule 1: SIP: 10.1.10.0/24 DIP 10.1.11.0/24 Protocol UDP   Action : Discard
Rule 2: SIP: 10.1.10.0/24 DIP 10.1.11.0/24 All Protocols   Action : Apply.

Inbound policy rule list would look like this:

Rule 1: SIP: 10.1.11.0/24 DIP: 10.1.10.0/24 Protocol: UDP   Action : Discard
Rule 2: SIP: 10.1.11.0/24 DIP: 10.1.10.0/24 Protoco: All   Action: Apply.

Administartor creates the rules in above fashion to indicate that to discard any UDP traffic between the networks, but allow everything else by securing the traffic.

Assume that above policy rules are created in SGW1.

10.1.10.0/24----------SGW1----------Internet-----------SGW2-------10.1.11.0/24

When a TCP packet is sent from the SGW1 LAN and SGW2 LAN, then second rule gets matched and SA is created to allow traffic 10.1.10/24 to/from 10.1.11.0/24 for all protocols. If SGW2 either misconfigured or intentionally sends UDP traffic in the SGW1-SGW2 tunnel, then SGW1 is expected to drop the packet even if it successfully decrypts and decapsualtes the packet. This can only happen if the SGW1 does the inbound policy check on inner IP packet.

If SGW1 does not do any inbound policy check, UDP traffic would have been passed to the 10.1.10.0/24 network thereby violating the access rules configured by the administartor.

I hope I could give good reasoning on why inbound policy check is required. Some fast path implementation don't do this.   So, as a development integration engineer, please ensure that not only your implementation, but also fast path implementation does all the checks that are required.

Comments?

Fragmentation before Ipsec Encapsulation - Redside fragmentation and more use cases

I am finding more and more benefits of doing 'red side' fragmentation in Ipsec worl.

One use case is given here: With red side fragmentation, any switches/routers in between security gateways of tunnels don't see fragmented packets. Due to this the cases, where some service providers' routers give less priority to the fragmented packets, don't arise.

Second use case is given here : When majority of the traffic goes on IPsec tunnels, LAG can't distribute the traffic across ports since the result traffic has same 5-tuple information. As described in the post, multiple IPsec tunnels normally get created with forceful NAT-T. All packets that are coming out of Ipsec Engine are expected to have 5 tuple information. If fragmentation is done after Encap, then the LAG would see some packets without 5-tuples. This results to uneven distribution. Hence redside fragmentation is done to ensure that LAG sees 5-tuples for all the packets.

Third use case: Avoid mis-ordering of the packets:

There could be packets which are big and small in the traffic. Big packets may get fragmented after Ipsec encapsulation if the result size exceeds the MTU of outgoing interface. Small packets may not get fragmented even after encapsulation.

Gateway receiving the Ipsec packets is expected to process them in order. Due to fragmented packets, this may not happen. Let us say that, gateway received 1st fragment of 1st packet, 2nd full packet and 2nd and also final fragment of 1st packet in that order. It is expected that the gateway processes them in the same order. But since 1st packet waits for 2nd fragment, 2nd full packet would be processed. Gateways don't stop the full packets getting processed as it may not know whether or not second fragment of the 1st packet is going to come in and also when it is going to come in.

So, this leads to packet mis-order.

This can be avoided if there are no fragments. Solution : Red side fragmentation.

Comments?

Tuesday, September 28, 2010

Look-aside acceleration & Application Usage scenarios

Performance and flexibility are two different factors that play role on how applications use look-aside accelerators. As described in post, applications use accelerators in synchronous or asynchronous fashion. In this post, I would give my view of different types of applications and their usage of look-aside accelerators.

I would assume that all applications are running in Linux user space. I also would assume in this post that all applications are using HW accelerators by memory mapping the registers in the user space. Based on these assumption, I could categorize applications into these types:

Per_packet processing applications with Dedicated core to the User Process and HW Polling Mode : In this type, application runs in the user process. A core or set of cores are dedicated to the process, that is, these cores are not used for anything else other than executing this process. Since core is dedicated, it can wait for the events on some HW interface until some event is ready to be processed. In this mode, it is expected that Multicore hardware provides single interface to wait for the events. Application wait in a loop forever for the events. When the event is ready, it takes action based on the type of event and then come back to wait for new events. This type of application is more suitable for per-packet processing applications such as IP forwarding, Firewall/NAT, IPsec, MACSec, SRTP etc..

Per-packet processing applications would use look-aside accelerators in asynchronous fashion. Incoming packets from Ethernet or other L2 interfaces and the results from the look-aside accelerators are given through the common HW interface.
Typical flow would be some thing like - When the incoming packet is ready on Ethernet port, polling function returns with 'New packet' event. New packet is processed by the user space and at one time decides that it needs to be sent to the HW accelerator, sends it to HW accelerator and then come back to poll again. HW accelerator at some time returns the result through same HW interface. When polling function returns with 'Acceleration result' event, user process processes the result and may send the packet out onto some other Ethernet port. It is possible that more packets would have been processed by the user process before the acceleration result is returned for previous packets. Due to this asynchronous nature, cores are utilized well and system throughput would be very good.
IPsec, MACSec, SRTP uses Crypto algorithms in asynchronous fashion.
PPP and IPsec IPCOMP use compression/decompression accelerators in asynchronous fashion.
Some portion of DPI use Pattern Matching acceleration in asynchronous fashion.

Per-Packet processing application with Non-Dedicated core to the user process & SW polling mode: This is similar to above type 'Dedicated core to the user process and HW polling mode'. In this type, core(s) are not dedicated to the user process. Hence HW polling is not used as this would make core not relinquish the control as often for doing other operations. SW polling is used, typically using ePoll() call. In this mode, interrupts are using UIO facilities provided by Linux. When the interrupt is raised whenever the packet is ready or accelerator result is ready. UIO wakes up the epoll() call in the user space. When the ePoll() returns, it reads the event from HW interface and it executes different function based on event type.

All per-packet processing applications such as IPsec, SRTP, MACSec, Firewall/NAT can also work in this fashion.
IPsec, MACSec, SRTP uses Crypto algorithms in asynchronous fashion.
PPP and IPsec IPCOMP use compression/decompression accelerators in asynchronous fashion.
Some portion of DPI use Pattern Matching acceleration in asynchronous fashion.

Stream Based applications : Stream based applications are normally work at high level away from packet reception and transmission. For example, Proxies/Servers work on BSD sockets - The data which they receive is the TCP data, not the individual packets. Crypto file system is another kind of stream application, where it works on the data, not on the packet. These applications collect data from several packets. Some times this data gets transformed such as packet data gets decoded into some other form. HW accelerators would be used on top of this data. In almost all cases the HW accelerators are used in synchronous fashion. In this type of applications , synchronous mode is used in two ways - Waiting for the result in a tight loop without relinquishing the control and waiting for the result in a loop by yielding to Operating system. First sub-mode (tight loop mode) is used when the HW acceleration function takes very less time and second mode (yield mode) is used when the acceleration function takes long.

Public Key acceleration such as RSA sign/verify, RSA encrypt/decrypt, DH operations and DSA sign/verify work in yield mode as these operations take significant number of cycles. Applications that require this acceleration are: IKEv1/v2, SSL/TLS based applications, EAP Server etc..
Symmetric Cryptography such as AES & different modes, Hashing algorithms, PRF Algorithms would be used in tight loop submode as these operations take less cycles. Note that Yielding might take anywhere between 20000 cycles to 200,000 cycles based on number of other ready processes and that is not acceptable latency for these operations. Applications based on SSL/TLS, IKEv1/v2, EAP Server etc..
I would put compression/decompression HW accelerator usage in slightly different sub-mode. Compression/Decompression works in this fashion for each context.

Software thread issues the operation.
Immediately reads if there is anything pending result (based on previous operations). Note that the thread is not waiting for the result.
Works on the result if available
And above steps happen in a loop until there is no input data.
At the end, it waits (in yield mode) until the all the result is returned by the accelerator.

Application that can use compression accelerators: HTTP Proxy, HTTP Server, Crypto FS, WAN optimization etc..

Any comments?

Sunday, September 26, 2010

LAG, Load Rebalancing & QoS Shaping

LAG feature exposes only one L2 interface to the IP stack for each LAG instance. It hides all the links in the LAG instance underneath it. It sounds good in the sense that IP stack & other applications are completely transparent with respect to number of links that are being added and removed.

Though many applications and IP stack don't care about the LAG, links and its properities, one application QoS would need to worry about the link properities - specifically its bandwidth (shaping bandwidth). In ideal world, even QoS does not need to worry about links and its properties. As we all know, to ensure that there is mis-ordering of the packets in a given conversation, distributor function of the LAG module distributes the conversations across the links, not the packets. If there are large number of conversations compared to the links, there is always possibility of equal distribution of the traffic across the links. But when there are small number of conversations, which by the way not so uncommon, then there is a possibility of unequal distribution with respect to the traffic. That is, there could be more traffic in some conversations compared to others. If high traffic conversations go to few links, then there is unequal distribution. Let me cover QoS and changes required in QoS to work with LAG.

Load Rebalancing:

LAG distributor normally implements the concept of 'Load Rebalancing'. Load rebalncing happens in three cases.

When LAG observes that there is unequal distribution.
When new link is added to the LAG instance.
When existing link is removed, disabled or broken.

Though new link and removal of existing link to/from the LAG instance is not the focus of this article, let me just give a gist of issues that need to be taken care. Packet mis-order issue must be taken care well. When the new link is added, if hash distribution is changed immediately, some of the existing conversations might be balanced to other links. If it is done arbitrarily, then there is a possibility of packets being received by collector in out-of-order for brief amount of time. To make sure that new link is used effectively, there are two methods can be used. Both can be used togehter though.

New conversations would use new hash distribution.
Current conversations can be put onto other links only if the conversation is idle for X milliseconds - Time at which we know that packets would have been collected by the collector.

When link is no longer active, then packet mis-ordering is no longer a big issue. The traffic has to flow and new distribution can take effective immediately and distirbute the conversations that belong to the old port to existing ports immediately.

Now on to redistribution due to unequal utilitization of links:

Redistribution can be done in two ways - Changing the hash algorithm or fields to be used in hash algorithm. Second is to some how increase the number of conversations. Second method of increasing the conversations would work only in cases where tunnels (such as Ipsec) are conversations. By increasing the number of tunnels, there is a good possibility of increasing the distribution. Actual 5-tuple flows are sent on multiple tunnels. See this link here on how LAG & IPsec work together.

Changing the hash algorithm or adding/removing fields to the hash algorithm would have mis-order issues. In some deployments mis-order once in a while is okay. In those cases, this methoed can be used. To use this method, rebalancing should not happen very frequently. Typically following mehtod is used - If a link utilization is more than X% (Typically 5 to 10% - configurable parameter) away from the average usage of the trunk, then it is candidate for redistribution. Stop doing redistribution for configurable amount of seconds to ensure that there are no frequent redistributions.

QoS:

Typically QoS shaping & scheduling function runs on top of L2 interfaces. Trunk link would be given the shaping bandwidth. Shaping is typically implemented using token bucket algorithm. Whenever there are tokens available, scheduling function is invoked. Scheduling function selects the next packet and sends the packet out.

LAG instance which is actiing as L2 interface has the shaping bandwidth which is sum of all the links. If the scheudling decision is taken purely based on the LAG trunk bandwidth, there is a possibility that scheduled packet would get dropped if the packet goes on link which is already completely utilized. This happens when there is uneven traffic in the convesations. Rebalancing helps, but it takes some time rebalance the traffic. Hence QoS shaping and scheduling function should not only consider the LAG instance bandwidth, but also the individual link bandwidth while making scheduling decision. By considering both, at least the paket from the high traffic conversation is not scheduled and resides still in the queue, there by avoiding packet drop.
At the same time, it is not good to under utilize other links. Scheduling, in this case, can move to other traffic that fall in other under-utilized links.

LAG is important feature, but it has its own challenges. IPsec and Qos implementations need to work with LAG properly to utilize LAG effectively.

Comments?

eNodeB and IPsec

eNodeB secures the traffic over the IPsec tunnels to the Serving Gateway (SGW) over backhaul network. Also, eNB creates many tunnels to peer eNBs for X2 and handover traffic. Though all features related to Ipsec are valid in eNB scenarios too, some features are worth mentioning in eNB context.

LAG and IPsec:

Please this link here to understand the issues and solutions related to LAG and Ipsec in general. This scenario is very much valid for eNB to SGW communication. Note that traffic from all GTP tunnels in non-handoff scenario go between eNB and SGW on one or few (when DSCP based tunnels) Ipsec tunnels. When LAG is used between eNB and LAG, similar issue of not utilizing more than one link would arise. Both the solutions suggested in earlier article are valid in this scenario too. In cases where it is difficult to get multiple public IP addresses to the LAG link, then scenario 2 - forceful NAT is only option I can think of.

Capabilities expected in eNB and SGW:

Using LAG effectively requires many tunnels. It is good to have 1 + (number of links - 1 ) * 32 Ipsec tunnels for good distribution across links. User traffic, in this case GTP traffic should be balanced across these IPsec tunnels.

Typically there are two GTP traffic tunnels for each cell phone user - One is typically created for Data traffic and another for voice traffic. Without LAG, normally two Ipsec tunnels are created - One for data traffic coming/going from/to all the cell users and another for voice traffic for all voice traffic coming/going to cell users. GTP traffic is distributed across these two Ipsec tunnels based on DSCP value.

Now, we have lot more Ipsec tunnels. There should be additional logic in eNB and SGW which distributes the GTP traffic across these multiple Ipsec tunnels. This logic should distribute the traffic from a given conversation to one Ipsec tunnel. Each GTP tunnel traffic can be viewed as one conversation. That is, GTP tunnels are distributed across the Ipsec tunnels. One way is to look at the TEID (Tunnel Endpoint ID) and use hash on TEID to distribute the traffic across Ipsec tunnels.

Ipsec implementation on eNB and SGW should have capability to create multiple tunnels - number of tunnels to be created should be configurable. eNB and SGW implementations also should have capability to bring up the tunnels on demand basis too. That is, they should ensure that these number of tunnels are UP and running as long as there is traffic. Note that all ipsec tunnel negotiation would have same selectors and ipsec implementations should not be having intelligence to remove old tunnels with same selectors.

If persistent feature is selected on SPD rules, then the implementations should ensure that all ipsec tunnels are UP and running all the time.

As described in the earlier article, it is necessary that Ipsec implementation have capability of doing 'Red side fragmentation' so that the LAG always sees UDP header in every packet which is required for its distribution.

DSCP based Ipsec tunnels:

LTE uses packet based network for voice, streaming, interactive and non-interactive data. Hence it is necessary that Ipsec tunnel honor this priority to ensure that voice and other real-time traffic is given priority. If both data and voice is sent on the same tunnel, there is a possibility of traffic getting dropped due to sequence number checks as part of anti-replay checks in the receiver. Even though packets are marked with increasing sequence number in both data and voice traffic and encapsulated in the Ipsec tunnel, due to local QoS and QoS in intermediate devices, voice traffic may be sent before the data traffic - that is traffic is reordered. As you know, receiver window right edge moves with newer sequence number. Due to this, some data packets which have lower sequence number get dropped if they are less than the lower edge. To avoid unnecessary drops, there are two methods used - Increase the Anti-replay window size or use different SAs (tunnels) for different kinds of traffic. Second method is normally used.

Due to this feature and above LAG feature, number of tunnels that need to be created in eNB and SGW can go up significantly. Hence both eNB and SGW should have enough memory and computation power to handle multiple tunnels.

Persistent tunnels

To reduce the latency of initial traffic, it is necessary to have this feature. Tunnels are UP and running all the time even when there is no traffic. This feature is good if the links are always-on and if there is no cost based on the traffic amount.

DSCP and ECN Copy settings:

Ipsec implementations expected to copy DSCP and ECN bits from inner header to outer header. Inner header DSCP value is set by applications and this should be continued even when the traffic is tunneled. This will ensure that the nodes between the eNB and SGW will also give QoS treatment. Hence it is necessary to copy the DSCP bits from inner header to outer header.

ECN bits indicate the congestion to the peer so that peer entity can inform the source entity to apply the congestion. TCP protocol has a way to inform the source entity when the receiver gets the IP packets with CE (congestion experienced) bit on in ECN bits of IP header. Intermediate nodes, including eNB and SGW should honor this by copying from inner header to outer header while encapsulating and copy from outer header to inner header while decapsulation.

Peer IP address adoption

eNodeB gets the IP address from backhaul provider dynamically. It is possible that IP address might be changed by the provider while traffic is going on. Ipsec tunnels are expected to be UP and running even if the IP address changes on the gateways. This internet draft discusses the mechanism to adopt the peer gateway address change. This feature is expected to be present to ensure that voice traffic does not observe too much of jitter and latency. Note that tunnel establishment takes hundreds of milliseconds as it involves IKE negotiation of the keys. This can introduce jitter and latency when the voice traffic is going on at that time. Implementation must implement this draft to eliminate jitter and latency issues when the IP address changes on the remote gateway.

IP Fragmentation and Reassembly

Normally many vendors of eNB give performance with respect to UDP traffic without involving IP fragmentation and reassembly. Even though this gives one data point, this may be misleading to customers. Most of the traffic in Internet today is TCP and same is true in LTE world too. TCP MSS is chosen such as way that TCP data packet with IP and TCP header would be MTU size. When the traffic undergoes IPsec encapsulation, it is almost certain that packets would need to be fragmented as it exceeds the MTU of the link. Though DF bit facility is available for end points to know the Path MTU, this feature is not used in Ipv4 end points today. Since packets are fragmented, reassembly is required on other side.

This feature is implemented in Ipsec implementations, but I am afraid that many implementations, though have very good IPsec performance on non-fragmented packets, they are not optimized when fragmentation and reassembly is required. Customers need to watch out for this as significant amount of traffic would be fragmented and reassembled.

IPv6 Support

Ipv6 is fast becoming choice of service providers and LTE core network. Hence Ipv6 support is expected in ipsec implementations. eNB and SGW must support both Ipv4 and IPv6 tunnels. Also, they should be able to send IPv4 and Ipv6 traffic on IPv4/Ipv6 tunnels.

TFC

Traffic Flow Confidentiality feature is normally given less importance. I was told that LTE networks require this feature be implemented in Ipsec tunnels so that the anybody who gets hold of backhaul network traffic will not be able to guess the type of traffic that is going on in the tunnels based on traffic characteristics such as - frequency of traffic, size of packets, distribution of packets etc..

AES-GCM

AES-GCM combines both encryption and integrity in one algorithm. Hence it is called combined algorithm. This algorithms is supposed to be 2x faster than AES-CBC algorithm. Also it is supposed to have half of latency of AES-CBC. Hence it is good for both performance and also for latency which is required for voice traffic. Hence it is becoming popular algorithm in eNB and SGW.

Validation Engineers and customers I believe should look for above features in eNB and SGW.

Comments?

Random technical bits and thoughts