Sunday, October 31, 2010

Multicore Networking applications - Mitigating the Performance bottlenecks

I had given this talk in 2010 Multicore Expo in San Jose.  It was in presentation document in concise form. I voiced the most of the details during my talk.  Many people requested me to provide details in written form.  I tried to give details here in this post.  I hope this post would give enough details on 'New techniques to improve software performance with increasing number of cores'.

Before going further, I would like to differentiate two kinds of applications - Packet processing applications and Stream processing applications.

Packet processing applications in my definition are the ones which  take  packet by packet, work on the packet and send out the same packet or packet with some minor modifications.  In packet processing applications,  there is one-to-one correspondence between input and output packets except for very small number of exceptions. One example where there is no one-to-one correspondence is when there is IP reassembly or fragmentation. Other example is when the packet is dropped by the application. Example applications in this category are:  IP forwarding, L2 Bridging,  Firewall/NAT,  Ipsec and even some portions of IDS/IPS.

Stream processing applications are the ones which may take packets or stream of data,  work on data and send out the data or  send out different packets. Most of  the TCP socket based proxy applications come under this category. Examples:  HTTP Proxy,  SMTP Proxy,  FTP Proxy etc..

This post tries to aid the programmers debugging the software to find out the performance bottlenecks in Multicore networking applications.

Always Ensure to do  flow/Session Parallelization 

Ensure that only one core is processing the session at any given time.  If multiple packets from the same session are being processed by more than one core at the same time,  then there would be requirement to ensure that Mutual exclusion on the session variables.  That would be very expensive.  Multicore SoCs actually aid you to do flow parallelization in packet processing applications.  Many Multicore SoCs support parsing the fields from the packets,  calculate hash on the software defined fields and distribute the packets across the multiple queues based on the hash value.  And then they provide provision for software threads to dequeue the packets from the queues.  These SoCs also provide provision to stop dequeue of packets from threads until the control of the queue is given up explicitly.  This ensures that a given flow is processed by only software thread at any time.

Many Multicore SoCs also have facility to bind the queues to the software threads and each software thread to the core.  If the number of flows are small, there is a possibility of cache being warmed with contexts due to previous packets. This reduces the data movement from DDR.   Also, many Multicore SoCs provide facility to stash the context as part of dequeue operation which reduces the cache thrashing issue even if  binding of the queues to the cores are not done. 

Flow parallelization not only eliminates the need for Mutexes, it also ensures that there is no packet mis-ordering in the flows.

Many stateful packet processing applications require not only flow parallelization, but also session parallelization.  Session typically consists of two flows - Client to Server traffic and Server to Client traffic.  It is possible that two packets from both the flows may be coming to the device and two separate software threads might be processing these packets. Stateful applications share many state variables across these two flows. Due to this, you may require mutual exclusion operation if both the packets are allowed to be processed at the same time.  Session Parallelization as described here would eliminate the need for mutual exclusion.  Unlike flow parallelization, session parallelization is not available in many Multicore SoCs for cases where the tuple values are different in both the flows and hence needs to be done in software.  Packet tuples are different when NAT is applied. Note that many Multicore SoCs enqueue the packet to the same queue if there is no NAT.  They are intelligent enough to generate the same hash value even though the tuples position get changed, that is, source IP in one flow would be destination IP in reverse flow and same is true for destination IP, Source Port and Destination Port.

Stream processing modules such as proxies would need to ensure that both client side and server side sockets are processed by the same software thread to ensure that there is no  Mutual exclusion operations requirement to  protect the sanctity of state variables.  Stream processing modules typically create many software threads - worker threads.  Master thread terminates the client side connections and handover the connection descriptor to one of the less loaded worker threads.  Worker thread is expected to create new connection to the server and do rest of the application processing.  Worker threads are typically implement FSM for processing multiple sessions. More often,  the number of worker threads would be same as number of cores dedicated for that application.  In cases where the threads need to block for some operations such as waiting for accelerator results, then more threads, in multiples of number of cores, would be created to take advantage of full power of accelerators.

Eliminate the Mutual Exclusion Operation while Searching for Session/Flow Context

This technique is also expected to ensure that there are no mutual exclusion operations in the packet path.  Any networking application do some search operations on the data structures to figure out the operations and other action to be done on the packet/data.  Upon the incoming packet/data,  search is done to get hold of session/flow context and then further packet processing happens based on the state variables in the session.  For example,  IP routing does search on the routing table to figure out the destination port, PMTU and other information for operations such as fragmentation, TTL decrement and packet transmit.  Similarly firewall/IPsec packet processing applications maintain the sessions in a easy to search data structures such as RB trees, hash lists etc..   Since the sessions are created or removed dynamically from these structures, it is necessary to protect the data structure while doing operations such as add/delete/search.  Mutual exclusion operations using spinlock,  futex, up/down are one way to do this.   RCU (Read-Copy-Update) is another method that can be used which eliminates the Mutex operation during search.   RCU operation is described in earlier post.  Please check that here and here.  RCU lock/unlock operations in many operating systems is very simple operation. Note that Mutex operations are still required for add/delete even in RCU based usage.

Eliminate Reference Counting 

One of the other bottlenecks in the Multicore programming is the need to keep the session safe from deletion while it is being used by other software threads. Traditionally this is achieved by doing 'reference counting'. Reference counting is used in two cases - During packet processing operation or  When neighbor module store the reference.

In the first case, reference count of the session context is incremented as part of the session lookup operation.  During packet processing, the session is referred many times to get hold of state variable values and to set the new values in the state variables of the session.  It is expected that if the session is deleted, it should not be freed until the current thread is done with its operation. Otherwise, it would corrupt some other memory if the session memory is freed and allocated to somebody else during packet processing.  To ensure that the session ownership is not given away, the reference count is checked as part of 'delete operation'.  If is is not zero, then the session is marked for deletion, but not freed until the reference count becomes zero.  If the value is zero, it indicates there is no reference to this session and the session gets freed. 
Since RCU operation postpones the delete operation until current processing cycles of all other threads,  reference counting becomes redundant.  Elimination of reference count not only helps in improving the performance, but also reduces the maintenance complexity. Note that reference counting operation requires atomic usage of count variable. Atomic operations are not inexpensive.

Second use case of reference count is when the neighbor modules store the reference (pointer) to the sessions in their session contexts. By eliminating the storage of pointer,  reference count usage can be eliminated.  This post helps you understand how this can be done.

Linux user space programs also can take advantage of RCUs. See this post for more details.

Use the Cache Effectively

Once the matching session is found upon incoming data/packet,  processing functionality uses many variables in the session. If these variables are together in a cache line,  any cache fill due to access of one variable result all other variable in the cache line available in the cache.  That is, Access to other variables will not result in going to DDR.  But all variables may not fill in one cache line. In those cases, it is necessary to group the related variables together to reduce going to DDR.

To effectively use instruction cache, always arrange your code with likely/unlikely compiler directives. Compilers will try to arrange the likely() code together. 

Reduce Cache Thrashing due to Statistics variables

Almost all networking applications update statistics variables.  Some variables are global and some of them are session context specific variables.  There are two types of statistics variables -  increment variables and add variables. Increment variables are typically used to maintain the count of packets.  Add variables are used to maintain the byte count.  Updating these variables require getting hold of current values and then add or increment operation.   If these variables are updated by multiple threads (with each thread running on a specific core), then every time an variable is updated,  cache information of this variable is no longer valid in other cores.  When one of other cores needs to do same operations,  it needs to get the current value first from the DDR and apply the operation.  In worst case scenario, where packets are going to round robin fashion to different software threads (hence cores), then the cache thrashing due to statistics variables would be very high and this would reduce the performance dramatically.

Always use 'per core/thread statistics counters'  whenever possible.  Please see this post for more details. 

Some Multicore SoCs provide special feature which also eliminates the need for 'per core' statistics maintenance.  These SoCs provide facility to allocate memory block for statistics.  These SoCs provide facility to fire the operation and forget about it.  Firing the operation involves the operation type (increment, decrement, add X or sub Y etc..) and memory address (32 bit or 64 bit).  SoCs internally do this operation without cache thrashing.  I suggest strongly to use this feature, if it is available in your SoC.

Use LRO/GRO facilities

Many networking applications' performance depends on the number of packets being processed than the number of bytes processed.  Examples: IP Forwarding,  Firewall/NAT and Ipsec with hardware acceleration.  So, reducing the number of packets processed becomes key in improving the performance.

LRO/GRO facilities provided by operating system in Ethernet drivers or by Multicore SoCs reduce the number of TCP packets, if multiple packets from the same TCP flow are pending to be processed.  Since TCP is byte oriented stream protocol, it does not matter whether or not the processing happens on packets.  Please see this post for more information on LRO feature in Linux operating system.  If it is supported by your operating system or Multcore SoC, always make use of it.

Process Multiple Packets together

Each packet processing module does set of operations on the packets/data - such as Search,  Process and  Pkt out.  If the packet is going through multiple modules, there are many C functions get called.  Each invocation of C function has its own overhead such as pushing the variables in the stack, initializing some local variables etc..    By bunching multiple packets of same flow together can reduce search/pkt out overhead and overhead associated with the C functions.

Some Multicore SoCs provide facility to coalesce packets together on per queue basis with coalescing parameters -  Packet threshold and time threshold.  Queue does not let the target thread to dequeue until one of the conditions reached - either number of packets in the queue exceed the packet threshold parameter or if no packet was dequeued for time 'time threshold'.   If this facility is available, ensure that your software dequeues multiple packets together and processes them together. 

Yet times, there is no one-to-one correspondence between queues and sessions.  In that case, one might ask that search overhead can't be reduced as there is no guarantee that the packets in the same queue belong to the same session.  Though it is correct, it might still have some improvements due to cache warming if there are more than one packet belonging to same session in the bunch.

As a software developer,  it would be required to strive for one-to-one correspondence between queues and sessions. This can be done easily among the modules running in software.  Some Multicore SoCs provide queues for not only to access hardware blocks, but also for inter-module communication.  Software can take advantage of this to create one-to-one mapping between queues and destination module's sessions.

It is true that when the packets are being read from the Ethernet controllers, there is no way to ensure that a queue only holds packets of one session as the queue selection happens based on the hash value of packet fields.  Two different sessions may fall into same queue.  In those cases,  as mentioned above you might not see improvement from 'serach' functionality, but you would still see improvements due to less number of invocations of C functions.

Many Multicore SoCs also have functionality to take multiple packets together for acceleration and for sending the packets out. This also will reduce the number of invocation to acceleration functions and for sending packets out.  If this facility is available in your Multicore SoCs,  take advantage of it. 

Eliminate usage of software queues

Some Multicore applications need to send the packets/data/control-data to other modules.  If multiple threads send the data to the queue, then there is a need for mutual exclusion to protect these data structure queues.

Many Multicore SoCs provide queues for software usage.  These queues would eliminate the need for software queues and hence eliminate the mutual exclusion problem, there by improving performance.   Some Multicore SoCs also provide facility to group multiple queues together into a queue group which allows sending and receiving applications to enqueue priority items and dequeue based on priority.  These queues can be used even among different processes or virtual machines as long as shared memory is used for items that get enqueued and dequeued.  Some Multicore SoCs even went a step further to provide 'copy' feature which avoids shared memory and there by providing good isolation. This feature makes a copy of these items from source process to internal managed memory by Multicore SoCs and copy to the destination process memory as part of dequeue operation.

Always use this feature if it is available in your Multicore SoC.

Eliminate the usage of Software Free pools 

Networking applications use free pools of memory blocks for memory management.  These free pools are used to allocate/free session contexts, buffers etc..   Many software threads would require these facilities at different times. Software typically maintains the memory pools on per core basis to avoid mutual exclusion operations on per allocation basis.  Since there is a possibility of asymmetric usage of pools by different threads, yet times there is a possibility of memory allocation failures even though there are free memory blocks in other threads' pools.   To avoid this, software does complex operations during these scenarios of moving memory blocks from one pool to another through global queues.   Many Multicore SoCs provide 'free pool' functionality in hardware.  Allocation and free can be done by any thread at any time without mutual exclusion operations. Use this facility whenever it is available.  It saves some core cycles.  More than that is provides efficient usage of memory blocks.

Use Multicore SoC acceleration features to improve performance

There are many acceleration features that are available in Multicore SoCs.  Try to take advantage of them.  I classify acceleration functions in Multicore SoCs into three buckets -  Ingress In-flow acceleration,  In-flight acceleration and Egress in-flow acceleration.

Ingress In-flow acceleration:  Acceleration functions that are done by Multicore SoCs in the hardware on the packets before they are handed over to software are called Ingress In-flow accelerations.  Some of the features, I am aware, in Multicore SoCs are:
  • Parsing of Packet fields :  Some Multicore SoCs parse the headers and make those fields available to the software along with the packet.  Software needing the fields can eliminate the parsing of fields.   These SoCs also provide facility for software to choose the fields to be made available along with the packet.  They also provide facilities for software to create parser to extract fields from proprietary headers or from non pre-defined headers.  Try to take advantage of this feature.
  • Distribution of packets across threads:  This is basic feature required in Multicore environments.  Packets needs to be distributed to different software threads.  Many Multicore SoCs also ensure that packets belonging to one flow go to one software thread at any time to ensure that packets will not get mis-ordered within a flow.  As described above,  multiple queues would be used by hardware to place the packets.  Selection of queue is based on hash value calculated on the set of software programmable fields.  As a software developer, take advantage of this feature rather than implementing the distribution in software.
  • Packet Integrity checks & Processing offloads:  Many Multicore SoCs do quite a bit of integrity checks on the packet as  listed below.  Ensure that your software don't do them again to save some core cycles.
    • IP Checksum verification.
    • TCP, UDP checksum verification.
    • Ensuring that the headers are there in full.
    • Ensure that size of packet is not less than the size indicated in the headers.
    • Invalid field values.
    • IPsec inbound processing.
    • Reassembly of fragments
    • LRO/GRO as described above.
    • Packet coalescing as described above.
    • Many more.
  • Policing :  This feature can police the traffic and reduce the amount of traffic that is seen by the software.  If your software requires policing of some particular traffic to stop cores from getting overwhelmed, this feature can be used rather than doing it in the lowest layers of software.
  • Congestion Management :  This feature ensures that the number of buffers used up by the hardware won't go up exponentially. Without this feature, cores may not find buffers to send out the packets if all buffers are used up by the receiving hardware. This situation typically happens when the core is doing lot of processing and hence slow in dequeuing while lot more packets are coming in.  Many Multicore SoCs also have facility to generate pause frames in case of congestion. 
Egress In-flow acceleration:   Acceleration functions that are done in the hardware once the packets are handed over to it by software to send the packets out are called Egress in-flow acceleration functions.  Some of the Egress in-flow acceleration functions are given below.  If these are available, take advantage of them in your software as these can reduce significant number of cycles in the core.
  • Shaping and Scheduling :  High priority packets are sent out within the shaped bandwidth.  Many Multicore SoCs provide facilities to program the effective bandwidth. These SoCs shape the traffic with this bandwidth. Packets which are queued to it by software would be scheduled based on the priority of the packets.  Some SoCs even provide multiple scheduling algorithms and provide facility for software to choose the algorithm on per physical or logical port.  Some SoCs even provide hierarchical scheduling and shaping.  Take advantage of this in your software if you require shaping and scheduling of the traffic.
  • Checksum Generation for IP and TCP/UDP transport packets :  Checksum generation, especially for locally generated TCP and UDP packets is very expensive.   Use the facilities provided by hardware.  
  • Ipsec Outbound processing :  Some Multicore SoCs provide this functionality in hardware.  If you require Ipsec processing,  use this facility to save large number of cycles on per packet basis.
  • TCP Segmentation and IP Fragmentation :  Some Multicore SoCs provide this functionality.  TCP segmentation performs well for local generated packets. Use this functionality to get best out of your Multicore.
In-flight Acceleration :   Acceleration functions provided by hardware that can be used during packet processing are called In-flight acceleration functions.  Crypto,  Crypto with protocol offload,  Pattern Matching,  XML acceleration are some of the acceleration functions that come in this category.  Here the packet/data for acceleration is handed over to the hardware acceleration functions by software. Software reads the results at later time when the results are ready.  Take advantage of these feature in your software wherever they are available . Some Multicore SoCs differentiate themselves by doing lot more in the acceleration functions.  For example,  some Multicore SoCs do protocol offload along with crypto such as Ipsec ESP,  SSL record layer protocol , SRTP and MACSec offloads which do beyond crypto offload.

I see many times people asking me a question on how to use the acceleration functions.  I had detailed this long time back here. Please see the details there and there.

Software Directed Ingress In-flow accelerations:

As described before, Ingress in-flow acceleration is applied before the packets are given to the software. Packets that are received on integrated Etherent controllers go through this acceleration.  But many times this acceleration is required from software too.  Take the example of Ipsec, SSL or any tunneling protocol.  Once the software processes these packets, that is once it gets hold of inner packets,  software would like ingress in-flow acceleration to be applied on the inner packets for distribution across cores and other acceleration functions.  To facilitate these kinds of scenarios, some Multicore SoCs provide concept of 'offline port' which allows software to reserve the offline ports and send the traffic for ingress in-flow acceleration.  Some software features that can take advantage of this feature are:
  • Tunneled traffic as described above to let the inner packets to go through the ingress in-flow acceleration,
  • IP reassembled traffic - Once the fragments are reassembled, it would have all 5-tuples which can be used to distribute the traffic through offline port.
  • L2 encapsulated packets - Such as IP packet from PPP, FR etc..
  • Ethernet controllers on PCI and Traffic from Wireless interfaces :  Here the traffic might need to be read by the software and Ingress in-flow acceleration might not have been implemented for these features. Software after getting hold of packets can be directed to in-flow acceleration functions through offline ports.
Use Multicore core features wherever they are available

Multicore SoCs from different vendors have different core architecture. Some Multicore SoCs are based on power pc, some based on MIPS core and Intel Multicore is based on x86 processors. Multicore SoC vendors provide different features to improve performance of Multicore applications.  Whenever they are available, software should make use of them to get the best performance out of cores.  Some of the features that I am aware of are listed below.

Single Instruction & Multiple Data instructions (SIMD)

Multicore SoCs from Freescale and Intel have this block in their cores.   This feature in the cores allows software do a given operation on the multiple data elements.  This kind of parallelism is called 'Data level parallelism'.  'Add' operation in typical cores is performance either on 32 bit or at the most 64 bit operands.  Current generation of SIMD do this operation on 128 bit operands. They also provide flexibility to do multiple 16 bit, 32 bit add operations on different parts of data simultaneously.  SIMD greatly helps in operations which involve arithmetic, bit, copy, compare operations on large amount of data.  Any operation that is done in a loop can be accelerated using SIMD.   In networking world,  SIMD is helpful in following cases:
  • Memory compare, copy,  clear operations.
  • String compare, copy, tokenization and other string operations.
  • WFQ scheduling of QoS, where multiple queues need to be checked to figure out which queues need to be scheduled based on sequence number property of queues.  If the sequence numbers are arranged in array form, then SIMD can be used very effectively.
  • Crypto operations.
  • Big Number arithmetic which is useful in RSA, DSA and DH operations.
  • XML Parsing and schema validations.
  • Search algorithms -  Accelerating compare operation to find matching entry from collision elements in a hash list.
  • Check-sum verification and generation:  In some cases Ingress and Egress In-flow accelerations can't be used to verify and generate the checksums.  One example is,  TCP and UDP packets that come in IPsec tunnel.   Since the packets are in encrypted form,  ingress and egress accelerators will not be able to verify and generate checksums in inner packets.  Even packets that get encapsulated in tunnels will not be able to take advantage of Ingress & Egress in-flow accelerations.  Checksum verifications and generations need to be done in software by cores.  SIMD would help in those cases tremendously.
  • CRC verification and generation:  These algorithms are not very expensive to have In-flight acceleration and not inexpensive for core to do.  SIMD in these cases help as it does not involve any architecture changes to the software and still get lot better performance over the cores which don't have SIMD.
Normally SIMD based cores give at least 50% more performance improvement for typical workloads.  So, as a software developer, figure out the ones that can be improved using SIMD and modify the code to improve performance of your application.

Speculative Hardware Data Prefetching & Software Directed Prefetching

This feature fetches the next cache line worth of data from the current memory access in the hopes that software would use next memory line.  Many core technologies provide control on enabling and disabling this at run time.  Software can take advantage of this while doing memory copy, set and compare operations.  Any data is arranged in linear fashion in the memory (such as arrays) can get good boost of performance with this feature. Note that, if this feature is not used selectively and carefully, it might even give degradation in performance. Be careful in using this feature.

Many cores also provide special instruction to warm the cache given a memory address. Software developers know the kind of processing (next module) and many times next module session context is also known. In those cases, software can be developed such a way that next module session is prefetched while packet processing happens in current module.  When the next module gets the control of the packet, it already has session context in the cache which avoids getting it DDR in serial fashion.  My experience is that using software directed prefetching gives very good results.  This also ensures that the performance does not go down even with large number of sessions.

Some Multicore SoCs provide support for Cache warming on the incoming packets.  As part of making packets ready for the software, these SoCs warm the cache with some part of packet content,  annotation data containing parsed fields and software issued context data.   When the software dequeues the packet, most of the information required to process the packet of the module that is getting hold of packet is in place in the cache, thereby, avoiding on-demand DDR access.  Software can program its context on per queue basis.  Note that, this feature is useful for the first module that receives the packet.  Actually that is good enough as this module can prefetch the next module context while the packet is being processed in the current module.  As long as each modules does this, there is no performance degradation even with high capacity. 

As described before,  hardware queues may not have one-to-one correspondence with the receiving module session contexts.  A queue might be having packets for multiple session contexts. Many times, software maintains the sessions in the hash table with large number of hash buckets.  All collision sessions are arranged in linked list or RB tree.  Software can ensure that there are as many queues as number of hash buckets and program the first collision element in the queue.  If the matching context is not same as the one that was programmed, then one might not get the full benefit of cache warming by the hardware. But if there are 4 collision elements and the traffic across these four are same, cache warming would come in handy 25% of the time. Some software developers might even store the collision elements in an array and program the array to the queue.

Software directed prefetch works very well as long as there is one-to-one correspondence between current module session context and next module session context.  In this case,  current module session context can cache the reference to the next module session context and use this to do prefetch operation.  This scheme also work fine if next module context is super set of multiple current module contexts.  But it does not work well if the next module context is finer granular.  Example:  Ipsec SA transfer packets from  multiple firewall/NAT sessions.  In this case, 'Software Directed Ingress In-flow acceleration' method can be used to direct the hardware to send the packet to next module.  This method not only provides cache warming, but also distributes the processing to multiple cores.

Hardware Page Table walk:

Some cores provide nested hardware page table walk to find out the physical address given the virtual address.  This is really useful for user space applications in Linux kind of operating systems.  Hardware page table walk feature is expected to be taken care by operating system vendors.  But unfortunately many OS vendors are not taking advantage of this feature.  As a software developer, if your Multicore SoC provide this feature, don't forget to ask your OS vendors to take advantage of this.  This will ensure that your performance does not go down when you move your application from Bare-metal environment (where the TLB are fixed and there is no page walk required) to Linux user space.

I hope it helps.

No comments: