Tuesday, September 28, 2010

Look-aside acceleration & Application Usage scenarios

Performance and flexibility are two different factors that play role on how applications use look-aside accelerators.  As described in  post,  applications use accelerators in synchronous or asynchronous fashion.  In this post, I would give my view of different types of applications and their usage of look-aside accelerators.

I would assume that all applications are running in Linux user space.  I also would assume in this post that all applications are using HW accelerators by memory mapping the registers in the user space.  Based on these assumption, I could categorize applications into these types:

  • Per_packet processing applications with Dedicated core to the User Process and HW Polling Mode :  In this type, application runs in the user process. A core or set of cores are dedicated to the process, that is, these cores are not used for anything else other than executing this process.  Since core is dedicated, it can wait for the events on some HW interface until some event is ready to be processed.  In this mode,  it is expected that Multicore hardware provides single interface to wait for the events.  Application wait in a loop forever for the events. When the event is ready, it takes action based on the type of event and then come back to wait for new events.  This type of application is more suitable for per-packet processing applications such as IP forwarding, Firewall/NAT,  IPsec, MACSec,  SRTP etc..   
    • Per-packet processing applications would use look-aside accelerators in asynchronous fashion. Incoming packets from Ethernet or other L2 interfaces and the results from the look-aside accelerators are given through the common HW interface.   
    • Typical flow would be some thing like - When the incoming packet is ready on Ethernet port,  polling function returns with 'New packet' event.   New packet is processed by the user space and at one time decides that it needs to be sent to the HW accelerator, sends it to HW accelerator and then come back to poll again.   HW accelerator at some time returns the result through same HW interface.  When polling function returns with 'Acceleration result' event, user process processes the result and may send the packet out onto some other Ethernet port.   It is possible that more packets would have been processed by the user process before the acceleration result is returned for previous packets.  Due to this asynchronous nature, cores are utilized well and system throughput would be very good.
    • IPsec, MACSec, SRTP uses Crypto algorithms in asynchronous fashion.
    • PPP and IPsec IPCOMP use compression/decompression accelerators in asynchronous fashion.
    • Some portion of DPI use Pattern Matching acceleration in asynchronous fashion.
  • Per-Packet processing application with Non-Dedicated core to the user process & SW polling mode:  This is similar to above type 'Dedicated core to the user process and HW polling mode'.   In this type,  core(s) are not dedicated to the user process.  Hence HW polling is not used as this would make core not relinquish the control as often for doing other operations.  SW polling is used, typically using ePoll() call.   In this mode, interrupts are using UIO facilities provided by Linux.  When the interrupt is raised whenever the packet is ready or accelerator result is ready. UIO wakes up the epoll() call in the user space.  When the ePoll() returns, it reads the event from HW interface and it executes different function based on event type.  
    • All per-packet processing applications such as IPsec, SRTP, MACSec, Firewall/NAT can also work in this fashion.
    • IPsec, MACSec, SRTP uses Crypto algorithms in asynchronous fashion.
    • PPP and IPsec IPCOMP use compression/decompression accelerators in asynchronous fashion.
    • Some portion of DPI use Pattern Matching acceleration in asynchronous fashion.
  • Stream Based applications :  Stream based applications are normally work at high level away from packet reception and transmission.  For example,  Proxies/Servers work on BSD sockets - The data which they receive is the TCP data, not the individual packets.  Crypto file system is another kind of stream application, where it works on the data, not on the packet.  These applications collect data from several packets. Some times this data gets transformed such as packet data gets decoded into some other form.   HW accelerators would be used on top of this data.  In almost all cases the HW accelerators are used in synchronous fashion.  In this type of applications ,  synchronous mode is used in two ways -  Waiting for the result in a tight loop without relinquishing the control and waiting for the result in a loop by yielding to Operating system.   First sub-mode (tight loop mode) is used when the HW acceleration function takes very less time and second mode (yield mode) is used when the acceleration function takes long. 
    • Public Key acceleration such as RSA sign/verify, RSA encrypt/decrypt, DH operations and DSA sign/verify work in yield mode as these operations take significant number of cycles.  Applications that require this acceleration are:  IKEv1/v2,  SSL/TLS based applications,  EAP Server etc..
    • Symmetric Cryptography such as AES & different modes,  Hashing algorithms, PRF Algorithms would be used in tight loop submode as these operations take less cycles.  Note that  Yielding might take anywhere between 20000 cycles to 200,000 cycles based on number of other ready processes and that is not acceptable latency for these operations.  Applications based on SSL/TLS,  IKEv1/v2,  EAP Server etc..
    • I would put compression/decompression HW accelerator usage in slightly different sub-mode.  Compression/Decompression works in this fashion for each context.
      • Software thread issues the operation.
      • Immediately reads if there is anything pending result (based on previous operations). Note that the thread is not waiting for the result.
      • Works on the result if available
      • And above steps happen in a loop until there is no input data.
      • At the end,  it waits (in yield mode) until the all the result is returned by the accelerator.
    • Application that can use compression accelerators:  HTTP Proxy, HTTP Server,  Crypto FS, WAN optimization etc..
 Any comments?

No comments: