Sunday, September 28, 2008

Look Aside Accelerators Versus In-Core Acceleration

Majority of Multicore processor vendors implemented many acceleration functions as look-aside accelerators. Some Multicore vendors such as Cavium and Intel implemented some functions as in-core and some as look-aside accelerators.  Cryptography is one acceleration function which Cavium and Intel implemented in-core and others have provided that as look-aside accelerator.  Other acceleration functions such as compression/decompression,  regular expression search are provided as look-aside accelerators by many of them.  Hence I will not be discussing them here.  I will concentrate mostly on Crypto acceleration.

How do software use them?

Software use the accelerators in two fashions - Synchronously and Asynchronously.  In synchronous usage,  software thread issues the request to the accelerator and waits for the result. That is, it uses the accelerator as software C function. By that time, C function returns, the result is with it.  In asynchronous usage, software thread issues the request and goes and does some thing else.  Once the result is ready,  thread picks up the result and does the rest of processing needed on the result.  Result is indicated to thread many ways.
  • If the thread is polling for events,  then the result is read via polling.  Many Multicore processors provide facility for software to listen for external events using one HW interface.  Incoming packets from Ethernet controllers, results from look-aside accelerators and other external events are given through common interface.  Run-to-completion programs waits for the event from this common interface and takes action based on the type of event received. 
  • If thread is not doing polling on HW interface, then the events are notified to the thread via interrupts.  
The point is that asynchronous usage of look-aside crypto acceleration allows the thread to do some thing while acceleration function is doing its job.

Look-aside accelerators can be used synchronously and asynchronously.  In-core accelerators are always used synchronously.

What application use the accelerators synchronously and asynchronously?

If the performance is same, software would like to use any accelerator synchronously.  But we all know that asynchronous usage would give best performance for per-packet processing applications.   Before going further, let us see some of the applications that use cryptography.

IPsec,  MACSec, SRTP, kinds of applications would use crypto accelerators asynchronously as these applications work on per-packet basis and simple to make changes to take advantage of look-aside accelerators in asynchronous fashion.

SSL and DTLS based applications, in my view, would always use accelerators in synchronous fashion.  SSL and DTLS are kind of libraries, not applications by themselves.  Applications such as HTTP, SMTP, POP3 servers and proxies would use SSL internally.  To use accelerators in asynchronous fashion, not only changes are required in  SSL/DTLS library, but also major modifications to the applications such as HTTP, SMTP, POP3 proxies etc.. are required.   When the applications are developed using high level languages, it becomes nearly impossible to make changes to the applications.

What are the advantages of look-aside accelerators?

Look-aside accelerators provide an option for software applications to use the accelerators in asynchronous fashion.  Any algorithm which takes significant number of cycles would be better used in look-aside fashion. Many of the crypto algorithm are falling in this category.  When used in asynchronous fashion,  core is not idling for the result.  Core can do some thing else, there by improving overall system throughput.  As described above, any per-packet processing applications such as IPsec, MACSec, PDCP, SRTP would work fantastic in this mode.  Many software applications, which were using software crypto algorithms, are being changed or already changed by software developers  to take advantage of asynchronous way of using the look-aside accelerators.

Another big advantage is with the processing of high priority traffic in poll mode based model - where the core spins for incoming events such as packets.  If the application uses the crypto accelerator in synchronous mode,  core is not doing any thing for a long time, in the tune of tens of microseconds.  If you take 1500 byte packet, it  might take around 10Micro seconds of time to encrypt/decrypt the data.  Core is not doing anything for 10Microseconds.  If it is jumbo frame of 9K bytes,  core may not be doing anything for 60Microseconds.  If there is any high priority traffic such as PTP (Precision Time Protocol) during this time,  this does not get processed for upto 60 Microseconds.   If Crypto is used in asynchronous fashion,  then these high priority traffic will be processed as the core does not babysit the crypto operation.  It also improves the jitter issues. Irrespective of the size of the packet which is getting encrypted/decrypted, high priority traffic is processed within same time.

Having said that,  SSL/TLS based applications use SSL library (example: openSSL).  Since SSL library works in synchronous fashion,  look-aside accelerators will be used in synchronously.  High priority traffic is not an issue as SSL based application typically work in user space in Linux and use software threads. Even if one thread is waiting for result from the crypto accelerator,  other software threads would be scheduled by the operating system  and thereby core would be utilized well as well as high priority traffic can be processed by other software threads.   But many times, the software thread waiting for the result may have to wait for the result in tight loop waiting for change in value of state variable which indicates the readiness of result. In those cases, other software threads may not be scheduled very well.  In those cases,  Multicore processors having hardware SMT (Simultaneous Multi Threading) would work good.

What are the advantages of in-core Crypto?

In-Core Crypto normally is faster than look-aside crypto when used in synchronous fashion.  So, it is good for SSL kind of applications.  But, these are not good for per-packet processing applications.  In-Core crypto has other advantages -  Since the data is sent to the in-core accelerators via core registers (these registers are normally big registers, 128 bit or 256 bit registers),  they work with virtual memory and hence very suitable for user space applications.  Another advantage of in-core acceleration is that they can be used in virtual machines without worrying about drivers being exposed efficiently by host operating system.  Since these are instructions like any other core instructions, they would work just fine without any additional effort. Since in-core crypto is just like software crypto,  it is very flexible for porting software applications without requiring to make major changes to SW architecture.

In-Core Crypto has some disadvantages too. If the cores divided across multiple applications and if some applications don't require crypto acceleration, those in-core crypto accelerators are not useful and resulting system throughput would be less.  As indicated above, in-core crypto acceleration does not work on high priority work as they can be used synchronous fashion.  For per-packet processing applications,  performance of in-core crypto would be less than the look-aside crypto when used in asynchronous fashion.

My take:

Since Multicore processors are expected to be used in many scenarios, it is necessary that the acceleration functions are designed such a way that they can be used for both per-packet processing applications and stream based applications (such as SSL).   If look-aside crypto acceleration performance is made similar to in-crypto acceleration in synchronous mode,  then it look-aside acceleration is preferable choice. As I described in my earlier post,  if look-aside accelerators are enhanced with  v-to-p kind of broker hardware module which understand the virtual space,  then I believe it is possible to make performance of  look-aside synchronous acceleration as close as possible with in-core acceleration.

No comments: