Saturday, August 30, 2008

Are Multicore processors Linux friendly?

Multicore processors have become the trend for past three years.  Intel, Cavium and RMI have Multicore processors.

But are they Linux friendly?

Since Linux has SMP functionality, they are in theory Linux friendly.  And they are to large extent as long as Kernel controls the hardware accelerators and Ethernet ports. 

There are many reasons why Network device vendors don't like to deal with the hardware from Kernel space. Let us take the two accelerators in question - Regex accelerator and Crypto accelerator.  Any device vendor providing security functionality mostly implement using Linux user space daemons for several reasons such as
  • User space programming is easier to debug.
  • Some security functions require proxies which work on top of socket interface - hence user space applications.
  • More memory can be accessed from user space.
  • Taking advantage of SWAP.
Any application will not be limited to one user space daemon. There could be many user space daemons for typical network infrastructure device. Also, there could be processes which are created dynamically to take more load.  Let us take UTM as an example:
  • IDS/IPS - One user space process, multiple threads.
  • Network AV - HTTP/S Proxy -  One user space process, multiple threads.
  • Network AV - SMTP/S proxy - One user space process, multiple threads.
  • Network AV - POP3/S Proxy - One user space process, multiple threads.
  • Network AV - IMAP/S proxy - One user space process, multiple threads.
  • ClamAV or any equivalent Anti Virus package -  Multiple user processes created dynamically at run time.
  • Spam Assasin or equivalent Anti Spam package - Multiple user processes.
  • IPsec :  Kernel level function.
Crypto acceleration is required by 
  •  Network AV - HTTPS proxy,  SMTP/S Proxy, POP3S Proxy and IMAP/S proxy.
  •  IPsec in kernel space.
Regex acceleration is required by:
  • ClamAV daemons.
  • Spam Assasin daemon
  • IDS/IPS daemon
  • Content Security daemon (HTTP Proxy).
To improve performance and also to have isolation, many vendors would like to deal with the hardware directly from user space without Kernel doing de-multiplexing of requests/responses. That is, the accelerator device need to be shared by multiple daemons. Each daemon should be looking at its own copy of accelerator.   If one process dies, it should not affect other devices.

Unfortunately, the Multicore processors today don't have that capability.  I hope new Multicore processors would have this capability. 

Let us see the expectations by the software on hardware accelerator devices.
  • Accelerator device should be instantiable.
  • Each instance can be memory mappable by the appropriate user space daemon.
  • Only owned process/thread should be able to submit and get hold of responses.
  • Each instance of hardware device should have its own interrupt and this interrupt line should wake up the appropriate thread.
  • When a user process dies, it should not affect other processes using the same device with different instance. 
  • When the user process dies, software should be able to stop the instance of device.
Intel and VIA implemented crypto as instructions and hence they may not have above issues. But many Multicore processors implemented them asynchronously and would have issues if they don't have support for 'instances'.

Comments?

3 comments:

Ravi said...

Cavium Octeon also has crypto at instruction level. So, Octeon does not have this issue.

What about NetLogic XLP and Freescale P4080? Since they don't have crypto instructions, do they have issues which you have detailed in your post?

Ravi said...

Crypto at instruction level works very good in virtualization environment. No hassles and headaches in providing drivers in kernel space.

I wonder why other Multicore processor vendors don't implement crypto at Core level. What advantages these other Multicore vendors have?

Srini said...

It is balance between performance & flexibility.

Asynchronous Crypto accelerators in general give better performance for IPsec, PDCP, MACSec, SRTP. These packet processing modules are in general simple to modify to take advantage of asynchronous acceleration. While crypto engine executes the crypto algorithm, processors can do some thing else. Some Multicore processors such as P4080 from Freescale does more than crypto. It also does protocol offload.

Crypto level instruction sacrifice the performance with flexibility.

My view is that eventually Multicore processors would support both modes - Synchronous & Asynchronous. Synchronous for SSL based applications and Asynchronous for packet processing based applications.

Recently Freescale announced that future SoCs would have Altivec SIMD in its cores. See here: http://media.freescale.com/phoenix.zhtml?c=196520&p=irol-newsArticle&ID=1474744.
I think that would provide best of both worlds.