Sunday, May 18, 2008

Proxy based Networking applications - Multicore and developer considerations

Many networking applications such as Anti Virus, Anti Spam,  IPS  to name few are being implemented as proxies in network infrastructure devices.  Proxy implementations terminate the connections from external clients and make connection to the destination servers. They get hold of data from client and server and send it to other end after doing applications specific processing.  To ease the development, proxies are implemented on top of sockets in user space.  For each client to server connection,  two socket descriptors would be created - One socket is created as part of accepting the connection from client and another socket is created as part of making connection to the server. 

In unicore processors, it is typical practice to have one process per proxy.  Process handles multiple connections using non-blocking sockets. It is done typically using either poll() and epoll_wait()  mechanisms.  This allows the process to work on multiple connections at the same time.

Many networking applications listed above use the proxies in transparent mode.  Transparent proxies avoid any changes in the end client and server applications and also avoid any changes to the DNS Server configuration.  Systems with transparent proxies intercept the packets from both client and servers.  It is expected that forwarding layer (of Linux ) of the system  to intercept the packets and redirect the packets to the proxies.  Redirection typically happens by overwriting the IP addresses and TCP/UDP ports of the packet such a way that the packets go to the proxies running in user space without any making any changes to the TCP or UDP or any other stack component by the developers.

Process skeleton looks some thing like this:

     Initialization, Daemonize & configuration load.
     Create a listening socket.
     while(forever until termination)
            Do any timeout processing.
            for ( all ready socket descriptors )
                 if ( listening socket)
                        Create application specific context.
                        Might initiate the server connection.
                        May add socket fds to epoll list.
                 If socket is ready with new data
                      Application specific processing();
                     As part of this, the oscket fd may get added to the epoll list again.
                 if ( socket is has space to send more data )
                      Application specific processing() which sends the data.
                      If more data to be sent, socket might be kept again in epoll list.
                if ( socket has exception )
                        Application specific processing ();
                        Connection may be closed as part of application processing.
     Do any graceful shutdown activities.

Increasingly Multicore processors are being used in network infrastructure devices to increase the performance of solution.  Linux SMP is one of the popular operating system choice by developers.  What are the things to be considered while moving to Multicore processors?

Usage of POSIX compliant threads in the proxy processes:
Create as many threads as number of cores in the processor.  Core affinity can be done for each thread.  Yet times one might like to create more number of threads beyond number of cores in processor to take advantage of asynchronous accelerators such as  Symmetric & Public key crypto acceleration,  regular expression search acceleration etc..  In those cases, thread might wait for the response.  To allow the core to do other connections processing ,  multiple threads per core are required. 

Mapping of thread to the application processing:  

Developers use different techniques.  Some use pipelining model. In this model, one thread gets hold of the packets for all connections and pass the packets to next thread in the pipeline for further processing.  Last thread in the pipeline sends the packet out on other socket.  Though this might use multiple cores in the processors, this may not be ideal choice for all applications.  I feel run-to-completion model is good choice for many applications.  Run-to-completion model is simple. Each thread waits for the connections. Once the connection is accepted, it does everything related to the applications in the proxy including the sending out the processed data.   The structure is similar to the process model, but the loop is executed by each thread.  That is, connections get shared across the threads with each thread processing set of connections.  Advantages which this approach are:
  • Better utilization of dedicated caches in the cores.
  • No or less number of Mutex operations as one thread does all processing.
  • Less number of context switches.
  • Less latency as it avoids multiple enque/deque operations to pass packets from one pipeline stage to another.
Load balancing of the incoming connections to the threads: 

There could be multiple ways of doing this.  One way is to let master thread accept the connections and give to one of the working threads to do rest of processing on the connection.  Master thread can use different load balancing techniques to assign the new connection to least loaded thread.  Though this approach is simple and confined to the process,  it has some drawbacks.  When there are shortlived, but large number of connections being established and terminated in quick succession,  master thread can become the bottleneck.  Also, cache utilization may not be that good as master and worker threads might be running on different cores.  Since master thread is not expected to do much other than accepting the connections,  a core may not be dedicated, that is, core may not be affined with the master thread. 

Another technique that can be used to let each thread listen on its own socket, accept the connections and process them.  As we know, we can't have more than one listen socket with respect to IP address and port combination.  So, this techniques uses multiple ports as many as number of threads. Each threads listens on a socket created with unique port.  It should be noted that external clients should not be knowing about this.  The client connections to the server will always be with one standard port.  Hence this technique requires additional feature in the intercept layer.  Intercept layer (typically in the Kernel in case of Linux) is already expected to do IP address translation to ensure the packets are redirected to the proxies.  In addition it can do port translation too. Port to translate with can be found based on the load on each port. For example, the port selection can be 'round-robin' or it could be based on 'least' number of connections on the ports.

Do all relevant application processing in one proxy:

Network infrastructure devices are complex. Yet times, on the same connection multiple application processing is required. For example,  on HTTP connection, the device may be expected do 'HTTP Acceleration such as compression', TCP Acceleration and 'Attack checks'.   If these are implemented in different processes as proxies, latency would increase dramatically as the each proxy terminates and makes new connection to next proxy. Also performance of the system goes down. Certainly it has one advantage, that is, maintainability of the code. But performance wise, it is good to do all applications processing in one single process/thread context. 

Considerations for choosing Multicore processor

Certainly cost is the factor. Besides the cost, other things to look for are -
  • Frequency of core is very important:  As discussed above, a connection is handled by one thread.  Since thread can be executed in one core context at any time, performance of the connection is proportional to processor frequency (speed).  For proxy based applications,  higher frequency cores are better choice compared to multiple low powered cores.
  • Cache :  Proxy based applications typically do lot more processing than typical per-packet based applications such as Firewall, IPSec VPN etc..   If the cache size higher, more instruction memory can be cached.   So, higher the cache size, better the performance. 
  • Division of cache across cores:  Since threads can be affined with the cores, it is good to ensure that the data cache is not same across the cores. Any facility to divide the shared cache into core specific cache would be preferrable.
  • Memory mapping of accelerator devices into the process virtual memory:  By having access to the hardware accelerator from the user space, one can avoid memory copies between user space and kernel space.
  • Hardware based connection distribution across cores :  This is to ensure that the traffic is distributed across cores. Intercepting software in Kernel forwarding layer need not make any load balancing decisions to distribute the traffic across threads.  Intercept layer only need to translate the port so that the packets go to the right thread.
Other important considerations that are needed for any applications are:
  • Facility in hardware to prioritize the management traffic at ingress level : To ensure that Management application is always accessible even when devices is under flood attack.
  • Congestion Management in hardware at ingress level:  To ensure that buffers are not exhausted by application that do lot of processing.
  • Hardware acceleration for crypto,  regular expressions and comperssion/uncompression.
Programming considerations for performance
  • Each poll() or epoll_wait() calls are expensive, so avoid calling epoll_wait() as much as possible : Once the epoll_wait comes out, read the data from the ready socket as much as possible. Similarly write data as much as possible on the ready sockets.
  • Avoid locking as much as possible.
  • Avoid pipelining - Adopt run to completion model.
I hope it helps developers who would be developing proxy based applications.

No comments: