Wednesday, January 7, 2009

Linux SoftIRQ based applications & throttling & yield

Some common packet processing applications are implemented in Linux Kernel in softirq context. Examples include Linux IP tables, IPSec VPN etc.. Based on some questions in checkpoint firewall product forums, it appears that checkpoint firewall and IPSec VPN were also implemented in Linux kernel. Netfilter hooks are normally used to get hold of packets from Linux TCP/IP stack. To make the programming simple, Packet handling by firewall and IPSec VPN happen within the hooks registered with netfilter hooks. Since Netfilter hooks are called in SoftIRQ context on received packet, essentially majority of firewall and IPSec VPN processing happens in SoftIRQ context.

do_softirq function is called from IRQ exit routine, local_bh_enable and ksoftirqd context. Since it is run from IRQ exit routine and local_bh_enable, softIRQ has highest priority over any threads and user level processes. Basically, SoftIRQ can preempt any user level or kernel level threads. Though this is good for packet processing in terms of latency and jitter, this also has problem of using all the CPU in high packet load environment. In lab environments ( and I am sure even in real deployments, it can happen when there is any flood attack) one can ensure that no user level processes get chance to execute by sending large number of packets to the device.

This is really a pain for many network equipment software developers. As of this writing, I did not find any good mechanism in Linux kernel to throttle the incoming traffic and yield so that other processes get their chance to do their operations. __do_softirq has some functionality where it ensures that it does not keep on processing softirq pending events. It breaks from the loop after 10 times even though there are pending events and wakes up the ksoftirqd kernel thread to process pending events. ksoftirqd being a low priority thread, it is supposed to give chance for other user level threads. Without this check, no other kernel task will even get chance. That is, by that time it processes the softirq events, due to external events (such as packets) new softirq events would get generated. It can keep going on forever. Though this protection is available, apparently this is not sufficient. Probably, this is due to new interrupts getting generated right after __do_softirq gives up the control.

Any method, to provide user level process execution under high load conditions, should ensure that there are no major changes to the application code. Also, it should not adversely impact the application performance. It is only possible if packet scheduling happens in thread context. To make sure that no change in the application code, packet application processing in tasklet context. That is thread will give control of the packet to the tasklet for packet processing. Let us call this module is 'floodways'.

Floodways consists of as many threads as number of CPUs in the system. Each thread is affined with the appropriate CPU ID. This module also needs to have as many queues as number of CPUs (threads). Have one low priority thread, let us call it 'idle' thread.

To improve performance, each node in the queue hold at least 16 skbs.

threadEntryPoint(int cpu)
{
take care of thread exit scenario when application terminates.

while (thread is not stopped )
{
_set_current_state(TASK_RUNNING)
while (number of skbs in cpu specific queues is not zero)
{
Remove the first node.
Schedule the tasklet to process the skbs in the node.
schedule(); /** Give up control voluntarily **/
}
_set_current_state(TASK_INTERRUPTIBLE);
}
}

Tasklet function ()
{
for ( all skbs in the node )
{
Call application input function.
}
free the node
}

idleThread()
{
Note down the timestamp (last_idle_timestamp).
}

At right place in the packet processing path in softIRQ context (May be right after the Netfilter hook gets hold of the packet), it can do following:

if ( CPU specific queue is not empty ) || (sched_clock() - last_idle_timestamp > 2 seconds)
{
/** Control comes here if CPU specific queues is not empty or idle thread is not executed for 2 seconds **/
get hold executing CPU id.
Keep skb in the empty node of CPU specific queue or create a node and add skb to it.
/** Wake up the thread **/
Get hold of thread ID based on CPU ID.
if (thread->state != TASK_RUNNING)
wake_up_process(thread);
}
else
go through normal processing. Since above condition is not expected to happen in non-100% CPU conditions, performance is not expected to degrade.

Since thread is the one which is scheduling the tasklet, I am hoping that other threads will get chance to run.

Since the addition and removal of entries to the queues is happening on the same CPU, there is no need for spin locks, but ensure to disable softirq using local_bh_enable and local_bh_disable. There is no protection require to check the queue length as long as it is made as atomic variable.

I hope it helps.

1 comment:

rat said...

while packet are flooding { since softirq is reentrant } , eventhough voluntarily yield cpu ( using schedule ) scheduler will again choose softirqd ??