Random technical bits and thoughts: Hardware timer block in Multicore processors for network infrastructure devices

Some use case scenarios of timers for different functions of network infrastructure devices is given here.

One of the main challenges with software timers is to ensure that jitter and latency of the packets don't go up during the period when some timer block related operations occur. Latency of the packets or even packet drop happens when CPU takes too long a time to process some timer block related functions. Any timer block functions that go through the timers in a tight loop would have affect on packet processing if the number of timer elements checked or acted on in the tight loop are more. The threshold of number of elements that are checked in the tight loop that causes packet latency disruption depends on the frequency of CPU. Based on the software timer block implementation, traversal of some timers happen for different operations. Let us see some of the challenges/problems with software timer modules.

Software timers depend on hardware timer interrupt. In Linux, timer interrupt occurs fore very jiffy( typically 1msec or 2msec). Due to this any software timer can have error up to jiffy. If applications requires smaller error, say in terms of, micro seconds, then only method I can think of is to have timer interrupt to occur in terms of microseconds. This may not work in all processors. There is too much of interrupt processing overhead in cores and reduces the performance of the system. Fortunately many Applications tolerate millisecond error in firing the timers, but some applications such as QoS scheduling on multi-gig links running general purpose operating systems such as Linux require finer granular and accurate timers.
Many networking applications require large number of software timers as described in earlier post. This will lead to traversing many timers on per jiffy basis. For example, if an application creates 500K timers/sec, then there would be 500 timers on per jiffy basis. For every 1 millisecond, it needs to traverse 500 timers and may have to fire all 500 of them. This can take significant amount of time based on the amount of time the application timer callback takes. If takes good amount of time to process, you have packet drop or increased packet latency or both the issues. Some software implementations maintain the timers on per core basis. If there are 8 cores, each core may be processing 62 or 63 timers every millisecond. This is ideal case, but what if the traffic workload is causing only few cores starting the timers. Only few cores would be loaded to process the expired timers. Basically the load may not get balanced across the cores.
To reduce the number of timers to traverse for every hardware timer interrupt, cascaded timers wheels are normally used by software implementations. This implementation does have different timer wheels for different timer granularity and when the timers are started, they go to appropriate wheel and bucket. Due to this any bucket of timer wheel contains the timers that will get expired. Though it reduces the number of timers to traverse for every timer interrupt, but it may involve movement of large number of timers from one timer wheel to another as described in the earlier post. This movement of timers may take significant amount of time and again could be the cause for packet drop and increased latency.
If there are periodic timers or need to be restarted based on activity software timer implementation spend good amount of time in restaring them.

Do hardware timer blocks in Multi-core processors help?

In my view hardware timer block can help when your applications demand large number of timers, periodic timers or very accurate timers. If your application requires 'Zero Loss Throughput', then hardware block is going to help certainly as it takes away the CPU cycles used to traverse the timer list or movement of timers in software implementations.

What are the features expected by network infrastructure applications from hardware timer block in Multi-core processors?

Large number of timers are expected to be supported, ranging in Millions.
Decent number of (say 1K) timer groups are expected to be supported. There are multiple applications running in cores that require timers. Applications that are being shutdown or that are being terminated due to some error conditions should be able to clear all the timers that it had started.
Accessibility of timer groups by applications running in different execution contexts. There should be good isolation among timer groups. There should be some provision to program the number of timers that can be added to a timer group. There should be provision to read the number of timers that are in the timer group.

Applications running in Linux user space
Applications running in Kernel space.
Applications running in virtual machines.

Application should be able to do following operations. All operations are expected to be completed synchronously.

Start a new timer: Application should be able to provide

Timer identification : Timer Group & Unique timer identification within in the group.
Timeout value (Microsecond granularity)
One shot or periodic timer or inactivity timer
Priority of timeout event (upon expiry) : This would help in prioritizing the timer events with respect to other events such as packets.
If there are multiple ways or queues to provide the timer event upon expiry to the cores, then application should be able to give its choice of way/queue as part of starting the timer. This would help in steering the timer event to specific core or distribute the timer events across cores.

Stop existing timer : Stopping the timer should free the timer as soon as possible. One existing hardware implementation of timer block in Multi-core processor today has this problem. If the application is starting timers and stopping them in continuous fashion, eventually it runs out of memory and memory will get freed only upon actual timeout value of the timers. If the timeout of these timers are in tens of minutes, then the memory is not released for minutes together. Good hardware implementation of timer block should not have this exponent usage of memory in any situation. Timer stop attributes typically involve

Timer identification

Restart the existing timer.

Timer identification
New timeout value

Get hold of remaining time out value at any time, synchronously by giving 'Timer identification'
Set the actvity on the timer - Should be very fast as applications might use this on per packet basis.

Firewall/NAT/ADC appliances targeting Large and Data center markets would greatly benefit from the Hardware based timer blocks. All hardware timer blocks are not equally created. Hence check the functionality and efficacy of hardware implementation.

Measure the latency, packet drop and jitter of the packets over long time. One scenario that can be tested is given below.

Without timers, measure the throughput of 1M sessions by pumping traffic across all sessions using equipment such as IXIA or smartbits. Let us this throughput is B1.
Create 1M sessions, hence 1M timers with 10 minutes timeout value.
Pump the traffic from IXIA or smartbits for 30 minutes.
Check whether the throughput is almost same as B1 across all 30 minutes. Also ensure that there is no packet drop or increase in latency of packets, specifically at 10, 20, 30 minute interval.

Measure the memory usage:

Do connection rate test with each connection inactivity timeout value 10 minutes.
Ensure that upon TCP Reset or TCP FIN sequence the session is removed and hence timer is stopped.
Continue this for 10 or more minutes.
Ensure that the memory usage did not go up beyond reason.
Ensure that timers could be started successfully during the test.

Random technical bits and thoughts

Sunday, May 18, 2008

Hardware timer block in Multicore processors for network infrastructure devices

No comments:

About Me

Interesting Links