- Firewall and Server Load balancing devices require inactivity timer for each session they create. Once the session is established, activity on the sessions would be observed. If there are no packets for certain time (inactivity timeout), the session would be removed. In addition, these functions also remove session upon observing TCP RST or TCP FIN packets. In case of non-TCP session, sessions are removed only upon inactivity,
- IPsec VPN, ARP, QoS functions start timer for each session/flow. These timers are typically called 'life expiry' timers. These timers almost all the time would expire. Once the timers expire, these sessions would be removed or revalidated. In case of QoS, most of these timers are periodic timers which would be started immediately after expiry.
- Protocol state machine timers: These timers are started on per session basis, some time more than one timer on session, during session establishment as part of signalling protocol. These timers are started when message is expected from the peer and stopped upon receiving the peer message. So, in normal cases, these timers always are stopped.
Operating systems like Linux provides timer module. Typical API functions provided by timer module, at high level are:
- add_timer: It starts the timer with the timeout value given to it. This function also takes the function pointer of the application and word argument (callback argument). This function pointer is called when the timeout occurs. This function pointer is passed with the callback argument.
- del_timer: This function stops the timer that was started earlier. Many applications expect that once the timer is stopped, application function pointer is never called.
- mod_timer: This function is called to restart the timer with new timeout value.
Some other characteristics and requirements of networking functions with respect to timers are:
- Any device targeted towards Enterprise, data center and service providers would have sessions ranging in Millions. Hence millions of timers are required. I have observed
- 2 M Sessions in case of firewall and SLB devices.
- 50K IPsec VPN tunnels.
- 500K of QoS Session (One session per subscriber & type of traffic )
- Session establishment rate is very important in addition to throughput. Some of firewall and Server load balancing devices support connection rate in the range of 250 to 500K Connections/sec. IPsec VPN function in high end market segment require SA establishment rate in the range of 20K/sec. Hence it is very important that very less number of CPU cycles are used in 'adding the timer', 'removing the timer'.
- Firing the timer (timer expiry) happens most of the time. Let us see some scenarios.
- As discussed above some networking functions such as IPsec VPN, ARP, QoS timers always expire.
- In case of firewall and SLB, it may appear that expiry does not happen most of the time. In case of non-TCP sessions, timer expiry always happens. Even in TCP too, timer expiry is not uncommon. Many TCP connections are non-interactive. Hence the the inactivity timeout is normally is up to 60 seconds. In the lab environment, connection rate and hybrid (connection rate with realistic payload) measurements are always result to timer being stopped using TCP RST or FINs, that is, each connection ends before inactivity timeout expires. In real work loads, the connections can stay beyond the inactivity timeout. In these cases, timers get expired and timers get restarted if there was activity. Activity was normally found out using last packet processed time stamp.
- Jitter is one important item that these applications should ensure that it is minimum. What it means is that core should not be doing too much of work while traversing the timers to check for expiration. Also it should not be firing too many timers in a loop. If these issues are not taken care, the packets which come in will have a big latency and hence the jitter. It is expected that the timer subsystem divides the work evenly there by having uniform latency and less jitter.
- High end network devices normally has big amount of memory. Note that they also support Millions of sessions. Timer is just one part of it. Timer mechanism should not be taking too much of memory. If the system support Y Million sessions and if each timer takes X number of bytes, the amount of memory consumed should not exceed X*Y. It is okay to expect some minimal amount of memory beyond this for housekeeping. But memory should not be in terms of Multiples of X*Y product.
- Some implementations, upon deleting the timers, keep some state information until the timer really gets expired. I guess these implementation do this to improve the performance such as avoiding the locks between add/delete. Though this state information size is less than the timer block size, still this can lead to exponential memory usage. Let us say that this state information is 8 bytes. Let us also assume that the system supports 250K connections/sec and has 2M session capacity. That is, the system can start and stop 250K timers every second. Let us also assume that the inactivity timeout is 60 seconds. That is, in 60 seconds, there would be 250K*60 timers are added and deleted. If 8 bytes of state information is required until expiry, than it requires additional memory of 250k*60*8 = 120Mbytes of memory. This is in additional to memory required for active timers for 2M sessions. That would be unacceptable. Think of the case where the inactivity timeout is more than 60 seconds.
Based on above, all timer functions such as addition, deletion and expiry are all need to be performing well with less jitter and does not exponentially increase the memory requirement. In case of Multicore processors, it is also requirement to avoid locks as much as possible to get the linear performance with number of cores.
Linux new timer implementation based on 'Cascaded timer wheel' satisfies many of the requirements. It is not taking care of two items - Minimizing the jitter and avoiding locks.
Linux implements multiple timer wheels with different granularity. Lowest level wheel granularity is jiffy based. Next level timer wheel granularity is previous level granularity * number of buckets in the previous wheel. For example, if the first level wheel has 256 buckets, then second level wheel granularity of each bucket is 256 jiffies. When the timer is started, it is kept in appropriate wheel. Each timer wheel has running index. Every time jiffy interrupt occurs, the timers in the bucket indexed by 'running index' of lowest level wheel are expired and then the index moves to next bucket. When it reaches the end, then it moves all timers in the next wheel are moved to this wheel and running index of next wheel is incremented. Moving the timers from next wheel to current wheel happens every time the current wheel index is being circulated to the beginning of the wheel. While moving the timers to the current level from the next level of wheel, it distributes the timers based on the timeout value. Timers are kept in the double linked list. Hence the addition and removal of timers is very fast.
It works fine except the case where it needs to move the timers from the next wheel to the current wheel and distribute them. This case may not happen if the timers are stopped (deleted). In some work loads, timers may not be stopped before they expire. Note that it may involve moving large number of timers and this lead to latency of the packets which are being processed. If it takes longer, this may even result to dropping of the packets. I do have some unproven ideas and I will get back when I crystalise those ideas in my mind.
Linux implementation also uses locks to protect the bucket lists as different cores add and remove a given timer. Locks can be avoided with some simple changes - Maintain timer wheels on per core basis. If the timer is being stopped by another core, then indicate this to the original core. This will ensure that only the core who started the timer would delete it from the bucket and it is the one which fires the application callback upon timer expiry. To ensure that there are no locks for indicating the deleted timer indications to the original core, ring based arrays (one array for each core) can be used.
Linux implements multiple timer wheels with different granularity. Lowest level wheel granularity is jiffy based. Next level timer wheel granularity is previous level granularity * number of buckets in the previous wheel. For example, if the first level wheel has 256 buckets, then second level wheel granularity of each bucket is 256 jiffies. When the timer is started, it is kept in appropriate wheel. Each timer wheel has running index. Every time jiffy interrupt occurs, the timers in the bucket indexed by 'running index' of lowest level wheel are expired and then the index moves to next bucket. When it reaches the end, then it moves all timers in the next wheel are moved to this wheel and running index of next wheel is incremented. Moving the timers from next wheel to current wheel happens every time the current wheel index is being circulated to the beginning of the wheel. While moving the timers to the current level from the next level of wheel, it distributes the timers based on the timeout value. Timers are kept in the double linked list. Hence the addition and removal of timers is very fast.
It works fine except the case where it needs to move the timers from the next wheel to the current wheel and distribute them. This case may not happen if the timers are stopped (deleted). In some work loads, timers may not be stopped before they expire. Note that it may involve moving large number of timers and this lead to latency of the packets which are being processed. If it takes longer, this may even result to dropping of the packets. I do have some unproven ideas and I will get back when I crystalise those ideas in my mind.
Linux implementation also uses locks to protect the bucket lists as different cores add and remove a given timer. Locks can be avoided with some simple changes - Maintain timer wheels on per core basis. If the timer is being stopped by another core, then indicate this to the original core. This will ensure that only the core who started the timer would delete it from the bucket and it is the one which fires the application callback upon timer expiry. To ensure that there are no locks for indicating the deleted timer indications to the original core, ring based arrays (one array for each core) can be used.
No comments:
Post a Comment