Random technical bits and thoughts

Sunday, May 18, 2008

Hardware timer block in Multicore processors for network infrastructure devices

Some use case scenarios of timers for different functions of network infrastructure devices is given here.

One of the main challenges with software timers is to ensure that jitter and latency of the packets don't go up during the period when some timer block related operations occur. Latency of the packets or even packet drop happens when CPU takes too long a time to process some timer block related functions. Any timer block functions that go through the timers in a tight loop would have affect on packet processing if the number of timer elements checked or acted on in the tight loop are more. The threshold of number of elements that are checked in the tight loop that causes packet latency disruption depends on the frequency of CPU. Based on the software timer block implementation, traversal of some timers happen for different operations. Let us see some of the challenges/problems with software timer modules.

Software timers depend on hardware timer interrupt. In Linux, timer interrupt occurs fore very jiffy( typically 1msec or 2msec). Due to this any software timer can have error up to jiffy. If applications requires smaller error, say in terms of, micro seconds, then only method I can think of is to have timer interrupt to occur in terms of microseconds. This may not work in all processors. There is too much of interrupt processing overhead in cores and reduces the performance of the system. Fortunately many Applications tolerate millisecond error in firing the timers, but some applications such as QoS scheduling on multi-gig links running general purpose operating systems such as Linux require finer granular and accurate timers.
Many networking applications require large number of software timers as described in earlier post. This will lead to traversing many timers on per jiffy basis. For example, if an application creates 500K timers/sec, then there would be 500 timers on per jiffy basis. For every 1 millisecond, it needs to traverse 500 timers and may have to fire all 500 of them. This can take significant amount of time based on the amount of time the application timer callback takes. If takes good amount of time to process, you have packet drop or increased packet latency or both the issues. Some software implementations maintain the timers on per core basis. If there are 8 cores, each core may be processing 62 or 63 timers every millisecond. This is ideal case, but what if the traffic workload is causing only few cores starting the timers. Only few cores would be loaded to process the expired timers. Basically the load may not get balanced across the cores.
To reduce the number of timers to traverse for every hardware timer interrupt, cascaded timers wheels are normally used by software implementations. This implementation does have different timer wheels for different timer granularity and when the timers are started, they go to appropriate wheel and bucket. Due to this any bucket of timer wheel contains the timers that will get expired. Though it reduces the number of timers to traverse for every timer interrupt, but it may involve movement of large number of timers from one timer wheel to another as described in the earlier post. This movement of timers may take significant amount of time and again could be the cause for packet drop and increased latency.
If there are periodic timers or need to be restarted based on activity software timer implementation spend good amount of time in restaring them.

Do hardware timer blocks in Multi-core processors help?

In my view hardware timer block can help when your applications demand large number of timers, periodic timers or very accurate timers. If your application requires 'Zero Loss Throughput', then hardware block is going to help certainly as it takes away the CPU cycles used to traverse the timer list or movement of timers in software implementations.

What are the features expected by network infrastructure applications from hardware timer block in Multi-core processors?

Large number of timers are expected to be supported, ranging in Millions.
Decent number of (say 1K) timer groups are expected to be supported. There are multiple applications running in cores that require timers. Applications that are being shutdown or that are being terminated due to some error conditions should be able to clear all the timers that it had started.
Accessibility of timer groups by applications running in different execution contexts. There should be good isolation among timer groups. There should be some provision to program the number of timers that can be added to a timer group. There should be provision to read the number of timers that are in the timer group.

Applications running in Linux user space
Applications running in Kernel space.
Applications running in virtual machines.

Application should be able to do following operations. All operations are expected to be completed synchronously.

Start a new timer: Application should be able to provide

Timer identification : Timer Group & Unique timer identification within in the group.
Timeout value (Microsecond granularity)
One shot or periodic timer or inactivity timer
Priority of timeout event (upon expiry) : This would help in prioritizing the timer events with respect to other events such as packets.
If there are multiple ways or queues to provide the timer event upon expiry to the cores, then application should be able to give its choice of way/queue as part of starting the timer. This would help in steering the timer event to specific core or distribute the timer events across cores.

Stop existing timer : Stopping the timer should free the timer as soon as possible. One existing hardware implementation of timer block in Multi-core processor today has this problem. If the application is starting timers and stopping them in continuous fashion, eventually it runs out of memory and memory will get freed only upon actual timeout value of the timers. If the timeout of these timers are in tens of minutes, then the memory is not released for minutes together. Good hardware implementation of timer block should not have this exponent usage of memory in any situation. Timer stop attributes typically involve

Timer identification

Restart the existing timer.

Timer identification
New timeout value

Get hold of remaining time out value at any time, synchronously by giving 'Timer identification'
Set the actvity on the timer - Should be very fast as applications might use this on per packet basis.

Firewall/NAT/ADC appliances targeting Large and Data center markets would greatly benefit from the Hardware based timer blocks. All hardware timer blocks are not equally created. Hence check the functionality and efficacy of hardware implementation.

Measure the latency, packet drop and jitter of the packets over long time. One scenario that can be tested is given below.

Without timers, measure the throughput of 1M sessions by pumping traffic across all sessions using equipment such as IXIA or smartbits. Let us this throughput is B1.
Create 1M sessions, hence 1M timers with 10 minutes timeout value.
Pump the traffic from IXIA or smartbits for 30 minutes.
Check whether the throughput is almost same as B1 across all 30 minutes. Also ensure that there is no packet drop or increase in latency of packets, specifically at 10, 20, 30 minute interval.

Measure the memory usage:

Do connection rate test with each connection inactivity timeout value 10 minutes.
Ensure that upon TCP Reset or TCP FIN sequence the session is removed and hence timer is stopped.
Continue this for 10 or more minutes.
Ensure that the memory usage did not go up beyond reason.
Ensure that timers could be started successfully during the test.

Tuesday, May 13, 2008

Assurance of firewall availability for critical resources : TR-069 support

I guess I have been harping that network security devices are stateful in nature. Let me say that again here :-) They create session entries for 5 tuple connections. DDOS attacks can consume these resources. There are several techniques used to maximize firewall availability. Some of them I discussed before are - Session inactivity timeout functionality, TCP syn flood detection and Syn Cookie mechanism to prevent SYN floods and connection rate limiting.

Above techniques do not guarantee that legitimate connections are not dropped. Rate throttling feature does not distinguish from genuine connections to DDOS connections. But, some resources are very important and access to/from these resources must be made available all the time. That is, some assurance of firewall availability for these critical resources is required.

During DDOS attack and worms outbreak, systems in corporate network should have access to central virus database server to get newer virus updates. Even if some systems in corporate network are compromised and participating in DDOS attacks, other systems should continue to access critical resources while problem is being fixed. Similarly, access to corporate servers should be maximized during DDOS outbreak.

Though all issues can't be solved, enough facilities should be there for assurance of firewall availability for these critical accesses.

Many firewall today support feature called 'Session Reservation and Session Limits'. Using this feature, certain number of sessions can be reserved to individual machines/systems. This feature also limits the number of simultaneous sessions for some non-critical systems/machines.
One use case example: Let us say that a Medium Enterprise has 500 systems. Say that this company bought a UTM firewall with 50000 session entries. Administrator can reserve 20 sessions and limit 100 sessions for each PC. That is, 10000 entries are reserved. Rest of 40000 sessions are free for all. When all 40000 sessions are used up, then reserved sessions are available for PCs. Each PC can use its reserved 20 session entries. Thereby, when there is a DDOS attack, even after 40000 session entries are used, these PC continue to have access 20 more session entries. No other system can occupy these reserved sessions.

Session reservation database is set of rules. Each rule contains following information:

Rule ID: Identification of the rule.
Description: description of this record.
IP Address information: IP addresses for which this rule applies. All the action information in this rule is specific to each IP address.
Connection Direction: Outgoing or incoming. Indicates whether the sessions to be reserved for connections made by machines represented by 'IP addresses' or for connections terminated by these IP addresses. 'outgoing' indicates this rule is applied for connections originated and 'incoming' indicates whether this rule is applied for incoming connections.
Zone : Indicates the zone ID. If 'Connection Direction' is outbound, then zone indicates the destination zone. If 'Connection Direction' is inbound, then zone indicates the source zone.
ReserveCount: Number of sessions reserved for this rule.

Session Limits database also contains set of rules. Each rule contains following information:

Rule ID: Identification of th rule.
Description
IP address Information: IP addresses for which this rule applies.
Connection Direction: Outgoing or Incoming.
Zone: Zone ID
Limit Count: Number of maximum sessions for each of IP addresses.

TR-069 data profile:

internetGatewayDevice.security.VirtualInstance.{i}.firewall.maxSessionReservationRules: R, unsigned Int
internetGatewayDevice.security.VirtualInstance.{i}.firewall.maxSessionLimitRules R
internetGatewayDevice.security.VirtualInstance.{i}.firewall.sessionReservations.{i} PC

ruleID: RW, Unsinged Int, Value between 1 and maxSessionReservationRules.
description: RW, String(128)
ipAddressType: RW, String(32). It takes values such as 'immediate', 'ipobject'. Immediate indicates that IP addresses are given as values and 'ipobject' indicates the IP address information points to one of the IPObjects.
ipAddresses: RW, String(64) - f the type is immediate, then it can be single IP address in dotted decimal form, subnet by providing network IP address and prefix in terms of number or range of IP addresses with '-' in between low and high values. If the type is 'ipobject', then it has one of ipobject names from internetGatewayDevice.security.VirtualInstance.{i}.NetworkObjects.IPValueObject.{i} table or internetGatewayDevice.security.VirtualInstance.{i}.NetworkObjects.IPFQDNObject.{i} table. 'any' is special value indicating all source IP values. Examples: 10.1.5.10 or 10.1.5.0/24 or 10.1.5.1-10.1.5.254
connectionDirection: RW, String(16). It takes values 'outgoing', 'incoming'.
zoneID: String(32), RW - One of the Zone IDs. It takes value of ZoneName from internetGatewayDevice.securityDomains.VirtualInstance.{i}.Zone.{i} table.
reserveCount: RW, Unsigned Int.

internetGatewayDevice.security.VirtualInstance.{i}.firewall.sessionLimits.{i} PC

ruleID: RW, Unsinged Int, Value between 1 and maxSessionLimitRules.
description: RW, String(128)
ipAddressType: RW, String(32). It takes values such as 'immediate', 'ipobject'. Immediate indicates that IP addresses are given as values and 'ipobject' indicates the IP address information points to one of the IPObjects.
ipAddresses: RW, String(64) - f the type is immediate, then it can be single IP address in dotted decimal form, subnet by providing network IP address and prefix in terms of number or range of IP addresses with '-' in between low and high values. If the type is 'ipobject', then it has one of ipobject names from internetGatewayDevice.security.VirtualInstance.{i}.NetworkObjects.IPValueObject.{i} table or internetGatewayDevice.security.VirtualInstance.{i}.NetworkObjects.IPFQDNObject.{i} table. 'any' is special value indicating all source IP values. Examples: 10.1.5.10 or 10.1.5.0/24 or 10.1.5.1-10.1.5.254
connectionDirection: RW, String(16). It takes values 'outgoing', 'incoming'.
zoneID: String(32), RW - One of the Zone IDs. It takes value of ZoneName from internetGatewayDevice.securityDomains.VirtualInstance.{i}.Zone.{i} table.
limitCount: RW, Unsigned Int.

Common mistakes by TR-069 CPE developers - Tips for testers

There are some mistakes which can go undetected for a long time. Testers and certification agencies need to watch out for those. I tried to list down some of those mistakes in TR-069 based CPE devices.

Problems associated with Instance numbers: It is expected that instance numbers are persistent across reboots. But, this does not seem to happen always. Typical problems observed are:

Instance numbers are not stored in persistent memory: If instance numbers are not stored and retrieved from persistent memory, then CPE device may provide duplicate instance number next time when it comes up for AddObject RPC methods. ACS may reject and new records will not get created. On a side note : Surprisingly some ACS systems don't even care to check for duplicate instance numbers.
Instance numbers are stored and retrieved, but the relationship with actual table records is not maintained: That is, ACS might have one view of instance number to the row and CPE devices have different mapping. This will have problems in modification of values. ACS thinks that it is modifying parameters of specific row, but CPE modifies some other record.
Instance numbers are stored and retrieved for records which were created and populated with values, but not saved for unpopulated records: This scenario occurs when AddObject is successful, but before the record is populated with values, the CPE is restarted or got restarted for some reason. If CPE does not restore these instance numbers, the configuration of that record will not be successful (when CPE comes back) as ACS sends the configuration with instance number returned by CPE before.

Problems associated with ParameterKey: ParameterKey is also expected to be stored in persistent memory across reboots. Some ACS systems use this as a configuration checkpoint. ACS systems know the CPE configuration by reading paramterKey value from the CPE. ACSes use this to figure out the 'configuration diff' between what it thinks CPE has versus the latest configuration it has for CPE. Only this difference is typically sent to CPE. Due to this, it is expected that parameterKey is stored in persistent memory along with the rest of configuration. Typical Problems observed are:

ParameterKey and configuration is not stored in persistent memory: There are two issues due to this. ACS reconfigures the device every time it restarts. This could be a problem as the device is unavailable until ACS configures the box.
Configuration is saved and retrieved, but not ParameterKey: Once the device restarts, ACS thinks that CPE does not have any configuration and reconfigures the device from scratch. Since configuration is actually retrieved, there would be many duplication errors. ACS might get confused and might stop pushing the configuration until admin is intervened. In some cases, I observed some intelligent ACS sets the device to factory defaults and configures by sending entire configuration without any manual intervention from ACS administrator.
Configuration and parameterKey are saved at different times: This is dangerous. Essentially the mapping between configuration and parameterKey is broken. When device comes back, ACS view and device view of configuration is different.

Problems associated with Access Control and Notifications: Access control is one feature CPE vendors forget to provide. Typical problems observed are:

No support provided for Access Control: This is one of the important features for managed service providers. If end user changes the important configuration and makes mistake, debugging may take significant time for Service providers. Due to this , service providers would like to allow change of only specific configuration by subscribers. Without having this support in device makes that intention difficult.
Support provided for Access control and notification only for some variables, but not all.
CPE sends notifications for unasked variables

Many problems, I believe, are result of configuration parameter mismatch between local configuration methods and TR-069 method. For example, Web interface provided on CPE might have been developed long ago before TR-069 based remote configuration method is implemented. Some CPE developers forget to integrate their web interface backend functionality with TR-069. In some cases, they do poor job of integration.

Cloud computing security is going to catch up.

Please see this article in information week.

Google and IBM are teaming up together to provide cloud services. Google is already providing email and storage services and they want to go beyond that.

One interesting thing that was mentioned in the article is
"With the exception of security requirements, "there's not that much difference between the enterprise cloud and the consumer cloud," Google CEO Eric Schmidt said earlier this month during an appearance in Los Angeles with IBM chief Sam Palmisano."

One more quote from the article:
"The cloud has higher value in business. That's the secret to our collaboration."

Another thing I observed in the article is their planned usage of Xen.

Combining all of them put together:

Cloud computing requires security. Otherwise, Enterprises may not be able to offload their servers to cloud.
Cloud computing makes use of Virtualization.

I was giving choices in my earlier blog on *Cloud computing and Security*. Though information week article is not giving enough information on how the security services are going to be offered, but they will start thinking soon.

I am beginning to think that both kinds of models which I suggested earlier would be used.

Flexibility for Enterprises to put their preferred vendor security products as virtual appliances.
Providing security using one mega security appliance.

My prediction is that mega security appliance is required to provide typical infrastructure security. Virtual appliance flexibility will be provided for specialized security.

Sunday, May 11, 2008

Packet processing applications - Updating to Multicore processors

I described the need for session parallelization and some tips on programming here. There are many network packet processing applications which don't take advantage of SMP systems. I tried to describe the steps required to convert these applications to run on SMP systems with as less modifications as possible.

Target packet processing applications:

Packet processing application that maintain sessions (flows) such as firewall, IPsec VPN, Intrusion Prevention Software, Traffic Management etc..
Packet processing applications where significant amount of cycles are spent on each packet. Session Parallelization technique uses some additional checks and queuing (enqueue and dequeue) operations of the packets in the session. Session Parallelization technique only works, if the number of CPU cycles required for packet processing is much more than the CPU cycles it takes for few checks and queue operations. Otherwise, you are better off taking locks during packet processing. For example, IP forwarding packet processing application can be parallelized using session parallelization technique.

Let us examine the typical packet processing application requirements.

Upon first packet of the session, they create a context - Session context. These session contexts are kept in some kind of run time data structure such as Hash buckets, linked lists, arrays, some container etc.. - I call it as Home base structure.
Subsequent packets of the session gets the session context from the home base Data structures.
Session contexts are deleted either by timer or by user actions or due to some special packets.
Session contexts are typically 'C' structures with multiple members (states)

Some members are set during session context creation time and never changed. I call them "SessionConstants".
Some members are manipulated during packet processing and those values are needed by subsequent packets. And these are not required by any other function other than packet processing. I call them "Session Packet Variables".
Some members are manipulated by both packet processing and also in non-packet processing contexts. I call those "Session Mutex Variables".
Some variables are used as containers by other modules. That is responsibility of storage and retrieval is responsibility of those modules. I call them "Session Container Variables"

Any packet processing application might contain multiple modules and each module has its own sessions (control blocks). To improve performance and other reasons, yet times, modules store other modules' control block references in their sessions. Also, current module session might be referred by other modules. when the session is actively used by other modules by having pointer to the session, it is necessary that session is not freed until all active references are dereferenced. At the same time, delete operation can't be postponed until it is deferenced by every other module. To satisfy both the requirements, each control block typically contains two variables - "Reference Count" and "Delete flag". Reference Count is incremented when external modules store the reference of the session. When the session is being deleted (either due to inactivity timer, due to special packets or due to user actions etc.. ) and if the reference is used, then modules set the "Delete flag" of the session. Any search on the session ignores the sessions having "Delete flag" set. Also the module is expected to inform the external modules that the session is being deleted. Upon this notification, external modules are expected to remove the reference and as part of it they decrement the reference count of the session. Session is freed only after all references are removed and Delete flag is set.

Process to follow to make applications SMP aware (Example: Applications running in Kernel space)

Identify the Session or Control block of the target module.
Define two additional variables in the control block - Reference Count and Delete flag.
Identify the home base structure. Since the sessions can be created in multiple CPU contexts, define the lock for this structure. It is better to define Read/Write lock. Read lock can be taken when the session is searched in home base data structure. Write lock needs to be taken while adding or deleting the control block to/from the home base data structure. Ensure to increment the reference count while adding to home base structure. Remove it from the home base structure only if reference count is 1 and Delete flag is set to TRUE.

Some times home base could be some other module's control block. In this case, it is responsibility of other module (Container module) responsibility to store, retrive and reset this control block reference atomically.
Always ensure to initialize the control block completely before adding it to the home base data structure.
Ensure that control block reference count is incremented within the home base lock for both add and search operations.

Identify session constants. There is nothing that need to be done for SMP.
Identify Session Packet Variables. There is no need to lock these variables if "Session Parallelization" technique is employed.
Identify Session Mutex variables. Further identify the logic groupings of these variables. For each of these logical groupings:

Define a lock. This lock can be global lock or session lock. Session locks are good for performance reasons, but it requires more memory.
Ensure to define MACROS/inline/functions to manipulate and access variables of the group.

Identify container variables. If this control block is home base structure for other module control blocks, then define MACRO/inline/functions to add/retrieve/reset. These macros would be used by other modules.

Define this set of MACROS for each container variable.
Define a session lock for each container variable. Above MACROS use this lock to protect the integrity of the container value.
Since ADD macro is keeping the reference of foreign module control block, ensure to increment the reference count of foreign control block. To allow this, ADD macro should expect the fucntion pointer being passed to it by the foreign module. It is expected that the function ponited by function pointer is used to increment the reference count.
Expect the "Increment Reference Count" function pointer passed to the RETRIEVE macro. Since RETRIEVE macro returns the pointer of foreign module, it is necessary to increment the reference count to ensure that pointer is not freed prematurely.
Ensure that complete functionality of ADD/RETRIEVE/RESET MACROS work under the lock defined for each container variable.

Some guidelines on Reference Count and Delete Flag variables.

Define "IncRefCount", "DefRefCount" and "SetDelete" and "GetDelete" macros.
Define a lock to manipulate and access "Reference Count" and "Delete flag" variables.
Use this lock only in above macros/inline functions.
Use above macros for manipulating and accessing these variables.
Do first level of cleanup of control block when "SetDelete" macro is called. This cleanup involves informing external module of its intention of going down and also removing any foreign module references. Foreign references are ones this module stored as part of its packet/session processing. Removing foreign references involves decrementing the reference count of foreign reference and invaliding the reference in the control block.
Do final cleanup such as freeing up memory for control block only when "Reference count" is 1 and "Delete flag" is set to TRUE. Before freeing up the memory, remove it from the home base structure. This condition typically happens as part of "Decrement Reference Count" macro.
In non-SMP scenarios, the reference counts are incremented when external modules store the reference of the session in their modules. In SMP scnearios, while one processor processing the packets of the session, another processor can delete the session. Due to this, it is expected that reference count is incremented even while packet processing is happening. That is, whenever "Search" for session is done or when the session is being retried from container module, the reference count must be incremented. As indicated above, retrival of the session, including incrementing the reference count must be done either in data structure/container lock. Since incrementing the reference count happens within its own lock, you would see lock within lock in this scenario. That is, reference count lock is taken with data structure or container lock taken. This is Ok as data structure or container lock is never taken under reference count lock even in cases where session is added or removed from the data structure/conatiner.

Saturday, May 10, 2008

DDOS Mitigation functionality - Tips for Admin

There is some discussion going on in focus-ids mailing list on DDOS attack mitigation. That discussion prompted me to write this article. ISIC, UDPSIC, TCPSIC, ICMPSIC are some tools used to measure the network security device effectiveness of detection and mitigation of DDOS attacks. As we all know one of the main intentions of DDOS attacks is to make the service, network or target unavailable. By looking at the packets, you can't see the difference between normal genuine packets and packets generated by DDOS attacks. This makes it difficult to stop these attacks based on signature based methods.

One of the properties of many of DDOS attacks is that they try to make the discovery of source of attack difficult to find. There are two types of DDOS attacks that are common.

Spoofing of source IP address in the packets: DDOS attacks are generated by spoofing the source IP address of the packet. ISIC, UDPSIC, TCPSIC and ICMPSIC tools simulate these kinds of attacks. Any packet that is sent back to the source does not reach the attacker. Due to this, TCP based sessions don't get established. Note that non-TCP sessions don't have connection establishment phase.
Botnets : The attacker instructs the agents which were installed on compromised hosts across the globe to bombard the target. Attacker keeps changing the hosts that attack the target. Thereby, in effect making the source discovery ineffective.

Now, in addition to Botnets, there is a third kind of DDOS attack. Recent DDOS attack on cnn.com by Chinese hackers is one example. Here too, sources are known, but there are many. Note that here the attack on cnn.com is not generated by botnets, but supposedly by out of patriotism. Some in China felt that CNN is biased in its reporting on Olympic torch and its linkage to Tibet religious freedom. As I understand, it was simple attack, where it connects to cnn.com website and accesses some URL for every 2 seconds. This attack executable was distributed and many home users in China, out of patriotism, executed it.

DDOS attack incident detection may be easier, but mitigation is difficult. If the intention of the attack is to consume the bandwidth of target site, there is nothing much the target network administrator can do. Target company/organization needs to depend on its ISP to block the flood of packets. Gathering as much information as possible and providing that information to ISP is one of the things the administrators can do.

The current trend of DDOS attacks go beyond consuming the link bandwidth. With less number of hosts participating in the DDOS attack, these attacks consume the CPU, memory bandwidth of target networks/servers. I feel the network security appliances providing DDOS attack mitigation functionality can help in this scenario. It can not only provide detection, but can stop bombardment of servers.

There are multiple products *DDOS mitigators* in the market claiming to solve some of above problems. Many IPS boxes also support this feature.

If you are hosting some servers, you can be a victim. As an administrator, I look for following features from these appliances.

DDOS attack consumes 1Mbps link by making 512 connections/sec (approximately) . Any DDOS mitigator, ideally should be able to process 512 connections in every second for 1Mbps link. If the connection is maintained for 20 seconds (which is typical), then the connection capacity needs to be 10K. For 100Mbps link, DDOS attack mitigation appliance needs to support 51200 connections/sec and should have 1M session capacity. With this capacity and connection rate, it can do better job of protecting internal networks/servers/other stateful security devices without itself getting bogged down.

DDOS mitigators are expected to limit the amount of traffic that goes to the internal servers/machines/networks etc.. Each resource in the network would have some limitations on how much traffic, connections, connection/sec it can take. Adminis, once they make a list of resources and their limitations, should be able to configure DDOS mitigators. DDOS mitigators must ensure that the resources are not flooded and it should shape the traffic accordingly. DDOS mitigators need to provide features like:

Ability to configure

connections/sec
Packets/sec
Bytes/sec

On per resource basis - Server/machine basis, Network basis
From a given source with respect to IP address range, Subnet.

Ability configure to filter traffic on combination of 5 tuples.

As you have observed, admins not only would like to shape the traffic to internal resources with respect to connections/sec, maximum number of connections and throughput, but also would like to have these limits from particular source(s). Yet times, I also observe that there is a requirement to limit the amount of traffic within each 5 tuple connection, between any IP address combination. Mitigators need to provide this flexibility without expecting admins to create many rules. Many times, it is not possible to create rules with all combinations of IP addresses. DDOS mitigators need to provide flexibility of creating rules with ranges, subnets along with provision to configure granularity to apply the specified traffic rates. For example, admin should be able to configure ANY to 'Internal HTTP Server IP addresses' with 10 connections/sec for every combination of source IP and destination IP. If there are 100 different sources are trying to access internal HTTP Servers, DDOS functionality should be able to rate limit the number of connections to 10/sec for each of sources independently.

As with any security device, it must also support multiple zones and provide flexibility with respect to zones. In case of hosting environments, provider may be servicing multiple customers. So, virtual instance, with each instance belonging to a customer is needed. In case of Enterprise environments, normally only one virtual instance would be used.

Flexibility is expected to be provided to disable limiting of traffic for some source networks. These networks could be networks belonging to remote offices. This feature is called white listing.

Ofcourse, it is expected that DDOS mitigators provides facilities to stop half open connections by providing TCP syn flood protection, UDP based session exhaust protection facilities, facilities to configure service inactivity timeouts for interactive protocols etc..

Thursday, May 8, 2008

UDP Broadcast Relay : TR-069 Support

UDP broadcast relay functionality became very popular due to NetBIOS. Broadcast packets are used by NetBIOS for name resolution. Windows Network neighborhood is one functionality that makes use of NetBIOS name service. Due to broadcast functionality, NetBIOS name service works within subnet. If there are multiple subnets, then WINS Server is required. Broadcast relay functionality in routers separating subnets eliminates the need for WINS Servers. Name resolution using UDP broadcast relay function can even be extended to networks in remote offices by relaying broadcast packets over VPN tunnels.

UDP broadcast relay functionality in routers receives broadcast packets and send to other subnets by replacing destination IP of original packet with destination subnet broadcast address.

Since firewall/VPN gateways are also routers, this functionality is implemented in many firewall/VPN gateways. These gateways provide control for administrators on type of broadcast packets to relay and destination subnets to relay to. Multiple of these rules can be created for different types of broadcast addresses.

Configuration consists of set of rules. Each rule containing incoming braodcast IP address/interface and relay subnets/interfaces. Rules can be created and deleted by administrator.

TR-069 profile:

internetGatewayDevice.security.VirtualInstance.{i}.UDPBroadcastRelay.{i} PC

name: String(32), RW, Mandatory - Identification of broadcast relay rule. Once rule is created, this value can't be changed.

description: String(128), RW, Optional
enable: Boolean, RW, Mandatory: Value 1 indicates the record is enabled and 0 is used to disable.
incomingBroadcastAddressType: String(16), RW, Mandatory - Indicates whether the broadcast address is represented as an IP address or Interface identifier. Takes one of the values "ipaddress", "interface".
incomingbroadcastAddress: String(128), RW, Mandatory - Either dotted IP address or Fully qualified TR-069 instance of VLAN, LANDevice, WANPPPConnection or WANIPConnection etc.
incomingbroadcastPort: Integer, RW, Mandatory - Destination Port of incoming broadcast packet.
internetGatewayDevice.security.VirtualInstance.{i}.UDPBroadcastRelay.{i}.relayTo.{i} PC

relayBroadcastAddressType: String(16), RW, Mandatory - Indiacates whether the relayTo broadcast address is specified as IP address or interface - Takes one of the values "ipaddress", "interafce".
relayBroadcastAddress: String(128), RW, Mandatory - Either dotted IP address or fully qualified instance of interafaces from VLAN, LANDevice, WANPPPConnection or WANIPConnection.

In case of remote subnets, relayBroadcast is specified as remote subnet broadcast IP address. If the subnets are directly attached to the router, then interface names can be used in relayBroadcastAddress field.

Monday, May 5, 2008

Packet Ordering requirements in network infrastructure devices

One of the goals of Internet is to maintain the packet ordering. This goal requires that infrastructure devices don't change the order of packets. That is, ingress packets from a port go in same order on egress ports after they go through the processing.

There are many types of infrastructure devices that come-in in the way of packets. Some Infrastructure devices are now not only do routing or switching, but also do many other functions such as deep packet inspection, firewall, Application detection, IPS, IPSec VPN etc.. So, it becomes difficult for network infrastructure devices to keep up with this requirement. Based on my understanding with some of deployments, this requirement is indeed relaxed. My understanding of packet ordering requirements now is:

If the infrastructure device is pure router or bridge (switch), it is expected that all ingress packets from a port go out in the same order on egress ports. Routers and switches might not be sending all ingress packets from a port to one egress port. That is, there is no one-to-one correspondence between ingress port to egress port. Packet order is expected to be maintained across the ingress packets from a port which are going to an egress port. That is, if set of packets from port1 are going to port2, then it is expected that the order of this set of packets is maintained. Routers and switches are not expected to maintain packet order across the packets which are going to different egress ports.
It is difficult to see any router/switch without traffic prioritization function on the egress port. Routers classify packets to different priority bands based on DSCP value and switches do this either based on DSCP value or COS value found in 802.1q headers. Traffic prioritization function sends higher priority packets before the lower priority packets. So, packet ordering requirement is not extended to packets belonging to different priorities. But, the packets from a ingress port belonging to same priority going to an egress port must go out in the order they were received.
Firewall, IPS, DPI and other stateful applications work on 5-tuple sessions. Here the packet ordering is expected to be kept intact within session. There is no requirement to keep the ordering across sessions. This works fine for VOIP and other real-time traffic scenarios. It is important to keep the jitter to low. Since jitter buffering is done on per session basis by VOIP end points, ensuring packet order is not changed within session seems fine.
IPsec is tunneling protocol. One tunnel may carry many sessions. Since sessions are not visible in the tunnel, it is required that IPsec function maintains the packet order within each security association (tunnel).

Based on some comments from some service providers, I got an understanding that 0.001% of packets going in different order is acceptable.

Any comments?

Thursday, May 1, 2008

Configuration Synchronization between TR-069 devices and ACS

Most of the times, configuration for TR-069 based devices is done at the provider end (ACS end). ACS pushes the configuration when device connects to it. TR-069 also provides facility for subscribers to change the configuration locally, but providers have control over which functions of device can be changed by subscriber. 'SetParameterAttributes' RPC method defined by TR-069 specification carries the 'Access control' attributes of parameters.

When allowed to change the confguration by subscribers, if no caution is taken, the configuration view can be vastly different between device and ACS over time. That might result to wrong conclusions by people administring the device. Fortunately, TR-069 provides a way for ACS admin to notifiy the changes to ACS when changes done by local subscribers. Like in 'Access Control' , notification attributes are also sent using SetParameterAttributes RPC method. If Synchronization of configuration is required, it is necessary that ACS sets notification attributes on parameters which are allowed to be changed by local subscriber.

It appears that notification of configuration change works smoothly between devices and ACS on modifications of existing parameters. But, it runs into rough weather when local subscriber adds a record in table objects or deletes a record from table object. Based on my not-so-much experience on different devices and ACS systems, it seems that this was not taken care well and TR-069 also not helping in this regard.

Let us examine different approaches that are possible.

Approach 1: Definition of one parameter 'Number Of Entries' for every table object. ACS can set the notification attribute on this parameter. Whenever local subscriber adds/deletes an instance from the table object, this parameter gets incremented/decremented. Since notification is set on this, ACS gets informed. It has performance disadvantage. When 'Number of Entries' parameter is changed, ACS gets notified with new value of this parameter. Now it is responsibility of ACS to find the difference between its configuration and configuration in the device. That might require walking through the table in device and making modifications to its database. Another disadvantage is that, this approach does not work if there is any existing data profile that has table objects without this special parameter 'Number of Entries'.
Approach 2: Setting the notification attributes on 0th instance: It is observed that many devices don't use 0th instance for records in table objects. This instance can be used to set the notification. If notification is set on 0th instance, it can be treated to indicate Add/Deletion of records to/from the table. Whenever new record is added/deleted by subscriber, device needs to check the notifications on 0th instance. If it is set, it can send instance number of newly created record or instance number of deleted record as parameter and value 1 for 'Add' and 2 for 'Delete' as part of 'ParamterList' of Inform method. By this, ACS knows instance number of newly added record or deleted record. In case of newly added record, it can issue 'Get' operation on specific instance number to read values of parameters. In case of deleted record, it can remove the record from its database.

Approach1 does not require any changes to the TR-069 specification, but has disadvantage of performance issues if the number of instances in a given table are high. Approach 2 is clean, but requires addition to TR-069 specification or clarifications in TR-069 specification.

If anybody choose to use TR-069 based device management for Enterprise devices, I strongly suggest to go with Approach 2.

TR-069 Protocol and applicability in Network security devices - Opinion

Recently a security software developer asked me a question. He wanted to know whether TR-069 protocol is suitable for managing network security devices. Further he wanted to know what enhancements I would like to see in TR-069 protocol, if any.

I am sure that quite a bit of thought had gone into defining the RPC methods and their usage. It is certainly easier to create new data models and software associated within device and ACS implementations when you compare with SNMP based management.

TR-069 is certainly suitable for network security application configuration. But there are some points developers should be aware of.

AddObject method:
This is one thing I have difficulty in understanding the intentions behind in defining 'AddObject' method in TR-069 protocol. 'AddObject' method does not take any parameter values. To add a row in a table, it requires two methods from ACS - AddObject followed by SetParameterValues method. Due to this, it has some complications and additional logic in TR-069 implementations in security devices.

Configuration of many security functions such as firewall, IPsec, AntiX and IPS consists of multiple rules (rows). Each row has its identification and value of these identification parameter (Let us call it as Instance Key) are typically configured by administrators. This identification parameter must be unique within the table. Some configuration databases identify the rows by integers and some identify by strings. Once the record is created, security software does not allow changes to identification values. Of course, other parameters of the rule (row) are allowed to be changed.

TR-069 defines its own identification for each row in the table called instance number. Instance number is of integer type. Instance number is returned in AddObject response. This can't be used to identify the record in a table as far as security applications are concerned. I guess this instance number is mostly for TR-069 protocol. Since record name/ID (Key) is not part of AddObject, record can't be created in the security applications upon reception of AddObject by the device. It needs to wait until SetParameterValues method is sent by ACS with the record identification value.

Though this is not a major hurdle of using TR-069 in configuring security applications, but it could have been avoided if AddObject allows setting up of parameter values.

I don't see any reason why separate instance number is required as it is defined in TR-069. By avoiding instance number and replace with application Instance Key, it provides many advantages:

Device does not need to maintain the state between "Add Object" and "Set Parameter Values" RPC methods to create the instance (row) in the applications.
Device complexity increases as it needs to maintain the mapping between instance numbers and application instance key values even across reboots to maintain consistency between devices and ACS.
ACS does not need to maintain the mapping of instances with each device instance numbers. Since ACS is expected to manage thousands of devices, it needs to maintain this run time mapping information which limits the ACS scalability.

I like to see the enhancement the TR-069 protocol where it avoids instance number approach for the rows. As a matter of fact, local management engines don't have special instance numbers for each row of tables. One of the parameters itself used as the key to identify the instances in the table. With that in mind, I like to see following approach.

One of the parameters in each table object in data model can be identified as the instance key. It requires only one parameter to identify the instance at that level due to tree structure of the data model and nested table objects. Table object is already identified uniquely due to upper (parent) table objects. Due to this, one parameter is good enough to identify the instance within the table object. By having one of data model parameters as the key parameter, ACS also can identify the row by this parameter value.
Today configuration transaction is limited to one SetParameterValues method. Adding a row is outside the transaction today. To make the adding instance also as part of transaction, row creation along with its parameters and values should be made part of SetParameterValues RPC method and eliminate AddObject RPC method.
Any modifications to the instance at later time need to send the key value along with the parameter names. Existing instance number position in the parameter name can be replaced with the key value.

Mandatory parameters in one RPC method:

Security application makes some configuration parameters 'must to have'. Some of these configuration parameters can have default values, but not all. My experience shows that many mandatory parameters don't have any default values. Due to this, security application software typically expect all the mandatory parameters values as part of its record creation. As discussed before, actual object creation in security software module is done when the first 'SetParameterValue' RPC method is received (after AddObject). So, it is expected that ACS sends SetParameterValues method with all mandatory parameters and its values. TR-069 protocol does not specify any rules in regards to this. Due to this, ACS systems have a choice of sending these parameters in multiple RPC methods. It complicates the device implementation, where it needs to wait until all mandatory parameters are received. This is not practical as device does not know when to give up waiting for the mandatory parameters. I believe, there is an understanding that ACS systems always send most of the parameters' (including mandatory parameters) values in one RPC method.

Since this information is not documented, you might often hear from your QA guys that it is not working as expected. I wish that TR-069 specification clarifies this clearly. By the way, if the approach as indicated in above section (under AddObject method section) is implemented, then this is a mute point.

Default Values for Parameters

TR-069 data model guidelines don't clearly specify that all optional parameters of table objects and normal parameters must have default values.

Many security devices and I am sure other devices also have requirement of resetting to factory defaults upon user action. In addition, some devices provide option for administrators/users to set all/some optional parameters to default values on existing configuration. To facilitate this, I think TR-069 data model definition must mandate setting up the default values for non-mandatory parameters.

Factory Defaults

Many device applications are shipped with default configuration. Many devices also support resetting the device to factory defaults via both hardware and software means. TR-069/TR-104 does not specify the ways to define the factory defaults. Due to this, when device resets its configuration to factory defaults, ACS does not know the configuration of device until it reads the configuration from the device by traversing through the data model template tree.

An additional benefit of defining factory defaults in a standard fashion also helps device to do the factory reset from central place (TR-069 client in device) without letting each application to do its own factory reset.

Validations

TR-069/TR-104 data model definition facilitates the parameter value validations such as data type, possible enumeration values, min and maximum length in case of string and base64 data types, range of values in case of integers. But, there are no validation definitions on number of instances that can be created. Many applications limit the number of instances on per table object basis. Having this number known to ACS facilitates the validation at the ACS itself. Note that this limit might be different from one device type to another type. Some generic software application tune this number based on amount of memory available on the device it is being installed. Some times, this limit is also configurable by the local administrator.

TR-104 mandates the definition of "Number of Instances" parameter for each table object. In addition, mandating one more parameter "Max Instances Supported" for each table object would sole above problem.

Nested Table Objects and Transactions

TR-069 protocol dictates that all parameters within the "Set Parameter Values" method either set in entirety or reject every thing. For all practical purposes, a configuration transaction is limited to one instance of the RPC method. Device implementations ensures this by checking with each application using "Test" functions of applications first for all configuration parameters and then followed by setting the configuration in the applications if all "test" functions are successful - basically tests followed by sets as in SNMP world. Applications' "test" functions also has its own limitations, especially if nested table object instance parameters are being checked if parent table object instance is not yet available in the applications, even though it is available in the same transaction. Since "Set" functions are expected to be called only after all "test" functions are successful, parent table instance would not have been created in the applications. Hence "test" function of nested table object instance would return FAILURE. To avoid complicated "test" functions, I feel that TR-069/TR-104 should put guideline for ACS developers to avoid setting parameters of nested table object instances along with parent table object instances in the same RPC method if it is followed by appropraite "AddObject" methods.

Nested Table Objects and Special cases

Even though it is true for all cases that the nested table instance can't exist without parent table object instance, in some cases it is required that instances of some table objects are created with atleast one instance of child table object. This is true specifically in security rules such as firewall ACL rules and IPSec SPD rules. ACL and IPSec SPD can be represented as table objects in the data model. These rules contain 5-tuple selectors with Source IP, destination IP represented by multiple sets of network objects. That is, source IP (and destination IP) field of the ACL rule is a table object to represent many networks. Administrator would be allowed to add more networks to the "source IP" and "destination IP" tables at later time once ACL rule is created. But when ACL rule is created, security application expect at least one network in "source IP" and "destination IP" fields of ACL rule object.

Today data model does not allow this kind of relationship. Due to this, ACS developers don't know about this dependency. If not taken care of by ACS then the rule creation will be rejected by device continuously.

It is necessary that data model definition allows specifying this special relationship. One simple proposal is to indicate for each table object on whether at least one instance of this is needed to create instance of immediate parent object.

Ordered Lists

Rules in many security applications such as firewall, IPsec VPN, Web application firewall etc.. are ordered. That is, the rules are searched from top rule to bottom rule for matching in the table object and hence the rules must be ordered in the list with top rule being the highest priority rule and bottom rule being the lowest priority rule. Administrators, normally in the course of administration need to add rules in between the list, top of the list, bottom of the list or change the priority of the rules in the list. TR-069 does not provide any RPC methods to change the priority of the rules. Due to this, the data models defining ordered list are expected to have 'priority' parameter. Administartor changes value of this parameter to change the priority of the rules. Devices are expected to form the ordered list based on the value of priority parameter acorss multiple instances in the table object. Though it works fine in cases where the configuration changed rarely, it poses performance penatlies on the device. Let us get into this details.

Security applications revalidates run time states every time there is a change in the rule list. Revalidation can be costly in systems where millions of states are created. For example, firewall application in Enterprise/Carrier market segments support more than million sessions where each session corresponds to TCP/UDP/IP connection. If the rules are modified or their order is changed, firewall application is expected to revalidate the sessions and remove sessions that are no longer valid with respect to new rule base.

If a rule needs to be added in between, then the priority value of rules below the current rule need to be changed by at least 1 value less than the current values to accomodate the rule. That is, if there are 500 rules and if one rule is added at position 250, then bottom 250 rules would undergo change. This will raise 250 more additional parameters being modified. This results to 250 changes in the rule base in device. So, there would be 250 revalidations of Millions of sessions. To avoid this performance penatly, it is necessary that each ordere list (table object) has one parameter outside the table object "Revalidate now". ACS sets this value at the end of all priority parameter values in SetParameterValues method. When this parameter is set to 1, device is expected to initiate revalidate process. It avoids revaldation logic in the device for each rule change. This parameter value is not expected to be persistent. Its purpose is to initiate revalidation action. Reading of this parameter should always give value 0 even after it is set.

ACS also needs to know the objects that are ordered in nature. ACS also needs to know the parameter name used to order the list. If ACS does not know this information, then the administartor is forced to change the priority value of all 250 rules in above example manually. That is not some thing which administrators (users) would enjoy. ACS, by knowing this informaton, can change the priority values itself based simple user actions internally and communicate with device appropriately. ACS at the end of any action on the ordered list should send set the "Revalidate Now" parameter. To faciliate this, data model defintion should identify the ordered lists. My proposal is to introduce an attribute "Ordered list" to the table objects. This attribute will not be present for non ordered table objects.

Above guidelines on data model work fine for existing TR-069 protocol. Future TR-069 enhancements needs to think of differnet approach for ACS scaling by introducing new commands in "SetParameterValues" RPC method. SetParameterValues RPC method as explained above also need to take 'Add' command in addition to implicit 'Modify' command supported today. My proposal is to introduce "Move" command for changing the order of instances in the table. "Move" command takes the identifcation value of target row and command attributes such as "first", "last", "before", "after". If the command attribute value is one of "before" or "after", then it also takes the relative row identificatin value. "First" value indicates to put the target row in the beginning of the list. "Last" value put the target row at the end of the list. "Before" keeps the target row immediately before the relative row and "After" keeps the row immediately after the relative row. This proposal eliminates the need for having "priority" parameter. It also reduces the need for sending multiple parameters when the row is added in between. Note that "Revalidate Now" parameter need to be still present if more than one row is being added or more than one "moves" are initiated by administrator.

Log Export

One of main features of network security devices is logging. Detection/Protection of internal resources as important as notifying administrators of any violations. When central management is used to configure network security devices, it is also expected that central management console provides facilities to show alerts, notifications, analyze logs and generate reports. TR-069 protocol does not specify any protocol elements to send the log messages to ACS.

You may need to device your own protocol (such as syslog over IPsec) for sending logs. Also, you need to work with ACS vendors to corelate the logs with configuration.

Wednesday, April 30, 2008

Firewall session inactivity configuration - TR-069 Support

Firewalls maintain session entries for each 5 tuple connection. Session entries are created upon first packet of the connection. TCP Sessions are removed whenever TCP RST packet is observed or ACKs for TCP FINs are observed in both directions. Sessions are also removed when there is no traffic for some period of time. This period of time is called inactivity timeout period. Non-TCP sessions such as UDP, ICMP sessions are removed only due to inactivity as they don't have any connection boundaries.

There are multiple application protocols (services) running on TCP and UDP transports such as FTP, Telnet, SSH, HTTP, LDAP, RADIUS, SMTP, POP3, IMAP etc. Some application protocols are interactive applications and many are non-interactive applications. Telnet, SSH, FTP are interactive applications. HTTP, HTTPS and many others are non-interactive. In non-interactive applications, once the connection is made to the server, there is no user input in between until the connection is terminated. Entire user input is taken before making the connection. In interactive applications, user input is taken after connection is established. Inactivity timeout period for non-interactive protocols can be in terms of 10s of seconds. Since interactive applications wait for user input, less inactivity timeout value may remove session if user does not feed any input data for a longer duration. So, interactive application protocols require longer inactivity timeout. If longer inactivity timeout is configured for non-interactive application protocols, there is a danger of keeping stale sessions for a longer time and that may result in session entries exhaustion problem. So, there is need for providing different inactivity timeout values for different protocols.

Please refer document on maximizing firewall availability. One of the techniques suggested there was to provide 'INIT Flow Timer optimization'. This approach suggests to have separate inactivity timeout period during connection establishment phase (3 way TCP handshake) . This inactivity timeout value can be way less than the inactivity timeout needed after connection is established.

Keeping both of above points in mind, session inactivity configuration includes following:

TCP Pre Connection inactivity timeout : Inactivity timeout during connection establishment phase.
UDP Pre Connection inactivity timeout: UDP does not have any connection establishment phase. For this discussion, UDP connection establishment phase considered complete when it receives at least one packet in both directions of the connection (client to server and server to client).
TCP Inactivity timeout: Inactivity timeout value after TCP connection is established.
UDP Inactivity timeout: Inactivity tiemout value after UDP session is established.
Generic IP inactivity timeout: Inactivity timeout value for non-TCP and non-UDP sessions.
TCP FIN timeout: Inactivity timeout after TCP FINs are observed in both directions.
Application protocol specific timeout records. Each record containing

Application protocol information : Protocol and Port
Inactivity timeout value: This inactivity timeout value is used after connection is established. If there is no matching application protocol specific inactivity timeout record, then TCP, UDP or generic IP inactivity timeout value is used.

TR-069 based configuration:

internetGatewayDevice.security.VirtualInstance.{i}.firewall.serviceInactivityTimeout P

tcpPreConnTimeOut: RW, Unsigned Int, Default : 10 seconds - Value in seconds.
udpPreConnTimeOut: RW, Unsigned int, Default: 10 seconds - Value in seconds.
tcpTimeOut: RW, Unsigned Int, Default: 60 seconds - Value in seconds.
tcpFinTimeOut: RW, unsigned int, default: 10 seconds - Value in seconds.
udpTimeOut: RW, Unsigned Int, Default: 60 seconds - Value in seconds.
IPTimeOut: RW, Unsigned Int, Default: 60 seconds - Value in seconds.
internetGatewayDevice.security.VirtualInstance.{i}.firewall.serviceInactivityTimeout.applicationTimeout.{i} PC

name: String(32), RW, Name of the record. Once the record is created, this can't be changed.
Description: String(128), RW, Optional - Description of the record.
protocol: String(8), RW, Mandatory - Takes values "tcp", "udp"
port: String(8), RW, Mandatory - Takes port value
inactivityTImeout: Unsigned Int, RW, Mandatory - Inactivity timeout in seconds.

Tuesday, April 29, 2008

Software Timers in Network infrastructure devices

Most of the networking functions in network infrastructure devices require timers on per session basis. Some of the examples:

Firewall and Server Load balancing devices require inactivity timer for each session they create. Once the session is established, activity on the sessions would be observed. If there are no packets for certain time (inactivity timeout), the session would be removed. In addition, these functions also remove session upon observing TCP RST or TCP FIN packets. In case of non-TCP session, sessions are removed only upon inactivity,
IPsec VPN, ARP, QoS functions start timer for each session/flow. These timers are typically called 'life expiry' timers. These timers almost all the time would expire. Once the timers expire, these sessions would be removed or revalidated. In case of QoS, most of these timers are periodic timers which would be started immediately after expiry.
Protocol state machine timers: These timers are started on per session basis, some time more than one timer on session, during session establishment as part of signalling protocol. These timers are started when message is expected from the peer and stopped upon receiving the peer message. So, in normal cases, these timers always are stopped.

Operating systems like Linux provides timer module. Typical API functions provided by timer module, at high level are:

add_timer: It starts the timer with the timeout value given to it. This function also takes the function pointer of the application and word argument (callback argument). This function pointer is called when the timeout occurs. This function pointer is passed with the callback argument.
del_timer: This function stops the timer that was started earlier. Many applications expect that once the timer is stopped, application function pointer is never called.
mod_timer: This function is called to restart the timer with new timeout value.

Some other characteristics and requirements of networking functions with respect to timers are:

Any device targeted towards Enterprise, data center and service providers would have sessions ranging in Millions. Hence millions of timers are required. I have observed

2 M Sessions in case of firewall and SLB devices.
50K IPsec VPN tunnels.
500K of QoS Session (One session per subscriber & type of traffic )

Session establishment rate is very important in addition to throughput. Some of firewall and Server load balancing devices support connection rate in the range of 250 to 500K Connections/sec. IPsec VPN function in high end market segment require SA establishment rate in the range of 20K/sec. Hence it is very important that very less number of CPU cycles are used in 'adding the timer', 'removing the timer'.
Firing the timer (timer expiry) happens most of the time. Let us see some scenarios.

As discussed above some networking functions such as IPsec VPN, ARP, QoS timers always expire.
In case of firewall and SLB, it may appear that expiry does not happen most of the time. In case of non-TCP sessions, timer expiry always happens. Even in TCP too, timer expiry is not uncommon. Many TCP connections are non-interactive. Hence the the inactivity timeout is normally is up to 60 seconds. In the lab environment, connection rate and hybrid (connection rate with realistic payload) measurements are always result to timer being stopped using TCP RST or FINs, that is, each connection ends before inactivity timeout expires. In real work loads, the connections can stay beyond the inactivity timeout. In these cases, timers get expired and timers get restarted if there was activity. Activity was normally found out using last packet processed time stamp.

Jitter is one important item that these applications should ensure that it is minimum. What it means is that core should not be doing too much of work while traversing the timers to check for expiration. Also it should not be firing too many timers in a loop. If these issues are not taken care, the packets which come in will have a big latency and hence the jitter. It is expected that the timer subsystem divides the work evenly there by having uniform latency and less jitter.
High end network devices normally has big amount of memory. Note that they also support Millions of sessions. Timer is just one part of it. Timer mechanism should not be taking too much of memory. If the system support Y Million sessions and if each timer takes X number of bytes, the amount of memory consumed should not exceed X*Y. It is okay to expect some minimal amount of memory beyond this for housekeeping. But memory should not be in terms of Multiples of X*Y product.

Some implementations, upon deleting the timers, keep some state information until the timer really gets expired. I guess these implementation do this to improve the performance such as avoiding the locks between add/delete. Though this state information size is less than the timer block size, still this can lead to exponential memory usage. Let us say that this state information is 8 bytes. Let us also assume that the system supports 250K connections/sec and has 2M session capacity. That is, the system can start and stop 250K timers every second. Let us also assume that the inactivity timeout is 60 seconds. That is, in 60 seconds, there would be 250K*60 timers are added and deleted. If 8 bytes of state information is required until expiry, than it requires additional memory of 250k*60*8 = 120Mbytes of memory. This is in additional to memory required for active timers for 2M sessions. That would be unacceptable. Think of the case where the inactivity timeout is more than 60 seconds.

Based on above, all timer functions such as addition, deletion and expiry are all need to be performing well with less jitter and does not exponentially increase the memory requirement. In case of Multicore processors, it is also requirement to avoid locks as much as possible to get the linear performance with number of cores.

Linux new timer implementation based on 'Cascaded timer wheel' satisfies many of the requirements. It is not taking care of two items - Minimizing the jitter and avoiding locks.

Linux implements multiple timer wheels with different granularity. Lowest level wheel granularity is jiffy based. Next level timer wheel granularity is previous level granularity * number of buckets in the previous wheel. For example, if the first level wheel has 256 buckets, then second level wheel granularity of each bucket is 256 jiffies. When the timer is started, it is kept in appropriate wheel. Each timer wheel has running index. Every time jiffy interrupt occurs, the timers in the bucket indexed by 'running index' of lowest level wheel are expired and then the index moves to next bucket. When it reaches the end, then it moves all timers in the next wheel are moved to this wheel and running index of next wheel is incremented. Moving the timers from next wheel to current wheel happens every time the current wheel index is being circulated to the beginning of the wheel. While moving the timers to the current level from the next level of wheel, it distributes the timers based on the timeout value. Timers are kept in the double linked list. Hence the addition and removal of timers is very fast.

It works fine except the case where it needs to move the timers from the next wheel to the current wheel and distribute them. This case may not happen if the timers are stopped (deleted). In some work loads, timers may not be stopped before they expire. Note that it may involve moving large number of timers and this lead to latency of the packets which are being processed. If it takes longer, this may even result to dropping of the packets. I do have some unproven ideas and I will get back when I crystalise those ideas in my mind.

Linux implementation also uses locks to protect the bucket lists as different cores add and remove a given timer. Locks can be avoided with some simple changes - Maintain timer wheels on per core basis. If the timer is being stopped by another core, then indicate this to the original core. This will ensure that only the core who started the timer would delete it from the bucket and it is the one which fires the application callback upon timer expiry. To ensure that there are no locks for indicating the deleted timer indications to the original core, ring based arrays (one array for each core) can be used.

Tuesday, April 15, 2008

Network security computing with Session parallelization - Development tips

Until 2005, performance of network security was a function of processor power. Processors were just keep getting better and faster and thereby the network security performance. In recent past, processors are not getting better, but more processors (cores) are being added to the chip. I guess physics limited the processor technology getting faster. Manufacturers are concentrating on increasing gate count in the chips than increasing the clock rate. Chip vendors are adding more processing cores in one die. These chips are called 'Multi Core Processors' (MCP).

Before any application takes advantage of MCPs, operating systems should support Multiple cores. Linux OS and other operating systems have been supporting MCPs for last few years. Linux OS calls this feature as 'SMP' (Symmetric Multi Processing). In this mode, all cores share the same memory i.e code and data segments are same across all cores. Since any core can access any part of the code and data, it is necessary to protect the critical data by serializing the access to the code segment accessing critical data.

Applications are expected to take advantage of multiple cores in the system. Now the mantra for applications is parallelization. There is no telling that parallelization of software is not easy. It is time consuming and takes significant investment.

One of the serialization technique is to stop cores from processing the critical code while other processor is executing that code. This is typically done in Linux Kernel space using spin locks. Too much of serialization reduces the performance and the application performance does not scale well with the number of cores. At the same time, critical data and data structures (such as binary trees, linked lists etc..) must be protected by serialization.

Other Difficulties in parallelization of code are:

Parallelization is subject to many errors such deadlocks, races conditions and more importantly it is difficult to debug.
Reproducibility of problems are difficult. Hence problem identification takes longer development cycles.
Maintainability of the code: First time the code might have been written keeping all parallelization problems in mind. As time progresses, different developers work on the code and maintain the code. They may not be aware of parallelization problems and make mistakes.
Single CPU based test coverage is not sufficient. Updated test suite should be able to find as many race conditions as possible.

Any parallelization approach should make it simple to develop, maintain code and yet efficient. 'Session parallelization' approach is one method that makes this task simple for networking applications running in the kernel space.

Network security primarily consists of firewall, IPsec VPN, Intrusion Prevention, URL filtering and Anti Virus/Spam/Spyware functions. To improve the performance of firewall, IPsec VPN and in some cases IPS functions, they are typically run in Linux kernel space. Each security function has notion of session. In case of firewall and IPS, sessions are nothing but 5 tuple flows and in case of IPsec VPN, session is 'SA Bundle'. Network security functions maintain state information within the sessions. State information, some times, changes on per packet basis and new packet processing depends on the current state. There are two ways to ensure the packet synchronization is ensured with respect to states. One way is serialize the code path that updates and checks the state information. If there are multiple instances where states are checked and modified in the packet path, then there are multiple serializations. Depending on number of serializations, performance impact is smaller or higher.

Another way is 'Session parallelization'. In session Parallelization, packet synchronization happens at the session level. At any time, only one core owns the session. If any other core receives the packet, then it is queued in the session for later processing by the core owning the session. If no core owns the session, then the core that received the packet starts processing the packet after stamping its ownership. Once the core processes the packet, it checks whether there are any packets pending to be processed. If so, it processes them and if not it disowns the session. Using this method, upon session identification, rest of packet processing does not need to take any locks for serialization. It not only improves the performance, but also less error prone. Having said that, locking of code can't be avoided in some cases such as session establishment, inter session state maintenance.

During session establishment, access to the configuration information is required. Configuration update can happen in one core context, while session establishment happens in some other core context. Data integrity must be maintained to ensure that core does not get the wrong configuration information. Also, data structure integrity must be maintained. Think of cases where configuration engine is removing a configuration record from a linked list and session establishment code is traversing the linked list. If care is not taken, then session establishment code might access invalid pointer resulting null pointer exception. Since, session establishment phase required read only access to the configuration structures, locks can be taken in read only mode. In read only mode, multiple cores can do session establishment for different sessions and thereby it does not effect the session establishment performance. Since configuration updates are less common, read only locks usage does not degrade the performance.

Another area where locks to be taken during packet processing is in situations where inter session state is maintained such as 'statistics counter update', session rate control state etc.. Since Many cores work on the different sessions simultaneously, it is necessary that inter session state information is protected using write locks around the code that manipulate the state information. It is very important that amount of code that is executed under locks is as small as possible as possible for better performance.

Though there are cases where locks are used such as inter session state and configuration data access, these instances are less in number. Session state updates and accesses are more common. As long as session parallelization is done, the SMP problem is manageable. Ofcourse, one could argue that a given session performance is limited by power of one core. But, in typical environment, there are many sessions and hence the system throughput will be proportional to number of cores.

Cloud Computing and security

Cloud computing has become popular term in recent past. Cloud computing providers have large number of cloud servers interconnected. They provide services to end users - Renting virtual server with CPU power required, Storage and some specialized services such as PHP, Java, Ruby-on-rail based servers etc..

Since these servers are outside of offices, it is required that you have very good internet connectivity. Cheap bandwidth and reliable connectivity favors the cloud computing model. From cloud computing provider perspective, this is becoming possible with very high speed, high density multi core processors and virtualisation with its inherent facility to provide isolation and running multiple services on a physical hardware.

Advantages of cloud computing for users (Enterprises) are same advantages you get with data centers such as

Reduce system and network infrastructure administration burden.
Save on Electricity cost by selecting data center with lower cost of electricity.
Save on real estate.

Cloud computing provides additional advantages such as

Handle peak loads by provisioning computing power with a click of a button.
Isolation of application servers from physical machines.

There are some concerns which are not yet fully matured.

Who is going to take care of security aspects of user applications? Is this cloud computing provider or is it the responsibility of users?
Who monitors the vulnerabilities of different applications and takes care of patchoing them?
Will there by any visibility provided of exploits and attacks to the user?
Who takes responsibility of provisioning security infrastructure? Who takes responsibility of tuning IPS/IDS signatures?
Who takes responsibility of complaint requirements such as PCIDSS etc..?
Who takes responsibility of auditing systems, application etc..?
If you have remote users that need access to these services, what kind of security on the wire required and who provides VPN Connectivity?

When cloud computing provider provides specialized services such as Email etc.., I feel that it is responsibility of cloud computing provider to check for vulnerabilities, hardening, patching, checking for spams and preventing from phishing attacks etc.. Do they do that today? What kind of guarantees provided?

When cloud computing providers provide generic services such as renting out virtual server, I have a feeling that responsibility of security them may fall on user s's shoulders. Now the questions arise such as:

- Do Cloud computing SPs provide *Cloud Security* service?
- Do SPs give flexibility for users to select their own security vendor?
- Do SPs expect security appliance is provisioned as Virtual service? If so, what kind of virtualization technology SPs provide?
- Do SPs provide network visibility for user to link the security service with application servers.

It is not possible for cloud computing providers to provide security for applications which they don't know. Many security problems are specific to each application. Typically Enterprises have their own applications in addition to standard applications. As you see in the questions, there is lot of tuning on security applications, such as adding new signatures in IPS, that happen over time. So, it makes sense for cloud computing providers to provide flexibility for users to create their own security environment. Enterprises also typically provide remote security connectivity for their employees to access critical services. Securing the Enterprise services not only involve exploit detection, tuning, hardening and patching, but also providing VPN service to employees.

I have a feeling that, Like the way computing services are provided in the cloud, security services will also be provided by cloud computing providers. Cloud Security Service provisioning not only involves security application, but also connectivity between security service and application servers. Even to provide complete security, it may involve multiple security services provisioning such as VPN Service, IPS Service, Firewall Service, Web application firewall service or it could be one UTM service.

If Service providers are going to provide flexibility for end users to provision their choice of security application, then SPs would provide choice of running Virtual security appliances.

Yet times, Service provider may not like to provide flexibility of security application and they may provide security as specialized service from them. In those case, SPs may go for mega security appliances supporting multiple instances with instance provisioned for one customer.

Let us see how this market turn out to be.

But, in both cases, need of computation power for security services is very high. Multi core processors are going to fill this gap.

Central Management Systems - Critical Missing features

Central Management systems /Network Management Systems are used to configure and monitor multiple network elements from a central location. Some of the features that are supported by many CMS solutions are:

Ability to configure multiple network elements.
Ability to collect log information.
Ability to analyze log and generation of periodic reports.
Ability to monitor critical events.
Ability to issue diagnostic commands to network elements and getting the results.
Ability to allow multiple administrators to use CMS solutions - Role based Management.
Ability to view the Audits - Who changed what and when.

CMS solutions architecturally contain

UI console: Allows users to configure network elements and also allows users to view the reports/analyze logs.
Policy Server with Policy repository: Stores the configuration for each network element.
Element Adapter: Convert configuration information to device understandable format. Sends the configuration information via protocol supported by devices.
Log Collector and Report Generator: Collects logs from network elements and also can generate reports.

To provide scalability i.e to support large number of network elements, multiple element adapters and log collectors are used which each of them supporting fixed number of network elements. Typically one policy server is used, but many UI consoles are used at any time.

There are some pieces which CMS vendors tend to ignore, but in my view they are very important. One of them is to do with 'Configuration session'. Traditionally, each administrative change in the configuration results to a command for the device. That is, if administrator changes the configuration X number of times, then X different commands are prepared for the device. It is observed, if the administrator changes a rule (let us say firewall rule) and un-modifies the the changes, then two commands are generated. These commands are eventually sent in order of the changes to the device. Many times, this is not a problem. But this could be a problem in some instances where:

First modification when applied changes the state of the device such as removing some run time states etc.. Second command which unmodifies the previous changes does not bring back the run time states that were destroyed.
Some times, the first modification might even stop the traffic in network elements.

Administrators should be given chance to make configuration changes (additions, modifications or deletions), review them and commit them. Only when it is committed, it should send the consolidated changes to the device. Configuration modifications and commands must be de-linked. When the commit is issued, it should generate commands on the configuration differences. Some examples would help in understanding the feature better.

Administrator added a rule and changes his mind and deleted the rule before committing. In this case, no command should be generated for this rule.
Administrator deleted a rule and changed his mind and revoked the configuration change. In this case, no command should be generated for this rule.
Administrator changed few parameters of a rule and again changed to newer values. Only new values should go to the device.

This feature requires following support from CMS solutions:

Configuration session should have start and End. End can be complete revoke or commit.
At any time, administrator can check for errors during configuration sessions. CMS solution is expected to provide 'Validate'.
Checking for duplicate configuration session. At any time, only one configuration session is allowed. Note that multiple user can view the committed configuration. If new configuration session is started, then CMS solution should return warning to the user that configuration session from 'user' at 'date&time' from 'ipaddress' is already started. It can give options such as 'take over the existing configuration session' or 'start from scratch by revoking previous session'.
At the time of 'commit', CMS solution can take new version string, which is prepended by date&time.
At the time of 'commit', CMS solution should generate commands to be sent to the device. It should do this by reading the information from configuration session, but not by the sequence of actions user did.
New configuration session should be allowed even if commands generated out of previous configuration session were not synchronized with the device.
Clearly show in the UI of the configuration which is part of current configuration session.

In addition, CMS solution should support:

Listing down the configuration versions of a particular network element.
Facility to migrate the network elements to previous versions.
Facility to remove very old configuration versions to preserve the space in database.