Sunday, February 22, 2009

Advice to Network Device testers - Simulate Capacity/Stress related faults

Capacity in network devices such as UTM is specified with respect to simultaneous connections in case of firewall, ALGs, intrusion Prevention functionality, tunnels in case of IPsec VPN, number of sessions in case of Anti Virus and Spam functions and many more related to different smaller functions. All the functions are not normally used at the same time. Even if all functions are used at the same time, all sessions may not be going through all functions. Due to this, network device vendors typically oversubscribe memory. That is, the memory needed for all functions for the specif ed capacity would be lot more than the memory available in the devices.

This could pose interesting problem in the field. If there is a deployment where multiple functions are used by large number of connections, there could be memory shortage and other resources shortage. This leads to error being returned when the resource is being allocated. If error detection, propagation and recovery is not taken care well by the software, this could lead to instability, leaks, crashes and lockups. It is tester job to ensure that these kinds of problems do not happen in the field. Typically testers simulate different conditions and ensure that system is stable. Yet times, it is not possible for testers to test all different combinations or simulate different conditions.

I believe testers should be able to simulate all possible combinations by simulating all kinds of error conditions. As part of it, testers should ask development team to provide facilities to inject the faults. In particular, testers should ask for facilities to inject faults for following.
  • Memory allocation failures: Almost all functions in software would allocate memory either at the time of connection establishment or on packet basis or to queue the packets and control data etc... Testers should have ammunition to inject the memory fault for specific functions.
  • Socket/File open failures
  • Semaphore creation failures
  • Thread/Tasklet creation failures
  • Fault simulation of any other OS resource that gets allocated after software is completely initialized.
Testers should go at testing in methodical way:

  • Keep list of all functions and OS resource allocations they do.
  • For each one of them, create a test case.
  • Before running the test case, configure to inject fault.
  • Run the test and ensure that system works as expected.
  • Run the test without fault and ensure that system is stable.
I believe that this kind of testing should happen for every release - feature or maintenance releases. If these tests are done manually, it takes very long time. My suggestion is to automate them.

Sunday, February 15, 2009

RSYNC - Applicability for WAN Deduplication

The very old and good rsync can be key to WAN deduplication.

WAN Deduplication purpose is to reduce the amount of traffic on the WAN links by removing duplicate date. It is also called as DRE (Data Redundancy Elimination).

'rsync' is utility provided in Unix variants for a long time. It is mainly used to mirror the data across multiple servers and also is used to for backing up the data. It recursively goes through all files and directories and updates the data in the backup or mirror server. One feature that is interesting to WAN deduplication is its ability to do 'delta encoding'. 'rsync' has feature to send only 'differences' to the destination machine which inturn creates a new file from the existing file and the 'delta' information it gets from origin server. 'rsync' also can compress the data using 'zlib' and thereby saving even more WAN bandwidth.

To understand the algorithms used by 'rsync' for delta encoding, check this technical report. it uses mechanism called 'rolling checksum' (Alder32 algorithm) to figure out the differences between the file the mirror machine has and the file the origin sever has. Why is this 'rolling checksum' required? Note that when file gets updated by anybody in the origin server, the changes could be anywhere in the file. It can be in the beginning of the file, middle of the file or at the end of the file. 'Delta' generation should work and not duplicate any common data irrespective of placement of changes the original file has undergone. Mirror server breaks down the file it has into multiple chunks of some size (typically S = 1K), calculates both rolling checksum and MD5 checksum on the chunks. Then it sends them to the origin server. Origin Server does rolling checksum for chunks of size S. Since the file might have undergone changes, origin server creates rolling checksum for each byte offset. Note that mirror server generates rolling checksum for non-overlapping chunks. Since origin server generates large number checksums, rolling checksum algorithm needs to be very fast. Alder32 algorithm has property of generating checksum without going through the all bytes of the chunk and hence it is very fast. It can generate checksum incrementally. Then the received checksums are compared with the checksums generated locally by origin server. If any checksum matches, then it verifies by comparing with MD5 checksum. If MD5 checksum also matches, then origin server assumes that mirror server has this chunk. Once it finds out the all duplicate data, it only sends the matching block information and any new or modified data for non-matching chunks. For detailed information on this algorithm, please check this link.

Other utilities which are useful for WAN deduplication : 'rdiff'. 'rdiff' uses the rsync algorithm to generate delta file with the difference from old file and new file. Then this can be applied to old file at other machine to get the new file.

Saturday, February 14, 2009

Firewall and NAT ALG Testing Recommendations


Overview:

Stateful inspection firewalls open temporary holes to allow data connections based on information it reads from the control connection. Some protocols such as FTP, SIP, RSTP, H.323, MGCP open a connection and exchange IP address, port information to peer end point for data transfer. Ports that are exchanged in control connections are not well known ports and they are ephemeral. Due to this, administrators can't configure firewall rules to allow these connections without allowing everything. Application Level Gateways (ALGs) are software modules within firewall interpret the protocol packets by extracting the ephemeral port information and open temporary holes to allow data connections to pass through the firewall between protocol end points. Since each protocol is different, multiple ALG modules are required - one for each protocol.


ALGs also do the address and port translation in the protocol data if firewall supports NAT functionality. If the IP address or ports are specified in ASCII form, there is a big possibility where the data length of the packet changes after translation. In case of TCP based ALG, this results into sequence number modifications in TCP header. Firewalls typically take care of maintaining the delta sequence numbers and modify further packets with this delta in both "Sequence number" and "Ack number" fields to ensure the integrity is maintained with client and sever end points of the connection. It is also important to note that ALGs modify different packets during the life of session and firewall software is expected to keep updating the delta sequence numbers appropriately. It is also to be noted that firewalls need to keep the history of delta numbers with respect to original sequence numbers to apply appropriate delta in case of retransmitted packets which are older.

To apply translation on the data, it is required that the ALG has complete PDU. In some protocols such as H.323 and SIP, this can be large. If there is congestion in the network, the end point does not send the PDU in one TCP packet and requires acknowledgment to send rest of PDU. Due to this, newer generation of firewalls send the acknowledgment to make the end point send rest of protocol data.

Many routers change the TCP MSS value of SYN and SYN+ACK packets of transit traffic to lower value whenever there is multimedia traffic to ensure that VOIP packets do not get stuck. As we all know, Voice traffic is delay sensitive and it should be transmitted as soon as possible. If routers has slow link then it takes significant time to transmit 1500 byte packets. If link bandwidth is 256kbps, it takes around 45msec to transmit 1500 byte packet. If WAN controller of the router chooses 1500 byte packet and if VOIP packet comes right after that, then VOIP packet may need to wait upto 45msec there by increasing the delay of real time traffic. By lowering the TCP MSS value, the size of TCP packets generated by end points can be controlled. Broadband routers setting the value of MSS value of transit TCP packets to 256 bytes are quite common. In these cases, the protocol data of complex protocols requiring ALG comes in many packets. Firewall and ALGs ensure to extract the relevant data for opening holes and translation even protocol data is coming in multiple TCP packets.

As discussed before, ALGs open temporary holes - pin holes. If ALGs are not implemented well, attackers can make control connection and send PDUs with data such a way that pin holes are created to access internal critical services. Also attacker can DoS the firewall by sending large number of PDUs which creates large number of pin holes there by causing service disruption to genuine users/connections.

ALG implementation can become very complex. Vulnerabilities increase with complexity. Buffer overflows, boundary conditions are typical problems associated with complex protocol implementation. That is one place validation should concentrate on.

Many protocols specifications (standards) don't specify maximum length of protocol messages - especially text based protocols such as SIP, HTTP etc.. Protocol implementations (end points and ALGs) typically assume the typical sizes while allocating buffers to buffer the data and don't allow the traffic if it exceeds this limit. Since these sizes are not universally adopted by different implementations, this could pose interoperability problems if ALG implementation assumption of size is different from end point implementations. This is one area validation should concentrate on. In my view ALG implementations should not assume any size restrictions for the PDUs which are not interpreted for its operation. For PDUs that are needed to be buffered, this size restriction should be as maximum as it can be. Some times the protocol messages is prepended with size information. In those cases, ALG implementations must allocate the buffer based on this size information. Validation should concentrate to ensure that ALG does not impose any problems in functionality.

As said before, ALGs also do the translations in the protocol data. ALG implementation should ensure that the translations happen for all three cases - Source NAT, Destination NAT and Source & Destination NAT. Many times validation Engineers concentrate on testing using one session. Many times problems related to NAT can't be found if only one session is used. Validation Engineers should test the ALG based firewall implementations with multiple sessions.


Recommendations:

As you can see, validation testing of ALGs is not as simple as running standard applications on both ends of firewall device. For example, running standrad FTP client and Server applications on two sides of firewall device and ensuring the file transfer succeeds is necessary, but not enough validation of FTP ALG. I recommend validation engineers to consider following for each ALG before certifying.

Functional testing:
  • Test with standard applications: Make a list of popular applications. Ensure that different combination of applications as client and server succeed in following cases. Configure firewall to allow control connections (initial connections) only.
    • Without NAT
    • Source NAT : With clients behind internal network and servers in external network.
      • with NAT IP address whose length in dotted decimal form is more than that of source IP address.
      • with NAT IP address whose length in dotted decimal form is equal to length of source IP address.
      • With NAT IP address whose length in dotted decimal form is less than that of source IP address.
    • Destination NAT: With servers in internal network and clients in external network.
      • DNAT IP address in dotted decimal is equal to length of Destination IP address in dotted decimal form.
      • DNAT IP address is more in length than that of destination IP address in dotted decimal form.
      • DNAT IP address is less in length than that of destination IP adderss in dotted decimal form.
    • Source NAT and destination NAT together: With servers in internal network and clients in external network.
  • Explore all options of applications and ensure that all options work with above NAT combination.
  • Ensure that Private IP address (client IP address in case of SNAT, Destination IP address in case of DNAT) does not appear on the packets after translation. This requires capturing the packets and searching for IP address in both binary and dotted decimal form.
  • Understand the protocol and get familiar with messages and fields and their lengths. If size is not mentioned in the protocol specification, get familiar with realistic maximum length of messages and fields. Test to ensure that ALG does not drop messages when messages with maximum sized fields are sent.
  • Ensure that ALGs perform well even when there is temporary packet loss. FragRoute tool can be used to drop the packets. Ensure to test with all NAT combinations.
  • Ensure that ALGs perform well when the TCP packet sizes are smaller than protocol messages. FragRoute tool can be used to change the TCP packet sizes. Ensure to test with all NAT combinations.
  • Ensure that ALGs perform well when TCP packets with smaller size and reordered. FragRoute tool can be used to reorder the TCP segments. Ensure to test with all NAT combinations.
Negative Testing: This testing is required to ensure that systems don't crash when invalid packets are sent.
  • Ensure that ALGs don't misbehave (crash or lockup) when protocol messages and fields of different lengths and values are sent. Make a note of all messages in the protocol specifications. Ensure to send messages of different length by writing your own client and server protocol or instrument the existing client and server implementation. I prefer later to reduce the effort required to do this kind of testing. It is relatively simple to generate messages and fields with different lengths and values for the first message of protocol. But for messages that are deep down the protocol require successful initial protocol phase. Hence I prefer going with instrumenting the existing open source client/server protocol implementations.
  • Go through some of the common vulnerabilities found in client and server implementations by searching through the CERT repository. Since ALG is also interpreting the protocol messages, ensure that these kinds of vulnerabilities are not present in the ALG.
Stress testing: This is one important step to ensure that system under test can cope up with capacity specified. Also, it is important to ensure that system performs as per performance requirements.
  • Use IXIA/SmartBits kind of tools to simulate large number of client and servers to ensure that system works as specified with respect to capacity.
  • Use IXIA/SmartBits to test the connection rate and ensure it satisfies the specifications of the box.
  • Use IXIA/SmartBits to test the throughput requirements.
  • Use IXIA/SmartBits to test throughput and connection rate combination requirements.
  • Repeat above test cases for 12 hours to ensure that the system is stable.