Random technical bits and thoughts: WAN optimization

Showing posts with label WAN optimization. Show all posts

Thursday, May 20, 2010

Performance considerations in Proxy based nework applications

Many details on performance considerations on proxy based networking applications are given here.

There are some more performance considerations in developing proxy based applications. Here they are:

Use Hugelbfs system to for running code and for application context memory (for connections): Please see this for getting understanding of this technique.
Use User space RCU wherever possible: See the RCU related information here.
Use Futexes as part of RCU implementation for add/delete operations. See about Futexes here.
Use posix spinlock kind of Mutexes only for small portion of the code.
Use UIO based Interrupt indication to the User space processes while dealing with memory mapped hardware accelerators.

Wednesday, May 19, 2010

Responses on the post related to WAN optimization and Infineta systems

I have got many responses on this post. Some were asking more clarifications on what I meant by deduplication efficiency across restarts. Some questioned that how I can make a statement that there is no persistence data across restarts in Infineta solution. By the way, I did not make a statement. To clarify further, I don't know whether the solution has capability of keeping data intact across power cycles. I just want to say for record that, it is only my reading based on the statement that the performance of the solution will be in multiples of gigabits/sec, up to 10Gbps. I guess we will come to know eventually when the product is out.

One response suggested that it is very well be that the solution might have multiple hard drives, with each drive hanging on different hardware bus, to achieve multigiga bit performance. Though it seems like a possibility and I am not sure whether it is practical.

One possibility is that multiple blades are combined into one solution with each blade working almost independently as far as Dedup is concerned. This type of solution is suitable for deployments having multiple servers that need to be synchronized (for backup, replication and others) where content of server or set of servers being optimized by different blades. If there are few servers than the blades, then this type of solution capacity can't be utilized completely in those specific deployments.

One response was asking on whether I know of any performance benchmarking criteria to evaluate WAN optimization solutions. I am not aware of any standards or defacto standards. If I come across any, I will certainly post them in this blog.

Monday, May 17, 2010

One more WAN Optimization company - Infineta systems

It is good to know that WAN optimization technology is attracting VC money. I believe that this market will continue to grow for few more year before it saturates. Though some research reports place this market to be at $5 billion by end of 2014, but I feel that this is much more than this. Any organization having multiple branch offices with consolidated central servers benefit from the WAN optimization.

Okay. Coming to Infineta. What is different about this technology compared to existing WAN optimization? I tried going through the Forrester report. Finally it all coming down to 'performance'. According to this report existing WAN optimization products peak at 1Gbps and Infineta systems performance seems to be in terms of multiples of Gbps upto 10Gbps. According to this report, this kind of performance is required to connect data centers of a given organization. Reasons given for this kind of performance are:

Replication, Mirroring, Backup of data and VM images among Data centers for reasons such as Business continuity/Disaster-recovery.
Amount of data exceeding perabytes.
Reducing latency of above operations.

According to job descriptions, it is clear that they are using multi-core processors and FPGAs. I did not find any mention of hard disk capacity. I have a feeling that persistent storage is not used. It would be difficult to achieve multiple of Gbps throughput with disk access. If that it the case, it is interesting to know how the efficiency of de-dup is compared with other established WAN optimization vendors.

Amount of DDR memory. This directly would be proportional to dedup efficiency. Larger the amount of memory, higher the dedup efficiency would be. Without hard drive capability, storage would be limited and it may have difficult time to achieve the de-dup efficiency compared to others in the market, in my view.
Lost of cached blocks across power restarts. If there is no persistent memory, data that was stored in the DDR memory would be lost across power cycles or when the system it taken out for maintenance. This requires rebuilding of the data cache again. This will reduce de-dup efficiency right after power recycle.

Sunday, April 4, 2010

Data Center Switch requirements for new Data Center Architectures

Traditionally data centers have three tiers of switches - Core switches, Aggregate switches and Access switches.

Core Switches : These switches connect to the network which are connected to the WAN links. This is farthest switch farm with respect to servers.
Access Switches : These switches are also called top-of-rack switches. Servers (Web Servers, Email Servers, Application Servers, Database Servers and others for which data center is built) get connected to the ports of these switches.
Aggregation Switches: Aggregation switches is intermediate switch layer which is sandwiched between Core and Access switch layers. Aggregation switch aggregates the traffic between core and access layers. Note that there could be lot of traffic among servers (Specifically among application, web and database servers). This traffic need not be seen by the core switches. This traffic just need to be among the access layer switches. Aggregation layer eliminates the traffic being seen by every switch. Core switches only see the traffic going to/coming from WAN/Corporate network. Aggregation layer also reduces the traffic among access layer switches.

It was necessary to have three tiers in earlier data center architectures due to

Large number of physical machines serving the content requires large number of Ethernet ports. Due to poor density of the ports on the switches, multiple access layer switches were necessary. Multiple switches means there is lot more traffic across access layer switches. One more hierarchy of switches enable good throughput by eliminating mesh kind of access layer switches for intra switch traffic.

What are some of the changes in Data Centers? One big change is collapse of three tiers to two tiers. Aggregation layer is disappearing. Let us see what is making this change.

Virtualizaton technology is reducing the number of physical machines: This implies that there are less number of ports.
Traffic on each port is increasing : Virtualization and Mulitcore processor are enabling multiple applications in one physical machine. It is not uncommon to see the requirement of multi-gig traffic on a single port.
10G and in future 40G/100G ports are facilitating the unified fabric for both kinds of traffic - Application traffic and SAN traffic, thus eliminating number of ports and interconnects.

These technologies are reducing the cost by reducing equipment, interconnects, by amount of power required and amount of cooling required. It also reduces the maintenance and hence reduction in cost.

What kind of features one would expect in the switches in new data centers:

Latency of traffic should be very less: By eliminating the aggregation layer itself reduces the latency. But that is not good enough for SAN traffic, Video and Voice workloads. Non-blocking switching or cut-through switching is expected to support real time traffic such as Video, Voice etc.. Traditionally, switches oversubscribe the bandwidth, that is, switches are not capable of receiving and transmitting of traffic of all ports at the same time with full port bandwidth. Hence the packets get blocked. In non-blocking switches, they are expected to send and receive traffic equal to number ports * each port bandwidth. If there are ten 1G ports, switches are expected to receive 10G traffic and send 10G traffic.

802.1qbb (Priority based Flow Control): When there is a congestion in the receiving node, 802.3x pause frame is generated normally. This makes all the traffic pause for some time. This standard allows pause frame generation on 802.1p priority levels. It lets the high priority traffic flow. Switches are expected to honor and generate theses kinds of frames.
802.1qaz (Enhanced Traffic Selection): This standard allows the bandwidth allocation for different priority levels or group of priority levels. It lets higher priority bandwidth to be consumed lower priority traffic if there is no higher priority traffic. SAN traffic would need to be going with higher priority levels. This feature is also expected to be supported by data center switches.
802.1qau (Congestion Notification): This standard allows end nodes to communicate the congestion notification. It lets the end node receiving the congestion notification to apply rate limiting on the out traffic. This feature is also expected to be supported by data center switches.

Port Density should be high.
Multi-Path support is required - I am not sure whether there are any standard at this time, but spanning tree is not used in these cases as it only provides one path.
VEPA Support would be required eventually. Due to VEPA, it may need to support C-VLAN and P-VLANs.
Large number of VLANs support is required to work with other network services such as ADCs, WAN Optimization and Network Security (Firewall, IPS, IPSec VPN etc..).
Ability to redirect the traffic not only based on L2 and L3 fields, but also L4 fields such as TCP, UDP Source and destination ports.
Any switch architecture should work with VM migration from one physical server to new physical server.
Public Data Center networks require Virtual Instance kind of concept within the switches to reuse VLANs (across different subscribers) due to limited number of VLAN IDs.

Sunday, March 14, 2010

Linux TCP Large Receive Offload optimization to increase performanace

In some network packet processing applications, number of packets being processed determine the performance. TCP is a streaming protocol and hence there is no packet boundary. Hence consecutive packets can be aggregated into few packets when the TCP packets are received at the lowest level. More the packets that can be aggregated, higher the performance would be. Applications that can benefit are:

Any Proxy based applications (Application Delivery controller, WAN optimization, Network Anti Virus)
IDS/IPS
Firewall ALGs.
Server Applications

I found one excellent paper describing two techniques to improve the TCP connection throughput performance - Receive aggregation and Acknowledgment offload. Please find it here. This paper also gives performance improvement with receive aggregation and without these optimization techniques. Performance was improved from 3.4Gbps to 4.6Gbps, 35% increase.

Receive aggregation technique is already implemented in Linux 2.6 kernel. It is called Large Receive offload feature. This feature is implemented in net/ipv4/inet_lro.c.

Receive aggregation technique is simple. It is used only when the NAPI functionality is applied on the Ethernet driver. In NAPI enabled Ethernet drivers, softirq receives the packets from the descriptors. Typically NAPI reads out all the packets from the receive descriptors (or until some threshold - quota).

Ethernet Driver normally sends up the packet to the stack using netif_receive_skb if the NAPI is enabled. In case of LRO, packet is given to the LRO library using lro_receive_skb function.
Find the matching flow. If no match, it creates new flow.
LRO module figures out whether this packet is eligible for aggregation. Packet is non-eligible if any of following conditions apply.

Non Padded frame (IP total packet length must be received packet length)
Non-TCP packet.
IP options are present
IP ECN CE is set
TCP segments has no data.
CWR (Congestion Window Reduced) flag is set
ECE (ECN Echo) flag is set
SYN flag is set
FIN flag is set
URG flag is set
PUSH flag is set
RST flag is set
ACK flag is not set
Non TCP Timestamp option is present

If the next sequence number expected is matches with the sequence number of this packet, packet is added to the existing packet sequence. If not, packet is not eligible for aggregation.
When the packet is found to be not eligible for aggregation, it is necessary to send buffered packets first to the stack before sending the current packet. This is done using lro_flush() function.
If the packet is eligible for aggregation, it associated with existing packets by manipulating the skb.
When the aggregation stops, it does following before sending the aggregated packet to the stack.

Changes the ACK to the last packet ACK.
Keeps the timestamp option of the last packet.
Recalculates the IP checksum (now the packet became bigger).
Partial Checksum update of TCP payload.

When does the aggregation stop:

When the configured aggregation limit reaches.
When the total packet size is more than (64K-MTU).
When the NAPI finishes all the packets in receive descriptors or reaches its quota.

Ethernet Driver is expected to send all the packets so far buffered at the end of current NAPI instance. It does so by calling lro_flush_all.

Monday, January 4, 2010

Data Redundancy Elimination across files

HTTP, CIFS and other application level proxies typically apply DRE within the scope of file being requested. That is, if the content is changed with the same file name, DRE is done effectively by these proxies. Client proxy as part of request sends the signature information of the file it has and Server proxy identifies the delta information from signatures and new file it gets from the origin server. Then delta information is sent to the client proxy which in turn gest the new content from the old file and delta information it gets. That is, the scope of delta information is limited to the scope of a given file.

It is my observation that Enterprise users, when they modify documents or presentation files etc.., they tend to make a new copy and give new file name to it, that is they version the changes by keeping different files. One might argue that this is bad way of keeping the versions. But this happens more frequency than one can imagine. In those cases, file based delta information does not work. Though one may argue that these instances of may be smaller, but they are not insignificant.

Good WAN optimization device should be able to do DRE on the blocks across files. I am not sure whether WAN optimization vendors are doing this already, but as a end user, you should look for this feature.

Wednesday, December 30, 2009

WAN Optimization Case Studies (Public information)

Updated on 01/08/2010

1. Sony Erricsson case study of WAN Optimization: http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?casestudyid=4000004790

Vendor/product: Cisco WAAS on Windows Server 2008
Savings: 24000 per branch office.
Use cases described :

Speed of project life cycle application management application increased by 20 times.
Video file transfer performance improved by 400%
Time taken to transfer data worth 12GBytes reduced to 1.5 hours from 12 hours.

2. Studley Inc., a real estate services firm based in New York City case study of WAN optimization: http://searchenterprisewan.techtarget.com/news/article/0,289142,sid200_gci1378242,00.html

Vendor : Riverbed WAN Accelerator
Use cases described:

Share Point application acceleration: Loading performance improved from 1 to 2 minutes to 4 to 5 seconds.

Monday, December 28, 2009

librsync usage

Please see the RSYNC algorithm here at blog on rsync. Librsync is library that can be linked with any application. It has all the features that required for de-duplication. Any de-duplication functionality requires that new file content is generated from the old content and delta information. It requires following features.

Generation of signatures by the entity that is expected to receive new file. Signatures are nothing but weak (Adler32 )and strong checksums (MD5) of all non-overlapping blocks of the file it has.
Entity that is supposed to send the new file should have facility to take the signature file, new file it has and generate delta file. Delta file consisting of set of instructions. Instructions include COPY a block from old file (delta file only has reference to block), Use the content from the delta file for certain blocks if old file does not have this content
Ability to merge the delta file with old content file to generate new file. This is typically done by the entity which receives the delta file.

Librsync library provides all these capabilities. If you look at the whole.c file of librsync library, you find following capabilities:

rs_sig_file: This function can be used to generate signature file. This function is called by the entity that has old content file.
rs_loadsig_file: This function can be used to load the signature file contents in memory based hash list. This function is called by the entity that has new content file.
rs_delta_file : This function can be used to create delta file from signatures it loaded before and new content file it has. This function is called by the entity that has new content file.
rs_patch_file: This function can be used to generate new content file from the delta file and the existing content file from which signature file was created before. This function is called by the receiver of the content.

'librsync' library maintains all the state information local to the a control block called 'job'. Due to this, librsync can be used simultaneously to do multiple operations at the same time. Each operation has its own 'job' and hence it does not have any impact on other operations. Hence it is suitable for proxy applications where one 'user process' processes multiple connections at the same time.

Saturday, December 26, 2009

WAN Optimization Devices - Features that should be expected by buyers

Many WAN optimization solutions are available today. What are the features one should look for? This list is my view of features the box should provide. Some features might or might not be used in some deployments though.

WAN Optimization is all about utilizing the WAN resources effectively. At very high level main features one should look for are:

Usability and Experience
Application Detection and providing differential QoS based on applications.
Load balancing and failover of the connections across multiple WAN links
De-Duplication of the data among offices
Compression
Caching
Security
Reliability

WAN optimization has been and will continue for at least a year or two to be a special device in Enterprise networks. As the value of WAN optimization is realized, more and more routing, switching, ADC and network security vendors are adding this as an additional feature in their offerings as a blade or as a software component running on some cores of Multicore processors. Buyers should not be just looking at the tick mark on WAN optimization, but should check the details.

Usability features:

No configuration changes to the client and server machines and their applications should be expected as part of WAN optimization device installation in the network: Client machines (Desktops, Laptops, Mobiles etc..) and Server machines (Running HTTP Server, Email Server, CIFS Server) should not even know that WAN optimization device is installed in the network. No changes to be expected to be made to the machines or applications running on those machines when these devices are added or removed.
No changes to the existing network infrastructure devices should be expected when WAN optimization devices are installed in the network, except for the cases where there is asymmetric routing. It is understandable that, In case of asymmetric routing, routers need to be configured to redirect the packets to WAN optimization devices using WCCPv2 protocol.
WAN optimization device should not be appearing as a L3 hop: WAN optimization devices are expected to provide Layer 2 transparency. WAN Optimization Device is expected to intercept the packets at the Layer 2 level by acting as a L2 bridge.
IPv4 and IPv6 Support: During this migration times to IPv6 from IPv4, both types of networks, clients and servers are possible. WAN Optimization device is expected to support IPv4, IPv6 networks, clients and servers.
When WAN Optimization device is inline of traffic, it is expected that the device has high availability feature using Active-Backup method at the minimum : If one device fails, other device should be able to take over the WAN optimization functionality. UDP connections must work fine even after backup device takes over the WAN optimization functionality. Existing TCP connections might break, but it must ensure that new connections go successfully through the backup device.
Device is expected to provide GUI for configuration.

WAN optimization devices should have facility to learn other devices and their reachability information dynamically. WAN devices are also expected to provide configuration facility to add/remove reachability information of other WAN devices statically.
Any configuration made on a device should be propagated to other devices if some incarnation of this configuration is required on other peer devices.
It is normally expected that any configuration change done gets reflected immediately. that is, no restart should be necessary for the configuration to be effective.
Configuration through secure mechanism is expected: SSH for CLI access, HTTPS for GUI.
Configuration consistency across device restarts is expected.
For Common Critiria and other certifications, Configuration facility are expected to have role based management with multiple roles with multiple users belonging to the roles. It is also expected that audit trail is created upon configuration changes. Audit logs are expected to have all the configuration information changed for the changed records.
Any configuration update on the Active device should reflect in the backup device without any additional effort by administrator.
Centralized Management System to configure multiple WAN optimization devices from a single console is normally expected when number of WAN optimization devices are more than few (example: more than 4).

Device is expected to provide multiple kinds of reports and statistics to the admin.

Reports related to amount of WAN bandwidth savings that occurred over specific time period.

Due to de-duplication, Due to compression, Due to Caching etc..
On different protocols (HTTP, NFS, CIFS etc..)

Reports related to Integrity of dedup repositories.
Reports related to possible savings if more memory/hard drive capability is added.
Reports related to traffic belonging to different applications over specific period of time.
Reports related to amount of WAN utilization and under-utilization.
Multiple different types of statistics collected over significant amount of time and represented in specific time periods such as hours, weeks, months etc..
Debug statistics which aid in field debugging.
Tracing facilities in field with different levels of traces.

Application Detection and providing differential QoS based on applications:

One of the features to utilize the WAN links effectively is to identify the applications and apply traffic management facilities such scheduling, marking and bandwidth control. Lower priority application such as P2P and non-interactive/non-realtime applications can be ensured to use lesser bandwidth when higher priority application data is pending to be sent on the WAN links.

Many applications can be detected by based on the Destination Port of TCP or UDP protocols. Application detection is expected to be provided by the WAN Optimization device to detect applications that do port hopping. Examples: P2P and IM applications. Application detection is also required to detect HTTP connections being used for social networking, DDL (Direct Download Links) etc.. Application detection identifies the application ID for each connection. QoS policies would need to have application ID as one of the criteria elements to choose the policy so that the policy rule specific actions such as bandwidth control and prioritization of traffic on to the WAN link can be applied.

Load balancing and fail-over of the connections across multiple WAN links

It appears that many deployments go with more number of WAN links to satisfy their bandwidth requirements than going for a bigger pipe. I believe it is less expensive. Also it provides organization to scale the bandwidth as the organization grows. Having multiple WAN links rather than single big pipe also avoid network discontinuity if one link fails. These WAN links normally are also taken from different ISPs so as to avoid discontinuity in case of one ISP failure.

WAN optimization functionality is expected to provide capability to support multiple links going towards WAN. These devices are expected to balance the traffic (based on hash result of IP packet header fields such as source IP, Destination IP etc.) across multiple WAN links. Also these devices are expected to transfer the existing connections on a failed WAN link to new links. These devices are also expected to maintain order of packets in a flow and hence the balancing criteria configurations selection should be provided to administrators.

De-Duplication of the data among offices

This is one important feature to reduce the amount of traffic exchanged (on WAN links) among offices of an organization. Basic purpose of de-duplication is to ensure duplicate data is not seen on the wire. Peer WAN optimization devices are expected to hide these details from clients and servers which are exchanging the data. Handling and processing of deduplication happens among WAN optimization devices. Different vendors may have implemented this in different ways. It is important to check the de-dup efficiency. Some of the features to look for are:

Block-level Deduplication with configurable block size is to be expected by the administrators.
De duplication must be across the protocols. That is, if the data is downloaded by a client from a server using HTTP protocol first time and same data is downloaded by the client from the same server, but using CIFS, it is expected that actual data is not seen on the WAN link. That is deduplication must happen across protocols.
Dedup feature efficiency when the data is not changed on the server, but being downloaded by the client again. In this case, it is expected that no data, but only the blocks identifiers would be sent on the WAN link, that is, 100% dedup efficiency expected.
Minimal changes to the data on the server should also lead to near 100% dedup efficiency. For example, if the additional data of few bytes is added in the beginning of the file in the server, only the changed data or atmost one or two additional blocks of data is expected to be seen on the WAN link when client downloads the changed file. This change in the data should not lead to transfer of complete file content. In further attempts of same file download should have again 100% dedup efficiency assuming that the file is not changed on the server.
Blocks that are stored by the WAN optimization should be persistent. This data should be available after any device restarts. Example scenario: A file is downloaded by the client machine from server machine. WAN optimization devices cache the blocks of data. Restart the device and download the same file from the server. There should be 100% dedup efficiency. It is understandable that devices take some time to recreate the internal serach lists from the disk when the device restarts. During this time the any download of the file will not be able to achieve 100% dedup. But when the system is ready with internal lists, it should lead to 100% dedup efficiency.
Look for amount of disk space and memory the device has. Dedup efficiency is directly proportional to this. Some devices don't support the disk drives for storing the dedup data. It would have multiple problems:

Dedup efficiency will not be good as DDR space is limited to store both search lists and data blocks.
When device restarts, there is no data in the DDR which leads to learning the data afresh.

Dedup efficiency, when tools like fragroute are used, is as good as the cases when it is not used. It appears that some WAN optimization devices don't work well in these scenarios. One might argue that fragroute is a lab tool, but it is necessary to remember that fragroute is simulating some real network conditions. For example, it is normal practice to break the TCP segments to smaller segments to reduce head-of-line blocking created by large TCP packets to allow VOIP RTP traffic. 'Frag Route' tool can be used to do multiple things such as:

Breaking the TCP segments to smaller segments.
Breaking IP datagrams to multiple IP fragments.
Reordering of TCP segments and IP datagrams/fragments.

Whatever disk size the WAN optimization devices have, it may not be sufficient when compared with amount of new data that is flowing on the WAN links. It is expected that WAN optimization devices throw the blocks which were not used for a long time to make space for new data. New data should be given higher priority all the time. One way to test to ensure this is to fill the disk by sending unique blocks of data. Then let the client download a big file from the server and ensure that there si 100% dedup efficiency when the client downloads the same file again.
Support for Protocol adapters : Expect protocol adapters for different protocols for following reasons.

Some protocols such as CIFS are chatty. To reduce the chattiness, some intelligence of protocol is required to do operations such as 'read-ahead'.
Knowing data boundary would make deduplication efficient. WAN Optimization can wait for data with this boundary intelligence and then do dedup processing. There is higher chance of finding the dedup blocks.
Doing ALG functionality to figure out data connections such as RTP to apply special processing to reduce any latency and jitter.
Decoding and Decompression of data before dedup processing occurs.
SSL Support : For SSL termination and SSL connection establishment.

Support for dedup feature for real time and/or streaming traffic: When reliability channels are used between WOC devices, then real time traffic quality can suffer. Hence it is expected from WAN optimization devices to use non-reliable channel for real time traffic.

Compression of data before sending it on the WAN links

WAN optimization devices are expected to provide compression feature to reduce the data on the WAN link. Dedup functionality reduces the data by not sending duplicate data. Compression reduces the data that is being sent. Hence both functions are expected in the WAN optimization products.

It is expected that compression is beyond packet level compression. It should be across the reliable channel (connection) normally established among WAN optimization devices. Compression by maintaining its history can do better job of reducing the data over time.

Caching

Caching feature completely eliminates any data including dedup block identifier data going on the wire if the file is not changed across downloads. It is only possible to do this in HTTP protocol.

WAN Optimization device should act as HTTP Proxy supporting HTTP/1.0 and HTTP/1.1
SSL Termination to ensure that it does both Caching and Deduplication across WAN Optimization devices. Peer WAN Optimization device can make SSL connection to the Server.

Security

Security on data at Rest: WAN optimization device stores the dedup data in the hard drive. It could be confidential data. Expect WAN optimization to support secure storage of the data.

Crypto file system versus normal file system: Devices must ensure that confidential data is stored in crypto file sysetm.
Encryption key used by crypto file system must not be stored in the same device. Expect it to provide KMIP or equivalent functionality to get the keys from Key Management Server. It ensures that when device is stolen, thieves don't get hands on the clear data.

Security on data on wire: WAN Optimization devices are expected to provide secure transfer of data among them. Note that these devices are terminating SSL connection at one device and make new SSL connection from remote device to the ultimate server. Hence it is necessary that communication among these devices is secured. IPsec is one popular technology to secure the data on the wire. Expect IPsec kind of functionality in WAN Optimization devices.

Reliability & Data Consistency:

As part of dedup and caching, data is stored in the disks. Expect some functionality to ensure that the data written to the disk is same as data being read. RAID is one method by which it can ensure that kind of integrity upon any disk related errors.

As always, it is always good to get the devices and evaluate them in your network for significant of time before buying them.

Friday, December 25, 2009

ecryptfs - some thoughts on network device appliance usage

There are many cryptography file systems in Linux. But 'ecryptfs' is made into the kernel.org. Almost all distributions support ecryptfs today. Due to this, my belief is that many appliance vendors are using this file system to store the files encrypted.

Network appliances such as WAN optimization devices requiring storage of confidential information in secure form can use this file system to store files in encrypted form.

There are multiple utilities provided to configure file system with different security parameters. It supports x.509 certificates (RSA) for encrypting and decrypting the encryption key for each file that is being written in this file system. This file system is really a wrapper file system on top of existing file systems such as EXT2 and EXT3. You can have multiple of ecryptfs file systems on one existing file system. Create a directory and make it as 'ecryptfs'. Any file that is written on to this directory is encrypted. You can get ecryptfs utilities here.

It appears from the source code, each file that is being written into the crypto file system can have its own encryption key. It makes use of keyring facility provided by Linux Kernel to store keying material.

How can crypto file systems be used by appliances. Appliances unlike typical users are not attended all the time. They are expected to be up all the time. They are expected to start themselves in case of any failures.

'ecryptfs' when it gets started is expected to be provided with the key information such as RSA private key. To ensure that this private key is not visible for any thief assuming that laptop is stolen, this private key is normally encrypted with passphrase. This passphrase is a secret which is expected to be entered by the user when laptop is powered up. User providing the passphrase works fine for laptop cases. But this is not the case for appliances which are not monitored. If passphrase is not used, then RSA private key is in clear. If the appliance is stolen, then the private key can be read, therefore, the complete file system. I guess basic purpose of encrypted file system is lost. So, it is expected that RSA private key is not stored in the same appliance device. I suggest to have RSA key pair on external (but protected physically) machine and mounted on local machine. Since RSA key pair is not on the appliance, even if it stolen, information is secure.

Usage of 'ecryptfs' can be found here: http://maketecheasier.com/create-a-private-encrypted-folder-on-ubuntu-hardy-with-ecryptfs/2008/09/25

Thursday, December 24, 2009

Secure Storage of cached and de-duplication data

WAN optimization devices store the data (content) in their own hard drives to process future requests. Some of this data can be confidential. More places the data is stored in clear, there are more opportunities of this data leaking out. In WAN optimization device case, if device is stolen, thieves can retrieve confidential data easily. To protect the privacy of data, it is necessary that the devices store the content in encrypted form. One thing one must ensure is that the DRE (Data Redundancy Elimination) efficiency should not go down even when encryption is applied. As you know Encryption algorithms similar to compression algorithms create dependency across the data in the file. That is some portion of previously encrypted block data is used to encrypt further data in the file. This will break the de-duplication efficiency dramatically. That is every time file gets modified by the user application, the file content change exponentially due to encryption even though the changes were made to the clear file were small.

'rsyncrypto' file encrypts the file such a way that there is no dependency among the encrypted blocks. Typically IV (Initialization Vector) is taken from the previous block and used along with the key to encrypt the new block. 'rsyncrypto' eliminates the IV being taken from the previous block and uses random IV for all blocks in the file. Though this may reduce some security effectiveness, it provides enough security effectiveness.

Backup market certainly can use this feature to secure data at rest while maintaining the de-duplication efficiency. This feature is particularly useful when external backup storage providers are used to backup the data. It is required that the users have control over keys used to encrypt the files and at no time backup storage providers have access to these keys at any time including while applying delta changes. This requirement mandates that the delta data is obtained on old and new encrypted files. So, 'rsyncrypto' utility is really useful. When the data needs to be retrieved from backup storage providers, user known keys would be used to decrypt the files.

I am not sure whether this technique is applicable for WAN de-duplication markets. WAN optimization devices need to serve the content locally without downloading the data from central WAN optimization device. That is, all WAN optimization devices should be able to get hold of clear data. Hence, I feel that WAN optimization devices would use 'Crypto file systems'. These file systems transparently encrypts all files in the file system. No knowledge of this is needed by WAN optimization feature in the device. This kind of secure storage appears to be fine as these WAN devices are typically administrated by same entity serving confidential data.

Sunday, February 15, 2009

RSYNC - Applicability for WAN Deduplication

The very old and good rsync can be key to WAN deduplication.

WAN Deduplication purpose is to reduce the amount of traffic on the WAN links by removing duplicate date. It is also called as DRE (Data Redundancy Elimination).

'rsync' is utility provided in Unix variants for a long time. It is mainly used to mirror the data across multiple servers and also is used to for backing up the data. It recursively goes through all files and directories and updates the data in the backup or mirror server. One feature that is interesting to WAN deduplication is its ability to do 'delta encoding'. 'rsync' has feature to send only 'differences' to the destination machine which inturn creates a new file from the existing file and the 'delta' information it gets from origin server. 'rsync' also can compress the data using 'zlib' and thereby saving even more WAN bandwidth.

To understand the algorithms used by 'rsync' for delta encoding, check this technical report. it uses mechanism called 'rolling checksum' (Alder32 algorithm) to figure out the differences between the file the mirror machine has and the file the origin sever has. Why is this 'rolling checksum' required? Note that when file gets updated by anybody in the origin server, the changes could be anywhere in the file. It can be in the beginning of the file, middle of the file or at the end of the file. 'Delta' generation should work and not duplicate any common data irrespective of placement of changes the original file has undergone. Mirror server breaks down the file it has into multiple chunks of some size (typically S = 1K), calculates both rolling checksum and MD5 checksum on the chunks. Then it sends them to the origin server. Origin Server does rolling checksum for chunks of size S. Since the file might have undergone changes, origin server creates rolling checksum for each byte offset. Note that mirror server generates rolling checksum for non-overlapping chunks. Since origin server generates large number checksums, rolling checksum algorithm needs to be very fast. Alder32 algorithm has property of generating checksum without going through the all bytes of the chunk and hence it is very fast. It can generate checksum incrementally. Then the received checksums are compared with the checksums generated locally by origin server. If any checksum matches, then it verifies by comparing with MD5 checksum. If MD5 checksum also matches, then origin server assumes that mirror server has this chunk. Once it finds out the all duplicate data, it only sends the matching block information and any new or modified data for non-matching chunks. For detailed information on this algorithm, please check this link.

Other utilities which are useful for WAN deduplication : 'rdiff'. 'rdiff' uses the rsync algorithm to generate delta file with the difference from old file and new file. Then this can be applied to old file at other machine to get the new file.

Sunday, May 18, 2008

Hardware timer block in Multicore processors for network infrastructure devices

Some use case scenarios of timers for different functions of network infrastructure devices is given here.

One of the main challenges with software timers is to ensure that jitter and latency of the packets don't go up during the period when some timer block related operations occur. Latency of the packets or even packet drop happens when CPU takes too long a time to process some timer block related functions. Any timer block functions that go through the timers in a tight loop would have affect on packet processing if the number of timer elements checked or acted on in the tight loop are more. The threshold of number of elements that are checked in the tight loop that causes packet latency disruption depends on the frequency of CPU. Based on the software timer block implementation, traversal of some timers happen for different operations. Let us see some of the challenges/problems with software timer modules.

Software timers depend on hardware timer interrupt. In Linux, timer interrupt occurs fore very jiffy( typically 1msec or 2msec). Due to this any software timer can have error up to jiffy. If applications requires smaller error, say in terms of, micro seconds, then only method I can think of is to have timer interrupt to occur in terms of microseconds. This may not work in all processors. There is too much of interrupt processing overhead in cores and reduces the performance of the system. Fortunately many Applications tolerate millisecond error in firing the timers, but some applications such as QoS scheduling on multi-gig links running general purpose operating systems such as Linux require finer granular and accurate timers.
Many networking applications require large number of software timers as described in earlier post. This will lead to traversing many timers on per jiffy basis. For example, if an application creates 500K timers/sec, then there would be 500 timers on per jiffy basis. For every 1 millisecond, it needs to traverse 500 timers and may have to fire all 500 of them. This can take significant amount of time based on the amount of time the application timer callback takes. If takes good amount of time to process, you have packet drop or increased packet latency or both the issues. Some software implementations maintain the timers on per core basis. If there are 8 cores, each core may be processing 62 or 63 timers every millisecond. This is ideal case, but what if the traffic workload is causing only few cores starting the timers. Only few cores would be loaded to process the expired timers. Basically the load may not get balanced across the cores.
To reduce the number of timers to traverse for every hardware timer interrupt, cascaded timers wheels are normally used by software implementations. This implementation does have different timer wheels for different timer granularity and when the timers are started, they go to appropriate wheel and bucket. Due to this any bucket of timer wheel contains the timers that will get expired. Though it reduces the number of timers to traverse for every timer interrupt, but it may involve movement of large number of timers from one timer wheel to another as described in the earlier post. This movement of timers may take significant amount of time and again could be the cause for packet drop and increased latency.
If there are periodic timers or need to be restarted based on activity software timer implementation spend good amount of time in restaring them.

Do hardware timer blocks in Multi-core processors help?

In my view hardware timer block can help when your applications demand large number of timers, periodic timers or very accurate timers. If your application requires 'Zero Loss Throughput', then hardware block is going to help certainly as it takes away the CPU cycles used to traverse the timer list or movement of timers in software implementations.

What are the features expected by network infrastructure applications from hardware timer block in Multi-core processors?

Large number of timers are expected to be supported, ranging in Millions.
Decent number of (say 1K) timer groups are expected to be supported. There are multiple applications running in cores that require timers. Applications that are being shutdown or that are being terminated due to some error conditions should be able to clear all the timers that it had started.
Accessibility of timer groups by applications running in different execution contexts. There should be good isolation among timer groups. There should be some provision to program the number of timers that can be added to a timer group. There should be provision to read the number of timers that are in the timer group.

Applications running in Linux user space
Applications running in Kernel space.
Applications running in virtual machines.

Application should be able to do following operations. All operations are expected to be completed synchronously.

Start a new timer: Application should be able to provide

Timer identification : Timer Group & Unique timer identification within in the group.
Timeout value (Microsecond granularity)
One shot or periodic timer or inactivity timer
Priority of timeout event (upon expiry) : This would help in prioritizing the timer events with respect to other events such as packets.
If there are multiple ways or queues to provide the timer event upon expiry to the cores, then application should be able to give its choice of way/queue as part of starting the timer. This would help in steering the timer event to specific core or distribute the timer events across cores.

Stop existing timer : Stopping the timer should free the timer as soon as possible. One existing hardware implementation of timer block in Multi-core processor today has this problem. If the application is starting timers and stopping them in continuous fashion, eventually it runs out of memory and memory will get freed only upon actual timeout value of the timers. If the timeout of these timers are in tens of minutes, then the memory is not released for minutes together. Good hardware implementation of timer block should not have this exponent usage of memory in any situation. Timer stop attributes typically involve

Timer identification

Restart the existing timer.

Timer identification
New timeout value

Get hold of remaining time out value at any time, synchronously by giving 'Timer identification'
Set the actvity on the timer - Should be very fast as applications might use this on per packet basis.

Firewall/NAT/ADC appliances targeting Large and Data center markets would greatly benefit from the Hardware based timer blocks. All hardware timer blocks are not equally created. Hence check the functionality and efficacy of hardware implementation.

Measure the latency, packet drop and jitter of the packets over long time. One scenario that can be tested is given below.

Without timers, measure the throughput of 1M sessions by pumping traffic across all sessions using equipment such as IXIA or smartbits. Let us this throughput is B1.
Create 1M sessions, hence 1M timers with 10 minutes timeout value.
Pump the traffic from IXIA or smartbits for 30 minutes.
Check whether the throughput is almost same as B1 across all 30 minutes. Also ensure that there is no packet drop or increase in latency of packets, specifically at 10, 20, 30 minute interval.

Measure the memory usage:

Do connection rate test with each connection inactivity timeout value 10 minutes.
Ensure that upon TCP Reset or TCP FIN sequence the session is removed and hence timer is stopped.
Continue this for 10 or more minutes.
Ensure that the memory usage did not go up beyond reason.
Ensure that timers could be started successfully during the test.

Random technical bits and thoughts