Thursday, February 3, 2011

Data Center/Enterprises - Clustering of Network devices

Throughput requirements of Data center/Enterprise network equipment are going up with increased traffic in data centers and Enterprises.  In addition,  computational requirements of network equipment are also going up.  Some examples of why more computation power is required.
  • Intrusion Detection/Prevention now requires almost 3 - 4 times the  computation power on per Mbps of traffic than what was required few years back.  I guess it is mainly due to sophisticated nature of attacks and evasion techniques adopted by attackers.  Javascript analysis itself takes 10 times computational power  than the typical DPI based pattern matching.  Javascript analysis requires proxy based functionality to get hold of the javascript and script analysis for attack detection.  These two tasks require lot more CPU cycles than typical pattern matching.
  • Traditional Server Load Balancers (SLB) used to select the internal server based on the IP, UDP/TCP header values.  Next generation server load balancers (SLB) called ADCs do deep packet inspection, such as HTTP, SIP URL,  HTTP Request headers,  to select the internal server to send the load.  DPI requires more CPU cycles.
  • Application Firewalls such as Web Application Firewalls and SIP firewalls not only  do the deep packet inspection, but also deep data inspection and that  requires more horse power from CPUs.
  • DDOS prevention requires real time analysis of not only packet-by-packet analysis, but also sessions and application protocol level analysis across sessions to identify the attacks.  Many DDOS attacks on per session basis look exactly same as the normal traffic.  Hence the analysis across sessions is required for detecting the anomaly.  This capability requires not only lot of memory, but also good amount of computational power.

Multicore processors are helping some extent in solving performance issues.  Clustering of multiple Multicore SoCs are becoming necessary to solve above performance issues in Data Center and Large Enterprise markets.  Typically,  multiple blades, each using Multicore SoCs, running the same application are clustered to take up the load. L2 switches are increasingly used to front end the cluster.  L2 switches now can be configured to balance the load across multiple devices of cluster.  One might see the cluster and L2 switch in one enclosure giving a feeling that it is one big box providing tens of gigabits of performance.

What features of L2 switch are enabling clustering?
  • Distribution of sessions across multiple devices in cluster:  Majority of L2 switches have capability to distribute the  traffic coming from incoming ports (Data Ports) across multiple ports (Device ports).  By connecting devices in the cluster to these ports, then each device gets the traffic that was redirected to that port. But many of the network devices expect that all packets of any given session go to the same device.  For example, all packets belonging to one HTTP connection should go to one device.  If packets of the sessions are distributed across multiple devices,  they will not be able to do their operation of analysis, proxy etc..    A given connection traffic involves both Client to Server and Server to Client traffic.  Though L2 switches don't have session intelligence, due to the hash based distribution mechanism they adopt,  same hash value gets generated for session traffic whether it is C-S or S-C traffic of a connection.   Some cautions:
    • L2 switches don't do IP reassembly.  Due to this, hash generated for first fragment of a packet can be different from the non-initial fragments if the hash generation block is configured with L4 fields (TCP source and destination ports).  So, it is advisable to configure the hash block with IP addresses and IP protocol.   This may give rise to unequal distribution. But with large number of sessions in DC, this may not be a big limitation.  
    • Some application sessions require multiple connections. Example:  SIP (Session Initiation Protocol).   SIP voice call typically involves three connections - SIP control connection,  RTP for voice/video data and RTCP for control frames.  Many devices expect that all three connections land on the same device.  If all three connections have same source, destination and protocol fields, then all packets of SIP application session would be sent to the same device by the switch.  But, RTP and RTCP IP addresses may be different from the IP addresses of SIP control connection.  If your device needs to support this,  then it is responsibility of cluster devices.  Cluster devices need to have intelligence of ownership of these kinds of  application sessions. If a device receives packets belonging to application that is owned by some other device, it needs to redirect the traffic to that  device that owns the SIP session.
  • As indicated implicitly above,  L2 switch port are divided into - Network ports (Data ports) that connect to the DC/Enterprise networks and Device ports where the cluster of devices are connected.  With large density of ports in current generation of switches, some ports even can be dedicated to inter-device communication, there by avoiding any other back plane such as infiniband or some other L2 switch fabric.  L2 switches and devices providing ETS (802.1qaz)  and 10G ports'  support can use the same port for both inter-cluster communication as well as for network traffic. 
New Generation Configuration Framework

Even though there are multiple devices in cluster,  it is required that admin user configures the cluster only once.  Admin users should not be expected to configure each device in the cluster.   Fortunately new generation of configuration framework are designed to handle cluster configuration. 

New generation configuration frameworks support the mechanism to ensure that configuration is same across the devices in the cluster.  Increasingly,  configuration architecture supports central management system which takes care synchronization of configuration across devices on per operation basis.

Network devices maintain several statistics. With multiple devices in the cluster, each device maintains its own set of statistics. Admin user typically expects to see the consolidated list of statistic counter values across all devices in the cluster. Again, new configuration frameworks reads the statistics from each device, consolidates them and show the consolidated output. 

On image upgrade : When new image version is available,   new configuration frameworks allow admin users to upgrade the image only once for the cluster. All devices in the cluster would get the image from the central configuration framework.

With these advancements in L2 switches and configuration frameworks,  clustering is again back in the networks.


No comments: