Monday, December 31, 2012

L4 Switching & IPSec Requirements & Openflow extensions


L4 Switching - Introduction

L4 switching typically involves connection tracking,  NAT and common attack checks.  Stateful inspection firewalls,  NAT, URL filtering  and SLB (Server Load Balancing) are some of the middle-box functions that take advantage of L4 switching.  These middle-box functions inspect first few packets of every connection (TCP/UDP/ICMP etc..) and offload the connection processing to some fast path entity to do L4 switching.  Normally both fastpath and normal path functionality reside in the same box/device.

Openflow, promoted by industry big names,  is one of the protocols that separates control plane from data plane where control planes of multiple switches implemented in one logical central controller and leaving the data plane alone at devices, which are programmable to work in a specified way,  thereby making devices simple.

Middle-box functions described above can also be separated - Normal path (service plane) and fast path (L4 switchiing).  Normal path works on first packet or at the most first few packets and fast path works on the connection for rest of the packets in the connection. By implementing normal path (Service plane) in centralized logical controllers and leaving the fastpath (L4 switching) at the physical/logical device level,  similar benefits of CP/DP separation can be achieved.  Benefits include
  • Programmable devices where device personality can be changed by controller applications.  Now,  Openflow switch can be made as L2 switch, L3 switch and/or L4 switch.
  • By removing major software logic (in this case, it is service plane) from device to the central location, cost of ownership goes down for end customers.
  • By centralizing the service plane,  operation efficiency can be improved significantly
    • Ease-of software upgrades
    • Configure/Manage from a central location.
    • Granular flow control.
    • Visibility across all devices.
    • Comprehensive traffic visualization

L4 Switching - Use cases:

  • Branch office connectivity with corporate head quarters :   Instead of having firewall, URL filtering and Policy control on every branch office device,  it can be centralized at corporate office. First few packets of every connection could go to main office.  Main office controller decides on the connection. If the connection is allowed,  it lets the openflow switch in the branch office to forward rest of the traffic on that connection.  This method only requires simple openflow switches in the branch office and all intelligence can be centralized at one location, that is, at main office. Thus, it reduces the need for skilled administrator at every branch office.
  • Server Load Balancing in Data Centers:   Today data-centers have expensive server load balancers to distribute the incoming connections to multiple servers.  SLB devices are big boxes today for two reasons -  Normal path processing that selects the best server for every new connection and  fastpath packet processing are combined into one box/device.   By offloading packet processing to inexpensive openflow switches,  SLB devices only need to worry about normal path processing, which could be done in lesser expensive boxes and even using commodity PC hardware.  
  • Managed Service providers providing URL filtering capability to home & SME offices :  Many homes today use PC based software to do URL filtering  to provide safe internet experience to kids.  SME administrators require URL filtering service to increase the productivity of their employees by preventing them to access recreational sites and also to prevent any malware contamination.  Again, instead of deploying URL filtering service at customer office premises,   service providers like to host this service centrally and manage centrally across multiple of their customers for operational efficiency.  Openflow controller implementing URL filtering service can get hold of packets until URL is fetched, find the category of URL, apply policy and decide on whether to continue the connection.  If the connection is to  be continued, then it can program the openflow switch in customer premises to forward rest of the packets.  
  • Though URL filtering service is one use case given,  this concept of centralizing the intelligence at one central place and programming the openflow switches for fast path is equally applicable for "Session Border Controller" and "Application Detection using DPI" and other services. Whichever service needs to inspect first few packets of the connection is candidate for Service Plane/L4 Switching  separation using Openflow.

L4 Switching - Processing

L4 switching or fastpath entity are used to do connection level processing. Any connection consists of two flows -  Client-to-Server (C-S) flow and Server-to-Client (S-C) flow.  

Connections are typically 5-tuple based.  

Functionality of L4 switching:
  • Connection Management :  
    • Acts on the connection Add/Remove/Query/Modify requests from the service plane.
    • Inactivity timeout :  Activity is observed on the connection. If no activity for some time (programmed during the connection entry creation),   connection entry is removed and notification is sent to the normal path.
    • TCP Termination : L4 switching entity can remove the connection entry if it observes the both endpoints of TCP connections exchange FINs and ACKed. It can also remove the entry if TCP packet with RST is seen.
    • Periodic notifications of the state collected to Service plane.
  •  Packet processing logic involves
    • Pre-flow lookup actions typically performed:
      • Packet integrity checks :  Checksum verification of both IP and transport headers,  Ensuring that the length field values are consistent with packet sizes etc..
      • IP Reassembly:  As the flows are identified by 5-tuples, it is required that packets are reassembled to determine the flow across fragments. 
      • IP Reassembly related attack checks and remediation
        • Ping-of-Death attack checks (IP fragment Overrun) - Checking for full IP packet never exceeds 64K.
        • Normalization of the IP fragments to remove overlaps to protect from teardrop related vulnerabilities.
        • Too many overlap fragments - Remove the IP reassembly context when too many overlaps observed.
        • Too many IP fragments  & Very small IP fragment size:  Drop the reassembly context and associated IP fragments.
        • Incomplete IP fragments resulting Denial of Service attacks - Limit the number of reassembly contexts on per IP address pair,  IP address etc..
    • TCP Syn flood protection using TCP Syn Cookie mechanism
    • Flow & Connection match:  Find the flow entry and corresponding connection entry.
    • Post lookup actions
      • Network Address Translation :  Translation of SIP, DIP, SP, DP.
      • Sequence number translation in TCP connections :  This is normally needed when the service plane updates the TCP payload that decreases or increases the size or when the TCP syn cookie processing is applied on the TCP connection.
      • Delta Sequence number updates in TCP connections :  This is normally required to ensure that right sequence number translation is applied, especially for retransmitted TCP packets.
      • Sequence number attack checks :  Ensuring that sequence numbers of the packets going in one direction of the connection are within some particular range. This is to ensure that any attacker injecting the traffic is not honored.  This check is mainly required to stop TCP RST packets that are generated and sent by attackers.  As RST packet terminates the connection and thereby creating DoS attack, it is required that this check is done.
      • Forwarding action :  To send the packet out by referring to routing tables and ARP tables.

IPSec - Introduction

IPSec is used to secure the IP packets. IP packets are encrypted to avoid leakage of data on the wire and also authenticated to ensure that the packets are indeed sent by the trusted party.  It runs in two modes - Tunnel mode and transport mode.   IP packets are encapsulated in another IP packet in tunnel mode.  No additional IP header in involved in transport mode. IPSec applies the security on per packet basis - Hence it is datagram oriented protocol.  Multiple encryption and authentication algorithms can be used to secure the packets.   Encryption keys and authentication keys are established on per tunnel basis by the Internet Key Exchange Protocol (IKE).   Essentially,  IPSec protocol suite contains IKE protocol and IPSec-PP (Packet processing).  

Traditionally,  implementations use proprietary mechanism between IKE and IPSec-PP as both of them sit in the same box.  New way of networking (SDN) centralizes control plane with distributed data paths.  IPSec is one good candidate for this.  When the control plane (IKE) is separated from the data path(IPSec-PP),   then there is a need for some standardization for communication between IKE and IPSec-PP.  Openflow is thought to be one of the protocols to separate out control plane and data plane.   But Openflow as defined in OF 1.3.x specification did not keep IPSec CP-DP separation in mind.  This blog post tries to provide extensions required in OF specification to enable IPsec CP-DP separation.


ONF (Open Networking Foundataion) is standardizing Openflow protocol.  Controllers using OF protocol programs the OF switches with flows and instructions/actions to be performed on packets.  Openflow 1.3+ specification supports multiple tables which simplifies the controller programming as each table in the switch can be used for a purpose.

Contrary to what everybody says,  Openflow protocol is not really friendly let alone to SP/FP separation, but also to L3 CP/DP separation.  The way I look at OF today is that it is good for L2 CP/DP separation and also good for traffic steering kind of use cases, but it is not sufficient for L3 and L4 switching.  This post tries to give some of the features that are required in OF specifications to support L4 switching in particular and L3 switching to some extent.

Extensions Required to realize L4 switching & IPSec CP-DP separation

Generic extensions:

Table selection on packet-out message:

Problem Description
Current OF 1.3 specification does not give provision for controllers to start the processing of packet-out message from a specific table in the switch.  Currently,  controller can send the packet-out message with set of actions.    These actions are expected to be executed in the switch.  Typically 'OUTPUT' action is expected to be specified by the controller in the action list.  This action sends the packets out on the port given in "OUTPUT" action.  One facility is provided to run through OF pipeline though.  If the port in 'OUTPUT' action is specified to be a reserved port OFPP_TABLE,  then the packet-out message execution starts from table 0.

Normally controllers takes advantage of multiple tables in the OF switch by dedicating tables to purposes.  For example, when L4 switching  with L3 forwarding are used, multiple tables are used - A table to classify packets,  A table for Traffic Policing entries,  A table for PBR and few tables for routing tables,  a table for L4 connections and a table for next hop entries and a table for traffic shaping rules.   Some tables are populated pro-actively by the OF controller , hence miss packets are not expected by controllers.  Entries in some tables are populated re-actively.  Tables such as 'L4 connections',  'Next hop' entries are created reactively by the controllers.   Processing of the miss packets (packet-in) can result into a flow being created in the table which was the cause of miss packet. Then controllers likes to send back the packet to the OF switch and would like OF switch to start processing the packet as OF switch does when there is a matching entry. Controller programming does not get complicated  if that was possible.  As indicated before,  due to limitations of the specification,  controller can't ask switch to start from a specific table.   Due to lack of this feature in the switches,   controllers are forced to figure out the actions that would need to be performed and then program the action list in the packet-out message.  In our view,  that is quite complex task for the controller programmers.  As I understand,  many controller applications simply using packet-in message to create the fow, but drop the packet-in message.  This is not good at all.  Think of SYN packet getting lost.  It would take some time for client to retransmit the TCP SYN packet and that delay results into very bad user experience.

Also, in few cases,  applications (CP protocols such as OSPF, BGP)  generate packets that needs to be sent out on the data network.  Controllers typically sit in management network, not on the data network.  Hence these packets are to be sent to the data network via OF switches.  This can only be achieved by sending these packets as packet-out messages to the switch.  Yet times, these packets need to be exposed to only part of the table pipeline.  In these cases, it would be required to have ability for controllers to start the packet processing in the switch for these packets from a specific table given by controller.


Freescale Extension :

struct ofp_action_fsl_experimenter_goto_table {
      uint16_t type; /* OFPAT_EXPERIMENTER. */
      uint16_t len; /* Length is a multiple of 8. */
      uint32_t experimenter; /** FSL Experimenter ID **/
      uint16_t fsl_action_type; /** OFPAT_FSL_GOTO_TABLE **/
      uint16_t table_id;

OFP_ASSERT(sizeof(struct ofp_action_fsl_experimenter_goto_table == 8);

table_id :  When switch encounters this action,  switch starts processing the packet from the table specified by 'table_id'.

Though this action is normally present in the action_list of e packet-out messages, this action can be used in flow_mod actions too.  When there is GOTO_TABLE instruction and GOTO_TABLE action,  then the GOTO_TABLE action takes precedence and GOTO_TABLE instruction is ignored.

Notice that 'table_id' size is 2 bytes.  This was intentional.  We believe that OF specifications in future would need to support more than 256 tables.

Saturday, December 29, 2012

L2 Network Virtualization & Is there a role for Openflow controllers?




Current Method of Network Virtualization

IaaS (Cloud Service Providers) providers do provide network isolation among their tenants. Even Enterprise private cloud operators are increasingly expected to provide network isolation among tenants - tenants being departments,  divisions,  test networks, lab networks etc..  This allows tenants to have their own IP addressing space and possibly overlapping with other tenants' address space.

Currently VLANs are used by network operators to create tenant specific networks.  Some of the issues related to VLAN are:
  • VLAN IDs are limited to 4K.  If tenants require 4 networks each on average,   only 1K customers can be satisfied on a physical network.  Network operators are forced to create additional physical networks when more tenants sign up.
  • Performance bottlenecks associated with the VLANs :  Even though many physical switches support 4K VLANs,  many physical switches don't provide line rate performance when the number of VLAN IDs go beyond certain limit ( some switches don't work well beyond 256 VLANs)
  • VLAN based networks have operational headaches -  VLAN based network isolation requires  all L2 switches are configured when a new VLAN is created or an existing VLAN is deleted.  Though many L2 switch vendors provide central console to work with their brand L2 switches,  it is operational  difficulty when switches from multiple vendors are present.
  • Loop convergence time is very high
  • Extending VLANs across data center sites or extending VLANs to customer premise has operational issues with respect to interoperable protocols,  Out-of-band understanding among network operators is required to avoid VLAN ID collisions.
To avoid issues associated with  capabilities  of  L2 switches,  networks having switches from multiple vendors  and limitations associated with VLANs,  increasingly overlays are used to virtualize physical networks to create multiple logical networks.

Overlay based Network Virtualization

Any L2 network requires the preservation of L2 packet from source to destination.  Any broadcast packet should go to all network nodes attached to the L2 network.  All multicast packets should go to network nodes that are willing to receive multicast packets of groups of their choice.

Overlay based network virtualization provides above functionality by overlaying the Ethernet packets using outer IP packet - Essentially tunneling Ethernet packetsfrom one place to another.

VxLAN, NVGRE are two of the most popular overlay protocols that are being standardized.  Please see my blog post on VxLAN here.

VxLAN provides 24 bits of VNI (Virtual Network Identifier).  In theory around 16M virtual networks can be created.  Assuming that each tenant may have 4 networks on average,  in theory, 4M tenants can be supported by CSP using one physical network.  That is, there is no bottleneck with respect to identifier space. 


Openstack is one of the popular open source cloud orchestration tools.  It is becoming formidable alternative to VMWare vCenter and VCD.   Many operators are using Openstack and  KVM hypervisor as a secondary source of cloud virtualization in their networks.  Reliability of Openstack came long way and many vendors are providing support for a fee. Due to these changes,  adoption of openstack+KVM is going up as a primary source of virtualization. Openstack mainly has four components -  'Nova' for VM management across multiple physical servers,  'Cinder' for storage management,  'Quantum' for network topology management and 'Horizon' to provide front end user experience to operators (administrators and tenants).

Quantum consists of set of plugins -  Core plugin and multiple extension plugins.  Quantum defines API for plugins and let various vendors to create backend for the plugins.  Quantum core plugin API defines the management API of virtual networks - Virtual networks can be created using VLAN,  GRE and being upgraded to support VxLAN too.
Quantum allows operators to create virtual networks.  As part of VM provisioning,  Openstack NOVA provides facility for operators to choose the virtual networks on which this VM needs to be placed on. 
When 'Nova scheduler' chooses a physical server to place the VM, it asks the quantum to provide MAC address, IP address and other information to be assigned to VM using 'create_port'  API.  Nova asks quantum as many times as number of virtual networks that VM belongs to.   Quanutm provides required information to NOVA back.  As part of this call,  Quantum comes to know about the physical server and the virtual networks that needs to be extended to the physical server.  It, then, informs the quantum agent (that sits in host Linux of each physical server) the virtual networks it needs to create.   Agent on the physical server gets more information on virtual networks from quantum and then create the needed resources.   Agent uses OVS (Open Virtual Switch ) package that is present in each physical server to do the job.  Please see some description of OVS below.   Quantum agent in each physical server creates two openflow bridges - integration bridge (br-int) and tunnel  bridge (br-tun).  Agent also connects south side of br-int to north side of br-tun  using loopback port pair.   Virtual network port creation and association to br-tun is done by quantum agent for every new virtual network or when the virtual network is deleted.  North side of br-int towards VMs is handed by libvirtd and associated drivers as part of VM management. See below.

Nova talks to 'nova-compute' package in the physical server to bring up/down VMs.  'Nova-compute' in the physical server uses 'libvirtd' package to bring up VMs,  create ports and associate them with openflow switches using OVS package.  Brief description of some of the work, libvirtd does with the help of OVS driver are:
  • Creates a Linux bridge for each port that is associated with the VM.
  • North side of this bridge is associated with VM Ethernet port (using tun/tap technology).
  • Configures ebtables to provide isolation among the VMs.
  • South side of this bridge is associated with Openflow integration bridge (br-int).  This is achieved by creating loopback port pair with one port attached to Linux bridge and another port associated with the Openflow switch, br-int.

Openvswitch (OVS)

OVS is openflow based switch implementation.  It is now part of Linux distribution.  Traditionally Linux bridges are used to provide virtual network functionality in KVM based host Linux.  In Linux 3.x kernels,  OVS has taken that responsibility and Linux bridge is used for the purposes of enabling 'ebtables'.
OVS provides set of utilities :  ovs-vsctl and ovs-ofctl.   "ovs-vsctl" utility is used by OVS quantum agent in physical servers to create openflow datapath entities (br-int, br-tun),   initialize Openflow tables and add both north and south bound ports to the br-int and br-tun.   "ovs-ofctl" is command line to create openflow flow entries in the openflow tables of br-int and br-tun.  It is used by OVS quantum agent to create default flow entries to enable typical L2 switching (802.1D) functionality.  Since OVS is openflow based,  external openflow controllers can manipulate the traffic forwarding by creating flows in br-int and br-tun.  Note that external controllers are required only to add 'redirect' functionality AND virtual switching functionality can be achieved even without external openflow controller. 

Just to outline various components in the physical server:
  • OVS  package - Creates Openflow switches,  Openflow ports and associate them to various switches and ofcourse provides ability for external controllers to control the traffic to/from VMs to external physical networks.
  • Quantum OVS Agent :  Communicates with the Quantum plugin in Openstack tool to get to know the virtual networks and configure OVS to realize those networks in the physical server.
  • OVS Driver in Libvirtd :  Allows connecting VMs to virtual networks and configures 'ebtables' to provide isolation among VMs.

Current VLAN based Network Virtualization solution

Openstack and OVS together can create VLAN based networks.  L2 switching is happening with no external openflow controller.   OVS Quantum agent with the help of plugin knows the VMs, their vports and corresponding network ports. Using this information,  agent associates the VLAN ID to each vport connected to the VMs.  This information is used by OVS to know which VLAN to use when packets come from VMs.  Also,  agent creates one rule to do the LEARNING for packets coming in from the network. 

Overlay based Virtual Networks

Companies like Nicira and bigswitch networks are promoting overlay based virtual networks. OVS in each compute node (Edges of the physical network) is used as starting point of overlays.  All L2 and L3 switches connecting compute nodes are only used for transporting the tunneled packets.  They don't need to participate in the virtual networks.   Since OVS in compute nodes is encapsulating and decapsulating inner ethernet packets into/from another IP packet,  in-between switches transfer the packets using outer IP header addresses and outer MAC headers.  Essentially,  overlay tunnels start and end at compute nodes.  With this,  network operators can configure the switches in L3 mode instead of problematic L2 mode.  May be, in future, one might not see any L2 switches in the data center networks.

Typical packet flow would be something like this:

- A VM sends a packet and it lands on the OVS in the host Linux.
- OVS applies actions based on the matching flows in br-int and packet is sent to br-tun.
- OVS applies actions based on the matching flows in br-tun and packet is sent out on the port (overlay port)
- OVS sends the packet to overlay protocol layer.
- Overlay protocol layer encapsulates the packet and sends out the packet.

In reverse direction,  packet flow would look like this:

- Overlay protocol layer gets hold of the incoming packet.
- Decapsulates the packets and exposes the packet with right port to the OVS br-tun.
- After applying any actions on the packet using matching OF flows in br-tun,  packet is sent to br-int.
- OVS applies the actions on the matching flows and figures out the destination port (one-to-one mapping with VM port)
- OVS sends the inner packet to the VM.

Note that:

- Inner packet is only seen by OVS and VM.
- Physical switches only see encapsulated packet.

VxLAN based Overlay networks using Openstack

OVS and VxLAN:

There are many open source implementations of VxLAN in OVS and integration of this with openstack.   Some details about one VxLAN implementation in OVS.

  • Creates as many  vports in OVS as number of VxLAN networks in the compute node.  Note that,  even though  there could be large number of VxLAN based overlay networks,  only networks to which local VMs belong are created in OVS as vports.  For example,  If there are VMs corresponding to two overlay networks,  then two vports are created. 
  • VxLAN implementation depends on VTEP entries to find out the remote tunnel endpoint address for a destination MAC address of the packet received from the VMs.  IP address is used as DIP of the outer IP header.
  • If there is no matching VTEP entries,  Multicast learning happens as per VxLAN.
  • VTEP entries can be created manually too.   A separate command line utility is provided to create VTEP entries to vports.  
  • Since Openstack has knowledge of VMs and physical servers they are hosting,  Openstack with the help of quantum agent in each compute node can create VTEP entries pro-actively.

Commercial Products

Openstack and OVS provide fantastic facilities to manage virtual networks using VLAN and overlay protocols.  Some commercial products seem to be doing following:
  • Provide their own Quantum Plugin in the openstack.
  • This plugin communicates with their central controller (OFCP/OVSDB and Openflow controllers).
  • Central controller is used to communicate with OVS in physical servers to manage virtual networks and flows.
Essentially,  these commercial products are adding one more controller layer between Quantum in openstack and physical servers.

My views:

In my view it is not necessary.  Openstack, OVS,  OVS plugin, OVS agent, and  OVS libvirtd driver are becoming mature and there is no need for one more layer of abstraction.  It is a matter of time where these open source components would be feature rich, reliable and supported by vendors such as redhat.   With OVS being part of Linux distributions and with ubuntu providing all of above components,  operators are better of sticking with these components instead of going for proprietary software.

Since OVS is openflow based,  there could be Openflow controller to add value with respect to traffic steeering and traffic flow redirections.  It should provide value, but one should make sure that default configuration is good enough to realize virtual networks without need for openflow controller.

In summary, I believe that Openflow controllers are not required to manage virtual networks in physical servers, but are required to add value added services such as traffic steering,  traffic visualization etc..

Sunday, January 22, 2012

IP Fragmentation versus TCP segmentation

Ethernet Controllers are increasingly becoming more intelligent with every generation of NICs.  Intel and Broadcom have added many features in Ethernet NIC chips in recent past.  Multicore SoC vendors are adding large number of features into Ethernet IO hardware blocks.

TCP GRO (Generic Receive Offload - It used to be called Large Receive offload too) and GSO  (Generic Segmentation Offload and it is used to be called Transport Segmentation Offload)  are two new features (in addition to FCoE offloads) one can see from Intel NICs and many Multicore SoCs.  These two features are  good for any TCP termination applications on the host processors/cores.  These two features reduces the number of packets traversing the host TCP/IP stack. 

TCP GRO works across multiple TCP flows where it aggregates multiple consecutive TCP segments (based on TCP sequence number) of a flow into one or few TCP packets in the hardware itself, there by sending very few packets to the host processor.  Due to this,  TCP/IP stack sees  fewer inbound packets.  Since the packet overhead is significant in TCP/IP stacks, lesser packets uses lesser number of CPU cycles, thereby leaving more CPU cycles for applications, essentially increasing the performance of overall system.

TCP GSO intention is similar to TCP GRO,but for outbound packets.  TCP layer typically segments the packets based on  MSS value. The MSS value is typically determined from PMTU (Path MTU) value.  Since TCP and IP headers take 40 bytes of data,  MSS is typically ( PMTU -  40 ) bytes.  If PMTU is 1500 bytes, then the result MSS value is 1460. When the application tries to send large amount of data,  then the data is segmented into multiple TCP packets where each TCP payload carries up to 1460 bytes.  TCP GSO feature in the hardware eliminates the need for TCP layer to do the segmentation and thereby reduces the number of packets that traverse between TCP layer and to the hardware NIC.  TCP GSO feature in the hardware typically expect the MSS value along with the packet and it does everything necessary internally to segment and send the segments out.

Ethernet Controllers are increasingly providing support for IP level fragmentation and reassembly.  Main reason is being  increasing popularity of tunnels.

With increasing usage of tunnels (IPsec, GRE, IP-in-IP,  Mobile IP, GTP-U and futuristic VXLAN and LISP), the packet size is going up.  Though these tunnel protocol specifications provides guidelines to avoid fragmentation using DF bit and PMTU discovery,  it does not happen in reality.  There are very few deployments where DF (Don't Fragment bit) , which is required for PMTU discovery, is used.   As far as I know,  almost all IPv4 deployments fragment the packets during tunneling.  Some deployments configure network devices to do red-side fragmentation (fragmentation before tunneling so that the tunneled packets appear whole IP packet) and some deployments go for black-side fragmentation (fragmentation after tunneling is done).   On receive direction, reassembly happens either before detunneling or after detunneling. 

It used to be the case where fragmented packets are given lesser priority by service providers during network congestion.  With high throughput connectivity and increasing customer base for networks, service providers are competing for the business by providing very good reliability and high throughput connectivity. Due to popularity of tunnels,  service providers are also realizing that dropping fragmented packets may result in bad experience to their customers.  It appears that service providers are not treating the fragmented packets in a step-motherly fashion anymore.

IP fragmentation and TCP segmentation offload methods can be used to reduce the number of packets traversing the TCP/IP stack in the host.  Next question that comes to mind is how to tune the TCP/IP stack to use these features and how to divide the work  between these two HW features. 

First thing to tune in the TCP/IP stack is to remove the MSS dependency on PMTU.  As described above, today MSS is calculated based on PMTU value. Due to this, IP fragmentation is not used by TCP stack for outbound TCP traffic. 

TCP Segmentation adds the both TCP and IP header to each segment.  That is, for every 1460 bytes, there would be overhead of 20 bytes of IP header and 20 bytes of TCP header.  In case of IP fragmentation,  each fragment would have its own IP header (20 bytes of overhead).  Since TCP segmentation has more overheads,  one can say IP fragmentation is better.  Here, MSS can be set to a bigger value such as 16K and let IP layer fragment the packet if the MTU value is less than 16K.   This is certainly a good argument and it works fine in networks where the reliability is good.  Where the reliability is not good,  if one fragment gets dropped, TCP layer needs to send entire 16K bytes in retransmission.  If TCP had done the segmentation, it would only need to send fewer bytes. 

There are advantages and disadvantages with both approaches. 

With increased reliability of networks and with no special treatment on fragmented traffic by service providers,  IP fragmentation is not a bad thing to do.  And ofcourse, one should worry about retransmissions too. 

I hear few tunings based on the deployments.  Warehouse data center deployments where the TCP client and servers in a controlled environment are tuning MSS to 32K and more with 9K (jumbo frame) of MTU.  I think that , for 1500 bytes MTU,  going with 8K of MSS may work good.

Saturday, January 21, 2012

SAMLv2 mini-tutorial and some good resources

Why SAML  (Security Assertion Markup Language)?

Single Sign-On:  Many organizations have multiple intranet servers with web front end.  Though all servers interact with common authentication database such as LDAP, SQL databases,  each server used to take authentication credentials from employee.  That is, if employee logs into server1 and then goes to server2 employees are expected to provide credentials again to the server2.  It may sound okay if the user is deliberately going to different servers.  But,  this is not good experience if there are hyperlinks from one server pages to other servers.  Single Sign-On solves this issue. Once the user logs into a server successfully,  the user does not need to provide his/her credentials when he/she accesses other servers.  Single authentication with single sign-on is becoming common requirement in organizations.

Cloud Services:  Organizations are increasingly adopting Cloud Services for cost reasons.  Cloud Services are provided by third party companies.  For example, provides Cloud service for CRM.  Taleo provides cloud service for talent acquisition systems.  Companies now have intranet servers,  some cloud services from different cloud service vendors.  Employees accessing the cloud service should not be asked to create accounts in the cloud service.  It is expected that cloud service provider uses the its subscriber organizations authentication database to validate the users signing in.  Many companies would not like to provide the access to their authentication database to cloud services directly.  Also,  many companies would not like their employees to provide their organization login credentials to cloud service for the fear of compromising the passwords, especially, senior and financial executives.  This requires the need for cloud services to redirect the employees to login using companies web servers and use single sign on mechanisms to allow the users to access cloud services.

SAMLv2 (version 2 is the latest version) facilitates the single sign-on not only across intranet servers of the company, but also across services provided by cloud providers.

How does SAMLv2 work?

There are two participants in SAML world -  Service Provider (SP) and Identity Provider (IDP).  Service provider is referred to represent the intranet servers and cloud servers.  Service provider provides some services to the authenticated users.  Identify provider refers to the system which takes login credentials from the user and authenticates the user using local database or LDAP database or any other authentication database.

When the user tries to access the resources in service providers,  SAMLv2 component of Service provider (SP)  first checks whether the user is already authenticated. If so, it allows the access to its resources. If the user is not authenticated, then rather than providing login screen and taking credentials, it instructs the user browser (by sending HTTP Redirect response with new location header containing the identity provider URL) to redirect the request to the identity provider.   Ofcourse SAMLv2 component of the SP needs to be configured with the IDP Provider URL beforehand by administrator.  SAMLv2 component of SP generates SAML request (in XML form) and 'Relay State".  It then adds these two to the location URL of the HTTP redirect response.   Browser then redirects the request as per location header to the IDP.   Now, IDP has the SAML request and Relaystate information. 

IDP first checks whether the user is already authenticated with it.  If not,  it sends the login page to the user as part of HTTP response.  Once the credentials are validated based on the authentication database,  SAMLv2 component of IDP generates the SAML response with the result of authentication,  principal name and  attests the response using private key of its certificate by signing the SAML response.  It also can add some more information about the user via attribute statements.  It then sends both SAML response and relay state to the browser as response to the last HTTP request it got (which is typically the credentials request).  IDP normally makes a HTTP page with embedded POST request  with  URL  which is SP URL (which it got from the SAMP request), SAML response and relaystate (which it got before) and javascript which is used by browser to send the POST request.   Browser gets this response and due to javascript it automatically posts the request to SP.  This POST request contains the SAML response and RelayState.

Now SAMLv2 component of SP ensures that SAML response is valid by verifying the digital signature.  Note that the public key (certificate) of IDP is expected to be configured in the SP beforehand by the administrator.  Then it checks the subject name of principal and result of authentication.  If result of authentication indicates 'success',  then it sends HTTP redirect response to the browser with original URL (from relaystate) as location header.  This makes the browser go the original URL to get hold of response.

Now onwards,  user would be able to access the pages in the SP.  If the user goes to another SP of the company,  same process as above would be repeated.  Since the user is already authenticated with the IDP,  IDP does not take login credentials again.  Since user does not see all the transactions that are happening underneath,  he/she gets single sign-on experience.

By keeping the IDP in Enterprise premises,  one can be sure that passwords are not given to any SPs whether they are intranet SPs or Cloud Service SPs.

SAMLv2 - Other SSO use cases:

The use case described above is called 'SP initiated SSO'. It is called SP initiated because user is trying to access the SP first.   Other use case in SSO is 'IdP initiated SSO' .   User access the IdP first.  IdP authenticates the user and then user is presented with links to different SPs that user can access.  When the user clicks on these special links, SAMLv2 component on the IDP generates the HTTP response with HTML page having POST binding to the SAMLv2 component of SP (It is called Assertion Consumer Service URL).  SAMLv2 response is provided in the HTML page as one of the post fields.  It adds 'RelayState' parameter with the URL that the user is supposed to be shown on the SP.   Note that 'RelayState" in this case is unsolicited.  There is no guarantee that the ACS in the SP will honor this parameter and also there is no guarantee that all ACSes treat the relayState as URL.   But many systems expect the URL in the relayState (including Google Apps Cloud Service) and hence sending URL is not a bad thing.

Above two cases 'SP Initiated SSO' and 'IdP initiated SSO' are mechanisms to access the resource by single authentication.  'Logout' is another scenario.  When user logs out (whether on a SP or on IdP),  the user should be logged out on all SPs.   This is called "Single Logout Profile". IdP is supposed to maintain all the SP information for which IdP was contacted for all users.  When the user logs out on IDP,  for all SPs in the user login session, it sends a logout request to 'Logout Service' component of SP.    If user logs out on SP and indicates his willingness to logout on all SPs, then the SP in addition to destroying the user authentication context on the SP also redirects the user browser to IDP. As part of redirection, it also frames the logout request too to the IDP.  IDP then destroys the context in the authentication context.  If the user indicated that he would be like to be logged out from all SPs, then the IdP would send the logout requests over SOAP to all other SP in the user authentication context.

For more detailed information about SAMLv2, start with SAMLv2 Technical overview and SAMLv2 specifications. You can find them here:

IdP Proxy use case

So far the use cases described above talk about SP and IdP interaction via browser or directly.   SAML does not prohibit IdP proxy use case where IdP proxy work as IdP for SPs and as SP for IdPs.   Please see some links on this topic:

Open Source SAMLv2 SP,  IdP and IdP Proxy:

I find that OpenAM (fork of Sun OpenSSO) as one of the popular open source SSO authentication framework.  You can find more information here:

IdP Proxy configuration is detailed here:

One more good site I found which details out IDP proxy and source code in Python:

Some good links, though not related IDP Proxy:

Google Apps as Service Provider and OpenAM as IDP:

OpenAM as Service provider and Windows ADFS as IDP:

Need for Pattern Matching Accelerators in UTM devices

Network security term typically refers to Threat prevention and Security on the wire.

Threat protection is normally achieved with multiple security technologies.  Basic protection is achieved from firewall technology.  IDS/IPS (Intrusion Detection/Prevention System),  Anti-Virus, Web application firewalls are some of the security technologies that are increasingly being used to protect networks (Network devices, Servers and Client machines).  Application Detection is another technology that is increasingly being used along with firewall to stop/allow traffic that can't be identified using ports in TCP/UDP headers, but requiring deep packet inspection.

Other than firewall,  all technologies listed above require deep packet and deep data inspection.  IDS/IPS technology adopts multiple techniques to identify the attack traffic. One of the techniques is to match the traffic data with known attack patterns.  Application detection also relays on pattern matching on the data as one of the techniques to detect the applications.  Anti Virus technology too depends on some pattern matching to detect viruses.

In almost all technologies above,  patterns get added to the deployed systems on continuous basis by device vendors as more attacks are discovered.  For example,  IPS devices, nowadays have around10,000 patterns (signatures) to detect the known attacks.  It keeps increasing every year.  Additionally, Some of these patterns are checked on every packet that goes through IPS.  This adds to number of CPU cycles requires to do IPS protection.

Many software algorithms are used to speed up the pattern matching performance.  Some of the techniques inlcude:
  • DFA (Deterministic Finite State Automata)
  • Bloom filters - Filters formed from the hashes of patterns can be used on the traffic to determine whether further analysis is required.
  • PCRE algorithms to search for patterns of regular expression type.

IPS and other technologies also use techniques to reduce the number of patterns to be matched using protocol level intelligence and classifying the patterns in multiple buckets (protocol basis,  port basis,  even on the basis of application protocol stages such as URL based pattern database,  HTTP Request header,  Response header pattern databases etc..).

Due to above techniques,  some device vendors  think that there is no need for pattern matching hardware accelerators.  There is a reason for that too as some early developments of snort (popular open source IDS/IPS software) did not find much performance improvement with hardware accelerators.  But I believe HW accelerators are required for following reasons.

Performance Determinism:  IPS, Anti Virus,  Web application firewall and application detection technologies depend on the regular signature updates. Hardware deployed in the fields might have X number of signatures a day of purchase and they might go up to 2X or 3X over the years.  Performance determinism is expected by CSOs.  To maintain performance levels,  CPUs should be avoided in doing pattern matching.  Hardware accelerators specialized in pattern matching help in maintaining performance levels even with increasing number of signatures.

Protection from CPU hogging attacks:  With software based pattern matching, it is possible to hog the CPUs by crafting the packets with each data that matches a patterns multiple times.  Consider that there is a signature rule which tries to match a pattern "abc123def" and if there is 1Mbytes of data is sent with all the data having "abc123def" repeated,  then the CPU would take forever as it matches every packet multiple times.  CPU will not only spend time in matching the patterns, but also spends significant number of cycles in doing  further analysis.   Hardware accelerators normally designed such a way that the performance does not go down even if there are multiple matches.

Next question would be the what capabilities of hardware accelerators one should look for to mitigate performance issues  - One associated with explosive growth of attack patterns (signatures) and avoid CPU hogging by deliberate attempts by attackers.  I believe one should look for following capabilities.
  • Accelerators should be programmable with decent number of patterns.
  • Accelerators should be able to perform well even with large number of patterns.
  • Accelerators should be able to perform well even if there are large number of matches.
  • Accelerators should be able to perform pattern matches based on context information such as 'relative offset', 'Depth of the data to look' while doing pattern matching.  This will reduce the number of results being returned by the accelerator.  Smaller the number of results to software, lesser the post processing.
  • Accelerators should be able to return results only when multiple patterns match on the data.  This also is required to reduce the number of results. 
In summary,  pattern matching hardware accelerators are required to reduce the CPU hogs either due to increase in signatures or intelligently crafted data by attackers. I feel that end customers should buy the UTM/IPS devices that take advantage of these accelerators to ensure that devices can be used at least for few years (future proof).