Friday, August 21, 2015

Some informational links on HSM

Some good information/links on HSM and related technologies

Safenet Luna, which is one of the popular HSMs in the market :

Cloud HSM (Based on Safenet Luna) :

Very good blog entry on mapping OpenSSL with HSMs using OpenSSL PKCS11 engine interface :  More details on actual steps with an example can be found here:

One more resource describing the OpenSSL integration :  One more place to get the same document:

PKCS11 standard :

Openstack Barbican provides key management functionality  and it can be enhanced to use HSM internally. More informationc an be found at :

Wednesday, March 5, 2014

Openflow switches - flow setup & termination rate

I attended ONS yesterday.  This is the 1st time, I see many OF hardware switch vendors (either based on NPUs,  FPGAs or ASICs) advertising the performance numbers.  Though the throughput numbers are impressive, flow setup/termination rates are, in my view, are disappointing.  I see the flow setup rate claims are any where between 100 to 2000/sec.

Software based virtual switches support flow setup rate, up to10K/sec for every core. If 4 core part is used, one can get easily 40K flow setups/sec.

In my view,  unless the flow setup rate is improved, it is very unlikely the hardware based OF switches would be popular as the market addressability is limited.

By talking to few, I understand that poor flow setup rate is mainly due to the way TCAMs are used.  As I understand, every flow update (add/delete) requires rearranging the existing entries and that leads to bad performance. I also understand from many of these vendors that they intend to support algorithmic search accelerators to improve the performance of flow insertions and deletions. I also understand from them that this could improve the performance to hundreds of thousands of flow setups/sec.

Unless following are taken care,  hardware based OF switch adoption would be limited.
  • Higher flow setup rate. (Must be better than software based switches)
  • Ability to maintain millions of flows.
  • Support for multiple tables (All 256 tables)
  • VxLAN Support
  • VxLAN with IPSec
Throughput performance is very important and hardware based OF switches are very good at that.  In my view, all of above are also important for any deployments to seriously consider hardware based OF switches in place of software based switches.

Tuesday, March 4, 2014

OVS Acceleration - Are you reading between the lines?

OVS, part of Linux distributions, is openflow based software switch implementation.  OVS has two components - User space component and kernel space component.  User space component consists of   OVSDB n(configuration) sub-component,  Openflow agent that communicates with openflow controllers and OF normal path processing logic.  Kernel component implements the fastpath.  OVS calls the kernel component 'datapath'.  OVS datapath maintains one table.  When the packet comes into the datapath logic, it determines the matching flow in the table.  If no entry, then the packet is sent to the user space 'normal path processing' module.  OVS, in user space, finds out if there are all matching entries across all OF tables in that pipeline. If there are all matching entries,  then it pushes the aggregate flow to the kernel datapath.  Further packets of the aggregate flow gets processed in the kernel itself.  Just to complete this discussion,  if OVS does not find any matching flow in the OF pipeline, then it, based on the miss entry,  sends the packet to the controller. Controller, then pushes the flow mod entries in the OVS.

It is believed that Linux Kernel based datapath is not efficient for following reasons.
  • Linux Kernel overhead before packet is handed over to the datapath.  Before packet is handed over to the OVS datapath, following processing occurs on packets.
    • Interrupt processing
    • Softirq processing
    • Dev Layer processing
  •  Interrupt overhead 
  • Fast Path to Normal path overhead for first few packets of a flow.

Few software/networking vendors have implemented user space datapath using OVS 'dpif' provider interface. Many of these vendors have implemented user space  datapath using Intel DPDK in poll mode.  Poll mode dedicates few cores to receive packets from the Ethernet controller directly, eliminating interrupts and  thereby avoiding overhead associated with interrupt processing. Also, since the Ethernet ports are owned by this custom user space process, there is no need for any complex hook processing as needed in Linux kernel.  This user space datapath can start OF processing on the packets immediately upon reception  from Ethernet controllers, thereby avoiding any intermediate processing overhead. There is no doubt that these two features by themselves provide good performance boost.

Many of the OVS acceleration implementations, as I know as on today, don't perform well  in Openstack environment.  Openstack orchestration with KVM not only uses OVS to implement virtual switch, but also it uses OVS to realize network virtualization (using VxLAN, NVGRE). In addition, it also uses Linux IPTables to implement isolation among VMs using security groups configuration.  Moreover, some deployments even use IPSec Over VxLAN to protect the traffic from eaves dropping.  Based on some literature I have read,  user space datapath packages also don't perform well if VMs are connected to OVS via tuntap interfaces.   They work well if there is no need for  isolation among VMs  and where there is no need for overlay based network virtualization.  That is, these acceleration packages work if VMs are trusted and when VLAN interfaces are good enough.  But, we know that Openstack integrated OVS requires VxLANs,  isolation among VMs,  usage of tuntap interfaces and even VxLAN over IPSec.

Note that isolation using firewall,  tuntap interfaces & VxLAN-over-IPsec today use Linux Kernel capabilities.  If OVS datapath is in user space, then packets anyway have to traverse through the Kernel (some times twice) to avail these capabilities, which even may have lot more performance issues over having OVS datapath in Kernel.  I will not be surprised, if the performance is lot lower than native OVS kernel datapath.

Hence, while evaluating OVS acceleration, one should ask for following questions:

  • What features of OVS applied in measuring the performance?
  • How many OF tables are used are hitting in this performance test?  Make sure that at least 5 tables are used in the OVS as this is very realistic number. 
  • Does it use VxLAN?
  • Does it implement firewall?
  • Does it implement VxLAN over IPSec?
  • Does it assume that VMs are untrusted?
If there is a performance boost with all features enabled, then that implementation is good. Otherwise, one should be very careful.

SDN Controllers - Be aware of Some critical interoperability & Robustness issues

Are you evaluating robustness of a SDN solution?  There are few tests, I believe one should run to ensure that SDN solution is robust.  Since SDN controller is brain behind traffic orchestration,  it should and must be robust enough to handle various openflow switches from various vendors.  Also, SDN controller must not assume that all openflow switches behave well.

Connectivity tests

Though initial connections to OF controllers are normally successful,  my experience is that connectivity fails in very simple non-standard cases.  Following tests would bring out any issues associated with connectivity. Ensure that the following tests are successful.

  • Controller restart :  Once switch is connected successfully,  restart the controller and ensure that switch connects successfully with the controller. It is observed few times that either switch is not initiating the connection or controller loses some important configuration during restart and does not honor connections from some switches.  In some cases, it is also found that the controller gets the dynamic IP address, but has a fixed domain  name. But switches are incapable of taking the FQDN as the controller address. Hence it is important to ensure that this test suite is successful.
  • Controller restart & controller creating pro-active flows in the switches :  Once the switch is connected successfully and traffic is flowing normally, restart the controller and ensure that switch connects to the controller successfully and traffic continues to flow.   It is observed that many switch implementations remove all the flows when it loses the main connection to the master controller. When the controller comes back up,  it is, normally, responsibility of controller to establish the basic flows (pro-active flows).  Reactive flows are okay as they can get established again upon the packet-in.  To ensure that the controller is working properly across restarts,  it is important to ensure that this test is successful.
  • Controller application restarts, but not the entire controller:  This is very critical as applications typically restart more often due to upgrades and due to bug fixes.  One must ensure that there is no issue with traffic related to other applications on the controller. Also,  one must ensure that there are no issues when the application is restarted either with new image or with existing image.  Also, one should ensure that there are no memory leaks.
  • Switch restart :  In this test, one should ensure that switch reconnects back successfully with the controller once it is restarted.  Also, one must ensure that the pro-active flows are programmed successfully by observing that the traffic continues to flow even after the switch restarts.
  • Physical wire disconnection :  One must also ensure that the temporary discontinuity does not affect the traffic flow.  It is observed that in few cases switch realizes TCP termination and controller does not. Some switches seems to be removing the flows immediately after the main connection breaks and when connected again,  either it does not get the flow states immediately as yet times controller itself may not have known the connection termination. It is observed that any new connection coming from the switch is assumed to be duplicate and hence does not initiate flow setup process. It is required that the controller must need to assume any new main connection from the switch is connecting back after some blackout.  It should treat it as two events - Disconnection with the switch followed by new connection.  
  • Change the IP address of the switch after successful connection with the controller :  Similar to above.  One must ensure that connectivity is restored and traffic flows smoothly.
  • Run above tests while traffic is flowing :  Just to ensure the system is stable even when the controller restarts while traffic is flowing through the switches.  And also ensure that controller is good when the switch restarts while packet-ins are pending in the controller and while controller is waiting for responses from the switch, especially during multi-part message response.  One black-box way of doing this is to wait until the application creates large number of flows in a switch (Say 100K flows).  Then issue "flow list" command (typically from CLI - many controllers provide mechanism to list out the flows) and immediately restart the switch and observe the behavior of controller. Once the switch is reconnected, let the flows be created and issue "flow list" command and ensure that this list is successful.
  • Check for memory leaks :  Run 24 hour tests by restarting the switch continuously.  Ensure that the script restarts the switch software only after it connects to the controller successful and traffic flows.  You should be surprised number of problems you could find with this simple test.  After 24 hours of test, ensure that there are no memory or fd (file descriptor) leaks by observing the process statistics.
  • Check for robustness of the connectivity by running 24 hour test with flow listing from the controller.  Let the controller application create large number of flows in a switch (say 100K) and in a loop execute a script which issues various commands including "flow list".  Ensure that this test is successful and there are no memory or fd leaks.

Keep Alive tests

Controllers share the memory across all openflow switch it controls.  Yet times, one misbehaving openflow switch might consume lot of OF message buffers or memory in the controller leaving controller not respond to keep alive messages from other switches.  SDN controllers are expected to reserve some message buffers and memory to receive keep alive messages and respond to them. This reservation not only required for keep alive messages, but also for each openflow switch.  Some tests that can be run to ensure that proper fairness by the controllers are:
  • Overload the controller by running cbench, which sends lot of packet-in messages and expect controller to crate flows.  While cbench is running, ensure that ta normal switch connectivity does not suffer.  Keep this test for 1/2 hour to ensure that controller is robust enough.
  • Same test, but tweak the cbench to generate and flood keep alive messages towards the controller.  While cbench is running,  ensure a normal switch connectivity to the controller does not suffer. Keep running this test continuously for 1/2 hour to ensure that the controller is robust enough.

DoS Attack Checks

There are few tests,  I believe need to be performed to ensure that the controller code is robust enough.  Buggy switches might send truncated packets or very corrupted packets which wrong (very big) length value.
  • Check all the messages & fields in the OF specifications that have length field.  Generate message (using cbench?) with wrong length value (higher value than the message size) and ensure that the controller does not crash or controller does not stop misbehaving.  It is expected that the controller eventually terminates the connection with the switch with no memory leaks.
  • Run above test continuously to ensure that there are no leaks.
  • Messages with maximum length value in the header:  It is possible that controllers might be allocating big chunk of memory and leading to memory exhaustion.  Right thing for controller is to do is to have maximum message length and drop the message (drain) without storing it in the memory.   
I would like to hear from others on what kind of tests you believe must be performed to ensure that robustness of controllers.

Monday, March 3, 2014

SDN Controller - What to look for?


ONF is making Openflow specification as one of the standards enabling non-proprietary communication between central control plane entity & distribute data plane entities. SDN Controllers are the ones which implement control plane for various data path entities.  OVS, being part of the Linux distributions,  is becoming a defacto virtual switch entity in data centers and service provider market segments.  OVS virtual switch, sitting in the Linux host acts as a switch (data path entity) between virtual machines on the Linux host and  rest of the network.

As with virtualization of compute and storage,  networks are also being virtualized. VLAN used to be the one of the techniques  to realize virtual networks. With the limitations of number of VLANs and inability of extending virtual networks using VLANs over L3 networks,  overlay based virtual network technology is replacing VLAN technology.   VxLAN overlay protocol is becoming a choice of virtual network technology.  Today virtual switches (such as OVS) are supporting VxLAN and becoming defactor overlay protocol in data center and service provider networks.

Another important technology that is becoming popular is Openstack.  Openstack is virtual resource orchestration technology to manage virtualization of compute, storage and network resources.  Neutron component of openstack takes care of configuration & management of virtual networks,  network services such as router,  DHCP, Firewall, IPSec VPN and Load balancers.  Neutron provides API to configure these network resources.  Horizon, which is GUI of openstack provides user interface for these services.

Network Nodes (A term used by Openstack community) are the ones which normally sit at the edge of the data centers. They provide firewall capability between Internet & data center networks,  IPSec capability to terminate IPSec tunnels with the  peer networks, SSL offload capability and load balancing capability to distribute the incoming connections to various servers.  Network nodes also acts as routers between external networks & internal virtual networks.  In today networks,  network nodes are self-contained devices.  They have both control plane and data plane  in each node.  Increasingly, it is being felt that SDN concepts can be used to separate out control plane & normal path software from data plane & fast path software.

Network nodes are also being used as routers across virtual networks within data centers for east-west traffic.  Some even  use them as firewall and load balancers for east-west traffic.  Increasingly,  it is being realized that network nodes should not be burdened with the east-west traffic and rather use virtual switches within each compute node to do this job.  That is, virtual switches are being thought to be used as distributed router, firewall and load balancer for east-west traffic.

Advanced network services, which do deep inspection of packets and data, such as Intrusion Prevention,  Web application firewalls,  SSL offload are being deployed in L2 transparent mode to avoid reconfiguration of networks and also to enable vmotion easily.  When deployed as virtual appliances, it also provides agility and scale-out functions.  It requires traffic steering capability to steer the traffic across required virtual appliances.  Though most of the network services are required for north-south traffic, some of them (such as IPS) are equally needed for east-west traffic.


As one see from above introduction,  operators would like to see following supported by centralized control plane entity (SDN Controllers)
  • Realization of virtual networks
  • Control plane for network nodes 
  • Normal path software for network nodes.
  • Traffic Steering capability to steer the traffic across advanced network services
  • Distributed routing, firewall & Load balancing capability for east-west traffic.
  • Integration with Openstack Neutron

At no time, centralized entity should  become a bottleneck, hence following additional requirements come in play.

  • Scale-out of control plane entity (Clustered Controllers) - Controller Manager.
  • Performance of each control plane entity.
  • Capacity of each control plane entity.
  • Security of control plane entity

Let us dig through each one of the above.

Realization of Virtual Networks:

SDN Controller  is expected to provide following:


  • Ability to program the virtual switches in compute nodes.
  • No special agent in compute nodes.
  • Ability to use OVS  using Openflow 1.3+ 
  • Ability to realize VxLAN based virtual networks using flow  based tunneling mechanism provided by OVS.
  • Ability to realize broadcast & unicast traffic using OF groups.
  • Ability to  integrate with Openstack to come to know about VM MAC addresses and the compute nodes on which they are present.
  • Ability to use above repository to program the flow entries in virtual switches without resorting broadcasting the traffic to all peer compute nodes.
  • Ability to auto-learn VTEP entries.
  • Ability to avoid multiple data path entities in a compute nodes - One single data path for each compute node.
  • Ability to honor security groups configured in Openstack Nova. That is, ability to program flows based on security groups configuration without using 'IP tables" in the compute node. 
  • Ability to use 'Connection tracking" feature to enable stateful firewall functionality.
  • Ability to support IPSec in virtual networks across compute nodes.


Capacity is entirely based on deployment scenario.  Based on best of my knowledge, I believe these parameters are realistic from deployment perspective and also based on capability of hardware.
  • Ability to support 256 compute nodes by one controller entity.  if there are more  256 compute nodes, then more controllers in the cluster should be able to take care of rest.
  • Ability to support multiple controllers - Ability to distribute controllers across the virtual switches.
  • Support for 16K Virtual networks.
  • Support for 128K Virtual ports
  • Support for 256K VTEP entries.
  • Support for 16K IPSec transport mode tunnels


  • 100K Connections/sec per SDN Controller node (Due to firewall being taken care in the controllers).  With new feature, that is being thought in ONF, called connection template,  this requirement of 100K connections/sec can go down dramatically.  I think 50K connections/sec or connection templates/sec would be good enough.
  • 512 IPSec tunnels/sec.

Control Plane & Normal Path software for network nodes

Functionality such as router control plane,  Firewall normal path,  Load balancer normal path & control plane for IPSec (IKE) are the requirements to implement control plane for network nodes.


  • Ability to integrate with Neutron configuration of routers,  firewalls,  load balancers & IPSec.
  • Support for IPv4 & IPv6 unicast routing protocols - OSPF, BGP, RIP and IS-IS.
  • Support for IPv4 & IPv6 Multicast routing protocols - PIM-SM
  • Support for IP-tables kind of firewall normal path software.
  • Support for IKE with public key based authentication.
  • Support for LVS kind of L4 load balancing software.
  • Ability to support multiple routes, firewall, load balancer instances.
  • Ability to support multiple Openflow switches that implement datapath/fastpath functionality of network nodes.
  • Ability to receive exception packets from Openflow switches, process them through control plane/normal-path software & programming the resulting flows in openflow switches.
  • Ability to support various extensions to Openflow specifications such as
    • Bind Objects 
      • To bind client-to-server & Server-to-client flows.
      • To realize IPSec SAs
      • To bind multiple flow together for easy revalidation in case of firewalls.
    • Multiple actions/instructions to support:
      • IPSec outbound/inbound SA processing.
      • Attack checks - Sequence number checks.
      • TCP sequence number NAT with delta history table.
      • Generation of ICMP error messages.
      • Big Metadata
      • LPM table support
      • IP Fragmentation
      • IP Reassembly on per table basis.
      • Ability to go back to the tables whose ID is less than the current table ID.
      • Ability to receive all pipe line fields via packet-in and sending them back via packet-out.
      • Ability for controller to set the starting table ID along with the packet-out.
      • Ability to define actions when the flow is created or bind object is created.
      • Ability to define actions when the flow is  being deleted or bind object is being deleted.
      • Connection template support to auto-create the flows within the virtual switches.


  • Ability to support multiple network node switches - Minimum 32.
  • Ability to support multiple routers -  256 per controller node,  that is, 256 name spaces per controller node.
  • Ability to support 10K Firewall rules on per router.
  • Ability to support 256 IPSec policy rules on per router.
  • Ability to support 1K pools in LVS on per router basis.
  • Ability to support 4M firewall/Load balancer sessions.
  • Ability to support 100K IPSec SAs. (If you need to support mobile users coming in via from IPSec)


  • 100K Connections or Connection templates/sec on per controller node basis.
  • 10K IPSec SAs/sec on per controller node basis.

Traffic Steering 


  • Ability to support network service chains
  • Ability to define multiple network services in a chain.
  • Ability to define bypass rules - to bypass some services for various traffic types.
  • Ability to associate multiple network service chains to a virtual network.
  • Ability to define service chain selection rules - Select a service chain based on the the type of traffic.
  • Ability to support multiple virtual networks.
  • Ability to establish rules in virtual switches that are part of the chain.
  • Ability to support scale-out of network services.


  • Support for 4K virtual networks.
  • Support for 8 network services in each chain.
  • Support for 4K chains.
  • Support for 32M flows.


  • 256K Connections Or connection templates/sec.

Distributed Routing/Firewall/Load balancing for East-West traffic

As indicated before, virtual switches in the compute nodes should be used as data plane entity for these functions. As a controller, in addition to programming the flows to realize virtual networks and traffic steering capabilities,  it should also program flows to control the traffic based on firewall rules and direct the east-west traffic based on the routing information and load balancing decisions.


  • Ability to integrate with Openstack to get to know the routers, firewall & LB configurations.
  • Ability to act as control plane/normal-path entity for firewall & LB (Similar to network node except that it programs the virtual switches of compute nodes).
  • Ability to add routes in multiple virtual switches (Unlike in network node where the routes are added to only corresponding data plane switch).
  • Ability to support many extensions (as specified in network node section).
  • Ability to collect internal server load (For load balancing decisions).


  •  Support for 512 virtual switches.
  •  8M+ firewall/SLB entries.


  • 100K Connections/sec by one SDN controller node.

SDN Controller Manager

When there are multiple controller nodes in a cluster or multiple clusters of controllers,  I believe there is a need for a manager to manage these controller nodes.


  • Ability to on-board new clusters 
  • Ability to on-board new controller nodes and assigning them to clusters.
  • Ability to recognize virtual switches - Automatically wherever possible.  Where not possible, via on-boarding.
  • Ability to associate virtual switches to controller nodes and ability  to inform controller nodes on the virtual switches that would be connected to it.
  • Ability to schedule virtual switches to controller nodes based on controller node capabilities to take in more virtual switches.
  • Ability to act as a bridge between Openstack Neutron & SDN controller nodes in synchronizing the resources & configuration of Neutron with all SDN controller nodes.  Configuration includes:
    • Ports & Networks.
    • Routers
    • Firewall, SLB & IPSec VPN configuration.
  • Ensuring that configuration in appropriate controller node is set to avoid any race conditions.
  • Ability to set backup relations.

Securing the SDN Controller

Since SDN Controller is brain behind realization of virtual networks and network services, it is required that it is highly available and not prone to attacks. Some of the security features it should implement in my view are:
  • Support SSL/TLS based OF connections.
  • Accept connections only from authorized virtual switches.
  • Always work with  backup controller.
  • Synchronize state information with backup controller.
  • DDOS Prevention 
    • Enable Syn-Cookie mechanism.
    • Enable host based Firewall
    • Allow traffic that is of interest to SDN Controller. Drop all other traffic.
    • Enable rate limiting of the traffic.
    • Enable rate  limiting on the exception packets from virtual switches.
    • Control number of flow setups/sec.
  • Constant vulnerability asssesment.
  • Running fragroute tools and isic tools to ensure that no known vulnerabilities are present.
  • Always authenticate the configuration users.
  • Provide higher priority to the configuration traffic.
Note: If one SDN controller node is implementing all the functions listed above,  it is required to combine all performance and capacity requirements.

Check SDN controller from Freescale consisting of comprehensive feature set,  takes advantage of multiple cores to provide very high performance system. Check the details here:

Saturday, July 6, 2013

VxLAN & Openflow Controller Role



In one of previous posts here,  I argued that there is no need for Openflow controller to realize virtual networks.  In the same post,  I also mentioned that Openflow controller is required if intelligent traffic redirection is required to realize features such as network service chaining and advanced traffic visualization.  These advanced features require to control the traffic path on per flow basis (5-tuple based connections) across physical/virtual appliances.

VxLAN, by default, tries to discover the remote endpoint  using discovery mechanism. This discovery mechanism involves sending the packet using  multicast VxLAN encapsulation.   Compute nodes or end points that owns the DMAC address of the inner packet expects to consume the packet.  Using learning mechanisms,  each VxLAN end point creates VTEP entries.  Once VTEPs are learnt,  packets are sent only to the intended endpoint.  Please see this VxLAN tutorial to understand the VxLAN functionality. 

As discussed in the post about OVS & VxLAN based networks, OVS supports flow based overlay parameters such as remote endpoint address,  VNI etc.. 

Use cases requiring control from Openflow controller

One use case as briefly touched upon is the flow-control (FC) which is a some central entity across Openflow virtual switches controlling the traffic path, possibly on per connection basis. FC functionality running on top of OF controller, gets the packet-in (miss packet) from the OVS,  decides the next destination (an appliance implementing a network service) and then programs the flow in OF switch with appropriate overlay parameters.

Another use case of OF controller, in realizing VxLAN based virtual networking support, is send/receive of multicast & broadcast packets.  VxLAN specifications support multicast transport  for transporting inner mutlticast and broadcast packets.  As discussed above,  FC functionality is expected to get hold of any inner flows whether they are unicast based, multicast based or broadcast based.   Also,  I hear that many times some network operators don't allow multicast packets in their network infrastructure, but at the same time, they don't like to stop VMs using Multicast based protocols. OF Controller  can provide a feature to duplicate the broadcast and multicast inner packets as many times as number of possible destinations of the virtual network and send them over VxLAN using unicast VxLAN encapsulation.

Support available in Cloud Orchestration tools (Openstack)

Fortunately, in cloud computing world, Openstack already maintains inventory information such as
  • Physical servers and their IP addresses.
  • VMs on each physical server and MAC addresses of the VM Ethernet Ports.
  • Virtual networks (VLAN based, VxLAN based or other overlay based).
  • Physical servers that are participating in a given virtual network.
  • Ports on OVS that are connected to the VMs and which virtual network they belong to.
Openstack provides APIs to get  hold of above information.  Also. Openstack has ability to notify interested parties when the repository changes.  Openflow controller can have a service called 'Cloud Resource Discovery' whose functionality is to keep the cloud resource repository in OF controller and make it available to OF controller applications such as FC. 
  • When it comes up, it can go and read above information and keeps the repository with it. Also, it can register to get the notifications.  
  • Upon any notification,  update local repository information. 
Note that there could be multiple OF controllers working in a cluster to take care of the load among them.  Since Openstack is one entity across OF controllers, responsibility of ensuring that the cloud repository is consistent across OF controllers is with the Openstack entity.

Details of OF Controller Role

Now, let us discuss the capabilities required in OF controllers to realize VxLAN based virtual networks to take of two functions - Enable FC & Work with Multicast/Broadcast packets.  In this,  I am assuming that OF controller is being used along side with Openstack. I am also assuming that VMs are the ones which are generating the traffic destined for another VM.  OVS is used to implement OF virtual switches and VxLAN overlays.  Essentially, OVS provides virtual switch where VMs are connected to on its north side and VxLAN ports which are connected to it on south side.

OF Controller predominantly need to have three main components;
  • Cloud Resource Discovery (CRD) component
  • Flow Control (FC) component
  • OF Controller Transport (OCT) component (such as Opendaylight,  Freescale Openflow Controller Transport,  NOX Controller etc..).

Cloud Discovery component: 

Discovers following from Openstack
  •  Virtual Networks configured
  •  Physical Servers (OF Capable switches)
  •  OF Logical switches within each OF Capable switch (For example, in Openstack world,  there are always to OF logical switch br-int and br-tun).
  •  Names of ports attached to the OF logical switches
  • Qualification of Ports (Network Port,  Port that is connected to VM  or patch port and the virtual network it corresponds to).
  • Local VLAN IDs used in br-int of each OF Capable switch and their mapping to VxLAN network and dices.
  • VMs - MAC Addresses and the corresponding physical servers.


OF Controller components typically maintain the repository, which is either populated by  CRD.  Many OCT do have repositories.  For example,  Freescale OCT has capability to store via its DPRM module
  • OF Capable switches
  • OF logical switches in each OF capable switch
  • Containers (Domains)
  • OF Logical switches that belong to domains.
  • Port names associated with OF logical switch.
  • Table names associated with each domain.
  • It also has ability to put various attributes to each of above elements for extensibility.
Additional repositories are required : 
  • Virtual Networks.  For each virtual network
    • Fields:
      • Virtual Network Name
      • Virtual Network Description.
      • Type of virtual network
        • In case of VLAN:
          • VLAN ID
        • In case of VxLAN
          • VNI
      • Attributes
        • To store the Openstack UUID.
      • List of references to OF Capable switches.
        • In each reference  node
          • Array of OF Logical switch references. (Example:  br-int, br-tun)
          • For br-int logical switch
            • List of references of Ports that are connected towards VMs.
              • For each reference,  also  store the reference to VM.
            • List of references of ports that are connected to br-tun.
          • For  br-tun logical switch
            • List of references of ports that are connected to br-int
            • List of references of ports that are connected to network
      • List of references to OF VMs
      • List of references to Ports that are 
    • Views:
      • Based on name
      • Based on Openstack UUID
  •  VM Repository - Each record consists of
    • Fields:
      • VM Name
      • VM Description
      • Attributes;
        • Openstack VM UUID
        • Kind of VM -  Normal Application VM or  NS VM.
      • List of references of ports connected towards this VM  on br-int (MAC address is supposed to be part of the Port repository in DPRM).  Fore each reference
        • VLAN associated with it (in br-int)
        • Reference to Virtual Network.
      • Reference to physical server (OF Capable switch)
    • Views
      • Based on VM name
      • Based on VM UUID
  •  Cell Repository
  •  Availability Zone repository
  •  Host Aggregate  repository
Existing repositories would be updated with following:
  •  Port repository in each OF logical switch: Each port entry is updated with using attributes
    •  Port Qualification attribute
      • VM Port or Network Port or Patch Port.
    • MAC Address attribute 
      • MAC Address (in case this is connected to VM port)
    • VM Reference attribute:
      • Reference to VM record.
    •  Virtual Network Reference attribute
      • Reference to Virtual network record.
  • OF Switch Repository :  I am not sure whether this is useful anywhere, but it is good to have following attributes 
    • VM Reference attribute
      • List of references to VMs that are being hosted by this switch.
    • VN References attribute:
      • List of references to VNs for which this switch is part of.
    • Availability Zone Reference attribute
      • Reference to  availability Zone record
    • Host Aggregate Zone Reference Attribute
      • Reference Host Aggregate Zone record.
    • Cell Reference Attribute
      • Reference to Cell record 

Flow Control

This component is the one which creates flows based on the information available in the repositories.  Assuming the flow granularity is at the connection level,  first packet from a VM of any connection results into a packet-miss (packet-in) in br-int.  OF controller Transport receives it from the OF switch and gives it over to the FC.  FC application knows the DPID of the switch,  in_port of the packet. From this information,  port repository is checked and from the port repository it finds the virtual network information such as local VLAN ID used in br-int and VxLAN VNI. It also finds out the remote IP address of the overlay endpoint based on VM MAC (DMAC of the packet).

Assuming that VMs communicating are in different compute nodes,  there are total 8 flows the FC module would need to create to let rest of packets of the connection go through between two VMs.
Any connection contains client-to-server side and server-to-client side.  There are two OF logical switches in each compute node.  Since, there are two compute nodes, total flows would be 2 (Due to Client to Server AND Server to Client) * 2 (Due to two Openflow switches) * 2 (because of two compute nodes).

Let us assume that compute nodes, node1 & node2, have outIP1  and outIP2 IP addresses.
Let us also assume that VM1 on Node1 is making HTTP connection (TCP, source port 30000,  destination port 80) with VM2 on Node2.  VM1 IP address is inIP1 and VM2 IP address InIP2.   Let us also assume that VxLAN  uses port number 5000.

When VxLAN based network is created,  let us assume that OVS agent created local VLAN 100 on node1 and VLAN 101 on Node2 for VxLAN network whose VNI is 2000.

FC establishes the following flows:
  • Node1
    • BR-INT:
      • Client to Server flow
        • Match fields
          • Input Port ID:  < Port ID to which VM is connected to>
          • Source IP;  inIP1
          • Destination IP:  inIP2
          • Protocol:  TCP
          • source port : 30000
          • destination port: 80
        • Actions:
          • Add VLAN tag 100 to the packet.
          • Output port :  Patch Port that is connected between br-int and br-tun.
      • Server to Client flow
        • Match fields:
          • Input Port ID: 
          • VLAN ID: 100
          • Source IP: inIP2
          • Destination IP: inIP1
          • Protocol:  TCP
          • Source port: 80
          • Destination Port: 30000
        • Actions
          • Remove VLAN tag 100
          • Output port
    • BR-TUN
      • Client to Server flow
        • Match fields:
          • Input Port ID:  Patch Port 
          • VLAN ID:  100
          • Source IP;  inIP1
          • Destination IP:  inIP2
          • Protocol:  TCP
          • source port : 30000
          • destination port: 80
        • Actions
          • Set Field:
            • Tunnel ID: 2000
            • Tunnel IP : outIP2
          • Remove VLAN tag 100
          • Output Port: 
      • Server to Client flow:
        • Match fields:
          • Input Port ID: 
          • Tunnel ID:  2000
          • Source IP: inIP2
          • Destination IP: inIP1
          • Protocol:  TCP
          • Source port: 80
          • Destination Port: 30000
        • Actions
          •  Add VLAN tag 100
          • Output port: < br-tun end of patch port pair>
  • Node 2:
    • BR-INT:
      • Server to Client flow:
        • Match fields
          • Input Port ID:  < Port ID to which VM is connected to>
          • Source IP;  inIP2
          • Destination IP:  inIP1
          • Protocol:  TCP
          • source port :80
          • destination port: 30000
        • Actions:
          • Add VLAN tag 101 to the packet.
          • Output port :  .
      • Client to Server flow
        • Match fields
          • Input Port ID:
          • VLAN ID: 101
          • Source IP: inIP1
          • Destination IP: inIP2
          • Protocol:  TCP
          • source port: 30000
          • destination port: 80
        • Actions:
          • Remove VLAN tag 101
          • Output port: 
    • BR-TUN :
      • Client to Server flow:
        • Match fields:
          • Input Port ID: 
          • Tunnel ID:  2000
          • Source IP;  inIP1
          • Destination IP:  inIP2
          • Protocol:  TCP
          • source port : 30000
          • destination port: 80
        •  Actions
          • Add VLAN tag 101
          • Output port:
      • Server to Client flow:
        • Match fields:
          • Input port ID:
          • VLAN ID: 101
          • Source IP;  inIP2
          • Destination IP:  inIP1
          • Protocol:  TCP
          • source port :80
          • destination port: 30000
        • Actions:
          • Set field:
            • Tunnel ID: 2000
            • Remote IP :  outIP1
          •  Remove vlan 101.
          •  Output port:
In case of broadcast and multicast inner packets,  FC could use OF 1.3 groups with 'ALL' type with as many buckets as number of destinations.  Each bucket must have action with separate set of 'set-field' actions.

Assume that FC sees a Multicast flow from VM1 in Node1 (outIP1) with DIP: inMIP, UDP protocol and Destination Port : 2222 and Source port 40000.  Assuming that it should go to compute nodes Node2 (outIP2), Node3 (outIP3) and Node4 (outIP4) as there are VMs that are interested in these multicast packets. Then FC would generate following flows in Node1 and Node2.  Flows in Node3 and Node4 look similar.

  • Node 1:
    • br-int
      • Match fields
        • Input port: 
        • Source IP :  inIP1
        • Destination IP:  inMIP
        • Protocol: UDP
        • Source port: 40000
        • Destination POrt:  2222
      • Actions
        • Add VLAN tag : 100
        • Output port: 
    •  br-tun
      • Group object (G1):
        • Type: ALL
        • Bucket 1: 
          • Set-fields:
            • Tunnel ID: 2000
            • Remote IP: outIP2
          • Remove VLAN tag 100
          • Output port:
        • Bucket 2
          • Set-fields:
            • Tunnel ID: 2000
            • Remote IP: outIP3
          • Remove VLAN tag 100
          • Output port:
        • Bucket 3
          • Set-fields:
            • Tunnel ID: 2000
            • Remote IP: outIP4
          • Remove VLAN tag 100
          • Output port:
      • Match fields:
        • Input port :
        • VLAN ID: 100
        • Source IP :  inIP1
        • Destination IP:  inMIP
        • Protocol: UDP
        • Source port: 40000
        • Destination POrt:  2222
      • Actions
        • Group Object: G1
  •  Node 2
    • br-tun
      • Match fields:
        • Input port:
        • Tunnel ID: 2000
        • Source IP :  inIP1
        • Destination IP:  inMIP
        • Protocol: UDP
        • Source port: 40000
        • Destination Port:  2222
      • Actions:
        • Push VLAN tag 101
        • Output port:
    • br-int
      • Match fields:
        • Input port:
        • VLAN ID: 101
        • Source IP :  inIP1
        • Destination IP:  inMIP
        • Protocol: UDP
        • Source port: 40000
        • Destination Port:  2222
      • Actions
        • Remove VLAN tag 101
        • Output port:
FC would use repository information to create above flows.  Hence it is important to have the repository information arranged in a good data structure to get information upon packet-miss.
Evey flow that is created by FC in OF logical switches, FC maintains the flows locally too.  This is to enable cases where OF switches evict the flows.  When there is a packet-miss due to this,  FC would push the locally available flow into the OF switch.  That is, whenever there is packet-in,  FC first needs to check its local run time flow store before creating new flows by referring to repository.

Few more considerations that FC and CRD components that need to take care of are;
  • VM Movement:  When VM is moved,  flows created in the OVS OF switches also should be moved accordingly.  CRD component is expected to listen for VM movement events from Openstack and internally update the repositories.  FC component, in turn, should update the OF flows accordingly - Removing flows from old compute node and put them in new compute node.

Thursday, July 4, 2013

Linux OVS & VxLAN based virtual networks

OVS ( is the Openflow switch implementation in Linux. It implements various OF versions including 1.3.   OVS has support to realize virtual networks using VLAN and GRE for a long time.  In recent past,   OVS was enhanced to support overlay based virtual networks.  In this post, I give some commands that can be used to realize virtual networks using VxLAN.

For more information about VxLAN,  please see this tutorial.

Recently, there was a very good development in OVS on overlays.  It is no longer required to have as many 'vports' as number of compute servers to realize a virtual network across multiple compute servers.  OVS now implements the concept of flow based overlay protocol values selection.  Due to this, one VxLAN port is good enough in OVS OF switch irrespective number of remote compute nodes and irrespective of number of virtual networks.

OVS introduced new extensions (an action and set of OXM fields that can be set using set_field action)  to Openflow protocol where OF controller specifies the flow with tunnel/overlay specific information.

VxLAN protocol layer adds overlay header and it needs following information - Source IP address and Destination IP address of outer IP header,  source port and destination ports of UDP header and VNI for VxLAN header.  OVS provides facilities for Openflow controller to set the source IP, destination IP and VNI using set_field action.  OVS introduced following NXM fields

NXM_NX_TUN_ID :  To specify VNI (VxLAN Network Identifier).
NXM_NX_TUN_IPV4_SRC :  To specify source IP of the outer IP header.
NXM_NX_TUN_IPV4_DST :  To specify the destination IP of the outer IP header.

VxLAN protocol layer knows the UDP destination port from the 'vport'.   ovs-vsctl command can be used to create VxLAN ports.  ovs-vsctl command can be used to create many VxLAN ports on the same VNI with different destination port on each one of them.  VxLAN protocol layer gets rest of information required to frame outer IP, UDP headers by itself and with the help of Linux TCP/IP stack.

Similarly,  VxLAN protocol layer informs the OVS OF switches by filling up above fields after decapsulating the packets.  Due to this,  Openflow controller can use above fields as match fields.

Essentially, OVS provided mechansims to set the tunnel field values for outgoing packets in the Openflow flows and also provided mechanisms to use these tunnel fields as match fields in OF tables for incoming packets.

Following commands can be used to create VxLAN ports using 'ovs-vsctl' without explicitly mentioning the tunnel destination and tunnel ID, letting Openflow controller to specify these field values in OF flows.

Creation of VxLAN port with default UDP service port:

  ovs-vsctl add-port br-tun vxlan0 -- set Interface vxlan0   type=vxlan  options:remote_ip=flow options:key=flow

Above command is used to create VxLAN port 'vxlan0' on OF switch 'br-tun' and specifying this port to get the tunnel ID (VNI)  and tunnel remote IP from the OF flow.  "key=flow" is meant to get the tunnel ID from the flow and "remote_ip=flow" is meant to get the tunnel destination IP address from the flow.

Small variation of above command to create the VxLAN port with different UDP destination port, 5000.

ovs-vsctl add-port br-tun vxlan1 -- set Interface vxlan1 type=vxlan options:remote_ip=flow options:key=flow options:dst_port=5000

OVS provides a mechanism to create Openflow flows without having to have external Openflow controller.  'ovs-ofctl' is the machanism provided by OVS to do this.

Following command can be used to create the
ovs-ofctl add-flow br-tun "in_port=LOCAL actions=set_tunnel:1,set_field:>tun_dst,output:1"  (OR)
ovs-ofctl add-flow br-tun "in_port=LOCAL actions=set_field:>tun_dst, set_field:1->tun_id, output:1
"set_tunnel" is used to specify the VNI.  "set_field" to specify the tunnel destination.

Other commands of interest are:

To see the openflow port numbers:
        ovs-ofctl show br-tun
To dump flows:
        ovs-ofctl dump-flows br-tun

Monday, December 31, 2012

L4 Switching & IPSec Requirements & Openflow extensions


L4 Switching - Introduction

L4 switching typically involves connection tracking,  NAT and common attack checks.  Stateful inspection firewalls,  NAT, URL filtering  and SLB (Server Load Balancing) are some of the middle-box functions that take advantage of L4 switching.  These middle-box functions inspect first few packets of every connection (TCP/UDP/ICMP etc..) and offload the connection processing to some fast path entity to do L4 switching.  Normally both fastpath and normal path functionality reside in the same box/device.

Openflow, promoted by industry big names,  is one of the protocols that separates control plane from data plane where control planes of multiple switches implemented in one logical central controller and leaving the data plane alone at devices, which are programmable to work in a specified way,  thereby making devices simple.

Middle-box functions described above can also be separated - Normal path (service plane) and fast path (L4 switchiing).  Normal path works on first packet or at the most first few packets and fast path works on the connection for rest of the packets in the connection. By implementing normal path (Service plane) in centralized logical controllers and leaving the fastpath (L4 switching) at the physical/logical device level,  similar benefits of CP/DP separation can be achieved.  Benefits include
  • Programmable devices where device personality can be changed by controller applications.  Now,  Openflow switch can be made as L2 switch, L3 switch and/or L4 switch.
  • By removing major software logic (in this case, it is service plane) from device to the central location, cost of ownership goes down for end customers.
  • By centralizing the service plane,  operation efficiency can be improved significantly
    • Ease-of software upgrades
    • Configure/Manage from a central location.
    • Granular flow control.
    • Visibility across all devices.
    • Comprehensive traffic visualization

L4 Switching - Use cases:

  • Branch office connectivity with corporate head quarters :   Instead of having firewall, URL filtering and Policy control on every branch office device,  it can be centralized at corporate office. First few packets of every connection could go to main office.  Main office controller decides on the connection. If the connection is allowed,  it lets the openflow switch in the branch office to forward rest of the traffic on that connection.  This method only requires simple openflow switches in the branch office and all intelligence can be centralized at one location, that is, at main office. Thus, it reduces the need for skilled administrator at every branch office.
  • Server Load Balancing in Data Centers:   Today data-centers have expensive server load balancers to distribute the incoming connections to multiple servers.  SLB devices are big boxes today for two reasons -  Normal path processing that selects the best server for every new connection and  fastpath packet processing are combined into one box/device.   By offloading packet processing to inexpensive openflow switches,  SLB devices only need to worry about normal path processing, which could be done in lesser expensive boxes and even using commodity PC hardware.  
  • Managed Service providers providing URL filtering capability to home & SME offices :  Many homes today use PC based software to do URL filtering  to provide safe internet experience to kids.  SME administrators require URL filtering service to increase the productivity of their employees by preventing them to access recreational sites and also to prevent any malware contamination.  Again, instead of deploying URL filtering service at customer office premises,   service providers like to host this service centrally and manage centrally across multiple of their customers for operational efficiency.  Openflow controller implementing URL filtering service can get hold of packets until URL is fetched, find the category of URL, apply policy and decide on whether to continue the connection.  If the connection is to  be continued, then it can program the openflow switch in customer premises to forward rest of the packets.  
  • Though URL filtering service is one use case given,  this concept of centralizing the intelligence at one central place and programming the openflow switches for fast path is equally applicable for "Session Border Controller" and "Application Detection using DPI" and other services. Whichever service needs to inspect first few packets of the connection is candidate for Service Plane/L4 Switching  separation using Openflow.

L4 Switching - Processing

L4 switching or fastpath entity are used to do connection level processing. Any connection consists of two flows -  Client-to-Server (C-S) flow and Server-to-Client (S-C) flow.  

Connections are typically 5-tuple based.  

Functionality of L4 switching:
  • Connection Management :  
    • Acts on the connection Add/Remove/Query/Modify requests from the service plane.
    • Inactivity timeout :  Activity is observed on the connection. If no activity for some time (programmed during the connection entry creation),   connection entry is removed and notification is sent to the normal path.
    • TCP Termination : L4 switching entity can remove the connection entry if it observes the both endpoints of TCP connections exchange FINs and ACKed. It can also remove the entry if TCP packet with RST is seen.
    • Periodic notifications of the state collected to Service plane.
  •  Packet processing logic involves
    • Pre-flow lookup actions typically performed:
      • Packet integrity checks :  Checksum verification of both IP and transport headers,  Ensuring that the length field values are consistent with packet sizes etc..
      • IP Reassembly:  As the flows are identified by 5-tuples, it is required that packets are reassembled to determine the flow across fragments. 
      • IP Reassembly related attack checks and remediation
        • Ping-of-Death attack checks (IP fragment Overrun) - Checking for full IP packet never exceeds 64K.
        • Normalization of the IP fragments to remove overlaps to protect from teardrop related vulnerabilities.
        • Too many overlap fragments - Remove the IP reassembly context when too many overlaps observed.
        • Too many IP fragments  & Very small IP fragment size:  Drop the reassembly context and associated IP fragments.
        • Incomplete IP fragments resulting Denial of Service attacks - Limit the number of reassembly contexts on per IP address pair,  IP address etc..
    • TCP Syn flood protection using TCP Syn Cookie mechanism
    • Flow & Connection match:  Find the flow entry and corresponding connection entry.
    • Post lookup actions
      • Network Address Translation :  Translation of SIP, DIP, SP, DP.
      • Sequence number translation in TCP connections :  This is normally needed when the service plane updates the TCP payload that decreases or increases the size or when the TCP syn cookie processing is applied on the TCP connection.
      • Delta Sequence number updates in TCP connections :  This is normally required to ensure that right sequence number translation is applied, especially for retransmitted TCP packets.
      • Sequence number attack checks :  Ensuring that sequence numbers of the packets going in one direction of the connection are within some particular range. This is to ensure that any attacker injecting the traffic is not honored.  This check is mainly required to stop TCP RST packets that are generated and sent by attackers.  As RST packet terminates the connection and thereby creating DoS attack, it is required that this check is done.
      • Forwarding action :  To send the packet out by referring to routing tables and ARP tables.

IPSec - Introduction

IPSec is used to secure the IP packets. IP packets are encrypted to avoid leakage of data on the wire and also authenticated to ensure that the packets are indeed sent by the trusted party.  It runs in two modes - Tunnel mode and transport mode.   IP packets are encapsulated in another IP packet in tunnel mode.  No additional IP header in involved in transport mode. IPSec applies the security on per packet basis - Hence it is datagram oriented protocol.  Multiple encryption and authentication algorithms can be used to secure the packets.   Encryption keys and authentication keys are established on per tunnel basis by the Internet Key Exchange Protocol (IKE).   Essentially,  IPSec protocol suite contains IKE protocol and IPSec-PP (Packet processing).  

Traditionally,  implementations use proprietary mechanism between IKE and IPSec-PP as both of them sit in the same box.  New way of networking (SDN) centralizes control plane with distributed data paths.  IPSec is one good candidate for this.  When the control plane (IKE) is separated from the data path(IPSec-PP),   then there is a need for some standardization for communication between IKE and IPSec-PP.  Openflow is thought to be one of the protocols to separate out control plane and data plane.   But Openflow as defined in OF 1.3.x specification did not keep IPSec CP-DP separation in mind.  This blog post tries to provide extensions required in OF specification to enable IPsec CP-DP separation.


ONF (Open Networking Foundataion) is standardizing Openflow protocol.  Controllers using OF protocol programs the OF switches with flows and instructions/actions to be performed on packets.  Openflow 1.3+ specification supports multiple tables which simplifies the controller programming as each table in the switch can be used for a purpose.

Contrary to what everybody says,  Openflow protocol is not really friendly let alone to SP/FP separation, but also to L3 CP/DP separation.  The way I look at OF today is that it is good for L2 CP/DP separation and also good for traffic steering kind of use cases, but it is not sufficient for L3 and L4 switching.  This post tries to give some of the features that are required in OF specifications to support L4 switching in particular and L3 switching to some extent.

Extensions Required to realize L4 switching & IPSec CP-DP separation

Generic extensions:

Table selection on packet-out message:

Problem Description
Current OF 1.3 specification does not give provision for controllers to start the processing of packet-out message from a specific table in the switch.  Currently,  controller can send the packet-out message with set of actions.    These actions are expected to be executed in the switch.  Typically 'OUTPUT' action is expected to be specified by the controller in the action list.  This action sends the packets out on the port given in "OUTPUT" action.  One facility is provided to run through OF pipeline though.  If the port in 'OUTPUT' action is specified to be a reserved port OFPP_TABLE,  then the packet-out message execution starts from table 0.

Normally controllers takes advantage of multiple tables in the OF switch by dedicating tables to purposes.  For example, when L4 switching  with L3 forwarding are used, multiple tables are used - A table to classify packets,  A table for Traffic Policing entries,  A table for PBR and few tables for routing tables,  a table for L4 connections and a table for next hop entries and a table for traffic shaping rules.   Some tables are populated pro-actively by the OF controller , hence miss packets are not expected by controllers.  Entries in some tables are populated re-actively.  Tables such as 'L4 connections',  'Next hop' entries are created reactively by the controllers.   Processing of the miss packets (packet-in) can result into a flow being created in the table which was the cause of miss packet. Then controllers likes to send back the packet to the OF switch and would like OF switch to start processing the packet as OF switch does when there is a matching entry. Controller programming does not get complicated  if that was possible.  As indicated before,  due to limitations of the specification,  controller can't ask switch to start from a specific table.   Due to lack of this feature in the switches,   controllers are forced to figure out the actions that would need to be performed and then program the action list in the packet-out message.  In our view,  that is quite complex task for the controller programmers.  As I understand,  many controller applications simply using packet-in message to create the fow, but drop the packet-in message.  This is not good at all.  Think of SYN packet getting lost.  It would take some time for client to retransmit the TCP SYN packet and that delay results into very bad user experience.

Also, in few cases,  applications (CP protocols such as OSPF, BGP)  generate packets that needs to be sent out on the data network.  Controllers typically sit in management network, not on the data network.  Hence these packets are to be sent to the data network via OF switches.  This can only be achieved by sending these packets as packet-out messages to the switch.  Yet times, these packets need to be exposed to only part of the table pipeline.  In these cases, it would be required to have ability for controllers to start the packet processing in the switch for these packets from a specific table given by controller.


Freescale Extension :

struct ofp_action_fsl_experimenter_goto_table {
      uint16_t type; /* OFPAT_EXPERIMENTER. */
      uint16_t len; /* Length is a multiple of 8. */
      uint32_t experimenter; /** FSL Experimenter ID **/
      uint16_t fsl_action_type; /** OFPAT_FSL_GOTO_TABLE **/
      uint16_t table_id;

OFP_ASSERT(sizeof(struct ofp_action_fsl_experimenter_goto_table == 8);

table_id :  When switch encounters this action,  switch starts processing the packet from the table specified by 'table_id'.

Though this action is normally present in the action_list of e packet-out messages, this action can be used in flow_mod actions too.  When there is GOTO_TABLE instruction and GOTO_TABLE action,  then the GOTO_TABLE action takes precedence and GOTO_TABLE instruction is ignored.

Notice that 'table_id' size is 2 bytes.  This was intentional.  We believe that OF specifications in future would need to support more than 256 tables.

Saturday, December 29, 2012

L2 Network Virtualization & Is there a role for Openflow controllers?




Current Method of Network Virtualization

IaaS (Cloud Service Providers) providers do provide network isolation among their tenants. Even Enterprise private cloud operators are increasingly expected to provide network isolation among tenants - tenants being departments,  divisions,  test networks, lab networks etc..  This allows tenants to have their own IP addressing space and possibly overlapping with other tenants' address space.

Currently VLANs are used by network operators to create tenant specific networks.  Some of the issues related to VLAN are:
  • VLAN IDs are limited to 4K.  If tenants require 4 networks each on average,   only 1K customers can be satisfied on a physical network.  Network operators are forced to create additional physical networks when more tenants sign up.
  • Performance bottlenecks associated with the VLANs :  Even though many physical switches support 4K VLANs,  many physical switches don't provide line rate performance when the number of VLAN IDs go beyond certain limit ( some switches don't work well beyond 256 VLANs)
  • VLAN based networks have operational headaches -  VLAN based network isolation requires  all L2 switches are configured when a new VLAN is created or an existing VLAN is deleted.  Though many L2 switch vendors provide central console to work with their brand L2 switches,  it is operational  difficulty when switches from multiple vendors are present.
  • Loop convergence time is very high
  • Extending VLANs across data center sites or extending VLANs to customer premise has operational issues with respect to interoperable protocols,  Out-of-band understanding among network operators is required to avoid VLAN ID collisions.
To avoid issues associated with  capabilities  of  L2 switches,  networks having switches from multiple vendors  and limitations associated with VLANs,  increasingly overlays are used to virtualize physical networks to create multiple logical networks.

Overlay based Network Virtualization

Any L2 network requires the preservation of L2 packet from source to destination.  Any broadcast packet should go to all network nodes attached to the L2 network.  All multicast packets should go to network nodes that are willing to receive multicast packets of groups of their choice.

Overlay based network virtualization provides above functionality by overlaying the Ethernet packets using outer IP packet - Essentially tunneling Ethernet packetsfrom one place to another.

VxLAN, NVGRE are two of the most popular overlay protocols that are being standardized.  Please see my blog post on VxLAN here.

VxLAN provides 24 bits of VNI (Virtual Network Identifier).  In theory around 16M virtual networks can be created.  Assuming that each tenant may have 4 networks on average,  in theory, 4M tenants can be supported by CSP using one physical network.  That is, there is no bottleneck with respect to identifier space. 


Openstack is one of the popular open source cloud orchestration tools.  It is becoming formidable alternative to VMWare vCenter and VCD.   Many operators are using Openstack and  KVM hypervisor as a secondary source of cloud virtualization in their networks.  Reliability of Openstack came long way and many vendors are providing support for a fee. Due to these changes,  adoption of openstack+KVM is going up as a primary source of virtualization. Openstack mainly has four components -  'Nova' for VM management across multiple physical servers,  'Cinder' for storage management,  'Quantum' for network topology management and 'Horizon' to provide front end user experience to operators (administrators and tenants).

Quantum consists of set of plugins -  Core plugin and multiple extension plugins.  Quantum defines API for plugins and let various vendors to create backend for the plugins.  Quantum core plugin API defines the management API of virtual networks - Virtual networks can be created using VLAN,  GRE and being upgraded to support VxLAN too.
Quantum allows operators to create virtual networks.  As part of VM provisioning,  Openstack NOVA provides facility for operators to choose the virtual networks on which this VM needs to be placed on. 
When 'Nova scheduler' chooses a physical server to place the VM, it asks the quantum to provide MAC address, IP address and other information to be assigned to VM using 'create_port'  API.  Nova asks quantum as many times as number of virtual networks that VM belongs to.   Quanutm provides required information to NOVA back.  As part of this call,  Quantum comes to know about the physical server and the virtual networks that needs to be extended to the physical server.  It, then, informs the quantum agent (that sits in host Linux of each physical server) the virtual networks it needs to create.   Agent on the physical server gets more information on virtual networks from quantum and then create the needed resources.   Agent uses OVS (Open Virtual Switch ) package that is present in each physical server to do the job.  Please see some description of OVS below.   Quantum agent in each physical server creates two openflow bridges - integration bridge (br-int) and tunnel  bridge (br-tun).  Agent also connects south side of br-int to north side of br-tun  using loopback port pair.   Virtual network port creation and association to br-tun is done by quantum agent for every new virtual network or when the virtual network is deleted.  North side of br-int towards VMs is handed by libvirtd and associated drivers as part of VM management. See below.

Nova talks to 'nova-compute' package in the physical server to bring up/down VMs.  'Nova-compute' in the physical server uses 'libvirtd' package to bring up VMs,  create ports and associate them with openflow switches using OVS package.  Brief description of some of the work, libvirtd does with the help of OVS driver are:
  • Creates a Linux bridge for each port that is associated with the VM.
  • North side of this bridge is associated with VM Ethernet port (using tun/tap technology).
  • Configures ebtables to provide isolation among the VMs.
  • South side of this bridge is associated with Openflow integration bridge (br-int).  This is achieved by creating loopback port pair with one port attached to Linux bridge and another port associated with the Openflow switch, br-int.

Openvswitch (OVS)

OVS is openflow based switch implementation.  It is now part of Linux distribution.  Traditionally Linux bridges are used to provide virtual network functionality in KVM based host Linux.  In Linux 3.x kernels,  OVS has taken that responsibility and Linux bridge is used for the purposes of enabling 'ebtables'.
OVS provides set of utilities :  ovs-vsctl and ovs-ofctl.   "ovs-vsctl" utility is used by OVS quantum agent in physical servers to create openflow datapath entities (br-int, br-tun),   initialize Openflow tables and add both north and south bound ports to the br-int and br-tun.   "ovs-ofctl" is command line to create openflow flow entries in the openflow tables of br-int and br-tun.  It is used by OVS quantum agent to create default flow entries to enable typical L2 switching (802.1D) functionality.  Since OVS is openflow based,  external openflow controllers can manipulate the traffic forwarding by creating flows in br-int and br-tun.  Note that external controllers are required only to add 'redirect' functionality AND virtual switching functionality can be achieved even without external openflow controller. 

Just to outline various components in the physical server:
  • OVS  package - Creates Openflow switches,  Openflow ports and associate them to various switches and ofcourse provides ability for external controllers to control the traffic to/from VMs to external physical networks.
  • Quantum OVS Agent :  Communicates with the Quantum plugin in Openstack tool to get to know the virtual networks and configure OVS to realize those networks in the physical server.
  • OVS Driver in Libvirtd :  Allows connecting VMs to virtual networks and configures 'ebtables' to provide isolation among VMs.

Current VLAN based Network Virtualization solution

Openstack and OVS together can create VLAN based networks.  L2 switching is happening with no external openflow controller.   OVS Quantum agent with the help of plugin knows the VMs, their vports and corresponding network ports. Using this information,  agent associates the VLAN ID to each vport connected to the VMs.  This information is used by OVS to know which VLAN to use when packets come from VMs.  Also,  agent creates one rule to do the LEARNING for packets coming in from the network. 

Overlay based Virtual Networks

Companies like Nicira and bigswitch networks are promoting overlay based virtual networks. OVS in each compute node (Edges of the physical network) is used as starting point of overlays.  All L2 and L3 switches connecting compute nodes are only used for transporting the tunneled packets.  They don't need to participate in the virtual networks.   Since OVS in compute nodes is encapsulating and decapsulating inner ethernet packets into/from another IP packet,  in-between switches transfer the packets using outer IP header addresses and outer MAC headers.  Essentially,  overlay tunnels start and end at compute nodes.  With this,  network operators can configure the switches in L3 mode instead of problematic L2 mode.  May be, in future, one might not see any L2 switches in the data center networks.

Typical packet flow would be something like this:

- A VM sends a packet and it lands on the OVS in the host Linux.
- OVS applies actions based on the matching flows in br-int and packet is sent to br-tun.
- OVS applies actions based on the matching flows in br-tun and packet is sent out on the port (overlay port)
- OVS sends the packet to overlay protocol layer.
- Overlay protocol layer encapsulates the packet and sends out the packet.

In reverse direction,  packet flow would look like this:

- Overlay protocol layer gets hold of the incoming packet.
- Decapsulates the packets and exposes the packet with right port to the OVS br-tun.
- After applying any actions on the packet using matching OF flows in br-tun,  packet is sent to br-int.
- OVS applies the actions on the matching flows and figures out the destination port (one-to-one mapping with VM port)
- OVS sends the inner packet to the VM.

Note that:

- Inner packet is only seen by OVS and VM.
- Physical switches only see encapsulated packet.

VxLAN based Overlay networks using Openstack

OVS and VxLAN:

There are many open source implementations of VxLAN in OVS and integration of this with openstack.   Some details about one VxLAN implementation in OVS.

  • Creates as many  vports in OVS as number of VxLAN networks in the compute node.  Note that,  even though  there could be large number of VxLAN based overlay networks,  only networks to which local VMs belong are created in OVS as vports.  For example,  If there are VMs corresponding to two overlay networks,  then two vports are created. 
  • VxLAN implementation depends on VTEP entries to find out the remote tunnel endpoint address for a destination MAC address of the packet received from the VMs.  IP address is used as DIP of the outer IP header.
  • If there is no matching VTEP entries,  Multicast learning happens as per VxLAN.
  • VTEP entries can be created manually too.   A separate command line utility is provided to create VTEP entries to vports.  
  • Since Openstack has knowledge of VMs and physical servers they are hosting,  Openstack with the help of quantum agent in each compute node can create VTEP entries pro-actively.

Commercial Products

Openstack and OVS provide fantastic facilities to manage virtual networks using VLAN and overlay protocols.  Some commercial products seem to be doing following:
  • Provide their own Quantum Plugin in the openstack.
  • This plugin communicates with their central controller (OFCP/OVSDB and Openflow controllers).
  • Central controller is used to communicate with OVS in physical servers to manage virtual networks and flows.
Essentially,  these commercial products are adding one more controller layer between Quantum in openstack and physical servers.

My views:

In my view it is not necessary.  Openstack, OVS,  OVS plugin, OVS agent, and  OVS libvirtd driver are becoming mature and there is no need for one more layer of abstraction.  It is a matter of time where these open source components would be feature rich, reliable and supported by vendors such as redhat.   With OVS being part of Linux distributions and with ubuntu providing all of above components,  operators are better of sticking with these components instead of going for proprietary software.

Since OVS is openflow based,  there could be Openflow controller to add value with respect to traffic steeering and traffic flow redirections.  It should provide value, but one should make sure that default configuration is good enough to realize virtual networks without need for openflow controller.

In summary, I believe that Openflow controllers are not required to manage virtual networks in physical servers, but are required to add value added services such as traffic steering,  traffic visualization etc..

Sunday, January 22, 2012

IP Fragmentation versus TCP segmentation

Ethernet Controllers are increasingly becoming more intelligent with every generation of NICs.  Intel and Broadcom have added many features in Ethernet NIC chips in recent past.  Multicore SoC vendors are adding large number of features into Ethernet IO hardware blocks.

TCP GRO (Generic Receive Offload - It used to be called Large Receive offload too) and GSO  (Generic Segmentation Offload and it is used to be called Transport Segmentation Offload)  are two new features (in addition to FCoE offloads) one can see from Intel NICs and many Multicore SoCs.  These two features are  good for any TCP termination applications on the host processors/cores.  These two features reduces the number of packets traversing the host TCP/IP stack. 

TCP GRO works across multiple TCP flows where it aggregates multiple consecutive TCP segments (based on TCP sequence number) of a flow into one or few TCP packets in the hardware itself, there by sending very few packets to the host processor.  Due to this,  TCP/IP stack sees  fewer inbound packets.  Since the packet overhead is significant in TCP/IP stacks, lesser packets uses lesser number of CPU cycles, thereby leaving more CPU cycles for applications, essentially increasing the performance of overall system.

TCP GSO intention is similar to TCP GRO,but for outbound packets.  TCP layer typically segments the packets based on  MSS value. The MSS value is typically determined from PMTU (Path MTU) value.  Since TCP and IP headers take 40 bytes of data,  MSS is typically ( PMTU -  40 ) bytes.  If PMTU is 1500 bytes, then the result MSS value is 1460. When the application tries to send large amount of data,  then the data is segmented into multiple TCP packets where each TCP payload carries up to 1460 bytes.  TCP GSO feature in the hardware eliminates the need for TCP layer to do the segmentation and thereby reduces the number of packets that traverse between TCP layer and to the hardware NIC.  TCP GSO feature in the hardware typically expect the MSS value along with the packet and it does everything necessary internally to segment and send the segments out.

Ethernet Controllers are increasingly providing support for IP level fragmentation and reassembly.  Main reason is being  increasing popularity of tunnels.

With increasing usage of tunnels (IPsec, GRE, IP-in-IP,  Mobile IP, GTP-U and futuristic VXLAN and LISP), the packet size is going up.  Though these tunnel protocol specifications provides guidelines to avoid fragmentation using DF bit and PMTU discovery,  it does not happen in reality.  There are very few deployments where DF (Don't Fragment bit) , which is required for PMTU discovery, is used.   As far as I know,  almost all IPv4 deployments fragment the packets during tunneling.  Some deployments configure network devices to do red-side fragmentation (fragmentation before tunneling so that the tunneled packets appear whole IP packet) and some deployments go for black-side fragmentation (fragmentation after tunneling is done).   On receive direction, reassembly happens either before detunneling or after detunneling. 

It used to be the case where fragmented packets are given lesser priority by service providers during network congestion.  With high throughput connectivity and increasing customer base for networks, service providers are competing for the business by providing very good reliability and high throughput connectivity. Due to popularity of tunnels,  service providers are also realizing that dropping fragmented packets may result in bad experience to their customers.  It appears that service providers are not treating the fragmented packets in a step-motherly fashion anymore.

IP fragmentation and TCP segmentation offload methods can be used to reduce the number of packets traversing the TCP/IP stack in the host.  Next question that comes to mind is how to tune the TCP/IP stack to use these features and how to divide the work  between these two HW features. 

First thing to tune in the TCP/IP stack is to remove the MSS dependency on PMTU.  As described above, today MSS is calculated based on PMTU value. Due to this, IP fragmentation is not used by TCP stack for outbound TCP traffic. 

TCP Segmentation adds the both TCP and IP header to each segment.  That is, for every 1460 bytes, there would be overhead of 20 bytes of IP header and 20 bytes of TCP header.  In case of IP fragmentation,  each fragment would have its own IP header (20 bytes of overhead).  Since TCP segmentation has more overheads,  one can say IP fragmentation is better.  Here, MSS can be set to a bigger value such as 16K and let IP layer fragment the packet if the MTU value is less than 16K.   This is certainly a good argument and it works fine in networks where the reliability is good.  Where the reliability is not good,  if one fragment gets dropped, TCP layer needs to send entire 16K bytes in retransmission.  If TCP had done the segmentation, it would only need to send fewer bytes. 

There are advantages and disadvantages with both approaches. 

With increased reliability of networks and with no special treatment on fragmented traffic by service providers,  IP fragmentation is not a bad thing to do.  And ofcourse, one should worry about retransmissions too. 

I hear few tunings based on the deployments.  Warehouse data center deployments where the TCP client and servers in a controlled environment are tuning MSS to 32K and more with 9K (jumbo frame) of MTU.  I think that , for 1500 bytes MTU,  going with 8K of MSS may work good.