Sunday, November 27, 2011

Software Defined Networking (SDN)

Before going into the details of SDN,  it is required to revisit the current networks and different network devices within the network.   Then this post talks about the problems associated with network devices as seen by service providers and network operators.  I try to describe how SDNs are expected to solve the issues faced by network operators.  

Current Networks

Current Networks consists of multiple types of network elements - Layer 2 Switches, Layer 3 Switches/routers and Network Service elements such as Network Security Devices,   Application Delivery Controllers, WAN Optimization Devices and many more.

L2 Switch devices:

L2 switch devices connect multiple computers in L2 network.  L2 switches have large concentration of ports with ports connected to servers, laptops, access points,  printers etc..   Each L2 switch device typically has anywhere between 8 to 64 ports.  Since there could be more than 64 computers  in one L2 domain,  multiple L2 switches are used with L2 switches inter-connected. Main functionality of L2 switch is to allow the connectivity among the computers.  L2 switches are rated based on how many Ethernet ports they support,  speed of those ports (1G,  10G etc..), bandwidth of the switch (rate at which the device can switch traffic),  latency  introduced on packets by the switch and features applied on the switches packets.

Though switching the traffic is  main functionality of the L2 switch,  Embedded processor  in the L2 switch device runs some control plane protocols.  Spanning Tree Protocol and their variations such as Rapid STP and Multiple STP are used by L2 switch devices to find the loops and avoid loops by disabling some ports. Multiple MAC Registration Protocol (MMRP) on top of MRP (Multiple Registration Protocol) is used to propagate the Multicast memberships across multiple switch devices, thereby avoiding Multicast packet flooding on the ports. Multiple VLAN Registration Protocol (MVRP) is used across the switch devices to know the VLAN versus ports relationship to create the logical LANs across multiple switches.  There are lot of other control plane protocols used for interoperability among the switch devices.

As discussed above, L2 switch devices mainly contain two different functions - Data plane functionality and Control plane functionality.  Data Plane functionality is mainly used to switch the traffic and Control plane functionality ensures that right context is established in the Data plane either based on the configuration or based on the result of  protocol state with peer switch devices.

Let us visit some of the critical "Data Plane" functionality of the switches.

  • Packet switching:  Packets received on one port is switched into other port based on the destination MAC address.  It does this by referring  "MAC address versus port" table  maintained by the switch.  This table is called "Learning table".  If there is no matching entry in the table,  it forwards the packet to all ports except the port on which original packet came in.
  • Population of Learning Table:  Every time packet is received by the switch from any ports, it extracts the source MAC address and populates the table with this source MAC address and port.  Basically Switch assumes that the host with MAC address as source MAC address of the packet is reachable from this ingress port.  
  • Shaping and scheduling on the ports in egress direction:  L2 switch devices provide support to shape the traffic if the connecting computer/Network devices does not accept certain rate.  When there is higher packet load than the shaping bandwidth,   L2 switches provide queuing and scheduling functionality with different algorithms such as priority based scheduling,  Deficit Round Robin algorithms, strict priority scheduling and combination of multiple algorithms.
  • Access Control Functionality:  Switch devices also provide access control functionality to filter the packets out,  to apply different actions based on the type of traffic.
Layer 3 (L3) switch devices/Routers:

Routers or L3 switch devices are used to separate out different L2 domains.  Layer 3 switching is typically based on the IPv4/Ipv6 addresses. Yet times, it could also be based on the other fields of Ipv4/IPv6 headers such as DSCP value.

Like in L2 switch devices, L3 switch devices also have Data Plane and Control Plane functionality.  Data Plane functionality forwards the packets based on the routing database.  Control plane functionality manages the routing database.  Control plane protocols such as RIP,  OSPF, BGP for unicast and PIM-SM, MLDv2/IGMP protocols for Multicast are used to populate the unicast and multicast routing databases respectively.   Control plane protocols in a device work with other L3 switch devices to figure out the best routes to reach the destinations.

Network Service Devices:

Network Service Devices also have two higher level functions - Control Plane and Data Plane.  Typically in network services,  people tend to use the term "Service Plane" instead of  "Control Plane".   Similarly,  the term "Fast Path" is used instead of "Data Plane".  Unlike L2/L3 switch devices,  data traffic is processed by both Service Plane and Fast Path.   Service Plane after processing certain amount of data in a given flow, typically decides to offload rest of processing into the "Fast Path".  Then onwards, any traffic on the offloaded flows are handed by "Fast Path".   I am not going into the details of when the Service Plane traffic decides to offload the flows to Fast Path as that is a different subject by itself.   Just to give an example,  ADCs might process the HTTP packets until it processes HTTP Request Headers and then offload rest of the connection to Fast Path.

Gist of Network Device functionality:

In essence,  almost all network devices have Control plane and Data Plane functionality.   Most of the intelligence resides in the Control plane.  Data Plane functionality,  though does not much intelligence, but the processing is expected to happen at very high speed.  Control plane creates the flow contexts in the Data plane as part of its processing and Data plane uses these flow contexts to act on the traffic.

Packaging of Network Devices today:

Cisco,  Juniper, HP, Brocade,  Dell are some of the L2, L3 switch device and Network service device vendors.  Network operators in Enterprises,  Service providers and Data Centers are customers of these devices.

Vendors provide self-contained devices with easy-to-use configuration mechanisms.  These devices come with both Control and Data Plane functionality.   Separation of Control and Data Plane and interfaces between them are proprietary to each vendor architecture.

In summary, vendors provide equipment  to network operators and operators configure these devices using Command Line interface,  HTTP based GUI or Centralized Management Systems to meet their deployments.

Network Operator Challenges:

I heard following challenges by network operators (Information based on conferences and reports)

  • Addition of new control plane protocols:  One operator indicated that they wanted to introduce new routing protocol suitable for their deployment.  The operator felt that the existing routing protocols such as IS-IS,  OSPF,  RIP are inadequate or overly complex.   Due to the nature of the network devices today,  they are not programmable.  Operators have following choices - Pay network device vendors to implement this new protocol,    Standardize the protocol through standard bodies and hope that vendors would implement the protocol or create the network devices themselves.   All of them are costly or time consuming.  It appears that one vendor asked for millions of dollars to implement the protocol  and maintain it for the life of product.  It is prohibitive for the operators to go in this direction due to the cost associated with it.  Going through standard bodies takes few years minimum and not an option for the operator due to urgent nature of the request.  Developing own device by the operator is also costly as it requires huge number Engineering resources.  
  • Cost of interoperability is very high:  One more operator indicated that these network devices have so many protocols implemented,  ensuring that there is interoperability maintained is a huge task.  It appears that operator network has thousands of network devices from different vendors. Whenever there is a new device purchased or new image upgrade on some devices,  this operator spends few man years of effort to ensure that the new device or new image continue to inter-operate with existing devices from other vendors for all protocol supported.  With the increasing number of control plane protocols,  the cost of interoperability is also going up.  This operator indicated that there are 100s of protocols for which they need to ensure interoperability with every new purchase of network device or new image upgrade.  I heard the number of few million dollars being spent every year on this by this operator.
  • Reduction of Control plane and discovery protocols traffic on the wire:  A Data Center operator indicated that there are large number of discovery protocol traffic observed in the network.  It appears that large number of ARP packets seen a network consisting of hundreds of thousands of virtual servers.  As I understand around 10-20% of CPU cycles are used  in processing the ARP requests by each virtual server.  As it is known, ARP packets are sent with broadcast MAC address. Due to this, every network device and computers in the L2 network will receive all the ARP requests.  ARP requests that belong to local IP address are responded back and all other ARP request packets are ignored by the devices/computers.   Processing cycles used to determine the ARP requests to act on are not insignificant.   As indicated above, around 10-20% of the cycles are used up in doing above operation.  This operator indicated that their Network repository system has all the devices and associated MAC addresses. Operator wanted to use this information to populate the relevant network devices facing the appropriate computers and let them respond to the ARP requests without propagating the ARP requests to other Network devices, thereby reducing the ARP packets on the network.   Though many devices supported the Proxy ARP functionality,  but the scale at which it was required was the issue faced by this operator. That is, the operators wanted at least 64K proxy ARP records to be supported by each network device, but operator did not find many devices supporting more than 2K proxy ARP records.
What do Network Operators want?

Network Operators would like to have flexibility beyond the configuration provided by network devices vendors.  Operators do understand the value of "Data Plane" to support Multi gigabit throughput.   Network operators believe that the need for programmability in Control plane part of the software.

As indicated before,  at this time,  network devices have both Control Plane and Data Plane functionality with proprietary interface between the planes.  To allow customization of control plane, addition of new control plane protocols, Network operators would like to have standardized interface between Control Plane and Data Plane. 

Network device vendors use different types of processors for running control plane.  To customize control plane software,  it may be required by network operators to develop and test the control plane on multiple processors. Operators would like to develop control plane only once for all network devices and that too operators would like to develop control plane applications in higher level languages such as Java due to large pool of Java developers.  It allows network operators to utilize existing Engineering pool they have.

Is SDN  the answser?

SDN is expected to solve network operator challenges. SDN enables network operators to develop or customize control plane applications themselves.  Open flow protocol is one of the first steps of SDN.

  • Open flow protocol definition:  Open flow protocol version 1.1 is defined to address the desires of network operators
    • Separation of Control Plane with Data Plane.  
    • Implementation of Control plane at central place serving multiple devices on high end processors.
    • Facility to develop control plane applications in higher level language such as Java and python.
How SDN is expected to address some of the challenges of network operators?
  • With control plane is separated from the Data plane,  control plane software vendor can be different from the network device vendors implementing Data plane.
  • Since there is one vendor for control plane software in a given deployment,  interoperability issues would get reduced dramatically.  
  • If one control plane computer (Controller) is managing multiple network devices, then there is no need to run control plane protocols among them as  control plane instances of each network device is within the same controller and they could have their own proprietary mechanisms to figure out the results without running protocol.   This will reduce amount of the protocol traffic in a network managed by one controller.  Control plane protocols are required to run only in cases where devices are managed by different controllers.
  • Since there are few controllers in a given deployment,  network operators can afford to run the controller software on very high end systems such as Multicore processors.  This enables implementation of controller including control plane protocols in high level language such as Python and Java.   
  • SDN is also trying to standardize different layers in the controller, similar to Server side programming.  It allows even network operators to customize and add newer control plane software without depending on the controller software vendors.
  • Openflow protocol standardized as part of SDN allows creation of flows in the network devices (Data Plane),  it allows granular control of the traffic going through the networks. It is possible to create logical networks without increasing physical network devices.  For example,  research networks or multi-tenant networks can be created on production networks with higher confidence that network continues to run. 


No comments: