Saturday, December 31, 2011

Locator and Identifier Seperation Protocol (LISP) - One more tunnel protocol

In 2012, I think that there would be focus on two technologies in network infrastructure market - SDN and  LISP (Locator and Identifier Separation Protocol).  LISP work is going on for few years and it seems to be talked about quite often in recent past.


The reasons for LISP is detailed very well in the RFC 4984.  Some points of RFC 4984 are worth noting down and I am mentioned them here.


Internet presence is now part of business model of many organizations. Hence high availability of connectivity to Internet is becoming very important to organizations.   High availability is being achieved by having multiple links to ISPs and also multiple links to different ISPs.   Multiple links are used for load balancing the traffic as well as for redundancy.

Customers (companies) get IP address block (subnet) from the ISPs  and this address block is used by organizations to assign IP addresses to the machines that needs to be reachable from external nteworks.  Since each ISP would assign different blocks,  critical machines are provided with multiple IP addresses - one from each ISP assigned block.  Operating systems and routing protocols running in the machines and routers would ensure that the right IP addresses of the active links are used.  Each machine operating system should have this intelligence so that connections from applications running on operating sytem's TCP/IP stack are assigned with active IP addresses.  Since multiple IP addresses are assigned to a  machine,  machine is termed as multihomed machine. This concept is called multihoming. 

Even though above scheme works in general, existing active connections would get terminated if the link associated with IP addresses of the connections go down.  This could result in lost voice calls,  termination of very important TCP/IP connections.  This is one problem with provider assigned (PA) IP addresses (also called Provider Aggregatable addresses).   There is no issue for new connections though as routing protocols propagate this information.  Note that service providers don't allow the packets having source IP address other than the IP address block assigned by them from their customer networks.  Due to this, packets belonging to  active connections can't be sent onto links of other  service providers.  This is  one of the challenges organizations have with multihoming.

Second challenge with multihoming is that the propagation of active routes and links to each machine.  All machines that can be reachable from external networks need to have routing protocols implemented.  As you all know, end nodes typically don't have routing protocols enabled to not increase the maintenance headache for IT department.

Third challenge is multihoming for inbound connections to the organizations. When a link is down, somehow remote systems should not be using the addresses associated with the down links.   Typically, this is achieved by having FQDN (full qualified domain name) to each internal server machine and updating the DNS Server  with the IP addresses of active links.  That is, DNS Server can't just be using static information.  DNS Servers should be informed of  changes as soon as possible.  Even though this can be done at local DNS Servers level,  many DNS resolvers in Internet might have cached this information and remote systems continue to use down IP addresses for some time.  This can be achieved by not sending DNS response with 0 TTL, but that would increase the load on the local DNS Servers.

Finally,  organizations would like certain type of traffic (both inbound and outbound connections) to use some links over other links for several reasons such as cost of the link,  time of day etc..  

Basically, traffic engineering for outbound and inbound connections have good number of challenges and the techniques to overcome these challenges have limitations as described above.

Provider Independent Addresses:

Finally,  to address the issues of Multihoming and Traffic Engineering,  RIR (Regional Internet Registrar) introduced a policy document allowing organizations to request provider independent addresses (PI addresses). Provider Independent addresses are expected to be routed by all service providers.  That is, packets coming from their customers with these IP addresses as source IP address is expected to be honored by the service providers.

Benefits of the PI addresses are obvious with above background,  but consolidated reasons are given below:
  • No need for multihoming support in end nodes, hence no need for enabling routing protocols in the end nodes.
  • Traffic Engineering is simple - No need for dynamically updating DNS Servers.
  • Simple to move to new service providers by organizations.  No renumbering the machines every time service provider is changed.
  • With acquisitions and mergers,  consolation of networks is simple.
With PA (Provider Assigned) addresses,  addresses are aggregatable.  Hence the routing entries used to be small in number in the routers.  With provider independent addresses, the routes can't be aggregatable and hence the routing table size increases dramatically. Routing table sizes of  DFZ (Default Free Zone) routers are going up dramatically due to PI addresses.  According to BGP Routing Analysis Report, the number of routes in the DFZ routers went from 5000 in year 2000 to 400,000 in year 2011.  With IPv6 popularity and more liberal assignment of PI addresses in IPv6, it will not be a surprise where the number of routes in DFZ routes going to millions in next few years.  Since the routing table is referred by DFZ & service provider  routers  for every packet that is coming in,  more routes in the table reduces the performance of the router and hence the performance of overall Internet.

LISP is mainly born to address the issue of the scaling in DFZ routers.

Basic concept of LISP:

LISP is trying to keep the advantages of Provider Independent Addresses to the organizations and keep the routing table to reasonable size by using aggregatable addresses.  To address this,  LISP proposes two addressing schemes - Identifier Address space and  Locator Address space.  Identifier address space is similar to  provider independent address space. Organization are expected to assign addresses from allocated space to individual network elements.  Locator space is also assigned to organization, but it is it provider aggregatable address space.  Hence, this space should not be used to assign addresses to all network elements.  This address space should be assigned to tunnel routers (LISP tunnel routers) only.   When the organization changes the service provider,  it should only need to worry about IP address assignment to LISP routers and no other change is expected.

LISP standards call Identifier address EID (Endpoint ID) and locator IP address RLOC (Router LOCator).

LISP router contains two functions - Ingress Traffic Router(ITR)  and Egress Traffic Router (ETR). Ingress and Egress terms are with respect to Endpoint network.  Since endpoint identifier space is not expected to be visible to the core network routers,  ITR encapsulates the traffic coming from the endpoint network with tunnels with LISP, UDP and IP headers and sends it out onto the Internet.  ETR is expected to decapsulate the traffic coming from the Internet and pass the internal packet to the endpoint network.   Typically, ITR and ETR are implemented in customer edge routers. Initially,  Enterprises might expect service providers to provide LISP service and eventually Enterprise routers will have this functionality. 

The IP addresses used in IP header of tunnel  are from the RLOC space.  Since RLOC space is provider aggregatable,  routing table size will not increase dramatically.  Please see the LISP draft for more information on the tunnel header formats.

How does it solve the issues/challenges discussed above?

Multihoming is no longer required in end nodes. But it is still required on LISP routers though - That is there would be requirement for multiple links from different providers for redundancy and traffic engineering.  Active connections will not be suffered if traffic is redirected to other links as endnodes work with EID space always and those IP addresses would continue to work, similar to provider independent addresses.  Outer IP header address of LISP tunnel would change when links go down and come back up.  That should be okay as these addresses are only used to get to the LISP ETR.

EID to RLOC mapping:

ITR needs to know the source IP and destination IP to be used for the tunnel header. ITR uses the destination IP (EID) of the packets coming in from the local network to determine the remote ETR RLOC IP address. It does this using mapping database.  Each ITR expected to maintain EID to RLOC cache.  If it does not find the matching entry in the cache, then it talks to mapping resolvers.  Mapping resolve servers uses the Mapping Database to figure out the destination ETR and lets the destination ETR to send actual EID to RLOC mapping to the requesting ITR.   Basically, Mapping resolves and mapping databases only used to find the ETR.  But ETR is the one which gives the EID to RLOC mapping to the ITRs.

ETRs are expected to register its RLOC with the mapping database for EID prefixes it controls.  This is done using MAP_REGISTER message.  ITRs send MAP_REQUEST message to get the EID to RLOC mapping to mapping resolvers.  Mapping resolvers use the map register database to know the RLOC of the ETR and translates the map request destination IP address with the RLOC to redirect the  MAP-REQUEST packet to the ETR.  ETR then replies using MAP-REPLY message with the actual RLOCs to be used by ITR.  One might ask why can't mapping resolvers itself sends the MAP-REPLY to the ITR.  ETR is given this opportunity to do inbound traffic engineering.   ETR can give different RLOC IP address for different type of traffic or use different link at different times etc..

Mapping database, mapping resolver servers and associated message formats are described in IETF draft LISP MAP Server interface.

In summary,  MAP resolvers and MAP database servers are used to index ETRs and ETRs are the ones which actually provides EID to RLOC mapping.

The challenge really is how the index database is implemented.  Note that this database can become big as all EID prefixes would be maintained in this database.  This database search needs to be fast and the database is updated by multiple ETRs.  Update database operation also need to be fast. Ofcourse serach operation needs to be very fast.  To take care of scalability issues, multiple database servers would need to be used.  It is also required to divide the database into multiple servers.  One proposal I see usage of DHT (Distributed Hash Table).

Please see following links:

Alternate network:
There is a DHT alternative to this.


Year 2012 would see LISP based routers.  Initial set of routers would have software implementation of LISP routing functionality.  Once the standards achieve certain level of  maturity,  one would see Ethernet controllers (standalone or Multicore based) vendors adopting this technology in hardware.

Monday, December 19, 2011

Table Centric Processing and Openflow

Software in  Embedded network appliances consists of multiple entities - Management Plane (MP), Control Plane (CP) and Data Plane (DP).  Some examples of network appliances include routers, switches,  load balancers,  security devices and WAN Optimization devices etc.. 

MP entity typically consists of management engines such as CLI,  GUI Engines and Persistent storage Engines.

CP is mainly consists of several sub entities for different functions. For example, if a device consists of firewall,  Ipsec VPN and routing functionality,  then  there could be three sub-entities.  Each sub-entity might consists of multiple protocols and modules.  Routing function might have multiple control plane protocols such as OSPF, BGP, IS-IS etc..  CP also typically consists of exception packet processing modules.  For example, if there is no flow processing context found for a given packet in DP,  DP gives the packet to CP to process it and lets the CP to create the flow context in DP.

DP entity consists of actual packet processing logic. DP entity is typically implemented in Network Processor Units,  ASIC.  It is increasingly being implemented in Multicore processor SoCs with some cores dedicated for CP and rest of cores for DP. 

In this post, I am going to concentrate on DP (also called as datapath). 

As indicated DP implements the "datapath" processing.  Based on the type of network device, datapath consists of 'routing',  'bridging',  'nat', 'firewall', 'ipsec' and 'dpi'  modules and more.  Some background on each of the datapath functionality is listed below. Then we will see how table driven processing model helps in implementing datapaths.


Routing module typically involves following processing, upon packet reception:
  • Parse and check integrity of the packet
    • L2 (Ethernet, VLAN etc..)  header parsing.
    • Identification of packet type 
    • If it is IP,  proceed further, otherwise do non-IP specific processing.
    • Integrity check involves ensuring the IP checksum is valid and packet size is at least as big as the 'total length' in the IP header.
  • Extraction of fields - Source IP address,  Destination IP address,  ToS fields from the IP header.
  • IP Spoofing Check:  This processing is done to ensure that the packet came from the right interface (Ethernet port).  Normally this is done by doing route lookup on source IP address.  Interface returned from the route lookup operation is compared with the port on which this packet came in. If both are same, then IP spoofing check is treated successful and further packet processing happens.  Basically, this check ensures that packet with destination IP address of current packet source IP address would go on the interface on which current packet came in. This requires "RouteLookup" operation using packet source IP address on the routing table which is populated by the CP.
  • Determination of outgoing interface and next hop gateway:   This processing step involves "route lookup" operation on the "Destination IP" of the packet and optionally on TOS value of the packet.  "Route Lookup" operation indicates whether the packet needs to be consumed by local device (that is DIP is one of the local IP addresses) or whether the packet needs to be forwarded.  If the packet needs to be sent to the next hop, then the gateway IP address (or destination IP itself if the host is in the same subnet of the router) and outbound interface would be returned.   If the packet is meant to the local device itself, then packet is handed over to the CP. Rest of the processing assumes that packet is forwarded to the next hop.
  • TTL Decrement :  At this processing step,  DP decrements the TTL and does incremental checksum update to the IP header checksum field.  If the TTL becomes 0, then the packet gets dropped.
  • Packet Fragmentation:  If the packet size is more than PMTU value of the route or MTU of the outbound interface, then packets gets fragmented at this step.
  • MAC Address Determination:  If the outbound interface is ETHRENT type  then this processing step  finds out the MAC address for the gateway IP address determined in one of the previous steps.  It refers to the "Address resolution table" populated by the CP.  It uses IP address as input and gets the MAC address.  If there is no matching entry,  packet is sent to the CP for ARP resolution.  In case the outbound interface is Point-to-Point interface,  L2 header can be found from the interface table which is populated by the CP.
  • Packet update with L2 header:  DP frames the Layer 2 Header (Ethernet, VLAN, PPP etc..) and prepends in the beginning of the packet right before the IP header. 
  • Packet Out:  Packet is sent out the outbound interface determined as part of "routelookup" operation.
At each step,  appropriate statistics counters are incremented. There are two types of statistics - Normal and Error statistics.  These statistics are typically  maintained globally,  on per interface basis,  on per routing entry or per ARP table entry.

Upon analysis of above steps,

IP routing datapath refers to few tables:
  • IP Routing Database Table, which is LPM (Longest Prefix Match), for route lookup operation.
  • MAC Address resolution table, which is exact match table,  to find MAC address for a given IP address.
  • Interface Table, which index table, to find the L2 header to be used in case of  Point-to-Point interfaces.  Interface Index is typically returned by the "route lookup" operation.
IP routing in some devices also implement "Policy Based Routing" (PBR).  PBR is an ACL type of table with each entry identified by several fields, including IP header fields (SIP, DIP, TOS),  Input interface and output interface and even on the transport header fields.  The output of each ACL rule is the routing table.  Basically, "route lookup" functionality involves two steps - One finding the matching ACL rule in PBR table. This gets the routing table to use.  Then this routing table is searched with IP address and TOS to get the routing information as part of second step.  ACL rules can have field values represented in subnets, ranges and single value.   ACL is typically ordered list.

Bridging:  Bridging switches the packets without any modification to the packet.  Bridging is typically done among the Ethernet interfaces.  Bridging typically involves following steps upon receiving the packet from one of the Ethernet interfaces that belong to a bridge.
  • Ethernet  Header Parsing: In this step,  Ethernet header and VLAN headers are extracted from the packet.
  • Bridge Instance determination:  There could be multiple bridge instances supported by datapath.  This step determines the bridge instances based on the interface on which packet came in.  A given Ethernet or VLAN interface belongs to only one bridge.  CP populates the Interface table with entries, with each entry identified "Interface Index" and having parameters such as Bridge Instance Index (Index to bridge table),   whether this interface is allows learning,  forwarding etc.. 
  • Learning Table Update:   Each Bridge maintains the learning table. It is also called FDB (Forwarding Database).  This table aims to provide Interface information given the MAC address.  This table is normally used by bridging datapath to determine the outbound interface for received packets.  But this table is updated with entries based on the incoming packet.  Its source MAC address and the interface on which packet came in used to populate the learning table. Basically,  it learns the machines on a physical network attached to the Ethernet port.   As part of this processing step,  if there is no learning table entry, then new entry gets added.  Note that learning is done if the Ethernet interface status in thhe "interface table" indicates that this interface can be used to learn the machines. In some systems,  the population of the learning table is done by the CP.  Whenever DP finds that there is no entry that matches the SMAC, then it sends the packet to the CP and CP creates the table entry.
  • Determination of outbound interface:  In this step,  DP does the lookup on the FDB with DMAC address key.  If there is a matching entry, it knows the outbound interface on which the packet needs to be sent.  In case of Multicast packet, it refers to the Multicast FDB to find out the set of interfaces on which the packet need to sent out.  In case of broadcast packet (DMAC = ff:ff:ff:ff:ff:ff), then all interfaces in the bridge is selected to send the packet out.  Note that Multicast FDB is populated by the CP. CP does this by interpreting the IGMP, MLD and PIM-SM packets.
  • Packet Out:  In this step, packet is sent out.  In case of unicast packet,  if the interface is known, packet is sent out.  In case the outbound interface is unknown (there is no matching FDB entry), then packet is sent out on all ports except the incoming port.   Packet is duplicated multiple times to achieve this.  Multicast packets are also sent on all ports if there is no matching  entry. Otherwise, it sends the packet to interfaces as indicated by Multicast FDB.  Broadcast packets are always sent to all ports in the bridge. Note that, in all these cases, packet is never sent on the interface on which packet came in. Also,  datapath does not send the packet if the interface is blocked for sending packet out.  This information is known from the interface table.
Bridging module maintains following tables.
  • Unicast FDB, which is exact match table. Match field is 'MAC address'.  One FDB is maintained for each bridge instance.
  • Mutlicast FDB, which  is also exact match table.  Match field is 'Multicast MAC Address'. One Multicast FDB is maintained for each bridge instance.
  • Interface Index Table - Which is indexed by the incoming interface identifier. Typically global to the entire datapath.
  • Bridge Instance Table - Which is indexed by bridge instance. Typically global to the entire datapath.
NAT/Firewall Datapath: NAT and Firewalls typically require Session Processing.  Session Management part of NAT/firewalls is typically implemented in datapath.  Sessions are typically 5-tuple sessions (Source IP, Destination IP, Protocol, Source Port and Destination Port) in case of TCP and UDP.  A Session is initiated by client and terminated by server.  NAT/Firewall devices being in between client and server machines,  it sees both sides of the connection - Client to Server and Server to Client.  Hence, in these devices,  sessions consists of two flows - C-to-S and S-to-C flows (Client-to-Server and Server-to-Client flows).   5-tuple values of C-to-S and S-to-C flow are same in case of firewall session except that values are toggled between source and destination.  That is, source IP of the C-S flow is used as destination IP of the S-C flow and destination IP of the C-S flow is used as source iP of the S-C flow. Same is true with source and destination ports.  In case of NAT,  this may not be as simple. NAT functionality translates client SIP, DIP, SP and DP to new values.  Hence the NAT session manager flows would have two different values in 5-tuples, but both belong to one session.  In summary,  a session maintained by NAT/firewall devices is pointed by two different flows.

Session Processing involves following steps in datapath:
  • Packet Parsing and Integrity of headers (L2 header, IPv4/IPv6 header, TCP/UDP/ICMP/SCTP headers) :  Similar step as described in "Routing" section. 
  • IP Reassembly:  Non-first IP fragments will not have transport header. Since session requires all 5-tuples for matching sessions,  if reassembly is not done, non-inital fragments will not be get matched onto the right flows. As part of the IP reassembly, many checks are made to check the anomalies and as part of this check fragments might get dropped.  Once the full transport packet is reassembled,  further processing happens.  Some of the checks that are done as part of this processing step are:
    • Ensures that initial and middle fragments are at least of some configured size - This is to figure out any deliberate fragmentation by sender to make the device spend large number of CPU cycles there by creating DoS condition.
    • Ensures that the total IP packet size never exceeds 64K size.
    • Taking care of overlapping IP fragments.
    • Ensuring that the data in the overlapped IP data is same. If not, it is considered as deliberate attempt and this processing steps drops these packets.
  • Transport Header integrity checks:  At this step, transport header integrity is ensured such as
    • Ensures that the packet holds transport header.
    • Some datapaths may also verify transport checksum.
  • Extraction of fields from the parsed headers :  Mainly SIP, DIP, P,SP and DP are extracted from network and transport headers.
  • Security Zone lookup :  Every incoming interface belongs to one of the security zones.  CP programs the table of entries with each entry having Interface ID and security Zone ID it belongs to.  Security Zone ID is used in flow lookup function.
  • Flow Lookup to get hold of session:  As discussed before, a session consists of two flows.  Flows are typically arranged in hash lists.  When the session is created by CP,  CP is expected to put two flows in the hash list  with each flow pointing to its session context.  Sessions are arranged in an array (Index Table).  Reference to session (array index) is put in the flow.   Basically,  there is one hash list for flows and Index table for sessions.   5-tuples and inbound security Zone ID are used to match the flow.  If there is no matching flow found,  the packet is sent to the CP.  If the flow is found,  then it goes onto next processing step.
  • Anomaly Checks:  Once the flow, associated session context and complementary flow is determined,  then this processing step does checks to find out any anomalies using packet header contents and state variables that are maintained in the session.  Few things that come to my mind are: 
    • Ensuring TCP sequence number of the current packets within reasonable range (Example: 128K) from the previous packet sequence.  This is done typically to ensure that MITM did not generate the TCP packet, mainly TCP packet with RST bit is set.   TCP RST packet can drop the session not only in the device, but also on the receiving client/server machine.
  • Packet Modification:  This step involves packet modification.  Packets belongs to sessions that are created by firewall CP may not undergo many changes. In case of NAT, there are several modifications are possible based on session state created by CP.  Some of them are:
    • Source NAT and Destination NAT modifies the SIP and DIP of the packet.
    • NAPT may modify both SP and DP of TCP and UDP packets.
    • To ensure that modified packet has unique IPID,  IPID also gets translated with unique IPID.
    • TCP MSS value might be modified with lesser MSS value to avoid any fragmentation in future steps.
    • TCP sequence numbers also may get translated. This is typically happens when CP does some packet processing before such as SYN FLOOD protection using SYN-COOKIE mechanism or due to some ALG processing.
  • Packet Routing and Packet-out:  These steps are similar to IP Routing module.
 Upon analysis of above steps, one can find that there are following tables in Session Processing.
  • Interface to Security Zone mappping table, which is an index table.
  • Flow hash table, which is exact match hash table.
  • Flow to Session ID table, which in index table.
Observations & Suggestions for new version of Openflow specifications:

All datapaths listed above use tables heavily.  Contexts created in each table are used by actions being performed.  It is not clear whether Openflow specification group is targeting to cover any type of datapath.  If so, few things to notice and improve. 

Packet Reinjection :  Each datapath might have more than one table.  Typically a table miss results into packet going to the control plane (controller in case of Openflow).  Control plane once acts on the packet and create appropriate context in the table,  it would like the packet processing in the datapath  to start from where the miss occurred.  Some times,  the processing should start from the table where the miss occurred or some times CP might do the operations on the packet and hence the packet processing needs to start in the datapath from the next table as specified in the newly created context.  Protocol between Control plane and datapath (Openflow) needs to provide this flexibility for control plane.

Atomic Update and Linkage of flows: Control plane yet times need to create multiple contexts in tables atomically.  For example, NAT/firewall control plane would need to create two contexts for two flows in atomic fashion.  Commit based facility would help in achieving this.  I think it is required to enhance openflow protocol to enable this operation.

Parsing headers & Extraction of fields:  Since there could be different datapaths that need to be exercised on a packet, one time parsing of the packet may not be sufficient. Each datapath or even each table might be preceded by parsing & extraction unit. There are different fields which are required in different paths.  There could be tunnel headers which need to be parsed for some datapaths to get to inner packet.  Each datapath might have its own requirement on which fields to be used for further processing.  That is, some datapath might need to work on IP header of tunnel and some other datapath might have requirement to use inner IP packet fields.  Hence I believe it is good to have parsing & extraction unit for each table. In cases, where the datapaths don't require different parsing & extraction units,  controller would not configure associated parsing & extraction units.  Some hardware devices (Multicore SoCs) I am familiar with support multiple parsing & extraction unit instances.  So, this is not a new thing for hardware ASIC/Multicore-SoCs. 

Pre Table lookup Actions:  Many times, there is a requirement to ensure that packet is good.  Packet integrity checks are very important during packet processing. Having some actions defined on per table basis (not on per flow basis) is good way to do these kinds of checks.  There may be a requirement in future to do even some modification to the packet before further parsing/extraction & eventual table lookup.  Pre-Table action list is one way to provide these facilities.  One can argue that, it is possible to use another table before packet is processed at current table and have context specific action on that table.  Though it is good thought,  it might result into inefficient usage of table resources in hardware.

Flexible Metadata:  Metadata is used to communicate the information among different tables and associated actions. There are two types of metadata required in datapaths - Some part of metadata is used as lookup action in further tables and some metadata is required by actions. Openflow specification defined the metadata, but it is limited to one word.  I believe that this is insufficient. Fixing the metadata size in standard is also not good. Controller can define the maximum metadata size required and program the size information in the datapath.  Datapath can allocate additional space to store metadata on per packet basis using the size information programmed by the controller.  Since there could be multiple types of datapaths and multiple tables within each type of datapath,  controller may categorize metadata for different purposes. Each metadata field and field size can be referred later on by table for lookup and associated actions.  I believe that openflow should provide that kind of flexibility - Setup the size of metadata required one time in the datapath,  language to use different fields in metadata for table lookup,  language to define action which can use different fields in the metadata or set/reset some fields in the metadata.

Actions :  Each type of datapath has its own actions - For example firewall session management datapath actions are different from the type of actions routing or bridging require.  Hence, there would be good number of actions eventually in the datapath.  And more actions would be required in future.  In addition to defining more actions in the openflow or associated specifications,  it should also provide flexibility to add new actions without having to create new hardware.  That is, openflow might need to mandate that the datapath hardware or software should be able to take newer actions in future.  To enable this,  it is required to define some kind of language to define new actions.  Since datapaths may be implemented in hardware ASIC,  the language should not have too many constructs/machine-codes.  Simple language is what is needed.  I thought simplified LLVM can be one option.

VXLAN (Virtual eXtensible LAN) - Virtual Data Centers - Tutorial

VMWare with contributions from Cisco, Citrix, Broadcom, Arista networks released IETF VXLAN draft which is a protocol to enable multiple L2 virtual networks over a physical infrastructure.  Please find that draft here.

Draft document clearly defines the problem statement and how the VXLAN is solves the problems.  I will not repeat all of them here. Some important background points are mentioned here though.


It is well known that Data Center and Service Provider networks are increasingly being enabled for multiple tenants. Even Enterprises are supporting multiple tenant networks for isolation - Some examples being isolation for different divisions, for research & Development, for trying out new services/networks.

Employees of a particular division used to be confined to a building in 1990s.   With globalization, it is no longer true.  Employees of one particular business unit are not only spread across multiple buildings, but also across countries and continents.  Each building or location have employees from multiple business units. Hence, it is required to create virtual LAN over same physical network where virtual LAN spans across the buildings, countries and continents.  Virtual LAN gives locality of L2 networks for related machines even though they are distributed across multiple physical networks. 

It is very common concept in recent past and hence there is no need to emphasize the need  for  virtual LANs for tenants in Data Center, Service Provider and multi-dwelling environments.  Operators would not like to create  new physical infrastructure or modify existing physical infrastructure every time they sign up a new tenant. Hence, operators of these networks support virtual networks for tenants for the purpose of isolation,  flexibility and effective usage of physical resources.   Virtualization of Servers solves this issue on compute side.

Virtualization in physical servers now is understood by Network Operators and tenants.  Elasticity is provided on the compute side.  Based on tenant requirements, more virtual machines either can be expanded or reduced based on usage.   Virtual machines used to host one tenant service can spread across multiple physical machines.  It is also true that a physical machine may hold virtual machines corresponding to multiple tenants.  That is, a tenant service can be across multiple virtual machines which can be across multiple physical servers.  Basically,  virtual server is now treated as physical server of non virtualized architectures.

If Data Center supports isolated networks and resources, then the Data Center is enabled for multi-tenancy and those Data Centers are called Virtual Data Centers (VDCs).  Implementation of VDC requires multi-tenancy across all active devices in Data Centers. Compute Server and Storage Virtualziation is well understood and already being done to a great extent.  Network Service devices too have multi-tenancy today.  Physical network service devices which are popular today in Data Center markets are also enabled for multi-tenancy.  These service devices tend not to implement multi-tenancy using virtualization as compute servers do today.  They tend to support multi-tenancy using "Data Virtualziation' or at the most "Container based Virtualization" for scalability purposes.

Tenant-ID communication among the Data Center equipment traditionally happens using VLAN ID.  A VLAN ID or set of VLAN IDs are assigned to each tenant.  The front end equipment figures out the tenant Identification (VLAN ID) based on IP addresses (Destination IP address of the incoming packets from the Internet).  Then onwards, the communication to compute servers (Web Servers, Application Servers, Data base Servers) and Storage devices happens via this VLAN ID. Reverse traffic (Outbound traffic to the Internet) also happens on VLAN IDs until the packets reach the front end device.  Front end device, then would remove VLAN header from the packet and send the packets out onto the Internet.

L2 switches,  L3 switches and any other network service devices use VLAN ID to identify the tenant and apply appropriate tenant specific policies.

Why VXLAN if VLANs are good?

VLANs are not good enough for following reasons:
  • VLANs are fixed in number - VLAN header defined 12 bits for VLAN ID. It means that only 4K VLAN IDs are possible.  If we go with best case assumption of 1 VLAN ID for each tenant, then a Data Center can atmost support 4K tenants.
  • VLAN is mostly L2 concept.  Keeping VLAN intact across L2 networks separated out by L3 routers is not straightforward and hence requires some intelligence in L3 devices.  Especially when the tenant networks needs to be expanded to multiple geographic locations, then extending VLAN across Internet requires newer protocols (such as TRILL).
  • If tenant traffic requires VLANs themselves for different reason,  double tagging and triple tagging may be required. Though 802.1ad tagging can be used for tenant identification and use 802.1Q tagging for tenant specific VLANs,  this may also require changes to existing devices.
VXLAN is new tunneling protocol works on top of UDP/IP.  It does require changes to existing infrastructure to understand this new protocol, but it is not going to have limitations of VLAN based tenant identification.  Since L2 network is being created over L3 network,  VDC can now extend not only within a Data Center/Enterprise location, but across different locations of Data Center/Enterprise networks. 

Some important aspects of VXLAN protocol:
  • VXLAN tunnels L2 packets across networks between VTEPs (VXLAN Tunnel End Points)
  • VXLAN encapsulates L2 packets in UDP/IP.
  • VXLAN defines the VXLAN header.  
  • UDP Destination Port indicates the VXLAN protocol.  Port number is yet to be assigned.
  • 16M virtual networks are possible.
  • VTEPs are typically End Point Servers (Compute Servers,  Storage Servers) and Layer 3 based Network Service Devices.  L2 switches need not be aware of VXLAN.  Some L2 switches (ToRs) may be added with this intelligence to proxy VXLAN functionality from computer servers on the ports connected to compute servers.  This may be mainly done to support non VXLAN based servers/resources.  VXLAN IETF draft calls it as VXLAN Gateway.
  • VXLAN Gateways are expected to translate tenant traffic from non-VXLAN networks to VXLAN Networks.  Front End device as described above is one example of VXLAN gateway.  This device might convert from Public IP address (Destination IP address) to VXLAN tenant (VNI - VXLAN Network Identifier).  
  • VXLAN defines VNI (VXLAN Network Identifier) which identifies the DC instance (VDC).  This is in place of VLANs that are used in the Data Center networks today.
  • It is expected that there is one management entity within one administrative domain to create the Virtual Data Centers (Tenant).  This involves assigning unique VNI to the tenant network (VDC instance),  associating Public IP addresses of the tenant,  Multicast IP address for ARP resolution across VTEPs in a VDC.
Like VLAN based tenant identification, VXLAN based tenant networks can have overlapping internal IP addresses.  But IP addresses assigned to virtual resources within a VDC must be unique though.

Some aspects of packet flow:

Let us assume following scenario:

A tenant is assigned with VNI "n" and Multicast IP address "m".  Two VMs  (VM1 and VM2) are provisioned on two different physical Servers (P1 and P2) located in two different cities.  VM1 is installed on P1 and VM2 is installed on P2.  P1 and P2 are reachable via IP addresses P1a and P2a.  VMs have private IP address VM1a and VM2a. 

Ethernet Packet from VM1 to VM2 would be encapsulated in UDP/IP with VXLAN header having following:

VXLAN header:  VNI  "n" and some flags.
UDP Header will have "source port"  assigned by system and standard "destination port".
IP header is generated with P1a as SIP and P2a as DIP.
Ethernet Header would have SMAC as local MAC address and DMAC corresponding to P2a from ARP resolution table or local gateway MAC address.

VTEP VXLAN functionality would be implemented  in NIC cards of servers or hypervisors. It is also would be part of L2 switch for supporting existing servers and associated VMs.   VMs within the servers need not be aware of VXLAN and hence existing VMs will just work fine.  VTEP functionality typically maintains a database (created by Management entity) of virtual NIC (of VMs) versus  VNI - Could be as simple as table of vNIC MAC address and VNI.  VTEP also is provisioned with the associated Multicast IP address (VNI and MAC address table).  VTEP when it gets a packet from the vNIC of VMs,  figures out the all the information required to frame the UDP/IP/VXLAN headers.  VNI is known from the table provisioned by the management entity.  DIP of the tunnel IP header is determined from the learning table (VTEP learning table - consisting of MAC address of the remote VMs versus the remote VTEP tunnel IP address entries).  VTEP is expected to keep this table updated based on the packets coming from the remote VTEPs.  It uses SIP of the tunnel header and SMAC of the inner Ethernet packet to update this table.

This looks simple. But there are two things to be considered - How do VMs get hold of DMAC address corresponding to peer VMs IP address?  Local ARP request does not work as it can't cross the local physical L2 domain.   Second,  how does the local VTEP gets to know the remote VTEP IP address if there is no matching learning table entry? 

Let us first discuss on how ARP request generated by VMs get satisfied.  Broadcast ARP request generated by a VM should somehow should go to all VMs and devices in the virtual network.  That is where Multicast tunnels are used by VTEP.  Source VTEP, upon getting hold of ARP request from the local VM, tunnels the ARP request in Multicast packet whose address is derived from VNI to Multicast IP address table. As a matter of fact, VTEP encapsulates any broadcast/Multicast Ethernet packets sent by local VM in multicast tunnel.  All VTEPs are expected to subscribe to the multicast address to receive multicast tunnel packets.  Receiving VTEPs decapsulate to get hold of inner packet, finds out all VMs (vNICs) corresponding to VNI and sends the internal packets onto those vNICs. Right VM would respond back to the ARP request with ARP reply and remote VTEP sends the ARP reply to the source VTEP.

Second complication is discovering the remote VTEP IP address when there is no matching entry in the learning table.  This could happen when the entry gets aged out.  If the VMs are configured with static ARP table,  ARP requests also will not be generated by the VM and hence there may not be any opportunity to learn the remote VTEP IP address for remote MAC addresses.  In this case, source VTEP upon receiving any unicast packet from the local VM may need to generate the ARP packet as normally generated by VM. This ARP request is sent in Multicast tunnel as described above. This will trigger the ARP reply from the remote VM which gets encapsulated by the remote VTEP.   This message can be used by local VTEP to update the learning table.  Since this process may take some time, VTEP may also need to buffer the packet until the learning is done.

Problems and possible solutions to VXLAN based VDCs:
  • I believe IPv6 is needed for tunnels.  IPv4 tunnel is okay in short term.  As you would have gathered by now, each tenant requires one multicast address.  This multicast address needs to be unique in the Internet.  That is, this address needs to be unique across network/Data-Center operators. If this mechanism gets popular, it is possible that multicast addresses may run out very soon.
  • Security is very critical.  It is now possible to corrupt the learning tables by the man-in-the middle or even external attackers.  VNIs will be known to attackers eventually.  Multicast packets or unicast packets can be generated to corrupt the learning table as well as overwhelm the learning table, thereby creating DoS condition.  I think IPsec (at least Authentication with NULL encryption) must be mandated among the VTEPs. It is understandable that IKE is expensive in NIC cards,  Hypervisors and VXLAN gateways.  But Ipsec is now available in many NIC cards and multicore processors.  Management entity can take the job of populating the keys in the VTEP end points, similar to management entity doing the provisioning of Multicast address for each VNI.  I believe that all VTEPs would be controlled by some management entity. Hence it is in possibility of realm to expect management entity to populate the Ipsec keys in each VTEP.  For Multicast tunnels, key needs to be same across all the VTEPs.  Management entity may recycle the key often to ensure that security is not compromised when attackers get hold of the keys. For unicast VXLAN tunnels,  Management entity can either use the same key for all VTEPs as in for Multicast or it could use pair wise keys.  
I think above problems are real.  It would be good if next draft of VXLAN provides some solutions to above problems. 

Thursday, December 15, 2011

Embrane - Is this SDN Play?

Recently, I came across a company called Embrane while doing some google search on  SDN.  Then I saw a press release that Embrane made a product release announcement. I thought I would check this out and see how far it goes in SDN.  I had gone through the whitepaper published in Embrane website.  If you are interested,  you can find that paper here.

My understanding of Embrane solution:

When I first read the white paper, I was not sure about  Embrane product - Whether it is a platform/framework  to instantiate any type of network service virtual appliances from any vendor or whether the Embrane provides some network services as virtual appliances. By end of reading the whitepaper and after going through their website, it appears that Embrane's main focus is to deliver the framework for any virtual network service appliances including third party virtual appliances.

Embrane architecture mainly consists of four components.  Each component is installed as separate VMs.
  • Elastic Service Manage (ELM)r:  Typically, there would be one VM of this type. Data Center operator works on this VM to provision Distributed Virtual Appliances. 
  • Distributed Virtual Appliances (DVA):  Each DVA is logical set of VMs.  There are three kinds of VMs within each DVA.   Even though, there are multiple VMs within one logical DVA,  it can be treated as one appliance for all practical purposes.  As I understand,  Data Center operator need to instantiate as many DVAs as number of tenants.   If there are two types of network services is required, then there would be 2 DVAs.   So, if there are X number of tenants in a Data Center and each tenant requires Y network services (ADC,  Firewall,  Web Application Firewall, WAN Optimization etc..), then there would be a need for X * Y DVAs.    Now, coming to three kinds of VMs within each DVA.
    • Network Service Virtual Appliances (NSVA):  DVA can have multiple virtual appliances.  These appliances implement actual functionality of network service such as ADC,  Firewall, WOC etc.  Obviously, there must be atleast one network service VA in a DVA.  Multiple VAs can be instantiated by ELM for scaling performance (Scale-out). 
    • Data Plane Dispatcher (DPD):   There will be 1 DPD in each DVA.  DPD is the one which actually distributes the traffic across multiple NSVAs  for linear performance scaling.  
    • Data Plane Manager (DPM):   One DPM VM in each DVA.  DPM is expected configure the NSVAs and DPD in the DVA on behalf of ESM.  Though it is not clear, I am assuming that this will ensure that the configuration integrity is maintained across all NSVAs.   It appears that this is the only VM that requires persistent storage and hence I am guessing that it might be storing the audit and system logs generated by DPD and NSVAs in persistent memory.
If the network service is ADC (Application Delivery Controller),  then it can be viewed that DVA provides one more level of Load balancing.  That is, DPD acts as Load Balnacer to multiple ADCs.  As we know, ADC itself acts as a load balancer to servers.  This makes sense as ADCs have become complex in recent past and computations power requirements have gone up.  Hence, one more layer of load-balancing is indeed required.  In current Data Center deployments, this is achieved using L2 switches.  L2 switches have capability to balance the load across multiple external devices based on hash result of defined fields in L2/L3 and L4 headers.

I have detailed out how L2 switches can be used to distribute the traffic across multiple devices of a cluster.  Please check that out here.

My views:

It appears that DPD functionality is similar to what I described in my earlier post.
Since DPD is a software based distributor,  I expect that it will not have limitations of  L2 switch based load distribution. As we all know that many network services work with sessions (typically 5 tuple based - SIP, DIP, P, SP, DP) to store the state across the packets. Any load distributor is expected to take care of this by sending packets corresponding to a session to only one device in the cluster.   If this is not done, there would be lot of communication across the devices (Virtual appliances) within the cluster.  This may eliminate the benefit of multiple virtual appliances in the cluster.   In my view,  DPD should be distributing the sessions (not the packets blindly) across multiple Virtual appliances.  Since it is software based solution,  it can do one more step and ensure that all sessions corresponding to application sessions are sent to the same virtual appliance.  VOIP based on SIP is one example where there can be 3 UDP sessions corresponding to one application session.  DPD kind of devices need to ensure that the traffic corresponding to all three sessions in this example are sent to one device (Virtual Appliance).  Detection of 5-tuples of data connections is only possible if DPD supports ALGs (Application Level Gateways).  Since there could be more ALG requirements in future,  the challenge is to provide these ALGs on constant fashion by the "Load Distributor" vendor and/or open up DPP architecture for third party vendors to install their own ALGs, thereby maintaining SDN spirit. 

As a described in the same earlier post,  configuration synchronization among the network service devices (NSVAs) is one important aspect of cluster based systems.  I guess DPM is the one which is taking care of it in Embrane solution.

Overall this architecture is good and replicating the physical solution into cloud solution.  It is good for environments where Data Center operators don't allow physical appliances to be deployed by their customers.

It does not appear to be Openflow based. But it can be still considered as part of SDN as it allows third party network service virtual appliances in their framework.

Challenges I see in Embrane solution:

Embrane might be having following features already.   Since I did not find any information on this, I thought it is worthwhile mentioning. I believe that following features are required in DPD kind of load distributors.

As I described above,  classifying the packets across multiple application sessions and ensuring that all packets corresponding to an application session go to the same network service virtual appliance is one big challenge for these kinds of equipment.  I know this personally and it is quite challenging to support multiple ALGs, mainly ensuring interoperability with both clients and servers. 

Some network deployments might see traffic on tunnels such as IP-in-IP, GRE, GTP-U etc..  Traffic corresponding to many sessions is sent on very few tunnels.  To ensure the distribution across multiple NVSAs,   "Load Distributor" need to have flexibility to dig deep into the tunnels and classify the packets based on inner packets.

Performance, I believe, would be the biggest challenge in virtual machine based 'Load Distributor". Classifying the packets, session load distribution and sending subsequent packets of sessions to the selected virtual appliance requires maintenance of millions of  sessions in the "Load Distributors".  What I hear is that one CPU based Virtual machines using VMWare and XEN kind of hypervisors give typically 1Gbps of performance for small size packets.  More processors can be added to the "Load Distributor" virtual machine, but achieving performance in 10s of Gigs may be very challenging.

My 2 Cents:

I believe that the "Load Distributors" need to go beyond virtual machines. Taking advantage of Openflow based switches to forward the packets would be the solution of choice in my view.  Load Distributor virtual machine can do the ALG kind of functionality, selection of Network Service Device for new sessions and create appropriate flows in Openflow switches and leave Openflow switches  to forward subsequent packets  to network service devices/virtual-appliances.  That is, openflow switch can forward the packets to "Load Distributor" if there is no matching flow.   Hence the traffic to Load Distributor is small and one virtual machine would be able to process the load.  Since Data Center operators normally have switches (eventually openflow switches),  this mechanism just work fine in Cloud environments.

Sunday, December 11, 2011

Software Defined Networking - Java Role

Controller component of Openflow based SDN is supposed to have most of the  networking intelligence.  One might have gathered by this time that the SDN is expected to provide programmability of network devices rather than simple configurability.  SDN Controller component is going to be complex entity in SDN. Software Engineering principles tell us that complexity is only manageable with modularity.  Complexity and associated modularity is not new in web based programming projects (Server Side programming).  There is so much of innovation that already happened on programming and modularity in server operating systems and middlware packages. These lessons can be used to manage the controller complexity too.

Many programming languages and associated frameworks are popular in Server Side programming.  Python, Perl, PHP, Ruby and Java are some of the popular programming languages. There are many layers of frameworks defined to ease the application programming.  One thing to observe is that, there are very few server side complex projects implemented in C or C++.    That tells us something.

Modularity, several frameworks and large number of libraries available in these programming languages makes the application development  fast, maintainable and easily extendable.  This allows agile development, faster implementations of innovations with less cost of development.

In SDN world too, this was realized and one would see Controller Network Operating Systems in Java and Python already today.

In my view, Java would be the software platform choice on controllers.  Main reasons being -  big pool of talent,  great middle-ware packages,  large number of libraries,  backing of many big companies, wide usage performance and access to hardware accelerators.  

I have found one controller by name BEACON which is Controller Networking Operating system based on Java.  It uses modular architecture of java frameworks.  It uses spring framework,  OSGi for developing applications as service bundles.  Even controller side of openflow protocol was implemented in Java language. See the BEACON link for more information.

I already hear that some vendors are developing MPLS LDP,  Routing Protocols in Java.

Do 'C' network programmers need to worry?

Not in short term.  If Openflow based SDN is going to pick up,  there would be more work on the controller side than the device. If Controllers are not based on C language,  then there would be less demand for C programmers.  Since Network programmers tend to have better understanding of the networking field,  low level programming and Protocol knowledge, their expertize would continue to sought after, but networking  programmers now need to know much more than C.  Employers will expect future Engineers to know as much Java/Python as C. 

Wednesday, December 7, 2011

ForCES (Forwarding and Control Element Separation) and Openflow 1.1 - Contrasting them

At very high level, both ForCES and Openflow protocols separate Control plane and Data plane.  Both protocols are intended to drive Software Defined Networking.

Some terminology differences:

  • Software Driven Networking versus Software Defined Networking :  Both are same, but different terms are used.  ForCES uses Software Driven Networking.  Openflow is created in the context of Software Defined Networking. 
  • Control Element and Forwarding Element terms are used by ForCES.  Openflow tend to use the terms Controller (which contains Control plane) and Switch or Datapath for Forwarding Element.  Some people also call the datapath as Fastpath and Dataplane.

Though at high level, both ForCES and Openflow can be considered  part of Software Defined Networking, there is no need for two different protocols.  Eventually, they need to get consolidated into one or one of them would die.  I think two protocol definitions have come into existence due to some conceptual differences on the functionality separation between "Control Plane" and "Data Plane". 

I think it may be difficult to bridge the conceptual differnces,   but I believe that there are some good things in ForCES protocol that can be adopted into Openflow to make Openflow protocol more complete.

First my view of conceptual differences between ForCES and Openflow protocol:

ForCES expects the datapath to have set of LFBs (Logical Functional Blocks).  That is, one forwarding element can contain multiple LFBs.  Each LFB is expected to be defined using inputs to the LFB,  output of the LFB and different components that can be configured in the LFB by the Control Element.  Components within LFB might have multiple tables of flows,  configuration information etc..  Each LFB is identified by ClassID.  Since there can be multiple instances of same LFB,  unique LFB instance is identified by LFB Class ID and LFB instance.  ForCES suite of protocols is going to define several standard LFBs.  As long as vendors develop LFBs as per the standards,  CE will be able to work with any datapath from different vendors. 

ForCES CE mainly connects several LFBs to create a packet flow (topology) to achieve the needed functionality.  CEs also program the each LFB for creating flows etc..  ForCES define the LFBs to output metadata and subsequent LFBs can make use of the metadata in addition to other inputs.

Main point really above is that the ForCES expect logical functional units defined by Datapath.

Openflow differs conceptually on this point.  Openflow protocol does not expect any LFBs.  Openflow expects datapath to have several tables.  Controller creates the flows in those tables with different actions.

It appears that the openflow expects even lower level implementation to be done at the datapath.  LFB kind of functionality is expected to be done by the controller.  Controller has flexibility to define the logical funcitonal units byitself by programming tables and flows.  That is, controller might divide the tables into multiple logical units and each set of tables with flows and associated actions result into what ForCES call it as LFBs. 

So, in essence,  Openflow based SDN expects very low level support by datapaths. 

Having said that, I feel that there would be requirement to define more and more actions in Openflow protocol.  To extend the functionality of datapath,  ForCES expect to create more LFB specifications.  In case of Openflow, I would expect that there would be more actions defined in the openflow protocol in future.

Some Pros and Cons:

In Openflow,  tables can be used for whatever purposes by the Controller.  Purpose of the tables can be changed easily by Controller based on network deployment requirements.  In case of ForCES, if a particular deployment does not require certain LFBs,  resources occupied by those LFBs may not be usable by other LFBs,   There is a possibility of under utilization of datapath resources by Controller in case of ForCES.

But ForCES has structure to the datapath.  That might come in handy in modularity, debugging and maintainability.  May be, there is some thing that can be learnt from ForCES conceptual model. 

One lesson I can think Openflow specification (from Forces) can do is to have some sort of action components.  Controller should be able to find out the type of actions the datapath supports.  Also, it would be great if datapaths have some programmability by which more actions can be uploaded from controller without creating new datapath hardware revisions. Ofcourse, this require common language to represent this so as to provide controller separation from the datapath.  Datapaths need to understand this language and program themselves.

Some good things in ForCES protocol:  Somethings that can be adopted in some fashion into future Openflow protocol specifications are:
  • 2PC Commit Protocol:  This is quite powerful.  In my old job, we did exactly same for similar problem.  I am sure that there would be many instances where a controller need to create several flows in different tables atomically and follow the ACID (Atomicity, Consistency,  Isolation and Durability) properties.  In addition, I also think that there would be a need for ACID support across multiple datapaths where controller needs to create flows across multiple datapaths/switches.  That is where, 2PC Commit protocol is going to help.
  • SCTP transport rather than TCP transport:   I personally like SCTP over TCP for message based protocols.  SCTP is reliable as TCP and maintain the message boundaries.  It is true that SSL based SCTP is not that popular, but some other security can be maintained (IPsec).   SCTP allows easy of providing "Batching",  "Command Pipelining".
  • Extendability:  ForCES followed TLV approach in messages between CE and FE.  Nice thing is that, there are nested TLVs and hence follows XML based nested architecture in binary form.   That is very extendable in future without having major revision to protocols and datapath implementations.   One can argue that Openflow binary protocol takes less bandwidth. That is true.  I was wondering whether we can follow hybrid approach - Known items in fixed header and unknown/future items in TLV format.
  • Selected Fields in LFBs:  Openflow today does not have any granularity on per table basis.  Openflow expects the flows in any table be based on all 15 fields.  I wish newer openflow protocol would specify the fields that are relevant for each table so that the datapath implementation can allocate appropriate flow blocks and utilize memory more effectively. 
  • Table Type:   Openflow specifies that the flows have priority.  It gives impression that these tables maintain flows in order. There are several types of functions that don't need ordered list such as routing tables which can use Trie tables.  Some tables can be Exact match tables (Hash tables can be used in there.  In case of ForCES, this issue is not present as each LFB can implement its own tables in its own fashion as long as outside behavior is same.  In Openflow, LFBs are logically part of the controller and controller only sees the tables.  I like to see Openflow table definition to have "ACL table" (Similar to what Openflow 1.1 defined today),  "LPM Table" and "Exact Match" Table.  

Sunday, December 4, 2011

Openflow 1.1 protocol tutorial

Openflow protocol is a TCP/SSL based protocol between controllers and switches.  Switch is expected to initiate a connection to the Controller.  For each datapath instance, switch device is expected to make a connection.  If the switch supports X number of datapaths (instances), then X number of connections are established.

As you might have guessed by this time,  Openflow is a protocol that separates Control Plane and Data Plane.  Data Plane is called as "datapath" too.

Controller typically tries to get information from the datapath about its hardware/virtual switch configuration such as - Number of table supported,  Number of ports, Ports information and QoS queues supported.  Controller tries to get this information every time there is successful Openflow connection was accepted from the switches.

Controllers, based on the configuration by the administrators and based on the output of protocols, knows what to program in the datapath.  Datapaths maintains the tables for storing the flows.  Controller has flexibility to arrange the tables and use different tables for various purposes. Table traversal for packet is controlled by the Controllers.  Each flow the controller establishes in the table can have various actions such as packet header modifications,  next table to jump to etc..   I will talk about how controllers will arrange the tables and does flow management for various applications such as L2 switching, L3 Switching and L4 switching in later posts.  Essentially,  tables contain flows. Flow have matching fields and action fields.  Matching fields are used to match the flow in the datapath.  Packet processing continues to the next table or packet gets sent out based on the matching flow action fields.  Some important items to notice are:
  • Tables are ordered lists, that is, each flow has a priority.
  • Meta data information can be collected across various matching flows in different tables.
  • Next table flow match can have the metadata information as one of the matching fields
  • Flows can point to group of action buckets.
  • Groups can be setup to 
    • Duplicate the packets.
    • Load share the traffic across multiple next tables.
In essence,  idea is that controller has entire control on the packet path across multiple tables. Datapath need not have any intelligence of L2 switching, IP routing.  As long as they blindly do operations as specified in the flows in tables,  things should work fine as Controller takes care of responsibility of L2/L3 protocol level processing.

Flows in the tables can be programmed by control plane in two ways - Proactive and Reactive.  It is also possible that some flows may be setup pro-actively and some get setup re-actively.  Proactive flows are setup without any packets.  Reactive flows are setup only when there is a packet.  Initially, there may not be any flows in tables.  Datapath, when it does not find the matching flow in a table, if the miss property of the table indicates to send the packet to controller,  then the datapath sends the packet to the controller. Controller is expected to push a flow and send the packet out to the datapath for datapath to process the packet appropriately.

With that background, let us see various protocol messages between controller and datapath:
  • Symmetric messages :  Are the messages that can be initiated by any party (Switch or Controller).  
    • Hello message:   Exchanged right after connection setup.
    • Echo Message:  Request/Reply messages - Mainly used for check the liveness of connection,  latency and bandwidth of connection.  Any Echo request messages is replied using Echo reply message.
    • Experimenter messages:  For future extendability.
  • Controller to Switch Messages:  These messages are initiated by the controller and switch is expected to reply. 
    • Feature request:   Controller sends this message to the switch to inquire the switch capabilities.  This feature request is used to get following information from the switch:
      • Data path ID :  A switch may support multiple data paths - Multiple switches. Switch is expected to make as many connections as number of datapaths (switch instances) it supports.  Based on the connection used to request features,  the switch device is expected to send corresponding datapath ID.
      • Maximum number of buffers that can be stored in the switch:  Normally, switch devices are expected to send the packet to which there is no matching table entry as PACKET_IN message to the controller.  To save bandwidth between the switch device and controller,  switch can optionally store the buffer in its memory, send reference to this buffer and some portion of the packet buffer to the controller. As you might see from the specifications,  this reference to buffer is returned by the controller using PACKET_OUT message for switch device to take action. In any case,   this particular parameter indicates the number of buffers the switch device can store while communicating with the controller.
      • Number of tables supported by the switch device on this instance:  As you might see from the openflow specification,  controller creates the flows in various tables.  This parameter informs the controller on how many tables the switch device supports for this datapath (instance).  
      • Capabilities supported by the switch instance such as
        • Statistics collection on per flow, table, port, group and Queue basis.
        • Whether switch can do IP reassembly before it extracts the source port and destination ports in case of UDP, TCP and SCTP transport protocols.
        • Whether the switch can extract source and target IP addresses from the ARP payloads.
      • Information about ports that were attached to this datapath (Switch instance):  For each port (physical or VLAN), following information is sent by the switch.
        • Port Name (String), Port ID (Integer), HW address (in case the port is used in Layer3),   Port Configuration Information in a bitmap (Administratively Down, Drop all packets received from port,  Drop packet forwarded to this port,  Don't send in packets on to this port back),   Port state in a bitmap (Port Link down,  Port blocked due to STP, RSTP etc..,  Alive), Port features in a bitmap (Speed support10Mb Full Duplex, 10Mb half duplex, 100Mb Full Duplex, 100Mb half duplex, 1Gb Full Duplex, 10Gb Half duplex,  10G Full duplex, 40G Full Duplex, 100G Full Duplex, 1Tb Full duplex and other speed support,  Link type - Copper/Fiber support, Link features - Auto negotiation supported, Pause and Asymmetric Pause ---  Port features are given in multiple bitmaps :  Current - current features, advertized - Advertized features of the port,  supports - Supported features of the port and Peer - Features supported by the peer, that is, other side of the link),  Current bit rate of the port and maximum bit rate of the port.
    •  Configure Switch and Get Switch Configure messages:  SET_CONFIG message is used to set the configuration and GET_CONFIG_REQUEST message is used to get the configuration from the switch.  Type of configuration that can be set by controller is listed below.
      • IP Fragment treatment (No special handling,  Drop fragments,  Perform reassembly):  Switch is supposed to take action as set by the controller when it receives the IP fragments.
      • Action on Invalid TTL:  Controller can ask switch to send the packets with invalid TTL to controller.  TTL value 0 is invalid for L2 switch and TTL value 1 is invalid for forwarding packets.
      • Length of the packet that need to be sent to the controller as part of PACKET-IN message. 
    •  Table Modification Message and Get Table Status message:  Modification message is used to modify the properties of specific table.  Property of table mainly deals with the action to be taken when there is a "MISS" in the table for the incoming packet. 'MISS' actions can be defined as one of the following
      • Send the packet to controller.
      • Continue with next table processing.
      • Drop the packet.
    • Flow Entry Management commands (Add/Modify/Delete) of flow entries in a given table. Flow is not the same as traditional flow.  To me,  flow in Openflow context has flexibility to use ternary comparisons (using mask).  Each table is also kind of ordered list with each flow having priority.  Normally when one thinks of flows, they are exact match entries and hence think that flows can be arranged in hash table fashion.  Since these flows are not traditional flows,  hash table can't be used.   Every entry that gets added has following information coming from controller.
      • Table ID: Table to which entry is getting added to.
      • Command: Add, Delete and Modify 
      • Cookie and Cookie mask: Valid for Modify and Delete commands. This is used to update or delete multiple entries at once.
      • Idle Timeout :  Inactivity timeout.  If there are no packets matching this flow for this timeout period,  flow gets deleted.
      • Hard timeout:   Flow gets deleted after this timeout even if there were packets matching this flow.
      • Buffer Identification:  This is sent by the controller typically for add command.  As discussed above,  to save the bandwidth between switch and controller,  actual packet is not sent along with the PACKET-IN message. Rather buffer reference is sent along with truncated packet content.  Controller, while adding the flow can inform the switch to process the packet referenced by the buffer_id.
      • output port and out group :  Variables are meant for DELETE command.  DELETE command deletes the flows that match these two parameters.
      • Inform Flow Removal:  This flag is set by controller on this flow to get informed whenever this flow gets deleted when the flow expires.
      • Check for overlapping entries: This flag indicates that no other flow with the same matching information should be added in future.
      • Match fields:  Flow is identified by set of match fields.  Match fields include  input_port, Ethernet Source MAC address and Mask,  Ethernet Destination MAC address and Mask,  VLAN ID,  VLAN Priority (PCP),   Ether-type,  IP TOS (DSCP field),  IP Protocol,  Source IP address and Mask,  Destiantion IP address and Mask,  Source port, Destination Port, MPLS Label,  MPLS TC,  Meta Data and Meta Data Mask. There are total 15 tuples. Except for Meta Data, everything is part of the packet.  Ofcourse some packets don't have some fields and those fields are normally ignored during the flow match process.   Some fields can be mentioned as wild cards. They are: Input Port,  VLAN ID, VLAN Priority,  Ether Type,  IP TOS,  IP Protocol,  Source Port, Destination Port, MPLS Label and MPLS TC.
      • Set of instructions to be applied. Each instruction can contain multiple actions.  Following instructions can be associated with the flow.
        • Next Table to lookup :  This instruction of the flow indicates to the datapath that it should start matching the next table whose identity is given along with the instruction.
        • Setup Meta Data:   This instruction is used to set the meta data and mask to the data path for that packet.  This meta data might be used by datapath to match the entry in the next table.
        • Actions on the packet - There are three types of instructions that are possible -  "Apply Actions" where actions are applied immediately to the packet,  "Write Actions" where the actions are collected and these collected actions are applied at the end before packet is sent out,  "Clear Actions" is used to clear any collected actions so far.   Note that a given flow can have both "Write Actions" and "Apply Actions". There are many actions that can be collected or applied.  Actions defined by specification are listed below.
          • Output to Switch Port :  Port on which packet has to be sent out.  This port can be logical port. If the logical port is "CONTROLLER",  then max_len field can be specified in the flow.  This size is used to truncate the packet while sending the packet to the controller using PACKET-IN message.
          • Set VLAN ID :  Replace the existing VLAN ID. Applies to packets that have existing VLAN tag.  If there are multiple VLAN tag, this action is applied onto the outermost VLAN header.
          • Set VLAN Priority:  Replace the existing VLAN Priority. If the packet does not have any VLAN tags, then this action is ignored by datapath.  If there are multiple VLAN tag, outermost VLAN tag's priority is replaced.
          • Set Ethernet Source MAC Address:  Replace the existing Ethernet source MAC address.   If there are multiple Ethernet headers, outermost Ethernet header is selected for modification.
          • Set Ethernet Destination MAC Address:  Replace the existing outermost Ethernet Destination MAC address.
          • Set IPv4 source address,  Set IPv4 Destination Address, Set Ipv4 ToS bits, Set IPv4 ECN bits:    Replace the appropriate fields in the outermost IP header and updates the checksum. In case of UDP, TCP, SCTP checksums are also updated.
          • Set transport source port,  Set transport destination port:  Replace the existing transport source port and destination port with the values given in the action descriptor.  Also updates the checksums.
          • Copy TTL Outwards - Copy TTL from next-to-outermost header to outermost header.  Copy can be from IP to IP,  MLPS to MPLS or IP to MPLS.
          • Copy TTL Inwards:  Copy TTL from outermost header to next-to-outermost header.  Copy an be from IP to IP,  MPLS to MPLS and MLPS to IP.
          • Decrement IPv4 TTL :  Decrement the TTL of outermost IP header.
          • Set MPLS Label:  Replace the existing outermost MPLS lable.
          • Set MPLS Traffic Class:  Replace the existing outermost MPLS TC.
          • Set MPLS TTL:  Set the TTL value of outermost MPLS header.
          • Decrement MPLS TTL:  Decrement the MPLS TTL.
          • Push and POP VLAN tag:  Push the VLAN tag or PoP the VLAN tag.
          • Push MPLS header and POP MPLS header 
          • Apply Group Actions:  Group ID of the group is mentioned along with the action.  Group actions are also applied along with the explicit actions specified in the flow.
          • Set Queue:  This action is set to apply the QoS on the packets.  Queue_ID reference is passed along with the action by the controller.  Data path is expected to queue the packet to this queue.  Note that this action can be set not only at the last table, but also at intermediate or first table.  If this action is set on the "Apply Actions",  then it is very important that QoS applied and result packet starts from where it was left off.
    • Group Entry Management Commands:  There is one group table for each datapath (Switch instance).  Group table contains multiple group with each group identified by group-id which is set by the Controller as part of group creation.  Each group is collection of buckets with each bucket having set of actions to be applied.  The type of actions that are set on a bucket are same type of actions that are set on the flow.   Since Group ID is referenced from the flow instruction,  the associated actions of the group are based on which instruction it is - Apply immediately,  Collect or Clear.   Each Group record contains a selection logic of bucket to use.  There are four group types - All,  Select, Indirect,  Fast failover.  Yet times, you require packets to be duplicated and load balanced. In those cases,  two groups are required. First group would have its type "All" and bucket action in each one of them point to separate groups, whose type is 'Select'.
      •  All: Packet is duplicated as many times as number of buckets.  On each duplicated copy,  bucket actions are applied. Packet processing of the duplicated copy is similar to original packet. That is,  this packet would jump to next table or packet gets egress'ed as original packet.
      • Select:  Packet processing selects one bucket.  This is mainly used for load balancing purposes.  Different flows of packets might use different buckets. 
      • Indirect:  One bucket is selected for all flows referred to this group.  This is similar to having one bucket in the group.
      • Fast Failover:  Executes the first high priority live bucket.  To select the bucket, each bucket is associated with weight (priority) and port and port group which tells whether this bucket is alive or not.
    •  Port Modification Message : is used by controller to modify the behavior of the port. Controller sends the message with the port ID and associated modification information. Since it is modification,  the fields which are modified are indicated using corresponding mask.
      • Port Configuration bits and mask bits - Administratively Down, Drop all packets received from port,  Drop packet forwarded to this port,  Don't send in packets on to this port back.
      • Port features that are asked by the controller to advertize.
    • Queue Configuration Message:  It is meant to do QoS in the datapath.  But the capabilities expected by controller from the switch are minimum.  It appears that number of queues are property of datapath.  Controller can only configure the shaping bandwidth on per queue basis.  It is understandable that classification is happening already, but there is no flexibility to create group of queues,  setting up the scheduling algorithms or setting up the queue management algorithms.
    • Read State Messages:  This message is used by controller to get the current state of data path.   This is used to get information about statistics mainly.  "type' in the request indicates the the type of information requested.  
      • Description Statistics:  Data path replies with following information:
        • Manufacturer  Description,  Hardware Description,  Software Description,  Serial number and Readable description of datapath.
      • Flow Statistics: Controller requests for a given flow using "Table ID",  "Out Port", "Cookie and its mask"  and Flow match fields. It was not very clear what happens if multiple flows match.  I think it is the first flow that matches would be selected to reply back. Data path returns the statistics for the given flow such as
        • How long flow is alive in seconds/nanoseconds.
        • Priority of the flow entry: 
        • Number of seconds before expiration.
        • Packet count, Byte count
        • Match fields and instructions that are part of the flow.
      • Aggregate flow statistics:   Similar to above. But in this case, aggregate statistics are sent.This aggregation is based on the flow statistics of all flows that are matched.
        • Packet Count, Byte Count.
        • Flow Count - Number of flows.
      • Table Statistics: Controller request statistics of a table. Reply is sent with following information
        • Fields that are used to match this table.
        • Wildcards supported to match this table.
        • Instructions that are supported by this table.
        • Write Actions
        • Apply Actions
        • Miss Action configuration.
        • Maximum number of entries supported in the table
        • Active entries
        • Number of packets looked up in the table.
        • Number of packets that have entry hit in the table.
      • Port Statistics:  Controller requests for port statistics by giving port number.  Reply information contains following:
        • Number of Received packets,  Number of transmitted packets.
        • Number of received bytes, Number of transmitted bytes.
        • Number of packets dropped in receive,  Number of packets dropped by transmit.
        • Number of receive errors, Number of tx errors
        • Number of rx frame errors,  Number of overrun errors
        • Number of rx CRC errors,  Number of collisions.
      • Queue Statistics:  Controller requests statistics by giving port number and queue ID.  Results sent back are:
        • Transmit bytes,   Transmit packets and transmit errors.
      • Group Statistics:  Controller requests statistics by giving group ID.   Reply information consists of
        • Reference Count - Number of flow entries or other group entries that refer to this group ID.
        • Number of packets and bytes  processed by this group
        • Bucket Statistics are also returned.  For each bucket in the group, following information is sent back
          • Packet count and byte count of packet processed by this bucket.
      • Group Description:  Controller can request the buckets and associated actions by giving group ID.  Reply information consists of:
        • Number of buckets and information on each bucket including actions.
    • PACKET-OUT message:  This message is sent by the controller, typically after creating the flow in the datapath.  I am puzzled by the description of PACKET-OUT message. I am not sure whether it is a problem with the specifications or my misunderstanding.  I see that PACKET-OUT message has action headers. I am expecting that PACKET-OUT message would start with the table where the miss occurred before.  Note that starting from first table is not an option in many cases where the packet is already morphed due to "Apply Actions" in the matched flows of previous tables. I would expect following information to be sent as part of PACKET-OUT:
      • Table ID:  Where to start the search from.
      • Buffer ID : In case the entire packet was not sent to the controller with PACKET-IN message as part of TABLE MISS.
      • Meta Data information:  Note that TABLE miss condition would have occurred after processing some tables before.  Due to action on the flows in those tables,  certain meta information would have been collected.  This meta data information is sent back so that the processing would be consistent.   I also think that Meta data information of one 32bit integer is not good enough. It should be significantly large enough (up to 128 bytes).  Again to save bandwidth, meta data information need not come to controller via PACKET-IN message and sent back using PACKET-OUT message. It can be stored along with stored buffer in the controller where BufferID is the reference to stored buffer. In case of data path is not storing the buffers, then metadata information should be expected by controller as part of PACKET-IN and send back using PACKET-OUT.
  • Asynchronous Messages :  These messages are sent from the datapath to the controller without any command message from the controller.
    • PACKET-IN message:  This message is sent whenever there is no matching flow in the sequence of tables. Any table miss will result in PACKET-IN message. My comments above in PACKET-OUT message section are valid here.  I would expect following information to go  to controller:
      • Table ID:  ID of the table where miss occurred.
      • Action type:  Is it due to MISS action or due to explicit action to send the packet to controller.
      • Buffer ID:  If the datapath can store the packet and in which case it can send the reference to this buffer and same buffer ID is expected as part of PACKET-OUT.   What happens to this stored buffer if there is no PACKET_OUT message.  If there is no PACKET-OUT message for certain amount of time, this packet gets dropped. I guess if there is PACKET-OUT message after this is dropped, then PACKET_OUT message will be ignored by the datapath.
      • Packet Data:  In case buffer ID is sent, then the entire packet will not be sent.  It just need to send enough bytes for controller to understand the kind of packet (typically, up to TCP/UDP header is good enough).  Amount of data to be sent in case of buffer ID is configurable by the controllers.  By default, miss_send_len is 128 bytes.
      • Metadata:   Specification does not talk about this. I believe that it should be sent.
    • Flow Removed Message:  Data path sends this message whenever flow is removed due to timeout.   This message contains following:
      • Flow specific information:  Priority,  Match fields etc..
      • Some statistics information;  Byte count, Packet Count.
      • Duration of the flow:  How much time the flow was alive.
      • Reason for flow removal :  Hard timeout, Idle timeout,  DELETE command,  Group Delete command.
      • Table ID:  Placement of current flow.
    • Port Status Message:  Whenver ports are added, removed or deleted, this information is sent to the controller.  Information typically contains:
      • Reason for this message:  ADD, DELETE, MODIFY.
      • Port specific information.
    • ERROR Message:  Datapath informs the controller whenever there are errors observed.

Dual Stack Support in LTE eNodeB - Technical bit

This technical bit summarizes the type of functionality expected out of Dual stack in user plane of eNodeB.


eNodeB connects to UE on Air interface side  and connect to multiple types of devices in the core network over backhaul network.  It communicates with MME,  S1 Gateways via Security Gateways.

GTP-U layer is the relay module which transfers the packets to/from UE to wireless core network.  GTP-U tunnel is normally terminated on S1 Gateway for normal traffic.  GTP-U tunnels are also terminated with other eNodeB in handover cases.

GTP-U packets can be optionally secured using IPsec.  Though transport mode is good enough, but tunnel model is commonly used.  IPSec tunnels are terminated at the Security Gateway in the core network.  Security Gateway are typically placed at the edge of core network and its placement is between eNodeB and S1 and other gateways in core network.

eNodeB Dual Stack Requirements:

eNodeB must ensure that it works with its peers.

  • IPv4 only UEs :  These UEs generate and consume Ipv4 packets only.
  • IPv6 only UEs:  These UEs generate and consume Ipv6 packets only.
  • Dual Stack UEs:  These UEs can generate and consume both IPv4 and IPv6 packets at the same time.
    • IPv4 Radio Bearers
    • IPv6 Radio Bearers
    • IPv4 and IPv6 Radio Bearers where both Ipv4 and IPv6 packets can be seen on the same RB.
S1 Gateways:
  • IPv4 Only S1 Gateways  :  GTP-U tunnels are IPv4 tunnels 
  • IPv6 only S1  Gateways -  GTP-U tunnels are IPv6 tunnels.
  • Dual Stack S1 Gateways  -  GTP-U tunnels to these gateways could be either IPv4 or IPv6.
  • In a given deployment, there is a possibility of having any of above types of S1 Gateways.
IPsec VPN Gateways:

eNB typically contains IPsec part of the eNB itself.  On the core network though,  Ipsec gateway is normally not combined with S1 Gateway.  It is a separate device/blade that sits in the core network.  Ipsec tunnels from eNB are terminated at this gateway whereas GTP tunnels are terminated at the S1 Gateway.

With this above background,  following functionality is expected typically in eNB:

GTP-U Layer:
  • Since there are different types of S1 gateways (IPv6 only,   IPv4 only and Dual stack),  MME can decide to put different UEs on various types of S1 gateways.  Hence GTP-U layer must be able to support IPv4 tunnels and IPv6 tunnels. 
  • GTP-U must be able to transport both Ipv4 and Ipv6 packets between UE and Core network on IPv4 based GTP tunnels and IPv6 based GTP tunnels.
  • GTP-U layer must be able to set DSCP values on outer IP header (IPv4 or IPv6) and should be able to copy DSCP values from inner IP packet (Could be IPv4 or IPv6).  Hence GTP layer should be able to copy DSCP value from  (For Uplink packets) 
    • IPv4 to IPv6
    • IPv6 to IPv6
    • IPv4 to IPv4
    • IPv6 to IPv4
IPSec Layer:

As noted above,  one security gateway might be front ending multiple S1 Gateways with respect to base stations.  Since there could be multiple GTP-U tunnels on one IPsec tunnel,  IPsec layer must support following:
  • Must be able to work with Security Gateway on core network whether Security Gateway in core network supports Ipv4 or IPv6 tunnel.
  • Must be able to transport both IPv4 and IPv6 packets (GTP-U tunnel packets) on one tunnel.
  • Must be able to do DSCP copy from GTP-U header to Outer IP header (could be IPv4 or IPv6). 
QoS Layer:

Since both Ipv4 and IPv6 packets traverse on the same Ethernet Port or VLAN port, it is necessary that the shaping and scheduling does not require two different types of configuration. It should be possible to create
  • QoS ACL  to have both Ipv4 and IPv6 rules.
  • A rule with both Ipv4 and Ipv6 address tuples.
  • Multiple rules pointing to the same Queues.