Sunday, January 22, 2012

IP Fragmentation versus TCP segmentation

Ethernet Controllers are increasingly becoming more intelligent with every generation of NICs.  Intel and Broadcom have added many features in Ethernet NIC chips in recent past.  Multicore SoC vendors are adding large number of features into Ethernet IO hardware blocks.

TCP GRO (Generic Receive Offload - It used to be called Large Receive offload too) and GSO  (Generic Segmentation Offload and it is used to be called Transport Segmentation Offload)  are two new features (in addition to FCoE offloads) one can see from Intel NICs and many Multicore SoCs.  These two features are  good for any TCP termination applications on the host processors/cores.  These two features reduces the number of packets traversing the host TCP/IP stack. 

TCP GRO works across multiple TCP flows where it aggregates multiple consecutive TCP segments (based on TCP sequence number) of a flow into one or few TCP packets in the hardware itself, there by sending very few packets to the host processor.  Due to this,  TCP/IP stack sees  fewer inbound packets.  Since the packet overhead is significant in TCP/IP stacks, lesser packets uses lesser number of CPU cycles, thereby leaving more CPU cycles for applications, essentially increasing the performance of overall system.

TCP GSO intention is similar to TCP GRO,but for outbound packets.  TCP layer typically segments the packets based on  MSS value. The MSS value is typically determined from PMTU (Path MTU) value.  Since TCP and IP headers take 40 bytes of data,  MSS is typically ( PMTU -  40 ) bytes.  If PMTU is 1500 bytes, then the result MSS value is 1460. When the application tries to send large amount of data,  then the data is segmented into multiple TCP packets where each TCP payload carries up to 1460 bytes.  TCP GSO feature in the hardware eliminates the need for TCP layer to do the segmentation and thereby reduces the number of packets that traverse between TCP layer and to the hardware NIC.  TCP GSO feature in the hardware typically expect the MSS value along with the packet and it does everything necessary internally to segment and send the segments out.

Ethernet Controllers are increasingly providing support for IP level fragmentation and reassembly.  Main reason is being  increasing popularity of tunnels.

With increasing usage of tunnels (IPsec, GRE, IP-in-IP,  Mobile IP, GTP-U and futuristic VXLAN and LISP), the packet size is going up.  Though these tunnel protocol specifications provides guidelines to avoid fragmentation using DF bit and PMTU discovery,  it does not happen in reality.  There are very few deployments where DF (Don't Fragment bit) , which is required for PMTU discovery, is used.   As far as I know,  almost all IPv4 deployments fragment the packets during tunneling.  Some deployments configure network devices to do red-side fragmentation (fragmentation before tunneling so that the tunneled packets appear whole IP packet) and some deployments go for black-side fragmentation (fragmentation after tunneling is done).   On receive direction, reassembly happens either before detunneling or after detunneling. 

It used to be the case where fragmented packets are given lesser priority by service providers during network congestion.  With high throughput connectivity and increasing customer base for networks, service providers are competing for the business by providing very good reliability and high throughput connectivity. Due to popularity of tunnels,  service providers are also realizing that dropping fragmented packets may result in bad experience to their customers.  It appears that service providers are not treating the fragmented packets in a step-motherly fashion anymore.

IP fragmentation and TCP segmentation offload methods can be used to reduce the number of packets traversing the TCP/IP stack in the host.  Next question that comes to mind is how to tune the TCP/IP stack to use these features and how to divide the work  between these two HW features. 

First thing to tune in the TCP/IP stack is to remove the MSS dependency on PMTU.  As described above, today MSS is calculated based on PMTU value. Due to this, IP fragmentation is not used by TCP stack for outbound TCP traffic. 

TCP Segmentation adds the both TCP and IP header to each segment.  That is, for every 1460 bytes, there would be overhead of 20 bytes of IP header and 20 bytes of TCP header.  In case of IP fragmentation,  each fragment would have its own IP header (20 bytes of overhead).  Since TCP segmentation has more overheads,  one can say IP fragmentation is better.  Here, MSS can be set to a bigger value such as 16K and let IP layer fragment the packet if the MTU value is less than 16K.   This is certainly a good argument and it works fine in networks where the reliability is good.  Where the reliability is not good,  if one fragment gets dropped, TCP layer needs to send entire 16K bytes in retransmission.  If TCP had done the segmentation, it would only need to send fewer bytes. 

There are advantages and disadvantages with both approaches. 

With increased reliability of networks and with no special treatment on fragmented traffic by service providers,  IP fragmentation is not a bad thing to do.  And ofcourse, one should worry about retransmissions too. 

I hear few tunings based on the deployments.  Warehouse data center deployments where the TCP client and servers in a controlled environment are tuning MSS to 32K and more with 9K (jumbo frame) of MTU.  I think that , for 1500 bytes MTU,  going with 8K of MSS may work good.


Saturday, January 21, 2012

SAMLv2 mini-tutorial and some good resources

Why SAML  (Security Assertion Markup Language)?

Single Sign-On:  Many organizations have multiple intranet servers with web front end.  Though all servers interact with common authentication database such as LDAP, SQL databases,  each server used to take authentication credentials from employee.  That is, if employee logs into server1 and then goes to server2 employees are expected to provide credentials again to the server2.  It may sound okay if the user is deliberately going to different servers.  But,  this is not good experience if there are hyperlinks from one server pages to other servers.  Single Sign-On solves this issue. Once the user logs into a server successfully,  the user does not need to provide his/her credentials when he/she accesses other servers.  Single authentication with single sign-on is becoming common requirement in organizations.

Cloud Services:  Organizations are increasingly adopting Cloud Services for cost reasons.  Cloud Services are provided by third party companies.  For example, salesforce.com provides Cloud service for CRM.  Taleo provides cloud service for talent acquisition systems.  Companies now have intranet servers,  some cloud services from different cloud service vendors.  Employees accessing the cloud service should not be asked to create accounts in the cloud service.  It is expected that cloud service provider uses the its subscriber organizations authentication database to validate the users signing in.  Many companies would not like to provide the access to their authentication database to cloud services directly.  Also,  many companies would not like their employees to provide their organization login credentials to cloud service for the fear of compromising the passwords, especially, senior and financial executives.  This requires the need for cloud services to redirect the employees to login using companies web servers and use single sign on mechanisms to allow the users to access cloud services.

SAMLv2 (version 2 is the latest version) facilitates the single sign-on not only across intranet servers of the company, but also across services provided by cloud providers.


How does SAMLv2 work?

There are two participants in SAML world -  Service Provider (SP) and Identity Provider (IDP).  Service provider is referred to represent the intranet servers and cloud servers.  Service provider provides some services to the authenticated users.  Identify provider refers to the system which takes login credentials from the user and authenticates the user using local database or LDAP database or any other authentication database.

When the user tries to access the resources in service providers,  SAMLv2 component of Service provider (SP)  first checks whether the user is already authenticated. If so, it allows the access to its resources. If the user is not authenticated, then rather than providing login screen and taking credentials, it instructs the user browser (by sending HTTP Redirect response with new location header containing the identity provider URL) to redirect the request to the identity provider.   Ofcourse SAMLv2 component of the SP needs to be configured with the IDP Provider URL beforehand by administrator.  SAMLv2 component of SP generates SAML request (in XML form) and 'Relay State".  It then adds these two to the location URL of the HTTP redirect response.   Browser then redirects the request as per location header to the IDP.   Now, IDP has the SAML request and Relaystate information. 

IDP first checks whether the user is already authenticated with it.  If not,  it sends the login page to the user as part of HTTP response.  Once the credentials are validated based on the authentication database,  SAMLv2 component of IDP generates the SAML response with the result of authentication,  principal name and  attests the response using private key of its certificate by signing the SAML response.  It also can add some more information about the user via attribute statements.  It then sends both SAML response and relay state to the browser as response to the last HTTP request it got (which is typically the credentials request).  IDP normally makes a HTTP page with embedded POST request  with  URL  which is SP URL (which it got from the SAMP request), SAML response and relaystate (which it got before) and javascript which is used by browser to send the POST request.   Browser gets this response and due to javascript it automatically posts the request to SP.  This POST request contains the SAML response and RelayState.

Now SAMLv2 component of SP ensures that SAML response is valid by verifying the digital signature.  Note that the public key (certificate) of IDP is expected to be configured in the SP beforehand by the administrator.  Then it checks the subject name of principal and result of authentication.  If result of authentication indicates 'success',  then it sends HTTP redirect response to the browser with original URL (from relaystate) as location header.  This makes the browser go the original URL to get hold of response.

Now onwards,  user would be able to access the pages in the SP.  If the user goes to another SP of the company,  same process as above would be repeated.  Since the user is already authenticated with the IDP,  IDP does not take login credentials again.  Since user does not see all the transactions that are happening underneath,  he/she gets single sign-on experience.

By keeping the IDP in Enterprise premises,  one can be sure that passwords are not given to any SPs whether they are intranet SPs or Cloud Service SPs.

SAMLv2 - Other SSO use cases:

The use case described above is called 'SP initiated SSO'. It is called SP initiated because user is trying to access the SP first.   Other use case in SSO is 'IdP initiated SSO' .   User access the IdP first.  IdP authenticates the user and then user is presented with links to different SPs that user can access.  When the user clicks on these special links, SAMLv2 component on the IDP generates the HTTP response with HTML page having POST binding to the SAMLv2 component of SP (It is called Assertion Consumer Service URL).  SAMLv2 response is provided in the HTML page as one of the post fields.  It adds 'RelayState' parameter with the URL that the user is supposed to be shown on the SP.   Note that 'RelayState" in this case is unsolicited.  There is no guarantee that the ACS in the SP will honor this parameter and also there is no guarantee that all ACSes treat the relayState as URL.   But many systems expect the URL in the relayState (including Google Apps Cloud Service) and hence sending URL is not a bad thing.

Above two cases 'SP Initiated SSO' and 'IdP initiated SSO' are mechanisms to access the resource by single authentication.  'Logout' is another scenario.  When user logs out (whether on a SP or on IdP),  the user should be logged out on all SPs.   This is called "Single Logout Profile". IdP is supposed to maintain all the SP information for which IdP was contacted for all users.  When the user logs out on IDP,  for all SPs in the user login session, it sends a logout request to 'Logout Service' component of SP.    If user logs out on SP and indicates his willingness to logout on all SPs, then the SP in addition to destroying the user authentication context on the SP also redirects the user browser to IDP. As part of redirection, it also frames the logout request too to the IDP.  IDP then destroys the context in the authentication context.  If the user indicated that he would be like to be logged out from all SPs, then the IdP would send the logout requests over SOAP to all other SP in the user authentication context.

For more detailed information about SAMLv2, start with SAMLv2 Technical overview and SAMLv2 specifications. You can find them here:

http://www.oasis-open.org/committees/download.php/27819/sstc-saml-tech-overview-2.0-cd-02.pdf
http://saml.xml.org/saml-specifications

IdP Proxy use case

So far the use cases described above talk about SP and IdP interaction via browser or directly.   SAML does not prohibit IdP proxy use case where IdP proxy work as IdP for SPs and as SP for IdPs.   Please see some links on this topic:

https://spaces.internet2.edu/display/GS/SAMLIdPProxy

Open Source SAMLv2 SP,  IdP and IdP Proxy:

I find that OpenAM (fork of Sun OpenSSO) as one of the popular open source SSO authentication framework.  You can find more information here:

http://forgerock.com/openam.html

IdP Proxy configuration is detailed here:

https://wikis.forgerock.org/confluence/display/openam/SAMLv2+IDP+Proxy+Part+1.+Setting+up+a+simple+Proxy+scenario

https://wikis.forgerock.org/confluence/display/openam/SAMLv2+IDP+Proxy+Part+2.+Using+an+IDP+Finder+and+LOAs

One more good site I found which details out IDP proxy and source code in Python:

http://www.swami.se/english/startpage/products/idpproxysocial2saml.131.html


Some good links, though not related IDP Proxy:

Google Apps as Service Provider and OpenAM as IDP:  https://wikis.forgerock.org/confluence/display/openam/Integrate+With+Google+Apps

OpenAM as Service provider and Windows ADFS as IDP:  https://wikis.forgerock.org/confluence/display/openam/OpenAM+and+ADFS2+configuration





Need for Pattern Matching Accelerators in UTM devices

Network security term typically refers to Threat prevention and Security on the wire.

Threat protection is normally achieved with multiple security technologies.  Basic protection is achieved from firewall technology.  IDS/IPS (Intrusion Detection/Prevention System),  Anti-Virus, Web application firewalls are some of the security technologies that are increasingly being used to protect networks (Network devices, Servers and Client machines).  Application Detection is another technology that is increasingly being used along with firewall to stop/allow traffic that can't be identified using ports in TCP/UDP headers, but requiring deep packet inspection.

Other than firewall,  all technologies listed above require deep packet and deep data inspection.  IDS/IPS technology adopts multiple techniques to identify the attack traffic. One of the techniques is to match the traffic data with known attack patterns.  Application detection also relays on pattern matching on the data as one of the techniques to detect the applications.  Anti Virus technology too depends on some pattern matching to detect viruses.

In almost all technologies above,  patterns get added to the deployed systems on continuous basis by device vendors as more attacks are discovered.  For example,  IPS devices, nowadays have around10,000 patterns (signatures) to detect the known attacks.  It keeps increasing every year.  Additionally, Some of these patterns are checked on every packet that goes through IPS.  This adds to number of CPU cycles requires to do IPS protection.

Many software algorithms are used to speed up the pattern matching performance.  Some of the techniques inlcude:
  • DFA (Deterministic Finite State Automata)
  • Bloom filters - Filters formed from the hashes of patterns can be used on the traffic to determine whether further analysis is required.
  • PCRE algorithms to search for patterns of regular expression type.

IPS and other technologies also use techniques to reduce the number of patterns to be matched using protocol level intelligence and classifying the patterns in multiple buckets (protocol basis,  port basis,  even on the basis of application protocol stages such as URL based pattern database,  HTTP Request header,  Response header pattern databases etc..).

Due to above techniques,  some device vendors  think that there is no need for pattern matching hardware accelerators.  There is a reason for that too as some early developments of snort (popular open source IDS/IPS software) did not find much performance improvement with hardware accelerators.  But I believe HW accelerators are required for following reasons.

Performance Determinism:  IPS, Anti Virus,  Web application firewall and application detection technologies depend on the regular signature updates. Hardware deployed in the fields might have X number of signatures a day of purchase and they might go up to 2X or 3X over the years.  Performance determinism is expected by CSOs.  To maintain performance levels,  CPUs should be avoided in doing pattern matching.  Hardware accelerators specialized in pattern matching help in maintaining performance levels even with increasing number of signatures.

Protection from CPU hogging attacks:  With software based pattern matching, it is possible to hog the CPUs by crafting the packets with each data that matches a patterns multiple times.  Consider that there is a signature rule which tries to match a pattern "abc123def" and if there is 1Mbytes of data is sent with all the data having "abc123def" repeated,  then the CPU would take forever as it matches every packet multiple times.  CPU will not only spend time in matching the patterns, but also spends significant number of cycles in doing  further analysis.   Hardware accelerators normally designed such a way that the performance does not go down even if there are multiple matches.

Next question would be the what capabilities of hardware accelerators one should look for to mitigate performance issues  - One associated with explosive growth of attack patterns (signatures) and avoid CPU hogging by deliberate attempts by attackers.  I believe one should look for following capabilities.
  • Accelerators should be programmable with decent number of patterns.
  • Accelerators should be able to perform well even with large number of patterns.
  • Accelerators should be able to perform well even if there are large number of matches.
  • Accelerators should be able to perform pattern matches based on context information such as 'relative offset', 'Depth of the data to look' while doing pattern matching.  This will reduce the number of results being returned by the accelerator.  Smaller the number of results to software, lesser the post processing.
  • Accelerators should be able to return results only when multiple patterns match on the data.  This also is required to reduce the number of results. 
In summary,  pattern matching hardware accelerators are required to reduce the CPU hogs either due to increase in signatures or intelligently crafted data by attackers. I feel that end customers should buy the UTM/IPS devices that take advantage of these accelerators to ensure that devices can be used at least for few years (future proof).

 

Saturday, December 31, 2011

Locator and Identifier Seperation Protocol (LISP) - One more tunnel protocol


In 2012, I think that there would be focus on two technologies in network infrastructure market - SDN and  LISP (Locator and Identifier Separation Protocol).  LISP work is going on for few years and it seems to be talked about quite often in recent past.

Why LISP?

The reasons for LISP is detailed very well in the RFC 4984.  Some points of RFC 4984 are worth noting down and I am mentioned them here.

Multihoming

Internet presence is now part of business model of many organizations. Hence high availability of connectivity to Internet is becoming very important to organizations.   High availability is being achieved by having multiple links to ISPs and also multiple links to different ISPs.   Multiple links are used for load balancing the traffic as well as for redundancy.

Customers (companies) get IP address block (subnet) from the ISPs  and this address block is used by organizations to assign IP addresses to the machines that needs to be reachable from external nteworks.  Since each ISP would assign different blocks,  critical machines are provided with multiple IP addresses - one from each ISP assigned block.  Operating systems and routing protocols running in the machines and routers would ensure that the right IP addresses of the active links are used.  Each machine operating system should have this intelligence so that connections from applications running on operating sytem's TCP/IP stack are assigned with active IP addresses.  Since multiple IP addresses are assigned to a  machine,  machine is termed as multihomed machine. This concept is called multihoming. 

Even though above scheme works in general, existing active connections would get terminated if the link associated with IP addresses of the connections go down.  This could result in lost voice calls,  termination of very important TCP/IP connections.  This is one problem with provider assigned (PA) IP addresses (also called Provider Aggregatable addresses).   There is no issue for new connections though as routing protocols propagate this information.  Note that service providers don't allow the packets having source IP address other than the IP address block assigned by them from their customer networks.  Due to this, packets belonging to  active connections can't be sent onto links of other  service providers.  This is  one of the challenges organizations have with multihoming.

Second challenge with multihoming is that the propagation of active routes and links to each machine.  All machines that can be reachable from external networks need to have routing protocols implemented.  As you all know, end nodes typically don't have routing protocols enabled to not increase the maintenance headache for IT department.

Third challenge is multihoming for inbound connections to the organizations. When a link is down, somehow remote systems should not be using the addresses associated with the down links.   Typically, this is achieved by having FQDN (full qualified domain name) to each internal server machine and updating the DNS Server  with the IP addresses of active links.  That is, DNS Server can't just be using static information.  DNS Servers should be informed of  changes as soon as possible.  Even though this can be done at local DNS Servers level,  many DNS resolvers in Internet might have cached this information and remote systems continue to use down IP addresses for some time.  This can be achieved by not sending DNS response with 0 TTL, but that would increase the load on the local DNS Servers.

Finally,  organizations would like certain type of traffic (both inbound and outbound connections) to use some links over other links for several reasons such as cost of the link,  time of day etc..  

Basically, traffic engineering for outbound and inbound connections have good number of challenges and the techniques to overcome these challenges have limitations as described above.

Provider Independent Addresses:

Finally,  to address the issues of Multihoming and Traffic Engineering,  RIR (Regional Internet Registrar) introduced a policy document allowing organizations to request provider independent addresses (PI addresses). Provider Independent addresses are expected to be routed by all service providers.  That is, packets coming from their customers with these IP addresses as source IP address is expected to be honored by the service providers.

Benefits of the PI addresses are obvious with above background,  but consolidated reasons are given below:
  • No need for multihoming support in end nodes, hence no need for enabling routing protocols in the end nodes.
  • Traffic Engineering is simple - No need for dynamically updating DNS Servers.
  • Simple to move to new service providers by organizations.  No renumbering the machines every time service provider is changed.
  • With acquisitions and mergers,  consolation of networks is simple.
With PA (Provider Assigned) addresses,  addresses are aggregatable.  Hence the routing entries used to be small in number in the routers.  With provider independent addresses, the routes can't be aggregatable and hence the routing table size increases dramatically. Routing table sizes of  DFZ (Default Free Zone) routers are going up dramatically due to PI addresses.  According to BGP Routing Analysis Report, the number of routes in the DFZ routers went from 5000 in year 2000 to 400,000 in year 2011.  With IPv6 popularity and more liberal assignment of PI addresses in IPv6, it will not be a surprise where the number of routes in DFZ routes going to millions in next few years.  Since the routing table is referred by DFZ & service provider  routers  for every packet that is coming in,  more routes in the table reduces the performance of the router and hence the performance of overall Internet.

LISP is mainly born to address the issue of the scaling in DFZ routers.


Basic concept of LISP:

LISP is trying to keep the advantages of Provider Independent Addresses to the organizations and keep the routing table to reasonable size by using aggregatable addresses.  To address this,  LISP proposes two addressing schemes - Identifier Address space and  Locator Address space.  Identifier address space is similar to  provider independent address space. Organization are expected to assign addresses from allocated space to individual network elements.  Locator space is also assigned to organization, but it is it provider aggregatable address space.  Hence, this space should not be used to assign addresses to all network elements.  This address space should be assigned to tunnel routers (LISP tunnel routers) only.   When the organization changes the service provider,  it should only need to worry about IP address assignment to LISP routers and no other change is expected.

LISP standards call Identifier address EID (Endpoint ID) and locator IP address RLOC (Router LOCator).

LISP router contains two functions - Ingress Traffic Router(ITR)  and Egress Traffic Router (ETR). Ingress and Egress terms are with respect to Endpoint network.  Since endpoint identifier space is not expected to be visible to the core network routers,  ITR encapsulates the traffic coming from the endpoint network with tunnels with LISP, UDP and IP headers and sends it out onto the Internet.  ETR is expected to decapsulate the traffic coming from the Internet and pass the internal packet to the endpoint network.   Typically, ITR and ETR are implemented in customer edge routers. Initially,  Enterprises might expect service providers to provide LISP service and eventually Enterprise routers will have this functionality. 

The IP addresses used in IP header of tunnel  are from the RLOC space.  Since RLOC space is provider aggregatable,  routing table size will not increase dramatically.  Please see the LISP draft for more information on the tunnel header formats.


How does it solve the issues/challenges discussed above?

Multihoming is no longer required in end nodes. But it is still required on LISP routers though - That is there would be requirement for multiple links from different providers for redundancy and traffic engineering.  Active connections will not be suffered if traffic is redirected to other links as endnodes work with EID space always and those IP addresses would continue to work, similar to provider independent addresses.  Outer IP header address of LISP tunnel would change when links go down and come back up.  That should be okay as these addresses are only used to get to the LISP ETR.

EID to RLOC mapping:

ITR needs to know the source IP and destination IP to be used for the tunnel header. ITR uses the destination IP (EID) of the packets coming in from the local network to determine the remote ETR RLOC IP address. It does this using mapping database.  Each ITR expected to maintain EID to RLOC cache.  If it does not find the matching entry in the cache, then it talks to mapping resolvers.  Mapping resolve servers uses the Mapping Database to figure out the destination ETR and lets the destination ETR to send actual EID to RLOC mapping to the requesting ITR.   Basically, Mapping resolves and mapping databases only used to find the ETR.  But ETR is the one which gives the EID to RLOC mapping to the ITRs.

ETRs are expected to register its RLOC with the mapping database for EID prefixes it controls.  This is done using MAP_REGISTER message.  ITRs send MAP_REQUEST message to get the EID to RLOC mapping to mapping resolvers.  Mapping resolvers use the map register database to know the RLOC of the ETR and translates the map request destination IP address with the RLOC to redirect the  MAP-REQUEST packet to the ETR.  ETR then replies using MAP-REPLY message with the actual RLOCs to be used by ITR.  One might ask why can't mapping resolvers itself sends the MAP-REPLY to the ITR.  ETR is given this opportunity to do inbound traffic engineering.   ETR can give different RLOC IP address for different type of traffic or use different link at different times etc..

Mapping database, mapping resolver servers and associated message formats are described in IETF draft LISP MAP Server interface.

In summary,  MAP resolvers and MAP database servers are used to index ETRs and ETRs are the ones which actually provides EID to RLOC mapping.

The challenge really is how the index database is implemented.  Note that this database can become big as all EID prefixes would be maintained in this database.  This database search needs to be fast and the database is updated by multiple ETRs.  Update database operation also need to be fast. Ofcourse serach operation needs to be very fast.  To take care of scalability issues, multiple database servers would need to be used.  It is also required to divide the database into multiple servers.  One proposal I see usage of DHT (Distributed Hash Table).

Please see following links:

Alternate network:  http://tools.ietf.org/html/draft-ietf-lisp-alt-1
There is a DHT alternative to this.

Summary:

Year 2012 would see LISP based routers.  Initial set of routers would have software implementation of LISP routing functionality.  Once the standards achieve certain level of  maturity,  one would see Ethernet controllers (standalone or Multicore based) vendors adopting this technology in hardware.

Monday, December 19, 2011

Table Centric Processing and Openflow

Software in  Embedded network appliances consists of multiple entities - Management Plane (MP), Control Plane (CP) and Data Plane (DP).  Some examples of network appliances include routers, switches,  load balancers,  security devices and WAN Optimization devices etc.. 

MP entity typically consists of management engines such as CLI,  GUI Engines and Persistent storage Engines.

CP is mainly consists of several sub entities for different functions. For example, if a device consists of firewall,  Ipsec VPN and routing functionality,  then  there could be three sub-entities.  Each sub-entity might consists of multiple protocols and modules.  Routing function might have multiple control plane protocols such as OSPF, BGP, IS-IS etc..  CP also typically consists of exception packet processing modules.  For example, if there is no flow processing context found for a given packet in DP,  DP gives the packet to CP to process it and lets the CP to create the flow context in DP.

DP entity consists of actual packet processing logic. DP entity is typically implemented in Network Processor Units,  ASIC.  It is increasingly being implemented in Multicore processor SoCs with some cores dedicated for CP and rest of cores for DP. 

In this post, I am going to concentrate on DP (also called as datapath). 

As indicated DP implements the "datapath" processing.  Based on the type of network device, datapath consists of 'routing',  'bridging',  'nat', 'firewall', 'ipsec' and 'dpi'  modules and more.  Some background on each of the datapath functionality is listed below. Then we will see how table driven processing model helps in implementing datapaths.

Routing

Routing module typically involves following processing, upon packet reception:
  • Parse and check integrity of the packet
    • L2 (Ethernet, VLAN etc..)  header parsing.
    • Identification of packet type 
    • If it is IP,  proceed further, otherwise do non-IP specific processing.
    • Integrity check involves ensuring the IP checksum is valid and packet size is at least as big as the 'total length' in the IP header.
  • Extraction of fields - Source IP address,  Destination IP address,  ToS fields from the IP header.
  • IP Spoofing Check:  This processing is done to ensure that the packet came from the right interface (Ethernet port).  Normally this is done by doing route lookup on source IP address.  Interface returned from the route lookup operation is compared with the port on which this packet came in. If both are same, then IP spoofing check is treated successful and further packet processing happens.  Basically, this check ensures that packet with destination IP address of current packet source IP address would go on the interface on which current packet came in. This requires "RouteLookup" operation using packet source IP address on the routing table which is populated by the CP.
  • Determination of outgoing interface and next hop gateway:   This processing step involves "route lookup" operation on the "Destination IP" of the packet and optionally on TOS value of the packet.  "Route Lookup" operation indicates whether the packet needs to be consumed by local device (that is DIP is one of the local IP addresses) or whether the packet needs to be forwarded.  If the packet needs to be sent to the next hop, then the gateway IP address (or destination IP itself if the host is in the same subnet of the router) and outbound interface would be returned.   If the packet is meant to the local device itself, then packet is handed over to the CP. Rest of the processing assumes that packet is forwarded to the next hop.
  • TTL Decrement :  At this processing step,  DP decrements the TTL and does incremental checksum update to the IP header checksum field.  If the TTL becomes 0, then the packet gets dropped.
  • Packet Fragmentation:  If the packet size is more than PMTU value of the route or MTU of the outbound interface, then packets gets fragmented at this step.
  • MAC Address Determination:  If the outbound interface is ETHRENT type  then this processing step  finds out the MAC address for the gateway IP address determined in one of the previous steps.  It refers to the "Address resolution table" populated by the CP.  It uses IP address as input and gets the MAC address.  If there is no matching entry,  packet is sent to the CP for ARP resolution.  In case the outbound interface is Point-to-Point interface,  L2 header can be found from the interface table which is populated by the CP.
  • Packet update with L2 header:  DP frames the Layer 2 Header (Ethernet, VLAN, PPP etc..) and prepends in the beginning of the packet right before the IP header. 
  • Packet Out:  Packet is sent out the outbound interface determined as part of "routelookup" operation.
At each step,  appropriate statistics counters are incremented. There are two types of statistics - Normal and Error statistics.  These statistics are typically  maintained globally,  on per interface basis,  on per routing entry or per ARP table entry.

Upon analysis of above steps,

IP routing datapath refers to few tables:
  • IP Routing Database Table, which is LPM (Longest Prefix Match), for route lookup operation.
  • MAC Address resolution table, which is exact match table,  to find MAC address for a given IP address.
  • Interface Table, which index table, to find the L2 header to be used in case of  Point-to-Point interfaces.  Interface Index is typically returned by the "route lookup" operation.
IP routing in some devices also implement "Policy Based Routing" (PBR).  PBR is an ACL type of table with each entry identified by several fields, including IP header fields (SIP, DIP, TOS),  Input interface and output interface and even on the transport header fields.  The output of each ACL rule is the routing table.  Basically, "route lookup" functionality involves two steps - One finding the matching ACL rule in PBR table. This gets the routing table to use.  Then this routing table is searched with IP address and TOS to get the routing information as part of second step.  ACL rules can have field values represented in subnets, ranges and single value.   ACL is typically ordered list.

Bridging:  Bridging switches the packets without any modification to the packet.  Bridging is typically done among the Ethernet interfaces.  Bridging typically involves following steps upon receiving the packet from one of the Ethernet interfaces that belong to a bridge.
  • Ethernet  Header Parsing: In this step,  Ethernet header and VLAN headers are extracted from the packet.
  • Bridge Instance determination:  There could be multiple bridge instances supported by datapath.  This step determines the bridge instances based on the interface on which packet came in.  A given Ethernet or VLAN interface belongs to only one bridge.  CP populates the Interface table with entries, with each entry identified "Interface Index" and having parameters such as Bridge Instance Index (Index to bridge table),   whether this interface is allows learning,  forwarding etc.. 
  • Learning Table Update:   Each Bridge maintains the learning table. It is also called FDB (Forwarding Database).  This table aims to provide Interface information given the MAC address.  This table is normally used by bridging datapath to determine the outbound interface for received packets.  But this table is updated with entries based on the incoming packet.  Its source MAC address and the interface on which packet came in used to populate the learning table. Basically,  it learns the machines on a physical network attached to the Ethernet port.   As part of this processing step,  if there is no learning table entry, then new entry gets added.  Note that learning is done if the Ethernet interface status in thhe "interface table" indicates that this interface can be used to learn the machines. In some systems,  the population of the learning table is done by the CP.  Whenever DP finds that there is no entry that matches the SMAC, then it sends the packet to the CP and CP creates the table entry.
  • Determination of outbound interface:  In this step,  DP does the lookup on the FDB with DMAC address key.  If there is a matching entry, it knows the outbound interface on which the packet needs to be sent.  In case of Multicast packet, it refers to the Multicast FDB to find out the set of interfaces on which the packet need to sent out.  In case of broadcast packet (DMAC = ff:ff:ff:ff:ff:ff), then all interfaces in the bridge is selected to send the packet out.  Note that Multicast FDB is populated by the CP. CP does this by interpreting the IGMP, MLD and PIM-SM packets.
  • Packet Out:  In this step, packet is sent out.  In case of unicast packet,  if the interface is known, packet is sent out.  In case the outbound interface is unknown (there is no matching FDB entry), then packet is sent out on all ports except the incoming port.   Packet is duplicated multiple times to achieve this.  Multicast packets are also sent on all ports if there is no matching  entry. Otherwise, it sends the packet to interfaces as indicated by Multicast FDB.  Broadcast packets are always sent to all ports in the bridge. Note that, in all these cases, packet is never sent on the interface on which packet came in. Also,  datapath does not send the packet if the interface is blocked for sending packet out.  This information is known from the interface table.
Bridging module maintains following tables.
  • Unicast FDB, which is exact match table. Match field is 'MAC address'.  One FDB is maintained for each bridge instance.
  • Mutlicast FDB, which  is also exact match table.  Match field is 'Multicast MAC Address'. One Multicast FDB is maintained for each bridge instance.
  • Interface Index Table - Which is indexed by the incoming interface identifier. Typically global to the entire datapath.
  • Bridge Instance Table - Which is indexed by bridge instance. Typically global to the entire datapath.
NAT/Firewall Datapath: NAT and Firewalls typically require Session Processing.  Session Management part of NAT/firewalls is typically implemented in datapath.  Sessions are typically 5-tuple sessions (Source IP, Destination IP, Protocol, Source Port and Destination Port) in case of TCP and UDP.  A Session is initiated by client and terminated by server.  NAT/Firewall devices being in between client and server machines,  it sees both sides of the connection - Client to Server and Server to Client.  Hence, in these devices,  sessions consists of two flows - C-to-S and S-to-C flows (Client-to-Server and Server-to-Client flows).   5-tuple values of C-to-S and S-to-C flow are same in case of firewall session except that values are toggled between source and destination.  That is, source IP of the C-S flow is used as destination IP of the S-C flow and destination IP of the C-S flow is used as source iP of the S-C flow. Same is true with source and destination ports.  In case of NAT,  this may not be as simple. NAT functionality translates client SIP, DIP, SP and DP to new values.  Hence the NAT session manager flows would have two different values in 5-tuples, but both belong to one session.  In summary,  a session maintained by NAT/firewall devices is pointed by two different flows.

Session Processing involves following steps in datapath:
  • Packet Parsing and Integrity of headers (L2 header, IPv4/IPv6 header, TCP/UDP/ICMP/SCTP headers) :  Similar step as described in "Routing" section. 
  • IP Reassembly:  Non-first IP fragments will not have transport header. Since session requires all 5-tuples for matching sessions,  if reassembly is not done, non-inital fragments will not be get matched onto the right flows. As part of the IP reassembly, many checks are made to check the anomalies and as part of this check fragments might get dropped.  Once the full transport packet is reassembled,  further processing happens.  Some of the checks that are done as part of this processing step are:
    • Ensures that initial and middle fragments are at least of some configured size - This is to figure out any deliberate fragmentation by sender to make the device spend large number of CPU cycles there by creating DoS condition.
    • Ensures that the total IP packet size never exceeds 64K size.
    • Taking care of overlapping IP fragments.
    • Ensuring that the data in the overlapped IP data is same. If not, it is considered as deliberate attempt and this processing steps drops these packets.
  • Transport Header integrity checks:  At this step, transport header integrity is ensured such as
    • Ensures that the packet holds transport header.
    • Some datapaths may also verify transport checksum.
  • Extraction of fields from the parsed headers :  Mainly SIP, DIP, P,SP and DP are extracted from network and transport headers.
  • Security Zone lookup :  Every incoming interface belongs to one of the security zones.  CP programs the table of entries with each entry having Interface ID and security Zone ID it belongs to.  Security Zone ID is used in flow lookup function.
  • Flow Lookup to get hold of session:  As discussed before, a session consists of two flows.  Flows are typically arranged in hash lists.  When the session is created by CP,  CP is expected to put two flows in the hash list  with each flow pointing to its session context.  Sessions are arranged in an array (Index Table).  Reference to session (array index) is put in the flow.   Basically,  there is one hash list for flows and Index table for sessions.   5-tuples and inbound security Zone ID are used to match the flow.  If there is no matching flow found,  the packet is sent to the CP.  If the flow is found,  then it goes onto next processing step.
  • Anomaly Checks:  Once the flow, associated session context and complementary flow is determined,  then this processing step does checks to find out any anomalies using packet header contents and state variables that are maintained in the session.  Few things that come to my mind are: 
    • Ensuring TCP sequence number of the current packets within reasonable range (Example: 128K) from the previous packet sequence.  This is done typically to ensure that MITM did not generate the TCP packet, mainly TCP packet with RST bit is set.   TCP RST packet can drop the session not only in the device, but also on the receiving client/server machine.
  • Packet Modification:  This step involves packet modification.  Packets belongs to sessions that are created by firewall CP may not undergo many changes. In case of NAT, there are several modifications are possible based on session state created by CP.  Some of them are:
    • Source NAT and Destination NAT modifies the SIP and DIP of the packet.
    • NAPT may modify both SP and DP of TCP and UDP packets.
    • To ensure that modified packet has unique IPID,  IPID also gets translated with unique IPID.
    • TCP MSS value might be modified with lesser MSS value to avoid any fragmentation in future steps.
    • TCP sequence numbers also may get translated. This is typically happens when CP does some packet processing before such as SYN FLOOD protection using SYN-COOKIE mechanism or due to some ALG processing.
  • Packet Routing and Packet-out:  These steps are similar to IP Routing module.
 Upon analysis of above steps, one can find that there are following tables in Session Processing.
  • Interface to Security Zone mappping table, which is an index table.
  • Flow hash table, which is exact match hash table.
  • Flow to Session ID table, which in index table.
Observations & Suggestions for new version of Openflow specifications:

All datapaths listed above use tables heavily.  Contexts created in each table are used by actions being performed.  It is not clear whether Openflow specification group is targeting to cover any type of datapath.  If so, few things to notice and improve. 

Packet Reinjection :  Each datapath might have more than one table.  Typically a table miss results into packet going to the control plane (controller in case of Openflow).  Control plane once acts on the packet and create appropriate context in the table,  it would like the packet processing in the datapath  to start from where the miss occurred.  Some times,  the processing should start from the table where the miss occurred or some times CP might do the operations on the packet and hence the packet processing needs to start in the datapath from the next table as specified in the newly created context.  Protocol between Control plane and datapath (Openflow) needs to provide this flexibility for control plane.

Atomic Update and Linkage of flows: Control plane yet times need to create multiple contexts in tables atomically.  For example, NAT/firewall control plane would need to create two contexts for two flows in atomic fashion.  Commit based facility would help in achieving this.  I think it is required to enhance openflow protocol to enable this operation.

Parsing headers & Extraction of fields:  Since there could be different datapaths that need to be exercised on a packet, one time parsing of the packet may not be sufficient. Each datapath or even each table might be preceded by parsing & extraction unit. There are different fields which are required in different paths.  There could be tunnel headers which need to be parsed for some datapaths to get to inner packet.  Each datapath might have its own requirement on which fields to be used for further processing.  That is, some datapath might need to work on IP header of tunnel and some other datapath might have requirement to use inner IP packet fields.  Hence I believe it is good to have parsing & extraction unit for each table. In cases, where the datapaths don't require different parsing & extraction units,  controller would not configure associated parsing & extraction units.  Some hardware devices (Multicore SoCs) I am familiar with support multiple parsing & extraction unit instances.  So, this is not a new thing for hardware ASIC/Multicore-SoCs. 

Pre Table lookup Actions:  Many times, there is a requirement to ensure that packet is good.  Packet integrity checks are very important during packet processing. Having some actions defined on per table basis (not on per flow basis) is good way to do these kinds of checks.  There may be a requirement in future to do even some modification to the packet before further parsing/extraction & eventual table lookup.  Pre-Table action list is one way to provide these facilities.  One can argue that, it is possible to use another table before packet is processed at current table and have context specific action on that table.  Though it is good thought,  it might result into inefficient usage of table resources in hardware.

Flexible Metadata:  Metadata is used to communicate the information among different tables and associated actions. There are two types of metadata required in datapaths - Some part of metadata is used as lookup action in further tables and some metadata is required by actions. Openflow specification defined the metadata, but it is limited to one word.  I believe that this is insufficient. Fixing the metadata size in standard is also not good. Controller can define the maximum metadata size required and program the size information in the datapath.  Datapath can allocate additional space to store metadata on per packet basis using the size information programmed by the controller.  Since there could be multiple types of datapaths and multiple tables within each type of datapath,  controller may categorize metadata for different purposes. Each metadata field and field size can be referred later on by table for lookup and associated actions.  I believe that openflow should provide that kind of flexibility - Setup the size of metadata required one time in the datapath,  language to use different fields in metadata for table lookup,  language to define action which can use different fields in the metadata or set/reset some fields in the metadata.

Actions :  Each type of datapath has its own actions - For example firewall session management datapath actions are different from the type of actions routing or bridging require.  Hence, there would be good number of actions eventually in the datapath.  And more actions would be required in future.  In addition to defining more actions in the openflow or associated specifications,  it should also provide flexibility to add new actions without having to create new hardware.  That is, openflow might need to mandate that the datapath hardware or software should be able to take newer actions in future.  To enable this,  it is required to define some kind of language to define new actions.  Since datapaths may be implemented in hardware ASIC,  the language should not have too many constructs/machine-codes.  Simple language is what is needed.  I thought simplified LLVM can be one option.

VXLAN (Virtual eXtensible LAN) - Virtual Data Centers - Tutorial

VMWare with contributions from Cisco, Citrix, Broadcom, Arista networks released IETF VXLAN draft which is a protocol to enable multiple L2 virtual networks over a physical infrastructure.  Please find that draft here.

Draft document clearly defines the problem statement and how the VXLAN is solves the problems.  I will not repeat all of them here. Some important background points are mentioned here though.

Background

It is well known that Data Center and Service Provider networks are increasingly being enabled for multiple tenants. Even Enterprises are supporting multiple tenant networks for isolation - Some examples being isolation for different divisions, for research & Development, for trying out new services/networks.

Employees of a particular division used to be confined to a building in 1990s.   With globalization, it is no longer true.  Employees of one particular business unit are not only spread across multiple buildings, but also across countries and continents.  Each building or location have employees from multiple business units. Hence, it is required to create virtual LAN over same physical network where virtual LAN spans across the buildings, countries and continents.  Virtual LAN gives locality of L2 networks for related machines even though they are distributed across multiple physical networks. 

It is very common concept in recent past and hence there is no need to emphasize the need  for  virtual LANs for tenants in Data Center, Service Provider and multi-dwelling environments.  Operators would not like to create  new physical infrastructure or modify existing physical infrastructure every time they sign up a new tenant. Hence, operators of these networks support virtual networks for tenants for the purpose of isolation,  flexibility and effective usage of physical resources.   Virtualization of Servers solves this issue on compute side.

Virtualization in physical servers now is understood by Network Operators and tenants.  Elasticity is provided on the compute side.  Based on tenant requirements, more virtual machines either can be expanded or reduced based on usage.   Virtual machines used to host one tenant service can spread across multiple physical machines.  It is also true that a physical machine may hold virtual machines corresponding to multiple tenants.  That is, a tenant service can be across multiple virtual machines which can be across multiple physical servers.  Basically,  virtual server is now treated as physical server of non virtualized architectures.

If Data Center supports isolated networks and resources, then the Data Center is enabled for multi-tenancy and those Data Centers are called Virtual Data Centers (VDCs).  Implementation of VDC requires multi-tenancy across all active devices in Data Centers. Compute Server and Storage Virtualziation is well understood and already being done to a great extent.  Network Service devices too have multi-tenancy today.  Physical network service devices which are popular today in Data Center markets are also enabled for multi-tenancy.  These service devices tend not to implement multi-tenancy using virtualization as compute servers do today.  They tend to support multi-tenancy using "Data Virtualziation' or at the most "Container based Virtualization" for scalability purposes.

Tenant-ID communication among the Data Center equipment traditionally happens using VLAN ID.  A VLAN ID or set of VLAN IDs are assigned to each tenant.  The front end equipment figures out the tenant Identification (VLAN ID) based on IP addresses (Destination IP address of the incoming packets from the Internet).  Then onwards, the communication to compute servers (Web Servers, Application Servers, Data base Servers) and Storage devices happens via this VLAN ID. Reverse traffic (Outbound traffic to the Internet) also happens on VLAN IDs until the packets reach the front end device.  Front end device, then would remove VLAN header from the packet and send the packets out onto the Internet.

L2 switches,  L3 switches and any other network service devices use VLAN ID to identify the tenant and apply appropriate tenant specific policies.

Why VXLAN if VLANs are good?

VLANs are not good enough for following reasons:
  • VLANs are fixed in number - VLAN header defined 12 bits for VLAN ID. It means that only 4K VLAN IDs are possible.  If we go with best case assumption of 1 VLAN ID for each tenant, then a Data Center can atmost support 4K tenants.
  • VLAN is mostly L2 concept.  Keeping VLAN intact across L2 networks separated out by L3 routers is not straightforward and hence requires some intelligence in L3 devices.  Especially when the tenant networks needs to be expanded to multiple geographic locations, then extending VLAN across Internet requires newer protocols (such as TRILL).
  • If tenant traffic requires VLANs themselves for different reason,  double tagging and triple tagging may be required. Though 802.1ad tagging can be used for tenant identification and use 802.1Q tagging for tenant specific VLANs,  this may also require changes to existing devices.
VXLAN is new tunneling protocol works on top of UDP/IP.  It does require changes to existing infrastructure to understand this new protocol, but it is not going to have limitations of VLAN based tenant identification.  Since L2 network is being created over L3 network,  VDC can now extend not only within a Data Center/Enterprise location, but across different locations of Data Center/Enterprise networks. 

Some important aspects of VXLAN protocol:
  • VXLAN tunnels L2 packets across networks between VTEPs (VXLAN Tunnel End Points)
  • VXLAN encapsulates L2 packets in UDP/IP.
  • VXLAN defines the VXLAN header.  
  • UDP Destination Port indicates the VXLAN protocol.  Port number is yet to be assigned.
  • 16M virtual networks are possible.
  • VTEPs are typically End Point Servers (Compute Servers,  Storage Servers) and Layer 3 based Network Service Devices.  L2 switches need not be aware of VXLAN.  Some L2 switches (ToRs) may be added with this intelligence to proxy VXLAN functionality from computer servers on the ports connected to compute servers.  This may be mainly done to support non VXLAN based servers/resources.  VXLAN IETF draft calls it as VXLAN Gateway.
  • VXLAN Gateways are expected to translate tenant traffic from non-VXLAN networks to VXLAN Networks.  Front End device as described above is one example of VXLAN gateway.  This device might convert from Public IP address (Destination IP address) to VXLAN tenant (VNI - VXLAN Network Identifier).  
  • VXLAN defines VNI (VXLAN Network Identifier) which identifies the DC instance (VDC).  This is in place of VLANs that are used in the Data Center networks today.
  • It is expected that there is one management entity within one administrative domain to create the Virtual Data Centers (Tenant).  This involves assigning unique VNI to the tenant network (VDC instance),  associating Public IP addresses of the tenant,  Multicast IP address for ARP resolution across VTEPs in a VDC.
Like VLAN based tenant identification, VXLAN based tenant networks can have overlapping internal IP addresses.  But IP addresses assigned to virtual resources within a VDC must be unique though.

Some aspects of packet flow:

Let us assume following scenario:

A tenant is assigned with VNI "n" and Multicast IP address "m".  Two VMs  (VM1 and VM2) are provisioned on two different physical Servers (P1 and P2) located in two different cities.  VM1 is installed on P1 and VM2 is installed on P2.  P1 and P2 are reachable via IP addresses P1a and P2a.  VMs have private IP address VM1a and VM2a. 


Ethernet Packet from VM1 to VM2 would be encapsulated in UDP/IP with VXLAN header having following:

VXLAN header:  VNI  "n" and some flags.
UDP Header will have "source port"  assigned by system and standard "destination port".
IP header is generated with P1a as SIP and P2a as DIP.
Ethernet Header would have SMAC as local MAC address and DMAC corresponding to P2a from ARP resolution table or local gateway MAC address.

VTEP VXLAN functionality would be implemented  in NIC cards of servers or hypervisors. It is also would be part of L2 switch for supporting existing servers and associated VMs.   VMs within the servers need not be aware of VXLAN and hence existing VMs will just work fine.  VTEP functionality typically maintains a database (created by Management entity) of virtual NIC (of VMs) versus  VNI - Could be as simple as table of vNIC MAC address and VNI.  VTEP also is provisioned with the associated Multicast IP address (VNI and MAC address table).  VTEP when it gets a packet from the vNIC of VMs,  figures out the all the information required to frame the UDP/IP/VXLAN headers.  VNI is known from the table provisioned by the management entity.  DIP of the tunnel IP header is determined from the learning table (VTEP learning table - consisting of MAC address of the remote VMs versus the remote VTEP tunnel IP address entries).  VTEP is expected to keep this table updated based on the packets coming from the remote VTEPs.  It uses SIP of the tunnel header and SMAC of the inner Ethernet packet to update this table.

This looks simple. But there are two things to be considered - How do VMs get hold of DMAC address corresponding to peer VMs IP address?  Local ARP request does not work as it can't cross the local physical L2 domain.   Second,  how does the local VTEP gets to know the remote VTEP IP address if there is no matching learning table entry? 

Let us first discuss on how ARP request generated by VMs get satisfied.  Broadcast ARP request generated by a VM should somehow should go to all VMs and devices in the virtual network.  That is where Multicast tunnels are used by VTEP.  Source VTEP, upon getting hold of ARP request from the local VM, tunnels the ARP request in Multicast packet whose address is derived from VNI to Multicast IP address table. As a matter of fact, VTEP encapsulates any broadcast/Multicast Ethernet packets sent by local VM in multicast tunnel.  All VTEPs are expected to subscribe to the multicast address to receive multicast tunnel packets.  Receiving VTEPs decapsulate to get hold of inner packet, finds out all VMs (vNICs) corresponding to VNI and sends the internal packets onto those vNICs. Right VM would respond back to the ARP request with ARP reply and remote VTEP sends the ARP reply to the source VTEP.

Second complication is discovering the remote VTEP IP address when there is no matching entry in the learning table.  This could happen when the entry gets aged out.  If the VMs are configured with static ARP table,  ARP requests also will not be generated by the VM and hence there may not be any opportunity to learn the remote VTEP IP address for remote MAC addresses.  In this case, source VTEP upon receiving any unicast packet from the local VM may need to generate the ARP packet as normally generated by VM. This ARP request is sent in Multicast tunnel as described above. This will trigger the ARP reply from the remote VM which gets encapsulated by the remote VTEP.   This message can be used by local VTEP to update the learning table.  Since this process may take some time, VTEP may also need to buffer the packet until the learning is done.

Problems and possible solutions to VXLAN based VDCs:
  • I believe IPv6 is needed for tunnels.  IPv4 tunnel is okay in short term.  As you would have gathered by now, each tenant requires one multicast address.  This multicast address needs to be unique in the Internet.  That is, this address needs to be unique across network/Data-Center operators. If this mechanism gets popular, it is possible that multicast addresses may run out very soon.
  • Security is very critical.  It is now possible to corrupt the learning tables by the man-in-the middle or even external attackers.  VNIs will be known to attackers eventually.  Multicast packets or unicast packets can be generated to corrupt the learning table as well as overwhelm the learning table, thereby creating DoS condition.  I think IPsec (at least Authentication with NULL encryption) must be mandated among the VTEPs. It is understandable that IKE is expensive in NIC cards,  Hypervisors and VXLAN gateways.  But Ipsec is now available in many NIC cards and multicore processors.  Management entity can take the job of populating the keys in the VTEP end points, similar to management entity doing the provisioning of Multicast address for each VNI.  I believe that all VTEPs would be controlled by some management entity. Hence it is in possibility of realm to expect management entity to populate the Ipsec keys in each VTEP.  For Multicast tunnels, key needs to be same across all the VTEPs.  Management entity may recycle the key often to ensure that security is not compromised when attackers get hold of the keys. For unicast VXLAN tunnels,  Management entity can either use the same key for all VTEPs as in for Multicast or it could use pair wise keys.  
I think above problems are real.  It would be good if next draft of VXLAN provides some solutions to above problems. 

Thursday, December 15, 2011

Embrane - Is this SDN Play?

Recently, I came across a company called Embrane while doing some google search on  SDN.  Then I saw a press release that Embrane made a product release announcement. I thought I would check this out and see how far it goes in SDN.  I had gone through the whitepaper published in Embrane website.  If you are interested,  you can find that paper here.

My understanding of Embrane solution:

When I first read the white paper, I was not sure about  Embrane product - Whether it is a platform/framework  to instantiate any type of network service virtual appliances from any vendor or whether the Embrane provides some network services as virtual appliances. By end of reading the whitepaper and after going through their website, it appears that Embrane's main focus is to deliver the framework for any virtual network service appliances including third party virtual appliances.

Embrane architecture mainly consists of four components.  Each component is installed as separate VMs.
  • Elastic Service Manage (ELM)r:  Typically, there would be one VM of this type. Data Center operator works on this VM to provision Distributed Virtual Appliances. 
  • Distributed Virtual Appliances (DVA):  Each DVA is logical set of VMs.  There are three kinds of VMs within each DVA.   Even though, there are multiple VMs within one logical DVA,  it can be treated as one appliance for all practical purposes.  As I understand,  Data Center operator need to instantiate as many DVAs as number of tenants.   If there are two types of network services is required, then there would be 2 DVAs.   So, if there are X number of tenants in a Data Center and each tenant requires Y network services (ADC,  Firewall,  Web Application Firewall, WAN Optimization etc..), then there would be a need for X * Y DVAs.    Now, coming to three kinds of VMs within each DVA.
    • Network Service Virtual Appliances (NSVA):  DVA can have multiple virtual appliances.  These appliances implement actual functionality of network service such as ADC,  Firewall, WOC etc.  Obviously, there must be atleast one network service VA in a DVA.  Multiple VAs can be instantiated by ELM for scaling performance (Scale-out). 
    • Data Plane Dispatcher (DPD):   There will be 1 DPD in each DVA.  DPD is the one which actually distributes the traffic across multiple NSVAs  for linear performance scaling.  
    • Data Plane Manager (DPM):   One DPM VM in each DVA.  DPM is expected configure the NSVAs and DPD in the DVA on behalf of ESM.  Though it is not clear, I am assuming that this will ensure that the configuration integrity is maintained across all NSVAs.   It appears that this is the only VM that requires persistent storage and hence I am guessing that it might be storing the audit and system logs generated by DPD and NSVAs in persistent memory.
If the network service is ADC (Application Delivery Controller),  then it can be viewed that DVA provides one more level of Load balancing.  That is, DPD acts as Load Balnacer to multiple ADCs.  As we know, ADC itself acts as a load balancer to servers.  This makes sense as ADCs have become complex in recent past and computations power requirements have gone up.  Hence, one more layer of load-balancing is indeed required.  In current Data Center deployments, this is achieved using L2 switches.  L2 switches have capability to balance the load across multiple external devices based on hash result of defined fields in L2/L3 and L4 headers.

I have detailed out how L2 switches can be used to distribute the traffic across multiple devices of a cluster.  Please check that out here.

My views:

It appears that DPD functionality is similar to what I described in my earlier post.
Since DPD is a software based distributor,  I expect that it will not have limitations of  L2 switch based load distribution. As we all know that many network services work with sessions (typically 5 tuple based - SIP, DIP, P, SP, DP) to store the state across the packets. Any load distributor is expected to take care of this by sending packets corresponding to a session to only one device in the cluster.   If this is not done, there would be lot of communication across the devices (Virtual appliances) within the cluster.  This may eliminate the benefit of multiple virtual appliances in the cluster.   In my view,  DPD should be distributing the sessions (not the packets blindly) across multiple Virtual appliances.  Since it is software based solution,  it can do one more step and ensure that all sessions corresponding to application sessions are sent to the same virtual appliance.  VOIP based on SIP is one example where there can be 3 UDP sessions corresponding to one application session.  DPD kind of devices need to ensure that the traffic corresponding to all three sessions in this example are sent to one device (Virtual Appliance).  Detection of 5-tuples of data connections is only possible if DPD supports ALGs (Application Level Gateways).  Since there could be more ALG requirements in future,  the challenge is to provide these ALGs on constant fashion by the "Load Distributor" vendor and/or open up DPP architecture for third party vendors to install their own ALGs, thereby maintaining SDN spirit. 

As a described in the same earlier post,  configuration synchronization among the network service devices (NSVAs) is one important aspect of cluster based systems.  I guess DPM is the one which is taking care of it in Embrane solution.

Overall this architecture is good and replicating the physical solution into cloud solution.  It is good for environments where Data Center operators don't allow physical appliances to be deployed by their customers.

It does not appear to be Openflow based. But it can be still considered as part of SDN as it allows third party network service virtual appliances in their framework.


Challenges I see in Embrane solution:

Embrane might be having following features already.   Since I did not find any information on this, I thought it is worthwhile mentioning. I believe that following features are required in DPD kind of load distributors.

As I described above,  classifying the packets across multiple application sessions and ensuring that all packets corresponding to an application session go to the same network service virtual appliance is one big challenge for these kinds of equipment.  I know this personally and it is quite challenging to support multiple ALGs, mainly ensuring interoperability with both clients and servers. 

Some network deployments might see traffic on tunnels such as IP-in-IP, GRE, GTP-U etc..  Traffic corresponding to many sessions is sent on very few tunnels.  To ensure the distribution across multiple NVSAs,   "Load Distributor" need to have flexibility to dig deep into the tunnels and classify the packets based on inner packets.

Performance, I believe, would be the biggest challenge in virtual machine based 'Load Distributor". Classifying the packets, session load distribution and sending subsequent packets of sessions to the selected virtual appliance requires maintenance of millions of  sessions in the "Load Distributors".  What I hear is that one CPU based Virtual machines using VMWare and XEN kind of hypervisors give typically 1Gbps of performance for small size packets.  More processors can be added to the "Load Distributor" virtual machine, but achieving performance in 10s of Gigs may be very challenging.

My 2 Cents:

I believe that the "Load Distributors" need to go beyond virtual machines. Taking advantage of Openflow based switches to forward the packets would be the solution of choice in my view.  Load Distributor virtual machine can do the ALG kind of functionality, selection of Network Service Device for new sessions and create appropriate flows in Openflow switches and leave Openflow switches  to forward subsequent packets  to network service devices/virtual-appliances.  That is, openflow switch can forward the packets to "Load Distributor" if there is no matching flow.   Hence the traffic to Load Distributor is small and one virtual machine would be able to process the load.  Since Data Center operators normally have switches (eventually openflow switches),  this mechanism just work fine in Cloud environments.


Sunday, December 11, 2011

Software Defined Networking - Java Role

Controller component of Openflow based SDN is supposed to have most of the  networking intelligence.  One might have gathered by this time that the SDN is expected to provide programmability of network devices rather than simple configurability.  SDN Controller component is going to be complex entity in SDN. Software Engineering principles tell us that complexity is only manageable with modularity.  Complexity and associated modularity is not new in web based programming projects (Server Side programming).  There is so much of innovation that already happened on programming and modularity in server operating systems and middlware packages. These lessons can be used to manage the controller complexity too.

Many programming languages and associated frameworks are popular in Server Side programming.  Python, Perl, PHP, Ruby and Java are some of the popular programming languages. There are many layers of frameworks defined to ease the application programming.  One thing to observe is that, there are very few server side complex projects implemented in C or C++.    That tells us something.

Modularity, several frameworks and large number of libraries available in these programming languages makes the application development  fast, maintainable and easily extendable.  This allows agile development, faster implementations of innovations with less cost of development.

In SDN world too, this was realized and one would see Controller Network Operating Systems in Java and Python already today.

In my view, Java would be the software platform choice on controllers.  Main reasons being -  big pool of talent,  great middle-ware packages,  large number of libraries,  backing of many big companies, wide usage performance and access to hardware accelerators.  

I have found one controller by name BEACON which is Controller Networking Operating system based on Java.  It uses modular architecture of java frameworks.  It uses spring framework,  OSGi for developing applications as service bundles.  Even controller side of openflow protocol was implemented in Java language. See the BEACON link for more information.

I already hear that some vendors are developing MPLS LDP,  Routing Protocols in Java.

Do 'C' network programmers need to worry?

Not in short term.  If Openflow based SDN is going to pick up,  there would be more work on the controller side than the device. If Controllers are not based on C language,  then there would be less demand for C programmers.  Since Network programmers tend to have better understanding of the networking field,  low level programming and Protocol knowledge, their expertize would continue to sought after, but networking  programmers now need to know much more than C.  Employers will expect future Engineers to know as much Java/Python as C. 


Wednesday, December 7, 2011

ForCES (Forwarding and Control Element Separation) and Openflow 1.1 - Contrasting them

At very high level, both ForCES and Openflow protocols separate Control plane and Data plane.  Both protocols are intended to drive Software Defined Networking.

Some terminology differences:

  • Software Driven Networking versus Software Defined Networking :  Both are same, but different terms are used.  ForCES uses Software Driven Networking.  Openflow is created in the context of Software Defined Networking. 
  • Control Element and Forwarding Element terms are used by ForCES.  Openflow tend to use the terms Controller (which contains Control plane) and Switch or Datapath for Forwarding Element.  Some people also call the datapath as Fastpath and Dataplane.

Though at high level, both ForCES and Openflow can be considered  part of Software Defined Networking, there is no need for two different protocols.  Eventually, they need to get consolidated into one or one of them would die.  I think two protocol definitions have come into existence due to some conceptual differences on the functionality separation between "Control Plane" and "Data Plane". 

I think it may be difficult to bridge the conceptual differnces,   but I believe that there are some good things in ForCES protocol that can be adopted into Openflow to make Openflow protocol more complete.

First my view of conceptual differences between ForCES and Openflow protocol:

ForCES expects the datapath to have set of LFBs (Logical Functional Blocks).  That is, one forwarding element can contain multiple LFBs.  Each LFB is expected to be defined using inputs to the LFB,  output of the LFB and different components that can be configured in the LFB by the Control Element.  Components within LFB might have multiple tables of flows,  configuration information etc..  Each LFB is identified by ClassID.  Since there can be multiple instances of same LFB,  unique LFB instance is identified by LFB Class ID and LFB instance.  ForCES suite of protocols is going to define several standard LFBs.  As long as vendors develop LFBs as per the standards,  CE will be able to work with any datapath from different vendors. 

ForCES CE mainly connects several LFBs to create a packet flow (topology) to achieve the needed functionality.  CEs also program the each LFB for creating flows etc..  ForCES define the LFBs to output metadata and subsequent LFBs can make use of the metadata in addition to other inputs.

Main point really above is that the ForCES expect logical functional units defined by Datapath.

Openflow differs conceptually on this point.  Openflow protocol does not expect any LFBs.  Openflow expects datapath to have several tables.  Controller creates the flows in those tables with different actions.

It appears that the openflow expects even lower level implementation to be done at the datapath.  LFB kind of functionality is expected to be done by the controller.  Controller has flexibility to define the logical funcitonal units byitself by programming tables and flows.  That is, controller might divide the tables into multiple logical units and each set of tables with flows and associated actions result into what ForCES call it as LFBs. 

So, in essence,  Openflow based SDN expects very low level support by datapaths. 

Having said that, I feel that there would be requirement to define more and more actions in Openflow protocol.  To extend the functionality of datapath,  ForCES expect to create more LFB specifications.  In case of Openflow, I would expect that there would be more actions defined in the openflow protocol in future.

Some Pros and Cons:

In Openflow,  tables can be used for whatever purposes by the Controller.  Purpose of the tables can be changed easily by Controller based on network deployment requirements.  In case of ForCES, if a particular deployment does not require certain LFBs,  resources occupied by those LFBs may not be usable by other LFBs,   There is a possibility of under utilization of datapath resources by Controller in case of ForCES.

But ForCES has structure to the datapath.  That might come in handy in modularity, debugging and maintainability.  May be, there is some thing that can be learnt from ForCES conceptual model. 

One lesson I can think Openflow specification (from Forces) can do is to have some sort of action components.  Controller should be able to find out the type of actions the datapath supports.  Also, it would be great if datapaths have some programmability by which more actions can be uploaded from controller without creating new datapath hardware revisions. Ofcourse, this require common language to represent this so as to provide controller separation from the datapath.  Datapaths need to understand this language and program themselves.

Some good things in ForCES protocol:  Somethings that can be adopted in some fashion into future Openflow protocol specifications are:
  • 2PC Commit Protocol:  This is quite powerful.  In my old job, we did exactly same for similar problem.  I am sure that there would be many instances where a controller need to create several flows in different tables atomically and follow the ACID (Atomicity, Consistency,  Isolation and Durability) properties.  In addition, I also think that there would be a need for ACID support across multiple datapaths where controller needs to create flows across multiple datapaths/switches.  That is where, 2PC Commit protocol is going to help.
  • SCTP transport rather than TCP transport:   I personally like SCTP over TCP for message based protocols.  SCTP is reliable as TCP and maintain the message boundaries.  It is true that SSL based SCTP is not that popular, but some other security can be maintained (IPsec).   SCTP allows easy of providing "Batching",  "Command Pipelining".
  • Extendability:  ForCES followed TLV approach in messages between CE and FE.  Nice thing is that, there are nested TLVs and hence follows XML based nested architecture in binary form.   That is very extendable in future without having major revision to protocols and datapath implementations.   One can argue that Openflow binary protocol takes less bandwidth. That is true.  I was wondering whether we can follow hybrid approach - Known items in fixed header and unknown/future items in TLV format.
  • Selected Fields in LFBs:  Openflow today does not have any granularity on per table basis.  Openflow expects the flows in any table be based on all 15 fields.  I wish newer openflow protocol would specify the fields that are relevant for each table so that the datapath implementation can allocate appropriate flow blocks and utilize memory more effectively. 
  • Table Type:   Openflow specifies that the flows have priority.  It gives impression that these tables maintain flows in order. There are several types of functions that don't need ordered list such as routing tables which can use Trie tables.  Some tables can be Exact match tables (Hash tables can be used in there.  In case of ForCES, this issue is not present as each LFB can implement its own tables in its own fashion as long as outside behavior is same.  In Openflow, LFBs are logically part of the controller and controller only sees the tables.  I like to see Openflow table definition to have "ACL table" (Similar to what Openflow 1.1 defined today),  "LPM Table" and "Exact Match" Table.  
 

Sunday, December 4, 2011

Openflow 1.1 protocol tutorial

Openflow protocol is a TCP/SSL based protocol between controllers and switches.  Switch is expected to initiate a connection to the Controller.  For each datapath instance, switch device is expected to make a connection.  If the switch supports X number of datapaths (instances), then X number of connections are established.

As you might have guessed by this time,  Openflow is a protocol that separates Control Plane and Data Plane.  Data Plane is called as "datapath" too.

Controller typically tries to get information from the datapath about its hardware/virtual switch configuration such as - Number of table supported,  Number of ports, Ports information and QoS queues supported.  Controller tries to get this information every time there is successful Openflow connection was accepted from the switches.

Controllers, based on the configuration by the administrators and based on the output of protocols, knows what to program in the datapath.  Datapaths maintains the tables for storing the flows.  Controller has flexibility to arrange the tables and use different tables for various purposes. Table traversal for packet is controlled by the Controllers.  Each flow the controller establishes in the table can have various actions such as packet header modifications,  next table to jump to etc..   I will talk about how controllers will arrange the tables and does flow management for various applications such as L2 switching, L3 Switching and L4 switching in later posts.  Essentially,  tables contain flows. Flow have matching fields and action fields.  Matching fields are used to match the flow in the datapath.  Packet processing continues to the next table or packet gets sent out based on the matching flow action fields.  Some important items to notice are:
  • Tables are ordered lists, that is, each flow has a priority.
  • Meta data information can be collected across various matching flows in different tables.
  • Next table flow match can have the metadata information as one of the matching fields
  • Flows can point to group of action buckets.
  • Groups can be setup to 
    • Duplicate the packets.
    • Load share the traffic across multiple next tables.
In essence,  idea is that controller has entire control on the packet path across multiple tables. Datapath need not have any intelligence of L2 switching, IP routing.  As long as they blindly do operations as specified in the flows in tables,  things should work fine as Controller takes care of responsibility of L2/L3 protocol level processing.

Flows in the tables can be programmed by control plane in two ways - Proactive and Reactive.  It is also possible that some flows may be setup pro-actively and some get setup re-actively.  Proactive flows are setup without any packets.  Reactive flows are setup only when there is a packet.  Initially, there may not be any flows in tables.  Datapath, when it does not find the matching flow in a table, if the miss property of the table indicates to send the packet to controller,  then the datapath sends the packet to the controller. Controller is expected to push a flow and send the packet out to the datapath for datapath to process the packet appropriately.

With that background, let us see various protocol messages between controller and datapath:
  • Symmetric messages :  Are the messages that can be initiated by any party (Switch or Controller).  
    • Hello message:   Exchanged right after connection setup.
    • Echo Message:  Request/Reply messages - Mainly used for check the liveness of connection,  latency and bandwidth of connection.  Any Echo request messages is replied using Echo reply message.
    • Experimenter messages:  For future extendability.
  • Controller to Switch Messages:  These messages are initiated by the controller and switch is expected to reply. 
    • Feature request:   Controller sends this message to the switch to inquire the switch capabilities.  This feature request is used to get following information from the switch:
      • Data path ID :  A switch may support multiple data paths - Multiple switches. Switch is expected to make as many connections as number of datapaths (switch instances) it supports.  Based on the connection used to request features,  the switch device is expected to send corresponding datapath ID.
      • Maximum number of buffers that can be stored in the switch:  Normally, switch devices are expected to send the packet to which there is no matching table entry as PACKET_IN message to the controller.  To save bandwidth between the switch device and controller,  switch can optionally store the buffer in its memory, send reference to this buffer and some portion of the packet buffer to the controller. As you might see from the specifications,  this reference to buffer is returned by the controller using PACKET_OUT message for switch device to take action. In any case,   this particular parameter indicates the number of buffers the switch device can store while communicating with the controller.
      • Number of tables supported by the switch device on this instance:  As you might see from the openflow specification,  controller creates the flows in various tables.  This parameter informs the controller on how many tables the switch device supports for this datapath (instance).  
      • Capabilities supported by the switch instance such as
        • Statistics collection on per flow, table, port, group and Queue basis.
        • Whether switch can do IP reassembly before it extracts the source port and destination ports in case of UDP, TCP and SCTP transport protocols.
        • Whether the switch can extract source and target IP addresses from the ARP payloads.
      • Information about ports that were attached to this datapath (Switch instance):  For each port (physical or VLAN), following information is sent by the switch.
        • Port Name (String), Port ID (Integer), HW address (in case the port is used in Layer3),   Port Configuration Information in a bitmap (Administratively Down, Drop all packets received from port,  Drop packet forwarded to this port,  Don't send in packets on to this port back),   Port state in a bitmap (Port Link down,  Port blocked due to STP, RSTP etc..,  Alive), Port features in a bitmap (Speed support10Mb Full Duplex, 10Mb half duplex, 100Mb Full Duplex, 100Mb half duplex, 1Gb Full Duplex, 10Gb Half duplex,  10G Full duplex, 40G Full Duplex, 100G Full Duplex, 1Tb Full duplex and other speed support,  Link type - Copper/Fiber support, Link features - Auto negotiation supported, Pause and Asymmetric Pause ---  Port features are given in multiple bitmaps :  Current - current features, advertized - Advertized features of the port,  supports - Supported features of the port and Peer - Features supported by the peer, that is, other side of the link),  Current bit rate of the port and maximum bit rate of the port.
    •  Configure Switch and Get Switch Configure messages:  SET_CONFIG message is used to set the configuration and GET_CONFIG_REQUEST message is used to get the configuration from the switch.  Type of configuration that can be set by controller is listed below.
      • IP Fragment treatment (No special handling,  Drop fragments,  Perform reassembly):  Switch is supposed to take action as set by the controller when it receives the IP fragments.
      • Action on Invalid TTL:  Controller can ask switch to send the packets with invalid TTL to controller.  TTL value 0 is invalid for L2 switch and TTL value 1 is invalid for forwarding packets.
      • Length of the packet that need to be sent to the controller as part of PACKET-IN message. 
    •  Table Modification Message and Get Table Status message:  Modification message is used to modify the properties of specific table.  Property of table mainly deals with the action to be taken when there is a "MISS" in the table for the incoming packet. 'MISS' actions can be defined as one of the following
      • Send the packet to controller.
      • Continue with next table processing.
      • Drop the packet.
    • Flow Entry Management commands (Add/Modify/Delete) of flow entries in a given table. Flow is not the same as traditional flow.  To me,  flow in Openflow context has flexibility to use ternary comparisons (using mask).  Each table is also kind of ordered list with each flow having priority.  Normally when one thinks of flows, they are exact match entries and hence think that flows can be arranged in hash table fashion.  Since these flows are not traditional flows,  hash table can't be used.   Every entry that gets added has following information coming from controller.
      • Table ID: Table to which entry is getting added to.
      • Command: Add, Delete and Modify 
      • Cookie and Cookie mask: Valid for Modify and Delete commands. This is used to update or delete multiple entries at once.
      • Idle Timeout :  Inactivity timeout.  If there are no packets matching this flow for this timeout period,  flow gets deleted.
      • Hard timeout:   Flow gets deleted after this timeout even if there were packets matching this flow.
      • Buffer Identification:  This is sent by the controller typically for add command.  As discussed above,  to save the bandwidth between switch and controller,  actual packet is not sent along with the PACKET-IN message. Rather buffer reference is sent along with truncated packet content.  Controller, while adding the flow can inform the switch to process the packet referenced by the buffer_id.
      • output port and out group :  Variables are meant for DELETE command.  DELETE command deletes the flows that match these two parameters.
      • Inform Flow Removal:  This flag is set by controller on this flow to get informed whenever this flow gets deleted when the flow expires.
      • Check for overlapping entries: This flag indicates that no other flow with the same matching information should be added in future.
      • Match fields:  Flow is identified by set of match fields.  Match fields include  input_port, Ethernet Source MAC address and Mask,  Ethernet Destination MAC address and Mask,  VLAN ID,  VLAN Priority (PCP),   Ether-type,  IP TOS (DSCP field),  IP Protocol,  Source IP address and Mask,  Destiantion IP address and Mask,  Source port, Destination Port, MPLS Label,  MPLS TC,  Meta Data and Meta Data Mask. There are total 15 tuples. Except for Meta Data, everything is part of the packet.  Ofcourse some packets don't have some fields and those fields are normally ignored during the flow match process.   Some fields can be mentioned as wild cards. They are: Input Port,  VLAN ID, VLAN Priority,  Ether Type,  IP TOS,  IP Protocol,  Source Port, Destination Port, MPLS Label and MPLS TC.
      • Set of instructions to be applied. Each instruction can contain multiple actions.  Following instructions can be associated with the flow.
        • Next Table to lookup :  This instruction of the flow indicates to the datapath that it should start matching the next table whose identity is given along with the instruction.
        • Setup Meta Data:   This instruction is used to set the meta data and mask to the data path for that packet.  This meta data might be used by datapath to match the entry in the next table.
        • Actions on the packet - There are three types of instructions that are possible -  "Apply Actions" where actions are applied immediately to the packet,  "Write Actions" where the actions are collected and these collected actions are applied at the end before packet is sent out,  "Clear Actions" is used to clear any collected actions so far.   Note that a given flow can have both "Write Actions" and "Apply Actions". There are many actions that can be collected or applied.  Actions defined by specification are listed below.
          • Output to Switch Port :  Port on which packet has to be sent out.  This port can be logical port. If the logical port is "CONTROLLER",  then max_len field can be specified in the flow.  This size is used to truncate the packet while sending the packet to the controller using PACKET-IN message.
          • Set VLAN ID :  Replace the existing VLAN ID. Applies to packets that have existing VLAN tag.  If there are multiple VLAN tag, this action is applied onto the outermost VLAN header.
          • Set VLAN Priority:  Replace the existing VLAN Priority. If the packet does not have any VLAN tags, then this action is ignored by datapath.  If there are multiple VLAN tag, outermost VLAN tag's priority is replaced.
          • Set Ethernet Source MAC Address:  Replace the existing Ethernet source MAC address.   If there are multiple Ethernet headers, outermost Ethernet header is selected for modification.
          • Set Ethernet Destination MAC Address:  Replace the existing outermost Ethernet Destination MAC address.
          • Set IPv4 source address,  Set IPv4 Destination Address, Set Ipv4 ToS bits, Set IPv4 ECN bits:    Replace the appropriate fields in the outermost IP header and updates the checksum. In case of UDP, TCP, SCTP checksums are also updated.
          • Set transport source port,  Set transport destination port:  Replace the existing transport source port and destination port with the values given in the action descriptor.  Also updates the checksums.
          • Copy TTL Outwards - Copy TTL from next-to-outermost header to outermost header.  Copy can be from IP to IP,  MLPS to MPLS or IP to MPLS.
          • Copy TTL Inwards:  Copy TTL from outermost header to next-to-outermost header.  Copy an be from IP to IP,  MPLS to MPLS and MLPS to IP.
          • Decrement IPv4 TTL :  Decrement the TTL of outermost IP header.
          • Set MPLS Label:  Replace the existing outermost MPLS lable.
          • Set MPLS Traffic Class:  Replace the existing outermost MPLS TC.
          • Set MPLS TTL:  Set the TTL value of outermost MPLS header.
          • Decrement MPLS TTL:  Decrement the MPLS TTL.
          • Push and POP VLAN tag:  Push the VLAN tag or PoP the VLAN tag.
          • Push MPLS header and POP MPLS header 
          • Apply Group Actions:  Group ID of the group is mentioned along with the action.  Group actions are also applied along with the explicit actions specified in the flow.
          • Set Queue:  This action is set to apply the QoS on the packets.  Queue_ID reference is passed along with the action by the controller.  Data path is expected to queue the packet to this queue.  Note that this action can be set not only at the last table, but also at intermediate or first table.  If this action is set on the "Apply Actions",  then it is very important that QoS applied and result packet starts from where it was left off.
    • Group Entry Management Commands:  There is one group table for each datapath (Switch instance).  Group table contains multiple group with each group identified by group-id which is set by the Controller as part of group creation.  Each group is collection of buckets with each bucket having set of actions to be applied.  The type of actions that are set on a bucket are same type of actions that are set on the flow.   Since Group ID is referenced from the flow instruction,  the associated actions of the group are based on which instruction it is - Apply immediately,  Collect or Clear.   Each Group record contains a selection logic of bucket to use.  There are four group types - All,  Select, Indirect,  Fast failover.  Yet times, you require packets to be duplicated and load balanced. In those cases,  two groups are required. First group would have its type "All" and bucket action in each one of them point to separate groups, whose type is 'Select'.
      •  All: Packet is duplicated as many times as number of buckets.  On each duplicated copy,  bucket actions are applied. Packet processing of the duplicated copy is similar to original packet. That is,  this packet would jump to next table or packet gets egress'ed as original packet.
      • Select:  Packet processing selects one bucket.  This is mainly used for load balancing purposes.  Different flows of packets might use different buckets. 
      • Indirect:  One bucket is selected for all flows referred to this group.  This is similar to having one bucket in the group.
      • Fast Failover:  Executes the first high priority live bucket.  To select the bucket, each bucket is associated with weight (priority) and port and port group which tells whether this bucket is alive or not.
    •  Port Modification Message : is used by controller to modify the behavior of the port. Controller sends the message with the port ID and associated modification information. Since it is modification,  the fields which are modified are indicated using corresponding mask.
      • Port Configuration bits and mask bits - Administratively Down, Drop all packets received from port,  Drop packet forwarded to this port,  Don't send in packets on to this port back.
      • Port features that are asked by the controller to advertize.
    • Queue Configuration Message:  It is meant to do QoS in the datapath.  But the capabilities expected by controller from the switch are minimum.  It appears that number of queues are property of datapath.  Controller can only configure the shaping bandwidth on per queue basis.  It is understandable that classification is happening already, but there is no flexibility to create group of queues,  setting up the scheduling algorithms or setting up the queue management algorithms.
    • Read State Messages:  This message is used by controller to get the current state of data path.   This is used to get information about statistics mainly.  "type' in the request indicates the the type of information requested.  
      • Description Statistics:  Data path replies with following information:
        • Manufacturer  Description,  Hardware Description,  Software Description,  Serial number and Readable description of datapath.
      • Flow Statistics: Controller requests for a given flow using "Table ID",  "Out Port", "Cookie and its mask"  and Flow match fields. It was not very clear what happens if multiple flows match.  I think it is the first flow that matches would be selected to reply back. Data path returns the statistics for the given flow such as
        • How long flow is alive in seconds/nanoseconds.
        • Priority of the flow entry: 
        • Number of seconds before expiration.
        • Packet count, Byte count
        • Match fields and instructions that are part of the flow.
      • Aggregate flow statistics:   Similar to above. But in this case, aggregate statistics are sent.This aggregation is based on the flow statistics of all flows that are matched.
        • Packet Count, Byte Count.
        • Flow Count - Number of flows.
      • Table Statistics: Controller request statistics of a table. Reply is sent with following information
        • Fields that are used to match this table.
        • Wildcards supported to match this table.
        • Instructions that are supported by this table.
        • Write Actions
        • Apply Actions
        • Miss Action configuration.
        • Maximum number of entries supported in the table
        • Active entries
        • Number of packets looked up in the table.
        • Number of packets that have entry hit in the table.
      • Port Statistics:  Controller requests for port statistics by giving port number.  Reply information contains following:
        • Number of Received packets,  Number of transmitted packets.
        • Number of received bytes, Number of transmitted bytes.
        • Number of packets dropped in receive,  Number of packets dropped by transmit.
        • Number of receive errors, Number of tx errors
        • Number of rx frame errors,  Number of overrun errors
        • Number of rx CRC errors,  Number of collisions.
      • Queue Statistics:  Controller requests statistics by giving port number and queue ID.  Results sent back are:
        • Transmit bytes,   Transmit packets and transmit errors.
      • Group Statistics:  Controller requests statistics by giving group ID.   Reply information consists of
        • Reference Count - Number of flow entries or other group entries that refer to this group ID.
        • Number of packets and bytes  processed by this group
        • Bucket Statistics are also returned.  For each bucket in the group, following information is sent back
          • Packet count and byte count of packet processed by this bucket.
      • Group Description:  Controller can request the buckets and associated actions by giving group ID.  Reply information consists of:
        • Number of buckets and information on each bucket including actions.
    • PACKET-OUT message:  This message is sent by the controller, typically after creating the flow in the datapath.  I am puzzled by the description of PACKET-OUT message. I am not sure whether it is a problem with the specifications or my misunderstanding.  I see that PACKET-OUT message has action headers. I am expecting that PACKET-OUT message would start with the table where the miss occurred before.  Note that starting from first table is not an option in many cases where the packet is already morphed due to "Apply Actions" in the matched flows of previous tables. I would expect following information to be sent as part of PACKET-OUT:
      • Table ID:  Where to start the search from.
      • Buffer ID : In case the entire packet was not sent to the controller with PACKET-IN message as part of TABLE MISS.
      • Meta Data information:  Note that TABLE miss condition would have occurred after processing some tables before.  Due to action on the flows in those tables,  certain meta information would have been collected.  This meta data information is sent back so that the processing would be consistent.   I also think that Meta data information of one 32bit integer is not good enough. It should be significantly large enough (up to 128 bytes).  Again to save bandwidth, meta data information need not come to controller via PACKET-IN message and sent back using PACKET-OUT message. It can be stored along with stored buffer in the controller where BufferID is the reference to stored buffer. In case of data path is not storing the buffers, then metadata information should be expected by controller as part of PACKET-IN and send back using PACKET-OUT.
  • Asynchronous Messages :  These messages are sent from the datapath to the controller without any command message from the controller.
    • PACKET-IN message:  This message is sent whenever there is no matching flow in the sequence of tables. Any table miss will result in PACKET-IN message. My comments above in PACKET-OUT message section are valid here.  I would expect following information to go  to controller:
      • Table ID:  ID of the table where miss occurred.
      • Action type:  Is it due to MISS action or due to explicit action to send the packet to controller.
      • Buffer ID:  If the datapath can store the packet and in which case it can send the reference to this buffer and same buffer ID is expected as part of PACKET-OUT.   What happens to this stored buffer if there is no PACKET_OUT message.  If there is no PACKET-OUT message for certain amount of time, this packet gets dropped. I guess if there is PACKET-OUT message after this is dropped, then PACKET_OUT message will be ignored by the datapath.
      • Packet Data:  In case buffer ID is sent, then the entire packet will not be sent.  It just need to send enough bytes for controller to understand the kind of packet (typically, up to TCP/UDP header is good enough).  Amount of data to be sent in case of buffer ID is configurable by the controllers.  By default, miss_send_len is 128 bytes.
      • Metadata:   Specification does not talk about this. I believe that it should be sent.
    • Flow Removed Message:  Data path sends this message whenever flow is removed due to timeout.   This message contains following:
      • Flow specific information:  Priority,  Match fields etc..
      • Some statistics information;  Byte count, Packet Count.
      • Duration of the flow:  How much time the flow was alive.
      • Reason for flow removal :  Hard timeout, Idle timeout,  DELETE command,  Group Delete command.
      • Table ID:  Placement of current flow.
    • Port Status Message:  Whenver ports are added, removed or deleted, this information is sent to the controller.  Information typically contains:
      • Reason for this message:  ADD, DELETE, MODIFY.
      • Port specific information.
    • ERROR Message:  Datapath informs the controller whenever there are errors observed.