Monday, December 19, 2011

VXLAN (Virtual eXtensible LAN) - Virtual Data Centers - Tutorial

VMWare with contributions from Cisco, Citrix, Broadcom, Arista networks released IETF VXLAN draft which is a protocol to enable multiple L2 virtual networks over a physical infrastructure.  Please find that draft here.

Draft document clearly defines the problem statement and how the VXLAN is solves the problems.  I will not repeat all of them here. Some important background points are mentioned here though.


It is well known that Data Center and Service Provider networks are increasingly being enabled for multiple tenants. Even Enterprises are supporting multiple tenant networks for isolation - Some examples being isolation for different divisions, for research & Development, for trying out new services/networks.

Employees of a particular division used to be confined to a building in 1990s.   With globalization, it is no longer true.  Employees of one particular business unit are not only spread across multiple buildings, but also across countries and continents.  Each building or location have employees from multiple business units. Hence, it is required to create virtual LAN over same physical network where virtual LAN spans across the buildings, countries and continents.  Virtual LAN gives locality of L2 networks for related machines even though they are distributed across multiple physical networks. 

It is very common concept in recent past and hence there is no need to emphasize the need  for  virtual LANs for tenants in Data Center, Service Provider and multi-dwelling environments.  Operators would not like to create  new physical infrastructure or modify existing physical infrastructure every time they sign up a new tenant. Hence, operators of these networks support virtual networks for tenants for the purpose of isolation,  flexibility and effective usage of physical resources.   Virtualization of Servers solves this issue on compute side.

Virtualization in physical servers now is understood by Network Operators and tenants.  Elasticity is provided on the compute side.  Based on tenant requirements, more virtual machines either can be expanded or reduced based on usage.   Virtual machines used to host one tenant service can spread across multiple physical machines.  It is also true that a physical machine may hold virtual machines corresponding to multiple tenants.  That is, a tenant service can be across multiple virtual machines which can be across multiple physical servers.  Basically,  virtual server is now treated as physical server of non virtualized architectures.

If Data Center supports isolated networks and resources, then the Data Center is enabled for multi-tenancy and those Data Centers are called Virtual Data Centers (VDCs).  Implementation of VDC requires multi-tenancy across all active devices in Data Centers. Compute Server and Storage Virtualziation is well understood and already being done to a great extent.  Network Service devices too have multi-tenancy today.  Physical network service devices which are popular today in Data Center markets are also enabled for multi-tenancy.  These service devices tend not to implement multi-tenancy using virtualization as compute servers do today.  They tend to support multi-tenancy using "Data Virtualziation' or at the most "Container based Virtualization" for scalability purposes.

Tenant-ID communication among the Data Center equipment traditionally happens using VLAN ID.  A VLAN ID or set of VLAN IDs are assigned to each tenant.  The front end equipment figures out the tenant Identification (VLAN ID) based on IP addresses (Destination IP address of the incoming packets from the Internet).  Then onwards, the communication to compute servers (Web Servers, Application Servers, Data base Servers) and Storage devices happens via this VLAN ID. Reverse traffic (Outbound traffic to the Internet) also happens on VLAN IDs until the packets reach the front end device.  Front end device, then would remove VLAN header from the packet and send the packets out onto the Internet.

L2 switches,  L3 switches and any other network service devices use VLAN ID to identify the tenant and apply appropriate tenant specific policies.

Why VXLAN if VLANs are good?

VLANs are not good enough for following reasons:
  • VLANs are fixed in number - VLAN header defined 12 bits for VLAN ID. It means that only 4K VLAN IDs are possible.  If we go with best case assumption of 1 VLAN ID for each tenant, then a Data Center can atmost support 4K tenants.
  • VLAN is mostly L2 concept.  Keeping VLAN intact across L2 networks separated out by L3 routers is not straightforward and hence requires some intelligence in L3 devices.  Especially when the tenant networks needs to be expanded to multiple geographic locations, then extending VLAN across Internet requires newer protocols (such as TRILL).
  • If tenant traffic requires VLANs themselves for different reason,  double tagging and triple tagging may be required. Though 802.1ad tagging can be used for tenant identification and use 802.1Q tagging for tenant specific VLANs,  this may also require changes to existing devices.
VXLAN is new tunneling protocol works on top of UDP/IP.  It does require changes to existing infrastructure to understand this new protocol, but it is not going to have limitations of VLAN based tenant identification.  Since L2 network is being created over L3 network,  VDC can now extend not only within a Data Center/Enterprise location, but across different locations of Data Center/Enterprise networks. 

Some important aspects of VXLAN protocol:
  • VXLAN tunnels L2 packets across networks between VTEPs (VXLAN Tunnel End Points)
  • VXLAN encapsulates L2 packets in UDP/IP.
  • VXLAN defines the VXLAN header.  
  • UDP Destination Port indicates the VXLAN protocol.  Port number is yet to be assigned.
  • 16M virtual networks are possible.
  • VTEPs are typically End Point Servers (Compute Servers,  Storage Servers) and Layer 3 based Network Service Devices.  L2 switches need not be aware of VXLAN.  Some L2 switches (ToRs) may be added with this intelligence to proxy VXLAN functionality from computer servers on the ports connected to compute servers.  This may be mainly done to support non VXLAN based servers/resources.  VXLAN IETF draft calls it as VXLAN Gateway.
  • VXLAN Gateways are expected to translate tenant traffic from non-VXLAN networks to VXLAN Networks.  Front End device as described above is one example of VXLAN gateway.  This device might convert from Public IP address (Destination IP address) to VXLAN tenant (VNI - VXLAN Network Identifier).  
  • VXLAN defines VNI (VXLAN Network Identifier) which identifies the DC instance (VDC).  This is in place of VLANs that are used in the Data Center networks today.
  • It is expected that there is one management entity within one administrative domain to create the Virtual Data Centers (Tenant).  This involves assigning unique VNI to the tenant network (VDC instance),  associating Public IP addresses of the tenant,  Multicast IP address for ARP resolution across VTEPs in a VDC.
Like VLAN based tenant identification, VXLAN based tenant networks can have overlapping internal IP addresses.  But IP addresses assigned to virtual resources within a VDC must be unique though.

Some aspects of packet flow:

Let us assume following scenario:

A tenant is assigned with VNI "n" and Multicast IP address "m".  Two VMs  (VM1 and VM2) are provisioned on two different physical Servers (P1 and P2) located in two different cities.  VM1 is installed on P1 and VM2 is installed on P2.  P1 and P2 are reachable via IP addresses P1a and P2a.  VMs have private IP address VM1a and VM2a. 

Ethernet Packet from VM1 to VM2 would be encapsulated in UDP/IP with VXLAN header having following:

VXLAN header:  VNI  "n" and some flags.
UDP Header will have "source port"  assigned by system and standard "destination port".
IP header is generated with P1a as SIP and P2a as DIP.
Ethernet Header would have SMAC as local MAC address and DMAC corresponding to P2a from ARP resolution table or local gateway MAC address.

VTEP VXLAN functionality would be implemented  in NIC cards of servers or hypervisors. It is also would be part of L2 switch for supporting existing servers and associated VMs.   VMs within the servers need not be aware of VXLAN and hence existing VMs will just work fine.  VTEP functionality typically maintains a database (created by Management entity) of virtual NIC (of VMs) versus  VNI - Could be as simple as table of vNIC MAC address and VNI.  VTEP also is provisioned with the associated Multicast IP address (VNI and MAC address table).  VTEP when it gets a packet from the vNIC of VMs,  figures out the all the information required to frame the UDP/IP/VXLAN headers.  VNI is known from the table provisioned by the management entity.  DIP of the tunnel IP header is determined from the learning table (VTEP learning table - consisting of MAC address of the remote VMs versus the remote VTEP tunnel IP address entries).  VTEP is expected to keep this table updated based on the packets coming from the remote VTEPs.  It uses SIP of the tunnel header and SMAC of the inner Ethernet packet to update this table.

This looks simple. But there are two things to be considered - How do VMs get hold of DMAC address corresponding to peer VMs IP address?  Local ARP request does not work as it can't cross the local physical L2 domain.   Second,  how does the local VTEP gets to know the remote VTEP IP address if there is no matching learning table entry? 

Let us first discuss on how ARP request generated by VMs get satisfied.  Broadcast ARP request generated by a VM should somehow should go to all VMs and devices in the virtual network.  That is where Multicast tunnels are used by VTEP.  Source VTEP, upon getting hold of ARP request from the local VM, tunnels the ARP request in Multicast packet whose address is derived from VNI to Multicast IP address table. As a matter of fact, VTEP encapsulates any broadcast/Multicast Ethernet packets sent by local VM in multicast tunnel.  All VTEPs are expected to subscribe to the multicast address to receive multicast tunnel packets.  Receiving VTEPs decapsulate to get hold of inner packet, finds out all VMs (vNICs) corresponding to VNI and sends the internal packets onto those vNICs. Right VM would respond back to the ARP request with ARP reply and remote VTEP sends the ARP reply to the source VTEP.

Second complication is discovering the remote VTEP IP address when there is no matching entry in the learning table.  This could happen when the entry gets aged out.  If the VMs are configured with static ARP table,  ARP requests also will not be generated by the VM and hence there may not be any opportunity to learn the remote VTEP IP address for remote MAC addresses.  In this case, source VTEP upon receiving any unicast packet from the local VM may need to generate the ARP packet as normally generated by VM. This ARP request is sent in Multicast tunnel as described above. This will trigger the ARP reply from the remote VM which gets encapsulated by the remote VTEP.   This message can be used by local VTEP to update the learning table.  Since this process may take some time, VTEP may also need to buffer the packet until the learning is done.

Problems and possible solutions to VXLAN based VDCs:
  • I believe IPv6 is needed for tunnels.  IPv4 tunnel is okay in short term.  As you would have gathered by now, each tenant requires one multicast address.  This multicast address needs to be unique in the Internet.  That is, this address needs to be unique across network/Data-Center operators. If this mechanism gets popular, it is possible that multicast addresses may run out very soon.
  • Security is very critical.  It is now possible to corrupt the learning tables by the man-in-the middle or even external attackers.  VNIs will be known to attackers eventually.  Multicast packets or unicast packets can be generated to corrupt the learning table as well as overwhelm the learning table, thereby creating DoS condition.  I think IPsec (at least Authentication with NULL encryption) must be mandated among the VTEPs. It is understandable that IKE is expensive in NIC cards,  Hypervisors and VXLAN gateways.  But Ipsec is now available in many NIC cards and multicore processors.  Management entity can take the job of populating the keys in the VTEP end points, similar to management entity doing the provisioning of Multicast address for each VNI.  I believe that all VTEPs would be controlled by some management entity. Hence it is in possibility of realm to expect management entity to populate the Ipsec keys in each VTEP.  For Multicast tunnels, key needs to be same across all the VTEPs.  Management entity may recycle the key often to ensure that security is not compromised when attackers get hold of the keys. For unicast VXLAN tunnels,  Management entity can either use the same key for all VTEPs as in for Multicast or it could use pair wise keys.  
I think above problems are real.  It would be good if next draft of VXLAN provides some solutions to above problems. 


deep said...

nice post

Srini said...

Based on my email feedback, I thought I would clarify some terms in my post.

I have used terms VDC, Virtual Network, Tenant Network in the post. All of them represent same thing and those terms are used interchangeably in the post.