Saturday, January 8, 2011

Data Center Equipment & Offload card requirements

I  have been attending some web conferences focusing on Data Center equipment.  From last few years,  there are more and more discussions on offloading some processing out of current data center equipment. Before getting details on offload functions,  it is good to revisit some of data center equipment.

Data centers, whether they are private, public or cloud,  have following equipment.

  • L2/L3 Switches and Routers
  • Server Load balancers or Application Delivery controllers.
  • WAN Optimization devices
  • Network Security Devices.
  • Monitoring Servers for network visibility, Surveillance and others.
  • Application Servers (Web, EMAIL etc..)
  • Database Servers
  • Storage Devices.
At this time, all servers are x86 servers.   I gather that, many of the network equipment such as SLB, ADCs, WOC, UTMs are also implemented on x86 servers with Crypto, Compressions and Pattern matching done on add-on PCIe cards.  Some vendors of these equipment are seriously considering moving to embedded multicore SoCs.   There are two camps.  One camp of vendors want to continue with x86 for their applications, but use embedded Multicore SoCs for offloading some functions beyond algorithmic offloads.  Other camp of vendors would like to replace multiple x86 processors with multiple Embedded Multicore SoCs.   

This post talks about networking functions that equipment vendors looking to see in offload cards. There are two types of offload cards that are becoming popular - PCIe cards that can go into x86 based network equipment and  proprietary cards that go into L2/L3 switch. Both kinds of cards are expected to take away some processing out of networking equipment. If it is PCIe card, more offload functions are possible due to proximity with the target application.  In this post,  I am going to list down the offload functions that are possible and expected by the networking equipment vendors.

PCIe offload cards act NIC card to the x86 equipment while doing offloading of some networking functions.  Hence some call them  intelligent NIC cards. Offload card receives all the packets from the network, it processes offload functions and then sends the result packets to the server via PCIe interface. These packets are presented to the server TCP/IP stack via PCIe Ethernet driver.  Similarly outbound packets would come to offload card from the server.  Offload card does offload functions necessary on egress traffic and sends the result packets out on the network.

List of offload functions in different categories are given below.

Algorithmic Acceleration:  This is already being done and requires no further explanation.  Algorithmic acceleration now typically involves
  • Public Crypto and Symmetric Crypto acceleration
  • Symmetric Crypto acceleration with Protocol offload such as SSL Record layer : Though this is provided by offline cards,  increasingly network applications requiring SSL are using AES-NI SIMD functions provided Intel SSE4.  Many PCIe based crypto accelerators are mainly used for public crypto acceleration.
  • Compression/Decompression  acceleration.
  • Regular expression Search acceleration.

Basic Networking offloads functions:
  • Offloading of ARP for IPv4 and ICMP based neighbor discover in case of IPv6:  As networks in the data center becoming flatter, there is more and more ARP broadcast traffic in the network.  TCP/IP stack of network equipment receives all broadcast packets.  Due to increased broadcast traffic,  CPU cycles are being spent in processing interrupts, and ARP processing.  Lot of times, ARP requests may not even be satisfied as there is no neighborhood entry.  Offload cards are expected to remove this overhead by responding to neighborhood requests by itself without involving the server TCP/IP stack.  How does it work?
    • Server TCP/IP stack is expected to create/delete neighborhood entries with local IP addresses and associated MAC addresses  in the offload cards whenever new entries are created locally and whenever they are deleted.  Offload card maintains the neighborhood entries database on per physical and logic port basis (in case of VLAN based interface).
    • When offload card receives the neighborhood request, it checks  its database first. If there is no matching cache entry, ARP packet is dropped.  If there is any matching entry, it processes the request and reply is sent to the network without involving server.  
  • Offloading of packet integrity checks and dropping of invalid inbound packets:  Any TCP/IP stack including servers TCP/IP stack spend some CPU cycles checking whether the packet received is good.  These cycles can be saved if offload cards do the checks.  Some of the checks that offload card can do are:
    • Check the SMAC of the packet :  If the source MAC is one of the local MAC addresses, then packet can be dropped by offload engine.
    • Check the IP length with the packet size:  If the IP length is more than the packet size,  it is a bad packet and packet can be dropped.  If the IP length in header is less than the packet size, then packet needs to be truncated to length in IP header.
    • Check the checksum of IP packet:  If the checksum  is bad, then offload card can drop the packet.
    • Check the TTL:  If TTL is 0, then the packet can be dropped by offload engine.
    • Check the SIP and DIP of IP header:  If both the addresses are same, then packet can be dropped by offload card.
    • Check the SIP of IP header with local IP addresses:  If the source IP is one of local server IP addresses, then the packet can be dropped by offload card. Similarly, if the source IP addressis one of the local broadcast addresses or multicast address, packet can be dropped by the offload card.
    • Transport header length checks:  If the length of the packet is not as per the length in the transport header, then the packet can be dropped by the offload card.
    • Transport header checksum checks:  If the checksum of the packet does not match with checksum in the header, packet can be dropped by the offload card.
    • When the checksums are verified,  to avoid similar checks in the server TCP/IP stack,  offload card can communicate the successful checksum verification indication along with each packet to the server TCP/IP stack.  Many server TCP/IP stacks such as Linux has capability to indicate this information in their buffers.  For example, Linux TCP/IP stack has capability to avoid checksum verification if some special fields are set in SKB.
  • Offloading of IP reassembly on inbound packets:   Offload Engines by doing IP reassembly and sending one single packet to the server TCP/IP stack can save good number of CPU cycles. Any server terminating any kind of tunnel (such as IPsec, GTP, GRE, IPinIP etc..) would tremendously would benefit from this offload.  Offload cards are expected to do reassembly of IP packets that involve two packets at the minimum as many tunnel protocols make two fragments from the IP packet. IP reassembly function is not only required to save some CPU cycles in server, but also required to do other offloads such as 'Traffic policing',  'stateless and stateful filtering',  'inline DPI offload'  as they require access to transport selectors.  These offloads will not work well on packets if they don't have transport selectors.
  • Offloading of IP fragmentation on outbound packets:  This offload again saves CPU cycles by moving this function from server TCP/IP stack to offload engine on outbound packets from the server.  Server TCP/IP stack can give PMTU value along with the packet.  Offload card can fragment the packet as per PMTU and send them out. By offloading this functionality to offload card,  offload card can do better job of traffic shaping if it involves transport selectors.  IP Fragmentation can be done after the traffic shaping there by allowing traffic shaper to look at full IP packet.
  • Offloading of hash calculation  on inbound packets:  Many networking applications running in x86 server require hash result to index into  hash buckets of their session blocks.  Bob Jenkins,  CRC-32 and CRC-64 are some of common hash algorithms used on packet fields.  Offload cards are expected to take configuration from x86 server networking applications.  Configuration include 'hash algorithm', 'packet fields' and their order. Offload cards, then expected to, extract the fields, do the hash on the fields and send the hash result along with the packet to the server TCP/IP stack. 
  • Packet parsing & Field extraction on inbound packets :  Almost all networking application require to extract some common fields from the packet.  Cycles required to extract the fields can be saved by offloading this part to the offload cards.  Server networking applications can program the offload card on the fields it is interested in.  Offload cards based on configuration extracts relevant fields and provide them along with the received packet.

Inbound Packet filtering

Non-security network equipment devices would normally service certain protocols.  Unrelated traffic that goes to the devices would only decrease the performance of the device. Offload cards can filter out the packets.  Device software can program the rules to allow only the traffic required to go to the devices. Offload cards can reject the traffic that does not match the rules programmed.

Traffic Policing of inbound packets

Network equipment devices run complex operations on the packets or data. For example, ADC/WOC/Security devices do SSL, TCP, HTTP and other application protocol proxying to act on the data.  Proxy operations are expensive in nature.  Even the best of equipment can't process more than few gbps. If more traffic is sent towards the device, packets would be dropped at some point and  it only worsens the performance of device.  Devices are typically rated for some performance for different types of traffic workloads.  Offloads cards are expected to provide traffic policing on different types of traffic and discard the packets that exceed the policing parameters.  Device software is expected to program the offload cards with different types of protocols and traffic policing parameters.

TCP stateless offload :

Based on some conversations with DC equipment vendors as well as offload card vendors,  it appears that majority of them don't like to have TCP termination in offload cards and corresponding TCP/IP stack bypass in the x86 device.  It appears that device vendors are happy with device Linux or NetBSD operating system's TCP/IP stack.  Only peripheral changes to the TCP/IP stack and new Ethernet driver is only thing allowed by these vendors due to offload cards.  Maturity of Linux TCP/IP stack,  constant upgrade of TCP/IP stack with latest standards (RFCs),  avoiding application porting or testing of system are some of the reasons cited.  

Due to this, offload cards are expected to provide only TCP stateless offload. TCP stateless offload functions increase the performance of proxy applications.
  • Offload of hash function :  This is described above under 'Basic networking offload'.  TCP layer of device TCP/IP stack uses hash function to find the TCP control block (TCB).  If offload card does this, then device TCP/IP stack can skip this and save some CPU cycles.
  • Large Receive Offload :  As described in the post "Linux TCP Large Receive Offload optimization to increase performance" LRO feature reduces the number of packets that are seen by the device TCP/IP stack. Lesser the number of packets, better the performance of TCP/IP stack. Offload cards can do this by implementing LRO function by aggregating multiple consecutive TCP packets of a flow of a connection into a single large packet.
  • Transmit Segmentation offload :  Similar to LRO, this feature reduces the number of packets that traverse the device TCP/IP stack, but this time on the transmit packets.  Device TCP stack normally segments the packets based on MSS which is PMTU-TCPheader length - IP headerlength.   This would increase the number of packets on transmit direction and also some cycles are spent in segmenting the big data.  TSO functionality in offload cards can reduce the number of packets and also save CPU cycles from segmenting the packet in the device.  Device TCP stack can provide the big packet along with segment size and PMTU to the offload card. Offload card along with TSO and IP fragmentation functionality can segment and if required fragment the packets before sending them out on to the network.
  • ACK offload : As we know,  TCP/IP stacks are expected to send at least one ACK for every two TCP segments it receives.  Hence, there would be large number of ACKs.  Similar to TSO,  reducing the number of ACKs traversing the device TCP/IP stack can increase the performance of device.  Device TCP/IP stack generates the template ACK with the number of ACKs to be generated and corresponding sequence/ACK number information to the offload card.  Offload card, then, can generate multiple ACKs from the template ACK.
Traffic Shaping of outbound packets

Many network equipment device types do some sort of traffic shaping on the outbound packets.  Traffic shaping is almost the last step on the outbound packets.  QoS or traffic shaping is quite the same in all devices and there is really not much a value addition by the device software.  Hence this function can be done by the offload cards without impacting any device functionality.  Traffic shaping policies can be programmed in the offload cards by the device software.  Offload cards can act on the outbound packets based on the configuration.  It is important to that offload cards support multiple scheduling algorithms and hierarchical shaping.  This function in offload cards can save good number of CPU cycles and hence more CPU power can be made available to the device applications.

IPsec termination offload

Some network equipment devices such as network security devices implement IPsec VPN.  For inbound packets,  IPsec is done in the early stages of packet processing. For outbound packets, it is done one step before traffic shaping.  Hence this is also good candidate for offloading along with the traffic shaping.  By offloading IPsec inbound SA processing,  device software only sees the clear packets.  Similarly,  by offloading Ipsec outbound SA processing, device software does not need to do the expensive IPsec outbound processing.   Device software is expected to program the tunnels (pairs of Inbound and outbound SAs) upon IKE negotiation and delete them when life time expires or when there explicit delete indication from the remote gateway.

Offload cards are expected to provide IPsec termination offload function.  Offload cards not only do outbound and inbound SA processing, they also expected to do inbound policy verification on both decapsulated inbound packets and also on the original clear packets.  On the outbound side,  offload function can expect device software to provide outbound SA reference along with the packet. 

Inline DPI offload on inbound packets

Network equipment devices having IDS, IPS and Application detection functionality require some kind of pattern match on the packets.  These applications analyze the TCP, UDP, IP payloads for some detections and for others it may require processed data of L7 protocols.  Offload cards have access to TCP, UDP and IP payloads and may not have L7 intelligence.  At least for TCP, UDP and IP payload based detection, offload cards can be used.  Device software programs the patterns in the offload card and its regular expression engine.  Offload card after doing all its operations on inbound packets such as 'IP Reassembly', 'LRO', 'Packet filtering', it does pattern match on the payload and sends the results along with the packet.  Each result is expected to have pattern ID, offset in the payload.  Device software can avoid pattern matching on basic protocol payloads and there by utilizing saved CPU cycles on other processing.

Multicore Distribution

Network equipment devices normally use multicore x86 processors.  Without any support of offload cards,  distribution of packets to multiple cores happen either via software based interrupt distribution or by affining interrupts to different cores.   Some NICs have capability of distributing the load across multiple large number of descriptor rings with each descriptor ring associated with one interrupt. When these kinds of NICs are replaced with the offload card, it is expected that offload cards also provide similar capability of distributing the inbound packet load across multiple descriptor rings with associated interrupt.   In addition to this,  offload cards are expected to place traffic of a given connection (whether it is client-to-server or server-to-client) on to the same descriptor ring.  That is hash value calculated on 5 tuples should be same whether it is C-S and S-C traffic of a connection.  As we know,  SIP, DIP are in opposite order in IP header between C-S and S-C traffic a connection. Same is true with TCP and UDP ports.  Some hash algorithms generate different values if the order of input to the hash algorthm  is different. To ensure that same hash is generated,   offload cards are expected to arrange the input to hash in  fashion that input is same for both C-S and S-C traffic.  One method of doing this is to pass the input based on the value of the field.  That is, always lesser value of SIP and DIP can be passed first to the  hash algorithm before passing higher value among SIP and DIP.  Same needs to be for SP and DP too.  This will ensure that same hash value is generated.

Device Ethernet Driver and offload card should have understanding of number of receive descriptor rings and accordingly the offload cards uses the hash result to map to one of the descriptor rings.

Virtualization Distribution

Some network equipment devices support virtualization with each VM serving different DC customer traffic. VLANs are typically used to distinguish different customer traffic.  Using IOMMU functionality of x86 and PCIe SR_IOV functionality, offload cards are expected to send the inbound traffic directly to the VMs memory without involving the hypervisor layer.  IOMMU also allows direct mapping of different descriptor rings to the user space processes.  For more information on IOMMU works, please refer to the AMD paper on IOMMU specification.

Multiple cores might have been assigned to VMs.  In those cases, this distribution should be combined with the 'Multicore Distribution'.


Since offload cards are intelligent NICs and they are part of Data center environment,  data center bridging functionality is expected to be provided by NIC cards, even though these are not offload functions.  Some of the functions expected are:

802.1qaz:  Enhanced Transmission Selection
802.1qbb: Priority based flow control
802.3bd:  MAC control frame for priority based flow control.
802.1 Qau: Congestion notification for end points to take action.
802.1qbg and 802.1qbh :  Virtual bridge (VEPA, VN-Tag)

I hope it helps. If you find any functions that can be offloaded, please drop a comment.

No comments: