Sunday, January 22, 2012

IP Fragmentation versus TCP segmentation

Ethernet Controllers are increasingly becoming more intelligent with every generation of NICs.  Intel and Broadcom have added many features in Ethernet NIC chips in recent past.  Multicore SoC vendors are adding large number of features into Ethernet IO hardware blocks.

TCP GRO (Generic Receive Offload - It used to be called Large Receive offload too) and GSO  (Generic Segmentation Offload and it is used to be called Transport Segmentation Offload)  are two new features (in addition to FCoE offloads) one can see from Intel NICs and many Multicore SoCs.  These two features are  good for any TCP termination applications on the host processors/cores.  These two features reduces the number of packets traversing the host TCP/IP stack. 

TCP GRO works across multiple TCP flows where it aggregates multiple consecutive TCP segments (based on TCP sequence number) of a flow into one or few TCP packets in the hardware itself, there by sending very few packets to the host processor.  Due to this,  TCP/IP stack sees  fewer inbound packets.  Since the packet overhead is significant in TCP/IP stacks, lesser packets uses lesser number of CPU cycles, thereby leaving more CPU cycles for applications, essentially increasing the performance of overall system.

TCP GSO intention is similar to TCP GRO,but for outbound packets.  TCP layer typically segments the packets based on  MSS value. The MSS value is typically determined from PMTU (Path MTU) value.  Since TCP and IP headers take 40 bytes of data,  MSS is typically ( PMTU -  40 ) bytes.  If PMTU is 1500 bytes, then the result MSS value is 1460. When the application tries to send large amount of data,  then the data is segmented into multiple TCP packets where each TCP payload carries up to 1460 bytes.  TCP GSO feature in the hardware eliminates the need for TCP layer to do the segmentation and thereby reduces the number of packets that traverse between TCP layer and to the hardware NIC.  TCP GSO feature in the hardware typically expect the MSS value along with the packet and it does everything necessary internally to segment and send the segments out.

Ethernet Controllers are increasingly providing support for IP level fragmentation and reassembly.  Main reason is being  increasing popularity of tunnels.

With increasing usage of tunnels (IPsec, GRE, IP-in-IP,  Mobile IP, GTP-U and futuristic VXLAN and LISP), the packet size is going up.  Though these tunnel protocol specifications provides guidelines to avoid fragmentation using DF bit and PMTU discovery,  it does not happen in reality.  There are very few deployments where DF (Don't Fragment bit) , which is required for PMTU discovery, is used.   As far as I know,  almost all IPv4 deployments fragment the packets during tunneling.  Some deployments configure network devices to do red-side fragmentation (fragmentation before tunneling so that the tunneled packets appear whole IP packet) and some deployments go for black-side fragmentation (fragmentation after tunneling is done).   On receive direction, reassembly happens either before detunneling or after detunneling. 

It used to be the case where fragmented packets are given lesser priority by service providers during network congestion.  With high throughput connectivity and increasing customer base for networks, service providers are competing for the business by providing very good reliability and high throughput connectivity. Due to popularity of tunnels,  service providers are also realizing that dropping fragmented packets may result in bad experience to their customers.  It appears that service providers are not treating the fragmented packets in a step-motherly fashion anymore.

IP fragmentation and TCP segmentation offload methods can be used to reduce the number of packets traversing the TCP/IP stack in the host.  Next question that comes to mind is how to tune the TCP/IP stack to use these features and how to divide the work  between these two HW features. 

First thing to tune in the TCP/IP stack is to remove the MSS dependency on PMTU.  As described above, today MSS is calculated based on PMTU value. Due to this, IP fragmentation is not used by TCP stack for outbound TCP traffic. 

TCP Segmentation adds the both TCP and IP header to each segment.  That is, for every 1460 bytes, there would be overhead of 20 bytes of IP header and 20 bytes of TCP header.  In case of IP fragmentation,  each fragment would have its own IP header (20 bytes of overhead).  Since TCP segmentation has more overheads,  one can say IP fragmentation is better.  Here, MSS can be set to a bigger value such as 16K and let IP layer fragment the packet if the MTU value is less than 16K.   This is certainly a good argument and it works fine in networks where the reliability is good.  Where the reliability is not good,  if one fragment gets dropped, TCP layer needs to send entire 16K bytes in retransmission.  If TCP had done the segmentation, it would only need to send fewer bytes. 

There are advantages and disadvantages with both approaches. 

With increased reliability of networks and with no special treatment on fragmented traffic by service providers,  IP fragmentation is not a bad thing to do.  And ofcourse, one should worry about retransmissions too. 

I hear few tunings based on the deployments.  Warehouse data center deployments where the TCP client and servers in a controlled environment are tuning MSS to 32K and more with 9K (jumbo frame) of MTU.  I think that , for 1500 bytes MTU,  going with 8K of MSS may work good.


悟空 said...

Greate article! Learned much from it :)

Ultra-low latency MAC said...

It is extremely interesting for me to read the article. Thank you for it. I like such topics and everything connected to them. I would like to read a bit more on that blog soon.
10G tcp offload


Intilop said...

I have heard really great things about these 10G tcp offload tips! I can’t wait to try all of them till I find the one that is perfect for me!

Intilop said...

This amazing advance technology has capabilities such as direct data sourcing, application layer data advance integrity check, traffic management, direct data advance placement and many more beneficial capabilities.
10 G TCP Offload

Patrick Petersen said...

Full kernel Bypass technology is used for optimizing network performances. It decodes the network packets and passes the data from kernel space to user space by copying it |TCP offload IP Core|
TCP Offload IP core