Sunday, March 14, 2010

Linux TCP Large Receive Offload optimization to increase performanace

In some network packet processing applications, number of packets being processed determine the performance. TCP is a streaming protocol and hence there is no packet boundary. Hence consecutive packets can be aggregated into few packets when the TCP packets are received at the lowest level.  More the packets that can be aggregated, higher the performance would be.  Applications that can benefit are:
  • Any Proxy based applications (Application Delivery controller, WAN optimization,  Network Anti Virus)
  • IDS/IPS 
  • Firewall ALGs.
  • Server Applications

I found one excellent paper describing two techniques to improve the TCP connection throughput performance -  Receive aggregation  and  Acknowledgment offload.  Please find it here. This paper also gives performance improvement with receive aggregation and without these optimization techniques.  Performance was improved from 3.4Gbps to 4.6Gbps, 35% increase.

Receive aggregation technique is already implemented in Linux 2.6 kernel. It is called Large Receive offload feature.  This feature is implemented in net/ipv4/inet_lro.c.

Receive aggregation technique is simple. It is used only when the NAPI functionality is applied on the Ethernet driver.  In NAPI enabled Ethernet drivers,  softirq receives the packets from the descriptors. Typically NAPI reads out all the packets from the receive descriptors (or until some threshold - quota). 
  • Ethernet Driver normally sends up the packet to the stack using netif_receive_skb if the NAPI is enabled. In case of LRO,  packet is given to the LRO library using lro_receive_skb function.
  • Find the matching flow. If no match, it creates new flow.
  • LRO module figures out whether this packet is eligible for aggregation. Packet is non-eligible if any of following conditions apply.
    • Non Padded frame (IP total packet length must be received packet length)
    • Non-TCP packet.
    • IP options are present
    • IP ECN CE is set
    • TCP segments has no data.
    • CWR (Congestion Window Reduced) flag is set
    • ECE (ECN Echo) flag is set
    • SYN flag is set
    • FIN flag is set
    • URG flag is set
    • PUSH flag is set
    • RST flag is set
    • ACK flag is not set
    • Non TCP Timestamp option is present
  • If the next sequence number expected is matches with the sequence number of this packet, packet is added to the existing packet sequence.  If not, packet is not eligible for aggregation.
  • When the packet is found to be not eligible for aggregation, it is necessary to send buffered packets first to the stack before sending the current packet. This is done using lro_flush() function.
  • If the packet is eligible for aggregation, it associated with existing packets by manipulating the skb.
  • When the aggregation stops, it does following before sending the aggregated packet to the stack.
    • Changes the ACK to the last packet ACK.
    • Keeps the timestamp option of the last packet.
    • Recalculates the IP checksum (now the packet became bigger).
    • Partial Checksum update of TCP payload.
  • When does the aggregation stop:
    • When the configured aggregation limit reaches.
    • When the total packet size is more than (64K-MTU).
    • When the NAPI finishes all the packets in receive descriptors or reaches its quota.
  • Ethernet Driver is expected to send all the packets so far buffered at the end of current NAPI instance. It does so by calling lro_flush_all.

No comments: