Wednesday, March 31, 2010

IPv6 deprecates NAT that is popular in IPv4 world - Is this really true?

I am not sure though.  There is no doubt that there are large number of IP addresses available in IPv6 world.  Though NAT was started due to the IPv4 address resource exhaustion, it has been used for many other purposes such as
  •  Avoid renumbering of the majority of network in corporations when they change their service provider or when the companies get merged.  When the service provider is changed or companies get merged, only few devices such as routers,  firewalls & NAT devices , IPSec VPN devices and Server such as DNS need to be updated.  Rest of the network addressing can remain same as NAT devices front end the network to external world.
  • When corporate networks have multiple ISP connections,  two scenarios are possible - There is load balancing of connections across multiple ISP connections and failover of connections to other ISP network in case of failure of primary ISP connection.  In both cases,  in IPv4 world, NAT devices take care of balancing and routing the traffic to appropriate ISP connections, that is, end hosts are not aware of this. End hosts are always assigned with fixed private IP addresses and NAT device takes care of translating those addresses with IP address of the ISP connection on which the packets are going to be transferred to.  In case of IPv6 network, without NAT support, each end host should take care of this on their own. This may be possible when every host runs routing protocol, but not sure whether it is practical.  By doing NAT in IPv6 world, this would be similar to IPv4 world and the burden of balancing and failing over the connection resides with NAT devices.
  • Security by obscurity is one another thing IPv4 NAT is providing which will be lost in IPv6 if every end host communicates with the external world on its own.
To me, above problems can be solved in some fashion or other such as Provider Independent IP addresses. This might increase the burden on the Internet core routers as there would be large number of routes. Though at this time, I am not sure whether this is practical, but I am sure some solution would be found.

But main reason I would believe NAT would continue to exist is due to Server Load balancing devices (And ADC devices).   SLB devices are expected to find out the least loaded server and award the connection to that server.  It is expected that connections are destined to one IP address (This IP address is called Virtual IP address) and server identifies the least loaded real server and redirects the connection to that server. Redirection of the connection typically involves Destination NAT where the DIP of the original connection packets are replaced with the real server IP address and vice versa on SIP (Source IP) for packets going from real server to client.  Also to ensure that the packet from the server passes through the same SLB device, source NAT is also applied with 'source IP' replaced with IP address of SLB device.  Since 'source IP' seen by real server is same even though connections came from multiple clients, there is a possibility of two different clients using the same source port. Hence source port translation is also required at the SLB device.  Due to this, I believe all considerations which are applicable for IPv4 NAT are valid here in Ipv6 world too.

One can say that there are other alternatives to address translation such as DNS based load balancing where different DNS requests are given with different real server IP addresses. But this is not accurate and there is a possibility of some real servers getting loaded more than others.  Another solution that can be suggested is to do only MAC translation.  Though this works, but it works only if both SLB device and real servers are on the same link.
Due to limitations on different approaches, Load balancing with NAT would continue to be required in many deployments.  So, don't rule out NAT even when we all move to IPv6.

If you believe I am wrong here, please drop a comment.

Thursday, March 25, 2010

Linux based fast path - Why is this needed?

I found two excellent blog posts on this subject. Please see them here:

http://www.multicorepacketprocessing.com/the-need-a-fast-path-on-multicore-cpu/
http://www.multicorepacketprocessing.com/os-networking-stacks-like-linux-are-not-well-adapted-to-multicore-packet-processing/

I see this statement in one of blog entries "However, the stack of any OS cannot go under ~2000 cycles per packets due to its design.".  It is really true.  Linux and other operating systems are generic and they need to cater to multiple different types of applications. Generality always comes with the price.  Forwarding process of IP packets in  Linux for example go through many layers of software from reception of the packet to the transmit of the packet.  These layers add to cycles.  As excellently listed down in above blogs,  higher layers of the operating systems don't have full control over the hardware (processor and accelerator) features. Since many layers of software comes in picture,  cache utilization is not very good.  On top of it, context switching and locks (for Multcore processors) in these generic layers would use up more core cycles.

Fastpath is nothing new. This has been adopted by networking vendors for a long time.  Fastpath is specific to applications running in the networking devices.  For example,  in networking devices,  fastpath layers are implemented for Firewall, IPsec VPN,  QoS and forwarding.  Fastpath concept is very simple and described below.
  • When packet is received, it first checks for the matching context.
  •  If no context, packet is given to the normal application modules.
    • Normal path based on its policy rules may decide to handle the session by creating its own context in its application. In addition it either decides to create the context in FP, or decides to handle the context itself by not populating the context in FP
  • If there is a matching context, packet is handled within fastpath and packet is routed without involving the normal path.
How/Why do fastpath modules get good performance?
  • It runs right above hardware and it has access to all hardware features.
  • Since FP is specific to given hardware, it can take advantage of all features without worrying about losing generality. Since fastpath module for any given application is expected to be very small in footpriint,  not having generality across all hardware devices is considered okay.
  • Fastpath implementations follow run-to-completion model.
  • Due to its small footprint,  most of the fastpath code might be in the L1 Cache of processors.
  • They can do locking of critical data blocks in processor caches.
  • Lock free implementation - Linear performance with cores
As the 'Multicore processing' blogs have indicated it is no surprise to see 3 to 10x performance improvement based on the type of application.

There are different types of fastpath implementations in the industry today.
  • ASIC based fast path implementations.
  • Network processor based fastpath implementation (Remember NPF. They even have set of API documents for fast path. You can still find there here )
  • Control plane and Data Plane with some cores running CP and some cores running FP in DP.
  • Linux Ethernet Driver based Fastpath for devices that use Linux SMP.
There are some pros and cons among different fastpath approaches. But the basic idea is same across them.  Do routine jobs of packet processing in fast path and let normal path handle special connections and packets.

I believe many  network device vendors would need to go for some kind of fast path if they need to support high volume of traffic with cost effective processors.

Sunday, March 21, 2010

LAN & WAN Interfaces with IPv4 and IPv6 - Prelude to Data Model definition

In any network routing device, you would have LAN interfaces and WAN interfaces.  LAN interfaces are connected to networks with Desktops, Servers etc.. WAN interfaces are connected to network which is connected to Edge Routers or devices which are towards Internet.  In case of Edge router,  WAN interfaces are used to connect to service provider network.  In non-edge router cases, there can be WAN interfaces or in some cases there may not be any WAN interfaces.

So far the interfaces are used to connect to IPv4 network and hence used to accept only IPv4 addressing. IPv6 networks in Enterprises are becoming common.  Now interfaces need to be configured with IPv6 addresses and provide IPv6 addresses to local network machines.  Configuration is becoming complex. But understanding the concepts makes it easier.  This article tries to provide requirements from both IPv4 and IPv6 perspective on LAN and WAN interfaces.  I hope that this article is useful for both developers and administrators.

LAN Interfaces:  Routers have following types of LAN interfaces that take IP addresses.
  • Ethernet Interfaces :  Some Ethernet interfaces may become part of Bridge interface. If the interface becomes part of bridge i.e bridge port, then it is no longer called LAN interface. It is simply called bridge port.  Similarly, if the interface becomes part of bonding interface, then also the interface is not called LAN interface.
  • Bonding Interfaces:  Multiple Ethernet interfaces are bonded together into one interface.
  • VLAN interfaces :  This is reverse of Bonding interface. Here one Ethernet interface is divided into multiple LAN interfaces. VLAN ID is used to de-multiplex incoming traffic to different interfaces. Since VLAN is used,  these interfaces are called VLAN interfaces.
  • Bridge Interfaces :  Multiple Ethernet interfaces become a  bridge interface using 802.1D protocol. These are also LAN interfaces.
LAN interface requirements:
  • Multiple IPv4 addresses (Address, Subnet) can be configured to enable multiple IPv4 networks on the same physical LAN. 
  • DHCP IPv4 Server Configuration :  LAN interface can be configured to serve the IP addresses to machines in the LAN.  There are multiple requirements here.
    • There are multiple different types of LAN machines -  VOIP phones, Media Servers,  Desktops, Laptops,  Smart phones etc..  Each type of machine might have different kinds of QoS requirements.  By providing IP addresses from different ranges of IP addresses to each type of LAN machines, QoS policies can be configured easily by having QoS rules with appropriate IP address range.  It is my understanding that different types of machines send 'Vendor Class identifier' differently.  This can be used to select the IP address range to server IP address.  There are other options that can be used to select the IP address range. So, the DHCP Server configuration on the LAN interface should have facility to take multiple IP address ranges with associated DHCP Option values.  Ofcourse, it also should take default IP address range that can be used to serve IP addresses when the DHCP client sends options and values that don't match the conditions set on the server side.
    • DHCP server is not only used to assign the IP address, but also other IP configuration such as DNS Server IP addresses, WINS Server IP address,  Default Router IP addresses etc..  Some of these can be configured manually at the server. But some of them might need to be learnt from the WAN connections.  But note that WAN connections may not be UP when the LAN machines connect to the DHCP Server.  In these cases, it is necessary that 'lease time' is set to very less time (such as few minutes) so that the client initiates the DHCP connection again.  When all the information is available with DHCP Server (ie when the WAN connection is UP), then it can give higher lease time to the DHCP clients. In IPv4 world, typically DNS Servers from WAN are not propagated to DHCP clients.  DNS Servers are configured in local DNS relay and provide local LAN interface IP address as DNS Server to DHCP Clients. There by,  there is no dependency on when the WAN connection is UP. 
    • Yet times, there may be DHCP Server elsewhere.  In this case, it is possible to set up DHCP relay on the LAN Interface.  
    • Some network devices also have DNS Proxy/Relay. In these cases, it is expected that the DHCP Server upon giving lease to a machine configures the FQDN with the given IP address in the DNS Proxy/relay. 
    • In addition to configuration,  it is required that the device provides statistics information and listing of 'attached devices (dhcp leases).
  • Dynamic Routing configuration:  In Enterprises, configuration of static routes in each device is discouraged.  Typically RIP or OSPF are used to learn the routes.  So, it is required to configure any thing necessary to enable routing protocol on the interface.
  • Some operating systems don't give flexibility of configuring the name of interface. So, it is good if some facility is provided for administrator to configure interface label and let the operating system choose the interface name.  This label can be intuitive name.  Any other configuration (such as creating routes etc..) requiring LAN interface can be referred by 'interface label'.  
  • Multiple IPv6 addresses can be configured statically.
  • IPv6 address assignment : In IPv4 world, DHCP Server is only way to serve the IP address and other networking information to the machines in LAN.  In IPv6 world,  IP address information is served in two ways - DHCP Server and using SLAAC (State less Auto Address configuration).  
    • SLAAC:   IPv6 prefixes to be advertised can be configured.  It is expected that the machines create its own IP address with this prefix and rest of it from the MAC address of the interface. Router advertises the prefixes in RA (Router Advertisement) messages.  Rest of networking information (such as DNS Servers etc..) is normally served via DHCP Server.  Since DHCP Server is not assigning IP addresses, this scheme is called DHCP Stateless configuration as described in RFC 3736.  In some deployments,  the prefixes that need to be advertised to the local clients need to be derived from the prefixes WAN connections gets from the ISP.  Since there could be many WAN interfaces,  there can be a requirement to configure the WAN interface label from which to derive the prefixes.
    • DHCP Server configuration:  Here it is similar to IPv4 DHCP Server. There are some minor differences.  In Ipv4, all the time the IP address ranges are configured by administrator. But in this case,  IPv6 prefixes are learnt from the WAN connections. As indicated above as part of SLACC,  it may be required to configure the WAN interface label from which to derive the prefixes and other information. DHCP Server is specified in RFC 3315.
    • As discussed above, if WAN connection is not UP, then the internal machines will not be advertised with the prefixes and hence internal machines may not be able to communicate among themselves. Note that Link local addresses are not expected to be used by applications. Link local addresses are expected to be used only for Neighbor discovery and Route discovery protocols.  It is not good if local machines can't communicate among themselves if there is no WAN connectivity.  Of course, there is no issue if the global prefixes are known and configured statically.  In other cases where WAN connectivity provides the prefixes,  a provision is made to assign ULA (Unicast Local Address) and is described in RFC 4193.  This particular ULA prefix configured should be same across the reboots of the CPE device.  Though the ULA prefix is generated using random number, it should be saved so that it stays across reboots.  Due to randomization,  this prefix may be unique, but there is no surety. Hence it is necessary that addresses starting with FC00:/7 are filtered out at the site boundary router towards Internet. But note that these can be used for inter-site VPN.  For all practical purposes, this is like any globally unicast prefix.  Note that, this address can co-exist along with other global unicast addresses which the router advertises to the local LAN machines.
 WAN Interface requirements:
  • Multiple physical interfaces can be WAN devices.
  • There are some kinds of WAN connections that require physical interface to be used such as PPPoE and normal IP connections.  There are some WAN connections which send data based on routing information such as IPSec-IRAC and PPTP. 
  • Each WAN device might have multiple WAN connections. Each WAN connection itself becomes an interface. 
  • Each WAN Connection can be  configured to make connections to ISP using one of following:
    • IP Connection
    • PPP - PPPoE, PPTP
    • IPsec - IRAC
  • Each WAN Device mostly would have statistics information and very less configuration. Statistics information mainly contains packets or bytes sent/received,  interface label etc.. 
    • IP Connection Mode:
      • WAN Device :  Identified by Interface label.  This connection uses this  WAN Device.
      • IPv4 Addressing
        • Sub Modes:  Static,  Dynamic.
        • Static:  Multiple IPv4 addresses with each IPv4 address having associated Subnet prefix.
        • Dynamic (DHCP Client) Mode:  It should request for IPv4 address, prefix,  DNS Servers, WINS Servers,  SNTP Servers. It is also should be possible for administrator to enter other options (for send and receive) such as Vendor Class Identifier.  The DNS Servers which it gets are typically programmed in DNS Relay.
      • IPv6 Addressing :  This is some what complex compared to IPv4. 
        • Sub Modes:  Static,  Dynamic Static IPv6 addresses can be configured. 
        • In Dynanic mode,   it starts with SLACC (RFC 4862). If the upstream router indicates the address needs  using stateful  way(M flag), then DHCP client is initiated with IA_NA option.  DHCP Stateful addressing (RFC 3315) and DHCP Prefix Delegation (RFC 3633) is always required to get other networking information (DNS Servers, SNTP Servers, SIP Servers etc..).  Prefixes which it gets would be used to divide across multiple LAN interfaces.  The division to be used can be configured.  As in IPv4, it should also take configuration for options that need to be sent or received.  Note that DNS Server information may be used by LAN Device DHCP Server and hence it should be possible for the CPE device to program the LAN Device DHCP Server or SLAAC server with learnt prefixes and DNS Servers.  Note that, if IA_NA option is not fulfilled by the server, it should assign one of the IP address from delegated prefixes to the WAN interface.
    • PPP Connection:
      • Sub Modes:  PPPoE,  PPTP
      • WAN Device Interface Label:  WAN Device to use. Valid only if it is PPPoE. In case of PPTP, it uses interface identified by routing entry which itself is found using PPTP Server IP address.
      • Generic PPP Configuration required:
        • User name, password in case of PAP/CHAP
        • Other PPP information (Like MTU, MRU, Compression Control etc..)
      • Sub Mode Specific configuration:
        • In case of PPPoE:  AC Name, Service Name etc..
        • In case of PPTP:   PPTP Server IP address and other information.
      • IPv4 Addressing:
        • Static or dynamic.
        • It also can get DNS Server IP addresses. As in 'IP Connection' mode, these addresses can be programmed in DNS Relay.
      • IPv6 Addressing
        • Using PPP, only link local addresses are negotiated. 
        • Using RA (SLAAC), it can get the Prefixes. If O flag is set, it gets DNS and other information via DHCP.
        • If RA indicates M=1, then it tries to get the IP address using DHCP IA_NA.
        • In any case, it initiates DHCP PD to get the prefixes.
        • If IA_NA is not successful, it uses one IP from the prefixes and assigns to its interface.
        • Only configuration require for above operation is to configure DHCP options to send and receive.
    • IPsec-IRAC Mode (RFC 5739) :
      • IPv4 Addressing:
        • If enabled, it gets the IP address and gets used as NAT IP address.
      • IPv6 Addressing:
        • All IKE and SPD policy rule configuration is required to be configured.
        • As part of IRAC, it is expected to get the IPv6 prefixes and DHCP Server IP address.
        • Using  DHCP Stateless configuration, it gets other networking information.
        • As in other modes,  it assigns the prefixes to LAN Devices (SLAAC and DHCP Server configuration of LAN Devices) and also programs the DNS Servers in DHCP Servers of LAN Devices. 
    • Like any good data model,  there should be enough information provided to administrator :
      • Statistics
      • Dynamic information that is learnt.
    This is only prelude to define the data model.  Expect detailed data model soon.

      Sunday, March 14, 2010

      Linux TCP Large Receive Offload optimization to increase performanace

      In some network packet processing applications, number of packets being processed determine the performance. TCP is a streaming protocol and hence there is no packet boundary. Hence consecutive packets can be aggregated into few packets when the TCP packets are received at the lowest level.  More the packets that can be aggregated, higher the performance would be.  Applications that can benefit are:
      • Any Proxy based applications (Application Delivery controller, WAN optimization,  Network Anti Virus)
      • IDS/IPS 
      • Firewall ALGs.
      • Server Applications

      I found one excellent paper describing two techniques to improve the TCP connection throughput performance -  Receive aggregation  and  Acknowledgment offload.  Please find it here. This paper also gives performance improvement with receive aggregation and without these optimization techniques.  Performance was improved from 3.4Gbps to 4.6Gbps, 35% increase.

      Receive aggregation technique is already implemented in Linux 2.6 kernel. It is called Large Receive offload feature.  This feature is implemented in net/ipv4/inet_lro.c.

      Receive aggregation technique is simple. It is used only when the NAPI functionality is applied on the Ethernet driver.  In NAPI enabled Ethernet drivers,  softirq receives the packets from the descriptors. Typically NAPI reads out all the packets from the receive descriptors (or until some threshold - quota). 
      • Ethernet Driver normally sends up the packet to the stack using netif_receive_skb if the NAPI is enabled. In case of LRO,  packet is given to the LRO library using lro_receive_skb function.
      • Find the matching flow. If no match, it creates new flow.
      • LRO module figures out whether this packet is eligible for aggregation. Packet is non-eligible if any of following conditions apply.
        • Non Padded frame (IP total packet length must be received packet length)
        • Non-TCP packet.
        • IP options are present
        • IP ECN CE is set
        • TCP segments has no data.
        • CWR (Congestion Window Reduced) flag is set
        • ECE (ECN Echo) flag is set
        • SYN flag is set
        • FIN flag is set
        • URG flag is set
        • PUSH flag is set
        • RST flag is set
        • ACK flag is not set
        • Non TCP Timestamp option is present
      • If the next sequence number expected is matches with the sequence number of this packet, packet is added to the existing packet sequence.  If not, packet is not eligible for aggregation.
      • When the packet is found to be not eligible for aggregation, it is necessary to send buffered packets first to the stack before sending the current packet. This is done using lro_flush() function.
      • If the packet is eligible for aggregation, it associated with existing packets by manipulating the skb.
      • When the aggregation stops, it does following before sending the aggregated packet to the stack.
        • Changes the ACK to the last packet ACK.
        • Keeps the timestamp option of the last packet.
        • Recalculates the IP checksum (now the packet became bigger).
        • Partial Checksum update of TCP payload.
      • When does the aggregation stop:
        • When the configured aggregation limit reaches.
        • When the total packet size is more than (64K-MTU).
        • When the NAPI finishes all the packets in receive descriptors or reaches its quota.
      • Ethernet Driver is expected to send all the packets so far buffered at the end of current NAPI instance. It does so by calling lro_flush_all.

      Saturday, March 13, 2010

      NAT and IPsec - Application sequence

      To my surprise, after so many years of IPsec deployments,  I keep hearing questions related to NAT and IPsec.  It appears that there is some confusion still on the sequence of operations between NAT and IPsec. That is why I thought I will write a blob on this subject.

      Before going further into the technical aspects, it is good to introduce two scenarios.
      • Branch Office VPN :  This term is typically used to connect offices of an organization over public Internet with IPsec VPN security.
      • Partner VPN :  This term is typically used to connect some part of network or machine(s) with partner network securely using IPsec VPN.
      If IPsec VPN is being used to secure networks of different offices of same organization, one can assume that the private IP addresses of different networks are unique within the organization.  Hence IPsec VPN can be applied on the private networks of offices.

      While connecting with Partners,  one can't assume that the private networks of partners are unique. It is very much possible that both the networks might have same private IP address networks.  Hence the IPsec VPN must always happen with public IP addresses.  In this case, typically source NAT is applied before IPsec VPN is applied on the packet.  Source NAT translated the local network IP addresses to public IP address and the IPsec sessions negotiated with public IP addresses.

      Note that it is also possible that a given a branch office VPN router, not only used to connect to the other offices of the organization, nut also connect partner networks. So, VPN routers must be able to support multiple site-to-site VPN within its organization, but also support site-to-site VPN with partner networks.

      Let us take a simple scenarios where one secure gateway  is securing 10.1.5.0/24 network. It is expected to securely connect to its head office VPN router which is securing 10.1.6.0/24 network. Let us also say that  this router is expected to secure traffic to/from two machines on its local network (say 10.1.5.5 and 10.1.5.6) with three partner machines in partner network.  For this purpose, let us say that two public IP addresses are allocated for the local security gateway -  190.1.2.2 and 190.1.2.3.   Let us also say that partner provided IP address to connect to the partner machines -  191.1.2.2,  191.1.3.3 and 191.1.2.4.

      On the security gateway, following NAT rules are required:
      • Source Range:  10.1.5.5 - 10.1.5.6,  Destination Range:  191.1.2.2-192.1.2.4,  Apply Source One-to-One NAT with 190.1.2.2 - 190.1.2.3 :   This rule does source NAT  on the connections that are originated from local network identified by source. It replaces source IP of the packet it is receiving from local network if it matches with 10.1.5.5-10.1.5.6 range and if destination IP is in the range of  191.1.2.2 - 192.1.2.4.  NAT IP address are given as 190.1.2.2 - 190.1.2.3.  Since it is One-to-One NAT,  it replaces 10.1.5.5 with 190.1.2.2 and 10.1.5.6 with 190.1.2.3.
      • Source Range:  191.1.2.2 - 192.1.2.4, Destination Range:  190.1.2.2 - 190.1.2.3, Apply Destination One-to-One NAT with 10.1.5.5 - 10.1.5.6:  This rule is applied for connections that are originated by partner network.  It translated destination IP so that the connections land at the right machines in local network.
      IPsec policy rules would look like this:
      •  Branch office VPN rule:  Source:  10.1.5.0/24  Destination 10.1.6.0/24 Apply Security (Algorithms and proposals are not shown here).
      •  Partner Office VPN rule:   Source :  190.1.2.2-190.1.2.3  Destination :  191.1.2.2 - 192.1.2.4, Apply Security. 
      I hope I could make it clear on how NAT and IPsec can be used to connect with partner network. Note that private IP addresses can be used to talk to partner network too as long as administrators of both organizations ensure that there are no duplicate private IP addresses in their networks that need communication.

      Saturday, March 6, 2010

      PCIe Endpoint developer considerations

      See my earlier post on "PCIe End Point Developer Techniques" on some of the considerations developer should  keep in mind.  This post will give some more expectations from Endpoint.

      Interrupt Coalescing:

      Endpoints will interrupt the host whenever it puts data in the receive descriptor ring(s) . Also, Endpoint interrupts the host whenever it transmits the packet out from the transmit descriptor ring(s).  Interrupt coalescing is normally used to reduce the number of interruptions to the host.  Interrupt coalescing configuration  by host on Endpoint will result in Endpoint interrupting the host only after filling up certain configured number of descriptors or after some configurable amount of time is passed from previous invocation of interrupt.  Time parameter is required to ensure that host gets the interrupt even if configured number of descriptors are not filled up. This parameter should be chosen carefully to reduce the latency of packets.  Interrupt coalescing also can be configured by host to reduce the number of interrupts when descriptors are removed from the descriptor rings.

      As an Endpoint developer, always make sure to provide interrupt coalescing parameter configuration to host for every interrupt that gets generated by the Endpoint.

      Now the processors are used in Endpoints,  similar functionality is required in reverse direction.  End Point should be able to indicate the interrupt numbers it would assign to the host. In addition, end point can also have interrupt coalescing parameters  in configuration registers for each interrupt.  Host will read them and use this information to reduce the number of interrupts it invokes on the Endpoint. Since these are registers,  host also can put different configuration values.

      As an Endpoint developer, make sure to provide this configuration to the host. If you are developing the host driver,  please ensure to use interrupt coalescing functionality.

      Multiple Receive Descriptor Rings and Multiple Transmit Descriptor rings:

       In case of intelligent NIC cards, multiple receive descriptor rings are used to pass different kinds of traffic.  For example,  packets related to configuration & management is given highest priority compared to other data traffic. Within data traffic, voice & video traffic would be given higher priority normally.  Different descriptor rings are used to pass different priority traffic.  Endpoint, upon receiving the packets from wire, classifies the packets and puts them in appropriate rings.   Host driver is expected to select the descriptor ring to dequeue the packet.  Basically, scheduling of rings is used to select the ring.  To enable scheduling, rings are given weightage. Some important rings may also be placed in strict priority.   Scheduler in the host is expected to select the strict priority rings if they have packets. If no packets in strict priority rings, then  a ring is selected from weighted ring set for dequeuing the packet.

      Similarly,  multiple transmit descriptor rings are used by the host for multiple reasons. One reason is similar to the reason described above for receive descriptor rings. That is,  multiple descriptor rings might be used to send different priority traffic.   Endpoint software is expected to do scheduling across multiple rings to select the ring and then dequeue the packet for transmission on the wire.  This scheduling also could be similar to the one described above.

      Set of descriptor rings meant for different kinds of traffic can be termed as 'Descriptor Ring Group".

      As Endpoint developer,  one should provide this kind of flexibility for host to process high priority traffic first over low priority traffic.

      Multiple Descriptor Ring Groups:


      In Multicore processor environments, avoiding locks is very important for achieving the performance. Endpoint is expected to provide facilities such a way that host never need to do any locking while dequeuing the packets from receive descriptor rings or while en-queuing the packets onto the transmit descriptor rings. That is where,  multiple descriptor groups are required.  If the host has 4 cores, then at least 4 receive and transmit groups are needed to avoid  lock by assigning each group to different core.  To also ensure that right core is woken up upon interrupt,  it is also necessary that each group has its own interrupt which is affined to the appropriate core.  That is, when a core is interrupted, it exactly knowns which group of descriptor rings to look at. Since a group is accessed by only one core, there is no lock required.

      Now that  Endpoints are also being implemented on processors, it also becomes important to reduce the locks even on the endpoints.   This makes the problem little more complicated.  Let us assume that host has 4 core processor and endpoint has 8 core processor.  As discussed above, since host has 4 cores, 4 groups are good enough.   That is, 8 cores in the endpoint would be updating only 4 groups.  Since more than one core would be manipulating the group, lock would be required in the endpoint.  To avoid lock in both places, it is required that the number of groups need to be at least maximum of number of cores in host and endpoint.  Descriptor groups would be equally divided across multiple cores on each side.  Even though this formula avoids the lock in both places,  but the traffic distribution can be a problem if the number of groups can't be divided across cores equally. This will lead to some cores getting utilized more than other cores.  If you take an example where the host has 4 cores and endpoint has 7 cores,  then to make the distribution of groups across cores equal on each side,  one should have GCF of 4 and 7. That is, 28 groups are required.  On host side, each core would need to be assigned with 7 groups and on endpoint side, each core is assigned with four groups.

      If one core has more than one group, then there could be two challenges.  One is interrupt assignment and other is distribution across groups.   Even though PCIe support MSI and MSI-X, there could always be some practical limitations on the number of interrupts that can be used.  Hence each group can't have its own interrupt.  Since interrupts are used to interrupt the core,   one interrupt per core is good enough.

      Producer core is expected to distribute the traffic equally among the groups - Round Robin on packet basis would be one acceptable method. Once the group is chosen,  based on the classification criteria as described in the previous section a descriptor ring  in the group gets selected to place the packet. Consumer core is expected to do similar RR mechanism to select the group to dequeue from.  Once the group is selected, another scheduler as described under the previous section would be used to select the descriptor ring to dequeue the packet from.

      As an Endpoint developer,  one should ensure to support group concept to avoid locks.

      Command Descriptor rings:


      So far we have talked about packet transmit and receive descriptor rings and groups.  Command rings are also required to pass commands to the Endpoint and for end point to respond back.  There are many cases where command rings are required.  For example,  Ethernet endpoint might be required to know information from host such as

      • MAC addresses
      • Multicast Addresses
      • Local IP addresses
      • PHY configuration.
      • Any other offload specific information.
      Unlike packet descriptor rings, here response is expected.  Hence the descriptor should have facility for host to provide both command and response buffers.  Endpoint is expected to act on the command and put the response in the response buffer.

      As an endpoint developer,  one should provide facility to send commands and receive responses.


      Thursday, March 4, 2010

      IPsec as WAN protocol for IPv6? Read on...

      In recent past, I see the trend of using one Security Association (SA) for both IPv4 and IPv6 selectors. That is, if two sites are being secured using IPsec having IPv4 and IPv6 networks,  traffic from both IPv4 and IPv6 networks go on the same SA. Ofcourse, it is possible with IKEv2 only. IKEv2 has facility to send both IPv4 and IPv6 selectors together as part of CHILD_SA negotiation.

      Remote Access Client (IRAC) traditionally used in mobiles and desktop/laptop end points. They connect to IRAS in corporate office, get private IP address and access corporate networks.

      In IPv6,  IRAC is not only used in end points, but also increasingly being used in small office IPsec boxes. More interestingly, IPsec is being used to get the IP addresses for internal LAN machines from the ISP. In this case ISP runs IRAS.  That is,  IPsec is being used as WAN protocol.  The flow is some thing like this:

      • IRAC in CPE making IKEv2 connection to the IRAS in ISP.
      • IRAC requests IPv4 and IPv6 information from the IRAS using configuration payloads in one transaction.
        • IPv4 configuration attributes typically involve IP address, DNS Servers,  Remote Networks etc..,  
        • IPv6 configuration as described in IKEv2 RFC is not sufficient for this kind of deployment. RFC5739  defines the attributes which get
          • Multiple IPv6 prefixes
          • DHCPv6 Server address at IRAS end.
          • This standard expects the IRAC to get rest of information such as DNS Servers and any other information via DHCP to DHCPv6 Server address it gets through this exchange.
      • How IPv4 address would be used:
        • This is typically public IP address.
        • This is used for NAPT. 
        • DNS Servers and WINS Servers IRAC gets would get configured dynamically into DHCPv4 Server.
        • Local IPv4 machines are assigned with Private IP address via DHCP v4 Server.
      • How IPv6 information is applied
        • Since it gets multiple prefixes,  it can assign one address from each prefix to IRAC interface (Virtual Link interface) in additional to Link Local address.
        • Prefixes are configured with DHCPv6 Server on the LAN interfaces if stateful addressing is being advertised via RA on the LAN or to the RA Proxy which sends the prefixes in RA messages. If there are multiple LAN interfaces than the prefixes, then prefix need to be divided further and assign subset to each LAN interface.
        • Any information it gets through DHCP transaction with IRAS DHCP Server also might need to get populated in DHCPv6 Server on LANs (Stateful or Stateless).

      Even though above deployment is mentioned as WAN access,  similar transaction can happen in corporate world.  Small home office or small sales offices in multiple locations can use similar mechanism to assign IPv6 address to local machines to communicate with IPv6 networks in corporate offices.