Saturday, July 17, 2010

Large Scale NAT with DS-Lite & A+P

Dual Stack Lite, Address plus Port assignment to CPE devices by ISPs are two most important mechanisms being adopted by ISPs to provide connectivity to IPv4 internet to smart phones, CPE devices in Residential markets and Femto CPEs.

Dual Stack Lite and A+P mechanisms are being done to ensure that IPv6 transition is smooth as Internet becomes Ipv6 addressable over time.

ISPs are facing shortage of public IPv4 addresses. Demand was increased with popularity of Internet in general and in particular the explosion growth of smart phones. Many ISPs are no longer in position to provide the dynamic public IP address to the CPE devices and smart phones. Note that CPE devices and smart phones have become always-on. So dynamic IP address is really becoming static.

Until world moves to the Ipv6,  only way is sharing of IPv4 address across multiple subscribers.

Many mobile service providers are only giving the private IP address to the smart phones and ISPs  This trend is continuing even for CPE devices with its explosive growth.   ISPs maintain mega NAT boxes which translate the traffic from CPE and smart phones with certain number of public IP addresses.  CPE devices already do their own NAT between IP addresses it assigns to local machines in the LAN with the IP address provided by the ISP.  Due to this, there is double NAT.  Though it works in majority of cases, there are some limitations which could be problematic for end customers, hence to the ISP business.
  •  Connectivity could be lost if dynamic  private IP address is assigned by the ISP is part of the private subnet the CPE is configured to assign to the local machines.
  • Though not a big concern immediately,  Bigger ISPs might have customer more than the private IP address space. If one goes with 10.x.x.x network, then ISPs might provide address to 2^24 subscribers.  With smart phones in the range of 120M in 2012,  it is a possibility that ISPs might even don't have many unique private IP addresses to assign.
  • Applications requiring special ALG will not work if both NAT devices (CPE as well as Carrier NAT) don't support the ALGs.   ISP Carrier NAT box may not entertain proprietary ALGs or may not have many ALGs.
  • Two internal machines that need to communicate among themselves (peer-to-peer) applications may not work in double NAT scenarios (hair pin scenarios).
  • Many peer-to-peer applications expect same IP and Port to be used for SNAT even though destination machines are different.  If not supported by Carrier NAT, many peer-to-peer applications may not work.
Large Scale NAT  (LSN) solves some of above limitations by doing NAT at only one place. In this model,  CPE is given the IPv6 address on its WAN interface.  If IPv6 machines are communicating with IPv6 destinations, then there is no IPv4 involved and it works fine.  If IPv4 machine in private network is communicating with public Ipv4 network in the Internet, then these packets are tunneled to the LSN box sitting in the provider network.  Ipv6 is used to tunnel IPv4 packets between CPE and LSN.  LNSN box in Provide network does the NAPT.   LSN eliminates double NAT.  LSN also takes care of overlapping private IP addresses among multiple subscribers by keeping the IPv6 tunnel end point address as one of the identification parameter to map the NAT entry.  It still has problems related to ALGs.  That is, if the LSN does not support ALGs for some applications, then these applications never work.  Also, LSN need to be high performing box with respect to throughput,  latency and jitter.  Though this can be solved by multiple LSN boxes, but ALG problems are too big for adoption of this technology. Also, if any application needs to be hosted, this becomes tough as port forwarding is controlled at Carrier NAT rather than the CPE gateway as we all normally accustomed to.

A+P (Address Plus Port) specifications provides the flexibility of doing NAT with CPE.  In this case, multiple CPEs are given with same public IP address, but with different ports.  CPE NAT is only expected to use assigned ports for source port NAT.  CPE can decide not to do NAT for some connections and in which case LSN in provider network would do the NAT.  Based on different people experience, I believe only few ports are necessary by the CPE due to feature called 'Dense NAT'.  That is, same source port can be used across multiple connections as long as 5-tuple is different across the connections on the external realm. Some web sites using AJAX may make multiple connections at the same time. I believe there are some sites which make almost 60+ connections at a time.  128 port range is good enough for many cases.  What it means is that, even without LSN, same IP address can be used across 128 subscribers assuming each subscriber requires 128 ports.  With A+P alone, the ISP can increase his customer base 128 times with the public IPv4 addresses the ISP has.  Since NAT is done at the CPE, all the facilities as in current CPE boxes are possible. It can have port forwarding feature,  each CPE can have its own ALGs and each one can have port triggering feature.  Having said that, it has its own limitations - IPsec without UDP NAT traversal does not work.  ICMP Echo Request/Reply would need to be taken care little bit more carefully as it does not have port concept.

I believe Comcast is in advanced stages of deploying LSN. Many ISPs would be requiring to install some kind of solution very soon.  In my view, the solution would be combination of LSN and A+P.  ISPs will differentiate subscribers using following subscriptions.
  • Subscribers requiring static public IP address.
    • Subscribers hosting any servers on standard ports which are expected to be reached via their own Domain name.
  • Subscribers requiring dynamic public IP address.
    • Subscribers hosting servers on standard ports with DDNS.
  • Subscribers with shared public IP address and dedicated ports for SNAT.
    • Subscribers hosting games or hosting servers on non-standard ports.
  • Subscribers with shared public IP address and shared ports (LSN).
    • Subscribers just requiring outbound access.
CPE devices and Smart phones require some support if they intend to take advantage of LSN and A+P.  When I was going through the specifications,  I had one doubt on why the CPE NATted packets need to go through the IPv6 tunnel to the PRR (Port Range Router).  I think the reason why they need to go is to ensure that the CPE device really did NAT with the IP address and Ports that were allocated to it by the PRR. It is required to ensure that CPE devices are behaving well and does not damage the connectivity of other CPE devices.

Let us see the kind of changes required in CPE devices.

Features expected in CPE:

  • CPE must support IPv6 addressing on its WAN Interfaces. 
  • Learning of LSN IPv6 address using DHCP extensions, PPP extensions Or CPE should have facility to provide static configuration.
  • In case CPE supports A+P, it should also learn the public IPv4 address and Port range(s)  from DHCP, PPP or via local configuration.  In some cases, more port ranges also can be requested dynamically when the ports are getting exhausted.  If it does not require any port ranges, it should be able to free them back to the PRR.
  • CPE must be able to provide IPv6 addresses to the IPv6 capable hosts in its internal network.
  • CPE must be able to provide IPv4 private addressing to the local hosts in its internal network.
  • CPE must be able to tunnel packets from private IP hosts to the LSN in provider network.
  • In case of A+P, it should have intelligence to figure out which connecting to be NATTed at the CPE and which one are allowed to be done at the LSN.
  • CPE optionally also can support providing the A+P to the local hosts if they are A+P aware. In which case, CPE also acts as local PRR.
Features expected in PRR:
  • It should be able to do Address+Port Management via signaling protocols (DHCP, PPP or web based management).
  • It should be figure out the packets that needs to go through LSN and if so, send those packets to LSN.
  • It should ensure that the packets are NATted by CPE with its delegated addresses and ports. If not, it should discard the packets.
  • It should provide facilities WCCP protocol for security checks (AV, AS,  IPS etc..).
  • It should terminate IPv6 tunnels and should be prepared to make Ipv6 tunnels to the LSN.
  • It should be scalable:  Good algorithms to 
    • Search tunnel for incoming packets from the CPE.
    • Search Address+Port based entries for packets coming from Internet to identify the CPE device and hence the tunnel.
Features expected in LSN:
  • It should be able to terminate large number of tunnels.
  • It should be able to maintain large number of NAT entries.
  • It should be able to work with CPE devices having same private IP address space.
  • It should be able to stateful failover if device fails.
  • It should support popular application ALGs such as:
    • FTP, RTSP, SIP, H.323. MGCP,  PPTP, L2TP and more..
I hope it helps.

Monday, June 21, 2010

User space IO - Some Challenges & Mitigations

There is pretty good information about UIO in Internet.  This link provides good introduction to this subject.

What is UIO (User space I/O) framework?

UIO framework is part of Linux kernel to enable device driver development in user space.


Which applications require User space drivers?

Zero-Copy drivers are  becoming necessary for performance reasons.  Many network packet processing applications are traditionally used to be developed in Linux kernel space.  Firewall, NAT, IPsec are some of the examples which you find in the kernel space.  Increasingly,  these applications are moved to the user space for multiple reasons such as - Availability of large memory space,   Easy-to-debug,  faster image upgrade and restart and many more.    Moving these applications without moving the Ethernet driver or acceleration drivers reduces the performance.  Even though there are some efficient mechanisms to transfer packets between kernel and user space, they  still take some core cycles.  Having access to the hardware from the user space eliminates the need for any mechanisms to transfer packets back and forth between user space and kernel space. UIO frame work allows user space applications to own the device.  UIO frame work does this by letting the application kernel driver to map the hardware IO to the user space process.  UIO frame work also allows the application kernel driver to register interrupt handler with hardware IRQ and wake up the user space daemon upon hardware interrupt.  User space application upon getting indication from the UIO frame work,  reads the packets or acceleration results from the hardware memory directly without involving kernel.


What are components involved in UIO?

UIO frame work is part of Kernel itself.  Application developers need to develop one simple kernel module and user space application.   Kernel module as indicated above is expected to register interrupt handler with the UIO framework and also indicate the memory ranges (address, size pairs) to the UIO framework.  User space application open the appropriate UIO device /dev/uioX  (X being the minor number),  get hold of memory map ranges from 'sysfs' file system, do memory map and wait for the interrupt events either using 'read' or 'ePoll()' system calls.  When the read/ePoll returns,  it can read the content from the memory mapped area and do the actual application processing on the packets.

API exposed by UIO framework for application kernel modules:

uio_register_device(struct device *parent,  struct uio_info *info)

This function is expected to be called by the application kernel module.  'info' to be filled up with the right values.   At the end of this function,  UIO frame work creates the device /dev/uioX where X is dynamically assigned minor number.  This is the device which is expected to be opened by the user space program to read the interrupt events.

struct uio_info {
    const char        *name;
    const char        *version;
    struct uio_mem        mem[MAX_UIO_MAPS];
    struct uio_port        port[MAX_UIO_PORT_REGIONS];
    long            irq;
    unsigned long        irq_flags;
    struct uio_device *uio_dev;
    irqreturn_t (*handler)(int irq, struct uio_info *dev_info);
    int (*mmap)(struct uio_info *info, struct vm_area_struct *vma);
    int (*open)(struct uio_info *info, struct inode *inode);
    int (*release)(struct uio_info *info, struct inode *inode);
    int (*irqcontrol)(struct uio_info *info, s32 irq_on);
};

name, version:  Application driver can provide any string as part of it.  Since this 'name' field is used by user space application to figure out the device name (/dev/uioX),  it is necessary that the name field is chosen such a way that it is specific to your application and unique across UIO devices.  Note that value X in /dev/uioX is chosen dynamically by the UIO frame work.  X value can be different across restarts of Linux system.  So, if user space application hardcodes the device file in its code, then it could be an issue when the system restarts and different UIO devices register with UIO framework in different order.  'name' is the one which is constant across restarts as it is given by the application driver.  Since the device name is not constant,   user space application, upon its initi8alization, should find out the UIO device name based on the value of 'name'.  UIO frame work creates set of files under /sys/class/uio/ directory. Under /sys/class/uio/, all device names are present. As many sub directories as number UIO files are present in /sys/class/uio. If there are two UIO files, then there would be two sub directories -  /sys/class/uio/uio0/,  /sys/class/uio/uio1/. Under each uioX directory, there are set of files -  'name', 'version', 'event' and set of directories - 'maps' and 'device'.   'name' file contains name of the device given by the application driver in the first line.  'version' file contains the version string given by the application driver.   'events' contains the number of times the interrupt service routine called so far.

User space application software is expected to find out the right device name by scanning through the directory entries (using scandir()) in /sys/class/uio/ directory.  For each directory entry, it needs to open the file 'name', read the first line and check the name.  If it matches with the name the application is looking for, then note down the directory entry that has matching entry - uioX.  Use this to form /dev/uioX string to open the UIO device.  FD returned by opening the device can be used to read the interrupt events.  This FD can be kept even in epoll(). This is useful if your application requires to wait for event from multiple file descriptors.

struct uio_mem mem[MAX_UIO_MAPS]:   If your application requires to map the register space of hardware in your user space application, then application kernel driver is expected to fill this up.  Since there could be multiple memory ranges required to access the hardware and hence there is array of mappings,  UIO provides facility to give multiple memory ranges. 

struct uio_mem {
    const char *name;
    unsigned long addr;
    unsigned long size;
    int memtype;
    void  __iomem *internal_addr;
   ...
}

It is expected that application kernel driver fills up the array of memory ranges using above structure during registration time. User space application is expected to read memory ranges from the /sys/class/uio/uioX/maps/ directory and do the memory mapping using mmap().  If application kernel driver fills up four memory addresses, then there would be four sub directories  under /sys/class/uio/uioX/maps/ -  map0, map1, map2, map3 and map4.   Under each 'mapX' sub directory, there are three  files - name, addr,  and size.   'addr' file contains address and 'size' file contains 'size'.  See below for explanation.   User space application is expected to read all paris of 'addr' and 'size' and use mmap() function to map them  to its virtual space.   Some explanation of fileds of uio_mem before going into further details.
  • addr:  It could be physical,  logical or virtual memory. Mostly it would be physical address as hardware device memory is exposed here.
  • size:  Size of the memory that needs to be exposed to the user space.
  • name :  Name given to each memory range.  
  • internal_addr:  This is not meant for user space programs to do anything.  Kernel driver can initialize this for its own usage at later time by interrupt service routine or irqcontrol function.  Typically, this memory is mapped using ioremap().
One thing note is that the memory mapping is always with respect to page boundary.  Very often, the device memory does not start at the page boundary. Hence it is required that the user space application adds the right offset to the return address of mmap() to point to the right locations in the device.  User space application is expected to keep the 'offset' parameter for each memory range using 'name'.

mmap() function takes one parameter 'offset' (note that this offset is nothing to do with offset explained above).  This offset is normally given in the multiples of page size.  This offset field is used by UIO frame work to determine the the memory range that user space programs intends to map.  Note that Linux IO infrastructure allows UIO framework to have only one corresponding mmap() function. Whenever mmap() is called in user space,  mmap() function of UIO framework in kernel is called.  UIO mmap() function internally calls remap_pfn_range() function to map the memory.  Note that there is only one mmap() function. How does UIO know which memory range to use to map?  TO solve this issue, UIO expects the user space programs to pass offset which is N * getpagesize() where N being the memory map index.   UIO internally gets hold of memory map index from the offset field and use corresponding 'addr' and 'size' values.

irq:  If your hardware device requires to interrupt the user process, then the application kernel driver is expected to register the IRQ number with the UIO frame work.  If the hardware device does not have this facility or interrupt is not required, then '0' need to be passed to it.  UIO frame work also provides an API function ' uio_notify_event()' to wakeup the user process. This can be used by timer or other facilities to wake up the user space process if the hardware device does not support interrupts.

irq_flags: Kernel driver is expected pass these flags. These flags would be given to request_irq() function by the UIO framework.  Typically,  IRQ_SHARED flag is sent if the IRQ is shared across more than one hardware device.

uio_dev:   This is filled up by the UIO framework.  UIO frame work puts the its own private information in there.  For every registration, UIO frame work creates an instance of uio_dev and keeps it in there. It is not expected to be interpreted by application kernel driver. Any further calls to the UIO framework from the application kernel driver is expected to pass uio_info. UIO frame works gets its instance from the uio_info->uio_dev and use the information in there to do its processing.

irqreturn_t (*handler)(int irq, struct uio_info *dev_info):   This is main interrupt handler.  It is expected to be provided by the application kernel driver.   Application kernel driver implements the interrupt service routine as required by the device. Waking up the user process is taken care by the UIO frame work itself.  UIO framework sets its own function as interrupt handler while calling request_irq() function.  That is, when there is an interrupt,  UIO framework gets the control first.  It calls the application driver handler function and then it does whatever is necessary to wake up the user process.  Hence the application driver handler does not worry about waking up the user process.  More often, my observation is that the application driver interrupt handler function does not do much.  Mostly, it just disables any interrupt generation by programming the device registers.  What should be done in the application driver handler depends on the hardware device capabilities.
  •  Hardware devices typically have capability for software to mask/unmask interrupt generation.  They also provide ability for software to indicate to the hardware to generate interrupts only for new events by acknowledging the previous events. and hardware devices generate interrupt, if new packets have come in , when interrupt is enabled.   If hardware has these capabilities , then the kernel handler typically disables the interrupt generation.  User space process upon being woken up,  indicates to the hardware to generate interrupts only for new events from now onward,   reads all device events in a loop (packets, results etc..) and then enables the interrupt.  This method automatically provides coalescing capability.  User space process is woken up upon first event and interrupts are disabled by the kernel handler. By the time, user space process woken up, it processes not only the event that had woken this up, but also any other events that have come after that.  
  • Note that the user space process or thread may be processing packets from multiple UIO devices.  In this case,  if the  user process processes all the packets coming from one device in a loop until all events are read, then there is a chance that packets from other devices are not handled in timely fashion.  It is expected that all devices are given fair chance.   One way to take care of this is to have one thread each for devices. But that may not be efficient.  It appears that the performance is best if the number of packet processing threads are equal to number of cores/HW threads.  There could be more devices than the threads.  Due to efficiency reasons, one thread may need to work with multiple devices.  In these cases, to give the fairness across devices,  it is necessary that thread handles only 'quota' number of packets from each device before revisiting the devices again.  This concept is similar to NAPI model adopted in Linux Ethernet drivers.
  •  
 int (*irqcontrol)(struct uio_info *info, s32 irq_on) :  This function pointer is filled up by application kernel driver to allow user space process to explicitly enable/disable interrupt generation by the hardware device.  This function gets called by UIO infrastructure when the user space process calls write() function on the UIO device fd.  Normally, irq handler disables interrupt generation and user space process enables interrupts using mapped memory. Some hardware devices might have race conditions if two contexts update interrupt mask related registers. This can happen when the mask register is used for other purposes.  In these cases,  central control of enable and disable is necessary.  But modern hardware devices don't have this issue and hence this function registration is not required.

 int (*mmap)(struct uio_info *info, struct vm_area_struct *vma) :  In usual cases, application kernel driver need not set this pointer.  UIO infrastructure has its own mmap() function defined which can do the memory mapping when user space calls mmap() function.  UIO Infrastructure itself can do the mapping using uio_mem mapping array.  Yet times, the number of entries needed to map could be more tham MAX_UIO_MAPS. In that case,  UIO infrastructure will not be able to do the mapping.  In this case, application kernel driver will need to provide mmap() function pointer and do the mapping necessary.  

Even though UIO framework provides  application kernel driver to indicate the memory ranges to map or register application specific mmap() function pointer,  more often I see that both of them are not used.  UIO framework predominantly used only for registering the interrupt handler to wake up the application user process. Many times, application kernel driver itself made as character devices driver with its ioctl() and mmap() functions in addition to open(), close().    There are multiple reasons for doing this. One of  the reason is given below.
  • Applications not only require to map the device specific memory locations, but also map the kernel memory for packet/acceleration-result buffers. UIO infrastructure does not provide this.  Ethernet hardware devices are typically expose descriptor rings of descriptors to receive packets.  Application is expected to provide buffer in each descriptor.  Ethernet controller fills up the buffer in the descriptors with incoming packets.  Buffers that are to be given to the Ethernet controller must be physical addresses. Current generation of Multicore SoCs don't have capability to convert from virtual space to physical space internally.  Hence physical addresses need to be provided for buffers that go in receive descriptors.  Since Linux user space does not have physical memory with it, it needs to get this memory from the kernel space.  Application kernel driver does this job.   User space program asks the kernel driver to allocate and map the memory to user space.  When the mmap() in  user space returns, it has the virtual address.  It gets the physical address of allocated buffer from the kernel driver and uses the physical address while programming the hardware and uses virtual space while using it in its program.  User space programs typically ask kernel driver to allocate big amount of memory and then asks that memory to be mapped. Packet buffers are allocation/free is done from  this big chunk.   

Applications may require big chunks of memory blocks for several reasons - packet buffers, acceleration results and even for local contexts.   But there is only one mmap() function and there are no special arguments by which user process indicates the purpose to the application driver.  Hence, it is necessary that there is some kind of protocol between user space process and the kernel driver.  One method that is typically followed is to indicate the purpose via one IOCTL command, then do mmap() and another IOCTL command to know the base address of allocated memory.  Let us say that there are two different memory chunks to be allocated - Chunk1 of size 128Kbyets for packets   and Chunk 2 for acceleration results of size 64K.  Then the sequence by which user space calls the kernel driver through FD are:

ioctl(fd,  SET_PURPOSE,  argument consisting of  type 'CHUNK1',  size '128K')
mmap()
ioctl(fd,  GET_MMAP_RESULT, argument consisting of 'physical address').

Similar sequence need to be followed whenever Chunk2 is required.

Kernel Driver need to keep the information given via SET_PURPOSE in its private information. When kernel driver mmap() function gets invoked, then it allocates memory using  kmalloc(),  calls remap_pfn_range.  It stores the address returns by kmalloc() in private information.  This is given back to user space when GET_MMAP_RESULT command is issued by user space.  All these three operations need to happen atomically.  Kernel driver may like to ensure the sequence and return error if new sequence is started before old sequence is completed.

 int (*open)(struct uio_info *info, struct inode *inode),  int (*release)(struct uio_info *info, struct inode *inode) :   These function pointers are can be set by application kernel driver to get hold of control whenever user space applications open or close the UIO device. It can do any cleanup necessary. 

Example Program



  

Saturday, June 19, 2010

IPv6 and eNodeB in LTE world - Technical bit

Does IPv6 migration of UEs and PDN Gateway require any support from eNodeBs?  eNodeB, MME, HLR, AAA and SGW need not be upgraded immediately along with PDN gateway to support Ipv6 external connectivity to the UEs.

eNodeBs communicate with the other 3GPP network elements for signaling and transporting the UE traffic over GTP tunnel.  They relay traffic from PDCP layer to the GTP tunnels and vice versa.  As such they don't look deep into the data as such. It could be IPv4 or Ipv6 packet.  GTP layer in eNodeB is expected to pick up the data, snap the GTP-U/UDP/IP header and send it over to the SGW by securing with IPsec.  GTP-U can use IPv4 connectivity to talk to SGW and eNBs etc.. Similarly signaling communication can happen on Ipv4 network.

PDCP RoHC (Robust Header Compression) is the only mandatory component in the eNodeB which tried to interpret the packets to/from UEs.  I believe the first generation of eNBs would expect PDCP RoHC to support both IPv4 and IPv6.  Some eNBs have some sophisticated functionality such as 'Application Detection'. Any component that interprets the user traffic would need to have IPv6 intelligence.  I believe eventually all the components such as GTP-U, IPsec,  QoS  require IPv6 support as operators move to IPv6 only networks.

By the way, I saw one IETF draft on IPv6 in 3GPP EPS.  Please find it here.  

Sunday, June 13, 2010

Avoiding double IP reassembly in eNodeB and IPsec Gateway in LTE - Red Side Fragmentation

I have given one use case long time back for Red side fragmentation feature. Please find it here.  There is one more use case where one can avoid double IP reassembly.

In LTE world,  you have eNodeB and SGW (Serving Gateways) communicate using IPsec  over backhaul network.  Ipsec functionality is normally part of eNodeB.  But on the SGW side, IPsec is normally deployed in a separate network element (IPsec Gateway).  Due to this, there could be double IP reassembly in eNodeB. Avoiding reassembly can improve eNodeB performance. This can be achieved by configuring 'Red side fragmentation' on Ipsec GW in core network near SGW. 

Let us step back and see the packet processing.  Any packet between eNodeB and SGW are first tunneled using GTP tunnel.  Then they get tunneled using IPsec to secure the traffic.  Let us all assume that MTU of all the links is 1500. It is not a bad assumption at all as interfaces of these network elements in many deployments are Ethernet interfaces.  Let us also assume that 1500 byte packet is being sent out from SGW to eNodeb (downlink packet).  GTP in the SGW adds GTP/UDP/IP header to the packet.  Since MTU of transmitting interface is 1500, this packet would be broken down into two - one packet with 1500 bytes and another packet with rest of data.  Now this packet goes to the IPsec gateway .  IPsec adds ESP/IP header to the first 1500 byte fragment and second fragment.  First fragment after Ipsec is done exceeds the MTU and it gets fragmented.  So, there are now total three packets for the original 1500 byte packet.  eNodeB upon receiving the packets, first needs to reassemble the fragments for Ipsec inbound processing consumption.  For GTP-U consumption, it needs to reassemble the fragments which went through the Ipsec inbound processing.

How do we avoid the double reassembly?  I can think of three options.

  • Combine GTP and Ipsec in the same network elements. Packets will get fragmented only after GTP and Ipsec outbound processing is done.  Hence only two fragments would be generated and peer only needs to do only one reassembly.  Note that this option may not be possible on SGW side due to scalability reasons.  GTP and IPsec are always  together in case of eNodeB though.
  • Configure IPsec gateway to do reassembly before it does IPsec outbound processing:  Here GTP-U might have broken the packet into two. Ipsec gateway reassembles them and passes the reassembled packet to the IPsec outbound processing.  Ipsec gateway after outbound processing would fragment the packet if necessary.  Peer (eNodeB) will only see two fragments and only one reassembly is required. But note that this would have overheads in Ipsec gateway. This may be okay as ipsec gateway is normally deployed as separate element and the CPU might be entirely dedicated for this.
  • Better option is in my view is to enable 'Red side fragmentation' in the IPsec gateway.  1500 byte GTP-U packet it receives get fragmented before IPsec processing is done. Three packets in total would go through the IPsec processing and would be sent out. There is no further fragmentation.  On eNodeB,  three fragments would get reassembled after Ipsec inbound processing is done and before GTP-U detunnel processing.  Only one reassembly is required.  It has its own advantage, but I believe this is better option. Disadvantages being - It requires 3 fragment reassembly on eNodeB.  On IPsec gateway, more packets would go through Ipsec engine. Since many Ipsec gateway use hardware accelerators, I am thinking that this additional processing will not affect the overall throughput.
Comments?

Saturday, June 12, 2010

LTE Network Sharing (MOCN) - eNodeB

Typically eNodeB and Core network components are owned and operated by same operator. It is quite common to share the physical infrastructure among multiple operators, specifically the cell sites - physical location, building etc... But every operator used to have their own base stations and transport card etc..  As I understand that is called passive sharing.  This sharing is now extended to active component sharing such as eNodeB. That is called network sharing or active sharing.

Networking sharing allows one frequency spectrum ,  eNodeB shared by multiple operators. Dedicated cells for each operator but still sharing rest of E-UTRAN infrastructure is also possible.

There are multiple business reasons for network sharing.Main reason being cost savings.
  • Cost savings by sharing infrastructure - eNodeB and Frequency spectrum.
    • LTE deployments will require major investments as new eNBs, new antennas need to be installed by operators.  To reduce the number of eNB one needs to own,  in some cases such as rural areas there by providing connectivity while reducing the costs.
3GPP body as part of LTE effort visualized these scenarios and created specifications to handle network sharing scenario. 3GPP specifications  22.951 and 23.251 describe the network sharing requirements and architecture & functional description of network sharing. Main feature that allows network sharing in LTE is that eNodeB broadcasts multiple PLMN IDs (Public Land Mobile Network ID which is combination of Mobile Country Code and Mobile Network Code - Each operator has unique PLMN ID) to the UE using SIB (system Information Block).  UE is expected to select the PLMN ID based on its selection process.  Using the selected PLMN ID,  UE is expected to make RRC connection with eNodeB.  eNodeB uses this PLMN ID to select the core network and in turn MME.

This feature of eNodeB serving multiple operator is also called 'Multi Operator Core Network' (MOCN).

Whenever a hardware is used for Multiple operators, there would be fairness would come in picture.  Due to this feature, identification of contexts (whether it is PDCP, GTP, RLC, MAC, IPsec or Qos) will need one  additional parameter - Operator ID.  Note that TEID which is used to terminate the GTP tunnels may not be unique across multiple operators.  In addition to this,  I believe following features would be expected from eNodeB to support MOCN.
  • Additional identification parameter, Operator ID in user plane modules.
  • Fairness in allocating resources in eNodeB (Buffers,  Contexts etc.. ) and Radio resource management if same cell is used by multiple operators.
  • VLAN Support - One or few dedicated VLANs for each operator.  
  • DHCP Client (Ipv4, IPv6) - Multiple instances :  eNodeBs typically get the IP address from the DHCP Server.  Since there are multiple VLANs due to multiple operators, DHCP client also needs to be capable to get multiple IP addresses - one for each operator (VLAN).  
  • If other L3 connectivity protocols used instead of DHCP such as PPP, then one needs to ensure that these L3 connectivity protocols too get multiple IP addresses - one for each operator.
  • eNodeB should ensure that right source IP addresses are used for GTP tunneling and Ipsec tunneling.
  • Fairness to ensure that one operator traffic does not overwhelm the CPU
    • Radio bandwidth is normally taken care as part of radio resource management on per operator basis.
    • Incoming traffic from backhaul network is expected to be policed at the ingress port. Each operator VLAN may be configured to police the traffic or schedule the traffic coming from different VLANs to CPU fairly (weighted if configured).  Due to scheduling, packets may be pending in the queues for future scheduling.  This can eat up buffers and new packets may not get received.  So, there should be limits on number of buffers each VLAN can occupy at any time.  
    • Outgoing traffic to Bachaul network also need to be controlled on per operator (VLAN) basis.  Note that all VLANs share the same physical link. Hence the outgoing traffic needs to be controlled on per VLAN basis to ensure that physical link is not overwhelmed.  Traffic shaping and scheduling on per VLAN basis is expected.  Within each VLAN,  priority based queuing might also may require traffic shaping & scheduling. Hence, eNodeB is expected to provide hierarchical shaping and scheduling.  

Saturday, June 5, 2010

LTE PDCP from eNodeB perspective

Packet Data Convergence Protocol (PDCP) is one of the User plane protocols in LTE. It is present in UE and eNodeB.  This protocol sends and receives packets to and from UE and eNodeB over air interface. This protocol works along with other L2 protocols RLC  (Radio Link Control) and MAC (Medium Access Control).

PDCP layer works on top of RLC and it transfers the UPLINK packets to the GTP layer which in turn tunnel the packets to core network (Evolved Packet Core - EPC).   It receives the downlink packets from GTP layer and send them onto RLC which in turn sends them to UE. That is PDCP layer sits in between RLC and GTP layers.

This particular post talks about PDCP layer details in eNodeB.  PDCP is user plane protocol. Control plane protocol RRC configures the PDCP entities in the user plane.

PDCP layer is described in 3GPP 36.323 standard.

PDCP functions : 

PDCP layer is expected to do following:

  • Security function over the air interface :  
    • Ciphering and Deciphering of user plane and control plane data.
    • Integrity protection and verification for control plane data:  Note that there is no integrity protection offered to the user plane data.
    • Sequence number is used to detect anti replays.
  • Header compression and decompression for user plane data:  Note that there is no header compression for control plane data.  ROHC (RFC 4995) is used to reduce the headers to save bandwidth of air interface. ROHC is mandatory for voice traffic.  Note that in LTE, both voice and data use packet switching.  Typically for every 32 bytes of voice data around 40 bytes of headers are added (RTP, UDP, IP) in case of IPv4 and around 60 bytes get added in case of IPv6. That is quite a bit of overhead.  ROHC is expected to reduce the over head to few bytes.  For data traffic, ROHC is not mandatory, but it is good to have.
  • Handover :  As discussed in earlier post  there are two types of handovers - seemless and lossless.  Seemless handover is typically used for radio bearers carrying control plane data and user plane data that is mapped to RLC UM (Unacknowledged mode).  In seemless handover,  header compression contexts are reset and Sequence numbers (COUNT) values are set to 0 in the target eNB.  PDCP SDUs that have not been transmitted will be sent to the X2 interface (S1 interface in case there is no X2 connectivity)  GTP tunnel to the target eNB.   Lossless handover is typically applied to the radio bearers that are mapped to RLC AM.  In this handover mode too,  header compression context is reset, that is,  ROHC context is not transferred to the target eNB. In lossless handover pending downlink packets -  PDCP SDUs for which no Acks are received from UE,  PDCP SDU which were not transmitted and new GTP packets that are coming from S1 interface in source eNB - are sent to the target eNB.  Similarly, uplink packets which are received out-of-order also sent to the target eNB. Control plane in source eNB sends the 'Next Transmit Seq num' and 'Next expected receive sequence number' to the target eNB for each RAB.  Optionally it also sends the bit map of PDCP sequence numbers of the packet which it would expect UE to retransmit.  This information is passed to the the target eNB via SN-STATUS-TRANSFER.  I guess this information would be used by target eNB PDCP to send the PDCP status transfer control message.
  • Discard function :  This allows packets to be discarded if PDCP layer did not successfully send the packets for 'discard timeout' time.  
  • Duplicate discarding : If PDCP layer receives duplicate packets (packets with same sequence number), the it discards them and does not send them to upper layers.
Some points to note :

PDCP specification goes in great lengths on PDCP data transfer procedures and details out internal implementation such as state variables to be maintained for received and transmit operations. These state variables are used to assign sequence numbers during transmit time and verify/discard/send up the packets which are received from RLC layer.  I will not go into those details here as the specification is very clear on them.  One thing to note is that these procedures are described from UE context. Same are valid for eNB PDCP implementation too. But note that UE PDCP transmits the UL packets to RLC  and receives DL packets from RLC.  In case of eNodeB,  PDCP layer transmits DL packets to RLC  and receives UL packet from RLC.  So, keep in mind while going through spec document.

There are two kinds of PDCP bearers:  SRB (Signalling Radio Bearer) and DRB (Dedicated Radio Bearer). There are only two SRBs - SRB1 and SRB2.  These are used by control plane protocol to send the packets to the UE.  DRBs are used for sending voice and data.  There would be as many DRBs as number of QoS streams.

    There are two kinds of packets in PDCP -  Data packets and Control packets. Packets that are sent on  SRBs and DRBs use data packet format. Control packets are used by ROHC to provide feedback to the compressors from decompressors.  Control packets are also used by PDCP layer to send the PDCP sequence number status to the peer (Packet sequence numbers that are received out-of-order).  

    As discussed before sequence numbers are sent along with the data packets to the  peer to do the in-order delivery of packets to its user entity.  To preserve the the bandwidth on the air, only least significant bits of sequence number is sent.  Most significant bits are called HFN.  Based on the window size, there are two sizes are chosen for sequence number that are sent along with the packet - 7 bits (User plane Short SN)  and 12 bits (User plane long SN). Typically short SN is used for UM mode and long SN is used for AM mode.

    There is one PDCP context for radio bearer. PDCP context is identified by four tuples - Virtual Instance ID,  Sector ID,  C-RNTI and LCI (logical channel identifier). Please see this post for more details on virtual instance ID and sector ID.  Both sides of PDCP - RLC and GTP - would use same identifiers to identify the PDCP context. Hence only one search table is good enough (Implementation information).

    There is one to one correspondence between PDCP SDU and PDCP PDU. That is there is no segmentation and concatenation functions in PDCP layer.  Addition of PDCP header,  applying compression and security on the PDCP SDU makes the PDCP PDU.  Similary deciphering, decompression and removal of PDCP header makes PDCP SDU from PDCP PDU.

    PDCP Status report is expected to be generated as part of PDCP reestablishment if the RB is setup to send the status report. This report is sent for PDCP PDUs that are received from RLC (UPlink packets).  Fields that are to be sent along with the status report is described in section 5.3.1 of 36.323 spec.

    PDCP Layer in eNB also may receive the PDCP Status report from the UE indicating out-of-order packets it had received. PDCP layer is expected to work on removing the PDCP SDUs that are pending in the transmit queue and acknowledged by the peer via status transfer message.  Also, it is expected to send the status report to control plane as CP may require to send to the target eNB if it is in handover stage for that UE.

    I always wondered how and when PDCP generates the status reports. From the UL & DL Data transfer procedures in Section 5.1 of 36.323,  one can observe that the PDCP PDU on receive direction (from RLC) don't get stored in PDCP layer.  They are given to the upper layers immediately after Security and RoHC processing.  PDCP layer assumes that the packets are given in order by the RLC and hence don't need to store the packets to do inorder delivery to the upper layers.  Description of "PDCP Status Report" in section 5.3 of 36.323 says that status report is sent for the PDUs which were received out-of-order. It gives an impression that packets are stored in the PDCP layer for inorder delivery to the upper layers.  If not, how does PDCP layer send the bit map indications of the out-of-order packets.  So, I thought section 5.1 and section 5.3 are contradictory.  Then there is some enlightenment.  This only happens during PDCP reestablishment time.  Control plane protocols when they indicate reestablishment to the local RLC,  RLC layer sends the PDCP PDUs (RLC SDUs) it has to the PDCP layer as is basis. That is, there could be some missing packets at that time.  THis is only case, PDCP layer gets  the packets with some missing PDCP PDUs.  PDCP layer is also informed of reestablishment by the control plane.  PDCP upon receiving the packets from RLC layer is expected to send the status report to the peer with bitmap of packet sequence numbers so that the peer PDCP can remove the SDUs at its transmit side that were acknowledged in the status message.

    I also had some confusion on PDCP discard for some time.  PDCP specification says that (Section 5.4 of 36.323) timer is started for every packet that is submitted to the PDCP layer by upper layers. If there is no successful transmission acknowledgment from the local RLC for the packet within 'discardTimer' timeout value, then PDCP can drop the packet.  I was thinking for some time that how does remote PDCP layer know about this drop. I thought remote peer waits on the packet (sequence number) endlessly . But from UL/DL data transfer procedures, it is clear that PDCP receiver does not wait on any missing sequence number packet as its window keeps moving right.

    PDCP Interfaces:
    PDCP layer interfaces with three neighboring modules -  RRC control plane,  RLC and GTP.  Ofcourse, there is initialization, configuration, monitoring etc.. Following sections describe the interface with RRC, RLC and GTP.

    RRC to PDCP interface :  Following interface would need to be exposed by PDCP user plane layer to the RRC in control plane.
    •  Interface to create PDCP Contexts for SRB and DRB in PDCP layer :  Parameters for this function at high level are:
      • Virtual Instance ID,  Sector ID, C-RNTI (Cell - Radio network Temporary Identifier).
      • LCI (logical channel identifier).
      • Reference to the control plane:  Some opaque information to pass along with indications to the control plane.
      • SRBOrDRB boolean flag
      • If SRB:
        • Unacknowledged or Acknowledged Mode
        • If it is Unacknowledged, then the direction of the packets (Transmit only, Receive only or both).  In acknowledged mode, it is assumed that the directios is 'Both' always.
        • Integrity Information (Y/N)
          • Algorithm 
          • Key
        • Cipher information (Y/N):
          • Algorithm
          • Key.
        • There is no RoHC for SRB.
        • Sequence number size is not configured for SRB by RRC. Use configuration 'Default SN Size for SRBs' to find the sequence number size.
      • If DRB:
        •  Active/Inactive:  Normally it is active. But in handover cases,  target eNB might create the PDCP context as part of Handover preparation and program the PDCP starting TX sequence number and Receive expected sequence number and Bit map when the X2 protocols sends SN_STATUS_TRANSFER message (Refer to 36.300 Figure 10.1.2.1.1-1:  Intra MME/Service Gateway HO).  Since these two events happen at different times, it is required that PDCP does not start processing the packets until PDCP sequence numbers are known to it.  To facilitate this, I believe control plane first creates the PDCP context in 'Inactive' mode and activates it at later time using some other API function.  If PDCP receives the SDUs from the upper layer while it is inactive, it is expected to hold them from processing until it is activated. 
        • Unacknowledged or Acknowledged Mode
        • If it is Unacknowledged, then the direction of the packets (Transmit only, Receive only or both).  In acknowledged mode, it is assumed that the directios is 'Both' always.
        • Integrity Information is not valid for DRB packets.
        • Cipher information (Y/N):
          • Algorithm
          • Key.
        • RoHC (Y/N)
          • Profile IDs:  Bit mask of compression profiles (RTP/UDP/IP,  UDP/IP,  ESP/IP, IP, TCP/IP , v2 RTP/UDP/IP, v2 UDP/IP, v2 ESP/IP and v2 IP)
          • maxCID:  Maximum flows.  
          • Large CID is derived from the max CID.  If maxCID > 15, large_cid is true else large_cid is false.
        • Sequence number size : 5 bits, 7 bits and 12 bits.
        • Handover case:  if this PDCP context is established in target eNB, it also sends the PDCP sequence numbers and Bit map of packets that were not received by source eNB.
    •  Interface to terminate PDCP contexts :  I am not sure whether there is any need to provide deletion of each individual bearer.  I have a feeling control plane does not delete each one of them.  So, it is required to have terminate bearer function for all bearers belonging to UE.  Parameters:
      • Virtual Instance ID,  Sector ID, C-RNTI
    • Interface to prepare PDCP context for re-establishment:  As part of this, PDCP is expected to wait for the packets sent by RLC which came in out-of-order from UE.  These packets would be processed and given to GTP-U.
    • Interface to set PDCP reestablishment on per SRB and DRB basis in PDCP layer :  This interface function is expected to be called by control plane when it requires reestablishment of PDCP context.  This function is expected to send PDCP status transfer message to the UE and also it is expectedt set the Cipher and integrity information to the context. If there are any pending DL packets, they get retransmitted with new cipher context. Parameters:
      • Virtual Instance ID, Sector ID, C-RNTI, LCI to identify the bearer.
      •  Cipher Information (Y/N): As part of reestablishment new keys may be established.
        • Algorithm
        • Key
      • In case of SRB, integrity information (Y/N):
        • Algorithm
        • Key 
      • Please refer to the Section 5.2 of 36.323 to understand how to setup transmit and receive sequence number for different modes.
      • ROHC context is reset if it is applicable.
    •  Interface to indicate the handover of  DRBs:  This function is expected to be called by control plane in the source eNB as part of handover execution phase.  When the PDCP layer gets this indication from the control plane, it should start forwarding the un-acknowledged downlink packets and uplink packets that are received out-of-order. It is expected that this function is called by control plane after it instructs (re-establishment) RLC. 
    •  Interface to send control messages via SRBs:  SRBs are used by control plane. This function can be called by control plane to send the packets on SRB.
    PDCP to RRC interface: 
    •  Interface to inform RRC of received Status Transfer message from peer : Using this interface point, PDCP informs the content of status transfer message to the control plane.  It sends the 'Reference information' that was set in the PDCP context by control plane during PDCP context creation. It helps control plane to corelate its context easily. Information from this indication would be used by RRC during handover execution phase.
    •  Interface to inform SRB data indications : This interface function gives the messages received on SRB from peer PDCP to the control plane.
    PDCP to RLC interface:
    • Interface to send PDCP PDUs including PDCP control and data PDUs :  This interface point is needed to send the PDCP PDU to RLC.  RLC also uses same identification parameter to match its context as PDCP does. Parameters include Bearer identificatio (Virtual instance ID,  Sector ID, C-RNTI, LCI), PDCP PDU packet buffer and message ID.  It is expected that message ID is returned when the RLC calls ack function to provide success & failure of delivery.  It helps PDCP implementation to find the matching SDU, stop the discard timer and remove the entry.
    RLC to PDCP interface :
    • Interface to indicate the new PDCP PDUs (new packets) - Multiple of them:  This function can be used by RLC to give PDCP PDUs to the PDCP layer.   Multiple packets can be given at one time.  RLC might be buffering the packets if they come in out-of-order. When the missing packet comes along,  All those packets can be given at once. 
    • Interface to indicate the pending PDCP PDUs - During reestablishment time (Multiple packets can be sent using one call) :  This function can be used by RLC to give pending PDCP PDUs in the RLC.  Along with the last packet, it can indicate that it is last packet. 
    • Interface to indicate acknowledgment of PDU sent using PDCP to RLC interface functions :  This function is expected to be used by RLC to give success/failure ack to the PDCP PDUs that were sent to the RLC before.
    PDCP to GTP interface:
    • Interface to send PDCP SDUs to the GTP : This function is used by PDCP to give PDCP SDU (IP packet) to the GTP layer.
    • Interface to send Downlink forwarding packets (Upon handover) to the GTP layer & Interface to send Uplink forwarding packets (Upon handover) to the GTP layer:  This function is called when the PDCP layer is informed of handover for a given context.  Both UL and DL packets are to be sent along with sequence number to the GTP. Last packet is expected to be indicated explicitly.  Since GTP waits for the last packet indication,  it is necessary that GTP is informed of 'no more packets indication' even if there are no packets to forward to target eNB.
    GTP to PDCP interface:
    • Interface for GTP layer to send new packets to the PDCP layer in downlink direction :  This function is called by PDCP upper layers to send the packets to PDCP.
    • Interface for GTP layer to send DL-forwarded packets (During handover) & Interface for GTP layer to send UL-forwarded packets (During handover) : THis function is normally called in target eNB during handover execution phase.  These packets are sent to the PDCP layer with the sequence number. 
    I have written this description based on my understanding of PDCP specifications. If somebody finds any inconsistency or if i had made any mistakes, please drop a comment or send an email.

      Monday, May 31, 2010

      Key Management System - Required features for network elements' security

      Every network element in Enterprise network requires access to security keys at some time or other.  Mobiles, laptops typically have security keys for securing emails,  for storing the confidential data and for backingup the data in remote locations.  Servers in data centers have private keys and certificates for providing authentication credentials to external users.  Network infrastructure devices also require symmetric keys for storing the data securely in file system, private/public key pairs and certificates to provide authentication credentials to remote VPN users etc..   Key Management Server provides a central mechanism to store the keys and retrieve the keys on demand basis.  


      PCI (Payment Card Industry) standard specifically requires that keys are not stored in the network elements' persistent memory.   It is expected that the keys are downloaded every time that element is powered up.  It is also expected that the keys downloaded are kept in temporary memory which would get erased when power is off.  In case of battery powered devices, they are expected to be stored in Hardware Security Modules (HSM).  This mechanism would protect the data when the media or device is stolen.

      Centralization of  all key management operations  gives control for Enterprises to manage the life cycle of keys. This kind of control is required now more than before due to cloud computing and smart phones. As we all know, Cloud computing allows Enterprises to rent VMs on demand basis.  Virtual appliances are typically installed on the VMs.  Since these rented VMs are hosted in Cloud computing providers' data centers,  Enterprises would like to ensure that the not only data, but also keys are secure.  Having Key Management Server within Enterprise trusted boundary would ensure that administrator to do Key management operations such as Revoke (expire), key rotation.   When VM is no longer required, all keys associated with the VM can be either revoked or archived.  This kind of control also helps to revoke the keys in case of VM break-in. 


      What are the devices and applications that require keys and Key Life cycle management:


      End network elements such as laptops, desktops and smartphones :  

      • Secure Email :  PGP  is normally used to secure the emails on the wire.  PGP requires private key/public key or certificates to secure the emails.
      • Crypto file systems :  These storage systems (within laptop or on dedicated remote file systems) are used to secure the confidential information.
      • Data backup :  Backup data should be secured and only can be decrypted by user.
      • IPsec VPN Client
      Servers and Network infrastructure devices:
      • HTTP Servers/SIP Servers/other SSL based Servers:  
      • SSL based proxy applications such as Network Anti Virus,  SSL VPN,  Management Servers such as HTTPS,  SSH etc..
      • Crypto file systems to store the confidential data.
      • IPsec VPN Servers
      Centralized Key Management consists of two components - Key Management Server and Clients.  Typically there would be one key management server and as many clients as number of network elements requiring centralized keys. 

      Generic requirements of Key Management Server:
      • It is expected to provide storage of different types of keys.  Some applications generate keys on their own and store the keys in Key Management Server.  Some applications expect Key Management Server to create and store the keys.
        • Private key,  public key or certificate and certificate chain.  
        • Symmetric key.
      • Get Operation : Key Management Server is expected to serve the keys to the clients requesting.  
      • Get Attributes:  This operation is expected to be provided by Key Management Server to provide different attributes associated with the key.  Keys are not served as part of this operation.
      • Back up of keys in case of any trouble in Key Management Server :  There are two operations are expected - Export and Import.  Export is to backup the keys from the Key management server and import is to get the keys from backup. 
      • Key Expiry : Expiring the keys -  Typically keys are rotated every 1 or 2 years.  Some times, expiry of keys need to be done in case of key compromise.
      • Key rotation :  As part of expiring the keys,  it is required to create new key to replace the old key.  To ensure the continuity of applications, both keys may need to be active for some time (overlap time).  During overlap time, old key is typically used for 'process' operation (decryption). New keys is used for both protect and process operations. 
      • Archive of keys:  Once the key is expired,  it can be archived.  Archiving is different from backup.  Backup of keys is done on active keys and archiving is done on the inactive keys.  Resume operations is expected to be provided in case keys need to re-activated. 
      • Deletion of keys:  Once the keys are determined that they are no longer required,  deletion operation can be done. Typically deletion happens after few years after it is archived. This is to ensure that there is no useful data that was encrypted using the keys that are targeted for delete operation.
      • Key Management Client and Server communication must be secure (SSL is one option).  KMS is expected to have scalability to support large enterprises. 
      • Network Element (Device) registration :  A device might have multiple applications requiring Key Management facilities.  If the device is stolen, all the keys of all applications in the device need to be revoked.  Revocation might involve actively communicating with the device and also to communicate with other devices using same keys or some parts of the keys.  To facilitate this,  key management server is expected to provide revoke operation on all keys correspond to the device.  KMS is expected to provide facilities to register the devices, revoke all keys on a device and deregister the devices.   There could be many devices that exist in the Enterprise.  It may be expensive for IT to register the devices.  There should be facilities for Key Management client on the device to auto register the device if it does not exist in the KMS.  KMS is also expected to provide control for users to revoke the keys on the device.  That is, if laptop or smart phone is stolen, user itself can go to KMS and revoke the keys rather than depending on the IT to do which may take longer and it is possible for thief to get hold of data in between.
      • Binding among the keys:  
        •  In public key cryptography,  pair of keys are required together - Private and public key pair. Yet times, certificate chain is also required along with this pair of keys.  Any key management operation is typically done on these keys together.  That is, when the key is revoked, its counterpart is also expected to be revoked. This binding is 'peer binding'. 
        • There is another kind of binding -  Key derivation binding.  Here the binding parent-child binding. Some operations on the parent are applied on the derived keys.  Actions on derived keys may not have any affect on the parent key. 
        • Some times keys are used for wrapping and unwrapping the other keys.  This is normally used in export and import operations.  When keys are exported,  keys are encrypted using wrap key.  As part of import operations, keys are unwrapped using wrap key. For lack of better term, let me call this as 'wrap binding'.
      • Key deployment in clients:
        • Key Management Systems are expected to provide the keys requested by the clients as long as ACL rules allow it to do.  
        • Yet times,  multiple objects may be retried by the clients together.  Key Management System should provide mechanism to bind the objects and get the all the objects corresponding to that binding. Private/Public key and certificates retrieval is one example.
        • When new keys are created or registered by one client,  it is needed that these keys are deployed on other systems (clients).  There are two common cases:
          • VPN Server creating or registering the keys:  Public key and certificate objects are expected to be installed on all known VPN clients.  How does KMS know the devices to install the keys to?  KMS can take 'application registration'  in this case. This is in addition to Device registration. VPN Server as part of installing the keys in KMS can indicate the 'application Identification' of the clients and also indicate the keys that need to be installed. KMS can use this information to find out devices from the device/application registrations and deploy the keys in those devices and application.
          • Cluster of Servers :  In Enterprise, multiple servers are used to serve the content. These servers share the load among them.  It is necessary that keys used are same across all servers. Even one server registers the keys, all keys need to be deployed in all other servers in the cluster.  KMS is expected to allow KM clients to indicate the information so that it can deploy keys to all other servers.  Again device and application registration facility allows this.
      • Access Control List: Access Control list play very important role in providing control for IT departments to do key management operations. It provides control for trusted users to do key management operations - Create/Register,  Backup, Revoke,  Key Rotation,  Archive, Delete, Get, Get Attributes - by associating user groups to permissions (key management operations).  Each ACL rule is really a combination of 'User group' and 'Permissions' on a given key object. It is normally accepted in the industry to create the ACL rules as part of creation/registration of keys.   If the creator group is not present in the KMS during the key creation time, creator group is also created with the creator user as part of the user group. There are two types of ACLs - Key object based,  non-Key based.  Key Object based ACL is part of each Key object and is normally created by the creator.  Non-Key based ACL are non-key specific and normally created by the administrator.  
        • Non-Key based ACL:  These ACL rules control the access when the keys are not present.  These ACL rules are normally created by the administrators. Each ACL rule contains the 'user group' and 'different types of permission'.    Different types of permissions can be:  Create/Register keys,  Create user group as part of key object creation,  Allow auto device registration by the creator of key object,  'Allow application registration by the creator'.  'allow creation of ACL rules on object by the creator',  'Allow archive', 'Allow resume', 'Allow import', 'Allow export',  ' Allow derive' by creator etc..   If administrator does not give 'Allow user group',  'Allow auto device registration' or 'allow ACL rules on objects', then administrator is expected to do these operations.  In addition to creating non-key based ACL rules,  KMS should provide facilities for administrators to add 'admin' users automatically to the user groups created by the creators to ensure that Enterprise always have control of keys. This is useful to process the data  when the IT gets hold of the laptops and smart phones after employees leave the company.  Yet times, administrator themselves might delegate some operations to some other IT staff.  Non-key based ACL should allow multiple IT personnel to do the key operations on existing key objects. These ACL rules may be delegated based on the 'applications'.  These ACL rule, hence, should contain   'user group',  'application identifier' and 'permissions'.  'Permissions are same as 'key based ACL rules'.  'Application identifier' can be 'Any'.
        • Key based ACL:   These rules are normally created by the creators. Each rule contains 'Key object',  'user group' and permissions.  These ACL rules are sent to the KMS as part of create/register operation.  As discussed in addition to the ACL rules,  KM client also can send deployment information such as 'send the keys'.   KMS is expected to send the keys to the clients based on the application identification.  ACL rules indicate what rule information can be retrieved by the clients.  In case of VPN client, ACL rule are created by the creator such a way that only public key and certificates can be retrieved.  In case of cluster of servers application, ACL rule is created such a way that private key, public key and certificates can be retrieved by all servers in the cluster. 'Permissions in the ACL rule' can be - Get, Get Attributes,  Revoke, Derive, Expire, Archive, Backup, Delete etc.. 

      Thursday, May 20, 2010

      Performance considerations in Proxy based nework applications

      Many details on performance considerations on proxy based networking applications are given here


      There are some more performance considerations in developing proxy based applications. Here they are:


      • Use Hugelbfs system to for running code and for application context memory (for connections):  Please see this for getting understanding of this technique.
      • Use User space RCU wherever possible:  See the RCU related information here.
      • Use Futexes as part of RCU implementation for add/delete operations.  See about Futexes here
      • Use posix spinlock kind of Mutexes only for small portion of the code.
      • Use UIO based Interrupt indication to the User space processes while dealing with memory mapped hardware accelerators.

      Wednesday, May 19, 2010

      Responses on the post related to WAN optimization and Infineta systems

      I have got many responses on this post.  Some were asking more clarifications on what I meant by deduplication efficiency across restarts.  Some questioned that how I can make a statement that there is no persistence data across restarts in Infineta solution.  By the way, I did not make a statement. To clarify further, I don't know whether the solution has capability of keeping data intact across power cycles.  I just want to say for record that, it is only my reading based on the statement that the performance of the solution  will be in multiples of gigabits/sec, up to 10Gbps.  I guess we will come to know eventually when the product is out.

      One response suggested that it is very well be that the solution might have multiple hard drives, with each drive hanging on different hardware bus,  to achieve multigiga bit  performance. Though it seems like a possibility and I am not sure whether it is practical.

      One possibility is that multiple blades are combined into one solution with each blade working almost independently as far as Dedup is concerned.  This type of solution is suitable for deployments having multiple servers that need to be synchronized (for backup, replication and others) where content of  server or set of servers being optimized by different blades.  If there are few servers than the blades, then this type of solution capacity can't be utilized completely in those specific deployments.

      One response was asking on whether I know of any performance benchmarking criteria to evaluate WAN optimization solutions. I am not aware of any standards or defacto standards. If I come across any, I will certainly post them in this blog.

      Monday, May 17, 2010

      One more WAN Optimization company - Infineta systems

      It is good to know that WAN optimization technology is attracting VC money. I believe that this market will continue to grow for few more year before it saturates. Though some research reports place this market to be at $5 billion by end of 2014,  but I feel that this is much more than this. Any organization having multiple branch offices with consolidated central servers benefit from the WAN optimization.

      Okay.  Coming to Infineta.  What is different about this technology compared to existing WAN optimization?  I tried going through the Forrester report.  Finally it all coming down to 'performance'.   According to this report existing WAN optimization products peak at 1Gbps and Infineta systems performance seems to be in terms of multiples of Gbps upto 10Gbps.  According to this report, this kind of performance is required to connect data centers of a given organization.  Reasons given for this kind of performance are:

      • Replication, Mirroring,  Backup of data and VM images among Data centers for reasons such as Business continuity/Disaster-recovery. 
      • Amount of data exceeding perabytes.
      • Reducing latency of above operations.
      According to job descriptions, it is clear that they are using multi-core processors and FPGAs. I did not find any mention of hard disk capacity.  I have a feeling that persistent storage is not used.  It would be difficult to achieve multiple of Gbps throughput with disk access.  If that it the case, it is interesting to know how the efficiency of de-dup is compared with other established WAN optimization vendors. 
      • Amount of DDR memory.  This directly would be proportional to dedup efficiency.  Larger the amount of memory,  higher the dedup efficiency  would be. Without hard drive capability, storage would be limited and it may have difficult time to achieve the de-dup efficiency compared to others in the market, in my view.
      • Lost of cached blocks across power restarts.  If there is no persistent memory, data that was stored in the DDR memory would be lost across power cycles or when the system it taken out for maintenance. This requires rebuilding of the data cache again.  This will reduce de-dup efficiency right after power recycle.

      Thursday, April 15, 2010

      SMB Evasions by attackers - Tips to prevent them in IDS/IPS devices

      DCE RPC packets can also come with SMB.  This article talked about some of the DCE RPC evasions by the attacker and way to detect the attacks even with these evasion techniques.  Since DCE RPC packets can come on SMB packets, it is important to understand some of evasion techniques used by attackers on the SMB protocol itself. 


      DCE RPC messages are predominantly embedded in SMB Messages such as SMB_COM_READ response,  SMB_COM_WRITE and ANDx versions of them. Also DCE RPC messages are also sent with SMB_COM_TRANSACT command and response messages. Note that these evasion techniques are not only useful for detecting attacks in DCE RPC based applications, but also CIFS (SMB) itself. 


      Protocol details of SMB are described very well here.  


      Many IDS/IPS devices don't have protocol intelligence of SMB and DCE RPC protocols. IDS/IPS systems that depend on generic pattern matching can be bypassed by attackers with simple evasion (obfuscation) techniques.  Let us examine some of the evasion techniques.


      1. ANDx messages:


      As indicated above, there are commands with ANDx version. Any command/response ending with ANDx have following structure in the packet after SMB Header.




      SMB_Parameters
        {
        UCHAR  WordCount;
        Words
          {
          UCHAR  AndXCommand;
          UCHAR  AndXReserved;
          USHORT AndXOffset;
          USHORT FID;
          ULONG  Offset;
      
          USHORT MaxCountOfBytesToReturn;
          USHORT MinCountOfBytesToReturn;
          ULONG  Timeout;
          USHORT Remaining;
          ULONG OffsetHigh (optional);
          }
        }
      SMB_Data
        {
        USHORT ByteCount;
        }





      Variable size of SMB Parameters : Many IDS/IPS devices assume that the 'wordcount' is constant for a given ANDx message.  For example,  for SMB_COM_READ_ANDx message, the 'word count' is assumed to be 10 words. But it can be 12 words in case of 64 bit offset (OffsetHigh).  IDS/IPS devices assuming 10 words and interpreting the data would have the detection wrong.  Attacker deliberately set the wordCount to 12 words even though OffsetHigh is 0.  IDS/IPS devices must interpret the 'word Count' to move to the data section. 


      Multiple ANDx messages under one SMB message (with one SMB Header): 


      Many IDS/IPS devices assume that there is only one command (or response) in the SMB message. But SMB protocol allows multiple ANDx commands (or responses) in one single message.  Every command/response would have its own 'SMB Parameter' and 'SMB Data' blocks. Attackers can put the malicious command/response as non-first command/response in the SMB message to bypass detection by security devices. IDS/IPS devices must interpret the 'AndxCommand' to figure out whether any more commands/responses are present in the message.  AndXCommand is normally set to 0xFF if there are no additional commands. 


      Filler between ANDx messages :  AndXOffset field indicates the next command in the SMB message. Since there is explicit mention of offset, is is protocol wise legal to send some additional filler data between Andx commands. Attacker can take advantage of this and put some data to confuse security devices. Security devices thinking that all commands are next to each other would fail to detect the attacks.  Security devices must be aware of this and interpret the AndXOffset appropriately as end systems do.


      Out-of-Order of ANDx messages : Here AndX commands can refer to the data in the SMB messages before AndX header. Note that AndXOffset indicates the offset from the beginning of SMB header.  Hence it can be set to any place in the SMB message. This is tricky for IDS/IPS devices as they need to store the complete SMB message before analyzing and hence it increases the memory requirements.  But it is necessary to do this to mitigate any evasion techniques used by attackers.


      2.  Transaction Messages


      Transaction command messages have this structure. Responses also have similar structure but some fields don't exist. So, be careful in analyzing the command and responses. 




      SMB_Parameters
        {
        UCHAR  WordCount;
        Words
          {
          USHORT TotalParameterCount;
          USHORT TotalDataCount;
          USHORT MaxParameterCount;
          USHORT MaxDataCount;
          UCHAR  MaxSetupCount;
          UCHAR  Reserved1;
          USHORT Flags;
          ULONG  Timeout;
          USHORT Reserved2;
          USHORT ParameterCount;
          USHORT ParameterOffset;
          USHORT DataCount;
          USHORT DataOffset;
          UCHAR  SetupCount;
          UCHAR  Reserved3;
          USHORT Setup[SetupCount];
          }
        }
      SMB_Data
        {
        USHORT ByteCount;
        Bytes
          {
          SMB_STRING Name;
          UCHAR      Pad1[];
          UCHAR      Trans_Parameters[ParameterCount];
          UCHAR      Pad2[];
          UCHAR      Trans_Data[DataCount];
          }
        }



      Fragmentation :  If application payload is bigger than the 'MaxBufferSize' negotiated during setup phase,  application payload is divided across multiple SMB messages with first message having SMB_COM_TRANSACTION command/response and further messages are sent with SMB_COM_TRANSACTION_SECONDARY.   Attackers take advantage of this to evade the detection by security devices which are not reassembling the data that is sent across multiple 'TRANSACTION' messages, even if the real application data is less than 'MaxBufferSize'.  Security devices must ensure that all messages have come in to reassemble by checking the 'TotalParameterCount' and 'TotalDataCount'. If all transaction messages (with same PID, MID, TID, UID in the SMB Header) parameter count and data count adds up to TotalParameterCount and TotalDataCount with different ParameterOffset and DataOffset, then security device can assume that all fragments are received. Note that some attackers try to fool security devices by sending duplicate SMB messages.  Security devices blindly adding 'ParameterCount' and 'DataCount' of all matching SMB messages to match 'Total ParameterCount' and 'TotalDataCount' without checking for unique 'ParameterOffset' and 'DataOffset' can be bypassed with attack detection by sending duplicate SMB messages.


      Out-of-Order of Transaction fragments :  As seen before 'Parameter Offset' and 'Data Offset' indicate the position of the parameter and data section of the message in the overall application payload. End SMB systems honor these values while reassembling. So, the order in which they come is not important.  Security devices, if they assume that the packets would be in order' can be evaded by attacker by sending these messages in different order.