Random technical bits and thoughts: 2008

Monday, September 29, 2008

What does Virtualization of Hardware IO devices and HW Accelerators mean?

Virtualization is becoming popular in server markets. Multiple under-utilized physical servers are consolidated into one without losing the binary compatibility of operating systems and applications.

Hypervisors from VMware, Linux Xen and KVM are enabling this trend. These hypervisors give an impression as if there are multiple CPUs and multiple IO devices which allows different operating systems and associated applications to run without any changes on a single physical server. Virtualization uses terms such as host and guest. Host is main operating system where hypervisor runs . Guests are virtual machines. Multiple guests can be installed on a host. As in physical world where each physical machine can run different operating system and applications, guests also can be installed with different operating systems and associated applications. And no change is required in operating systems or applications. That is real beauty. That is, you buy an operating system CD/DVD from Microsoft or Ubuntu and follow similar steps of installing them as in physical machine on a hypervisor to create a guest machine.

Hypervisor virtualizes almost everything on the hardware - CPU, IO devices such as Ethernet Controller, Keyboard, Mouse, Accelerators such as cryptography accelerators, Pattern Matching Accelerators, Compression Accelerators etc..   Hypervisor makes multiple instances of them from one physical device with one ore more instances assigned to guests. Hypervisor exposes these software instances as pseudo physical devices to guests and hence existing drivers in the guest operating systems work with no additional software changes.

Current generation of hypervisors, using software drivers, deal with the hardware. Guests don't interact with hardware directly. Host hypervisor software internally virtualizes by creating multiple instances. Guest drivers deal with these instances. Guests think that they are talking to hardware directly, but actually they deal with hardware via host operating system by connecting to the virtual instances created by host driver.

Direct IO/Accelerator connectivity
Traditional way of virtualizing hardware devices requires guests going through the hypervisor. Due to this, there is an additional copy of data and also additional context switching overhead. To reduce the performance impacts associated with the indirection though hypervisors, direct IO connectivity is being thought by both CPU and hardware device vendors. Intel/AMD seems to be enhancing their CPUs to allow direct connectivity to the hardware devices from guest operating systems.

Intel/AMD x86 processors seem to be providing a feature called IOMMU in their CPUs. Hardware IO devices traditionally only work with physical memory.   The IOMMU feature allows IO devices to take the virtual memory address for buffers and commands. Guests or even the user space processes in host operating systems such as Linux deal with virtual address space. CPUs translate virtual addresses to physical addresses dynamically using MMU translation tables. IOMMU is expected to do similar translation for IO devices. IO devices can be given buffers in virtual address space of guests or user space processes. IO devices before reading/writing data from the virtual addresses work with IOMMU to translate into physical address and then perform read/write operation on the physical address.

To avoid hypervisor intervention, another feature is also required, that is interrupt delivery to the guests directly from IO devices. Interrupts are typically used by IO and accelerator devices to inform the new input or completion of command given by CPUs earlier. CPU vendors are also providing this feature where PICs (Programmable Interrupt Controller) are virtualized.

These two CPU features allow direct connectivity of IO devices. Hypervisors are also doing one more job before, that is, creating multiple instances of IO and accelerator devices. To avoid hypervisor intervention, then the instantiation of devices need to be taken care by the devices itself. Unfortunately, this can't be done at central place, CPU. This needs to be taken care by each individual IO/Accelerator device.

Instantiation of IO & accelerator devices within the hardware need to ensure that it satisfies the virtualization requirements as hypervisors are doing. Some of them given below.

Isolation :   Each guest is independent of each other. Isolation should exist similar to physical servers.

Failure isolation: In physical world, failure of a physical server or IO devices within it does not affect the operation of another physical server. Similarly, a guest failure should not affect the operation of other guests. Since IO/Accelerator devices are common resource among the guests, it is required that it provides isolation such that if one instance of IO device fails, it should not affect the operation of other instances. Any fault deliberately or unintentionally introduced by guest should only affect its owned instance, but not others. Fault should be corrected by resetting its instance and should not involve reset of all instances or entire device.
Performance Isolation: When applications are run in different physical servers, all devices in a physical server is available exclusively for the operating systems and applications. In a shared environment where multiple guests or user space processes working with the IO/accelerator devices need to ensure that one guest does not hog the entire accelerator and IO device bandwidth. IO/accelerator devices are expected to have some sort of scheduling to share the device bandwidth across. One method is scheduling of commands to accelerators using round-robin and weighted round robin schedulers. But this may not be sufficient in accelerator devices. For example, 2048 bit RSA sign operation takes 4 times the crypto accelerator bandwidth compared 1024 RSA sign operation. Consider a scenario where a guest is sending 2048 bit RSA sign operations to its instance of acceleration device and another guest is using accelerator device for 1024 bit RSA sign operations. If round robin method is used for scheduling requests by device across instances, then the guest sending 2048 bit operations takes more crypto accelerator bandwidth than other guest. This may be considered unfair. It is also possible that a guest deliberately sends high computing operations to deny the bandwidth of crypto accelerator to other guests, creating denial of service condition. Hardware schedulers of devices are expected to have scheduler that takes into consideration of processing power used by each instance.

Access Control:

As discussed devices are expected to provide multiple instances. Hypervisor would be expected to assign instances to guests. Guests (typically kernel space) can inturn assign its device instances to its user space processes. User space processes in turn assign the device resources to different tenants. In public data center server market, tenant services are created using guests - Each or set of guests correspond to one tenant - where tenants don't share a guest. In Network services (Load balancers, security devices), a host or guest can process traffic from multiple tenants. In some systems, each tenant is isolated to a user space process within a host or a guest and in some systems, multiple tenants might even share a user space process. And there are hybrid systems, based on the service level agreement with its customers, Data Center operators either create a VM (guest) for tenant traffic, create a user space process within a guest for processing tenant traffic or share a user space process along with other similar kinds of tenants.   Ofcourse, isolation granularity differs from these three different types of tenant installations. VM for tenant provides best isolation, User space process provides similar, but better isolation than the shared user space process. If it is shared user process, all tenant traffic gets affected if the process dies. But one thing that should be ensured is that the hardware IO/accelerator devices resources are not hogged by one tenant traffic.

IO/Accelerator hardware devices need to provide multiple instances for supporting multiple tenants. Typically, each tenant is associated with one virtual instance of peripheral devices. Associating Partition, User space process and tenant combination to virtual instance assignment is the job of trusted party.   Hypervisor is typically under the control of network operator. Hypervisor is assigned with the task of assigning virtual instances to guests. Kernel space component of hosts or guests internally assign its owned device instances to its user space daemons and tenants within the user space daemons.

Ownership assignment as described above is one part of access control. Another part of access control is to ensure that guests and user space processes will only look at and access its assigned resources, but not the others resources. If guests or user space processes allowed to even look at the other instances not owned by them, then it is considered as a security hole.

Another important aspect is that multiple guests or user space processes can be instantiated using same image multiple times. This is done for reasons such as load sharing (active-active). Even though same binary image is used to bring up multiple guest/user-space instances, each userspace/guest instance should not be using same peripheral virtual instances. Since it is same binary image, this is relatively becomes easy if device virtual instances are exposed at the same virtual address space by the hypervisor.

Most of the peripheral devices are memory mappable in the CPUs. Once the configuration space is mapped, accessing configuration space of peripheral can be done in the same way as accessing the DDR. Peripherals,supporting virtualization, typically divide the configuration space into global configuration space and configuration space for multiple virtual instances. If a peripheral supports X number of virtual instances and each virtual instance configuration space is M bytes, then there would be (X * M) + G bytes of configuration space where is G is size of global configuration space of that peripheral device. Global configuration space of peripheral is normally meant to initialize the entire device. Instance specific configuration space is meant for run time operations such as sending commands and getting responses in case of accelerator devices and sending & receiving packets in case of Ethernet controllers. Global configuration space is normally only allowed to controlled by the hypervisor and where as virtual instances configuration space is allowed by appropriate assigned guests.

To satisfy the access control requirements, it is required thatvirtual instances assigned to each guest starts at the same address space. Typically, TLBs are used for this purpose as TLBs are used to translate addresses and also provide access restrictions. TLBs take source address which is input to the translation, destination address which is the start of the translated address, and the size. Since hypervisors are normally used to assign set of virtual instance resources to guests, hypervisor can take job of not only keeping track of free virtual instances and assigning virtual instances to guests, but also creating page table entry (or multiple PTEs in case of all assigned virtual instances are not contiguous) to map the virtual space of guests to physical space where the virtual instances configuration is mapped in CPU address space.   Virtual address of hypervisor of a given guest is treated as physical space by the guests. When guest kernel assigns the virtual instances to its user space processes, it does same thing where it creates page table entries which allows mapping of virtual instance space to the user space virtual memory. Once the PTEs are created, TLBs are used dynamically by the system at run time in hierarchical fashion - PTE lookup in guest followed by PTE lookup in host.

The description in above section mainly discusses the way cores access the peripheral devices. There is another aspect where peripheral devices need to access the cores' memory. As discussed earlier, CPUs having the IOMMU capability (nested IOMMU capability) can give virtual addresses for buffers to the peripheral devices. Virtual address spaces overlap with other across guests and user space processes. Due to this, virtual instances of devices would see same addresses for buffers.   To get the physical address for virtual addresses,   devices need to consult IOMMU. IOMMU would require guest ID, user space ID and virtual address to give associated physical address. Hence hypervisors while assigning the virtual instances of devices to guests also need to associate the guest ID with the virtual instance and inform the device through global configuration space.. Similarly, host or guest kernel while assigning the virtual instance to user space would associte the additional user space process ID to the virtual instance. Kernel is also expected to assign the user space process ID to the virtual instance using Virtual instance specific configuration space or it could ask hypervisor to assign the user space process ID to the virtual instance using global configuration space.

Since IOMMU are accessed by devices, it is necessary that the buffers given to the peripheral devices would never get swapped out. Hence it is very important the applications must ensure to lock the buffers in the physical memory (operating systems provide facilities to lock memory in DDR) before providing them to peripheral devices.

Summary

Multicore processors and Multicore SoCs (processors + Peripherals) are increasingly being used in network equipment devices.  Network equipment vendors while selecting the Multicore SoCs for virtualization would look for following features in their next designs.

Features expected in cores:

IOMMU Capability which looks at the PTE tables created by operating systems, similar to 'Hardware Page Table Walk' support in cores.
Nested IOMMU capability as being discussed by Intel and AMD on cores side to allow hierarchical PTE tables to facilitate peripheral devices access to buffers with virtual memory addresses in user space processes of guest operating systems.
Virtualization of PIC.

Features expected in peripheral devices

Ability to provide multiple virtual instances.
Ability to provide good isolation among the virtual instances - Reset of virtual instance should not affect the operation of other instances. Performance isolation by scheduling mechanisms which considers the peripheral device bandwidth. Peripherals are also expected to provide facility for software to assign the device bandwidth for virtual instances.
Ability for peripheral devices to take virtual memory addresses for buffers. Ability to communicate with IOMMU capabilities to work with virtual addresses.

Sunday, September 28, 2008

Look Aside Accelerators Versus In-Core Acceleration

Majority of Multicore processor vendors implemented many acceleration functions as look-aside accelerators. Some Multicore vendors such as Cavium and Intel implemented some functions as in-core and some as look-aside accelerators. Cryptography is one acceleration function which Cavium and Intel implemented in-core and others have provided that as look-aside accelerator. Other acceleration functions such as compression/decompression, regular expression search are provided as look-aside accelerators by many of them. Hence I will not be discussing them here. I will concentrate mostly on Crypto acceleration.

How do software use them?

Software use the accelerators in two fashions - Synchronously and Asynchronously. In synchronous usage, software thread issues the request to the accelerator and waits for the result. That is, it uses the accelerator as software C function. By that time, C function returns, the result is with it. In asynchronous usage, software thread issues the request and goes and does some thing else. Once the result is ready, thread picks up the result and does the rest of processing needed on the result. Result is indicated to thread many ways.

If the thread is polling for events, then the result is read via polling. Many Multicore processors provide facility for software to listen for external events using one HW interface. Incoming packets from Ethernet controllers, results from look-aside accelerators and other external events are given through common interface. Run-to-completion programs waits for the event from this common interface and takes action based on the type of event received.
If thread is not doing polling on HW interface, then the events are notified to the thread via interrupts.

The point is that asynchronous usage of look-aside crypto acceleration allows the thread to do some thing while acceleration function is doing its job.

Look-aside accelerators can be used synchronously and asynchronously. In-core accelerators are always used synchronously.

What application use the accelerators synchronously and asynchronously?

If the performance is same, software would like to use any accelerator synchronously. But we all know that asynchronous usage would give best performance for per-packet processing applications.   Before going further, let us see some of the applications that use cryptography.

IPsec, MACSec, SRTP, kinds of applications would use crypto accelerators asynchronously as these applications work on per-packet basis and simple to make changes to take advantage of look-aside accelerators in asynchronous fashion.

SSL and DTLS based applications, in my view, would always use accelerators in synchronous fashion. SSL and DTLS are kind of libraries, not applications by themselves. Applications such as HTTP, SMTP, POP3 servers and proxies would use SSL internally. To use accelerators in asynchronous fashion, not only changes are required in SSL/DTLS library, but also major modifications to the applications such as HTTP, SMTP, POP3 proxies etc.. are required. When the applications are developed using high level languages, it becomes nearly impossible to make changes to the applications.

What are the advantages of look-aside accelerators?

Look-aside accelerators provide an option for software applications to use the accelerators in asynchronous fashion. Any algorithm which takes significant number of cycles would be better used in look-aside fashion. Many of the crypto algorithm are falling in this category. When used in asynchronous fashion, core is not idling for the result. Core can do some thing else, there by improving overall system throughput. As described above, any per-packet processing applications such as IPsec, MACSec, PDCP, SRTP would work fantastic in this mode. Many software applications, which were using software crypto algorithms, are being changed or already changed by software developers to take advantage of asynchronous way of using the look-aside accelerators.

Another big advantage is with the processing of high priority traffic in poll mode based model - where the core spins for incoming events such as packets. If the application uses the crypto accelerator in synchronous mode, core is not doing any thing for a long time, in the tune of tens of microseconds. If you take 1500 byte packet, it might take around 10Micro seconds of time to encrypt/decrypt the data. Core is not doing anything for 10Microseconds. If it is jumbo frame of 9K bytes, core may not be doing anything for 60Microseconds. If there is any high priority traffic such as PTP (Precision Time Protocol) during this time, this does not get processed for upto 60 Microseconds. If Crypto is used in asynchronous fashion, then these high priority traffic will be processed as the core does not babysit the crypto operation. It also improves the jitter issues. Irrespective of the size of the packet which is getting encrypted/decrypted, high priority traffic is processed within same time.

Having said that, SSL/TLS based applications use SSL library (example: openSSL). Since SSL library works in synchronous fashion, look-aside accelerators will be used in synchronously. High priority traffic is not an issue as SSL based application typically work in user space in Linux and use software threads. Even if one thread is waiting for result from the crypto accelerator, other software threads would be scheduled by the operating system and thereby core would be utilized well as well as high priority traffic can be processed by other software threads.   But many times, the software thread waiting for the result may have to wait for the result in tight loop waiting for change in value of state variable which indicates the readiness of result. In those cases, other software threads may not be scheduled very well. In those cases, Multicore processors having hardware SMT (Simultaneous Multi Threading) would work good.

What are the advantages of in-core Crypto?

In-Core Crypto normally is faster than look-aside crypto when used in synchronous fashion. So, it is good for SSL kind of applications. But, these are not good for per-packet processing applications. In-Core crypto has other advantages - Since the data is sent to the in-core accelerators via core registers (these registers are normally big registers, 128 bit or 256 bit registers), they work with virtual memory and hence very suitable for user space applications. Another advantage of in-core acceleration is that they can be used in virtual machines without worrying about drivers being exposed efficiently by host operating system. Since these are instructions like any other core instructions, they would work just fine without any additional effort. Since in-core crypto is just like software crypto, it is very flexible for porting software applications without requiring to make major changes to SW architecture.

In-Core Crypto has some disadvantages too. If the cores divided across multiple applications and if some applications don't require crypto acceleration, those in-core crypto accelerators are not useful and resulting system throughput would be less. As indicated above, in-core crypto acceleration does not work on high priority work as they can be used synchronous fashion. For per-packet processing applications, performance of in-core crypto would be less than the look-aside crypto when used in asynchronous fashion.

My take:

Since Multicore processors are expected to be used in many scenarios, it is necessary that the acceleration functions are designed such a way that they can be used for both per-packet processing applications and stream based applications (such as SSL).   If look-aside crypto acceleration performance is made similar to in-crypto acceleration in synchronous mode, then it look-aside acceleration is preferable choice. As I described in my earlier post, if look-aside accelerators are enhanced with v-to-p kind of broker hardware module which understand the virtual space, then I believe it is possible to make performance of look-aside synchronous acceleration as close as possible with in-core acceleration.

Saturday, September 27, 2008

Are Multicore processors Linux friendly - Part 2

As described in earlier post, Software architects of network equipments are moving or developing their applications in user space. I believe Multicore vendors should keep this in mind while designing the accelerators and Ethernet controllers. Unfortunately, current creep of Multicore devices are not well designed for applications based on user space.

Let me first give some items which are different in User space programs in comparison to Kernel level programming and Bare-metal programming.

Virtual Memory : Kernel level programs and Bare-metal programs work with physical memory. User space programs in Linux work with virtual memory. Virtual memory to physical memory mappings are created by the operating system on per process basis.

Some information on Virtual space and how it works: Each process virtual space starts from 0 to 4Gbytes in 32 bit operating systems. When the process executes, core with the help of MMU gets the physical address from virtual address for executing instructions as well as to get data or write data in the memory. Core maintain the cache of virtual space to physical space mapping in Translation Lookaside Buffer (TLB). Many processor types have 64 TLB entries on per core basis. Operating system when it finds that there is no matching TLB entry for virtual address (TLB miss), goes through the mapping table created for each process for the match. If there is a match and physical address is available, then it adds the entry in one of the TLB entries. If there is no physical address page fault occurs and it requires reading the data from the secondary storage. Since many processes can be scheduled on the same core by operating system, TLB gets flushed out upon context switch and gets filled up with virtual space addresses of new process as part of its execution.

Operating system scheduling : In Bare-metal programming, application developer has full control over the cores and the logic that gets executed on per core basis. User processes may get scheduled on different cores and same core might be used to schedule multiple processes. Since operating systems provide time slice for each process, the user program can be preempted at any time by the operating system.
Restart capability: Bare-metal and kernel level programs gets initialized when system is UP. If there is any issue in the program, whole system gets restarted. User space programs provide flexibility of graceful restart and restart if they crash due to some issues. That is, user space programs should be capable of reinitializing itself when they get started even if the complete OS and system is not restarted.

What are the challenges user programs have while programming with Multicore processors?

Virtual Memory:

Many Multicore processors are blind to the virtual memory. They only understand the physical memory. Due to that, application software needs to do different things to get best out of Accelerators. Some of the techniques that are followed by software are:

Use physical address for buffers that would be sent to the HW accelerators & Ethernet controllers: This is done by software allocating some memory in kernel space, which reserves the physical memory, and memory mapping that in user space using mmap(). Though it looks simple, it is not practical for all applications.

Applications need to be changed to use this memory mapped memory for allocating buffers. Memory allocation in some applications might go through multiple layers of software. For example, some software might be in high level languages such as Java, Perl, Python etc.. Mappong allocation routines of these programs to memory mapped area could be nearly impossible and requires major software development.
Applications might be allocating memory for several reasons. Applications might be calling same allocation function for all types of reasons. To take advantage of memory mapped space, either the application need to provide new memory allocation routines or all allocations are satisfied from mapped area. First case requires software changes which could be significant if applications have developed multiple layers on top of basic allocation library routines. Second case may not work may have problems in satisfying allocations. Note that kernel space is limited and amount of memory that can be mapped is not infinite.

Implement hardware drivers in Kernel space and copy the data from virtual memory to physical memory and vice versa using copy_from_user and copy_to_user routines. This method obviously has performance problems - Memory copy overhead. It also requires driver in the kernel which is not preferred by many software developers. Preference would be to memory map the hardware and use the hardware directly from the user space software.
Use virtual space for all buffers. Convert virtual memory to physical memory and provide the physical memory to the HW accelerators. Though this is better, this also has performance issues - Locking the memory and getting the physical pages is not inexpensive. get_user_pages() equivalent user space function needs to go through the process specific page table to get the physical pages for virtual pages. Second is that all physical pages need to be locked using mlock() function, which is not so expensive, but takes good number of CPU cycles. On top of that, the result of get_user_pages is set of physical pages which may not be contiguous. If accelerators dont' support scatter gather buffers, then this required flattening the data which is again very expensive.

I am expecting that at least future versions of Multicore processors would have capability to understand the virtual memory and avoid software to do any thing special. I expect that Multicore processor takes the virtual address for both input and output buffers, in addition to acceleration specific input, and convert virtual addresses to physical memory and gets the accelerator function executed using accelerator engines. There is a possibility that virtual to physical space mapping might get changed while the V-to-P conversion or acceleration algorithm is running. To avoid this, V-to-P conversion module should do two things.

A. Input side:

1. Copy the input data to internal RAM of the accelerator.
2. While copying, if it find that there is no TLB entry, it returns error to the software.
3. Software accesses the virtual space which makes the TLB entry filled up and then software issues the command again to continue from where it left off.

B. Returning the result:

1. Copy the output from internal RAM to the virtual memory using TLB.
2. If it finds there is no mapping, let it return to the software, if software is using the accelerator synchronously. If software is not waiting, let it generate the interrupt which wakes the processor and issues the command to read the result.
3. Software accesses the virtual space which makes the TLB entry and issues the command again. The v-to-p conversion module starts writing from the place it left off.

Many times, TLB entry would be there always for input buffer. It is possible that TLB entry might have been lost by that time accelerator does its job. But it avoids quite a bit of processing software has to do as indicated above.

Since the v-to-p hardware module needs to have access to TLB, it needs to be part of the core. So the command to be issued to accelerator and to read the result should be more like a instruction.

Note that TLB gets overwritten every time there is a process context switch. While doing memory copy operation, specifically for output buffer, it is always expected that v-to-p module checks the current process ID for which TLB is valid with the process ID it has as part of the command.

Since v-to-p module is also expected to do the memory copy from the DDR to internal SRAM or high speed RAM for input data and from SRAM to DDR for output data. It is expected that this is very fast and does not add to the latency. Hence v-to-p module is expected to work with core caches for coherency and performance reasons.

Operating System Scheduling:

Same core may be used by the Operating system to run multiple independent processes. All these processes may be required to use accelerators by memory mapping them onto the user space. Since these are independent applications, there should not be any expectation that these processes would use accelerators in cooperative fashion.

Current Multicore processors would not allow multiple user processes running on a core to access the hardware accelerators independently. Due to this, software creates the drivers in Kernel space to access the hardware. Each user process talks to the driver which in turn services the requests and returns results to appropriate user process. Again, this would have performance issues resulting from copy of buffers from user space to kernel space and vice versa.

Limitation of Multicore processors today stems from two things:

A. Multiple virtual instances can't be created in the acceleration engines.

B. Interface points to HW accelerator is limited to 1 for each core.

If those two limitations are mitigated, then multiple user processes can use hardware accelerators by directly mapping them into user space.

Restart

User process can be restarted any time either due to graceful shutdown or due to crash. When there is a crash, there is a possibility of some buffers pending in the accelerator device. Linux, upon any user process crash or whenever the process is gracefully shutdown frees up its physical pages associated with the process. Physical pages can be used up by any body else. If accelerator is working on the physical page thinking that it is owned by the user process that had given the command, then this could be an issue as it may write some data which might corrupt some other process.

I believe if solution as specified in 'Virtual memory' section above is followed, there is no issue as accelerators work on internal SRAM. Since v-to-p module always checks the TLB while writing into the memory, it should not corrupt any memory.

I hope my ramblings are making sense.

Hardware Traffic Management Functionality - What is it system designers need to look for)

There are many chip vendors coming out with inbuilt traffic management solutions, mainly on traffic shaping and scheduling. I happened to review some of them as part of my job at Intoto.

Traffic Management in hardware is typically last step in the egress packet processing. Scheduled packets of traffic management goes on the wire. That is, once the packet is submitted by software to the Hardware Traffic Management, packets are not seen by the software.

In theory, anything that is done in the hardware is good as it saves precious CPU cycles to do some thing else. And that is good thing. In practice, hardware traffic management feature set is limited, it may not be useful in Enterprise markets. As I understand these HW traffic management solutions are designed for some particular market segments such a Metro Ethernet.

If you are designing a network equipment, you may like to look for following functionality in hardware traffic management (HTM).

Traffic Classification: Many HTMs don't support classification in the hardware. They expect the classification to be done by the software running in the cores and enqueue the packet to the right hardware queue. HTMs typically do only shaping and scheduling portion of Traffic Management function on the queues. I can understand that there are multiple ways the packets can be classified and hence leaving it to software provides good flexibility for system designers. As I understand, number of cycles software takes to do the classification is either same or more than the scheduling and shaping put together. At least, I would expect HTMs to do some simple classification based on L2 and L3 header fields, there by leaving the classification task from the cores.

Traffic Shaping and Scheduling : Traffic shaping being the basic functionality of HTMs, this is supported well. Toke bucket algorithm common algorithm used by Traffic Managers to do the traffic shaping. Some systems require Dual rate traffic shaping (Committed Information Rate and Excess Information Rate). System designers may need to look for 'Dual Rate' feature. In addition, it is required to know how the EIR is used by the HTMs.   At least the systems I am familiar should treat EIR similar to CIR, but EIR shaping is expected to be done only if all CIR requirement of all queues is met. If there is more bandwidth available after meeting the CIR requirement of queues, then EIRs of the queues need to be considered.   If EIR of all the queues are met and if there is still more bandwidth available to send more packets, then round-robin mechanism or some other mecahnism of queue selection can be adopted for scheduling the traffic One should look for these features to ensure link is not under-utilized.

Another feature one should look from HTMs on whether it has flexibility to enable only CIR, EIR or both and a flag indicating whether it should participate in scheduling beyond EIR.

From scheduling perspective, different systems require different scheduling algorithms. Systems require scheduling from strict priority based queues and non-strict priority queues. For non-strict priority queues, scheduling algorithm applies. Common scheduling algorithms expected are: DRR, CRR, WFQ, RR, WRR.

Traffic Marking:   Marking is one important feature of Traffic Management functionality. Marking of the packet is meant for upstream router to make allow/deny decisions if the upstream observes any congestion. Different markings need to be applied based on classification criteria and based on rate band it used (within CIR, between CIR and EIR and beyond EIR). Marking the packet based on classification criteria is normally expected to be done by software if classification is done in software. But marking the packet based on the shaping rate needs to be done by HTM as software does not get hold of the packets after traffic management. Typically the marking is limited to DSCP value of IP header or CoS field of 802.1Q header. I see some HTM systems expecting the software to point to the DSCP location and CoS location along with the packet so that they can place the right value in those locations.

So, the features to look for on marking side is - Ability for HTM to market packets and Ability for software to configure marking values (DSCP, COS or both) on per queue basis based on the shaping band used to schedule the packet (CIR, EIR or beyond EIR).

Congestion Management: Shaping and Scheduling always leads to queue management. Queue Management is required to limit the queue size and also to ensure that latency of packets don't go up as in some cases it is good to drop the packets rather than send the packets late.   Different traffic types require different congestion management. Typical congestion management algorithms expected are - Tail Drop, RED (Random Early Detection) , WRED (Weighted Random Early Detection), head of queue drop. In addition, queue size in terms of number of packets it can hold are expected to be configurable.   When there is congestion, there would be packet drops. How the packets are dropped and how they are informed to software can have performance issues. Software needs to know the packets that are dropped from the queues so that software can free them.   To reduce the number of interrupts going to the software for dropped packets, it is expected that interrupt coalescing functionality is implemented by HTM. Also it is expected that it maintains list of packets that were dropped so that the software can read that bunch in one go when interrupt occurs.

Hierarchical Shaping and Scheduling : This feature is critical for many deployments. Shaping parameters are normally configured at the physical port or logical port level based on the effective bandwidth. On the port, there could be multiple subscribers (Example: Server farms of different customers in DC, different divisions in the Enterprises, Subscribers in Metro Ethernet Provider etc..) with each subscriber having their own shaping (CIR, EIR). Different traffic flows in each subscriber also might have shaping. For example, MEF does not rule out shaping on traffic based on set of DSCP values beyond shaping on Port and VLAN level. In Enterprise, shaping might need to be done based on IP addresses or transport protocol services. Scheduling is always associated with each shaping. That is whenever there is some bandwidth available, scheduler is initiated to schedule the packet. That is, when the physical/logical port level shaper finds some bandwidth to send the traffic, it invokes the scheduler. In above example, port/logical port level scheduler tries to get hold of packets from one of the subscribers. If the subscriber itself is another QoS instance (having its own shaping and scheduling), selected subscriber scheduler is called to get hold of the packet. If the subscriber scheduler might call another internal scheduler to select the queues having the traffic. Since one scheduler calls another scheduler, it is called hierarchical. Typically, 8 hierarchical levels are expected to be supported. As a system designer, one needs to ensure this feature and ensure that the number of levels supported by HTM suit your requirement.

It is also required to ensure that hierarchical shaping and scheduling does not involve queues at each level. If that was the case, performance of HTM would be bad. It is okay for HTM to expect software to put the packet in the inner most levels. Note that all queues may not be always in the inner most level. An intermediate or first levels QoS might have either further QoS instance or the queue itself. If the scheduler selects the QoS instance, then next inner level scheduler is called, otherwise, it selects the packet form the selected queue. Classification in software is expected to put the packets in appropriate queues as per scheduler levels.

Support for Multiple Ports : Some Enterprise Edge devices have multiple interfaces (Multiple WAN interfaces).   Each interface might be requiring its own QoS traffic treatment. As a system designer, this is one thing to look for in HTM on how many ports/logical ports can be configured with QoS.   Logical ports are also required as some systems use inbuilt switch to expose multiple physical interfaces from one 10G interface connected to the CPU board. VLANs are used internally to communicate between 10G interface and switch.   For all practical purposes, this scenario should be treated as if there are multiple interfaces on the CPU card itself.

Support for LAG : LAG feature adds multiple links together with each link having its own shaping parameters.   As a system designer, you may like to ensure that traffic marked for a link (port or logical port) by the software LAG or hardware LAG is scheduled on the same link. Also, one may like to ensure that schedule operation is invoked by the appropriate shaper of link. That is, HTM should not be having shaper on LAG instance, but it should implement the shaper on each link.

At no time, HTM should drop the packet. It is okay for some mis-ordering happening in some LAG operations, but no packet should be dropped. Two LAG operations - Rebalancing and Add/Delete new link should not rise to drop of packets from the HTM queues. One may like to ensure this.

Srini

Saturday, August 30, 2008

Are Multicore processors Linux friendly?

Multicore processors have become the trend for past three years. Intel, Cavium and RMI have Multicore processors.

But are they Linux friendly?

Since Linux has SMP functionality, they are in theory Linux friendly. And they are to large extent as long as Kernel controls the hardware accelerators and Ethernet ports.

There are many reasons why Network device vendors don't like to deal with the hardware from Kernel space. Let us take the two accelerators in question - Regex accelerator and Crypto accelerator. Any device vendor providing security functionality mostly implement using Linux user space daemons for several reasons such as

User space programming is easier to debug.
Some security functions require proxies which work on top of socket interface - hence user space applications.
More memory can be accessed from user space.
Taking advantage of SWAP.

Any application will not be limited to one user space daemon. There could be many user space daemons for typical network infrastructure device. Also, there could be processes which are created dynamically to take more load. Let us take UTM as an example:

IDS/IPS - One user space process, multiple threads.
Network AV - HTTP/S Proxy - One user space process, multiple threads.
Network AV - SMTP/S proxy - One user space process, multiple threads.
Network AV - POP3/S Proxy - One user space process, multiple threads.
Network AV - IMAP/S proxy - One user space process, multiple threads.
ClamAV or any equivalent Anti Virus package - Multiple user processes created dynamically at run time.
Spam Assasin or equivalent Anti Spam package - Multiple user processes.
IPsec : Kernel level function.

Crypto acceleration is required by

Network AV - HTTPS proxy, SMTP/S Proxy, POP3S Proxy and IMAP/S proxy.
IPsec in kernel space.

Regex acceleration is required by:

ClamAV daemons.
Spam Assasin daemon
IDS/IPS daemon
Content Security daemon (HTTP Proxy).

To improve performance and also to have isolation, many vendors would like to deal with the hardware directly from user space without Kernel doing de-multiplexing of requests/responses. That is, the accelerator device need to be shared by multiple daemons. Each daemon should be looking at its own copy of accelerator. If one process dies, it should not affect other devices.

Unfortunately, the Multicore processors today don't have that capability. I hope new Multicore processors would have this capability.

Let us see the expectations by the software on hardware accelerator devices.

Accelerator device should be instantiable.
Each instance can be memory mappable by the appropriate user space daemon.
Only owned process/thread should be able to submit and get hold of responses.
Each instance of hardware device should have its own interrupt and this interrupt line should wake up the appropriate thread.
When a user process dies, it should not affect other processes using the same device with different instance.
When the user process dies, software should be able to stop the instance of device.

Intel and VIA implemented crypto as instructions and hence they may not have above issues. But many Multicore processors implemented them asynchronously and would have issues if they don't have support for 'instances'.

Comments?

Thursday, August 28, 2008

Techniques to defend aganist DNS Cache Poisoning attacks

This subject is covered very well in many forums, IETF drafts and RFCs. My purpose of this article is to provide some idea on how network security devices can play a role in defending the DNS cache poisoning attacks. Before that, let me give some background.

Many DNS attacks, discussed recently are on DNS Caching Servers, DNS resolvers and there are less number of attacks on Zone Authority DNS Servers. Attacker sends DNS response packets to DNS Caching Servers before authoritative DNS Servers respond to queries sent by DNS Caching Servers, there by poisoning the cache with IP addresses for domain names of their choosing. Typically DNS Caching servers are located in company premesis and ISPs. These DNS Servers cache the responses until TTL expires. When query is raised to these caching serves, they reply immediately with response if it corresponding entry is present in its cache. If not, it sends the query to some other pre-defined uplink DNS Servers or to the authorative DNS Servers corresponding to the domain in question.

Attackers send the DNS response as if it was from uplink DNS Server or authoratative DNS Servers. Since DNS works on UDP, it is quite easy for attacker to spoof the DNS response. The difficulty in making the attack successful by attacker lies in making the response acceptable by the Caching Server. DNS Caching Servers, typically accept the response only:

they have sent the query.
response contains the same transaction ID as the query it had sent.
response contains the destination port same as the source port it used to send the query.
response contains the source port same as the destination port it used to send the query.
response contains the source IP same as the destination IP it used to send the query.

But, the attackers seem to be able to penetrate above defenses. According to recent vulnerabiltiy reports, many DNS Caching servers randomize the transaction ID already, but it seems that it is not good enough. One of the syggestions to defend against this attack is to make source port of the query also random. It helps in making attacker life difficult as it requires more responses and hence time. It appears that many DNS Caching servers and DNS resolvers patched their software and randomizing the source port. That is good news. But, it appears that even this defense is broken. See this article here: http://tservice.net.ru/~s0mbre/blog/2008/08/08. As seen in the blog, it takes more time, but attacker can make the exploit successful.

Many DNS Caching servers implemented some additional defeneses such as:

Additional corelation checks between Query and responses, such as checking for "Domain name" in Query section matches with Query and Answer section in the response : Many attackers are getting around this defense by sending their own queries with arbitrary domain names as part of queries and sending same in the DNS responses . By sending queries themseleves gives them even more control. That is, they need not wait for subscribers of DNS Caching Server to intiate the query, thus greatly increasing the chance of cache poison. Attackers themselves are sending the query with unknown subdomains which certainly triggers the query from the Caching server. Attacker can time their responses right after sending the query with un-resolvable and randomly created domain names. One might ask, how is Cache is getting poisoned if attackers are sending random subdomain name. At best it is kind of DoS attack. From the recent attack exploit scripts (script1 and script2), one can understand that cache poisoning is not happening with the information present in the answer section of DNS response, but using NS Resource Records (Athority section) and Additional Resource records of DNS response. Attackers are ensuring that the Domain name in Question section in both query and response is same. To understand more about this kind of attack, read 2.3 section (Name Chaining) of RFC3833.txt
Preventing from Birthday attacks - Again this defense is being overcome by sending random domain names in each query by attackers.
Ensuring that the NS record entry has some portion of the domain name in the Question section to make sure that arbitrary NS record is not honored: This kind of defense also is being broken by attackers. Attacker are actually creating full domain name in Question section with random string followed by the victim domain name and creating response with victim domain name in Authority section. Thereby, the defense is being bypassed by attacker. For example, if attacker wanted to serve his/her own IP addresses for www.veryimportantsite.com, then the type of queries and responses he/she sends to victim caching server look like this:

-----------------------------------------------------------------------
Query:
QUESTION SECTION:
.www.veryimportantsite.com. IN A

Response:

Question section is same as Query.

AUTHORITY SECTION:
www.veryimportantsite.com. 6000 IN NS attacker.veryimportantsite.com.

;; ADDITIONAL SECTION:
attacker.veryimportantsite.com. 6000 IN A 2.3.4.5

----------------------------------------------------------------------------

As you see above, defenses used by DNS Caching servers are not going to work. In above case, DNS Caching Server is going to use 2.3.4.5 as the Authoritative DNS Server for www.veryimporantsite.com domain name. Any queries to this caching server for domain names www.veryimportant.com and any other subdomains within it, 2.3.4.5 server is contacted. Since this server is hosted by attacker, he/she can choose to provide IP addresses for victim domain names. Basically, in this case, NS List is corrupted with some poisoned entry. Replace veryimportantsite.com with google.com, then it will have devastating effect if the DNS Caching server is one of popular ISPs' servers. Many users behind this ISP will be directed to site of attackers' choosing when they visit www.google.com or any subdomain under www.google.com.

Many people agree that the right solution is getting rid of UDP and use DNS over TCP or use DNSSEC. It is going to take time for everybody to adopt this. Until then DNS Caching Servers need to improve their randomization and security administrators install security devices to prevent the attack or get early signs of attempts.

How can security devices help?

Due to the stateful nature of network security devices, DNS response packet is not accepted by the firewall functionality of devices if 5 tuples of response don't match up with 5 tuples of DNS query. That is, DNS response packet must have its SIP equal to DIP of the query, DIP equal to SIP, SP equal to DP and DP euqal to SP. In addition, many firewall devices also check for matching transaction ID. One might think that Firewall is not adding value here since many DNS Caching servers are already doing these checks. But firewall goes one step further. It only allows DNS response from secuirty zone to which DNS query is sent. Also, it is good practice to set up the rules to allow DNS traffic as needed. Typically in Enterprises, only outbound DNS is allowed. If there are Authoritative DNS Servers in side, DNS requests are allowed only to these particular servers. This will reduce the probability of successful poisoning attacks. Let us analyze.

Assume that, a company has DNS Caching Server in its "Intranet-DMZ" zone and ISP DNS Server, ofcourse, is in untrusted (External) zone. Enterprise administartor creates a rule 'Intranet DMZ Zone' to 'External Zone' on Destination Port 53 Destination IP as ISP DNS Server with "Allow" action. Due to this rule, any DNS queries from the attacker will be dropped. So, the attacker only needs to depend on queries generated by DNS Caching Server due to genuine queries from its local users. That dramatically reduce the success of the attack. If there are internal attackers, they need to send both queries and responses.

Some firewall devices come with local DNS Caching Server (DNS Resolvers) functionality. In these cases, it should take same precautions such as randomizing the source port and transaction ID while sending DNS query to uplink DNS Servers.

I feel that security devices implementing firewall and IPS can add more defensive measures such as:

1. Many DNS resolves and Caching servers are already enhanced and donot send DNS queries on persistent sessions. That is, Each DNS query is sent with different randomized source port. Security devices can remove the session once the corresponding DNS response is received. Then count the number of DNS responses received for which there is no session entry.

As you observe from the attack scripts, for every query, large number of responses are sent by the attacker. By having some logic on counting the number of orphan responses in a particular quantum of time indicates that some attacker is trying to exploit DNS Caching Servers. Security devices can warn the administartors for further analysis.

2. Once the attack attempt is detected, security device can send one more query with the same information as in the original query and match the responses. If the content looks similar, then it can be cached and send response to the original querier. Ofcoruse, this requires some DNS resolution functionlity in the security device. Note that this functionality can also be implemented within the DNS Caching Servers too.

3. Security devices implementing NAT should not make the guessing of ID and source port easier. When NAT port is selected, it must ensure that it is random port.

Thursday, August 21, 2008

Data Center Firewall features

What makes firewall a good data center firewall?

Before going further on what features are expected by data center IT/Security professionals, it is good to revisit the data centers. Data center providers are mainly hosting providers. They host their customer applications and machines. Some customers of data centers share a machine resource, some like to host their application in a virtual system and some like to host their applications in a dedicated machine(s)/blades. To provide availability and share the load, application servers are installed in multiple machines with "load balancers" distributing the load across the server farm. As we all know, HTTP/HTTPS servers by far the single most server application in data centers. Most of the times, services provided by hosted servers are meant for general public.

Increasingly, there is a trend by Enterprises offloading hosting of Intranet servers to external data center providers. Intranet servers are typically provide access to Employees and limited access to their partners. For example, many email services, sharepoint and wikis are being offloaded to data center providers by many small and medium Enterprises. Many of these services require user authentication. Enterprises don't like to duplicate the user databases in multiple machines/applications. So, you also see the trend of 'Central Authentication Database' across internal servers and servers hosted outside. Many web applications are providing SAML based authentication for federate identity. Since web services need to talk to outside identity providers, there can be outbound connections. Note that, traditionally, servers in data centers only see inbound connections.

Enterprise administrator also requires facilities to upload the content and do other administrative activities on hosted servers. Typically FTP, SSH are some of the services required by administrators. Some applications might have web interface running on Port 80/443 for administration. To provide added security beyond user authentication, data center providers likes to control admin access from particular network(s), typically Enterprise Networks.

With more and more services (both Intranet and Extranet) being hosted in external data centers, the need for securing them is high. Collaborative services/servers such as wikis, share point, CRMs and other work flow servers are typically used to be part of Enterprise networks and only accessible for local users. They are being hosted in external data centers for reasons such as providing access from anywhere for employees, partners, contractors etc.. and also reduce the administration headache. Since they are exposed to access from anywhere, they are open for attacks from attackers. So, need for detection and prevention of exploits becomes much more than what data centers are used to. Quick look at the vulnerabilities published by NIST indicates (nvd.nist.gov) that SQL/XSS/LFI/RFI injections are on rise. You can also see number of wikis, blogs and other collaborative applications are targets of attackers.

Intranet servers when placed in external hosting providers' network, Enterprises would like the communication channels to be secure to protect data from eaves dropping. HTTP over SSL/TLS is one common method used to achieve data confidentiality on the wire. For security devices, placed outside of these servers, to do better job of access control, intrusion detection and malicious injections, it is necessary for these devices to see the traffic in clear. To achieve this, security devices should have capability to decrypt the SSL and do traffic/data analysis and if required redo the SSL. By the way, Since security devices are expected to be kept right in front of the servers, there may not be any need for redoing SSL. But important take way is that the security device should have capability to terminate the SSL connections.

From last few years, many web applications are using SOA (Service Oriented Architecture) which is built upon XML standards. Traditional ways of plain POST requests, JSON and PHP Objects are fast becoming thing of past. Any security device doing intrusion and data analysis need to move beyond POST, JSON and PHP objects and start interpreting SOAP and XML.

Data center providers provide services to many customers. Each customer requirement from security perspective is different. One generic security policy does not fit in these environments. You could have as many firewalls as number of customers, but that is not scalable from cost, space and cooling perspective. Virtualization in firewall/security devices comes in handy. Virtualization with VMWare/Xen also does not scale well. Old traditional virtualization scales well and suites well for data center providers.

Since security device comes in the way of traffic, things like performance of security devices should be high to support traffic rate that can be processed by servers/services it is securing. Latency, stability, availability and failover capabilities are some more important factors data center providers consider while selecting the security devices.

With above background, it is very easy to map to the features expected by data center providers on security device protecting their application and server infrastructure.

Access Control : As you see above, access control some times need to go beyond IP addresses and TCP/UDP ports. Some web applications provide administrator and normal user access via same TCP/UDP port. Hence it is not possible to distinguish administrator and normal users from IP addresses and ports. Since many data center providers don't like admin access to be given from any IP address (for providing better security), but from some specific networks, it is required that the access control go beyond to application level information such as URL, Query parameters etc..
Intrusion Detection and Prevention at L3-L7: As explained above, typical traditional intrusion detection systems without web application intelligence will not be able to detect intrusions all the time. There are many evasions being employed by attackers. Some evasions are at the IP and TCP level and more evasions are at the HTTP protocol level. Hence protocol intelligence is required. In addition, with SOA based web services, intrusion detection systems need to have intelligence to extract data from SOAP/XML messages. In addition to web application intelligence, they also need to have intelligence of other common services provided by hosting providers such as DNS, FTP, SIP etc..
SSL Proxy: Network device should be able to terminate the SSL for further analysis on the protocol data.
Virtualization: One physical hardware box is expected to support multiple virtual instances to reduce number of security devices in the deployment. Each virtual instance would need to have its own security policy configuration. It should be as good as different physical firewall devices. I, personally don't prefer VMware/Xen/KVM based virtualization for these environments. I prefer Traditional virtualization where only configuration data and run time states are instantiated for every context.
DDOS attack detection and prevention.
Traffic Anomaly detection and traffic Control.
Performance: To achieve multi gigabit speeds, look for hardware architecture which is scalable.
Stateful failover and high availability
Logging & Auditing capabilities
Intuitive central Management system

Optional features: Though they are not required, some data centers might find them useful

Server side NAC: Provide facility for user based access control. NAC does user authentication and provides control access to the different features of an application based on the URL and other fields in the protocol. It also helps in correlating user actions and might be useful in auditing.

My intent here is not to go into many details, but provide some ideas on the features security vendors would need to think while providing security device solutions to data center market.

Guidelines for defining data models

TR-106 defines data model guidelines for creating new data models. For interoperability and to provide better clarity, I have defined some more guidelines.

Before you proceed further, please read this.

Guidelines:

Guidelines that are being followed today in defining data models.

Data type: such as string, base64, integer, unsigned integer, Boolean, char etc..
Range: In case of integer, unsigned integer values.
Enumerations: Set of values that this parameter can take. Valid for both integers and strings.
Min/Max length of string: In case of string and base64 types.
Read/RW attributes for parameters
Create/Delete for table nodes.

Additional guidelines:

General:

Values of some parameters don't reflect changes done by ACS. These are called action parameters. Each action parameter should have associated result parameter. Action parameters are always strings and it takes value "apply". ACS sets this value to perform the action. These parameters should be defined with attribute called "ActionParameter". This attribute should be set to 1 for action parameters. Associated result parameter fully qualified name should also be set with attribute "AssociatedResultParameterName".
For each associated action parameter, there should be result parameter. Result parameters are always strings. "success", "not available" are two values defined by this guidelines. Any other string value is considered as error in performing the action by application. Possible error strings are defined by applications defining the data model. Whenever action parameter value is set, associated result string value should get automatically set to "not available". When the application processing is done successfully, then this parameter value would be set to "success" by device. If application processing returns error, appropriate string value is set to this parameter.
Always define default values for all parameters except for mandatory parameters of table objects.

Table objects:

Identify the parameter that uniquely identifies the instance in the table object. Indicate that in an attribute "KeyParameter". ACS can check for the uniqueness of this value whenever the instance is created.
For each parameter that can't be changed after instance is created, indicate in DM that "DontChangeAfterCreate". Key parameter of the table object typically has this attribute set. There could be other parameters too based on application.
Identify all parameters that are mandatory during instance creation. Use "Mandatory" attribute for these parameters. ACS can ensure to send values for all mandatory parameters in "SetParameterValues" method that follows "AddObject" RPC method.
Indicate the default value in the data model if the parameter is not mandatory. This information is needed when ACS user asks it to reset the some specific optional parameter to default value.
For every table object, one parameter "Number Of Entries" must be present as per TR-106. I suggest to have one parameter "Number off Entries Supported" to represent the number of rows the device can hold. Both these parameters take integer values and both of them are READONLY. Note that names of the parameters could be different for different data models (table nodes). Hence there is need to associate these parameters to appropriate table nodes. To associate these parameters for the table object (table node), these parameter should have attributes associated with it indicating the full qualified table object. I call this attribute as "AssociatedTableNode". In addition to this attribute, two more attributes are needed - "CurrentEntriesIndicator", "MaxNumEntriesIndicator". CurrentEntriesIndicator attribute should be set to parameters indicating the current number of entries and MaxNumEntriesIndicator attribute should be set to parameters indicating the maximum number of entries that device can take for corresponding table.
If table is representing the ordered list, then it should have special attribute for the table object called "OrderList". Also it should have associated parameter name that indicate the priority of the ordered list. Call this attribute as "PriorityParameterName". This information is used by ACS as described in previous blog here. Priority parameter name takes integer values. Lower values indicate higher priority instances.
If table is an ordered list, it should also have one pair of action and result parameters for revalidation purposes. In addition to attributes it defined earlier for these two parameters, it should have additional parameter representing the table object name with attribute "AssociatedTableNode". ACS uses these attributes to know the parameters names to revalidate the states in the device.
As described under section "Nested table objects and special cases" of earlier blog here, there is a need for creating a table object instance with at least one instance of child table. ACS needs to know this special relationship so that it generates the screens such a way that it takes at least one child instance parameters as part of parent instance creation screen. This relationship information also useful for ACS to validate this information before sending configuration to devices. Each child table of this type should have one attribute "OneInstanceNeededByParent" set to 1.

Wednesday, August 20, 2008

Jericho forum - Is network security device market dead?

In one of the meetings I participated few weeks earlier, one person asked me a very interesting question - Will there be any security devices market in future? When I asked him why that question, he referred me to Jericho forum. Though I have some idea about jericho forum before, it got me interested to know more details about this.

When I first browsed through the forum publications, I thought that the question that was asked was fair. At first glance, it appeared that Jericho forum is proposing to add security along with application and data. But after spending few hours on position papers, Brochure and FAQ, it appears that Forum is not advising people to throw away their firewalls and security devices, but enhance security down to applications, data. Having said that, position papers still confuse readers with some inconsistent statements. I think that Jericho forum did not position their security concerns and resulting architecture very well and hence the confusion and mis-characterization in security industry.

Jericho forum described two main challenges - Business transactions that tunnel over HTTP/HTTPS and exploits/malware escaping traditional firewalls/security devices. I add one more challenge beyond HTTPS, that is, data itself may not be in clear - either it is encoded, encrypted or compressed.

It is true that traditional network address/service level firewalls are not good enough to protect resources from data level attacks and data misuse. Many applications are being developed on top of Port 80/443 (HTTP/HTTPS). Web Services (SOA) architecture is being used to develop multiple applications on a single machine with HTTP/HTTPS as transport. Any application service level filtering is possible only by devices having HTTP/HTTPS and web services intelligence.

It is also true that many newer malwares evade traditional signature detection - either by sending malware executables via HTTPS or constantly morphing themselves to avoid detection. One of the techniques behavioral analysis requires gathering the run time information such as registry entry modifications, listening port, any outbound connections, files being modified etc.. by running the executable on appropriate operating system.

With the challenges described and positioning it is doing, first impression I got was that Jericho forum is advocating adding entire security along with each application in the same machine. It took me a while to get rid of this impression. I guess the term "De-perimeterization' is confusing. I would like to think that Jericho Forum is proposing that security at Enterprise boundary is not good enough and the security is, additionally, needed closer to the applications/resources. So, there are multiple perimeters, with some perimeter having few machines or even one machine or one application. By the way, traditional firewalls and Ipsec VPN devices do very good job of providing access control to desktop systems based on the type of user and provide security connectivity to other branches of organization.

Though adding all security functions along with the application on the same machine provides better security, there are complexities:

There could be multiple machines running same application in cluster mode. In some deployments, it is observed that 100s of machines are used to share the load. In those, it is wise to move security functions such as "L4-L7 access control", "Intrusion Detection functions" to specialized security devices. It saves CPU cycles on application servers. It provides single control for administrators to manage security functions of the applications or set of servers and hence the management becomes easier. Some security functions such as terminating wireless connectivity and mobile device management don't really belong to one specific LOB (Line Of Business) application. They need to be outside of the application servers.

Having said that, some security functions can't be done well outside the LOB machines such as behavioral detection of malwares or when the data is encrypted or compressed with proprietary algorithms. They are better done as in end systems.

There is cost to apply some security functions outside the LOB servers. For example, Many LOB Servers implement security protocols such as SSL, XML Security etc.. Any access control device providing control at the XML field level must terminate the SSL connection, authenticate the user and decrypt and validate the XML documents before doing access control. There is inherent benefit too - It saves CPU cycles on the LOB Servers as it sees clear traffic. But, there may be some concerns in CSOs that some network elements has access to the clear data. If it is micro perimeter, then there may not be any concern. I guess Jericho forum is driving this point where the security perimeter is as close to the applications and data.

Security device vendors would like to make their solutions as generic as possible. They don't like to tie up the device functionality to one or few applications. That is where, standardization helps. I am happy to see that Jericho forum in their COA (Collaboration Oriented Architecture) position paper, chose the SOA and XACML. Both of these architecture heavily dependent on XML messaging. It provides common understanding for network elements outside of LOB servers and there by creating eco-system of vendors comprising security vendors, application vendors.

Having said that, I feel that the LOB applications must have their own security based on the application - Such as authentication, multiple roles, role based access, Auditing etc..

In summary, CSOs need to understand that Enterprise boundary security with traditional network level firewalls is not good enough to protect the data and resources. Application specific security is must. Some security functions can be done outside of LOB servers, but the security device must be as close to the LOB servers as possible. So, I don't see network security device vendor market drying up.

Tuesday, June 17, 2008

OpenDNS - Domain filtering Cloud computing Service

I recently came across *opendns* service. Visit www.opendns.com to find out more details on this service. "opendns" name is a confusing name given the type of service they are providing. Initially, I thought it is some thing similar to dynamic DNS.

Operation:
OpenDNS mainly provides domain blocking capability. Domains are arranged in multiple categories. It provides facility for users to configure the categories which are to be blocked. It also provides facility for users to create white list and block lists of domain names.

This service is using DNS protocol. It expects the user machines or routers to use their DNS Servers for domain name resolution. As part of DNS resolution, it appears to be extracting the domain name from DNS request packet, search in their local database, get the category and look in user preferences. If category is configured to be blocked or if domain is in the block list, then openDNS server seems to sending DNS response with its own IP address. Due to this, user browser session ends on this IP address. OpenDNS seems to be doing search on the domain name (Host field of HTTP request header) again to determine the category and it shows nice page indicating why it was blocked.

Comments:
This service is good for residential users and even for business users. Residential users get benefited by blocking adult sites for kids and also stop while visiting phishing sites. Businesses also benefit as it stop users going to phishing sites. Having said that, this works fine only when CPE devices work in conjunction with openDNS service. Before going into the capabilities required in CPE devices, let me list down some limitations/issues in using opendns service.

Privacy issues: Some businesses find it difficult to trust opendns provider due to privacy issues where *openDNS* provider comes to know the sites business users visiting. Business may like to have facility for some users to bypass this service and for some mandate this service. Also, businesses like to have facility to bypass openDNS based Domain name resolutions for some specific domain names.
User or group based white list/block lists/category selection: There are different types of employees in businesses. Also, there are different types of home users - kids, parents, visitors, teens etc.. OpenDNS provides only one profile for all users. This may not be sufficient for many businesses and residential users.
Evasion: Kids can evade these filters if they use IP address in their browsers.
Updating Dynamic Public IP address with the opendns account

How can CPEs help?

User/Group based lists: User/Group based lists support is only possible if openDNS updates its functionality. One possible way is to have special DNS request with added information such as GroupID. OpenDNS Service can rovide facility in openDNS portal to create category selection/blocklist/whitelist onper group basis. Since one can't expect all PCs to support this special enhancement in the DNS protocol, this kind of support is possible with CPEs implementing DNS proxy to convert DNS requests to add Group ID.

Privacy: CPE devices can help in mitigating privacy issues by providing support to create 'skip' lists - Source skip list and Domain skip list. If the source IP address of the DNS request packet from internal PCs matches the entry in 'Source Skip' list, then it bypasses openDNS based resolution. It can do this by sending the DNS request to one of ISP Domain Name Servers. 'Domain Skip list' is checked for domain names inside DNS request sent by local machines. If there is a match, then it bypasses the openDNS resolution.

Evasion: CPE devices can monitor HTTP requests and check the 'Host' header line. If the 'Host' header line does not have domain name, but IP address, then we can certainly say that domain name is not used while browsing the site. CPE devices can provide configuration on type of action to take. It can provide options like 'Inform' and 'Deny'. 'Inform' action informs parent in case of RG environment or admin in case of business environment. 'Deny' action drops the connection and might even present local HTML error page to the user. Here too, we should 'skip' lists to help scenarios where some sites are only reachable via IP addresses - for example Intranet sites or partner sites etc..

There is another kind of evasion possible too. Local users using their own DNS Server or some public DNS Servers. CPE can check all DNS requests and ensure that only specified DNS Servers are used. It could even do Destination NAT to the required DNS Server IP address.

Dynamic IP address update: Today it is expected that special program is run in the PCs behind CPE routers. It does not work well if we have many machines or machines which do not run the software provided by OpenDNS folks. CPE device can help in those matters where it updates the dynamic public IP in openDNS Servers. CPE devices are already equipped with updating dynamic public IP addresses in DYNDNS servers. They could do additional job of upating in openDNS Server too.

SME IPS and Cloud Computing

Cloud computing providers are betting on small and medium businesses flocking to them. Large number of SME businesses are already using email service provided by cloud computing providers. It appears that this trend is being spread to other services such as File Service, backup service and web application services.

Businesses offloading their intranet and extranet services to the providers would be left with desktops and some minimal servers in their network. I have my own doubts on merits of moving Intranet services to providers, but that discussion belongs to some other topic.

Desktops normally don't provide any services i.e they don't run any servers. May be printers and other networking equipment have some services, but they are limited to internal machines. Hence firewall protection allowing only internal machines is good enough.

Basically, the requirement of server side security function beyond firewall is going to be less in these environments. In addition, many hackers are now moving towards soft targets i.e desktops and applications running on desktops such as browsers, viewers etc..

Many IPS/IDS devices in the market today protect servers better than clients. Due to movement of services to providers and with increase of client side attacks, IDS/IPS vendors must support better client side detection to survive.

IDS/IPS vendors realized this and moving towards this, but not as fast as one would like to see. By Mid-2009, I believe that many IDS/IPS boxes in the market will have sophisticated engines to support client side attack detection and prevention.

Sunday, May 18, 2008

Proxy based Networking applications - Multicore and developer considerations

Many networking applications such as Anti Virus, Anti Spam, IPS to name few are being implemented as proxies in network infrastructure devices. Proxy implementations terminate the connections from external clients and make connection to the destination servers. They get hold of data from client and server and send it to other end after doing applications specific processing. To ease the development, proxies are implemented on top of sockets in user space. For each client to server connection, two socket descriptors would be created - One socket is created as part of accepting the connection from client and another socket is created as part of making connection to the server.

In unicore processors, it is typical practice to have one process per proxy. Process handles multiple connections using non-blocking sockets. It is done typically using either poll() and epoll_wait() mechanisms. This allows the process to work on multiple connections at the same time.

Many networking applications listed above use the proxies in transparent mode. Transparent proxies avoid any changes in the end client and server applications and also avoid any changes to the DNS Server configuration. Systems with transparent proxies intercept the packets from both client and servers. It is expected that forwarding layer (of Linux ) of the system to intercept the packets and redirect the packets to the proxies. Redirection typically happens by overwriting the IP addresses and TCP/UDP ports of the packet such a way that the packets go to the proxies running in user space without any making any changes to the TCP or UDP or any other stack component by the developers.

Process skeleton looks some thing like this:

main()
{
   Initialization, Daemonize & configuration load.
   Create a listening socket.
     while(forever until termination)
     {
epoll_wait();
Do any timeout processing.
for ( all ready socket descriptors )
{
   if ( listening socket)
                 {
accept();
                        Create application specific context.
                        Might initiate the server connection.
                        May add socket fds to epoll list.
                 }
                 If socket is ready with new data
                 {
Application specific processing();
                     As part of this, the oscket fd may get added to the epoll list again.
                 }
   if ( socket is has space to send more data )
                {
Application specific processing() which sends the data.
                      If more data to be sent, socket might be kept again in epoll list.
                }
                if ( socket has exception )
                {
Application specific processing ();
                        Connection may be closed as part of application processing.
                }

}
     }
     Do any graceful shutdown activities.
     exit(0);
}

Increasingly Multicore processors are being used in network infrastructure devices to increase the performance of solution. Linux SMP is one of the popular operating system choice by developers. What are the things to be considered while moving to Multicore processors?

Usage of POSIX compliant threads in the proxy processes:
Create as many threads as number of cores in the processor. Core affinity can be done for each thread. Yet times one might like to create more number of threads beyond number of cores in processor to take advantage of asynchronous accelerators such as Symmetric & Public key crypto acceleration, regular expression search acceleration etc.. In those cases, thread might wait for the response. To allow the core to do other connections processing , multiple threads per core are required.

Mapping of thread to the application processing:

Developers use different techniques. Some use pipelining model. In this model, one thread gets hold of the packets for all connections and pass the packets to next thread in the pipeline for further processing. Last thread in the pipeline sends the packet out on other socket. Though this might use multiple cores in the processors, this may not be ideal choice for all applications. I feel run-to-completion model is good choice for many applications. Run-to-completion model is simple. Each thread waits for the connections. Once the connection is accepted, it does everything related to the applications in the proxy including the sending out the processed data.   The structure is similar to the process model, but the loop is executed by each thread. That is, connections get shared across the threads with each thread processing set of connections. Advantages which this approach are:

Better utilization of dedicated caches in the cores.
No or less number of Mutex operations as one thread does all processing.
Less number of context switches.
Less latency as it avoids multiple enque/deque operations to pass packets from one pipeline stage to another.

Load balancing of the incoming connections to the threads:

There could be multiple ways of doing this. One way is to let master thread accept the connections and give to one of the working threads to do rest of processing on the connection. Master thread can use different load balancing techniques to assign the new connection to least loaded thread. Though this approach is simple and confined to the process, it has some drawbacks. When there are shortlived, but large number of connections being established and terminated in quick succession, master thread can become the bottleneck. Also, cache utilization may not be that good as master and worker threads might be running on different cores. Since master thread is not expected to do much other than accepting the connections, a core may not be dedicated, that is, core may not be affined with the master thread.

Another technique that can be used to let each thread listen on its own socket, accept the connections and process them. As we know, we can't have more than one listen socket with respect to IP address and port combination. So, this techniques uses multiple ports as many as number of threads. Each threads listens on a socket created with unique port. It should be noted that external clients should not be knowing about this. The client connections to the server will always be with one standard port. Hence this technique requires additional feature in the intercept layer. Intercept layer (typically in the Kernel in case of Linux) is already expected to do IP address translation to ensure the packets are redirected to the proxies. In addition it can do port translation too. Port to translate with can be found based on the load on each port. For example, the port selection can be 'round-robin' or it could be based on 'least' number of connections on the ports.

Do all relevant application processing in one proxy:

Network infrastructure devices are complex. Yet times, on the same connection multiple application processing is required. For example, on HTTP connection, the device may be expected do 'HTTP Acceleration such as compression', TCP Acceleration and 'Attack checks'. If these are implemented in different processes as proxies, latency would increase dramatically as the each proxy terminates and makes new connection to next proxy. Also performance of the system goes down. Certainly it has one advantage, that is, maintainability of the code. But performance wise, it is good to do all applications processing in one single process/thread context.

Considerations for choosing Multicore processor

Certainly cost is the factor. Besides the cost, other things to look for are -

Frequency of core is very important: As discussed above, a connection is handled by one thread. Since thread can be executed in one core context at any time, performance of the connection is proportional to processor frequency (speed). For proxy based applications, higher frequency cores are better choice compared to multiple low powered cores.
Cache : Proxy based applications typically do lot more processing than typical per-packet based applications such as Firewall, IPSec VPN etc.. If the cache size higher, more instruction memory can be cached. So, higher the cache size, better the performance.
Division of cache across cores: Since threads can be affined with the cores, it is good to ensure that the data cache is not same across the cores. Any facility to divide the shared cache into core specific cache would be preferrable.
Memory mapping of accelerator devices into the process virtual memory: By having access to the hardware accelerator from the user space, one can avoid memory copies between user space and kernel space.
Hardware based connection distribution across cores : This is to ensure that the traffic is distributed across cores. Intercepting software in Kernel forwarding layer need not make any load balancing decisions to distribute the traffic across threads. Intercept layer only need to translate the port so that the packets go to the right thread.

Other important considerations that are needed for any applications are:

Facility in hardware to prioritize the management traffic at ingress level : To ensure that Management application is always accessible even when devices is under flood attack.
Congestion Management in hardware at ingress level: To ensure that buffers are not exhausted by application that do lot of processing.
Hardware acceleration for crypto, regular expressions and comperssion/uncompression.

Programming considerations for performance

Each poll() or epoll_wait() calls are expensive, so avoid calling epoll_wait() as much as possible : Once the epoll_wait comes out, read the data from the ready socket as much as possible. Similarly write data as much as possible on the ready sockets.
Avoid locking as much as possible.
Avoid pipelining - Adopt run to completion model.

I hope it helps developers who would be developing proxy based applications.

Random technical bits and thoughts