Random technical bits and thoughts: November 2009

Wednesday, November 11, 2009

Per CPU data structures and statistics - Improving performance

To improve performance, developers typically define the statistics on per core basis. Also, developers tend to define internal structures of module on per core basis wherever possible. Based on the core in which processing is happening, appropriate "per core" memory block is used to read and update the data.

In case of statistics, typically during event (example packet) processing statistics counters are updated. Since 'per core' statistics counter is updated, there is no need for any atomic add/inc operations. Reading of statistics is a rare occurrence which is typically initiated by some management function. Reading across multiple blocks, adding them up and giving comprehensive data is required in this case. During add operations of counters of multiple blocks, there is a chance of counter is being updated. Due to this, there could be some error. But it is normally acceptable.

One example of maintaining internal structures of module is 'free lists'. Other example could be 'timers'. In both of these examples, set of blocks are arranged on per core basis, such as linked lists. Typically array of linked lists or any data structure is defined. Whenever the module requires a free block, based on the core it is running, it gets the free node from that core specific linked list. When the module frees the block, the core specific linked list is updated with freed node. Thus, no lock operation is required while allocation and free of nodes.

Let me take some scenarios to explain this better. Let us assume a module contains a core specific free lists and one global active list. Module gets the 'Control block' node from the free lists and puts it in active list upon 'add' operation. As part of 'delete' operation, this control blocks gets freed. Let us also assume that upon event (example packet), which happens very frequently, control block is found from the active list first and then it updates several statistics during event processing. To improve performance, these counters are put on 'per core' basis in the control block. Based on the core processing the event, appropriate core specific counters are incremented.

struct modStatistics {
int Counter1;
int Counter2;
};
typedef struct modCb_s {
struct modCb_s *pNext;
.... some module specific state variables ...
} modCb_t;

struct modStatistics statsArray[NR_CPUS];

modCb_t *pActiveListHead; This is not important for our discussion.

Free lists:

struct freeQ_s {
int Count;
modCb_t *pHead;
modCb_t *pTail;
};

struct freeQ_s freeLists[NR_CPUS];

Though this 'per core' mechanism gives good performance by avoiding atomic operations and locks, it can have cache related performance implications.

In above example, let us take statistics counters. Typically, they get incremented using statsArray[current_cpu].Counter1++. This operation involves

reading the counter value.
increment
store back the new value.

Since all counters are together, some of them for sure fall into the same cache line. As you all know, the caching system in CPUs work by cache lines. Cores typically get the cache line worth of data from memory upon read operation. When one core updates the counter, it invalidates the cache line in which the counter is present. So, when another core tries to increment the counter belonging to its statistics block, it may end up getting the block from memory into cache due to cache line invalidation by first core, even though the first core did not update any memory location corresponding to second core.

Similarly, whenever pHead and pTail pointers in above example get updated, it is possible that other cores might end up reading from DDR even though core specific memory is being updates as per module logic. As you all know, getting memory from DDR is more expensive than reading data from the caches.

Note that L1 Cache and in some processors L2 Cache is provided on per core basis. This problem may not be there in the shared caches, but shared cache inherently has more latency compared to dedicated cache. Moreover L1 Cache in all processors families I know are dedicated. Some processors (which I think is very good) which has three level hierarchy caches typically have even L2 also dedicated on per core basis. Whenever you have dedicated caches, above cache invalidation problem can occur which results to performance degradation.

To avoid this degradation, programmers typically align each core specific block cache aligned. Though this looks good, it would have memory implications. Based on cache line size, at most you can waste 'cache line size - 1' bytes for each core. If there are more cores say 32 (which is very common in today's world', it can result to 32 * 127 (assuming that the cache line is 128) = 4K. Ofcourse this is the worst case scenario. But still this is waste of memory for every 'per core' instance.

In addition, each module need to declare the size of array to hold all possible values of cores. That is, the size of array need to be big enough to accomodate hot pluggable CPUs at run time. There could be waste of memory if that deployment does not any hot pluggable CPUs.

I also heard that in some processor families, the core IDs are not unique and hence the size of CPU specific arrays need to be as big as the largest core ID value.

Linux 'per cpu' infrastructure takes care of above issues and yet provide benefits of per core blocks. Module can define its per cpu blocks using following facilities provided in Linux.

DEFINE_PER_CPU (struct modStatistics *, statsArray): This is in place of struct modStatistics statsArray[NR_CPUS].
DEFINE_PER_CPU(struct freeQ_s *, freeLists): This is in place of 'struct freeQ_s freeLists[NR_CPUS].

Access functions:

_raw_get_cpu_var(statsArray) or __get_cpu_var(statsArray) to get hold of address of executing CPU specific statsArray block.
per_cpu(statsArray, cpuid) is to get hold of address of CPU specific statsArray block of CPU 'cpuid'.

Using above, statistics can be incremented as _raw_get_cpu_var(statsArray).Count1++. To figure out the comprehensive statistics, one could use following code block.

totalcount1 = 0;
for (core = 0; core <>
{
totalcount1 += per_cpu(statsArray, core).Count1;
}

Let us analyze what is happening under the hoods.

Linux keeps all the per CPU blocks defined using DEFINE_PER_CPU across all modules in one memory region. This memory region is duplicated for every CPU in the system. When there is any hot plugged CPUs at later time, it duplicates this region again.

At linker time, there is only one copy of these per CPU blocks. During init time, Linux duplicates this linked block as many number of times as number CPUs.

__get_cpu_var, per_cpu(), __raw_get_cpu_var macros based on the CPU ID, get hold of the base of memory region corresponding to each CPU and adds the offset of instance of any per cpu defined memory to get hold CPU specific memory block.

Since Linux combines all module instances together on per CPU basis, any updates to the variables of any memory block don't result to cache invalidation by other cores. Since all modules core specific blocks are together, there is no need for cache aligment, thus saving memory.

Some links on this topic that I referred are given here, here and ofcourse Linux source code.

I hope it is useful. Happy multicore programming.

Sunday, November 8, 2009

RCUs - Eliminate reference counts & increase maintainability

Reference counts in application specific blocks (records) are used for many purposes in Multicore environment. Main reason for reference count variable in records is to ensure that memory block is freed only after every module or core stop using that record. External modules normally increment reference count of the record before storing the pointer in their storage. Also cores using the record also increment the reference count, do processing and decrement once the pointer is no longer required in its processing. This will ensure that any core deleting the record will not free the record to heap prematurely and record gets freed only when its reference count becomes zero. Book keeping of record with reference count has many problems associated with it.

Let us take some dummy packet processing module 'dmod' running in a Multicore SMP operating system. It maintains its records in a hash list - drecList. Its record is called 'drec'. Let us also assume that there is one external module to this module called 'emod'. 'emod' has its own records of type 'erec' maintained in erecList. For discussion sake, let us assume that every time 'drec' is created, it calls a function in 'emod' which inturn stores 'drec' instance in its 'erec' instance. There are two paths for the packets. 'dmod' gets hold of packets from the link layer and also it gets packets from the 'emod'. 'emod' calls the 'dmod' packet processing function with the paacket and 'drec' pointer that was stored in its record.

Traditionally with spinlocks, it would be implemented as follows:

dmod:

'drec' structure contains two variable delete_flag, 'refCount' information. If both delete_flag is true and refCount is 0, then the record gets freed.

CreateDrecInstance()
{

Allocate memory for drec.
write_lock_bh(drecListLock);
Add the drec to the drecList
Set the RefCount to 1 and Delete_flag to 0;
IncDrecInstance(drec) because it is going to be referred beyond write_lock_bh
write_unlock_bh(drecListLock);
CreateERecInstance( drec );
return 'drec' instance to the caller.

}

DeleteDrecInstance( drec )
{

write_lock_bh(drecListLock);
Remove from linked list
DecRefCount(drec) as it was set to 1 as part of adding to the linked list as part of 'CreateDrecInstance' function.
write_unlock_bh(drecListLock);
DeleteDrec(drec);

}

SearchDrec(parameters)
{

read_lock_bh(drecListLock);
Search with parameters for matching drec;
if found, IncRefCountIfNotDeleted(drec) - since it is used by caller, reference count needs to be incremented here. If the record is already set for deletion, it is treated as record unmatched.
read_unlock_bh(drecListLock);
return matching 'drec' instance pointer.

}

IncRefCount(drec)
{

spin_lock_bh(drecRefCntLock);
increment drec->refCount
spin_unlock_bh(drecRefCntLock);

}

IncRefCountIfNotDeleted(drec)
{

spin_lock_bh(drecRefCntLock);
if ( drec->delete_flag is true) drec = null;
increment drec->refCount;
spin_unlock_bh(drecRefCntLock);
return drec;

}

DecRefCount(drec)
{

freeflag = false;
spin_lock_bh(drecRefCntLock);
decrement drec->refCount;
if ( drec->delete is true and drec->refCount is 0 ) freeflag = true;
spin_unlock_bh(drecRefCntLock);
if ( freeflag is true) Free drec memory to heap.

}

DeleteDrec(drec)
{

freeflag = false;
Call EmodDereference(drec); Call external module to remove reference to this. Note that this may not remove reference immediately in this function call. This function might generate an event and come out. That event eventually in some other thread context might get processed and reference to drec would be removed.
spin_lock_bh(drecRefCntLock);
set drec->delete to true;
if (drec->delete is true and drec->refCount is 0 ) freeflag = true;
spin_unlock_bh(drecRefCntLock);
if ( freeflag is true) Free drec memory to heap;

}

DmodPacketProcess(pkt)
{

Get search parameters from 'pkt'.
SearchForDrec(parameters);
if ( drec is null) CreateDrecInstance(parameters);
DoPktProcessing(drec, pkt);
DecRefCount(drec);

}

It is expected that 'emod' function while storing the 'drec' instance pointer does with its own spin_lock and read by incremeting 'drec' instance pointer.

There are two spin locks are taken on per packet basis - One list lock and another to increment reference count. As you might have known, locks are expensive even if there is no contention. Number of instructions that get executed as part of lock/unlock functions are significant.

RCUs alone will work fine and can be used to eliminate list locks and also to eliminate the need for reference counts as long as no external modules store the records by reference. In earlier example, there is 'emod' which stores the drec pointer. Since external module may not remove the reference to drec in the current processing cycle, reference counts are required. Using arrays with magic numbers for external module purposes will eliminate the need for reference counts.

Reference count mechanisms for one major reason of many bugs in Multicore environment. Elimination of need for having reference count would be welcome by many software developers. It is not a good practice to let external module store the reference (pointer) to the module records. Rest of this article describes one technique.

RCUs with Array level indirection would eliminate the reference counts in records. Let me take 'dmod' to explain this concept. In addition to maintaining the hash list it also needs to have one array of size 'number of records' that are possible in this module. Each array element has 'drec pointer', 'magic number'. External modules are not expected to store the pointer to drec at any time. They are expected to store index to the array and the magic number. External modules are expected to check the validity of array element by matching the magic number stored with the magic number in the array element. If both match, then it can assume that 'drec' pointer is valid in the array and can use the pointer for rest of processing.

struct RCU_indirection {
unsigned int magic; -- Value 0 means it is unused, any value indicates rec is valid.
void *rec;
} ;

dmod:

struct RCU_indirection drecArray[MAX_DRECS];

'drec' need not contain 'refCount' and 'delete' flag. But one variable 'index' is required.

Have global magic count : drecMagic = 1;

CreateDrecInstance()
{

Allocate memory for drec.
write_lock_bh(drecListLock);
Add the drec to the drecList
Add the drec and drecMagic in a free array element.
Increment drecMagic
Put the index into array element in 'drec'
write_unlock_bh(drecListLock);
CreateERecInstance( index, magic );
return 'drec' instance to the caller.

}

DeleteDrecInstance( drec )
{

write_lock_bh(drecListLock);
Remove from linked list
Set the magic to zero in the array element pointed by index in drec.
Set the drec to NULL in the array element.
wmb();
write_unlock_bh(drecListLock)
call_rcu ( ) : Callback function associated with it will free the drec.

}

SearchDrec(parameters)
{

Search with parameters for matching drec
return matching 'drec' instance pointer.

}

DmodPacketProcess(pkt)
{

rcu_read_lock_bh();
Get search parameters from 'pkt'.
SearchForDrec(parameters);
if ( drec is null) CreateDrecInstance(parameters);
DoPktProcessing(drec, pkt);
rcu_read_unlock_bh();

}

When 'emod' calls 'dmod' packet processing function, it should do following:

{

rcu_read_lock_bh()
if ( drecArray[erec->drecIndex].magic == erec->magic) drec=drecArrayp[erec->drecIndex].drec
DoPktProcessing(drec, pkt);
rcu_read_unlock_bh();

}

As you see it is simple and eliminate two locks on per packet basis. Even if there are more external modules, this scheme works. If record pointers are required to be stored in any other data structures within the module, then too there is no need for any reference counts as long as those data structures also manipulated within the same spinlock ( 'listlock' in example above).

Monday, November 2, 2009

Read-copy-update (RCU) - What are they and how they can be used

Very good explanation about RCUs at wikepidia with heading Read-Copy-Update.

RCU locks are replacement for Read-write spin locks. But they are very fast for read operations. It has similar overhead as write locks for add operations and little bit additional overhead in case of 'delete' operations. Networking applications typically maintain multiple context blocks (records) in data structure such as hash lists, linked lists, arrays etc.. . In many cases, these data structure are searched to find out the matching record on per packet basis. Records are created and removed infrequently. RCUs are best way to protect these kinds of data structures in Multicore SMP environments.

Read operation in RW locks in SMP Environments is heavy due to high number of instructions in these functions. Hence these functions take more CPU cycles even when there is no lock contention. That is why, RCUs are preferred.

There are mainly five API functions in RCU:

1. rcu_read_lock
2. rcu_read_unlock
3. rcu_assign_ptr
4. rcu_dereference
5. call_rcu/synchronize_rcu

rcu_read_lock and rcu_read_unlock are used while searching for a record in any data structure (linked list, arrays etc..). While searching for a record, yet time you may need to go through the list via next/prev pointers in linked list. These pointers must be dereferenced using rcu_dereference. Typically the pointers which are modified with rcu_assign_ptr in add and delete operations need to be dereferenced while traversing the list. These two functions are mainly to take care of weak order processors which execute instructions out-of-order and also write into the memory locations out-of-order. If you really look at body of these functions, they execute memory barrier instructions. Many RCU implementation also don't have many instructions in rcu_read_lock and rcu_read_unlock. In Linux, RCU read lock mostly disables preemption (If preemption is enabled). Writer task is expected to use normal spin locks to protect the data structure from other writers. Write task can involve either addition or deletion of record in and out of data structures. If readers can be in softirq, it is necessary that writes use 'bh' version of spin locks.

RCUs read operation depends on the fact that any word assignment to a memory location is atomic which is true in all processors. Reason for having memory barrier operation in rcu_assign_ptr is to ensure that value assigned appears on the cache/memory before any dereferencing happens by other cores.

Whenever writer removes the records from application data structure, it needs to call synchronize_rcu or call_rcu. call_rcu is asynchronous mechanism and is used by application when it needs to get control after all cores complete their current scheduling cycle, which means that the record is not being accessed by any core. call_rcu takes a function pointer and a callback argument. This callback function is called after all cores complete its current scheduling cycle. Application callback can take further action such as freeing the record. If the record is being referred by pointers by some other application records, then this callback function can't free the memory until all references are removed. In those cases, reference count need to be maintained by records in addition to RCUs.

Sunday, November 1, 2009

Application Delivery Controllers versus Server Load balancing devices

Server Load Balancing (SLB):

Let us first revisit the "Server Load balancing" concept and its features. Server Load balancing was invented during initial stages of web in late 1990 and early 2000s. At that time the pages served were mostly static. So simple Layer4 load balancing was good enough to do balancing of TCP connections across multiple servers in the server farm. Load balancing device, typically, is assigned with Public IP address. When client connection lands on this public IP address, this device figures out the best server that can serve the connection and hands over the connection to the server by doing "Destination NAT", that is, destination IP address of remote client generated packets gets translated with IP address of server selected. To ensure that packets going to client from server are accepted by the client, response packets from the server would undergo source IP address translation. By this clients only see the public IP address in response packets. Now, let us discuss some SLB features.

MAC Level redirection : When servers in the server farm have public IP address, this option can be used instead of doing "Destination NAT". Server Load balancing (SLB) device should be placed in the traffic path to ensure this mode works. In this mode, SLB device selects the server in the farm and modifies the destination MAC address with the server MAC address. Optionally it can also change the Source MAC with the SLB device MAC.
Additional Source NAT : This mode is useful in assymetric routing topologies. When destination NAT is done on the client packets, it is required to do source IP translation on server to client traffic. For this to happen, server to client traffic must pass through the same SLB device that did the DNAT on the client traffic. Source NAT is done on the client traffic with the SLB device IP address to ensure that the traffic come to the SLB device even if the servers' routing tables are configured with some other default gateway.
Multiple different algorithms to select the server: Based on SLB deployment and need, several different ways of scheduling connections to server can be chosen. Following are some of the methods.

Round Robin : Each server is treated same from capability perspective and servers are chosen on round robin fashion. If there are 2 servers in the server farm, then alternate connections are given to each server.
Weighted Round Robin : Typically each server is given a weight 1 to 10, 10 is assigned to server that can serve 10 connections for every 1 connection served by server that has weight 1. Until weight number of consecutive connections are assigned to a server, next server is not chosen.
Least Connections : In both Round robin and Weigthted round robin methods, it is assumed that the all connections are almost equal with respect to duration of connection, amount of traffic that is sent on each connection and processing power used by each connection on the server. But in reality it is not true. In 'Least Connection' method new connections is given to server which is serving least number of connections at that time. SLB device maintains the count of active connections with each server to facilitate this scheduling.
Weighted Least Connections : All servers can't be assumed to be equally powerful and this feature allows deploying servers of different processing power capabilities. Each server can be given a weight. If 2 and 5 weights are given to server 1 and server2, it means that server2 is powerful enough to serve 5/2 times the active connections simultaneously than server1. SLB device keeps this weight in mind in deciding the best server for new connections.
Fastest Response time : Though 'Least connections' and 'Weighted Least connections' do their best to distribute the connections, there can be situations where many high processing connections are given to one server. Any new connections going that server can overwhelm that server and due to that connections may not be served well. Any connection that does deep data mining function could overwhelm the server. To take care of these situations, it is better to monitor the processing power left in each server for new connections. What is the better way than monitoring the response time of servers. Though it is not 100% way of finding the least utilized server with response time, it is close as long as monitoring for response time is happening continuously. Some SLB devices monitor the response time as part of health checking. That is not good enough as health checks happen once in a while. Response time measurements must be done on response to real client traffic.
Fill-and-go-next : Green is the buzz word now - Save power and save on cooling costs. Round Robin and other methods above distribute the traffic to all servers. Even when there is less load across servers, they are all used and thereby not giving chance to any server to go to any power saving modes. When this mode is chosen, SLB device uses one server to full utilization before selecting next server. Typical configuration include the number of connections a server can handle and when to start warming the next server. For example if 10000 and 8000 numbers are given, after 8000 active connections on existing server, next connection is given to next server to bring that server up to the full power and both servers are used until 18000 active connections and then another server is warmed up and so on. This mode can be combined with Round Robin method where this is used to balance the traffic across all warmed up servers.

Syn Flood Detection and Prevention (DDoS Prevention): This feature is common in servers from last few years. But detection and prevention at central place in SLB allows administrators not worry about feature availability in all servers or enabling this feature in servers. This feature prevents attackers from sending large number of SYN packets without completing the TCP connection establishment phase. SLB devices using this feature detect the SYN flood attack without consuming its resources using SYN-Cookie mechanism. Only when three way handshake is complete, connection is awarded to server in the server farm.
Traffic Anomaly detection and Traffic Policing: DDoS attacks such as Syn flood detection is really thing of a past. Newer kind of attacks complete TCP connections and issue requests that consume CPU power there by creating Denial of Service for genuine users. Since these attacks complete the connection, it is possible to track the users by their IP addresses. Traffic anomaly detection feature allows administrators to configure the baseline traffic characteristics based on normal day activity. SLB devices based on this configuration detect any anomaly and can take corrective actions such as:

Throttling the traffic coming from identified IP addresses : Throttling the traffic can be on "Simultaneous active connections", "Connection Rate", "Packet/Byte Rate".
Blocking the traffic from identified IP addresses for certain time.

Health Checking: SLBs are expected to use servers that are online. To ensure that, SLBs typically do health checks on servers to ensure that they are alive. Some types of the health checks that can be configured are:

ICMP Ping : Using this check SLB knows that the server machine is active.
TCP Connection Checks : These checks allow SLBs know that a given TCP Service is running. Note that it is possible that Machine is alive, but the user process listing for TCP connections is dead. This check allows SLBs to check the TCP Server liveness.
Application Health Checks: These checks allow SLB to know whether the application is running well on the TCP Service. Note that TCP Server may be running in the server, but application that acts on the application data could be different user process than the process that is listening for TCP connections. Hence it is important to have these health checks enabled in the SLB devices.

Persistence: For some end applications, it is necessary that all connections coming from one client go to the same server. This is necessary if the application maintains some state across connections coming from the client. It necessitates temporary and dynamic binding state in SLB devices. When there is no matching bind state, one server is chosen for the connection and immediately binds the client IP address with the chosen server IP address. Any new connection from this client is given to same server without server selection process. To ensure that these bindings don't consume large amount of memory on SLB devices, these bindings are removed after certain amount of inactivity. As you can see later, binding are implemented using cookies in case of HTTP connections without having need to store binding state in the SLB. Cookie received in each request would indicate the bound server.

Application Delivery Controllers (ADC):

Server Load balancing based on L3-L4 ( Source IP, Destination IP and Ports) does not solve all issues with different kinds of applications running on the servers. It requires L7 intelligence in the SLB devices to meet the load balancing requirements of newer kind of applications. Since they have L7 intelligence in making load balancing decision, I guess these devices are being called 'Application Delivery Controllers". In short they are called ADCs.

First generation of ADCs were trying to solve challenges associated with E-commerce and Session-persisting HTTP based applications. One challenge is to ensure all connections go the server which was chosen for the first connection from the client. This is necessary due to the fact the applications on the server maintain user state in the memory and this state is updated with the user selections until user logs out. This challenge has another twist where these connections are SSL encrypted.

First generation ADCs solve these challenges using HTTP protocol intelligence in ADC devices.

Persistence using HTTP Cookies : Though IP based persistence works decently, it fails when a given user IP address changes across connections. This can happen if the user router gets new IP address in the life of user session. I believe another possibility is due to different NAT addresses usage by some ISPs. That is, ISP might use one NAT IP address for one connections and another NAT IP address for other connections of same user. Due to these issues IP address persistence does not work in all cases . In addition, IP address persistence occupies some memory in SLB devices. Alternative for HTTP based applications is to use HTTP Cookies. HTTP Cookies allow servers to relate the connections belonging to the same session. Same mechanism is used by ADCs to relate user connections. ADCs define their own cookie for storing the selected server in encrypted/encoded cookie data. When new HTTP connection is received, it checks whether it has its cookie and value. If not, it assumes that this is first connection of session and adds a Cookie in the response going to client with encoded/encrypted selected server information. Any further connections coming from the same user to this site would have this new cookie and this cookie value is used by SLB to map the connection to the server. Since this cookie belongs to the SLB, SLB devices remove the cookie from the request before sending it to the server.
SSL Termination : E-Commerce web sites require SSL/TLS Security and also session persistence. To facilitate this, SLBs provide SSL termination capability. This capability in addition also offloads CPU intensive SSL processing to ADCs and thereby saving CPU cycles on server for application processing.

Second generation ADCs extend above functionality with HTTP Optimization, L7 protocol data based server selection, threat security features and multiple virtual instances.

HTTP Optimization features:

HTTP Connection pooling : HTTP protocol allows multiple transactions in one TCP connection. This feature in ADCs multiplex multiple client connections into few connections to the server. Without this feature, there are as many TCP connections to the Server from SLB as number of TCP connections it is terminating. With this feature, there are only few TCP connections towards server, there by saving some CPU cycles and memory on the servers. Though the amount of saving one can have with this feature is debatable, this is one feature many ADCs support.
HTTP Compression: HTTP compression using gzip and other algorithms can save bandwidth by compressing the response data. ADC devices offload this capability from the Servers there by making server CPU cycles available for application processing.
HTTP Caching & Delta Encoding: HTTP Caching feature between clients and servers avoids duplicate file download if content was not changed on the server. But there are many instances where file/data content change but not significantly. This feature allows only difference (delta) being sent to the clients rather than complete content there by saving bandwidth usage. Since delta encoding requires significant CPU cycles, ADCs offload this functionality from HTTP Servers.

Programmatic L7 Protocol content based Server Selection:

SLBs traditionally don't look at the application protocol data to make server selection. ADCs augment the criteria for selecting the server with protocol data such as HTTP data. Some examples where protocol intelligence is required:

Multiple server farms with each each farm having different content.
Different server farms for Mobile users and Normal users.
Different response on errors based on browser type.
Prioritization/throttling/shaping of requests to specific URL resources upon congestion.
Addition/Modification of request/responses based on the type of applications running on the server farm.
Combination of above.

Since each deployment has its own requirements, one configuration solution for all is not sufficient. Hence ADCs provide flexible mechanism using 'rules' similar to IPS devices which provide rules for detecting vulnerabilities. Rules are created by administrators using the application protocol (eg. HTTP) keywords exposed by ADCs. Typically each protocol element is exposed as keyword. Multiple rules can be created with each rule having keywords and associated values. ADCs internally match the connection data with the rules. First match of rule with respect to values of keywords is used to take actions as per the rule. Actions can be 'Selection of Server farm', traffic policing etc..

Threat Security:

Since ADCs are becoming a central location for all requests going to the data center servers, this is place to enforce threat security and thereby offloading threat security functionality from the servers. Some of the threat security functions ADCs support are:

Traffic Policing based on protocol intelligence: ADCs with protocol intelligence can do better job of DDOS prevention than SLBs. Baseline traffic can be much beyond typical 5 tuples. It can include URL pattern, User-Agent, Cookie values and beyond.
Web Application Firewall: There are many attacks that are being exploited by attackers on web server applications including SQL injections, XSS (Cross Site Scripting), LFI (Local File include) and RFI (Remote File Include). ADCs are increasingly offloading this functionality from the servers. It not only improves server performance, but also administrators can configure signatures at one place i.e ADC.

Virtual Instances:

This feature is not common in ADCs today. But I believe this is going to become critical feature moving forward to ensure that multiple server farms belonging to multiple domains corresponding to different customers are supported by one or few ADCs. One ADC device should be able to support multiple customer server farms in public data centers. There are two types of virtual instances possible - In first type. One executable image supports multiple instances. Here configuration data and run time data are stored separatetly and even provide UI/CLI with role based access to configuration belonging to virtual instances. Second type of Virtual instance is to have multiple images with each image for one virtual instance. In this case, even if there is any issue with one virtual instance, other instances are not affected, there by providing good isolation. First type can provide large number of instances where second type is limited to few tens. I personally think that Linux container approach is better for second type as this is lean and uses common Operating system and TCP/IP stack image. Only user processes are instantiated multiple times as number of containers.

Third generation ADCs:

What are the features one would expect from third generation ADCs. I can think of few based on where data center market is heading.

Cloud Computing : Traditionally ADCs are configured with all servers in a given server farm. It works well when manual intervention is required to setup/take-off servers from server farm. Administrators are used to configure the ADCs accordingly. In Cloud computing, servers are no longer physical. They are virtual. They can be added and removed dynamically and programmatically. They also can be disabled and enabled dynamically without manual intervention. Some use cases - virtual instances are brought up and down based on the load on the servers or based on time of day, holidays etc.. Since they are coming up and going down dynamically, it is expected that ADCs also would get this information dynamically.
Beyond HTTP Servers (SIP, CIFS, SMTP, IMAP etc..): HTTP had been and is dominant in data centers today. It is changing with cloud computing. Many organizations are expected to host their internal servers too in the cloud. They include mail servers, File Servers etc.. Third generation ADCs are expected to balance the traffic across Email Servers, File Servers etc.. In addition it is expected to have optimization features to save network bandwidth. File Servers today dominantly support CIFS. ADCs are expected to have CIFS proxy that can save bandwidth by doing delta encoding, caching and de-duplication etc..

Random technical bits and thoughts