Sunday, January 22, 2012

IP Fragmentation versus TCP segmentation

Ethernet Controllers are increasingly becoming more intelligent with every generation of NICs.  Intel and Broadcom have added many features in Ethernet NIC chips in recent past.  Multicore SoC vendors are adding large number of features into Ethernet IO hardware blocks.

TCP GRO (Generic Receive Offload - It used to be called Large Receive offload too) and GSO  (Generic Segmentation Offload and it is used to be called Transport Segmentation Offload)  are two new features (in addition to FCoE offloads) one can see from Intel NICs and many Multicore SoCs.  These two features are  good for any TCP termination applications on the host processors/cores.  These two features reduces the number of packets traversing the host TCP/IP stack. 

TCP GRO works across multiple TCP flows where it aggregates multiple consecutive TCP segments (based on TCP sequence number) of a flow into one or few TCP packets in the hardware itself, there by sending very few packets to the host processor.  Due to this,  TCP/IP stack sees  fewer inbound packets.  Since the packet overhead is significant in TCP/IP stacks, lesser packets uses lesser number of CPU cycles, thereby leaving more CPU cycles for applications, essentially increasing the performance of overall system.

TCP GSO intention is similar to TCP GRO,but for outbound packets.  TCP layer typically segments the packets based on  MSS value. The MSS value is typically determined from PMTU (Path MTU) value.  Since TCP and IP headers take 40 bytes of data,  MSS is typically ( PMTU -  40 ) bytes.  If PMTU is 1500 bytes, then the result MSS value is 1460. When the application tries to send large amount of data,  then the data is segmented into multiple TCP packets where each TCP payload carries up to 1460 bytes.  TCP GSO feature in the hardware eliminates the need for TCP layer to do the segmentation and thereby reduces the number of packets that traverse between TCP layer and to the hardware NIC.  TCP GSO feature in the hardware typically expect the MSS value along with the packet and it does everything necessary internally to segment and send the segments out.

Ethernet Controllers are increasingly providing support for IP level fragmentation and reassembly.  Main reason is being  increasing popularity of tunnels.

With increasing usage of tunnels (IPsec, GRE, IP-in-IP,  Mobile IP, GTP-U and futuristic VXLAN and LISP), the packet size is going up.  Though these tunnel protocol specifications provides guidelines to avoid fragmentation using DF bit and PMTU discovery,  it does not happen in reality.  There are very few deployments where DF (Don't Fragment bit) , which is required for PMTU discovery, is used.   As far as I know,  almost all IPv4 deployments fragment the packets during tunneling.  Some deployments configure network devices to do red-side fragmentation (fragmentation before tunneling so that the tunneled packets appear whole IP packet) and some deployments go for black-side fragmentation (fragmentation after tunneling is done).   On receive direction, reassembly happens either before detunneling or after detunneling. 

It used to be the case where fragmented packets are given lesser priority by service providers during network congestion.  With high throughput connectivity and increasing customer base for networks, service providers are competing for the business by providing very good reliability and high throughput connectivity. Due to popularity of tunnels,  service providers are also realizing that dropping fragmented packets may result in bad experience to their customers.  It appears that service providers are not treating the fragmented packets in a step-motherly fashion anymore.

IP fragmentation and TCP segmentation offload methods can be used to reduce the number of packets traversing the TCP/IP stack in the host.  Next question that comes to mind is how to tune the TCP/IP stack to use these features and how to divide the work  between these two HW features. 

First thing to tune in the TCP/IP stack is to remove the MSS dependency on PMTU.  As described above, today MSS is calculated based on PMTU value. Due to this, IP fragmentation is not used by TCP stack for outbound TCP traffic. 

TCP Segmentation adds the both TCP and IP header to each segment.  That is, for every 1460 bytes, there would be overhead of 20 bytes of IP header and 20 bytes of TCP header.  In case of IP fragmentation,  each fragment would have its own IP header (20 bytes of overhead).  Since TCP segmentation has more overheads,  one can say IP fragmentation is better.  Here, MSS can be set to a bigger value such as 16K and let IP layer fragment the packet if the MTU value is less than 16K.   This is certainly a good argument and it works fine in networks where the reliability is good.  Where the reliability is not good,  if one fragment gets dropped, TCP layer needs to send entire 16K bytes in retransmission.  If TCP had done the segmentation, it would only need to send fewer bytes. 

There are advantages and disadvantages with both approaches. 

With increased reliability of networks and with no special treatment on fragmented traffic by service providers,  IP fragmentation is not a bad thing to do.  And ofcourse, one should worry about retransmissions too. 

I hear few tunings based on the deployments.  Warehouse data center deployments where the TCP client and servers in a controlled environment are tuning MSS to 32K and more with 9K (jumbo frame) of MTU.  I think that , for 1500 bytes MTU,  going with 8K of MSS may work good.


Saturday, January 21, 2012

SAMLv2 mini-tutorial and some good resources

Why SAML  (Security Assertion Markup Language)?

Single Sign-On:  Many organizations have multiple intranet servers with web front end.  Though all servers interact with common authentication database such as LDAP, SQL databases,  each server used to take authentication credentials from employee.  That is, if employee logs into server1 and then goes to server2 employees are expected to provide credentials again to the server2.  It may sound okay if the user is deliberately going to different servers.  But,  this is not good experience if there are hyperlinks from one server pages to other servers.  Single Sign-On solves this issue. Once the user logs into a server successfully,  the user does not need to provide his/her credentials when he/she accesses other servers.  Single authentication with single sign-on is becoming common requirement in organizations.

Cloud Services:  Organizations are increasingly adopting Cloud Services for cost reasons.  Cloud Services are provided by third party companies.  For example, salesforce.com provides Cloud service for CRM.  Taleo provides cloud service for talent acquisition systems.  Companies now have intranet servers,  some cloud services from different cloud service vendors.  Employees accessing the cloud service should not be asked to create accounts in the cloud service.  It is expected that cloud service provider uses the its subscriber organizations authentication database to validate the users signing in.  Many companies would not like to provide the access to their authentication database to cloud services directly.  Also,  many companies would not like their employees to provide their organization login credentials to cloud service for the fear of compromising the passwords, especially, senior and financial executives.  This requires the need for cloud services to redirect the employees to login using companies web servers and use single sign on mechanisms to allow the users to access cloud services.

SAMLv2 (version 2 is the latest version) facilitates the single sign-on not only across intranet servers of the company, but also across services provided by cloud providers.


How does SAMLv2 work?

There are two participants in SAML world -  Service Provider (SP) and Identity Provider (IDP).  Service provider is referred to represent the intranet servers and cloud servers.  Service provider provides some services to the authenticated users.  Identify provider refers to the system which takes login credentials from the user and authenticates the user using local database or LDAP database or any other authentication database.

When the user tries to access the resources in service providers,  SAMLv2 component of Service provider (SP)  first checks whether the user is already authenticated. If so, it allows the access to its resources. If the user is not authenticated, then rather than providing login screen and taking credentials, it instructs the user browser (by sending HTTP Redirect response with new location header containing the identity provider URL) to redirect the request to the identity provider.   Ofcourse SAMLv2 component of the SP needs to be configured with the IDP Provider URL beforehand by administrator.  SAMLv2 component of SP generates SAML request (in XML form) and 'Relay State".  It then adds these two to the location URL of the HTTP redirect response.   Browser then redirects the request as per location header to the IDP.   Now, IDP has the SAML request and Relaystate information. 

IDP first checks whether the user is already authenticated with it.  If not,  it sends the login page to the user as part of HTTP response.  Once the credentials are validated based on the authentication database,  SAMLv2 component of IDP generates the SAML response with the result of authentication,  principal name and  attests the response using private key of its certificate by signing the SAML response.  It also can add some more information about the user via attribute statements.  It then sends both SAML response and relay state to the browser as response to the last HTTP request it got (which is typically the credentials request).  IDP normally makes a HTTP page with embedded POST request  with  URL  which is SP URL (which it got from the SAMP request), SAML response and relaystate (which it got before) and javascript which is used by browser to send the POST request.   Browser gets this response and due to javascript it automatically posts the request to SP.  This POST request contains the SAML response and RelayState.

Now SAMLv2 component of SP ensures that SAML response is valid by verifying the digital signature.  Note that the public key (certificate) of IDP is expected to be configured in the SP beforehand by the administrator.  Then it checks the subject name of principal and result of authentication.  If result of authentication indicates 'success',  then it sends HTTP redirect response to the browser with original URL (from relaystate) as location header.  This makes the browser go the original URL to get hold of response.

Now onwards,  user would be able to access the pages in the SP.  If the user goes to another SP of the company,  same process as above would be repeated.  Since the user is already authenticated with the IDP,  IDP does not take login credentials again.  Since user does not see all the transactions that are happening underneath,  he/she gets single sign-on experience.

By keeping the IDP in Enterprise premises,  one can be sure that passwords are not given to any SPs whether they are intranet SPs or Cloud Service SPs.

SAMLv2 - Other SSO use cases:

The use case described above is called 'SP initiated SSO'. It is called SP initiated because user is trying to access the SP first.   Other use case in SSO is 'IdP initiated SSO' .   User access the IdP first.  IdP authenticates the user and then user is presented with links to different SPs that user can access.  When the user clicks on these special links, SAMLv2 component on the IDP generates the HTTP response with HTML page having POST binding to the SAMLv2 component of SP (It is called Assertion Consumer Service URL).  SAMLv2 response is provided in the HTML page as one of the post fields.  It adds 'RelayState' parameter with the URL that the user is supposed to be shown on the SP.   Note that 'RelayState" in this case is unsolicited.  There is no guarantee that the ACS in the SP will honor this parameter and also there is no guarantee that all ACSes treat the relayState as URL.   But many systems expect the URL in the relayState (including Google Apps Cloud Service) and hence sending URL is not a bad thing.

Above two cases 'SP Initiated SSO' and 'IdP initiated SSO' are mechanisms to access the resource by single authentication.  'Logout' is another scenario.  When user logs out (whether on a SP or on IdP),  the user should be logged out on all SPs.   This is called "Single Logout Profile". IdP is supposed to maintain all the SP information for which IdP was contacted for all users.  When the user logs out on IDP,  for all SPs in the user login session, it sends a logout request to 'Logout Service' component of SP.    If user logs out on SP and indicates his willingness to logout on all SPs, then the SP in addition to destroying the user authentication context on the SP also redirects the user browser to IDP. As part of redirection, it also frames the logout request too to the IDP.  IDP then destroys the context in the authentication context.  If the user indicated that he would be like to be logged out from all SPs, then the IdP would send the logout requests over SOAP to all other SP in the user authentication context.

For more detailed information about SAMLv2, start with SAMLv2 Technical overview and SAMLv2 specifications. You can find them here:

http://www.oasis-open.org/committees/download.php/27819/sstc-saml-tech-overview-2.0-cd-02.pdf
http://saml.xml.org/saml-specifications

IdP Proxy use case

So far the use cases described above talk about SP and IdP interaction via browser or directly.   SAML does not prohibit IdP proxy use case where IdP proxy work as IdP for SPs and as SP for IdPs.   Please see some links on this topic:

https://spaces.internet2.edu/display/GS/SAMLIdPProxy

Open Source SAMLv2 SP,  IdP and IdP Proxy:

I find that OpenAM (fork of Sun OpenSSO) as one of the popular open source SSO authentication framework.  You can find more information here:

http://forgerock.com/openam.html

IdP Proxy configuration is detailed here:

https://wikis.forgerock.org/confluence/display/openam/SAMLv2+IDP+Proxy+Part+1.+Setting+up+a+simple+Proxy+scenario

https://wikis.forgerock.org/confluence/display/openam/SAMLv2+IDP+Proxy+Part+2.+Using+an+IDP+Finder+and+LOAs

One more good site I found which details out IDP proxy and source code in Python:

http://www.swami.se/english/startpage/products/idpproxysocial2saml.131.html


Some good links, though not related IDP Proxy:

Google Apps as Service Provider and OpenAM as IDP:  https://wikis.forgerock.org/confluence/display/openam/Integrate+With+Google+Apps

OpenAM as Service provider and Windows ADFS as IDP:  https://wikis.forgerock.org/confluence/display/openam/OpenAM+and+ADFS2+configuration





Need for Pattern Matching Accelerators in UTM devices

Network security term typically refers to Threat prevention and Security on the wire.

Threat protection is normally achieved with multiple security technologies.  Basic protection is achieved from firewall technology.  IDS/IPS (Intrusion Detection/Prevention System),  Anti-Virus, Web application firewalls are some of the security technologies that are increasingly being used to protect networks (Network devices, Servers and Client machines).  Application Detection is another technology that is increasingly being used along with firewall to stop/allow traffic that can't be identified using ports in TCP/UDP headers, but requiring deep packet inspection.

Other than firewall,  all technologies listed above require deep packet and deep data inspection.  IDS/IPS technology adopts multiple techniques to identify the attack traffic. One of the techniques is to match the traffic data with known attack patterns.  Application detection also relays on pattern matching on the data as one of the techniques to detect the applications.  Anti Virus technology too depends on some pattern matching to detect viruses.

In almost all technologies above,  patterns get added to the deployed systems on continuous basis by device vendors as more attacks are discovered.  For example,  IPS devices, nowadays have around10,000 patterns (signatures) to detect the known attacks.  It keeps increasing every year.  Additionally, Some of these patterns are checked on every packet that goes through IPS.  This adds to number of CPU cycles requires to do IPS protection.

Many software algorithms are used to speed up the pattern matching performance.  Some of the techniques inlcude:
  • DFA (Deterministic Finite State Automata)
  • Bloom filters - Filters formed from the hashes of patterns can be used on the traffic to determine whether further analysis is required.
  • PCRE algorithms to search for patterns of regular expression type.

IPS and other technologies also use techniques to reduce the number of patterns to be matched using protocol level intelligence and classifying the patterns in multiple buckets (protocol basis,  port basis,  even on the basis of application protocol stages such as URL based pattern database,  HTTP Request header,  Response header pattern databases etc..).

Due to above techniques,  some device vendors  think that there is no need for pattern matching hardware accelerators.  There is a reason for that too as some early developments of snort (popular open source IDS/IPS software) did not find much performance improvement with hardware accelerators.  But I believe HW accelerators are required for following reasons.

Performance Determinism:  IPS, Anti Virus,  Web application firewall and application detection technologies depend on the regular signature updates. Hardware deployed in the fields might have X number of signatures a day of purchase and they might go up to 2X or 3X over the years.  Performance determinism is expected by CSOs.  To maintain performance levels,  CPUs should be avoided in doing pattern matching.  Hardware accelerators specialized in pattern matching help in maintaining performance levels even with increasing number of signatures.

Protection from CPU hogging attacks:  With software based pattern matching, it is possible to hog the CPUs by crafting the packets with each data that matches a patterns multiple times.  Consider that there is a signature rule which tries to match a pattern "abc123def" and if there is 1Mbytes of data is sent with all the data having "abc123def" repeated,  then the CPU would take forever as it matches every packet multiple times.  CPU will not only spend time in matching the patterns, but also spends significant number of cycles in doing  further analysis.   Hardware accelerators normally designed such a way that the performance does not go down even if there are multiple matches.

Next question would be the what capabilities of hardware accelerators one should look for to mitigate performance issues  - One associated with explosive growth of attack patterns (signatures) and avoid CPU hogging by deliberate attempts by attackers.  I believe one should look for following capabilities.
  • Accelerators should be programmable with decent number of patterns.
  • Accelerators should be able to perform well even with large number of patterns.
  • Accelerators should be able to perform well even if there are large number of matches.
  • Accelerators should be able to perform pattern matches based on context information such as 'relative offset', 'Depth of the data to look' while doing pattern matching.  This will reduce the number of results being returned by the accelerator.  Smaller the number of results to software, lesser the post processing.
  • Accelerators should be able to return results only when multiple patterns match on the data.  This also is required to reduce the number of results. 
In summary,  pattern matching hardware accelerators are required to reduce the CPU hogs either due to increase in signatures or intelligently crafted data by attackers. I feel that end customers should buy the UTM/IPS devices that take advantage of these accelerators to ensure that devices can be used at least for few years (future proof).