Tuesday, March 4, 2014

SDN Controllers - Be aware of Some critical interoperability & Robustness issues

Are you evaluating robustness of a SDN solution?  There are few tests, I believe one should run to ensure that SDN solution is robust.  Since SDN controller is brain behind traffic orchestration,  it should and must be robust enough to handle various openflow switches from various vendors.  Also, SDN controller must not assume that all openflow switches behave well.

Connectivity tests

Though initial connections to OF controllers are normally successful,  my experience is that connectivity fails in very simple non-standard cases.  Following tests would bring out any issues associated with connectivity. Ensure that the following tests are successful.

  • Controller restart :  Once switch is connected successfully,  restart the controller and ensure that switch connects successfully with the controller. It is observed few times that either switch is not initiating the connection or controller loses some important configuration during restart and does not honor connections from some switches.  In some cases, it is also found that the controller gets the dynamic IP address, but has a fixed domain  name. But switches are incapable of taking the FQDN as the controller address. Hence it is important to ensure that this test suite is successful.
  • Controller restart & controller creating pro-active flows in the switches :  Once the switch is connected successfully and traffic is flowing normally, restart the controller and ensure that switch connects to the controller successfully and traffic continues to flow.   It is observed that many switch implementations remove all the flows when it loses the main connection to the master controller. When the controller comes back up,  it is, normally, responsibility of controller to establish the basic flows (pro-active flows).  Reactive flows are okay as they can get established again upon the packet-in.  To ensure that the controller is working properly across restarts,  it is important to ensure that this test is successful.
  • Controller application restarts, but not the entire controller:  This is very critical as applications typically restart more often due to upgrades and due to bug fixes.  One must ensure that there is no issue with traffic related to other applications on the controller. Also,  one must ensure that there are no issues when the application is restarted either with new image or with existing image.  Also, one should ensure that there are no memory leaks.
  • Switch restart :  In this test, one should ensure that switch reconnects back successfully with the controller once it is restarted.  Also, one must ensure that the pro-active flows are programmed successfully by observing that the traffic continues to flow even after the switch restarts.
  • Physical wire disconnection :  One must also ensure that the temporary discontinuity does not affect the traffic flow.  It is observed that in few cases switch realizes TCP termination and controller does not. Some switches seems to be removing the flows immediately after the main connection breaks and when connected again,  either it does not get the flow states immediately as yet times controller itself may not have known the connection termination. It is observed that any new connection coming from the switch is assumed to be duplicate and hence does not initiate flow setup process. It is required that the controller must need to assume any new main connection from the switch is connecting back after some blackout.  It should treat it as two events - Disconnection with the switch followed by new connection.  
  • Change the IP address of the switch after successful connection with the controller :  Similar to above.  One must ensure that connectivity is restored and traffic flows smoothly.
  • Run above tests while traffic is flowing :  Just to ensure the system is stable even when the controller restarts while traffic is flowing through the switches.  And also ensure that controller is good when the switch restarts while packet-ins are pending in the controller and while controller is waiting for responses from the switch, especially during multi-part message response.  One black-box way of doing this is to wait until the application creates large number of flows in a switch (Say 100K flows).  Then issue "flow list" command (typically from CLI - many controllers provide mechanism to list out the flows) and immediately restart the switch and observe the behavior of controller. Once the switch is reconnected, let the flows be created and issue "flow list" command and ensure that this list is successful.
  • Check for memory leaks :  Run 24 hour tests by restarting the switch continuously.  Ensure that the script restarts the switch software only after it connects to the controller successful and traffic flows.  You should be surprised number of problems you could find with this simple test.  After 24 hours of test, ensure that there are no memory or fd (file descriptor) leaks by observing the process statistics.
  • Check for robustness of the connectivity by running 24 hour test with flow listing from the controller.  Let the controller application create large number of flows in a switch (say 100K) and in a loop execute a script which issues various commands including "flow list".  Ensure that this test is successful and there are no memory or fd leaks.

Keep Alive tests

Controllers share the memory across all openflow switch it controls.  Yet times, one misbehaving openflow switch might consume lot of OF message buffers or memory in the controller leaving controller not respond to keep alive messages from other switches.  SDN controllers are expected to reserve some message buffers and memory to receive keep alive messages and respond to them. This reservation not only required for keep alive messages, but also for each openflow switch.  Some tests that can be run to ensure that proper fairness by the controllers are:
  • Overload the controller by running cbench, which sends lot of packet-in messages and expect controller to crate flows.  While cbench is running, ensure that ta normal switch connectivity does not suffer.  Keep this test for 1/2 hour to ensure that controller is robust enough.
  • Same test, but tweak the cbench to generate and flood keep alive messages towards the controller.  While cbench is running,  ensure a normal switch connectivity to the controller does not suffer. Keep running this test continuously for 1/2 hour to ensure that the controller is robust enough.

DoS Attack Checks

There are few tests,  I believe need to be performed to ensure that the controller code is robust enough.  Buggy switches might send truncated packets or very corrupted packets which wrong (very big) length value.
  • Check all the messages & fields in the OF specifications that have length field.  Generate message (using cbench?) with wrong length value (higher value than the message size) and ensure that the controller does not crash or controller does not stop misbehaving.  It is expected that the controller eventually terminates the connection with the switch with no memory leaks.
  • Run above test continuously to ensure that there are no leaks.
  • Messages with maximum length value in the header:  It is possible that controllers might be allocating big chunk of memory and leading to memory exhaustion.  Right thing for controller is to do is to have maximum message length and drop the message (drain) without storing it in the memory.   
I would like to hear from others on what kind of tests you believe must be performed to ensure that robustness of controllers.


No comments: