Software in Embedded network appliances consists of multiple entities - Management Plane (MP), Control Plane (CP) and Data Plane (DP). Some examples of network appliances include routers, switches, load balancers, security devices and WAN Optimization devices etc..
MP entity typically consists of management engines such as CLI, GUI Engines and Persistent storage Engines.
CP is mainly consists of several sub entities for different functions. For example, if a device consists of firewall, Ipsec VPN and routing functionality, then there could be three sub-entities. Each sub-entity might consists of multiple protocols and modules. Routing function might have multiple control plane protocols such as OSPF, BGP, IS-IS etc.. CP also typically consists of exception packet processing modules. For example, if there is no flow processing context found for a given packet in DP, DP gives the packet to CP to process it and lets the CP to create the flow context in DP.
DP entity consists of actual packet processing logic. DP entity is typically implemented in Network Processor Units, ASIC. It is increasingly being implemented in Multicore processor SoCs with some cores dedicated for CP and rest of cores for DP.
In this post, I am going to concentrate on DP (also called as datapath).
As indicated DP implements the "datapath" processing. Based on the type of network device, datapath consists of 'routing', 'bridging', 'nat', 'firewall', 'ipsec' and 'dpi' modules and more. Some background on each of the datapath functionality is listed below. Then we will see how table driven processing model helps in implementing datapaths.
Routing
Routing module typically involves following processing, upon packet reception:
- Parse and check integrity of the packet
- L2 (Ethernet, VLAN etc..) header parsing.
- Identification of packet type
- If it is IP, proceed further, otherwise do non-IP specific processing.
- Integrity check involves ensuring the IP checksum is valid and packet size is at least as big as the 'total length' in the IP header.
- Extraction of fields - Source IP address, Destination IP address, ToS fields from the IP header.
- IP Spoofing Check: This processing is done to ensure that the packet came from the right interface (Ethernet port). Normally this is done by doing route lookup on source IP address. Interface returned from the route lookup operation is compared with the port on which this packet came in. If both are same, then IP spoofing check is treated successful and further packet processing happens. Basically, this check ensures that packet with destination IP address of current packet source IP address would go on the interface on which current packet came in. This requires "RouteLookup" operation using packet source IP address on the routing table which is populated by the CP.
- Determination of outgoing interface and next hop gateway: This processing step involves "route lookup" operation on the "Destination IP" of the packet and optionally on TOS value of the packet. "Route Lookup" operation indicates whether the packet needs to be consumed by local device (that is DIP is one of the local IP addresses) or whether the packet needs to be forwarded. If the packet needs to be sent to the next hop, then the gateway IP address (or destination IP itself if the host is in the same subnet of the router) and outbound interface would be returned. If the packet is meant to the local device itself, then packet is handed over to the CP. Rest of the processing assumes that packet is forwarded to the next hop.
- TTL Decrement : At this processing step, DP decrements the TTL and does incremental checksum update to the IP header checksum field. If the TTL becomes 0, then the packet gets dropped.
- Packet Fragmentation: If the packet size is more than PMTU value of the route or MTU of the outbound interface, then packets gets fragmented at this step.
- MAC Address Determination: If the outbound interface is ETHRENT type then this processing step finds out the MAC address for the gateway IP address determined in one of the previous steps. It refers to the "Address resolution table" populated by the CP. It uses IP address as input and gets the MAC address. If there is no matching entry, packet is sent to the CP for ARP resolution. In case the outbound interface is Point-to-Point interface, L2 header can be found from the interface table which is populated by the CP.
- Packet update with L2 header: DP frames the Layer 2 Header (Ethernet, VLAN, PPP etc..) and prepends in the beginning of the packet right before the IP header.
- Packet Out: Packet is sent out the outbound interface determined as part of "routelookup" operation.
At each step, appropriate statistics counters are incremented. There are two types of statistics - Normal and Error statistics. These statistics are typically maintained globally, on per interface basis, on per routing entry or per ARP table entry.
Upon analysis of above steps,
IP routing datapath refers to few tables:
- IP Routing Database Table, which is LPM (Longest Prefix Match), for route lookup operation.
- MAC Address resolution table, which is exact match table, to find MAC address for a given IP address.
- Interface Table, which index table, to find the L2 header to be used in case of Point-to-Point interfaces. Interface Index is typically returned by the "route lookup" operation.
IP routing in some devices also implement "Policy Based Routing" (PBR). PBR is an ACL type of table with each entry identified by several fields, including IP header fields (SIP, DIP, TOS), Input interface and output interface and even on the transport header fields. The output of each ACL rule is the routing table. Basically, "route lookup" functionality involves two steps - One finding the matching ACL rule in PBR table. This gets the routing table to use. Then this routing table is searched with IP address and TOS to get the routing information as part of second step. ACL rules can have field values represented in subnets, ranges and single value. ACL is typically ordered list.
Bridging: Bridging switches the packets without any modification to the packet. Bridging is typically done among the Ethernet interfaces. Bridging typically involves following steps upon receiving the packet from one of the Ethernet interfaces that belong to a bridge.
- Ethernet Header Parsing: In this step, Ethernet header and VLAN headers are extracted from the packet.
- Bridge Instance determination: There could be multiple bridge instances supported by datapath. This step determines the bridge instances based on the interface on which packet came in. A given Ethernet or VLAN interface belongs to only one bridge. CP populates the Interface table with entries, with each entry identified "Interface Index" and having parameters such as Bridge Instance Index (Index to bridge table), whether this interface is allows learning, forwarding etc..
- Learning Table Update: Each Bridge maintains the learning table. It is also called FDB (Forwarding Database). This table aims to provide Interface information given the MAC address. This table is normally used by bridging datapath to determine the outbound interface for received packets. But this table is updated with entries based on the incoming packet. Its source MAC address and the interface on which packet came in used to populate the learning table. Basically, it learns the machines on a physical network attached to the Ethernet port. As part of this processing step, if there is no learning table entry, then new entry gets added. Note that learning is done if the Ethernet interface status in thhe "interface table" indicates that this interface can be used to learn the machines. In some systems, the population of the learning table is done by the CP. Whenever DP finds that there is no entry that matches the SMAC, then it sends the packet to the CP and CP creates the table entry.
- Determination of outbound interface: In this step, DP does the lookup on the FDB with DMAC address key. If there is a matching entry, it knows the outbound interface on which the packet needs to be sent. In case of Multicast packet, it refers to the Multicast FDB to find out the set of interfaces on which the packet need to sent out. In case of broadcast packet (DMAC = ff:ff:ff:ff:ff:ff), then all interfaces in the bridge is selected to send the packet out. Note that Multicast FDB is populated by the CP. CP does this by interpreting the IGMP, MLD and PIM-SM packets.
- Packet Out: In this step, packet is sent out. In case of unicast packet, if the interface is known, packet is sent out. In case the outbound interface is unknown (there is no matching FDB entry), then packet is sent out on all ports except the incoming port. Packet is duplicated multiple times to achieve this. Multicast packets are also sent on all ports if there is no matching entry. Otherwise, it sends the packet to interfaces as indicated by Multicast FDB. Broadcast packets are always sent to all ports in the bridge. Note that, in all these cases, packet is never sent on the interface on which packet came in. Also, datapath does not send the packet if the interface is blocked for sending packet out. This information is known from the interface table.
Bridging module maintains following tables.
- Unicast FDB, which is exact match table. Match field is 'MAC address'. One FDB is maintained for each bridge instance.
- Mutlicast FDB, which is also exact match table. Match field is 'Multicast MAC Address'. One Multicast FDB is maintained for each bridge instance.
- Interface Index Table - Which is indexed by the incoming interface identifier. Typically global to the entire datapath.
- Bridge Instance Table - Which is indexed by bridge instance. Typically global to the entire datapath.
NAT/Firewall Datapath: NAT and Firewalls typically require Session Processing. Session Management part of NAT/firewalls is typically implemented in datapath. Sessions are typically 5-tuple sessions (Source IP, Destination IP, Protocol, Source Port and Destination Port) in case of TCP and UDP. A Session is initiated by client and terminated by server. NAT/Firewall devices being in between client and server machines, it sees both sides of the connection - Client to Server and Server to Client. Hence, in these devices, sessions consists of two flows - C-to-S and S-to-C flows (Client-to-Server and Server-to-Client flows). 5-tuple values of C-to-S and S-to-C flow are same in case of firewall session except that values are toggled between source and destination. That is, source IP of the C-S flow is used as destination IP of the S-C flow and destination IP of the C-S flow is used as source iP of the S-C flow. Same is true with source and destination ports. In case of NAT, this may not be as simple. NAT functionality translates client SIP, DIP, SP and DP to new values. Hence the NAT session manager flows would have two different values in 5-tuples, but both belong to one session. In summary, a session maintained by NAT/firewall devices is pointed by two different flows.
Session Processing involves following steps in datapath:
- Packet Parsing and Integrity of headers (L2 header, IPv4/IPv6 header, TCP/UDP/ICMP/SCTP headers) : Similar step as described in "Routing" section.
- IP Reassembly: Non-first IP fragments will not have transport header. Since session requires all 5-tuples for matching sessions, if reassembly is not done, non-inital fragments will not be get matched onto the right flows. As part of the IP reassembly, many checks are made to check the anomalies and as part of this check fragments might get dropped. Once the full transport packet is reassembled, further processing happens. Some of the checks that are done as part of this processing step are:
- Ensures that initial and middle fragments are at least of some configured size - This is to figure out any deliberate fragmentation by sender to make the device spend large number of CPU cycles there by creating DoS condition.
- Ensures that the total IP packet size never exceeds 64K size.
- Taking care of overlapping IP fragments.
- Ensuring that the data in the overlapped IP data is same. If not, it is considered as deliberate attempt and this processing steps drops these packets.
- Transport Header integrity checks: At this step, transport header integrity is ensured such as
- Ensures that the packet holds transport header.
- Some datapaths may also verify transport checksum.
- Extraction of fields from the parsed headers : Mainly SIP, DIP, P,SP and DP are extracted from network and transport headers.
- Security Zone lookup : Every incoming interface belongs to one of the security zones. CP programs the table of entries with each entry having Interface ID and security Zone ID it belongs to. Security Zone ID is used in flow lookup function.
- Flow Lookup to get hold of session: As discussed before, a session consists of two flows. Flows are typically arranged in hash lists. When the session is created by CP, CP is expected to put two flows in the hash list with each flow pointing to its session context. Sessions are arranged in an array (Index Table). Reference to session (array index) is put in the flow. Basically, there is one hash list for flows and Index table for sessions. 5-tuples and inbound security Zone ID are used to match the flow. If there is no matching flow found, the packet is sent to the CP. If the flow is found, then it goes onto next processing step.
- Anomaly Checks: Once the flow, associated session context and complementary flow is determined, then this processing step does checks to find out any anomalies using packet header contents and state variables that are maintained in the session. Few things that come to my mind are:
- Ensuring TCP sequence number of the current packets within reasonable range (Example: 128K) from the previous packet sequence. This is done typically to ensure that MITM did not generate the TCP packet, mainly TCP packet with RST bit is set. TCP RST packet can drop the session not only in the device, but also on the receiving client/server machine.
- Packet Modification: This step involves packet modification. Packets belongs to sessions that are created by firewall CP may not undergo many changes. In case of NAT, there are several modifications are possible based on session state created by CP. Some of them are:
- Source NAT and Destination NAT modifies the SIP and DIP of the packet.
- NAPT may modify both SP and DP of TCP and UDP packets.
- To ensure that modified packet has unique IPID, IPID also gets translated with unique IPID.
- TCP MSS value might be modified with lesser MSS value to avoid any fragmentation in future steps.
- TCP sequence numbers also may get translated. This is typically happens when CP does some packet processing before such as SYN FLOOD protection using SYN-COOKIE mechanism or due to some ALG processing.
- Packet Routing and Packet-out: These steps are similar to IP Routing module.
Upon analysis of above steps, one can find that there are following tables in Session Processing.
- Interface to Security Zone mappping table, which is an index table.
- Flow hash table, which is exact match hash table.
- Flow to Session ID table, which in index table.
Observations & Suggestions for new version of Openflow specifications:
All datapaths listed above use tables heavily. Contexts created in each table are used by actions being performed. It is not clear whether Openflow specification group is targeting to cover any type of datapath. If so, few things to notice and improve.
Packet Reinjection : Each datapath might have more than one table. Typically a table miss results into packet going to the control plane (controller in case of Openflow). Control plane once acts on the packet and create appropriate context in the table, it would like the packet processing in the datapath to start from where the miss occurred. Some times, the processing should start from the table where the miss occurred or some times CP might do the operations on the packet and hence the packet processing needs to start in the datapath from the next table as specified in the newly created context. Protocol between Control plane and datapath (Openflow) needs to provide this flexibility for control plane.
Atomic Update and Linkage of flows: Control plane yet times need to create multiple contexts in tables atomically. For example, NAT/firewall control plane would need to create two contexts for two flows in atomic fashion. Commit based facility would help in achieving this. I think it is required to enhance openflow protocol to enable this operation.
Parsing headers & Extraction of fields: Since there could be different datapaths that need to be exercised on a packet, one time parsing of the packet may not be sufficient. Each datapath or even each table might be preceded by parsing & extraction unit. There are different fields which are required in different paths. There could be tunnel headers which need to be parsed for some datapaths to get to inner packet. Each datapath might have its own requirement on which fields to be used for further processing. That is, some datapath might need to work on IP header of tunnel and some other datapath might have requirement to use inner IP packet fields. Hence I believe it is good to have parsing & extraction unit for each table. In cases, where the datapaths don't require different parsing & extraction units, controller would not configure associated parsing & extraction units. Some hardware devices (Multicore SoCs) I am familiar with support multiple parsing & extraction unit instances. So, this is not a new thing for hardware ASIC/Multicore-SoCs.
Pre Table lookup Actions: Many times, there is a requirement to ensure that packet is good. Packet integrity checks are very important during packet processing. Having some actions defined on per table basis (not on per flow basis) is good way to do these kinds of checks. There may be a requirement in future to do even some modification to the packet before further parsing/extraction & eventual table lookup. Pre-Table action list is one way to provide these facilities. One can argue that, it is possible to use another table before packet is processed at current table and have context specific action on that table. Though it is good thought, it might result into inefficient usage of table resources in hardware.
Flexible Metadata: Metadata is used to communicate the information among different tables and associated actions. There are two types of metadata required in datapaths - Some part of metadata is used as lookup action in further tables and some metadata is required by actions. Openflow specification defined the metadata, but it is limited to one word. I believe that this is insufficient. Fixing the metadata size in standard is also not good. Controller can define the maximum metadata size required and program the size information in the datapath. Datapath can allocate additional space to store metadata on per packet basis using the size information programmed by the controller. Since there could be multiple types of datapaths and multiple tables within each type of datapath, controller may categorize metadata for different purposes. Each metadata field and field size can be referred later on by table for lookup and associated actions. I believe that openflow should provide that kind of flexibility - Setup the size of metadata required one time in the datapath, language to use different fields in metadata for table lookup, language to define action which can use different fields in the metadata or set/reset some fields in the metadata.
Actions : Each type of datapath has its own actions - For example firewall session management datapath actions are different from the type of actions routing or bridging require. Hence, there would be good number of actions eventually in the datapath. And more actions would be required in future. In addition to defining more actions in the openflow or associated specifications, it should also provide flexibility to add new actions without having to create new hardware. That is, openflow might need to mandate that the datapath hardware or software should be able to take newer actions in future. To enable this, it is required to define some kind of language to define new actions. Since datapaths may be implemented in hardware ASIC, the language should not have too many constructs/machine-codes. Simple language is what is needed. I thought simplified LLVM can be one option.