Sunday, January 24, 2010

PCIe End Point Developer techniques

Some UTM functions such as Intrusion Prevention and Anti-Malware take significant CPU time. Equipment vendors are looking for ways to offload some processing from the main CPU to NIC cards.  That is how Intelligent NIC cards have born.  Some equipment vendors created their own ASICs or FPGAs to do some offload functions at ingress time and egress time.  Increasingly vendors are showing interest in using Multicore processors in offload cards. Multicore processors approach  have many advantages over ASIC/FPGA approach as typical programming languages can be used to create even offload engines - Faster Time-to-Market,  Image can be upgraded in the field and simple to maintain.  Offload card are attached to main CPU over PCIe and they in essence are replacing  traditional PCIe NIC cards in security equipment devices.

With PCIe cards  increasing becoming processor based,  network programmers are now can program the offload cards using traditional programming languages such as 'C'.  Intelligent NIC cards have Ethernet port attached to it for external connectivity.  Towards CPU, it communicates via PCIe interface. These cards, upon receiving the packets from the Ethernet port, do some operations and send the result packets to the CPU via PCIe interface.  When the packets are given to the card via PCIe, it can do some operations before sending them out on Ethernet port. Basically,  PCIe card as a whole appears as one NIC interface to the host even though it does some offload functions.  In this post, I will not cover typical offload functions that can be done in the offload cards. I will try to do that in my next posts.  In this post, I like to concentrate on the efficient packet transfer over PCIe.

Some brief introduction on PCIe:

Intelligent NIC cards use PCIe End Point functionality to talk to the host CPU.  Host CPU uses the PCI in RC (Root complex) mode.   As you all know,  PCIe devices are arranged in a tree manner.  Root of the tree is RC (host).  Leaves of the trees are  PCIe devices (End points).  In between the leaves,   there can be PCIe switches.   Host uses services provided by the end points.  PCIe switches are used to extend the number of end point devices to the host.  Example of end point devices are:  Ethernet controllers,  SCSI devices, Video devices etc..

Every end point and root complex processors  have PCIe controller hardware block to talk to the PCIe bus.  End points can only talk to the RC, that is, End points can't talk to each other. If they need to talk to each other, RC will act as a relay.  RC can talk to many end points at any time.  In a PCIe tree,  only one RC is possible.  In typical PC system,  x86 is RC and all other PCIe devices are End Points.

RC and end points share required memory mapped IO or memory itself with the other party.  End points indicate the regions of memory or memory mapped IO to the RC via configuration space BAR registers.  Typically,  end point have around 5 BARs.  Each BAR represents in some form both memory address and the size.  Host would map these memory locations in its system address space. In case of x86 hosts, BIOS does the PCIe devices enumeration. As part of enumeration, it maps the device addresses in its space.   Hosts typically keep aside certain system address space to map memory exposed by PCIe devices.  Many host processors today are at the minimum have 36 bit address lines - they can address up to 64 G bytes of address space.  Some address space goes for the DDR and some space in the rest of address space is reserved to map PCIe devices.   Hosts also normally would like to expose its memory to PCIe devices.  There is no standard defined on how the hosts would communicate to the end point  the memory spaces it wants to share.  End Points typically assign some part of DDR memory as the command and response memory.  This memory would be part of the memory space it exposes to the RC host via one of the BAR registers.  One of the commands that EndPoint exposes could be for host drivers to share its memory with the EndPoint.  End point then maps the RC exposed memory in its reserved system address space (End point processor's PCI address space).

PCIe controllers support multiple transactions types  - Most important ones are Memory read,  Memory write,  Configuration Read and Configuration Write.  There are others such as 'Messages',  Memory read and write with Lock. There are some more transaction types related to PCIe internal layers (transaction layer) such as Completion etc..

When the processors (either host or End Point) read or write data from/into the PCIe address space,  PCIe controller gets hold of this operation and creates the PCIe read/write transaction  on to the other side (destination) with the other side's memory address. PCIe transaction message contains the other side address, transaction type and data to be written, if it is write operation.  In case of read operation, there would be one more message that comes from other side with the data.   All this happens automatically.  As far as the processor is concerned, it thinks that it is reading and writing into some address (in this case PCIe address) same way as it read and writes from normal memory.  Internally PCIe transaction would be created.  Normally PCIe controllers have some thing called 'Address Translation Registers'.  Corresponding to each PCIe address space range,  original memory address of other side is stored.  This is normally done whenever other side gives the memory ranges to be mapped.  That is how PCIe controller knows the memory address to be put in the read/write transactions. 

PCIe defines two transaction models called 'Posted' and 'Non Posted'.  Posted transactions are the ones where the transactions are fired and forgotten.  That is, processor does not care whether the transaction was completed.  'Write' transactions are typically posted transactions.   'Read transactions' are by default non-posted as it needs to get the data from the other side's memory address.  Since it is synchronous operation, process simply waits for the result.  Hence the non-posted transactions are expensive.  As I understand read transaction typically even in PCIe Gen2 hardware takes around 1 to 2 Microseconds.  Hence PCIe read transaction for less than 16 bytes should be avoided.  For big size size, it is okay to use PCIe read transactions though.

So, any programming interface exposed by the PCIe card must ensure that the host processor does not do any  PCIe read transactions for data movement (such as packet transfer).   

Descriptor rings:

This post gives one technique whereby both host and End point avoid PCIe read transactions.  Descriptor ring approach is used to send/receive packets and commands between hosts and end points.  Normally PCI devices design the descriptor rings such that hosts never need to do PCIe read transaction either  to get hold of packets or send packets.  As PCIe devices also being implemented on processors for doing flexible offload mechanisms, it is necessary that PCIe read transactions should be avoided even in End Point implementations.

Descriptor rings consists of set of descriptors. Each descriptor contains some control & status bits and pointers to buffers holding the packets or commands/results.

For every descriptor ring, there is one consumer and producer.  In case of Ethernet based PCIe device,  on transmit side,  host is producer and Ethernet controller end point is consumer.  For received packets,  Ethernet controller is producer and host is consumer.  That is,  at least two descriptor rings are used  in Ethernet Controller end point -  Transmit descriptor ring and Receive descriptor ring.  In addition to these two rings,  there may be command rings too to send command and receive responses from the End point.

Extending the Ethernet example,  Ethernet driver in the host typically initializes both packet transfer (receive and transmit)  rings.  Host Ethernet driver normally initializes the descriptors in the descriptor ring with buffers to receive packets.  Ethernet Controller when it receives the packet from the wire puts it in the buffers in the receiver descriptor ring and invokes the interrupt on the host.  On transmit side too,  host puts the buffers of the packets in the descriptors and informs the end point.  In case of processor based end points,  even interrupt on that processor would be invoked by host.

Descriptor ring consists of descriptors in a ring with Producer Index and Consumer Index.  Producer Index (PI) points to the descriptors that can be written into and Consumer Index (CI) points to the ring from where consumer is expected to read descriptors.  Since there should be a way to find out whether the ring is full or empty, normally one descriptor is left empty.  That is, when writing to descriptor pointed by PI makes the PI equal to CI, then that descriptor is not used by producer until CI moves up.

To avoid PCIe read transactions, descriptor ring buffer has to be part of consumer processors' memory space.  PI and CI indexes would be there in both spaces, but the consumer space would be exposed to the producer.  Producer PI and CI need not be exposed to the consumer.  Here is the flow :

Initialization time:

  • Consumer Descriptor ring is part of the memory it exposes to the producer.
  • Consumer Descriptor rings PI and CI are also part of the memory it exposes to the producer.
  • Initially, both PI  and CI are set to 0. 
Producer :

Producer has got the buffer to write into.

If  (PI + 1 )%ring Size != CI ) 
  • Write the buffer in the descriptor pointed by the PI.
  • Write the status and control bits accordingly.
  • Increment the local PI:  PI = (PI + 1) %ring size;
  • Write the new PI into other side PI..
  • Generate Interrupt (MSI interrupt is also write transaction) - ofcourse, it needs to honor coalescing criteria.

Consumer flow would be:

if ( PI !=  CI )
  • Read from the descriptor that is indexed by local CI and process
  • Update local CI by (CI + 1 ) % ring size;
  • Write  the new CI into other side CI, that is update Producer's CI value   
  • Generate interrupt (if needed) to indicate producer that the descriptor is consumed.

If you see above, producer or consumer never do any read operation on other side memory. Any read you see above is on the local memory.  

If you take Ethernet Controller is an example, then the receive descriptor ring is defined in the host memory and transmit descriptor ring is defined in the endpoint memory.

Note that the packet buffers are always expected to be part of the host memory to ensure that no changes are required in host TCP/IP stack or host applications.  Due to this, endpoint would issue PCI read transaction on the packet buffer using address found in the descriptor. To reduce the read latency across multiple packets, it is good to get hold of multiple buffer points from multiple descriptors and initialize the multiple DMA controllers (if the device has them).  Since the packet content is expected to be in terms of hundreds of bytes and multiple packets can be read at the same time (if multiple DMA controllers are available),   read latency would be amortized to small value on per packet basis.  Read transaction on transmit packets in case of  Ethernet controllers is unavoidable,   but it is necessary that the descriptors reads should never result into PCIe read transaction.

Another point to note is by the time  interrupt  reaches the consumer, it is necessary that the consumers PI is updated.  Otherwise, there could be race condition where interrupt goes first and nothing was found to be read.  If PI was updated after the interrupt is seen by the peer, some filled descriptors will never be read until another interrupt is generated due to new descriptor being updated by producer.  That race condition is very dangerous as next descriptor update might happen after long time.  Since PCIe MSI interrupt is write transaction, there will not be any race condition as long as PI update is issued before invoking the interrupt.


Avinash said...

Thanks for this valuable information and description. I was exactly looking for this.

Anonymous said...

Thanks for sharing your knowledge,

I got struck in writing a simple driver, in which endpoint will try to write data to host memory without using DMA.

I couldn't able to find how endpoint knows the memory location of RC.