Pipeline programming divides the task into multiple stages and links these stages. Each stage is typically implemented in different execution context such as tasklets, threads or even processors. In network pipe lining, packets go through multiple stages with each stage doing some application specific task before packet goes out or consumed. When a particular stage is completed, packet gets enqueued to the next stage. Next stage picks up the packet from its input queue, processes and enqueues to the next stage and goes on until all stages are completed.
If care is not taken, pipeline programming can introduce major bottlenecks. Inter stage queues can become longer when there is asymmetric processing with respect to CPU cycles among processing stages. This gives rise to memory depletion and dropping packets. Jitter of packets can increase if further stages don't pickup the packet soon enough. One would see major problems if control information is also being passed via queues to the next stage. Unlike packets, if control events are dropped, either due to memory failures or due to queue size constraints, it results to functional issues.
Yet times, many software applications require additional information to be sent to the next stage along with the packet. Some implementation allocate new memory blocks to hold meta information along with packet buffer. If control data and packets allocate from the same memory pool, there is a chance of memory failures for control data when large number of packets are received.
My recommendation is to avoid pipeline programming as much as possible and go for run-to-completion model. Yet times pipeline programming is not avoidable and one instance of this programming is required due to stack size restrictions in Linux kernel. In may Linux distributions, stack size is limited to 4K. Some complex applications having multiple logical modules require more than 4K stack if run-to-completion model is adopted. Pipeline programming with multiple logical stages is one solution to avoid stack size problems. Another instances is where asynchronous based hardware accelerators are used. To take full advantage of hardware accelerators such as cryptography and pattern matching accelerators, applications running in cores handover the job at right time to accelerator hardware and work on some thing else such as processing next packet. When hardware accelerator completes its job, it indicates the result asynchronously (mostly through interrupts). Application software picks up the result and does rest of the required processing on the packets. Basically, application processing is divided into pre-acceleration stage, acceleration stage (which is done in some hardware) and post-acceleration stage, thus pipeline model. As in pipeline model, there will be input queue to the hardware engine and output queue to the processors.
One of the main issues that come due to pipeline model is bloating of inter-stage queues. It eats up memory to hold the nodes in the queues, drops partially processed packets thus wasting the CPU cycles if the next stage tasklet does not get chance to run for a long time if incoming packets are coming in very high speed. As in run-to-completion model, better way to avoid is to stop reading new packets until read packets are processed and sent. As much as possible, simulate run-to-completion model even in pipeline programming. This can be achieved by making cores to process the queues before new packet is read. Based on type of system architecture chosen, either packets are read from the hardware in poll mode or in interrupt mode. Many multicore processors in the market today provide poll model. Until packet is read, packets are kept in the hardware. In interrupt mode, packets are read without application knowledge by the operating system and packets are handed over to the applications.
In poll mode case, my suggestion is to force processing of queues of all pipeline stages before dequeuing the next packet. In case of interrupt mode, my suggestion is to process the pipeline stage queues before processing the new packet. To make this happen, pipelining should not involve dedicating the cores to the specific stages. The core receiving the packets should be able to execute any stage of application and any core should be able to receive the new packet. In essence, go for SMP model in case of multi core processors. In case of multi core processors, my suggestion is to divide the sessions across multiple cores to avoid too many contentions, but ensure to run all application code in all cores.
To avoid contentions, it is advisable to use as many queues as number of cores for any given stage. Stage that is outputting the packet to next stage enqueues to the current core specific queue. As many tasklets as number of cores are required to process the queues. Each tasklet takes the queue node from its queue and does the processing. Tasklets are scheduled by the operating system normally. In addition to processes queues via these tasklets, my earlier recommendation is to force the queue processing before new packet is processed. Since in SMP model, there is a specific need for processing certain packets in specific cores, forced processing should process only current core specific queues of pipeline stages. One might think that forced processing is good enough and there is no need for tasklets. Remember that packets need to be processed by all stages even when there are no new packets. Since forced processing function is called for every new packet, it is necessary that it takes very few cycles especially when there are no nodes to process.
Another recommendation I have is to avoid memory allocations to hold meta information along with the packet while queuing the information to the next stage processing. Typically, each packet buffer has its own meta information in its buffer header. In case of linux, skbuff holds buffer specific information. My advice is to provide additional space in buffer header to hold queue node specific meta information. By avoiding memory allocations, performance goes up, memory fragmentation is reduced and overall system performance improves.