Saturday, September 27, 2008

Are Multicore processors Linux friendly - Part 2

As described in earlier post,  Software architects of network equipments are moving or developing their applications in user space.  I believe Multicore vendors should keep this in mind while designing the accelerators and Ethernet controllers.   Unfortunately, current creep of Multicore devices are not well designed for applications based on user space.

Let me first give some items which are different in User space programs in comparison to Kernel level programming and Bare-metal programming.

  • Virtual Memory :  Kernel level programs and Bare-metal programs work with physical memory.  User space programs in Linux work with virtual memory.  Virtual memory to physical memory mappings are created by the operating system on per process basis.  
    • Some information on Virtual space and how it works:  Each process virtual space starts from 0 to 4Gbytes in 32 bit operating systems.  When the process executes, core with the help of MMU gets the physical address from virtual address for executing instructions as well as to get data or write data in the memory.  Core maintain the cache of virtual space to physical space mapping in Translation Lookaside Buffer (TLB).   Many processor types have 64 TLB entries on per core basis.  Operating system when it finds that there is no matching TLB entry for virtual address (TLB miss), goes through the mapping table created for each process for the match.  If there is a match and physical address is available, then it adds the entry in  one of  the TLB entries.  If there is no physical address page fault occurs and it requires reading the data from the secondary storage.  Since many processes can be scheduled on the same core by operating system,  TLB gets flushed out upon context switch and gets filled up with virtual space addresses of new process as part of its execution.
  • Operating system scheduling :  In Bare-metal programming,  application developer has full control over the cores and the logic that gets executed on per core basis.  User processes may get scheduled on different cores  and same core might be used to schedule multiple processes.  Since operating systems provide time slice for each process,  the user program can be preempted at any time by the operating system.
  • Restart capability:  Bare-metal and kernel level programs gets initialized when system is UP. If there is any issue in the program,  whole system gets restarted.  User space programs provide flexibility of graceful restart and restart if they crash due to some issues.  That is, user space programs should be capable of reinitializing itself when they get started even if the complete OS and system is not restarted.

What are the challenges user programs have while programming with Multicore processors?

Virtual Memory:

Many Multicore processors are blind to the virtual memory.  They only understand the physical memory. Due to that, application software needs to do different things to get best out of Accelerators.  Some of the techniques that are followed by software are:
  • Use physical address for buffers that would be sent to the HW accelerators & Ethernet controllers:  This is done by software allocating some memory in kernel space, which reserves the physical memory, and memory mapping that in user space using mmap().  Though it looks simple, it is not practical for all applications. 
    • Applications need to be changed to use this memory mapped memory for allocating buffers.  Memory allocation in some applications might go through multiple layers of software.  For example, some software might be in high level languages such as Java,  Perl, Python etc..  Mappong allocation routines of these programs to memory mapped area could be nearly impossible and requires major software development.  
    • Applications might be allocating memory for several reasons.  Applications might be calling same allocation function for all types of reasons.  To take advantage of memory mapped space, either the application need to provide new memory allocation routines or all allocations are satisfied from mapped area.   First case requires software changes which could be significant if applications have developed multiple layers on top of basic allocation library routines.  Second case may not work may have problems in satisfying allocations. Note that kernel space is limited and amount of memory that can be mapped is not infinite. 
  • Implement hardware drivers in Kernel space and copy the data from virtual memory to physical memory and vice versa using copy_from_user and copy_to_user routines.  This method obviously has performance problems - Memory copy overhead.  It also requires driver in the kernel which is not preferred by many software developers.  Preference would be to memory map the hardware and use the hardware directly from the user space software.
  • Use virtual space for all buffers.  Convert virtual memory to physical memory and provide the physical memory to the HW accelerators.  Though this is better,  this also has performance issues -  Locking the memory and getting the physical pages is not inexpensive.  get_user_pages() equivalent user space function needs to go through the process specific page table to get the physical pages for virtual pages. Second is that all physical pages need to be locked using mlock()  function, which is not so expensive, but takes good number of CPU cycles. On top of that,  the result of get_user_pages is set of physical pages which may not be contiguous.  If accelerators dont' support scatter gather buffers, then this required flattening the data which is again very expensive.
I am expecting that at least future versions of Multicore processors would have capability to understand the virtual memory and avoid software to do any thing special.  I expect that Multicore processor takes the virtual address for both input and output buffers, in addition to acceleration specific input, and convert virtual addresses  to physical memory and gets the accelerator function executed using accelerator engines.  There is a possibility that virtual to physical space mapping might get changed while the V-to-P conversion or acceleration algorithm is running.  To avoid this,  V-to-P conversion module should do two things.

A. Input side:

1.  Copy the input data to internal RAM of the accelerator.
2.  While copying, if it find that there is no TLB entry, it returns error to the software.
3.  Software accesses the virtual space which makes the TLB entry filled up and then software issues the command again to continue from where it left off.

B. Returning the result:

1.  Copy the output from internal RAM to the virtual memory using TLB.
2.  If it finds there is no mapping,  let it return to the software, if software is using the accelerator synchronously. If software is not waiting, let it generate the interrupt which wakes the processor and issues the command to read the result.
3.  Software accesses the virtual space which makes the TLB entry and issues the command again.  The v-to-p conversion module starts writing from the place it left off.

Many times, TLB entry would be there always for input buffer.  It is possible that TLB entry might have been lost by that time accelerator does its job.  But it avoids quite a bit of processing software has to do as indicated above.

Since the v-to-p hardware module needs to have access to TLB, it needs to be part of the core. So the command to be issued to accelerator and to read the result should be more like a instruction.

Note that TLB gets overwritten every time there is a process context switch. While doing memory copy operation, specifically for output buffer, it is always expected that v-to-p module checks the current process ID for which TLB is valid with the process ID it has as part of the command.

Since v-to-p module is also expected to do the memory copy from the DDR to internal SRAM or high speed RAM for input data and from SRAM to DDR for output data.  It is expected that this is very fast and does not add to the latency.  Hence v-to-p module is expected to work with core caches for coherency and performance reasons.

Operating System Scheduling:

Same core may be used by the Operating system to run multiple independent processes.  All these processes may be required to use accelerators by memory mapping them onto the user space. Since these are independent applications, there should not be any expectation that these processes would use accelerators in cooperative fashion.

Current Multicore processors would not allow multiple user processes running on a core to access the hardware accelerators independently. Due to this, software creates the drivers in Kernel space to access the hardware.  Each user process talks to the driver which in turn services the requests and returns results to appropriate user process.  Again, this would have performance issues resulting from copy of buffers from user space to kernel space and vice versa.

Limitation of Multicore processors today stems from two things:

A.  Multiple virtual instances can't be created in the acceleration engines.

B.  Interface points to HW accelerator is limited to 1 for each core.

If those two limitations are mitigated, then multiple user processes can use hardware accelerators by directly mapping them into user space. 


User process can be restarted any time either due to graceful shutdown or due to crash.  When there is a crash, there is a possibility of some buffers pending in the accelerator device.  Linux, upon any user process crash or whenever the process is gracefully shutdown frees up its physical pages associated with the process.  Physical pages can be used up by any body else.  If accelerator is working on the physical page thinking that it is owned by the user process that had given the command, then this could be an issue as it may write some data which might corrupt some other process.

I believe if solution as specified in 'Virtual memory' section above is followed, there is no issue as accelerators work on internal SRAM. Since v-to-p module always checks the TLB while writing into the memory, it should not corrupt any memory.

I hope my ramblings are making sense.

No comments: