Saturday, April 5, 2008

SMT threading in Multicore processors

There is one good article I found on concepts of Simultaneous Multi-threading. See here.

As a software Engineer, it is easy to get confused between threading implemented in hardware and software. Though concept are similar, there is no one-to-one correspondence.  Even in single core chips that don't have hardware threads can support software threads. Software threads are really operating system concept.  Where I reference 'thread' in rest of this article, it means hardware thread.

Hardware threads takes very less Die space compared to adding new core to the die. You are mistaken if you that each hardware thread is equal to one core. Note that operating systems such as Linux expose virtual CPU for each hardware thread, but its performance can't be equated with physical CPU.  Operating systems do this for convenience.

Hardware threads are part of each core.  If a Multicore processor having 8 cores support 2 threads per core, then it means that there are 16 threads in total.

Basic idea behind the SMT:  Cores typically have multiple execution units for different operations such as arithmetic units,  branch units, floating point units and load/store units etc..  Each unit runs in multiple stages in a pipeline fashion.  Any work that is given to the unit takes multiple processing cycles.  Each stage is normally executed in one processing cycle. Each stage is independently executed at the same time.  Processor utilization is highest if all stages are filled up all the time.  Typical work loads don't fill up all stages of any given unit.  Many of these unused stages by one thread can be used by executing software programs of other threads.  Probability of using all execution stages is higher with more number of threads.

What are other situations where hardware threads are useful?

Processor Delay during Cache miss:  L1 and L2 Caches are smaller compared to DDR. When instruction are executed or data operations are done,  target information may not be there in the cache, the condition called as Cache miss.  Cache miss leads to reading data from DDR.  This takes few cycles (around 100 cycles or so) to refresh the cache. During this time, if there is only one thread per core, the core literally waits and does not do anything.  If there are other threads in the core, they can do some other operations which does not require reading from external memory.

Processor waiting for results from accelerator devices:  Many times software programs use accelerator devices in synchronous fashion.  That is, command is given to accelerator and waits for the result in continuous loop for the result.  During this time, that is, until the result is back from the accelerator, processor thread is not using any of its resources.  If there are multiple hardware threads,  these threads can use core resources and there by utilizing the core to its maximum.
  •  One might say that accelerators should not be used in synchronous fashion. Software programs need to use them in asynchronous fashion and do some thing else before results come in. In theory, it  sounds good, but in many cases it is not possible to change the software architecture  due to software developed in interpreter languages or developers don't like to change the software. 
  • One also might say that software programs, rather than the waiting for the result by making processor spin on result, yield to the operating system and get control through interrupt when the result is ready. Again, this sounds good, but it is not an option in many cases where hardware is accessed directly from the user space daemon (by mmaping) in Linux kind of operating system as interrupt latency, waking up the user process and user process getting scheduled can lead to significant latency. 
  • Some Multicore processors and Intel avoided these problems by introducing co-processor model with additional instructions.  For example, Crypto acceleration is implemented as new set of instructions in the core.  Though the situation is better here, still multithreading helps as co-processing unit might also have some pipeline stages.  Multithreading helps in utilizing these stages as well.

What is the performance improvement expected with  and without Mulithreading?

Depends on work load.  You might even see some performance degradation if application is single SW threaded.  But in general there would be performance improvement with threading, but mileage might vary. Networking application such as firewall, Ipsec see performance improvement in the range of very minor to 30% based on number of sessions between single threaded cores to dual threaded cores.  If the number of sessions/tunnels used to measure the performance is very less, then performance improvement with dual threaded cores would not be very high.  With more number of sessions, one could see significance performance improvement.  When there are less number of sessions, most of the session might be within the cache and DDR access may be very less. I guess that could be the reason why performance does not improve with number of threads.  But in real life,  DDR access, accelerator accesses would be there and performance improvements would be seen.

How many threads are ideal to be there in the core?

Based on network applications I mentioned above (Firewall, NAT and IPsec),  more than two threads per core does not provide enough improvement to justify the cost of adding thread to the core.  In the best case, we saw 30% performance with two threads and saw only 40% improvement with 4 threads when experimentation was done.  I personally expected more than 40%, but I can't explain why it is only 10% more with 4 threads. It could be that some other factors are playing role. 

What kind of role OS can play?

Note that all threads in a core are sharing L1 and L2 Caches. If threads run different software programs, then there could be lot of cache thrashing.  If same program is run on both threads of any given core then there is a possibility of cache data sharing.  Linux SMP keeps this in mind while scheduling the tasks to the cores and hardware threads. As much as possible, OSes tend to assign similar software thread to the another thread of same core.

What is it I would like to say from applications perspective:

  • 2 hardware threads per core.
  • Larger L2 Cache due to Multithreading.
  • Coprocessor based Accelerators 
Make no mistake - Hardware thread is not a replacement for core.

2 comments:

Ravi said...

30% of performance improvement with one thread per core is very good. If I go by Multiprocessor report on Netlogic XLP processor, one can get 100% performance improvement on snort & clamav kind of applications. Seems very good to me.

Netlogic, Intel, IBM Multicore processors have multithreading. I wonder why other Multicore vendors did not take this approach.

Srini said...

I think it is matter of time before you see multi-threading in every Multicore processor.

Instruction Execution engine stalls for multiple reasons as given in the post - Cache misses, not enough out-of-order instructions that can be executed to make the execution engines busy and waiting for accelerator results when used synchronously. I am sure there are other reasons also which I am not aware of.

For realistic workloads, one can expect around 25% improvement in the performance. As long as this average improvement in the performance is more than the % of die size that is required for adding a thread, then it is worth it. That is, if 15% more die is required to add a thread and if it gives performance boost of 25%, then overall system throughput increase for adding 6 threads (100/15) in 6 cores of Multicore processor is more than adding one core.

there is lot of debate in the industry on whether to add Multithreading versus cores to the Multicore processors. I say that it is combination of both.