Simultaneous Multithreading: Multiplying Alpha Performance

Simultaneous Multithreading: Multiplying Alpha Performance
Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq Computer Corporation To begin I must say that I am honored to represent the large number of outstanding engineers of the Alpha design team who have produced and are dedicated to continue producing the fastest microprocessors in the world. And as is obvious from the title, I’m going to discuss the introduction of multithreading into future Alpha processors.

Outline Alpha Processor Roadmap Motivation for Introducing SMT
Implementation of an SMT CPU Performance Estimates Architectural Abstraction I’ll begin by putting the work into the context of the Alpha CPU roadmap, spend some time on our motivation, show how SMT can be implemented in the context of an out-of-order execution processor, show some preliminary performance results and then conclude with a bit on the software model

Alpha Microprocessor Overview
Higher Performance 0.35mm 0.18mm 0.125mm 21264 EV6 EV7 EV8 Lower Cost 0.28mm 0.125mm 21264 EV67 EV78 ... 0.18mm 21264 EV68 This is the requisite processor evolution diagram with major designs across the top, and shrinks and derivatives below. EV7 was described here at Microprocessor Forum last year, and EV8 is the next fully new core CPU design. 1998 1999 First System Ship 40 45

EV8 Technology Overview
Leading edge process technology – GHz 0.125µm CMOS SOI-compatible Cu interconnect low-k dielectrics Chip characteristics ~1.2V Vdd ~250 Million transistors ~1100 signal pins in flip chip packaging A brief summary of the technology going in to EV8…. We will have available the most advanced process technology available at time of introduction. That means we will be using 1/8 micron CMOS. We also will be designing to be SOI compatible so we can take advantage of that technology. Furthermore, we will using all the other state-of-the-art technologies such as Cu interconnect and low-k dielectrics. Gross chip characteristics include a 1.2V Vdd, about ¼ billion transistors and around 1100 signal pins in flip chip packaging.

EV8 Architecture Overview
Enhanced out-of-order execution 8-wide superscalar Large on-chip L2 cache Direct RAMBUS interface On-chip router for system interconnect for glueless, directory-based, ccNUMA with up to 512-way multiprocessing 4-way simultaneous multithreading (SMT) Some people have speculated that there’s no performance to be gained going beyond 4-wide issue, but EV8 will demonstrate that you can get a significant performance gain going wider. This will be accomplished with numerous enhancements to the out-of-order pipeline. And as was done with the this will be accomplished without sacrificing cycle time. EV8 will also follow the model used in EV7 of a large on-chip level 2 cache and on-chip Direct RAMBUS controllers and on-chip network routers for glueless multiprocessor systems. Finally EV8 will introduce simultaneous multithreading, which is principal focus of this talk.

Goals Leadership single stream performance
Extra multistream performance with multithreading Without major architectural changes Without significant additional cost Given those characteristics – I want to review our goals… Alpha’s distinquishing characteristic is leadership single stream performance, we’re not going to forget that objective. So first and foremost in both architecture and cycle time EV8 will aim at being the highest performing microprocessor when running a single thread. Second, we significantly improve that performance when running multiple theads - without having to change the Alpha architecture, and without large additional die area. Althought there certainly is additional design effort as you will see later that we avoid replicating most structures (especially large ones) to support multithreading. Or put the other way, when running a single thread all the chip resources can be dedicated to running that thread at peak speed.

Reduced function unit utilization due to dependencies
Instruction Issue Time To motivate how we came to incorporate multithreading into EV8, let’s return to the early 1980’s and take a very high level view of instruction execution. This very abstract diagram illustrates the activity in just the execute stage of a single issue machine, with the red boxes showing instruction execution. Note gaps due to multiple cycle operation latency and inter-instruction dependencies which result in less than perfect utilization of the function units. Reduced function unit utilization due to dependencies

Superscalar leads to more performance, but lower utilization
Superscalar Issue Time If that weren’t bad enough, for more performance we now make parallel (or wide) issue to use more function units in a cycle. For example the peak sustainable execution rate of the is 4. Even with sophisticated techniques like out-of-order (or dynamic) issue, and sophisticated branch prediction, this leads to more waste, since there aren’t always enough instructions to issue in a cycle. A wide variety of studies have shown this function unit unitization to be under 50%, and getting worse as machines continue to get wider. Note, I’m not saying that its bad to keep on this trajectory, because continuing to make the machine wider is still providing an absolute single stream performance benefit. Furthermore the key point is not keeping FUs busy, because they really aren’t that expensive, but because they represent more work getting done. So what can we do…. Superscalar leads to more performance, but lower utilization

Adds to function unit utilization, but results are thrown away
Predicated Issue Time Some approaches to dealing with the idle function units attempt to improve performance by causing the execution of operations whose results are never going to be committed. An example is predication. These operations can improve performance some by shortening latency at the expense of using more function units, but throwing those results away. This can go only a small distance toward improving useful occupancy of the function units. With predication we get some performance gain in exchange for significant use of function units by operations whose results are thrown away. In fact, prediction practically guarantees that often times half the results generated will be thrown away. We wanted to find a technique where we could get more performance by using the results from these function units. Adds to function unit utilization, but results are thrown away

Limited utilization when only running one thread
Chip Multiprocessor Time Another approach is to get more efficiency is by using two smaller CPUs. But this sacrifices single stream performance. And Amdahl’s law tells us that sometimes you have parallelism and sometimes you don’t, so one can’t always use multiple streams. This just means you still will see reduced function unit utilization. So as long as you can keep getting cost-effective absolute performance gains for making the processor wider without sacrificing cycle time it makes sense to do so. Limited utilization when only running one thread

Fine Grained Multithreading
Time Classic answer to problem of dependencies is to take instructions from multiple threads. Still leaves a lot of wasted slots. Some variants of this style of multithreading result in a design where where every thread goes much slower. I actually worked on this style of multithreading for my Ph.D ages ago, but had abandoned the idea until now. Intra-thread dependencies still limit performance

Simultaneous Multithreading
Time What changed my mind was the work by Dean Tullsen and his advisors Susan Eggers and Hank Levy at U. Washington who addressed waste issue, by proposing simultaneous multithreading. What they suggested was simply using any available slot for any available thread. But this work was somewhat incomplete, so we then collaborated to work on achieving the goal of uncompromsed single stream performance - and multithreading... Maximum utilization of function units by independent operations

Basic Out-of-order Pipeline
Fetch Decode/Map Queue Reg Read Execute Dcache/Store Buffer Reg Write Retire PC Icache Register Map Dcache Regs Here is a high level picture of the normal flow of pipeline. The flow through this pipeline is…(see diagram) If we assume that instructions can come from different threads, we observe that after renaming instructions are “thread-blind” since the instruction dependencies are fully reflected in the register dependencies, and the instructions from different threads could be mixed at will. This partitions the pipeline into two pieces - a thread-dependent part and a thread-independent part. Thread-blind

SMT Pipeline Fetch Decode/Map Queue Reg Read Execute
Dcache/Store Buffer Reg Write Retire PC Register Map Regs Regs Now we enhance the pipeline by allowing fetch of instruction from multiple program counters. Each cycle we pick a thread to fetch from. The instructions are fetched out of a completed shared instruction cache, and sent on to the mapper where separate maps are maintained for each thread. Then the instructions are dropped into the queue and because the dependencies are all determined by the physical register dependencies, instructions from different threads can be mixed in all subsequence stages of the pipeline, as illustrated by the showing all the units uniformly shared by all the threads. Note however, that the register file although shared must be made larger to accommodate the register state of all the threads. Dcache Icache

Changes for SMT Basic pipeline – unchanged Replicated resources
Program counters Register maps Shared resources Register file (size increased) Instruction queue First and second level caches Translation buffers Branch predictor So to recap what we needed to do. First – nothing with respect to the basic pipeline. Then some resources are replicated. In particular…PC and register maps. Note that these are pretty small. Then we provided the means to share all the other mostly large resources. In terms of size only the register file actually need be made larger. Other resources were primarily sized at the optimal point for single threaded performance. Only when cost was small were some enlarged to enhance SMT performance. In some cases, the sweet spot on the performance curve moves up - encouraging larger structures - but we actually didn’t make anything much larger as a result of multithreading. In some cases, the designers opted for replicated structures rather than designing the control necessary to mingle information associated with different threads in the same structure So now I’m telling you that we’re going to have all this contention in the shared resources. Isn’t performance going to be abysmal.

Multiprogrammed workload
Here’s the result of running multiple indepentent tasks on our SMT performance model. This is relative performance starting from a base that is significantly faster than the In each case aggregate performance goes up significantly when run with multiple threads. So in fact, there is more contention, but the latency tolerance of the SMT more than makes up for the additional latency. But these are just throughput improvements…

Decomposed SPEC95 Applications
Here we illustrate blind application of the SUIF compiler to decompose SPEC FP95 apps Some show no benefit, but some show significant gain – up to 2x. This is demonstrating a real increase in the SpecFP metric. So here we see that SMT can be used to get more performance on single applications without changing architecture. Use of SMT gives us more registers allowing for techniques such as loop fusion that need more registers - and opportunity for fast communication through shared caches. SMT also benefits from the ability to dynamically shift from thread level parallelism when there are many threads to run and peak instruction level parallelism when there is only one thread to run.

Multithreaded Applications
This same phenomenon is demonstrated on explicitly threaded applications. Again speedup varies with application… These first few are hand coded benchmark program, but I think this last one is most significant showing a dramatic speedup on a transaction processing workload.

Architectural Abstraction
1 CPU with 4 Thread Processing Units (TPUs) Shared hardware resources TPU 0 TPU1 TPU2 TPU3 Icache TLB Dcache Scache Stepping back a moment I want to talk about the architectural abstraction used by the SMT processor. In an SMT CPU each thread appears to be full architecturally compatible members of the Alpha processor line. To help talk about this we use the term “Thread Processing Unit” or TPU to refer to the programmer’s perception of a thread. Note that there is no specific TPU on the die, but it is just an abstraction provided to the users. Software will be able to treat the TPUs as the nodes of an standard SMP processor. Thus what is happening is that the architectectural aspects of the machine are replicated but some non-architectural aspects are shared. The result is that we end up with a CPU chip with what appears to be multiple processors on it, each of which appears to be a fully compliant architectural CPU. These multiple TPUs give a performance benefit over a single threaded processor but at a hardware cost of a single CPU.

System Block Diagram EV8 M EV8 M EV8 M IO IO IO EV8 M EV8 M EV8 M IO
1 2 3 EV8 M EV8 M EV8 M IO IO IO EV8 M EV8 M EV8 M IO IO IO EV8 M EV8 M EV8 M Combining multiple EV8 chips together results in a EV7-style multiprocessor with many more processing units. Because each TPU is the same as any Alpha processor the OS can just treat this system as one having four times as many nodes. SMT is good for such configurations because it is more latency tolerant for the longer memory operations. IO IO IO

Quiescing Idle Threads
Problem: Spin looping thread consumes resources Solution: Provide quiescing operation that allows a TPU to sleep until a memory location changes Although, as I said each TPU is a fully compliant member of the Alpha family we are adding one extension for efficient locking. Spin looping is a common technique for software lock acquisition. In a multiprocessor this is a perfectly reasonable thing to do, but in an SMT the instruction in the spin loop are consuming resources doing nothing, that could better be used by other threads… So we add quiesce…

Summary Alpha will maintain single stream performance leadership
SMT will significantly enhance multistream performance Across a wide range of applications, Without significant hardware cost, and Without major architectural changes In summary, with EV8, Alpha will maintain single stream performance leadership Furthermore, SMT will significantly enhance multistream performance Across a wide range of applications, Without significant hardware cost, and Without major architectural changes

References "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreaded Processor" by Tullsen, Eggers, Emer, Levy, Lo and Stamm in ISCA96. “Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading” by Lo, Eggers, Emer, Levy, Stamm and Tullsen in ACM Transactions on Computer Systems, August 1997. “Simultaneous Multithreading: A Platform for Next-Generation Processors” by Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro, October, 1997. References are available on the web.

Simultaneous Multithreading: Multiplying Alpha Performance

Similar presentations

Presentation on theme: "Simultaneous Multithreading: Multiplying Alpha Performance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Simultaneous Multithreading: Multiplying Alpha Performance

Similar presentations

Presentation on theme: "Simultaneous Multithreading: Multiplying Alpha Performance"— Presentation transcript:

Similar presentations

About project

Feedback