Simultaneous Multithreading: Multiplying Alpha Performance

Slides:



Advertisements
Similar presentations
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Advertisements

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Multi-Core Architectures
Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.
PipeliningPipelining Computer Architecture (Fall 2006)
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
What’s going on here? Can you think of a generic way to describe both of these?
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
COMP 740: Computer Architecture and Implementation
Distributed Processors
Central Processing Unit Architecture
William Stallings Computer Organization and Architecture 8th Edition
Prof. Onur Mutlu Carnegie Mellon University
Parallel Processing - introduction
Simultaneous Multithreading
Simultaneous Multithreading
Computer Structure Multi-Threading
The University of Adelaide, School of Computer Science
Architecture & Organization 1
What happens inside a CPU?
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Architecture & Organization 1
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Hardware Multithreading
Lecture: SMT, Cache Hierarchies
Yingmin Li Ting Yan Qi Zhao
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Lecture: SMT, Cache Hierarchies
Computer Evolution and Performance
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Chapter 11: Alternative Architectures
ECE 352 Digital System Fundamentals
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
8 – Simultaneous Multithreading
Lecture 22: Multithreading
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Presentation transcript:

Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq Computer Corporation To begin I must say that I am honored to represent the large number of outstanding engineers of the Alpha design team who have produced and are dedicated to continue producing the fastest microprocessors in the world. And as is obvious from the title, I’m going to discuss the introduction of multithreading into future Alpha processors.

Outline Alpha Processor Roadmap Motivation for Introducing SMT Implementation of an SMT CPU Performance Estimates Architectural Abstraction I’ll begin by putting the work into the context of the Alpha CPU roadmap, spend some time on our motivation, show how SMT can be implemented in the context of an out-of-order execution processor, show some preliminary performance results and then conclude with a bit on the software model

Alpha Microprocessor Overview Higher Performance 0.35mm 0.18mm 0.125mm 21264 EV6 EV7 EV8 Lower Cost 0.28mm 0.125mm 21264 EV67 EV78 ... 0.18mm 21264 EV68 This is the requisite processor evolution diagram with major designs across the top, and shrinks and derivatives below. EV7 was described here at Microprocessor Forum last year, and EV8 is the next fully new core CPU design. 1998 1999 2000 2001 2002 2003 First System Ship 40 45

EV8 Technology Overview Leading edge process technology – 1.2-2.0GHz 0.125µm CMOS SOI-compatible Cu interconnect low-k dielectrics Chip characteristics ~1.2V Vdd ~250 Million transistors ~1100 signal pins in flip chip packaging A brief summary of the technology going in to EV8…. We will have available the most advanced process technology available at time of introduction. That means we will be using 1/8 micron CMOS. We also will be designing to be SOI compatible so we can take advantage of that technology. Furthermore, we will using all the other state-of-the-art technologies such as Cu interconnect and low-k dielectrics. Gross chip characteristics include a 1.2V Vdd, about ¼ billion transistors and around 1100 signal pins in flip chip packaging.

EV8 Architecture Overview Enhanced out-of-order execution 8-wide superscalar Large on-chip L2 cache Direct RAMBUS interface On-chip router for system interconnect for glueless, directory-based, ccNUMA with up to 512-way multiprocessing 4-way simultaneous multithreading (SMT) Some people have speculated that there’s no performance to be gained going beyond 4-wide issue, but EV8 will demonstrate that you can get a significant performance gain going wider. This will be accomplished with numerous enhancements to the out-of-order pipeline. And as was done with the 21264 this will be accomplished without sacrificing cycle time. EV8 will also follow the model used in EV7 of a large on-chip level 2 cache and on-chip Direct RAMBUS controllers and on-chip network routers for glueless multiprocessor systems. Finally EV8 will introduce simultaneous multithreading, which is principal focus of this talk.

Goals Leadership single stream performance Extra multistream performance with multithreading Without major architectural changes Without significant additional cost Given those characteristics – I want to review our goals… Alpha’s distinquishing characteristic is leadership single stream performance, we’re not going to forget that objective. So first and foremost in both architecture and cycle time EV8 will aim at being the highest performing microprocessor when running a single thread. Second, we significantly improve that performance when running multiple theads - without having to change the Alpha architecture, and without large additional die area. Althought there certainly is additional design effort as you will see later that we avoid replicating most structures (especially large ones) to support multithreading. Or put the other way, when running a single thread all the chip resources can be dedicated to running that thread at peak speed.

Reduced function unit utilization due to dependencies Instruction Issue Time To motivate how we came to incorporate multithreading into EV8, let’s return to the early 1980’s and take a very high level view of instruction execution. This very abstract diagram illustrates the activity in just the execute stage of a single issue machine, with the red boxes showing instruction execution. Note gaps due to multiple cycle operation latency and inter-instruction dependencies which result in less than perfect utilization of the function units. Reduced function unit utilization due to dependencies

Superscalar leads to more performance, but lower utilization Superscalar Issue Time If that weren’t bad enough, for more performance we now make parallel (or wide) issue to use more function units in a cycle. For example the peak sustainable execution rate of the 21264 is 4. Even with sophisticated techniques like out-of-order (or dynamic) issue, and sophisticated branch prediction, this leads to more waste, since there aren’t always enough instructions to issue in a cycle. A wide variety of studies have shown this function unit unitization to be under 50%, and getting worse as machines continue to get wider. Note, I’m not saying that its bad to keep on this trajectory, because continuing to make the machine wider is still providing an absolute single stream performance benefit. Furthermore the key point is not keeping FUs busy, because they really aren’t that expensive, but because they represent more work getting done. So what can we do…. Superscalar leads to more performance, but lower utilization

Adds to function unit utilization, but results are thrown away Predicated Issue Time Some approaches to dealing with the idle function units attempt to improve performance by causing the execution of operations whose results are never going to be committed. An example is predication. These operations can improve performance some by shortening latency at the expense of using more function units, but throwing those results away. This can go only a small distance toward improving useful occupancy of the function units. With predication we get some performance gain in exchange for significant use of function units by operations whose results are thrown away. In fact, prediction practically guarantees that often times half the results generated will be thrown away. We wanted to find a technique where we could get more performance by using the results from these function units. Adds to function unit utilization, but results are thrown away

Limited utilization when only running one thread Chip Multiprocessor Time Another approach is to get more efficiency is by using two smaller CPUs. But this sacrifices single stream performance. And Amdahl’s law tells us that sometimes you have parallelism and sometimes you don’t, so one can’t always use multiple streams. This just means you still will see reduced function unit utilization. So as long as you can keep getting cost-effective absolute performance gains for making the processor wider without sacrificing cycle time it makes sense to do so. Limited utilization when only running one thread

Fine Grained Multithreading Time Classic answer to problem of dependencies is to take instructions from multiple threads. Still leaves a lot of wasted slots. Some variants of this style of multithreading result in a design where where every thread goes much slower. I actually worked on this style of multithreading for my Ph.D ages ago, but had abandoned the idea until now. Intra-thread dependencies still limit performance

Simultaneous Multithreading Time What changed my mind was the work by Dean Tullsen and his advisors Susan Eggers and Hank Levy at U. Washington who addressed waste issue, by proposing simultaneous multithreading. What they suggested was simply using any available slot for any available thread. But this work was somewhat incomplete, so we then collaborated to work on achieving the goal of uncompromsed single stream performance - and multithreading... Maximum utilization of function units by independent operations

Basic Out-of-order Pipeline Fetch Decode/Map Queue Reg Read Execute Dcache/Store Buffer Reg Write Retire PC Icache Register Map Dcache Regs Here is a high level picture of the normal flow of pipeline. The flow through this pipeline is…(see diagram) If we assume that instructions can come from different threads, we observe that after renaming instructions are “thread-blind” since the instruction dependencies are fully reflected in the register dependencies, and the instructions from different threads could be mixed at will. This partitions the pipeline into two pieces - a thread-dependent part and a thread-independent part. Thread-blind

SMT Pipeline Fetch Decode/Map Queue Reg Read Execute Dcache/Store Buffer Reg Write Retire PC Register Map Regs Regs Now we enhance the pipeline by allowing fetch of instruction from multiple program counters. Each cycle we pick a thread to fetch from. The instructions are fetched out of a completed shared instruction cache, and sent on to the mapper where separate maps are maintained for each thread. Then the instructions are dropped into the queue and because the dependencies are all determined by the physical register dependencies, instructions from different threads can be mixed in all subsequence stages of the pipeline, as illustrated by the showing all the units uniformly shared by all the threads. Note however, that the register file although shared must be made larger to accommodate the register state of all the threads. Dcache Icache

Changes for SMT Basic pipeline – unchanged Replicated resources Program counters Register maps Shared resources Register file (size increased) Instruction queue First and second level caches Translation buffers Branch predictor So to recap what we needed to do. First – nothing with respect to the basic pipeline. Then some resources are replicated. In particular…PC and register maps. Note that these are pretty small. Then we provided the means to share all the other mostly large resources. In terms of size only the register file actually need be made larger. Other resources were primarily sized at the optimal point for single threaded performance. Only when cost was small were some enlarged to enhance SMT performance. In some cases, the sweet spot on the performance curve moves up - encouraging larger structures - but we actually didn’t make anything much larger as a result of multithreading. In some cases, the designers opted for replicated structures rather than designing the control necessary to mingle information associated with different threads in the same structure So now I’m telling you that we’re going to have all this contention in the shared resources. Isn’t performance going to be abysmal.

Multiprogrammed workload Here’s the result of running multiple indepentent tasks on our SMT performance model. This is relative performance starting from a base that is significantly faster than the 21264. In each case aggregate performance goes up significantly when run with multiple threads. So in fact, there is more contention, but the latency tolerance of the SMT more than makes up for the additional latency. But these are just throughput improvements…

Decomposed SPEC95 Applications Here we illustrate blind application of the SUIF compiler to decompose SPEC FP95 apps Some show no benefit, but some show significant gain – up to 2x. This is demonstrating a real increase in the SpecFP metric. So here we see that SMT can be used to get more performance on single applications without changing architecture. Use of SMT gives us more registers allowing for techniques such as loop fusion that need more registers - and opportunity for fast communication through shared caches. SMT also benefits from the ability to dynamically shift from thread level parallelism when there are many threads to run and peak instruction level parallelism when there is only one thread to run.

Multithreaded Applications This same phenomenon is demonstrated on explicitly threaded applications. Again speedup varies with application… These first few are hand coded benchmark program, but I think this last one is most significant showing a dramatic speedup on a transaction processing workload.

Architectural Abstraction 1 CPU with 4 Thread Processing Units (TPUs) Shared hardware resources TPU 0 TPU1 TPU2 TPU3 Icache TLB Dcache Scache Stepping back a moment I want to talk about the architectural abstraction used by the SMT processor. In an SMT CPU each thread appears to be full architecturally compatible members of the Alpha processor line. To help talk about this we use the term “Thread Processing Unit” or TPU to refer to the programmer’s perception of a thread. Note that there is no specific TPU on the die, but it is just an abstraction provided to the users. Software will be able to treat the TPUs as the nodes of an standard SMP processor. Thus what is happening is that the architectectural aspects of the machine are replicated but some non-architectural aspects are shared. The result is that we end up with a CPU chip with what appears to be multiple processors on it, each of which appears to be a fully compliant architectural CPU. These multiple TPUs give a performance benefit over a single threaded processor but at a hardware cost of a single CPU.

System Block Diagram EV8 M EV8 M EV8 M IO IO IO EV8 M EV8 M EV8 M IO 1 2 3 EV8 M EV8 M EV8 M IO IO IO EV8 M EV8 M EV8 M IO IO IO EV8 M EV8 M EV8 M Combining multiple EV8 chips together results in a EV7-style multiprocessor with many more processing units. Because each TPU is the same as any Alpha processor the OS can just treat this system as one having four times as many nodes. SMT is good for such configurations because it is more latency tolerant for the longer memory operations. IO IO IO

Quiescing Idle Threads Problem: Spin looping thread consumes resources Solution: Provide quiescing operation that allows a TPU to sleep until a memory location changes Although, as I said each TPU is a fully compliant member of the Alpha family we are adding one extension for efficient locking. Spin looping is a common technique for software lock acquisition. In a multiprocessor this is a perfectly reasonable thing to do, but in an SMT the instruction in the spin loop are consuming resources doing nothing, that could better be used by other threads… So we add quiesce…

Summary Alpha will maintain single stream performance leadership SMT will significantly enhance multistream performance Across a wide range of applications, Without significant hardware cost, and Without major architectural changes In summary, with EV8, Alpha will maintain single stream performance leadership Furthermore, SMT will significantly enhance multistream performance Across a wide range of applications, Without significant hardware cost, and Without major architectural changes

References "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreaded Processor" by Tullsen, Eggers, Emer, Levy, Lo and Stamm in ISCA96. “Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading” by Lo, Eggers, Emer, Levy, Stamm and Tullsen in ACM Transactions on Computer Systems, August 1997. “Simultaneous Multithreading: A Platform for Next-Generation Processors” by Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro, October, 1997. References are available on the web.