© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Outline Microarchitecture –State of the art –Future trends, Ronen et al.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Microarchitecture: Overview Instruction Supply Execution Mechanism Data Supply Highest performance means generating the highest instruction and data bandwidth you can, and effectively consuming that bandwidth in execution – paraphrased from M. Alsup, AMD Fellow
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Microarchitecture, 1990 Short pipelines On-chip I and D Caches, blocking Simple prediction
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Microarchitecture, 2000 Mechanisms to find parallel instructions –dynamic scheduling –static scheduling On-chip cache hierarchies, with non-blocking, higher-bandwidth caches Sophisticated branch prediction
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Future Microarchitecture: One Perspective Instruction Supply Execution Mechanism Data Supply Highest performance means generating the highest instruction and data bandwidth you can, and effectively consuming that bandwidth in execution – paraphrased from M. Alsup, AMD Fellow
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Where are we headed? [with influences from Ronen et al] More ILP : Even wider, deeper –enabling technology: speculation, predication, compiler transformations, binary re-optimization, complexity effective design Multithreading –enabling technology: speculation, subordinate threads, discovery of thread-level parallelism Chip Multiprocessors –enabling technology: speculation, discovery of thread-level, course-grained parallelism
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois More ILP Instruction Supply –Branches, cache misses, partial fetches Data Supply –Higher bandwidth, lower latency, memory ordering, non-blocking caches Execution –Reduction of redundant work, design complexity and partitioning Tolerating Latency –Can some things just take a long time?
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Multithreading [Burton Smith, 1978] Fetch Execute WriteBack This is a snapshot of the pipeline during a single cycle. Each color represents instructions from a different thread. B. Smith’s original concept was for a single-wide pipeline, but extends naturally to a multiple issue pipeline.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Simultaneous Multithreadiing [W. Yamamoto, 1994/D. Tullsen, 1995] Fetch Execute WriteBack
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Simultaneous Multithreading, possible implementation Front EndBack End Intel Hyperthreading in Pentium 4 [HotChips’14] is first realization with two threads Small ISA register file minimizes effect of replication Replicated retirement logic Minimal hardware overhead but major increase in verification cost
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Chip Multiprocessor [K. Olukotun, 1996] Fetch Execute WriteBack ProcA Shared L2 Cache ProcC ProcDProcB Single processor die contains multiple CPUs all of which share some amount of resources, such as an L2 cache and chip pins.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Hardware Accelerators
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Existing Solutions … Intel IXP1200 Network Processor Philips Nexperia (Viper) ARM MICRO- ENGINES ACCESS CTL. MIPS MPEG VLIW VIDEO MSP IBM Cell … what’s next? …
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Strategic Memory Data Delivery Data transfer network managed by memory transfer module (MTM) –A smart, global manager –Strategic allocation of network bandwidth –Has some idea of data priority in the application –Scalability challenges exist Work hand-in-hand with compartmentalization
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Adding Point-to-Point Communication Neighbor-to-neighbor interconnects added –Explicitly scheduled communication –Tight coupling between processing elements
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Discussion/Thought Exercise What are the essential differences between the SMT model of execution and the CMP model? –What resources are shared and in what manner? –What type of data movement exists in one but not others? –What types of applications/situations are the best case situations for each model?