Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian Dept. of Computer Science, University of Rochester.

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.
CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
"DDMT", Amir Roth, HPCA-71 Speculative Data-Driven Multithreading (an implementation of pre-execution) Amir Roth and Gurindar S. Sohi HPCA-7 Jan. 22, 2001.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Pipelining and Parallelism Mark Staveley
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Sunpyo Hong, Hyesoon Kim
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
CS203 – Advanced Computer Architecture
PipeliningPipelining Computer Architecture (Fall 2006)
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Lecture: SMT, Cache Hierarchies
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
A Case for Interconnect-Aware Architectures
The University of Adelaide, School of Computer Science
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian Dept. of Computer Science, University of Rochester

Talk Outline Trade-offs in future microprocessors Dynamic resource management  On-chip cache hierarchy  Clustered processors  Pre-execution threads Future work University of Rochester

Talk Outline Trade-offs in future microprocessors Dynamic resource management  On-chip cache hierarchy  Clustered processors  Pre-execution threads Future work University of Rochester

Design Goals in Modern Processors High performance  High clock speed  High parallelism Low power Low design complexity  Short, simple pipelines Microprocessor designs strive for: Unfortunately, not all can be achieved simultaneously. University of Rochester

Trade-Off in the Cache Size CPU L1 data cache L1 data cache Size/access time: 32KB cache/2 cycles 128KB/4 cycles “sort 4000” miss rate: very low very low “sort 4000” execution time: t t + x “sort 16000” miss rate: high very low “sort 16000” execution time: T T - X CPU University of Rochester

Trade-Off in the Register File Size University of Rochester Register file The register file stores results for all active instructions in the processor. Large register file  more active instructions  high parallelism  long access times  slow clock speed / more pipeline stages  high power, design complexity

Trade-Offs Involving Resource Sizes Trade-offs influence the design of the cache, register file, issue queue, etc. Large resource size  high parallelism, ability to support more threads  long latency  long pipelines/ low clock speed  high power, high design complexity University of Rochester

Parallelism-Latency Trade-Off For each resource, performance depends on:  parallelism it can help extract  negative impact of its latency Every program has different parallelism and latency needs. University of Rochester

Limitations of Conventional Designs Resource sizes are fixed at design time – the size that works best, on average, for all programs This average size is often too small or too large for many programs For optimal performance, the hardware should match the program’s parallelism needs. University of Rochester

Dynamic Resource Management  Reconfigurable memory hierarchy (MICRO’00, IEEE TOC, PACT’02)  Trade-offs in clusters (ISCA’03)  Selective pre-execution (ISCA’01) Efficient register file design (MICRO’01) Dynamic voltage/frequency scaling (HPCA’02) University of Rochester

Talk Outline Trade-offs in future microprocessors Dynamic resource management  On-chip cache hierarchy  Clustered processors  Pre-execution threads Future work University of Rochester

Conventional Cache Hierarchies CPUL1 L2 Main Memory 32KB 2-way set-associative 2 cycles Miss rate 2.3% 2MB 8-way 20 cycles Miss rate 0.2% University of Rochester Speed Capacity

Conventional Cache Layout DecoderDecoder Output Driver wordline bitline Address Data University of Rochester way 0way 1

Wire Delays Delay is a quadratic function of the wire length By inserting repeaters/buffers, delay grows roughly linearly with length Length = x Delay ~ t Length = 2x Delay ~ 4t Length = 2x Delay ~ 2t + logic_delay Repeaters electrically isolate the wire segments Commonly used today in long wires University of Rochester

Exploiting Technology DecoderDecoder University of Rochester

The Reconfigurable Cache Layout DecoderDecoder way 0way 1way 2way 3 University of Rochester

The Reconfigurable Cache Layout DecoderDecoder way 0way 1way 2way 3 32KB 1-way cache, 2 cycles University of Rochester

The Reconfigurable Cache Layout DecoderDecoder way 0way 1way 2way 3 64KB 2-way cache, 3 cycles The disabled portions of the cache are used as the non-inclusive L2. University of Rochester

Changing the Boundary between L1-L2 CPU L1L2 University of Rochester

Changing the Boundary between L1-L2 CPU L1L2 University of Rochester

Trade-Off in the Cache Size CPU L1 data cache L1 data cache Size/access time: 32KB cache/2 cycles 128KB/4 cycles “sort 4000” miss rate: very low very low “sort 4000” execution time: t t + x “sort 16000” miss rate: high very low “sort 16000” execution time: T T - X CPU University of Rochester

Salient Features Low-cost: Exploits the benefits of repeaters Optimizes the access time/capacity trade-off Can reduce energy -- most efficient when cache size equals working set size University of Rochester

Control Mechanism University of Rochester Inspect stats. Is there a phase change? Run each configuration for an interval Pick the best configuration Remain at the selected configuration Gather statistics at periodic intervals (every 10K instructions) yesno exploration

Metrics Optimizing performance:  metric for best configuration is simply instructions per cycle (IPC) Detecting a phase change:  Change in branch frequency or miss rate frequency or sudden change in IPC  change in program phase  To avoid unnecessary explorations, the thresholds can be adapted at run-time University of Rochester

Simulation Methodology Modified version of Simplescalar includes many details on bus contention Executing programs from various benchmark sets (a mix of many program types) University of Rochester

Performance Results University of Rochester Overall harmonic mean (HM) improvement: 17%

Energy Results University of Rochester Overall energy savings: 42%

Talk Outline Trade-offs in future microprocessors Dynamic resource management  On-chip cache hierarchy  Clustered processors  Pre-execution threads Future work University of Rochester

Conventional Processor Design I Cache Branch Predictor Rename & Dispatch IssueQIssueQ Register File FU Large structures  Slower clock speed University of Rochester

The Clustered Processor I Cache Branch Predictor Rename & Dispatch IQ Regfile FU IQ Regfile FU IQ Regfile FU IQ Regfile FU r1  r3 + r4 r41  r43 + r44 r2  r1 + r41 Small structures  Faster clock speed But, high latency for some instructions University of Rochester

Emerging Trends Wire delays and faster clocks will make each cluster smaller Larger transistor budgets and low design cost will enable the implementation of many clusters on the chip The support of many threads will require many resources and clusters  Numerous, small clusters will be a reality! University of Rochester

Communication Costs University of Rochester Regs IQFU Regs IQFU Regs IQFU Regs IQFU Regs IQFU Regs IQFU Regs IQFU Regs IQFU Regs IQFU Regs IQFU Regs IQFU Regs IQFU More clusters  more communication 4 clusters 8 clusters

Communication vs Parallelism University of Rochester 4 clusters  100 active instrs r1  r2 + r3 r5  r1 + r3 … r7  r2 + r3 r8  r7 + r3 8 clusters  200 active instrs r1  r2 + r3 r5  r1 + r3 … r7  r2 + r3 r8  r7 + r3 … r5  r1 + r7 … r9  r2 + r3 Distant parallelism: distant instructions that are ready to execute Ready instructions

Communication-Parallelism Trade-Off More clusters  More communication  More parallelism Selectively use more clusters  if communication is tolerable  if there is additional distant parallelism University of Rochester

IPC with Many Clusters (ISCA’03) University of Rochester

Trade-Off Management The clustered processor abstraction exposes the trade-off between communication and parallelism It also simplifies the management of resources -- we can disable a cluster by simply not dispatching instructions to it University of Rochester

Control Mechanism University of Rochester Inspect stats. Is there a phase change? Run each configuration for an interval Pick the best configuration Remain at the selected configuration Gather statistics at periodic intervals (every 10K instructions) yesno exploration

The Interval Length Success depends on ability to repeat behavior across successive intervals Every program is likely to have phase changes at different granularities Must also pick the interval length at run-time University of Rochester

Picking the Interval Length Start with minimum allowed interval length If phase changes are too frequent, double the interval length – find a coarse enough granularity such that behavior is consistent Repeat every 10 billion instructions Small interval lengths can result in noisy measurements University of Rochester

Varied Interval Lengths BenchmarkInstability factor for a 10K interval length Minimum acceptable interval length and its instability factor gzip4%10K / 4% vpr14%320K / 5% crafty30%320K / 4% parser12%40M / 5% swim0%10K / 0% mgrid0%10K / 0% galgel1%10K / 1% cjpeg9%40K / 4% djpeg31%1280K / 1% Instability factor: Percentage of intervals that flag a phase change. University of Rochester

Results with Interval-Based Scheme University of Rochester Overall improvement: 11%

Talk Outline Trade-offs in future microprocessors Dynamic resource management  On-chip cache hierarchy  Clustered processors  Pre-execution threads Future work University of Rochester

Pre-Execution University of Rochester Executing a subset of the program in advance Helps warm up various processor structures such as the cache and branch predictor

The Future Thread (ISCA’01) The main program thread executes every single instruction Some registers are reserved for the future thread so it can jump ahead Main thread Pre-execution thread University of Rochester

Key Innovations Ability to advance much further  eager recycling of registers  skipping idle instructions Integrating pre-executed results  re-using register results  correcting branch mispredicts  prefetch into the caches Allocation of resources University of Rochester

Trade-Offs in Resource Allocation Main thread Future thread University of Rochester Allocating more registers for the main thread favors nearby parallelism Allocating more registers for the future thread favors distant parallelism The interval-based mechanism can pick the optimal allocation

Pre-Execution Results University of Rochester Overall improvement with 12 registers: 11% Overall improvement with dynamic allocation: 18%

Conclusion Emerging technologies will make trade-off management very vital Approaches to hardware adaptation cache hierarchy clustered processors pre-execution threads The interval-based mechanism with exploration is robust and applies to most problem domains University of Rochester

Talk Outline Trade-offs in future microprocessors Dynamic resource management  On-chip cache hierarchy  Clustered processors  Pre-execution threads Future work University of Rochester

Future Scenarios Clustered designs can be used to produce all classes of processors A library of simple cluster cores – with different energy, clock speed, latency, and parallelism characteristics The role of the architect: putting these cores together on the chip and exploiting them to maximize performance University of Rochester

Heterogeneous Clusters Having different clusters on the chip provides many options for instruction steering For example, a program limited by communication will benefit from large, slow cluster cores Non-critical instructions of a program could be steered to slow, energy-efficient clusters -- such clusters can also help reduce processor hot-spots University of Rochester

Other Critical Problems How does one build a highly clustered processor?  Where does the cache go?  What interconnect topology do we use?  How does multithreading affect these choices? University of Rochester

Research synopses and papers available at: More Details… University of Rochester

Slide Title Point 1. University of Rochester