Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Slides:

Advertisements

Similar presentations

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Advertisements

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

CS455/CpE 442 Intro. To Computer Architecure

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Billion Transistor Architectures Interconnect design for low power – Naveen & Karthik Computational unit design for low temperature – Karthik Increased.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian Dept. of Computer Science, University of Rochester.

1 Exploring Design Space for 3D Clustered Architectures Manu Awasthi, Rajeev Balasubramonian University of Utah.

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.

Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

Revisiting Load Value Speculation:

1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Ozan Akar CMPE 511 Fall 2006.

1 Wire Aware Architecture Naveen Muralimanohar Advisor – Rajeev Balasubramonian University of Utah.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Superscalar Architecture Design Framework for DSP Operations Rehan Ahmed.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Dynamic Associative Caches:

Dynamic Scheduling Why go out of style?

SECTIONS 1-7 By Astha Chawla

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Alpha Microarchitecture

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

CS455/CpE 442 Intro. To Computer Architecure

Efficient Interconnects for Clustered Microarchitectures

A Case for Interconnect-Aware Architectures

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Rajeev Balasubramonian

Presentation transcript:

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatand Venkatachalapathy Processor Architecture

February 14 th Overview/Motivation  Wire delays hamper performance.  Power incurred in movement of data  50% of dynamic power is in interconnect switching (Magen et al. SLIP 04)  MIT Raw processor’s on-chip network consumes 36% of total chip power (Wang et al. 2003)  Abundant number of metal layers

February 14 th Wire characteristics  Wire Resistance and capacitance per unit length ± Width   R  , C   Spacing   C   Delay  (as delay  RC), Bandwidth 

February 14 th Design space exploration  Tuning wire width and spacing d 2d B Wires Resistance Capacitance Resistance Capacitance Bandwidth

February 14 th Transmission Lines  Similar to L wires - extremely low delay  Constraining implementation requirements!  Large width  Large spacing between wires  Design of sensing circuits  Implemented in test CMOS chips

February 14 th Design space exploration  Tuning Repeater size and spacing Traditional Wires Large repeaters Optimum spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power

February 14 th Design space exploration Delay Optimized B wires Bandwidth Optimized W wires Power Optimized P wires Power and B/W Optimized PW wires Fast, low bandwidth L wires

February 14 th Heterogeneous Interconnects  Intercluster global Interconnect 72 B wires  Repeaters sized and spaced for optimum delay 18 L wires  Wide wires and large spacing  Occupies more area  Low latencies 144 PW wires  Poor delay  High bandwidth  Low power

February 14 th Outline Overview Design Space Exploration Heterogeneous Interconnects  Employing L wires for performance  PW wires: The power optimizers  Evaluation  Results  Conclusion

February 14 th L1 D Cache LSQLSQ Eff. Address Transfer 10c Mem. Dep Resolution 5c Cache Access 5c Data return at 20c L1 Cache pipeline

February 14 th Exploiting L-Wires L1 D Cache LSQLSQ Eff. Address Transfer 10c Partial Mem. Dep Resolution 3c Cache Access 5c 8-bit Transfer 5c Data return at 14c

February 14 th L wires: Accelerating cache access  Transmit LSB bits of effective address through L wires  Partial comparison of loads and stores in LSQ  Faster memory disambiguation  Introduces false dependences ( < 9%)  Indexing data and tag RAM arrays  LSB bits can prefetch data out of L1$  Reduce access latency of loads

February 14 th L wires: Narrow bit width operands  Transfer of 10 bit integers on L wires  Schedule wake up operations  Reduction in branch mispredict penalty  A predictor table of 8K two bit counters  Identifies 95% of all narrow bit-width results  Accuracy of 98%  Implemented in the PowerPC!

February 14 th PW wires: Power/Bandwidth efficient  Idea: steer non-critical data through energy efficient PW interconnect  Transfer of data at instruction dispatch Transfer of input operands to remote register file Covered by long dispatch to issue latency  Store data

February 14 th Evaluation methodology L1 D Cache B wires (2 cycles) L wires (1 cycle) PW wires (3 cycles) Cluster  A dynamically scheduled clustered modeled with 4 clusters in simplescalar-3.0  Crossbar interconnects  Centralized front-end I-Cache & D-Cache LSQ Branch Predictor

February 14 th Evaluation methodology I-Cache D-cache LSQ Cluster Cross bar Ring interconnect  A dynamically scheduled 16 cluster modeled in Simplescalar-3.0  Ring latencies  B wires ( 4 cycles)  PW wires ( 6 cycles)  L wires (2 cycles)

February 14 th IPC improvements: L wires  L wires improves performance by 4% on four cluster system and 7.1% on a sixteen cluster system

February 14 th Four cluster system: ED 2 gains Link Relativ e metal area IPC Relative processor energy (10%) Relative ED 2 (10%) Relative ED 2 (20%) 144 B PW PW 36 L B PW,36 L B, 36 L

February 14 th Sixteen Cluster system: ED 2 gains LinkIPC Relative Processor Energy (20%) Relative ED 2 (20%) 144 B PW, 36 L B B, 36 L B, 36 L

February 14 th Conclusions  Exposing the wire design space to the architecture  A case for micro-architectural wire management!  A low latency low bandwidth network alone helps improve performance by upto 7%  ED 2 improvements of about 11% compared to a baseline processor with homogeneous interconnect  Entails hardware complexity

February 14 th Future work  A preliminary evaluation looks promising  Heterogeneous interconnect entails complexity  Design of heterogeneous clusters  Energy efficient interconnect

February 14 th Questions and Comments? Thank you!

February 14 th Backup

February 14 th L wires: Accelerating cache access  TLB access for page look up Transmit a few bits of Virtual page number on L wires Prefetch data our of L1$ and TLB 18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits) Wire Type Crossb ar delay Ring hop delay PW wires 36 B wires24 L wires12

February 14 th Model parameters  Simplescalar-3.0 with separate integer and floating point queues  32 KB 2 way Instruction cache  32 KB 4 way Data cache  128 entry 8 way I and D TLB

February 14 th Overview/Motivation: ± Three wire implementations employed in this study ± B wires: traditional  Optimal delay  Huge power consumption ± L wires:  Faster than B wires  Lesser bandwidth ± PW wires:  Reduced power consumption  Higher bandwidth compared to B wires  Increased delay through the wires