1 Wire-driven Microarchitectural Design Space Exploration School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30332,

Slides:



Advertisements
Similar presentations
Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.
Advertisements

THERMAL-AWARE BUS-DRIVEN FLOORPLANNING PO-HSUN WU & TSUNG-YI HO Department of Computer Science and Information Engineering, National Cheng Kung University.
Hadi Goudarzi and Massoud Pedram
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
A Novel 3D Layer-Multiplexed On-Chip Network
Recent Progress In Embedded Memory Controller Design
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.
Variability-Driven Formulation for Simultaneous Gate Sizing and Post-Silicon Tunability Allocation Vishal Khandelwal and Ankur Srivastava Department of.
High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department
Configurational Workload Characterization Hashem H. Najaf-abadi Eric Rotenberg.
CMOS Circuit Design for Minimum Dynamic Power and Highest Speed Tezaswi Raja, Dept. of ECE, Rutgers University Vishwani D. Agrawal, Dept. of ECE, Auburn.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,
CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.
Circuit Performance Variability Decomposition Michael Orshansky, Costas Spanos, and Chenming Hu Department of Electrical Engineering and Computer Sciences,
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
Ryan Kastner ASIC/SOC, September Coupling Aware Routing Ryan Kastner, Elaheh Bozorgzadeh and Majid Sarrafzadeh Department of Electrical and Computer.
CS 7810 Lecture 15 A Case for Thermal-Aware Floorplanning at the Microarchitectural Level K. Sankaranarayanan, S. Velusamy, M. Stan, K. Skadron Journal.
Profile-Guided Microarchitectural Floorplanning for Deep Submicron Processor Design Mongkol Ekpanyapong, Jacob R. Minz, Thaisiri Watewai*, Hsien-Hsin S.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
CAD for Physical Design of VLSI Circuits
CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
A Polynomial Time Approximation Scheme For Timing Constrained Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, Charles J. Alpert** *Dept of Electrical.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.
Impact of Interconnect Architecture on VPSAs (Via-Programmed Structured ASICs) Usman Ahmed Guy Lemieux Steve Wilton System-on-Chip Lab University of British.
A Faster Approximation Scheme for Timing Driven Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, and Charles J. Alpert** *Dept of ECE, Michigan Technological.
Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,
The End of Conventional Microprocessors Edwin Olson 9/21/2000.
Application Heartbeats Henry Hoffmann, Jonathan Eastep, Marco Santambrogio, Jason Miller, Anant Agarwal CSAIL Massachusetts Institute of Technology Cambridge,
Energy Efficient D-TLB and Data Cache Using Semantic-Aware Multilateral Partitioning School of Electrical and Computer Engineering Georgia Institute of.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Thermal-aware Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work was supported.
Routability-driven Floorplanning With Buffer Planning Chiu Wing Sham Evangeline F. Y. Young Department of Computer Science & Engineering The Chinese University.
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
Floorplanning Optimization with Trajectory Piecewise-Linear Model for Pipelined Interconnects C. Long, L. J. Simonson, W. Liao and L. He EDA Lab, EE Dept.
Superscalar Architecture Design Framework for DSP Operations Rehan Ahmed.
Interconnect Driver Design for Long Wires in FPGAs Edmund Lee University of British Columbia Electrical & Computer Engineering MASc Thesis Presentation.
Interconnect Driver Design for Long Wires in FPGAs Edmund Lee, Guy Lemieux & Shahriar Mirabbasi University of British Columbia, Canada Electrical & Computer.
CS203 – Advanced Computer Architecture
Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.
CS203 – Advanced Computer Architecture
The Interconnect Delay Bottleneck.
ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Department of Electrical & Computer Engineering
An Automated Design Flow for 3D Microarchitecture Evaluation
Efficient Interconnects for Clustered Microarchitectures
Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,
Physics-guided machine learning for milling stability:
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

1 Wire-driven Microarchitectural Design Space Exploration School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30332, USA Mongkol Ekpanyapong Sung Kyu Lim Chinnakrishnan Ballapuram Hsien-Hsin “Sean” Lee ISCAS 2005, Kobe, Japan

Wire-driven Microarchitectural Design Space Exploration 2 Microarchitecture Design Trend Transistors are almost free  billions of billions [Pat Gelsinger keynote in DAC-42] Processor architects tend to Increase module capacity to improve the performance (e.g. caches, BTB, ROB, etc) Increase the die dimension Assume communications are free, too But ….. Delay = 80 ns 1mm Delay = 20 ns 0.5mm

Wire-driven Microarchitectural Design Space Exploration 3 Alleviating Wire Delay Buffers Insertion to speed up In reality, chip size is growing Issues in many via cuts, area, power,.. Flip-Flop Insertion to meet cycle time (P4 dedicates 2 pipe stages for communication) Module 2 FF Module 2 FF Module 1 Latency is not scalable !

Wire-driven Microarchitectural Design Space Exploration 4 Motivation Wires, in particular global wires, is a problem In deep submicron processor design Conventional architecture techniques increasing module sizes (e.g. caches) will no longer guarantee performance improvement Early design space exploration (DSE) at the microarchitecture level needs to take “wire impact” into account A high efficiency DSE framework is imperative

5 Algorithms

Wire-driven Microarchitectural Design Space Exploration 6 Dynamic communication-aware Profile-guided Floorplanning [DAC-42] CACTIGENESYS PROFILING FLOORPLANNING CYCLE-BASED SIMULATOR Technology Parameter Architecture Description Application Target Frequency Module-level Netlist + Profile Module-level Layout + Wire Latency Use Traffic ProfileFor floorplanning

Wire-driven Microarchitectural Design Space Exploration 7 CACTIGENESYS PROFILING FLOORPLANNING CYCLE-BASED SIMULATOR ADAPTIVE PARAMETER TUNING Technology Parameter Architecture Description Application Target Frequency Module-level Netlist + Profile Module-level Layout + Wire Latency AMPLE  Adaptive Microarchitectural PLanning Engine Wire-drivenAutomatedDesign SpaceExploration

Wire-driven Microarchitectural Design Space Exploration 8 Adaptive Parameter Tuning Algorithm Initialization ADAPTIVE PARAMETER TUNING For each uarch parameter Gradient Search End

Wire-driven Microarchitectural Design Space Exploration 9 AMPLE  Initialization Initialization For N uarch parameters (N+1) Iteration Smart Start Priority_search Priority_search() based on Microarch_Planning Results Profile-Guided Microarch_Planning Microarch_Planning() Optional: Profile-Guided Microarch_Planning Microarch_Planning() For N uarch parameters (N+1) Iteration

Wire-driven Microarchitectural Design Space Exploration 10 Smart Start: Initial Microarchitecture Configurations ClasswidthBTBRUULSQIL1DL1L2L3ALUFPU Processor bound K8K128K064 Cache sensitive K16K512K042 Bandwidth bound K8K128K042 Good starting points can reduce design space exploration time Applications are classified into three categories: Processor-bound applications Cache-sensitive applications Bandwidth-bound applications

Wire-driven Microarchitectural Design Space Exploration 11 Priority Search Prioritize microarchitectural parameters High impact parameters are tuned first Correlation metric can be used to identify critical parameters, but it requires large runtime Gradient First-order Ratio (GFR) is proposed here as follow: Higher GFR  Higher priority A uarch parameter (e.g. BTB) The uarch parameter has max IPC gain Initialization For each uarch parameter Gradient Search End

Wire-driven Microarchitectural Design Space Exploration 12 Adaptive Parameter Tuning Algorithm ADAPTIVE PARAMETER TUNING Initialization For each uarch parameter End Gradient Search

Wire-driven Microarchitectural Design Space Exploration 13 Gradient Search Algorithm Gradient Search Update Parameter and Prune Profile-Guided Microarch_Planning() Compute Gain Return While Gain > Threshold && Acyclic

Wire-driven Microarchitectural Design Space Exploration 14 Compute Gain and New Parameters Let [p,i] be a microarchitecture parameter p at iteration i Let  denotes the step size Gain Equation: Parameter Calculation Equation: Parameters are pruned or rounded if unrealistic

Wire-driven Microarchitectural Design Space Exploration 15 Search Pruning Rationale Reduce search time by pruning unrealistic parameters Cache size order L1 < L2 < L3 Issue width ≥ Number of ALUs No search in floating point units for integer applications Upper and lower bound on number of modules and module size

16 Experimental Results

Wire-driven Microarchitectural Design Space Exploration 17 DSE Runtime Comparison Bench.Brute ForceSimulated AnnealingAMPLE TimeIterationTimeIterationTimeIteration 164.gzip vpr mcf gap twolf swim art 1, lucas Normalized Avg. Time

Wire-driven Microarchitectural Design Space Exploration 18 Performance Comparison Best: best pick from brute force SA: Simulated Annealing Gra: AMPLE w/ design goal of “performance” Gra II: AMPLE w/ design goal of “performance + area” 1.0 = brute force average

Wire-driven Microarchitectural Design Space Exploration 19 Area Comparison Best: best pick from brute force SA: Simulated Annealing Gra: AMPLE w/ design goal of “performance” Gra II: AMPLE w/ design goal of “performance + area” 1.0 = brute force average

Wire-driven Microarchitectural Design Space Exploration 20 Contributions and Conclusion We propose AMPLE DSE Framework Wire delay conscious Goal-directed High performance Cost effectiveness Highly efficient An order of magnitude faster than time-limted (incomplete) brute force 1.43x faster than simulated annealing We show that AMPLE outperforms prior art in DSE turnaround time DSE quality

Wire-driven Microarchitectural Design Space Exploration 21 Q & A That’s All Folks !