Compiler-in-the-Loop ADL-driven Early Architectural Exploration Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2 1 Center For Embedded.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,
Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park 2 Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Yunheung Paek 2.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
© ACES Labs, CECS, ICS, UCI. Energy Efficient Code Generation Using rISA * Aviral Shrivastava, Nikil Dutt
Aggregating Processor Free Time for Energy Reduction Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1 1 Center For Embedded Computer Systems,
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
Automatic Generation of Operation Tables for Fast Exploration of Bypasses in Embedded Systems Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Sanghyun.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2.
An Efficient Compiler Technique for Code Size Reduction using Reduced Bit-width ISAs S. Ashok Halambi, Aviral Shrivastava, Partha Biswas, Nikil Dutt, Alex.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
Center for Embedded Computer Systems University of California, Irvine SPARK: A High-Level Synthesis Framework for Applying.
Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.
University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Sanghyun Park, §Aviral Shrivastava and Yunheung Paek
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
Copyright © 2008 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Nikil Dutt 1, and Nalini Venkatasubramanian 1.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.
3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.
Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.
Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,
Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.
E X C E E D I N G E X P E C T A T I O N S VLIW-RISC CSIS Parallel Architectures and Algorithms Dr. Hoganson Kennesaw State University Instruction.
Sogang University Advanced Computing System Chap 2. Processor Technology Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.
Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Operation Tables for Scheduling in the presence of Partial Bypassing Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1 1 Center For Embedded.
PipeliningPipelining Computer Architecture (Fall 2006)
Design-Space Exploration
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
Precision Timed Machine (PRET)
Superscalar Processors & VLIW Processors
Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National.
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
URECA: A Compiler Solution to Manage Unified Register File for CGRAs
How to improve (decrease) CPI
Partially Protected Caches to Reduce Failures Due to Soft Errors in Multimedia Applications Kyoungwoo Lee, Aviral Shrivastava, Ilya Issenin, Nikil Dutt,
Control unit extension for data hazards
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Daxia Ge Friday February 9th, 2007
CS203 – Advanced Computer Architecture
Code Transformation for TLB Power Reduction
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
*Qiang Zhu Fujitsu Laboratories LTD. Japan
CMSC 611: Advanced Computer Architecture
Presentation transcript:

Compiler-in-the-Loop ADL-driven Early Architectural Exploration Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2 1 Center For Embedded Computer Systems, University of California, Irvine, CA, USA 2 Strategic CAD Labs, Intel, Hudson, MA, USASC L

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 2 Bypassing Improves Performance Pipelining improves performance Pipelining improves performance Limited by pipeline hazards Bypasses eliminate certain data hazards Bypasses eliminate certain data hazards Further improve performance FD RF R1  R2 + R3R4  R4 + R1 FD OR X1 RF X2 WB R1  R2 + R3R4  R4 + R1 OR X1 X2 WB R1

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 3 Area and Power consumption Area and Power consumption Wide multiplexers Bypass Control logic Bypass wires Impact of Bypassing Cycle time Cycle time Bypasses may be a part of timing-critical path FDX1 RFX2 WB M1 M2 Wiring congestion Wiring congestion Overall chip complexity Overall chip complexity deeply pipelined out-of-order processors P. Ahuja et al., The Performance Impact of incomplete bypassing in processor pipelines MICRO 1995 A. Abnous and N. Bagerzadeh, Pipelining and bypassing in a VLIW processor, IEEE Trans OR

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 4 Problem, Solution and Problem Problem – How do I customize bypasses? Problem – How do I customize bypasses? Important for Embedded Systems Solution – Solution – Keep only the most beneficial bypasses Area, Power and Performance trade-off FDORX1 RF X2 WB Problems – Problems – How to Compile for a processor with partial bypassing? Requires Compiler-in-the-Loop Exploration

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 5 Compiler-in-the-Loop Exploration How to compile for Partial Bypassing How to compile for Partial Bypassing Compiler in the exploration loop Compiler in the exploration loop Power-Performance-Area Tradeoff Power-Performance-Area Tradeoff

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 6 Bypass Sensitive Scheduling No Hazard Bypasses transfer data between dependent operations Bypasses transfer data between dependent operations Missing bypasses cause pipeline hazard Missing bypasses cause pipeline hazard Hazard FD OR X1 RF X2 WB R1  R2 + R3R4  R4 + R1 R1 R1  R2 + R3 R1 R1  R2 + R3 R1 Bypass-sensitive compiler should be able to Bypass-sensitive compiler should be able to detect and avoid pipeline hazards

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 7 Operation Table Operation Table Operation Table for ADD R1 R2 R3 FDORX1 RF X2 WB C1C2 C3 BRF C4 C5 Operation Table is a binding between Operation Table is a binding between Operation and Processor Resources and Registers Can detect Resource Hazards Can detect Resource Hazards OTs model processor resources Can detect Data Hazards Can detect Data Hazards OTs model processor registers 1. F 2. D 3. OR ReadOperands R2 C1 RF R3 C2 RF C5 BRF DestOperands R1 RF 4. X1 WriteOperands R1 C4 BRF 5. X2 6. XWB WriteOperands R1 C3 RF Details are in the paper !!

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 8 Up to 20% Performance Improvement on MiBench Up to 20% performance improvement

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 9 Compiler-in-the-Loop Exploration How to compile for Partial Bypassing How to compile for Partial Bypassing Compiler in the exploration loop Compiler in the exploration loop Power-Performance-Area Tradeoff Power-Performance-Area Tradeoff

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 10 Compiler-in-the-Loop Exploration Application Bypass Configuration gcc –O3 Executable Traditional Cycles Cycle Accurate Simulator Traditional Exploration CIL Cycles OT-based Compiler Executable Cycle Accurate Simulator Bypass-sensitive Compiler-in-the-Loop Exploration

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 11 Bypass Exploration 7 pipeline stages can bypass result 7 pipeline stages can bypass result We vary which pipeline stage bypasses a result We vary which pipeline stage bypasses a result 2 7 = 128 bypass configurations Encode bypass configuration Configuration 28 = Bypass paths from MWB, M2 and XWB are present Bypass paths from MWB, M2 and XWB are present F1F2IDRFX1X2XWB M1 D1D2DWB MWBM2

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 12 Bypass Explorations on XScale CIL-compiler can effectively exploit the bypass configuration CIL-compiler can effectively exploit the bypass configuration Significant performance difference Significant performance difference bitcount Bypass Source Configurations Execution Cycles Traditional CIL

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 13 X-bypass explorations in XScale XWB X1X2 XWB X2 X2 X1 XWB X1 XWB X2 X1 X-bypass Configuration bitcount Execution Cycles Traditional CIL Difference in trends F1F2IDRFX1X2XWB M1 D1D2DWB MWBM2

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 14 M-bypass explorations in XScale Difference in trends X1X2XWB D1D2DWB F1F2IDRF M1MWBM2

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 15 bitcount DWBD2DWB D2 D Bypass Configurations Execution Cycles Traditional CIL D-bypass exploration in XScale Difference in trends X1 D1D2DWB F1F2IDRF X2XWB M1MWBM2

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 16 Compiler-in-the-Loop Exploration How to compile for Partial Bypassing How to compile for Partial Bypassing Compiler in the exploration loop Compiler in the exploration loop Power-Performance-Area Tradeoff Power-Performance-Area Tradeoff

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 17 Performance-Energy-Area Trade-off Point 2 Point 1 Design Point 1 Design Point 1 no bypass from MWB and XWB to first operand 18% less area and 14% less energy consumption of bypass control logic 2% performance loss Design Point 2 Design Point 2 Only D2 and X2 bypass to first operand 25% less area and 16% less energy consumption of bypass control logic 6% performance loss

TechCon 2005 Copyright © 2005 UCI ACES Laboratory 18 Summary Bypassing improves performance but is costly in terms of area and power Bypassing improves performance but is costly in terms of area and power Partial bypassing presents valuable trade-offs, however poses challenges in compilation Partial bypassing presents valuable trade-offs, however poses challenges in compilation We propose a compilation approach for partial bypassing We propose a compilation approach for partial bypassing Up to 20% performance improvement by bypass-sensitive compiler We propose Compiler-in-the-Loop Exploration of partial bypasses. We propose Compiler-in-the-Loop Exploration of partial bypasses. More meaningful exploration of design space CIL Exploration of bypasses is able to discover interesting pareto-optimal design points CIL Exploration of bypasses is able to discover interesting pareto-optimal design points