University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.
Control path Recall that the control path is the physical entity in a processor which: fetches instructions, fetches operands, decodes instructions, schedules.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Instruction-Level Parallelism (ILP)
Computer Organization and Architecture
 Understanding the Sources of Inefficiency in General-Purpose Chips.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Erasing Core Boundaries.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.
University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,
Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.
Cost-Efficient Soft Error Protection for Embedded Microprocessors
University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.
University of Michigan Electrical Engineering and Computer Science 1 Online Timing Analysis for Wearout Detection Jason Blome, Shuguang Feng, Shantanu.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Processor Types And Instruction Sets Barak Perelman CS147 Prof. Lee.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
CLEMSON U N I V E R S I T Y AVR32 Micro Controller Unit Atmel has created the first processor architected specifically for 21st century applications that.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
Automated Design of Custom Architecture Tulika Mitra
University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.
University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Encore: Low-Cost,
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Computer Studies/ICT SS2
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing.
Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University,
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.
NISC set computer no-instruction
GROUP 2 CHAPTER 16 CONTROL UNIT Group Members ๏ Evelio L. Hernandez ๏ Ashwin Soerdien ๏ Andrew Keiper ๏ Hermes Andino.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Hyunchul Park†, Kevin Fan†, Scott Mahlke†,
1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.
My Coordinates Office EM G.27 contact time:
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Instruction Level Parallelism
William Stallings Computer Organization and Architecture 8th Edition
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Flow Path Model of Superscalars
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
RegLess: Just-in-Time Operand Staging for GPUs
Pipelining and Vector Processing
Tony Nowatzki Vinay Gangadhar Karthikeyan Sankaralingam
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Figure 8.1 Architecture of a Simple Computer System.
Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,
Analyzing Behavior Specialized Acceleration
Objectives Describe common CPU components and their function: ALU Arithmetic Logic Unit), CU (Control Unit), Cache Explain the function of the CPU as.
Presentation transcript:

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution of Recurring Traces for Energy-Efficient General Purpose Processing Shantanu Gupta, Shuguang Feng, Amin Ansari, Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August University of Michigan (Intel, Northrup-Grumman, UIUC, Princeton) MICRO-44 December 6, 2011

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 2 Computational Efficiency Landscape Pentium M Core 2 Core i7 GTX 280 GTX 295 S1070 IBM Cell AMD Embedded Processors AMD Opteron Energy dilemma More gates can fit on a die But power constraints limit their use To scale performance, need to increase efficiency

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 3 Where Does The Energy Go? Energy used in a single-issue RISC in-order core Instruction fetch and decode energy dominates Actual execution barely consumes 10% Plenty of opportunities to save energy…. [Dally’08]

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 4 Increasing Efficiency with Accelerators Accelerators can give 10 – 50X efficiency FPGAs General Purpose Processors SIMD Efficiency, Performance Flexibility Loop Accelerators, ASICs Application regularity defines success: 1.Small dominant code segments 2.Little control flow 3.Narrow application set 4.Data parallelism ASIPs DSPs

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 5 Utility Factor for Accelerators FPGAs General Purpose Processors SIMD Efficiency, Performance Flexibility Loop Accelerators, ASICs ASIPs DSPs What fraction of the code gets accelerated? Most solutions fail for “irregular” or “general-purpose” code ??? Goal: A design to target irregular codes

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 6 The BERET Architecture A compute engine for “hot regular regions” in irregular codes Key insights: 1. Exploits recurring instructions (traces) to save on redundant fetches and decodes 2. Uses a bundled execution model to save on redundant register reads/writes L1 D$ BERET CPU L1 I$ Program Hot Regions CPU BERET copy live-ins copy live-outs BERET: Bundled Execution of REcurring Traces

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 7 We leverage such looping traces for savings 1.Straight-line code  simple hardware 2.Typically short  easy to buffer 3.Significant fetch / decode savings for buffered instructions Insight 1: Recurring Instructions How about loops? ► Typical loops in irregular codes are large and control intensive! BB 1 BB 2 BB 5 BB 0 BB 20 BB 3 BB 4 BB 7BB 6 85% 15% 90%10% 50% Hot basic blocks Control Flow Graph (CFG) BB 1 BB 2 BB 5 BB 3 exit? BB 20 BB 4 exit? A looping trace BB 1 BB 2 BB 5 BB 20

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 8 Frequency of Recurring Instructions Offload stable traces in irregular loops

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 9 Insight 2: Bundled Execution Traditional processors issue and execute instructions in isolation… >> ST LD + / >> & << ST + LD >> ST LD + / >> & << ST + LD 11 instrs, 14 reads, 10 writes 3 instrs, 6 reads, 2 writes Bundled execution

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 10 Efficiency of Bundled Execution 10 All results normalized to a bundle length of 1 Bundled execution increases datapath efficiency by more than 2x

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 11 BERET Hardware Design Hardware design objectives: ► Capable of executing straight-line code in a loop (traces) ► Support for bundled execution of trace instructions ► Handle trace side-exits, and transfer control to the main processor Internal Register File SEB 1SEB 2SEB N Writeback Bus MUX Store Buffer D$ ALU LD << ALU Index bits Input Latch Output Latch config. bits Configure SEB 1 – 2 cycles Execute SEB 1 – 5 cycles Writeback 1 – 2 cycles SEB config. Configuration RAM (CRAM) I$ SEB: Subgraph Execution Block

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 12 MPY ADD SUB BR LD AND SHIFT ST ADD OR BR Hot Trace exit Compiler Support SEB 0 SEB 1 SEB 2 SEB 3 Configuration Control RF BERET with SEBs Program Hot Traces (with high loop back probability) | & << ST × - BR LD ++ BR Data flow subgraphs Assert 1. Trace Detection2. Mapping traces to SEBs

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 13 CPU-BERET Execution Flow CPU BERET RF Body Header … Body Header Body Header Assert Header Side Exit Header Copy Live-Ins Copy Live-Outs RF-0 RF-1 RF-0 RF-1 Execution Time Execution Registers copied to BERETProgram executes on BERETAssert discovered, last iteration squashedRegisters copied back to main processorProgram executes on main processor

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 14 Energy Savings Training setTest set

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 15 Performance Impact

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 16 Concluding Remarks Scaling program performance in energy-constrained environment requires improving computational efficiency Most accelerators exploit program regularity for savings BERET is a configurable engine that saves energy by: ► Exploiting hot traces to avoid redundant fetches and decodes ► Using a bundled execution model to reduce temporary variable reads and writes Energy Saving ~35% Performance Enhancement ~10% Area Overhead 20%

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 17 Questions For more ► See

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 18 Fine Grain Program Phase Behavior Traditional phases too coarse-grained to match accelerator Traditional phases Hypothesis of This Work Irregular programs are composed of fine-grain periods of high degrees of regularity. We can identify these periods and run them on an accelerator customized for “simple” execution. Accelerate the pink portions 0M10M Fine-grain