Sanghyun Park, §Aviral Shrivastava and Yunheung Paek

Slides:



Advertisements
Similar presentations
Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.
Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.
Chapter 12 Pipelining Strategies Performance Hazards.
Aggregating Processor Free Time for Energy Reduction Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1 1 Center For Embedded Computer Systems,
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Runahead Execution A review of “Improving Data Cache Performance by Pre- executing Instructions Under a Cache Miss” Ming Lu Oct 31, 2006.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
Pipelining and Parallelism Mark Staveley
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
CML Branch Penalty Reduction by Software Branch Hinting Jing Lu Yooseong Kim, Aviral Shrivastava, and Chuan Huang Compiler Microarchitecture Lab Arizona.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
PipeliningPipelining Computer Architecture (Fall 2006)
Dynamic Scheduling Why go out of style?
Multiscalar Processors
5.2 Eleven Advanced Optimizations of Cache Performance
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Hardware Multithreading
Address-Value Delta (AVD) Prediction
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CSC3050 – Computer Architecture
The University of Adelaide, School of Computer Science
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Presentation transcript:

Hiding Cache Miss Penalty Using Priority-based Execution for Embedded Processors Sanghyun Park, §Aviral Shrivastava and Yunheung Paek SO&R Research Group Seoul National University, Korea § Compiler Microarchitecture Lab Arizona State University, USA

Critical need for reducing the memory latency Memory Wall Problem Increasing disparity between processors and memory In many applications, 30-40% memory operations of the total instructions streaming input data Intel XScale spends on average 35% of the total execution time on cache misses From Sun’s page : www.sun.com/processors/throughput/datasheet.html 2 Critical need for reducing the memory latency Sanghyun Park : DATE 2008, Munich, Germany

Are they proper solutions for the embedded processors? Hiding Memory Latency In high-end processors, multiple issue value prediction speculative mechanisms out-of-order (OoO) execution HW solutions to execute independent instructions using reservation table even if a cache miss occurs Very effective techniques to hide memory latency Are they proper solutions for the embedded processors? 3 Sanghyun Park : DATE 2008, Munich, Germany

Hiding Memory Latency In the embedded processors, not viable solutions incur significant overheads area, power, chip complexity In-order execution vs. Out-of-order execution 46% performance gap* Too expensive in terms of complexity and design cycle Most embedded processors are single-issue and non- speculative processors e.g., all the implementations of ARM *S.Hily and A.Seznec. Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading. In HPCA’99 Need for alternative mechanisms to hide the memory latency with minimal power and area cost 4 Sanghyun Park : DATE 2008, Munich, Germany

Basic Idea Place the analysis complexity in the compiler’s custody HW/SW cooperative approach Compiler identifies the low-priority instructions Microarchitecture supports a buffer to suspend the execution of low-priority instructions Use the memory latencies for the meaningful jobs!! cache miss Original execution stall... high-priority instructions load instructions Priority based execution low-priority instructions low-priority execution 5 execution time Sanghyun Park : DATE 2008, Munich, Germany

Outline Previous work in reducing memory latency Priority based execution for hiding cache miss penalty Experiments Conclusion 6 Sanghyun Park : DATE 2008, Munich, Germany

Previous Work Prefetching Run-ahead execution Out-of-order processors Analyze the memory access pattern, and prefetch the memory object before actual load is issued Software prefetching [ASPLOS’91], [ICS’01], [MICRO’01] Hardware prefetching [ISCA’97], [ISCA’90] Thread-based prefetching [SIGARCH’01], [ISCA’98] Run-ahead execution Speculatively execute independent instructions in the cache miss duration [ICS’97], [HPCA’03], [SIGARCH’05] Out-of-order processors can inherently tolerate the memory latency using the ROB Cost/Performance trade-offs of out-of-order execution OoO mechanisms are very expensive for the embedded processors [HPCA’99], [ICCD’00] 7 Sanghyun Park : DATE 2008, Munich, Germany

Outline Previous work in reducing memory latency Priority based execution for hiding cache miss penalty Experiments Conclusion 8 Sanghyun Park : DATE 2008, Munich, Germany

Priority of Instructions High-priority Instructions Instructions that can cause cache misses Load Parent data-dependent on… generates the source operands of the high-priority instruction Branch control-dependent on… All the other instructions are low-priority Instructions Instructions that can be suspended until the cache miss occurs 9 Sanghyun Park : DATE 2008, Munich, Germany

Finding Low-priority Instructions 01:L19: ldr r1, [r0, #-404] 02: ldr ip, [r0, #-400] 03: ldmda r0, r2, r3 04: add ip, ip, r1, asl #1 05: add r1, ip, r2 06: rsb r2, r1, r3 07: subs lr, lr, #1 08: str r2, [r0] 09: add r0, r0, #4 10: bpl .L19 1 2 3 9 r1 ip 7 4 r2 cpsr r3 r0 ip 5 10 r1 6 8 Innermost loop of the Compress benchmark 1. Mark all load and branch instructions of a loop 2. Use UD chains to find instructions that define the operands of already marked instructions, and mark them (parent instructions) 3. Recursively continue Step 2 until no more instructions can be marked Instruction 4, 5, 6 and 8 are low-priority instructions 10 Sanghyun Park : DATE 2008, Munich, Germany

Scope of the Analysis Candidate of the instruction categorization instructions in the loops at the end of the loop, execute all low-priority instructions Memory disambiguation* static memory disambiguation approach orthogonal to our priority-based execution ISA enhancement 1-bit priority information for every instruction flushLowPriority for the pending low-priority instruction * Memory disambiguation to facilitate instruction…, Gallagher, UIUC Ph.D Thesis,1995 11 Sanghyun Park : DATE 2008, Munich, Germany

Architectural Model 2 execution modes Low-priority instructions high/low-priority execution indicated by 1-bit ‘P’ Low-priority instructions operands are renamed reside in ROB cannot stall the processor pipeline Priority selector compares the src regs of the issuing insn with reg which will miss the cache From decode unit ROB P Instruction Rename Table Rename Manager P src regs high low Priority Selector MUX operation bus cache missing register Memory Unit FU 12 Sanghyun Park : DATE 2008, Munich, Germany

Execution Example All the parent instructions reside in the ROB L 04: add ip, r17, r18, asl #1 L 04: add ip, ip, r1, asl #1 L 04: add ip, r17, r18, asl #1 L 04: add ip, ip, r1, asl #1 Rename Table r12 r17 r1 r18 H 03: ldmda r0, r2, r3 H ---: mov r18, r1 H 02: ldr r17, [r0, #-400] H 01: ldr r18, [r0, #-404] H 02: ldr r17, [r0, #-400] H 01: ldr r1, [r0, #-404] H 02: ldr ip, [r0, #-400] H 03: ldmda r0, r2, r3 H 02: ldr ip, [r0, #-400] high low high low 10: bpl .L19 01: ldr r1, [r0, #-404] All the parent instructions reside in the ROB The parent instruction has already been issued ‘mov’ instruction shifts the value of the real register to the rename register 13 Sanghyun Park : DATE 2008, Munich, Germany

We can achieve the performance improvement by… executing low-priority instructions on a cache miss # of effective instructions in a loop is reduced

Outline Previous work in reducing memory latency Priority based execution for hiding cache miss penalty Experiments Conclusion 14 Sanghyun Park : DATE 2008, Munich, Germany

Experimental Setup Intel XScale Innermost loops from 7-stage, single-issue, non- speculative 100-entry ROB 75-cycle memory latency cycle-accurate simulator validated against 80200 EVB Power model from PTscalar Innermost loops from MultiMedia, MiBench, SPEC2K and DSPStone benchmarks Application GCC –O3 Assembly Compiler Technique for PE Assembly with Priority Information Cycle-Accurate Simulator Report 15 Sanghyun Park : DATE 2008, Munich, Germany

Effectiveness of PE (1) 39% improvement 17% improvement Up to 39% and on average 17 % performance improvement In GSR benchmark, 50% of the instructions are low-priority efficiently utilize the memory latency 16 Sanghyun Park : DATE 2008, Munich, Germany

Effectiveness of PE (2) how much we can hide the memory stall time On average, 75% of the memory latency can be hidden The utilization of the memory latency  depends on the ROB size and the memory latency how many low-priority instructions can be hold how many cycles can be hidden using PE 17 Sanghyun Park : DATE 2008, Munich, Germany

Varying ROB Size ROB size  # of low-priority instructions average reduction for all the benchmarks we used memory latency = 75 cycles ROB size  # of low-priority instructions Small size ROB can hold very limited # of low-priority instructions Over 100 entries  saturated due to the fixed memory latency 18 Sanghyun Park : DATE 2008, Munich, Germany

Varying Memory Latency average reduction for all the benchmarks we used with 100-entry ROB The amount of latency that can hidden by PE keep decreasing with the increase of the memory latency smaller amount of memory latency  less # of low-priority instruction Mutual dependence between the ROB size and the memory latency 19 Sanghyun Park : DATE 2008, Munich, Germany

Power/Performance Trade-offs Anagram benchmark from SPEC2000 1F-1D-1I in-order processor much less performance / consume less power 2F-2D-2I in-order processor less performance / more power consumption 2F-2D-2I out-of-order processor performance is very good / consume too much power 1F-1D-1I with priority-based execution is an attractive design alternative for the embedded processors 20 Sanghyun Park : DATE 2008, Munich, Germany

Conclusion Memory gap is continuously widening High-end processors Latency hiding mechanisms become ever more important High-end processors multiple-issue, out-of-order execution, speculative execution, value prediction not suitable solutions for embedded processors Compiler-Architecture cooperative approach compiler classifies the priority of the instructions architecture supports HWs for the priority based execution Priority-based execution with the typical embedded processor design (1F-1D-1I) an attractive design alternative for the embedded processors 21 Sanghyun Park : DATE 2008, Munich, Germany

Thank You!! 22