Post Pass Binary Adaptation for Software based Speculative Precomputation Steve S. Liao Perry H. Wang Hong Wang Gerolf Hoflehner Daniel lavery John P.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Analysis of programs with pointers. Simple example What are the dependences in this program? Problem: just looking at variable names will not give you.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Towards Virtual Routers as a Service 6th GI/ITG KuVS Workshop on “Future Internet” November 22, 2010 Hannover Zdravko Bozakov.

VLIW Compilation Techniques in a Superscalar Environment Kemal Ebcioglu, Randy D. Groves, Ki- Chang Kim, Gabriel M. Silberman and Isaac Ziv PLDI 1994.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

DISTRIBUTED CONSISTENCY MANAGEMENT IN A SINGLE ADDRESS SPACE DISTRIBUTED OPERATING SYSTEM Sombrero.

Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel.

Advanced OS Chapter 3p2 Sections 3.4 / 3.5. Interrupts These enable software to respond to signals from hardware. The set of instructions to be executed.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Improving Code Generation Honors Compilers April 16 th 2002.

Variational Path Profiling Erez Perelman*, Trishul Chilimbi †, Brad Calder* * University of Califonia, San Diego †Microsoft Research, Redmond.

Multi-Core Architectures

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)

Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

"DDMT", Amir Roth, HPCA-71 Speculative Data-Driven Multithreading (an implementation of pre-execution) Amir Roth and Gurindar S. Sohi HPCA-7 Jan. 22, 2001.

Dean Tullsen ACACES 2008  Parallelism – Use multiple contexts to achieve better performance than possible on a single context.  Traditional Parallelism.

1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.

® IBM Software Group © 2009 IBM Corporation Assist Threads for Data Prefetching in IBM XL Compilers Gennady Pekhimenko, Yaoging Gao, IBM Toronto, Zehra.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Data Structures Lakshmish Ramaswamy. Tree Hierarchical data structure Several real-world systems have hierarchical concepts –Physical and biological systems.

E6200, Fall 07, Oct 24Ambale: CMP1 Bharath Ambale Venkatesh 10/24/2007.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Direct memory access. IO Command includes: buffer address buffer length read or write dada position in disk When IO complete, DMA sends an interrupt request.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

1 Lecture 5a: CPU architecture 101 boris.

CS 352H: Computer Systems Architecture

Ph.D. in Computer Science

Computer Architecture Principles Dr. Mike Frank

/ Computer Architecture and Design

Improving cache performance of MPEG video codec

The University of Texas at Austin

Introduction, Focus, Overview

Presented by: Isaac Martin

CSCI1600: Embedded and Real Time Software

Program Slicing Baishakhi Ray University of Virginia

Hardware Multithreading

CS 143A Quiz 1 Solution.

Lecture 11: Memory Data Flow Techniques

Computer Architecture: Multithreading (IV)

Shortest-Paths Trees Kun-Mao Chao (趙坤茂)

Henk Corporaal TUEindhoven 2011

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up

Sampoorani, Sivakumar and Joshua

Instruction Level Parallelism (ILP)

CSC3050 – Computer Architecture

How to improve (decrease) CPI

Introduction, Focus, Overview

CSCI1600: Embedded and Real Time Software

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Post Pass Binary Adaptation for Software based Speculative Precomputation Steve S. Liao Perry H. Wang Hong Wang Gerolf Hoflehner Daniel lavery John P. Shen

SMT Vs Superscalar

Speculative Precomputation Reduces cache misses Spawns a thread for precomputing address of load Does not modify architectural state So need not be correct! Basically another prefetcher!

Post Pass Tool Identify delinquent loads –Small number of loads => majority of cache misses Slicing Scheduling Trigger identification SSP enabled binary generation

Slicing Region based –Loop, Loop body or a procedure –Graph with regions as nodes –Edges connect parents and child regions Speculative slicing –To reduce the slice length –Memory disambiguation –Data speculation –Removing unexecuted paths and unrealised calls

Scheduling For Chain SP –Graph partioning Forward dependencies Level sort –Schedule the resulting acyclic graph –Include synchronisation –Dependence reduction Loop rotation –Reduce loop carried dependency Branch Prediction

Trigger Identification Why can’t you move far ahead? –Copying overhead Vs Slack SSP enabled binary genration Choose Precomputation model –Chain or Basic?

Results