John-Paul Fryckman CSE 231: Paper Presentation 23 May 2002

Slides:

Advertisements

Similar presentations

1 Optimizing compilers Managing Cache Bercovici Sivan.

Advertisements

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.

Instruction-Level Parallelism (ILP)

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.

CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.

Chapter 12 Pipelining Strategies Performance Hazards.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Buffered dynamic run-time profiling of arbitrary data for Virtual Machines which employ interpreter and Just-In-Time (JIT) compiler Compiler workshop ’08.

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

1 Lecture 25: Advanced Data Prefetching Techniques Prefetching and data prefetching overview, Stride prefetching, Markov prefetching, precomputation- based.

StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

Fetch Directed Prefetching - a Study

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2.

A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University.

An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Reducing OLTP Instruction Misses with Thread Migration

Data Prefetching Smruti R. Sarangi.

William Stallings Computer Organization and Architecture 8th Edition

Simultaneous Multithreading

The University of Adelaide, School of Computer Science

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 14 Instruction Level Parallelism and Superscalar Processors

ECE/CS 552: Pipelining to Superscalar

Flow Path Model of Superscalars

Temporal Streaming of Shared Memory

Spare Register Aware Prefetching for Graph Algorithms on GPUs

The University of Texas at Austin

Introduction, Focus, Overview

Milad Hashemi, Onur Mutlu, Yale N. Patt

Levels of Parallelism within a Single Processor

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Chapter 5 Memory CSE 820.

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Yingmin Li Ting Yan Qi Zhao

ECE Dept., University of Toronto

Ka-Ming Keung Swamy D Ponpandi

Computer Architecture: Multithreading (IV)

Adaptive Code Unloading for Resource-Constrained JVMs

Inlining and Devirtualization Hal Perkins Autumn 2011

Lecture 10: Branch Prediction and Instruction Delivery

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up

Sampoorani, Sivakumar and Joshua

Instruction Level Parallelism (ILP)

Data Prefetching Smruti R. Sarangi.

ECE/CS 552: Pipelining to Superscalar

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Levels of Parallelism within a Single Processor

CSC3050 – Computer Architecture

How to improve (decrease) CPI

Introduction, Focus, Overview

Dynamic Binary Translators and Instrumenters

Ka-Ming Keung Swamy D Ponpandi

Project Guidelines Prof. Eric Rotenberg.

Spring 2019 Prof. Eric Rotenberg

Stream-based Memory Specialization for General Purpose Processors

Interconnection Network and Prefetching

Presentation transcript:

John-Paul Fryckman CSE 231: Paper Presentation 23 May 2002 Dynamic Hot Data Stream Prefetching for General-Purpose Programs Chilimbi and Hirzel John-Paul Fryckman CSE 231: Paper Presentation 23 May 2002

Why Prefetching? Increasing memory latencies Not enough single thread ILP to hide memory latencies Minimizing stall cycles due to misses increases IPC (increases performance) Fetch data before it is needed!

Target Hot Data Highly repetitious memory sequences Hot sequences are 15-20 objects long Hot data 90% of program references 80% of cache misses

Why Dynamic Dynamic prefetching translates into a general purpose solution Many unknowns at compile time Pointer chasing code Irregular strides

Dynamic Hot Data Stream Prefetching Profile memory references Detect hot data streams Create and insert triggers for these streams And, repeat!

Profiling and Detection Need to minimize profiling overhead Use sampling Switch into instrumented code Collect traces Find hot data streams Generate context-free grammars for hot sequences

DFSM Prefetching Engine Merge CFGs together into a massive DFSM DFSM detects prefixes for hot sequences, then generates fetches for the rest of the data Insert prefetching code Presumably, states are removed when they are no longer hot

Good and the Not So Good Good: Questionable Questions: With overhead, 5-19% speedups Questionable Questions: How does it impact easy to predict code? Worse case state for DFSM: O(2n) They did not study this. Is this possible? Do they always prefetch in time? What about phase changes/cache pollution?