Stream-based Memory Specialization for General Purpose Processors

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Computer Architecture Instruction-Level Parallel Processors
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen.
Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.
Stored Program Concept: The Hardware View
UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.
Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Programmer's view on Computer Architecture by Istvan Haller.
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005.
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
1 Lecture 5a: CPU architecture 101 boris.
Translation Lookaside Buffer
CS 352H: Computer Systems Architecture
Data Prefetching Smruti R. Sarangi.
Adaptive Cache Partitioning on a Composite Core
Ph.D. in Computer Science
Multiscalar Processors
Stash: Have Your Scratchpad and Cache it Too
Less is More: Leveraging Belady’s Algorithm with Demand-based Learning
5.2 Eleven Advanced Optimizations of Cache Performance
Architecture Background
Prefetch-Aware Cache Management for High Performance Caching
Flow Path Model of Superscalars
Ke Bai and Aviral Shrivastava Presented by Bryce Holton
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Using Dead Blocks as a Virtual Victim Cache
Lecture 14: Reducing Cache Misses
Tony Nowatzki Vinay Gangadhar Karthikeyan Sankaralingam
EE 382N Guest Lecture Wish Branches
Address-Value Delta (AVD) Prediction
Adapted from slides by Sally McKee Cornell University
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,
Data Prefetching Smruti R. Sarangi.
Analyzing Behavior Specialized Acceleration
José A. Joao* Onur Mutlu‡ Yale N. Patt*
15-740/ Computer Architecture Lecture 14: Prefetching
The Vector-Thread Architecture
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
CSC3050 – Computer Architecture
Mapping DSP algorithms to a general purpose out-of-order processor
rePLay: A Hardware Framework for Dynamic Optimization
Cache Performance Improvements
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Introduction to Optimization
Prof. Onur Mutlu Carnegie Mellon University
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

Stream-based Memory Specialization for General Purpose Processors Zhengrong Wang Prof. Tony Nowatzki

Computation & Memory Specialization + - SIMD + / Dataflow + Core Acc. New ISA abstraction for certain computation pattern. - Mem Acc. b[0] a[b[0]] b[i] New ISA abstraction for memory access pattern? b[1] a[b[1]] a[b[i]] b[2] a[b[2]] … … Stream

Stream: A New ISA Memory Abstraction Stream: A decoupled memory access pattern. Higher level abstraction in ISA. Decouple memory access. Enable efficient prefetching. Leverage stream information in cache policies. 60% memory accesses  streams. 1.37× speedup over a traditional O3 processor. Core Acc. b[0] a[b[0]] b[i] b[1] a[b[1]] a[b[i]] b[2] a[b[2]] Mem Acc. … … Stream

Outline Insight & Opportunities. Stream Characteristics. Stream ISA Extension. Stream-Aware Policies. Microarchitecture Extension. Evaluation.

Outline Insight & Opportunities. Stream Characteristics. Stream ISA Extension. Stream-Aware Policies. Microarchitecture Extension. Evaluation.

Conventional Memory Abstraction while (i < N) { if (cond) v += a[i]; i++; } O3 Core L1 Cache L2 Cache if Overhead 3: Assumption on reuse. addr addr load if add br Overhead 2: Similar address computation/loads. Addr. load Miss Hit Addr. Miss Hit Overhead 1: Hard to prefetch with control flow. Val. add Resp. Resp. Val. br Resp. Resp.

Opportunity 1: Prefetch with Ctrl. Flow cfg(a[i]); while (i < N) { if (cond) v += a[i]; i++; } O3 Core L1 Cache L2 Cache Prefetch. cfg. SE. Miss Hit Before loop. if Resp. Resp. addr if Addr. load addr Hit Miss Hit Val. Addr. load Hit Miss Hit Opportunity 1: Prefetch with control flow. Overhead 1: Hard to prefetch with control flow. add br Resp. Val. Resp. Resp.

Opportunity 2: Semi-Binding Prefetch s_a = cfg(); while (i < N) { if (cond) v += s_a; i++; } O3 Core L1 Cache L2 Cache Prefetch. cfg. SE. Miss Hit Before loop. if FIFO Resp. Resp. Hit addr load Addr. Resp. Val. if Opportunity 2: Semi-binding prefetch. Overhead 2: Similar address computation/loads. add br Opportunity 1: Prefetch with control flow.

Opportunity 3: Stream-Aware Policies s_a = cfg(); while (i < N) { if (cond) v += s_a; } O3 Core L1 Cache L2 Cache Prefetch. cfg. SE. Miss Resp. Hit Before loop. if FIFO Resp. add br if Overhead 2: Repeated address computation/loads. Opportunity 2: Semi-binding prefetch. Opportunity 3: Better policies, e.g. bypass a cache level if no locality. Overhead 3: Assumption on reuse. Opportunity 1: Prefetch with control flow.

Related Work Decouple access execute. Prefetching. Outrider [ISCA’11], DeSC [MICRO’15], etc. Ours: New ISA abstraction for the access engine. Prefetching. Stride, IMP [MICRO’15], etc. Ours: Explicit access pattern in ISA. Cache bypassing policy. Counter-based [ICCD’05], LLC bypassing [ISCA’11], etc. Ours: Incorporate static stream information. Ssp is inspired by many other related and built on… Why is it better than DESC? Ours encodes information about the access patterns in a way that is accessable by the microarchitecture.

Outline Insight & Opportunities. Stream Characteristics. Stream ISA Extension. Stream-Aware Policies. Microarchitecture Extension. Evaluation.

Stream Characteristics – Stream Type Trace analysis on CortexSuite/SPEC CPU 2017. 51.49% affine, 10.19% indirect. Indirect streams can be as high as 40%. Support indirect stream.

Stream Characteristics – Stream Length 51% stream accesses from stream longer than 1k. Some benchmarks contain short streams. Support longer stream to capture long term behavior. Low overhead to support short streams.

Stream Characteristics – Control Flow 53% stream accesses from loop with control flow. Decouple from control flow.

Outline Insight & Opportunities. Stream Characteristics. Stream ISA Extension. Stream-Aware Policies. Microarchitecture Extension. Evaluation.

Stream ISA Extension – Basic Example Color to be matched. Step, stream, user instruction. IV stream and mem stream have the same color. Original C Code Stream Decoupled Pseudo Code Stream Dependence Graph int i = 0; while (i < N) { sum += a[i]; i++; } stream_cfg(s_i, s_a); while (s_i < N) { sum += s_a; stream_step(s_i); } stream_end(s_i, s_a); s_i s_a Iter. Step. User Pseudo-Reg Memory 0x408 … Memory 0x400 Memory 0x404 Stream a[i] i++ s_a 1 i++ 2

Stream ISA Extension – Control Flow Original C Code Stream Decoupled Pseudo Code Stream Dependence Graph int i = 0, j = 0; while (cond) { if (a[i] < b[j]) i++; else j++; } stream_cfg(s_i, s_a, s_j, s_b); while (cond) { if (s_a < s_b) stream_step(s_i); else stream_step(s_j); } stream_end(s_i, s_a, s_j, s_b); s_i s_a s_j s_b Iter. 1 Step User Pseudo-Reg Memory 0x408 … Memory 0x400 Memory 0x404 Stream a[i] i++ s_a 2 i++

Stream ISA Extension – Indirect Stream Original C Code Stream Decoupled Pseudo Code Stream Dependence Graph int i = 0; while (i < N) { sum += a[b[i]]; i++; } stream_cfg(s_i, s_a, s_b); while (s_i < N) { sum += s_a; stream_step(s_i); } stream_end(s_i, s_a, s_b); s_i s_b s_a Iter. 1 2 Step User Pseudo-Reg Memory 0x86c … Memory 0x888 Memory 0x668 a[b[i]] Pseudo-Reg Memory 0x408 … Memory 0x400 Memory 0x404 b[i] i++ i++ s_a s_b

Stream ISA Extension – ISA Semantic New architectural states: Stream configuration. Current iteration’s data. New speculation in ISA: Stream elements will be used. Streams are long. Maintain the memory order. Load  first use of the pseudo-register after configured/stepped. Store  every write to the pseudo-register.

Outline Insight & Opportunities. Stream Characteristics. Stream ISA Extension. Stream-Aware Policies. Microarchitecture Extension. Evaluation.

Stream-Aware Policies Rich Information Better Policies Compiler (ISA) /Hardware Memory Footprint Reuse Distance Modified? Conditional Used? Indirect … Prefetch Throttling Cache Replacement Cache Bypassing Sub-Line Transfer

Stream-Aware Policies – Cache Bypass Stream: Access Pattern  Precise Memory Footprint. Core while (i < N) while (j < N) while (k < N) sum += a[k][i] * b[k][j]; L1$ s_b s_a L2$ s_b s_a Reuse Dist. 𝑁 Reuse Dist. 𝑁×𝑁 a[N][N] b[N][N]

Outline Insight & Opportunities. Stream Characteristics. Stream ISA Extension. Stream-Aware Policies. Microarchitecture Extension. Evaluation.

Microarchitecture Memory 0x408 Memory 0x40c Memory 0x400 Memory 0x404 Stream Pseudo-Reg

Microarchitecture – Misspeculation Control misspeculated stream_step. Decrement the iteration map. No need to flush the FIFO and re-fetch data (decoupled) ! Other misspeculation. Revert the stream states, including stream FIFO. Memory fault delayed until the use of the element.

Outline Insight & Opportunities. Stream Characteristics. Stream ISA Extension. Microarchitecture Extension. Stream-Aware Policies. Evaluation.

Methodology Compiler in LLVM: Gem5 + McPAT simulation. 33 Benchmarks: Identify stream candidates. Generate stream configuration. Transform the program. Gem5 + McPAT simulation. 33 Benchmarks: SPEC2017 C/CPP benchmarks. CortexSuite. SimPoint: 10 million instructions’ simpoints. ~10 simpoints per benchmark.

Configurations Baseline. Stream Specialized Processor. Baseline O3. Unrealistic (highly) ideal configuration. Baseline. Baseline O3. Pf-Stride: Table-based prefetcher. Pf-Helper: SMT-based ideal helper thread. Requires no HW resources (ROB, etc.). Exactly 1k instruction before the main thread. Stream Specialized Processor. SSP-Non-Bind: Prefetch only. SSP-Semi-Bind: + Semi-binding prefetch. SSP-Cache-Aware: + Stream-Aware cache bypassing.

Results – Overall Performance In part due to decoupling and specialized cache policies, ssp can outperform …

Results – Semi-Binding Prefetching

Results – Design Space Interaction

Conclusion Stream as a new memory abstraction in ISA. ISA/Microarchitecture extension. Stream-aware cache bypassing. New paradigm of memory specialization. New direction for improving cache architectures. Combine memory and computation specialization.