IMP: Indirect Memory Prefetcher

Slides:

Advertisements

Similar presentations

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

Advertisements

DSPs Vs General Purpose Microprocessors

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

CS7810 Prefetching Seth Pugsley. Predicting the Future Where have we seen prediction before? – Does it always work? Prefetching is prediction – Predict.

Memory System Characterization of Big Data Workloads

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.

A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

JAVA AND MATRIX COMPUTATION

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

1 Memory Management. 2 Fixed Partitions Legend Free Space 0k 4k 16k 64k 128k Internal fragmentation (cannot be reallocated) Divide memory into n (possible.

Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.

Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

Sunpyo Hong, Hyesoon Kim

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.

1 Lecture 5a: CPU architecture 101 boris.

University of California, Berkeley

Scalpel: Customizing DNN Pruning to the

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Reza Yazdani Albert Segura José-María Arnau Antonio González

Adaptive Cache Partitioning on a Composite Core

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

(Find all PTEs that map to a given PPN)

Energy-Efficient Address Translation

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Presented by: Isaac Martin

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Linchuan Chen, Peng Jiang and Gagan Agrawal

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

CARP: Compression-Aware Replacement Policies

Improving Memory System Performance for Soft Vector Processors

Many-Core Graph Workload Analysis

Cache - Optimization.

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Stream-based Memory Specialization for General Purpose Processors

DSPatch: Dual Spatial pattern prefetcher

Presentation transcript:

IMP: Indirect Memory Prefetcher Xiangyao Yu1, Christopher J. Hughes2, Nadathur Satish2, Srinivas Devadas1 CSAIL, MIT1 Parallel Computing Lab, Intel2

Sparse Operations Operations on sparse datasets (e.g., graphs, sparse matrices) increasingly important Reason #1: Rise of Big Data analytics Many analytics problems inherently irregular, represented by graphs or sparse systems of equations Reason #2: HPC – big gains in dense apps have drawn attention to lousy efficiency of sparse apps Ex: High-Performance Conjugate Gradient (HPCG) in TOP500 ALU utilization: DGEMM ~90%, SPMV ~1%-5%

Challenge from Sparsity Indirect access pattern Example: sparse matrix-vector multiplication (SPMV) Access pattern depends on input data Traditional hardware prefetchers can not capture the pattern matrix non-zeros input vector ×

Execution Time Impact 62% execution time waiting for indirect accesses

Observation Indirect accesses generally in form of A[B[i]] Sparse matrix-dense vector multiplication A: input vector, B: column id of non-zeros Graph algorithms A: vertex list, B: edge list B is pre-computed and contiguously accessed Unpredictable factor is the content of B B A Key Idea: Indirect prefetch On access to B[i] read B[i + Δ] and prefetch A[B[i + Δ]] indirect_address = &A[B[i]] = coeff × B[i] + base_address

IMP Architecture Indirect prefetching Prefetch Table (PT) Cache Access Stream Table Indirect Table Address Generator Indirect Prefetch Address Indirect prefetching &A[B[i]] = coeff × B[i + Δ] + base_address

Indirect Pattern Detector (IPD) IMP Architecture Prefetch Table (PT) Cache Access Stream Table Indirect Table Address Generator Detect Indirect pattern for stream accesses Write detected indirect pattern back to the prefetch table Cache Miss Indirect Pattern Detector (IPD) Indirect Prefetch Address Indirect pattern detection Learn coeff and base_address for A[B[i]]

Indirect Pattern Detection &A[B[i]] = coeff × B[i] + base_address Need two (&A[B[i]], B[i]) pairs to solve for coeff and base_address Candidates for &A[B[i]]: first several cache misses after accessing B[i] tend to do indirect access soon after reading index do not need to prefetch for cache hits Restrict coeff to small powers of 2 4, 8, 16, 1/8 (bit vector) Multiply becomes a shift Small set of values allows exhaustive search

Indirect Pattern Detector (IPD) &A[B[i]] = coeff × B[i] + base_address Given B[i] and B[i+1], for every following miss, compute base_address for every coeff Match found! Coeff=4, base addr=0xFC Events Read idx1 (=1) Miss addr=0x100 Miss addr=0x120 Read idx2 (=16) Miss addr=0x13C Ptr to Stream PF index1 index2 Baseaddr array Entry “X” 1 16 coeff 4 0xFC 0x11C 0xFC 8 0xF8 0x118 0xBC 16 0xF0 0x110 0x3C 1/8 0x100 0x120 0x13A

Other Optimizations Multi-way indirection Multi-level indirection A[B[i]] and C[B[i]] Multi-level indirection A[B[C[i]]] Load partial cacheline for indirect accesses Little spatial locality IMP can help predict access granularity See paper for details

Evaluation Graphite simulator System configuration IMP configuration Core : 64 In-order, single issue L1 Cache: 32KB Dcache, 16KB Icache Shared L2 Cache: each tile 256 KB (16 MB in total) Memory: DRAMSim2 On-die network: mesh, 2 cycles/hop, 64-bit flits IMP configuration Prefetch table: 16 entries, max prefetch distance = 16 storage overhead: less than 256 bytes energy overhead: less than 3% Indirect pattern detector: 4 entries storage overhead: 450 bytes

Performance vs. Perfect Prefetcher IMP gives 1.56x speedup on average SW prefetch incurs 29% (up to 2x) more instructions

Summary Indirect access (A[B[i]]) key pattern for workloads operating on sparse data structures IMP can detect and prefetch these accesses at low hardware cost Provides significant performance benefits

Backup Slides

Partial Cacheline Accessing

Coverage

Accuracy

Latency vs. Perfect Prefetch