Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.

Slides:

Advertisements

Similar presentations

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.

1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.

1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.

Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013.

Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee

Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.

1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Sunpyo Hong, Hyesoon Kim

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

A Graph Theoretic Approach to Cache-Conscious Placement of Data for Direct Mapped Caches Mirza Beg and Peter van Beek University of Waterloo June

Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Seth Pugsley, Jeffrey Jestes,

Reducing Memory Interference in Multicore Systems

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Managing GPU Concurrency in Heterogeneous Architectures

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Linchuan Chen, Peng Jiang and Gagan Agrawal

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

CARP: Compression-Aware Replacement Policies

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Final Project presentation

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah

Cache - Optimization.

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Presentation transcript:

Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State University

Summary Proposal A compiler-runtime cooperative data layout optimization that improves row-buffer locality in irregular programs Proposal A compiler-runtime cooperative data layout optimization that improves row-buffer locality in irregular programs ~17% improvement in overall application performance Problem Most data locality optimizations target exclusively cache locality. “Row Buffer Locality” is also important. The problem is especially challenging in the case of irregular programs (sparse data) Problem Most data locality optimizations target exclusively cache locality. “Row Buffer Locality” is also important. The problem is especially challenging in the case of irregular programs (sparse data) 2

Outline Background Motivation Conservative Layout Fine-grain Layout Related Work Evaluation Conclusion 3

DRAM Organization DIMM DRAM chip Processor MC Rank Channel Bank Row Buffer Row-buffer Locality 4

Irregular Programs Real X(num_nodes), Y(num_edges); Integer IA(num_edges, 2); for (t = 1, t < T, t++) { /* If it is time to update the interaction list */ for (i = 0, i < num_edges; i++) { X(IA(i, 1)) = X(IA(i, 1)) + Y(i); X(IA(i, 2)) = X(IA(i, 2)) - Y(i); } 5

Inspector/Executor model /* Executor */ Real X(num_nodes), Y(num_edges); Real X’(num_nodes), Y’(num_edges); Integer IA(num_edges, 2); for (t = 1, t < T, t++) { X’, Y’ = Trans(X, Y); for (i = 0, i < num_edges; i++) { X’(IA(i, 1)) = X’(IA(i, 1)) + Y’(i); X’(IA(i, 2)) = X’(IA(i, 2)) - Y’(i); } /* Inspector */ Trans(X, Y): for (i = 0, i < num_edges; i++) { /* data reordering algorithms */ } return (X’, Y’) Used for identifying parallelism or improving cache locality 6

Outline Background Motivation Conservative Layout Fine-grain Layout Related Work Evaluation Conclusion 7

Row-buffer Locality Prior works that target irregular applications exclusively focus on improving cache locality – No efforts to improve row-buffer locality Typical latencies (based on AMD architecture) – Last Level Cache (LLC) hit = 28 cycles – Row-buffer hit = 90 cycles – Row-buffer miss = 350 cycles Application performance is dictated not only by the cache hitrate, but also by the row-buffer hitrate. 8

Example Layout (b) eliminates the row-buffer miss caused by accessing ‘y’. Assuming this move will not cause any additional cache misses Layout (c) eliminates the row-buffer misses caused by accessing ‘v’ even at the cost of an additional cache miss

Outline Background Motivation Conservative Layout Fine-grain Layout Related Work Evaluation Conclusion 10

Notations Seq: the sequence of data elements obtained by traversing the index array α x : the access to a particular data element x in Seq time(α x ): the “logical time stamp” of x in Seq β x : the memory block where data element x resides α x, : the “most recent access” to β x before α x Caches(β x ): the set of cache blocks to which β x can be mapped in a k-way set-associative cache 11

Definition Block Distance: Given Caches(β x ) = Caches(β Y ), the block distance between α x and α y, denoted as Δ(α y, α x ) is the number of “distinct" memory blocks that are mapped to Caches(β x ) and accessed during the time period between time(α x ) and time(α y ) 12

Lemma 13

Conservative Layout Objective: – Increase row-buffer hitrate – Without affecting the cache performance Algorithm 1.Identifying the locality sets 2.Constructing the interference graph 3.Assigning rows in memory 14

1. Identifying the Locality Sets 15

2. Constructing the Interference Graph Each node represents a locality set If α x and α y are the two accesses that incur successive cache misses in Seq, and x and y are located in different rows, then an edge is added between the locality sets of x and y – Weight on this edge represents the total number of such α x and α y pairs 16

3. Assigning Rows in Memory Sort the edges in the interference graph in decreasing order Assign same row to the locality sets connected by the edge with the largest weight 17

Outline Background Motivation Conservative Layout Fine-grain Layout Related Work Evaluation Conclusion 18

Fine-grain Layout 19

Algorithm 1.Constructing the Interference Graph 2.Constructing the Locality Graphs 3.Finding Partitions 4.Assigning Rows in Memory 20

1. Constructing the Interference Graph Each node in the interference graph represents a data element If α x and α y are two accesses that incur successive cache misses in Seq, and x and y are located in different rows, then we set up an edge between x and y – Weight on the edge represents the number of such α x and α y pairs 21

2. Constructing the Locality Graphs 22

3. Finding Partitions 23

4. Assigning Rows in Memory Each partition is assigned to a memory block in a row 24

Example 25

Related Work Inspector/Executor model – Typically used for parallelism (Lawrence Rauchwerger [1]) and cache locality (Chen Ding [2]) – We use it to improve row-buffer locality and our approach is complementary to them Row buffer locality – Compiler approach: Mary W. Hall [3] – Hardware approach: Al Davis [4] – Our work specifically targets irregular applications 26

Outline Background Motivation Conservative Layout Fine-grain Layout Related Work Evaluation Conclusion 27

Evaluation CPU12 cores; 2.6 GHz; 4 memory controllers Caches64KB per core L1 (3 cycles); 512KB per core L2 (12 cycles); 12MB per socket shared L3 (28 cycles) MemoryDDR3-1866; 8 banks per channel; 8KB row-buffers NameInput SizeL3 Miss rate RB Miss rate PSST427.6 MB18.1 %29.6 % PaSTiX511.6 MB24.3 %41.7 % SSIF129.3 MB13.7 %24.4 % PPS738.2 MB21.4 %33.1 % REACT1.2 GB28.6 %46.9 % Benchmarks Platform (modeled in GEM5) 28

Simulation Results 6 % 15 % 27 % 12 % 17 % 29

30

Conclusion Exploiting row-buffer locality is critical for application performance We proposed two compiler-directed data layout organizations with the goal of improving row-buffer locality in irregular applications – Without affecting cache performance – Trading cache performance for row-buffer locality 31

Thank You Questions? 32

References 1.“Improving Cache Performance in Dynamic Applications through Data and Computation Reorganization at Run Time”, ICPP “Sensitivity Analysis for Automatic Parallelization on Multi-Cores”, ICS “A compiler algorithm for exploiting page-mode memory access in embedded dram devices“, MSP ’02 4.“Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement”, ASPLOS

BACKUP SLIDES 34

Results with AMD based system 35

Memory Scheduling 36