Nikos Hardavellas, Northwestern University

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

AN ANALYTICAL MODEL TO STUDY OPTIMAL AREA BREAKDOWN BETWEEN CORES AND CACHES IN A CHIP MULTIPROCESSOR Taecheol Oh, Hyunjin Lee, Kiyeon Lee and Sangyeun.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

LEMap: Controlling Leakage in Large Chip-multiprocessor Caches via Profile-guided Virtual Address Translation Jugash Chandarlapati Mainak Chaudhuri Indian.

Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.

High Performing Cache Hierarchies for Server Workloads

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

1 Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.

Memory System Characterization of Big Data Workloads

CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Manycore Network Interfaces for In-Memory Rack-Scale Computing Alexandros Daglis, Stanko Novakovic, Edouard Bugnion, Babak Falsafi, Boris Grot.

DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

By Islam Atta Supervised by Dr. Ihab Talkhan

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Presented by: Nick Kirchem Feb 13, 2004

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Seth Pugsley, Jeffrey Jestes,

Reducing Memory Interference in Multicore Systems

Adaptive Cache Partitioning on a Composite Core

ASR: Adaptive Selective Replication for CMP Caches

Xiaodong Wang, Shuang Chen, Jeff Setter,

Reactive NUMA A Design for Unifying S-COMA and CC-NUMA

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Temporal Streaming of Shared Memory

Lecture 12: Cache Innovations

Chapter 8 Digital Design and Computer Architecture: ARM® Edition

Reducing Memory Reference Energy with Opportunistic Virtual Caching

CARP: Compression-Aware Replacement Policies

Adaptive Single-Chip Multiprocessing

Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia.

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Lecture: Cache Hierarchies

Chapter 4 Multiprocessors

Cache - Optimization.

Presentation transcript:

Nikos Hardavellas, Northwestern University Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures Nikos Hardavellas, Northwestern University Team: M. Ferdman, B. Falsafi, A. Ailamaki Northwestern, Carnegie Mellon, EPFL

Moore’s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 2007 45nm 2010 32nm 2013 22nm 2016 16nm 2019 Device scaling continues for at least another 10 years © Hardavellas

Moore’s Law Is Alive And Well Good Days Ended Nov. 2002 Moore’s Law Is Alive And Well [Yelick09] “New” Moore’s Law: 2x cores with every generation On-chip cache grows commensurately to supply all cores with data © Hardavellas

Larger Caches Are Slower Caches slow access large caches Increasing access latency forces caches to be distributed © Hardavellas

Balance cache slice access with network latency Cache design trends As caches become bigger, they get slower: Split cache into smaller “slices”: Balance cache slice access with network latency © Hardavellas

Modern Caches: Distributed core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 Split cache into “slices”, distribute across die © Hardavellas

Data Placement Determines Performance core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core cache slice L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 Goal: place data on chip close to where they are used © Hardavellas

Our proposal: R-NUCA Reactive Nonuniform Cache Architecture Data may exhibit arbitrarily complex behaviors ...but few that matter! Learn the behaviors at run time & exploit their characteristics Make the common case fast, the rare case correct Resolve conflicting requirements © Hardavellas

Reactive Nonuniform Cache Architecture [Hardavellas et al, ISCA 2009] [Hardavellas et al, IEEE-Micro Top Picks 2010] Cache accesses can be classified at run-time Each class amenable to different placement Per-class block placement Simple, scalable, transparent No need for HW coherence mechanisms at LLC Up to 32% speedup (17% on average) -5% on avg. from an ideal cache organization Rotational Interleaving Data replication and fast single-probe lookup © Hardavellas

Outline Introduction Why do Cache Accesses Matter? Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion © Hardavellas

Cache accesses dominate execution [Hardavellas et al, CIDR 2007] Lower is better 4-core CMP DSS: TPCH/DB2 1GB database Ideal Bottleneck shifts from memory to L2-hit stalls © Hardavellas

We lose half the potential throughput How much do we lose? 4-core CMP DSS: TPCH/DB2 1GB database Higher is better We lose half the potential throughput © Hardavellas

Outline Introduction Why do Cache Accesses Matter? Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion © Hardavellas

Terminology: Data Types core core core core core Read or Write Read Read Read Write L2 L2 L2 Private Shared Read-Only Shared Read-Write © Hardavellas

Maximum capacity, but slow access (30+ cycles) Distributed shared L2 core core core core core core core core address mod <#slices> L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core Unique location for any block (private or shared) L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 Maximum capacity, but slow access (30+ cycles) © Hardavellas

Fast access to core-private data Distributed private L2 core core core core core core core core On every access allocate data at local L2 slice L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core Private data: allocate at local L2 slice L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 Fast access to core-private data © Hardavellas

Distributed private L2: shared-RO access core core core core core core core core On every access allocate data at local L2 slice L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core Shared read-only data: replicate across L2 slices L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 Wastes capacity due to replication © Hardavellas

Distributed private L2: shared-RW access core core core core core core core core On every access allocate data at local L2 slice L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core Shared read-write data: maintain coherence via indirection (dir) L2 L2 L2 L2 L2 X L2 L2 L2 core core core core core core core core L2 L2 L2 dir L2 L2 L2 L2 Slow for shared read-write Wastes capacity (dir overhead) and bandwidth © Hardavellas

Conventional Multi-Core Caches Shared Private core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 dir L2 L2 L2 Address-interleave blocks High capacity Slow access Each block cached locally Fast access (local) Low capacity (replicas) Coherence: via indirection (distributed directory) We want: high capacity (shared) + fast access (private) © Hardavellas

Where to Place the Data? read-write share migrate replicate read-only Close to where they are used! Accessed by single core: migrate locally Accessed by many cores: replicate (?) If read-only, replication is OK If read-write, coherence a problem Low reuse: evenly distribute across sharers read-write share migrate read-only replicate sharers# © Hardavellas

Flexus: Full-system cycle-accurate timing simulation Methodology Flexus: Full-system cycle-accurate timing simulation [Hardavellas et al, SIGMETRICS-PER 2004 Wenisch et al, IEEE Micro 2006] Workloads OLTP: TPC-C 3.0 100 WH IBM DB2 v8 Oracle 10g DSS: TPC-H Qry 6, 8, 13 SPECweb99 on Apache 2.0 Multiprogammed: Spec2K Scientific: em3d Model Parameters Tiled, LLC = L2 Server/Scientific wrkld. 16-cores, 1MB/core Multi-programmed wrkld. 8-cores, 3MB/core OoO, 2GHz, 96-entry ROB Folded 2D-torus 2-cycle router, 1-cycle link 45ns memory © Hardavellas

Cache Access Classification Example Each bubble: cache blocks shared by x cores Size of bubble proportional to % L2 accesses y axis: % blocks in bubble that are read-write % RW Blocks in Bubble © Hardavellas

Cache Access Clustering share (addr-interleave) R/W share migrate % RW Blocks in Bubble % RW Blocks in Bubble replicate R/O sharers# migrate locally Server Apps Scientific/MP Apps replicate Accesses naturally form 3 clusters © Hardavellas

Instruction Replication Instruction working set too large for one cache slice core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 Distribute in cluster of neighbors, replicate across © Hardavellas

Reactive NUCA in a nutshell Classify accesses private data: like private scheme (migrate) shared data: like shared scheme (interleave) instructions: controlled replication (middle ground) To place cache blocks, we first need to classify them © Hardavellas

Outline Introduction Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion © Hardavellas

Classification Granularity Per-block classification High area/power overhead (cut L2 size by half) High latency (indirection through directory) Per-page classification (utilize OS page table) Persistent structure Core accesses the page table for every access anyway (TLB) Utilize already existing SW/HW structures and events Page classification is accurate (<0.5% error) Classify entire data pages, page table/TLB for bookkeeping © Hardavellas

Classification Mechanisms Instructions classification: all accesses from L1-I (per-block) Data classification: private/shared per-page at TLB miss On 1st access On access by another core Core i core Ld A Ld A core Core j TLB Miss TLB Miss L2 L2 OS OS A: Private to “i” A: Private to “i” A: Shared Bookkeeping through OS page table and TLB © Hardavellas

Page Table and TLB Extensions Core accesses the page table for every access anyway (TLB) Pass information from the “directory” to the core Utilize already existing SW/HW structures and events TLB entry: P/S vpage ppage 1 bit Page table entry: P/S/I L2 id vpage ppage 2 bits log(n) Page granularity allows simple + practical HW © Hardavellas

Data Class Bookkeeping and Lookup private data: place in local L2 slice Page table entry: P L2 id vpage ppage TLB entry: P vpage ppage shared data: place in aggregate L2 (addr interleave) Page table entry: S L2 id vpage ppage TLB entry: S vpage ppage Physical Addr.: tag L2 id cache index offset © Hardavellas

Coherence: No Need for HW Mechanisms at LLC Reactive NUCA placement guarantee Each R/W datum in unique & known location Shared data: addr-interleave Private data: local slice core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 Fast access, eliminates HW overhead, SIMPLE © Hardavellas

Instructions Lookup: Rotational Interleaving RID +1 1 2 3 1 2 3 2 3 3 1 1 2 3 1 +log2(k) 1 1 2 2 3 3 1 2 3 2 3 3 RID 1 1 2 3 1 size-4 clusters: local slice + 3 neighbors PC: 0xfa480 Addr each slice caches the same blocks on behalf of any cluster Fast access (nearest-neighbor, simple lookup) Balance access latency with capacity constraints Equal capacity pressure at overlapped slices © Hardavellas

Outline Introduction Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion © Hardavellas

Evaluation Delivers robust performance across workloads Shared (S) R-NUCA (R) Ideal (I) Delivers robust performance across workloads Shared: same for Web, DSS; 17% for OLTP, MIX Private: 17% for OLTP, Web, DSS; same for MIX © 2009 Hardavellas

Conclusions Data may exhibit arbitrarily complex behaviors ...but few that matter! Learn the behaviors that matter at run time Make the common case fast, the rare case correct Reactive NUCA: near-optimal cache block placement Simple, scalable, low-overhead, transparent, no coherence Robust performance Matches best alternative, or 17% better; up to 32% Near-optimal placement (-5% avg. from ideal) © Hardavellas

Thank You! For more information: http://www.eecs.northwestern.edu/~hardav/ N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures. IEEE Micro Top Picks, Vol. 30(1), pp. 20-28, January/February 2010. N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches. ISCA 2009. © Hardavellas

BACKUP SLIDES © 2009 Hardavellas

Why Are Caches Growing So Large? Increasing number of cores: cache grows commensurately Fewer but faster cores have the same effect Increasing datasets: faster than Moore’s Law! Power/thermal efficiency: caches are “cool”, cores are “hot” So, its easier to fit more cache in a power budget Limited bandwidth: large cache == more data on chip Off-chip pins are used less frequently © Hardavellas

Backup Slides ASR © 2009 Hardavellas

ASR vs. R-NUCA Configurations 12.5× 25.0× 5.6× 2.1× 2.2× 38% Core Type In-Order OoO L2 Size (MB) 4 16 Memory 150 500 90 Local L2 12 20 Avg. Shared L2 25 44 22 © 2009 Hardavellas

ASR design space search © Hardavellas

Backup Slides Prior Work © 2009 Hardavellas

Simple, scalable mechanism for fast access to all data Prior Work Several proposals for CMP cache management ASR, cooperative caching, victim replication, CMP-NuRapid, D-NUCA ...but suffer from shortcomings complex, high-latency lookup/coherence don’t scale lower effective cache capacity optimize only for subset of accesses We need: Simple, scalable mechanism for fast access to all data © Hardavellas

Shortcomings of prior work L2-Private Wastes capacity High latency (3 slice accesses + 3 hops on shr.) L2-Shared High latency Cooperative Caching Doesn’t scale (centralized tag structure) CMP-NuRapid High latency (pointer dereference, 3 hops on shr) OS-managed L2 Wastes capacity (migrates all blocks) Spill to neighbors useless (all run same code) © Hardavellas

Shortcomings of Prior Work D-NUCA No practical implementation (lookup?) Victim Replication High latency (like L2-Private) Wastes capacity (home always stores block) Adaptive Selective Replication (ASR) Capacity pressure (replicates at slice granularity) Complex (4 separate HW structures to bias coin) © Hardavellas

Classification and Lookup Backup Slides Classification and Lookup © 2009 Hardavellas

Data Classification Timeline Core i Core j core core Ld A Ld A L2 L2 TLB Miss inval A TLBi TLB Miss Core k evict A reply A allocate A core L2 allocate A OS P i vpage ppage S x vpage ppage i≠j Fast & simple lookup for data © 2009 Hardavellas

Misclassifications at Page Granularity Accesses from pages with multiple access types Access misclassifications A page may service multiple access types But, one type always dominates accesses Classification at page granularity is accurate © Hardavellas

Backup Slides Placement © 2009 Hardavellas

Private Data Placement Spill to neighbors if working set too large? NO!!! Each core runs similar threads Store in local L2 slice (like in private cache) © Hardavellas

Private Data Working Set OLTP: Small per-core work. set (3MB/16 cores = 200KB/core) Web: primary wk. set <6KB/core, remaining <1.5% L2 refs DSS: Policy doesn’t matter much (>100MB work. set, <13% L2 refs  very low reuse on private) © Hardavellas

Address-interleave in aggregate L2 (like shared cache) Shared Data Placement Read-write + large working set + low reuse Unlikely to be in local slice for reuse Also, next sharer is random [WMPI’04] Address-interleave in aggregate L2 (like shared cache) © Hardavellas

Shared Data Working Set © Hardavellas

Instruction Placement Working set too large for one slice Slices store private & shared data too! Sufficient capacity with 4 L2 slices Share in clusters of neighbors, replicate across © Hardavellas

Instructions Working Set © Hardavellas

Rotational Interleaving Backup Slides Rotational Interleaving © 2009 Hardavellas

Instruction Classification and Lookup Identification: all accesses from L1-I But, working set too large to fit in one cache slice core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 Share within neighbors’ cluster, replicate across © 2009 Hardavellas

Rotational Interleaving RotationalID +1 TileID 1 1 2 2 3 3 4 5 1 6 2 3 7 2 8 3 9 10 1 11 12 2 3 13 14 15 1 +log2(k) 16 17 1 2 18 19 3 20 1 21 22 2 23 3 2 24 3 25 26 1 27 2 28 29 3 30 31 1 Fast access (nearest-neighbor, simple lookup) Equalize capacity pressure at overlapping slices © 2009 Hardavellas

Nearest-neighbor size-8 clusters © Hardavellas