U NDERSUBSCRIBED T HREADING ON C LUSTERED C ACHE A RCHITECTURES W IM H EIRMAN 1,2 T REVOR E. C ARLSON 1 K ENZO V AN C RAEYNEST 1 I BRAHIM H UR 2 A AMER.

Slides:

Advertisements

Similar presentations

Dynamic Thread Mapping for High- Performance, Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu Jinpyo Park Diana Marculescu Presented By Ravi.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Lecture 6: Multicore Systems

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.

LEMap: Controlling Leakage in Large Chip-multiprocessor Caches via Profile-guided Virtual Address Translation Jugash Chandarlapati Mainak Chaudhuri Indian.

High Performing Cache Hierarchies for Server Workloads

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer.

OpenFOAM on a GPU-based Heterogeneous Cluster

Memory System Characterization of Big Data Workloads

A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Skewed Compressed Cache

ECE 510 Brendan Crowley Paper Review October 31, 2006.

1 Chapter 01 Authors: John Hennessy & David Patterson.

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Modeling GPU non-Coalesced Memory Access Michael Fruchtman.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Applying Control Theory to the Caches of Multiprocessors Department of EECS University of Tennessee, Knoxville Kai Ma.

Sampling Dead Block Prediction for Last-Level Caches

Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Performance and Energy Efficiency Evaluation of Big Data Systems Presented by Yingjie Shi Institute of Computing Technology, CAS

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

By Islam Atta Supervised by Dr. Ihab Talkhan

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Sunpyo Hong, Hyesoon Kim

GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.

A Case for Toggle-Aware Compression for GPU Systems

Lynn Choi School of Electrical Engineering

Seth Pugsley, Jeffrey Jestes,

Adaptive Cache Partitioning on a Composite Core

Resource Aware Scheduler – Initial Results

Scaling the Memory Power Wall with DRAM-Aware Data Management

Morgan Kaufmann Publishers

BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Presented by: Isaac Martin

Efficient Computing in the Post-Moore Era

Hybrid Programming with OpenMP and MPI

EE 193: Parallel Computing

Intel Core I7 Pipeline Wei-Tse Sun.

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Many-Core Graph Workload Analysis

EE 155 / Comp 122 Parallel Computing

Rajeev Balasubramonian

Presentation transcript:

U NDERSUBSCRIBED T HREADING ON C LUSTERED C ACHE A RCHITECTURES W IM H EIRMAN 1,2 T REVOR E. C ARLSON 1 K ENZO V AN C RAEYNEST 1 I BRAHIM H UR 2 A AMER J ALEEL 2 L IEVEN E ECKHOUT 1 1 G HENT U NIVERSITY 2 I NTEL C ORPORATION HPCA 2014, O RLANDO, FL

C ONTEXT Many-core processor with 10s-100s of cores – E.g. Intel Xeon Phi, Tilera, GPGPU Running scalable, data-parallel workloads – SPEC OMP, NAS Parallel Benchmarks, … Processor fixed area/power budget – Spend on cores, caches? – Cache topology? 2

O VERVIEW Cache topology: – Why clustered caches? – Why undersubscription? Dynamic undersubscription – CRUST algorithms for automatic adaptation CRUST and future many-core design 3

M ANY - CORE C ACHE A RCHITECTURES 4 Core Cache … (N) Core private clustered shared (NUCA) Core Cache … (N) Cache Core Cache Core Cache Core Cache Core Cache Core Cache … (C) Core Cache … (C) … (N/C)

M ANY - CORE C ACHE A RCHITECTURES 5 …… … private clustered shared hit latency sharing

U NDERSUBSCRIBING FOR C ACHE C APACITY Less than C active cores/threads per cluster When working set does not fit in cache Keep all cache capacity accessible 6 Core Cache … (N) Core Cache Core full subscription3/4 undersubscription2/4 undersubscription1/4 undersubscription

M ANY - CORE C ACHE A RCHITECTURES 7 …… … private clustered shared hit latency sharing undersubscription (1:1) (1:C) (1:N)

P ERFORMANCE & E NERGY : W ORKING SET VS. C ACHE SIZE 8 Baseline architecture: 128 cores private L1 clustered L2 1M shared per 4 cores Baseline architecture: 128 cores private L1 clustered L2 1M shared per 4 cores 1/42/43/44/4 N-cg/A normalized performance (1/execution time) energy efficiency (1/energy)

P ERFORMANCE & E NERGY : W ORKING SET VS. C ACHE SIZE 9 Compute bound: use all cores for highest performance 1/42/43/44/4 N-cg/A performance energy efficiency Bandwidth bound: disable cores for better energy efficiency 1/42/43/44/4 N-cg/C performance energy efficiency Capacity bound: reduce thread count to optimize hit rate 1/42/43/44/4 N-ft/C performance energy efficiency 1/4 undersubscription: 3.5x performance 80% energy savings

10 C LUSTE R- AWARE U NDERSUBSCRIBED S CHEDULING OF T HREADS (CRUST) Dynamic undersubscription Integrated into the OpenMP runtime library – Adapt to each #pragma omp individually Optimize for performance first, save energy when possible – Compute bound: full subscription – Bandwidth bound: no * performance degradation (* <5% vs. full) – Capacity bound: highest performance Two CRUST heuristics (descend and predict) for on-line adaptation CRUST-descend CRUST-predict

CRUST- DESCEND Start with full subscription Reduce thread count while performance increases 11 threads/cluster full performance Selected

CRUST- PREDICT Reduce number of steps required by being smarter Start with heterogeneous undersubscription – Measure LLC miss rate for each thread/cluster option – Predict performance of each option using PIE-like model Select best predicted option 12

M ETHODOLOGY Generic many-core architecture – 128 cores, 2-issue – 2x 32 KB private L1 I+D – L1-D stride prefetcher – 1 MB shared L2 per 4 cores – 2-D mesh NoC – 64 GB/s total DRAM bandwidth Sniper simulator, McPAT for power SPEC OMP and NAS parallel benchmarks – Reduced iteration counts from ref, class A inputs 13

R ESULTS : O RACLE ( STATIC ) 14 capacity boundbandwidth bound compute bound

R ESULTS : L INEAR B ANDWIDTH M ODELS 15 capacity boundbandwidth bound compute bound Linear Bandwidth Models (e.g. BAT): save energy, does not exploit capacity effects on clustered caches

R ESULTS : CRUST 16 capacity boundbandwidth bound compute bound CRUST: save energy when bandwidth-bound, exploit capacity effects on clustered caches

U NDERSUBSCRIPTION VS. FUTURE DESIGNS Finite chip area, spent on cores or caches – Increasing max. compute vs. keeping cores fed with data Undersubscription can adapt workload behavior to the architecture Does this allow us to build a higher-performance design? Sweep core vs. cache area ratio for 14-nm design – Fixed 600 mm² area, core = 1.5 mm², L2 cache = 3 mm²/MB – Clustered L2 shared by 4 cores, latency ~ log2(size) – GB/s on-package, 64 GB/s off-package 17 VariantABCDEF Cores L2 size (MB/core) Core area25%33%40%50%58%64%

U NDERSUBSCRIPTION FOR F UTURE D ESIGNS 18 Compute bound: linear relation between active cores and performance Capacity bound: reduce thread count until combined working set fits available cache N-ft/C

Build one design with best average performance Full subscription: – conservative option C has highest average performance Dynamic undersubscription: prefer more cores – higher max. performance for compute-bound benchmarks – use undersubscription to accomodate capacity-bound workloads 19 U NDERSUBSCRIPTION FOR F UTURE D ESIGNS ABCDEF relative performance full subscription dynamic subscription E vs. C: 40% more cores, 15% higher performance

C ONCLUSIONS Use clustered caches for future many-core designs – Balance hit rate and hit latency – Exploit sharing to avoid duplication – Allow for undersubscription (use all cache, not all cores) CRUST for dynamic undersubscription – Adapt thread count per OpenMP parallel section – Performance and energy improvements of up to 50% Take undersubscription usage model into account when designing future many-core processors – CRUST-aware design: 40% more cores, 15% higher performance 20