Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

AN ANALYTICAL MODEL TO STUDY OPTIMAL AREA BREAKDOWN BETWEEN CORES AND CACHES IN A CHIP MULTIPROCESSOR Taecheol Oh, Hyunjin Lee, Kiyeon Lee and Sangyeun.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations.
Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.
Memory System Characterization of Big Data Workloads
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Performance David Monismith Jan. 16, 2015 Based on notes from Dr. Bill Siever and from the Patterson and Hennessy Text.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Karu Sankaralingam University of Wisconsin-Madison Collaborators: Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, and Doug Burger The Dark Silicon Implications.
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.
Classic Model of Parallel Processing
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Sunpyo Hong, Hyesoon Kim
Department of Electrical and Computer Engineering University of Wisconsin - Madison Optimizing Total Power of Many-core Processors Considering Voltage.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
CMSC 611: Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
COSC3330 Computer Architecture
Computer Architecture: Parallel Processing Basics
Resource Aware Scheduler – Initial Results
15-740/ Computer Architecture Lecture 7: Pipelining
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Energy-Efficient Address Translation
Presented by: Isaac Martin
Hardware Multithreading
Lecture: SMT, Cache Hierarchies
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Adaptive Single-Chip Multiprocessing
Lecture: SMT, Cache Hierarchies
CS 3410, Spring 2014 Computer Science Cornell University
Many-Core Graph Workload Analysis
Lecture: SMT, Cache Hierarchies
Chapter 4 Multiprocessors
Hardware Multithreading
Presentation transcript:

Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel Workloads 1 Technion – Israel Institute of Technology, 2 Microsoft Corporation

Challenges:  Single-core performance trend is gloomy  Exploit chip-multiprocessors with multithreaded applications  The memory gap is paramount Latency, bandwidth, power 2 Chip-Multiprocessor Era 2 [Figure: Hennessy and Patterson, Computer Architecture- A Quantitative approach] Two basic remedies:  Cache – Reduce the number of out-of-die memory accesses  Multi-threading – Hide memory accesses behind threads execution How do they play together? How do we make the most out of them?

The many-core span  Cache-Machines ↔ MT-Machines A high-level analytical model Performance curves study  Few examples Summary 3 Outline 3

The many-core span  Cache-Machines ↔ MT-Machines A high-level analytical model Performance curves study  Few examples Summary 4 Outline 4

Cache-Machines vs. MT-Machines # of Threads Cache/Thread Thread Context Cache Cache Architecture Region Many-Core – CMP with many, simple cores  Tens  hundreds of Processing Elements (PEs) MT Architecture Region Intel’s Larrabee … Nvidia’s GT200 5 Nvidia’s Fermi Cache Core Multi-Core Region Uni-Processor Region Cache c c c c c c c c What are the basic tradeoffs? How will workloads behave across the range?  Predicting performance

The many-core span  Cache-Machines ↔ MT-Machines A high-level analytical model Performance curves study  Few examples Summary 6 Outline 6

Use both cache and many threads to shield memory access  The uniform framework renders the comparison meaningful  We derive simple, parameterized equations for performance, power, BW,.. A Unified Machine Model 7 C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Cache To Memory Threads Architectural States C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C

Cache Machines 8 C C Many cores (each may have its private L1) behind a shared cache C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Cache To Memory C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C # Threads Performance Cache Non Effective point (CNE)

Memory latency shielded by multiple thread execution Multi-Thread Machines C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C To Memory C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Threads Architectural States Bandwidth Limitations # Threads Performance Max performance execution Memory access 9

Analysis (1/3) Given a ratio of memory access instructions r m (0≤r m ≤1) Every 1/r m instruction accesses memory  A thread executes 1/r m instructions  Then stalls for t avg cycles t avg =Average Memory Access Time (AMAT) [cycles] 10 Cache Thread Context t [cycles] ld

PE stays idle unless filled with instructions from other threads Each thread occupies the PE for additional cycles  threads needed to fully utilize each PE Analysis (2/3) t [cycles] ld 11 Cache Thread Context

Analysis (3/3) Machine utilization: Performance in Operations Per Seconds [OPS]: Number of available threads Peak Performance #Threads needed to utilize a single PE 12 Cache Thread Context

Performance Model 13 PE UtilizationOff-Chip BW Power

The many-core span  Cache-Machines ↔ MT-Machines A high-level analytical model Performance curves study  Few examples Summary 14 Outline

15 # Threads 3 regions: Cache efficiency region, The Valley, MT efficiency region Unified Machine Performance Performance Cache region MT region The Valley

Workloads: Can be parallelized into large number of threads No serial part Threads are independent of each other  No wait time/synchronization No data sharing:  Cache capacity divided among all running threads  Cache hit rate function: HW/SW Assumptions ParameterValue N PE 1024 S$S$ 16 MByte CPI exe 1 f1 GHz tmtm 200 cycles rmrm 0.2 Hardware:

Increase in cache size  cache suffices for more in-flight threads  Extends the $ region 17 Increase in cache size Cache Size Impact..AND also ..AND also  Valuable in the MT region  Caches reduce off-chip bandwidth  delay the BW saturation point

Increase in memory latency  Hinders the MT region  Emphasise the importance of caches Unlimited BW to memory Increase in Memory latency Unlimited BW to memory Memory Latency Impact 18

Simulation results from the PARSEC workloads kit Swaptions:  Perfect Valley Hit Rate Function Impact 19

Simulation results from the PARSEC workloads kit Raytrace:  Monotonically-increasing performance Hit Rate Function Impact 20

Three applications families based on cache miss rate dependency:  A “strong” function of number of threads – f(N q ) when q>1  A “weak” function of number of threads - f(N q ) when q≤1  Not a function of number of threads Threads Performance Hit Rate Dependency – 3 Classes Performance # Threads 21

Simulation results from the PARSEC workloads kit Canneal  Not enough parallelism available Workload Parallelism Impact 22

The many-core span  Cache-Machines ↔ MT-Machines A high-level analytical model Performance curves study  Few examples Summary 23 Outline

A high-level model for many-core engines  A unified framework for machines and workloads from across the range A vehicle to derive intuition  Qualitative study of the tradeoffs  A tool to understand parameters impact  Identifies new behaviors and the applications that exhibit them  Enables reasoning of complex phenomena First step towards escaping the valley 24 Summary Thank You!

25 Backup

26 Model Parameters 26

27 Model Parameters 27 ParameterDescription N PE Number of PEs (in-order processing elements) S$S$ Cache size [Bytes] N max Maximal number of thread contexts in the register file CPI exe Average number of cycles required to execute an instruction assuming a perfect (zero-latency) memory system [cycles] f Processor frequency [Hz] t$t$ Cache latency [cycles] tmtm Memory latency [cycles] BW max Maximal off-chip bandwidth [GB/sec] b reg Operands size [Bytes] Machine parameters:

28 Model Parameters 28 Workload parameters: ParameterDescription n Number of threads that execute or are in ready state (not blocked) concurrently rmrm Fraction of instructions accessing memory out of the total number of instructions [0≤r m ≤1] P hit (s, n) Cache hit rate for each thread, when n threads are using a cache of size s

29 Model Parameters 29 Power parameters: ParameterDescription e ex Energy per operation [j] e$e$ Energy per cache access [j] e mem Energy per memory access [j] Power leakage Leakage power [W]

30 Parsec Workloads 30

Model Validation, PARSEC Workloads

Related Work 32

Similar approach of using high level models:  Morad et al., CA-Letters 2005  Hill and Michael, IEEE Computer 2008  Eyerman and Eeckhout, ISCA-2010 Related Work 33 Agrawal, TPDS-1992 Saavedra-Barrera and Culler, Berkeley 1991 Sorin et al., ISCA-1998 Hong and Kim, ISCA-2009 Baghsorkhi et al., PPoPP-2010 Thread Context Cache Cache Architecture Region MT Architecture Region Cache Core Multi-Core Region Uni-Processor Region Cache c c c c c c c c