Theory: Asleep at the Switch to Many-Core

Slides:



Advertisements
Similar presentations
Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
Advertisements

DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Challenges and Opportunities for System Software in the Multi-Core Era or The Sky is Falling, The Sky is Falling!
Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.
CS 5204 – Operating Systems 1 Scheduler Activations.
Trumping the Multicore Memory Hierarchy with Hi-Spade Phillip B. Gibbons Intel Labs Pittsburgh April 30, 2010 Keynote talk at 10 th SIAM International.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Computer Science 162 Discussion Section Week 2. Agenda Recap “What is an OS?” and Why? Process vs. Thread “THE” System.
Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Introduction Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640 at Penn, Spring.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 1: Introduction.
Computer System Architectures Computer System Software
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Introduction. Outline What is database tuning What is changing The trends that impact database systems and their applications What is NOT changing The.
Improving Network I/O Virtualization for Cloud Computing.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
Multi-core Computing Lecture 1 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Trumping the Multicore Memory Hierarchy with Hi-Spade Phillip B. Gibbons Intel Labs Pittsburgh September 22, 2011.
Programmability Hiroshi Nakashima Thomas Sterling.
Hierarchies, Clouds, and Specialization Phillip B. Gibbons Intel Labs Pittsburgh June 28, 2012 NSF Workshop on Research Directions in the Principles of.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Memory Hierarchies [FLPR12] Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar Ramachandran. Cache- Oblivious Algorithms. ACM Transactions on Algorithms,
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Single Instruction Multiple Threads
Internal Parallelism of Flash Memory-Based Solid-State Drives
Conclusions on CS3014 David Gregg Department of Computer Science
Lynn Choi School of Electrical Engineering
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
Lecture: Large Caches, Virtual Memory
Enabling Effective Utilization of GPUs for Data Management Systems
Lecture: Large Caches, Virtual Memory
Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
R SE to the challenges of ntelligent systems
Parallel Computing in the Multicore Era
Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker
Task Scheduling for Multicore CPUs and NUMA Systems
课程名 编译原理 Compiling Techniques
EE 193: Parallel Computing
Multi-Processing in High Performance Computer Architecture:
EECS 582 Midterm Review Mosharaf Chowdhury EECS 582 – F16.
Shared Memory Multiprocessors
ECE 445 – Computer Organization
Lecture: Cache Innovations, Virtual Memory
Chapter 1 Introduction.
Introduction to Operating Systems
CSCE 313 – Introduction to UNIx process
(A Research Proposal for Optimizing DBMS on CMP)
Introduction to Operating Systems
/ Computer Architecture and Design
Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J
Low Depth Cache-Oblivious Algorithms
Chapter 4 Multiprocessors
TensorFlow: A System for Large-Scale Machine Learning
Vrije Universiteit Amsterdam
6- General Purpose GPU Programming
Presentation transcript:

Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip B. Gibbons

Two Decades after the peak of Theory’s interest in parallel computing… The Age of Many-Core is finally underway Fueled by Moore’s Law: 2X cores per chip every 18 months All aboard the parallelism train! (Almost) The only way to faster apps Theory: Asleep at the Switch to Many-Core

All Aboard the Parallelism Train? Switch to Many-Core…Many Challenges Interest waned long ago Yet problems were NOT solved Research needed in all aspects of Many-Core Computer Architecture Programming Languages & Compilers Operating & Runtime Systems Theory YES! YES! YES! Who has answered the call? Theory: Asleep at the Switch to Many-Core

Theory: Asleep at the Switch “Engineer driving derailed Staten Island train may have fallen asleep at the switch.” (12/26/08) Theory needs to wake-up & regain a leadership role in parallel computing Theory: Asleep at the Switch to Many-Core

Theory’s Strengths New Algorithmic Paradigms Provable Correctness Conceptual Models Abstract models of computation New Algorithmic Paradigms New algorithms, new protocols Provable Correctness Safety, liveness, security, privacy,… Provable Performance Guarantees Approximation, probabilistic, new metrics Inherent Power/Limitations Of primitives, features,… …among others Theory: Asleep at the Switch to Many-Core

Five Areas in Which Theory Can (Should) Have an Important Impact Parallel Thinking Memory Hierarchy Asymmetry/Heterogeniety Concurrency Primitives Power Montparnasse 1895 Theory: Asleep at the Switch to Many-Core

Impact Area: Parallel Thinking Key: Good Model of Parallel Computation Express Parallelism Good parallel programmer’s model Good for teaching, teaching “how to think” Can be engineered to good performance Theory: Asleep at the Switch to Many-Core

Impact Area: Memory Hierarchy Deep cache/storage hierarchy Need conceptual model Need smart thread schedulers Theory: Asleep at the Switch to Many-Core

Impact Area: Asymmetry/Heterogeniety Fat/Thin cores SIMD extensions Multiple coherence domains Mixed-mode parallelism Virtual Machines ... Theory: Asleep at the Switch to Many-Core

Impact Area: Concurrency Primitives Parallel prefix Hash map [Herlihy08] Map reduce [Karloff09] Transactional memory Memory block transactions [Blelloch08] Graphics primitives [Ha08] Maurice Herlihy, Nir Shavit, Moran Tzafrir, DISC’08 Howard Karloff, Siddharth Suri, Sergei Vassilvitskii, this workshop Guy E. Blelloch, Phillip B. Gibbons, S. Harsha Vardhan, SPAA’08 Phuong Ha, Philippas Tsigas, Otto Anshus, DISC’08 Make the case Many-Core should (not) support Improve the algorithm Recommend new primitives (prescriptive) Theory: Asleep at the Switch to Many-Core

Fertile area for Theory help Impact Area: Power Many-cores provide features for reducing power Voltage scaling [Albers07] Dynamically run on fewer cores, fewer banks Susanne Albers, Fabian Muller, Swen Schmelzer, SPAA’07 Fertile area for Theory help Theory: Asleep at the Switch to Many-Core

Deep Dive: Memory Hierarchy Deep cache/storage hierarchy Need conceptual model Need smart thread schedulers Theory: Asleep at the Switch to Many-Core

Good Performance Requires Effective Use of the Memory Hierarchy CPU Performance: Running/response time Throughput Power L1 L2 Cache Main Memory Magnetic Disks Two new trends: Pervasive Multicore & Pervasive Flash bring new challenges and opportunities Theory: Asleep at the Switch to Many-Core

New Trend 1: Pervasive Multicore CPU L1 CPU L1 CPU L1 Challenges Cores compete for hierarchy Hard to reason about parallel performance Hundred cores coming soon Cache hierarchy design in flux Hierarchies differ across platforms Opportunity Rethink apps & systems to take advantage of more CPUs on chip Shared L2 Cache L2 Cache Main Memory Magnetic Disks Makes Effective Use of Hierarchy Much Harder Theory: Asleep at the Switch to Many-Core

New Trend 2: Pervasive Flash CPU L1 CPU L1 CPU L1 Opportunity Rethink apps & systems to take advantage Challenges Performance quirks of Flash Technology in flux, e.g., Flash Translation Layer (FTL) Shared L2 Cache Flash Devices Main Memory Magnetic Disks New Type of Storage in the Hierarchy Theory: Asleep at the Switch to Many-Core

How Hierarchy is Treated Today Algorithm Designers & Application/System Developers often tend towards one of two extremes Ignorant (Pain)-Fully Aware API view: Memory + I/O; Parallelism often ignored [Performance iffy] Hand-tuned to platform [Effort high, Not portable, Limited sharing scenarios] Or they focus on one or a few aspects, but without a comprehensive view of the whole Theory: Asleep at the Switch to Many-Core

Hierarchy-Savvy Parallel Algorithm Design (Hi-Spade) project …seeks to enable: A hierarchy-savvy approach to algorithm design & systems for emerging parallel hierarchies Hide what can be hid Expose what must be exposed for good performance Robust: many platforms, many resource sharing scenarios Sweet-spot between ignorant and (pain)fully aware http://www.pittsburgh.intel-research.net/projects/hi-spade/ “Hierarchy-Savvy” http://www.pittsburgh.intel-research.net/projects/hi-spade/ Theory: Asleep at the Switch to Many-Core

Hi-Spade Research Scope A hierarchy-savvy approach to algorithm design & systems for emerging parallel hierarchies Research agenda includes Theory: conceptual models, algorithms, analytical guarantees Systems: runtime support, performance tools, architectural features Applications: databases, operating systems, application kernels Theory: Asleep at the Switch to Many-Core

Cache Hierarchies: Sequential Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Cache Hierarchies: Sequential External Memory (EM) Algorithms Main Memory (size M) External Memory Block size B External Memory Model [See Vitter’s ACM Surveys article] Jeffrey S. Vitter, ACM Computing Surveys, 2001 + Simple model + Minimize I/Os – Only 2 levels Theory: Asleep at the Switch to Many-Core

Cache-Oblivious Algorithms Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Cache Hierarchies: Sequential Alternative: Cache-Oblivious Algorithms [Frigo99] Main Memory (size M) External Memory Block size B Cache-Oblivious Model Twist on EM Model: M & B unknown to Algorithm + simple model Key Goal: Good performance for any M & B Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar Ramachandran, FOCS’99 + Key Goal Guaranteed good cache performance at all levels of hierarchy – Single CPU only Theory: Asleep at the Switch to Many-Core

Cache Hierarchies: Parallel Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Cache Hierarchies: Parallel Explicit Multi-level Hierarchy: Multi-BSP Model [Valiant08] Goal: Approach simplicity of cache-oblivious model Hierarchy-Savvy sweet spot Leslie G. Valiant, ESA’08. Also this workshop Theory: Asleep at the Switch to Many-Core

– Parallel cache-obliviousness too strict a goal Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Challenge: Theory of cache-oblivious algorithms falls apart once introduce parallelism: Good performance for any M & B on 2 levels DOES NOT imply good performance at all levels of hierarchy Key reason: Caches not fully shared L2 Cache Shared L2 Cache CPU2 L1 CPU1 CPU3 What’s good for CPU1 is often bad for CPU2 & CPU3 e.g., all want to write B at ≈ the same time B – Parallel cache-obliviousness too strict a goal Theory: Asleep at the Switch to Many-Core

Scheduling of parallel threads Has LARGE impact on cache performance Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Key new dimension: Scheduling of parallel threads Has LARGE impact on cache performance L2 Cache Shared L2 Cache CPU2 L1 CPU1 CPU3 Can mitigate (but not solve) if can schedule the writes to be far apart in time Recall our problem scenario: all CPUs want to write B at ≈ the same time B Theory: Asleep at the Switch to Many-Core

Existing Parallel Cache Models Slide from Rezaul Chowdhury main memory shared cache (size = C) CPU block transfer (size = B) 1 2 p Parallel Shared-Cache Model: main memory block transfer (size = B) private cache (size = C) CPU 1 2 p Parallel Private-Cache Model: Slide from Rezaul Chowdhury Theory: Asleep at the Switch to Many-Core

Competing Demands of Private and Shared Caches private cache CPU 1 2 p main memory shared cache Shared cache: cores work on the same set of cache blocks Private cache: cores work on disjoint sets of cache blocks Experimental results have shown that on CMP architectures work-stealing, i.e., the state-of-art scheduler for private-cache model, can suffer from excessive shared-cache misses parallel depth first, i.e., the best scheduler for shared-cache model, can incur excessive private-cache misses Slide from Rezaul Chowdhury Theory: Asleep at the Switch to Many-Core

Private vs. Shared Caches Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Private vs. Shared Caches Parallel all-shared hierarchy: + Provably good cache performance for cache-oblivious algs 3-level multi-core model: insights on private vs. shared + Designed new scheduler with provably good cache performance for class of divide-and-conquer algorithms [Blelloch08] Guy E. Blelloch, Rezaul A. Chowdhury, Phillip B. Gibbons, Vijaya Ramachandran, Shimin Chen, Michael Kozuch, SODA’08 L2 Cache Shared L2 Cache CPU2 L1 CPU1 CPU3 – Results require exposing working set size for each recursive subproblem Theory: Asleep at the Switch to Many-Core

Design low-depth cache-oblivious algorithm Low depth D Good miss bound Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Parallel Tree of Caches … Guy E. Blelloch, Phillip B. Gibbons, Harsha Vardhan Simhadri, CMU tech report 2009 Approach: [Blelloch09] Design low-depth cache-oblivious algorithm Thrm: for each level i, only O(M P D/ B ) misses more than the sequential schedule i i Low depth D Good miss bound Theory: Asleep at the Switch to Many-Core

Five Areas in Which Theory Can (Should) Have an Important Impact Parallel Thinking Memory Hierarchy Asymmetry/Heterogeniety Concurrency Primitives Power Theory: Asleep at the Switch to Many-Core