Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip.

Slides:

Advertisements

Similar presentations

Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Challenges and Opportunities for System Software in the Multi-Core Era or The Sky is Falling, The Sky is Falling!

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Introduction CSCI 444/544 Operating Systems Fall 2008.

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.

CS 5204 – Operating Systems 1 Scheduler Activations.

Trumping the Multicore Memory Hierarchy with Hi-Spade Phillip B. Gibbons Intel Labs Pittsburgh April 30, 2010 Keynote talk at 10 th SIAM International.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.

Computer Science 162 Discussion Section Week 2. Agenda Recap “What is an OS?” and Why? Process vs. Thread “THE” System.

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 1: Introduction.

Computer System Architectures Computer System Software

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Introduction. Outline What is database tuning What is changing The trends that impact database systems and their applications What is NOT changing The.

Programming Paradigms for Concurrency Part 2: Transactional Memories Vasu Singh

Improving Network I/O Virtualization for Cloud Computing.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Software & the Concurrency Revolution by Sutter & Larus ACM Queue Magazine, Sept For CMPS Halverson 1.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

High-Performance Computing An Applications Perspective REACH-IIT Kanpur 10 th Oct

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Group 3: Architectural Design for Enhancing Programmability Dean Tullsen, Josep Torrellas, Luis Ceze, Mark Hill, Onur Mutlu, Sampath Kannan, Sarita Adve,

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Trumping the Multicore Memory Hierarchy with Hi-Spade Phillip B. Gibbons Intel Labs Pittsburgh September 22, 2011.

An Execution Model for Heterogeneous Multicore Architectures Gregory Diamos, Andrew Kerr, and Sudhakar Yalamanchili Computer Architecture and Systems Laboratory.

Operating Systems: Wrap-Up Questions answered in this lecture: What is an Operating System? Why are operating systems so interesting? What techniques can.

Programmability Hiroshi Nakashima Thomas Sterling.

Architecture & Cybersecurity – Module 3 ELO-100Identify the features of virtualization. (Figure 3) ELO-060Identify the different components of a cloud.

Hierarchies, Clouds, and Specialization Phillip B. Gibbons Intel Labs Pittsburgh June 28, 2012 NSF Workshop on Research Directions in the Principles of.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.

KERRY BARNES WILLIAM LUNDGREN JAMES STEED

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Internal Parallelism of Flash Memory-Based Solid-State Drives

Lynn Choi School of Electrical Engineering

Lecture: Large Caches, Virtual Memory

Enabling Effective Utilization of GPUs for Data Management Systems

Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

EE 193: Parallel Computing

Multi-Processing in High Performance Computer Architecture:

EECS 582 Midterm Review Mosharaf Chowdhury EECS 582 – F16.

Theory: Asleep at the Switch to Many-Core

Chapter 1 Introduction.

Introduction to Operating Systems

(A Research Proposal for Optimizing DBMS on CMP)

Introduction to Operating Systems

Low Depth Cache-Oblivious Algorithms

Vrije Universiteit Amsterdam

Presentation transcript:

Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip B. Gibbons

Theory: Asleep at the Switch to Many-Core2 Abstract Two decades after the peak of Theory’s interest in parallel computing, the age of many-core is finally underway. Moore’s Law is driving a steady doubling of cores per chip, forcing everyone to jump aboard the parallelism train (or be left behind). To overcome the challenges in realizing the potential of many-core, research is needed in all aspects of parallel computing. Yet, judging by the lack of parallel computing papers in FOCS, SODA, and STOC, Theory is asleep at the switch. This talk provides a wake- up call for Theory to regain a leadership role in parallel computing, before it’s too late. I will advocate five areas in which Theory can (and should) have an important impact, by playing to its strengths. Illustrative examples of initial progress in several of these areas will be drawn from our work on the Hierarchy-savvy Parallel Algorithm Design (Hi-Spade) project.

Theory: Asleep at the Switch to Many-Core3 Two Decades after the peak of Theory’s interest in parallel computing… The Age of Many-Core is finally underway Fueled by Moore’s Law: 2X cores per chip every 18 months All aboard the parallelism train! (Almost) The only way to faster apps

Theory: Asleep at the Switch to Many-Core4 All Aboard the Parallelism Train? Switch to Many-Core…Many Challenges Interest waned long ago Yet problems were NOT solved Research needed in all aspects of Many-Core Computer Architecture Programming Languages & Compilers Operating & Runtime Systems Theory YES! Who has answered the call?

Theory: Asleep at the Switch to Many-Core5 Theory: Asleep at the Switch Theory needs to wake-up & regain a leadership role in parallel computing “Engineer driving derailed Staten Island train may have fallen asleep at the switch.” (12/26/08)

Theory: Asleep at the Switch to Many-Core6 Theory’s Strengths Conceptual Models –Abstract models of computation New Algorithmic Paradigms –New algorithms, new protocols Provable Correctness –Safety, liveness, security, privacy,… Provable Performance Guarantees –Approximation, probabilistic, new metrics Inherent Power/Limitations –Of primitives, features,… …among others

Theory: Asleep at the Switch to Many-Core7 Parallel Thinking Memory Hierarchy Asymmetry/Heterogeniety Concurrency Primitives Power Montparnasse 1895 Five Areas in Which Theory Can (Should) Have an Important Impact

Theory: Asleep at the Switch to Many-Core8 Impact Area: Parallel Thinking Key: Good Model of Parallel Computation Express Parallelism Good parallel programmer’s model Good for teaching, teaching “how to think” Can be engineered to good performance

Theory: Asleep at the Switch to Many-Core9 Impact Area: Memory Hierarchy Deep cache/storage hierarchy Need conceptual model Need smart thread schedulers

Theory: Asleep at the Switch to Many-Core10 Impact Area: Asymmetry/Heterogeniety Fat/Thin cores SIMD extensions Multiple coherence domains Mixed-mode parallelism Virtual Machines...

Theory: Asleep at the Switch to Many-Core11 Impact Area: Concurrency Primitives Parallel prefix Hash map [Herlihy08] Map reduce [Karloff09] Transactional memory Memory block transactions [Blelloch08] Graphics primitives [Ha08] Make the case Many-Core should (not) support Improve the algorithm Recommend new primitives (prescriptive)

Theory: Asleep at the Switch to Many-Core12 Impact Area: Power Many-cores provide features for reducing power Voltage scaling [Albers07] Dynamically run on fewer cores, fewer banks Fertile area for Theory help

Theory: Asleep at the Switch to Many-Core13 Deep Dive: Memory Hierarchy Deep cache/storage hierarchy Need conceptual model Need smart thread schedulers

Theory: Asleep at the Switch to Many-Core14 Good Performance Requires Effective Use of the Memory Hierarchy CPU L1 L2 Cache Main Memory Magnetic Disks Performance: Running/response time Throughput Power Two new trends: Pervasive Multicore & Pervasive Flash bring new challenges and opportunities

Theory: Asleep at the Switch to Many-Core15 L2 Cache New Trend 1: Pervasive Multicore Shared L2 Cache Main Memory Magnetic Disks CPU L1 CPU L1 CPU L1 Makes Effective Use of Hierarchy Much Harder Challenges Cores compete for hierarchy Hard to reason about parallel performance Hundred cores coming soon Cache hierarchy design in flux Hierarchies differ across platforms Opportunity Rethink apps & systems to take advantage of more CPUs on chip

Theory: Asleep at the Switch to Many-Core16 Shared L2 Cache New Trend 2: Pervasive Flash Main Memory Magnetic Disks CPU L1 CPU L1 CPU L1 New Type of Storage in the Hierarchy Flash Devices Challenges Performance quirks of Flash Technology in flux, e.g., Flash Translation Layer (FTL) Opportunity Rethink apps & systems to take advantage

Theory: Asleep at the Switch to Many-Core17 How Hierarchy is Treated Today Ignorant (Pain)-Fully Aware Hand-tuned to platform [Effort high, Not portable, Limited sharing scenarios] API view: Memory + I/O; Parallelism often ignored [Performance iffy] Algorithm Designers & Application/System Developers often tend towards one of two extremes Or they focus on one or a few aspects, but without a comprehensive view of the whole

Theory: Asleep at the Switch to Many-Core18 Hide what can be hid Expose what must be exposed for good performance Robust: many platforms, many resource sharing scenarios Sweet-spot between ignorant and (pain)fully aware “Hierarchy-Savvy” Hierarchy-Savvy Parallel Algorithm Design (Hi-Spade) project …seeks to enable: A hierarchy-savvy approach to algorithm design & systems for emerging parallel hierarchies

Theory: Asleep at the Switch to Many-Core19 Hi-Spade Research Scope A hierarchy-savvy approach to algorithm design & systems for emerging parallel hierarchies Research agenda includes Theory:conceptual models, algorithms, analytical guarantees Systems:runtime support, performance tools, architectural features Applications:databases, operating systems, application kernels

Theory: Asleep at the Switch to Many-Core20 Cache Hierarchies: Sequential External Memory (EM) Algorithms Hi-Spade: Hierarchy-savvy Parallel Algorithm Design [See Vitter’s ACM Surveys article] + Simple model + Minimize I/Os – Only 2 levels Main Memory (size M) External Memory Block size B External Memory Model

Theory: Asleep at the Switch to Many-Core21 Alternative: Cache-Oblivious Algorithms Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Main Memory (size M) External Memory Block size B Cache-Oblivious Model Twist on EM Model: M & B unknown to Algorithm + simple model Key Goal: Good performance for any M & B + Key Goal Guaranteed good cache performance at all levels of hierarchy – Single CPU only [Frigo99] Cache Hierarchies: Sequential

Theory: Asleep at the Switch to Many-Core22 Cache Hierarchies: Parallel Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Explicit Multi-level Hierarchy: Multi-BSP Model [Valiant08] Goal: Approach simplicity of cache-oblivious model Hierarchy-Savvy sweet spot

Theory: Asleep at the Switch to Many-Core23 Challenge: –Theory of cache-oblivious algorithms falls apart once introduce parallelism: Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Good performance for any M & B on 2 levels DOES NOT imply good performance at all levels of hierarchy Key reason: Caches not fully shared L2 CacheShared L2 Cache CPU2 L1 CPU1 L1 CPU3 L1 What’s good for CPU1 is often bad for CPU2 & CPU3 e.g., all want to write B at ≈ the same time B – Parallel cache-obliviousness too strict a goal

Theory: Asleep at the Switch to Many-Core24 Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Key new dimension: Scheduling of parallel threads Has LARGE impact on cache performance L2 CacheShared L2 Cache CPU2 L1 CPU1 L1 CPU3 L1 Can mitigate (but not solve) if can schedule the writes to be far apart in time Recall our problem scenario: all CPUs want to write B at ≈ the same time B

Theory: Asleep at the Switch to Many-Core25 Existing Parallel Cache Models main memory block transfer (size = B) private cache (size = C) CPU block transfer (size = B) private cache (size = C) CPU block transfer (size = B) private cache (size = C) CPU12p ParallelPrivate-Cache Model: main memory shared cache (size = C) CPU block transfer (size = B) CPU 12p ParallelShared-Cache Model: Slide from Rezaul Chowdhury

Theory: Asleep at the Switch to Many-Core26 Competing Demands of Private and Shared Caches private cache CPU private cache CPU private cache CPU 1 2p main memory shared cache  Shared cache: cores work on the same set of cache blocks  Private cache: cores work on disjoint sets of cache blocks  Experimental results have shown that on CMP architectures  work-stealing, i.e., the state-of-art scheduler for private-cache model, can suffer from excessive shared-cache misses  parallel depth first, i.e., the best scheduler for shared-cache model, can incur excessive private-cache misses Slide from Rezaul Chowdhury

Theory: Asleep at the Switch to Many-Core27 Private vs. Shared Caches Parallel all-shared hierarchy: + Provably good cache performance for cache-oblivious algs 3-level multi-core model: insights on private vs. shared + Designed new scheduler with provably good cache performance for class of divide-and-conquer algorithms [Blelloch08] Hi-Spade: Hierarchy-savvy Parallel Algorithm Design L2 CacheShared L2 Cache CPU2 L1 CPU1 L1 CPU3 L1 – Results require exposing working set size for each recursive subproblem

Theory: Asleep at the Switch to Many-Core28 Parallel Tree of Caches Hi-Spade: Hierarchy-savvy Parallel Algorithm Design … … … … … … … … … … … … … … … … … … … … … … … Approach: [Blelloch09] Design low-depth cache-oblivious algorithm Thrm: for each level i, only O(M P D/ B ) misses more than the sequential schedule i i Low depth D Good miss bound

Theory: Asleep at the Switch to Many-Core29 Five Areas in Which Theory Can (Should) Have an Important Impact Parallel Thinking Memory Hierarchy Asymmetry/Heterogeniety Concurrency Primitives Power