Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Sven Woop Computer Graphics Lab Saarland University

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

DANBI: Dynamic Scheduling of Irregular Stream Programs for Many-Core Systems Changwoo Min and Young Ik Eom Sungkyunkwan University, Korea DANBI is a Korean.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

GPUs and GPU Programming Bharadwaj Subramanian, Apollo Ellis Imagery taken from Nvidia Dawn Demo Slide on GPUs, CUDA and Programming Models by Apollo Ellis.

Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.

Extending GRAMPS Shaders Jeremy Sugerman June 2, 2009 FLASHG.

GRAMPS Overview and Design Decisions Jeremy Sugerman February 26, 2009 GCafe.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Trends in Cluster Architecture Steve Lumetta David Culler University of California at Berkeley Computer Science Division.

TEMPLATE DESIGN © Sum() is now a Shader stage: An N:1 shader and a graph cycle reduce in place, in parallel. 'Barrier'

GRAMPS: A Programming Model For Graphics Pipelines Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

GRAMPS: A Programming Model for Graphics Pipelines and Heterogeneous Parallelism Jeremy Sugerman March 5, 2009 EEC277.

Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.

GRAMPS Beyond Rendering Jeremy Sugerman 11 December 2009 PPL Retreat.

Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian.

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

Many-Core Programming with GRAMPS & “Real Time REYES” Jeremy Sugerman, Kayvon Fatahalian Stanford University June 12, 2008.

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford University September 12, 2008.

Doing More With GRAMPS Jeremy Sugerman 10 December 2009 GCafe.

Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Further Developing GRAMPS Jeremy Sugerman FLASHG January 27, 2009.

“Evaluating MapReduce for Multi-core and Multiprocessor Systems” Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis Computer.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

GPU Architecture and Programming

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.

Emergent Game Technologies Gamebryo Element Engine Thread for Performance.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Sunpyo Hong, Hyesoon Kim

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

J++ Machine Jeremy Sugerman Kayvon Fatahalian. Background  Multicore CPUs  Generalized GPUs (Brook, CTM, CUDA)  Tightly coupled traditional CPU (more.

A Sorting Classification of Parallel Rendering Molnar et al., 1994.

My Coordinates Office EM G.27 contact time:

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Accelerating MapReduce on a Coupled CPU-GPU Architecture

GPU Scheduling on the NVIDIA TX2:

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010

2 The single fast core era is over Trends: Changing Metrics: ‘scale out’, not just ‘scale up’ Increasing diversity: many different mixes of ‘cores’ Today’s (and tomorrow’s) machines: commodity, heterogeneous, many-core Problem: How does one program all this complexity?!

3 High-level programming models Two major advantages over threads & locks –Constructs to express/expose parallelism –Scheduling support to help manage concurrency, communication, and synchronization Widespread in research and industry: OpenGL/Direct3D, SQL, Brook, Cilk, CUDA, OpenCL, StreamIt, TBB, …

4 My biases workloads Interesting applications have irregularity Large bundles of coherent work are efficient Producer-consumer idiom is important Goal: Rebuild coherence dynamically by aggregating related work as it is generated.

5 My target audience Highly informed, but (good) lazy –Understands the hardware and best practices –Dislikes rote, Prefers power versus constraints Goal: Let systems-savvy developers efficiently develop programs that efficiently map onto their hardware.

6 Contributions: Design of GRAMPS Programs are graphs of stages and queues Queues: –Maximum capacities, Packet sizes Stages: –No, limited, or total automatic parallelism –Fixed, variable, or reduction (in-place) outputs Simple Graphics Pipeline

7 Contributions: Implementation Broad application scope: –Rendering, MapReduce, image processing, … Multi-platform applicability: –GRAMPS runtimes for three architectures Performance: –Scale-out parallelism, controlled data footprint –Compares well to schedulers from other models (Also: Tunable)

8 Outline GRAMPS overview Study 1: Future graphics architectures Study 2: Current multi-core CPUs Comparison with schedulers from other parallel programming models

GRAMPS Overview

10 GRAMPS Programs are graphs of stages and queues –Expose the program structure –Leave the program internals unconstrained

11 Writing a GRAMPS program Design the application graph and queues: Design the stages Instantiate and launch. Credit: Cookie Dough Pipeline

12 Queues Bounded size, operate at “packet” granularity –“Opaque” and “Collection” packets GRAMPS can optionally preserve ordering –Required for some workloads, adds overhead

13 Thread (and Fixed) stages Preemptible, long-lived, stateful –Often merge, compare, or repack inputs Queue operations: Reserve/Commit (Fixed: Thread stages in custom hardware)

14 Shader stages: Automatically parallelized: –Horde of non-preemptible, stateless instances –Pre- reserve /post- commit Push : Variable/conditional output support –GRAMPS coalesces elements into full packets

15 Queue sets: Mutual exclusion Independent exclusive (serial) subqueues –Created statically or on first output –Densely or sparsely indexed Bonus: Automatically instanced Thread stages Cookie Dough Pipeline

16 Queue sets: Mutual exclusion Independent exclusive (serial) subqueues –Created statically or on first output –Densely or sparsely indexed Bonus: Automatically instanced Thread stages Cookie Dough (with queue set)

17 A few other tidbits Instanced Thread stages Queues as barriers / read all-at-once In-place Shader stages / coalescing inputs

18 Formative influences The Graphics Pipeline, early GPGPU “Streaming” Work-queues and task-queues

Study 1: Future Graphics Architectures (with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan; appeared in Transactions on Computer Graphics, January 2009)

20 Graphics is a natural first domain Table stakes for commodity parallelism GPUs are full of heterogeneity Poised to transition from fixed/configurable pipeline to programmable We have a lot of experience in it

21 The Graphics Pipeline in GRAMPS Graph, setup are (application) software –Can be customized or completely replaced Like the transition to programmable shading –Not (unthinkably) radical Fits current hw: FIFOs, cores, rasterizer, …

22 Reminder: Design goals Broad application scope Multi-platform applicability Performance: scale-out, footprint-aware

23 The Experiment Three renderers: –Rasterization, Ray Tracer, Hybrid Two simulated future architectures –Simple scheduler for each

24 Scope: Two(-plus) renderers Ray Tracing Extension Rasterization Pipeline (with ray tracing extension) Ray Tracing Graph

25 Platforms: Two simulated systems CPU-Like: 8 Fat Cores, Rast GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched

26 Performance— Metrics “Maximize machine utilization while keeping working sets small” Priority #1: Scale-out parallelism –Parallel utilization Priority #2: ‘Reasonable’ bandwidth / storage –Worst case total footprint of all queues –Inherently a trade-off versus utilization

27 Performance— Scheduling Simple prototype scheduler (both platforms): Static stage priorities: Only preempt on Reserve and Commit No dynamic weighting of current queue sizes (Lowest) (Highest)

28 Performance— Results Utilization: 95+% for all but rasterized fairy (~80%). Footprint: < 600KB CPU-like, < 1.5MB GPU-like Surprised how well the simple scheduler worked Maintaining order costs footprint

Study 2: Current Multi-core CPUs (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)

30 Reminder: Design goals Broad application scope Multi-platform applicability Performance: scale-out, footprint-aware

31 The Experiment 9 applications, 13 configurations One (more) architecture: multi-core x86 –It’s real (no simulation here) –Built with pthreads, locks, and atomics Per-pthread task-priority-queues with work-stealing –More advanced scheduling

32 Scope: Application bonanza GRAMPS Ray tracer (0, 1 bounce) Spheres (No rasterization, though) MapReduce Hist (reduce / combine) LR (reduce / combine) PCA Cilk(-like) Mergesort CUDA Gaussian, SRAD StreamIt FM, TDE

33 Scope: Many different idioms FM Merge Sort Ray Tracer SRAD MapReduce

34 Platform: 2xQuad-core Nehalem Queues: copy in/out, global (shared) buffer Threads: user-level scheduled contexts Shaders: create one task per input packet Native: 8 HyperThreaded Core i7’s

35 Performance— Metrics (Reminder) “Maximize machine utilization while keeping working sets small” Priority #1: Scale-out parallelism Priority #2: ‘Reasonable’ bandwidth / storage

36 Performance– Scheduling Static per-stage priorities (still) Work-stealing task-priority-queues Eagerly create one task per packet (naïve) Keep running stages until a low watermark –(Limited dynamic weighting of queue depths)

37 Performance– Good Scale-out (Footprint: Good; detail a little later) Parallel Speedup Hardware Threads

38 Performance– Low Overheads ‘App’ and ‘Queue’ time are both useful work. Percentage of Execution Execution Time Breakdown (8 cores / 16 hyperthreads)

Comparison with Other Schedulers (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)

40 Three archetypes Task-Stealing: (Cilk, TBB) Low overhead with fine granularity tasks  No producer-consumer, priorities, or data-parallel Breadth-First: (CUDA, OpenCL) Simple scheduler (one stage at the time)  No producer-consumer, no pipeline parallelism Static: (StreamIt / Streaming) No runtime scheduler; complex schedules  Cannot adapt to irregular workloads

41 GRAMPS is a natural framework Shader Support Producer- Consumer Structured ‘Work’ Adaptive GRAMPS Task- Stealing  Breadth- First   Static 

42 The Experiment Re-use the exact same application code Modify the scheduler per archetype: –Task-Stealing: Unbounded queues, no priority, (amortized) preempt to child tasks –Breadth-First: Unbounded queues, stage at a time, top-to-bottom –Static: Unbounded queues, offline per-thread schedule using SAS / SGMS

43 Seeing is believing (ray tracer) GRAMPS Breadth-First Static (SAS) Task-Stealing

44 Comparison: Execution time Mostly similar: good parallelism, load balance Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

45 Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Comparison: Execution time Breadth-first can exhibit load-imbalance

46 Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Comparison: Execution time Task-stealing can ping-pong, cause contention Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

47 Comparison: Footprint Breadth-First is pathological (as expected) Relative Packet Footprint (Log-Scale) Size versus GRAMPS

48 Footprint: GRAMPS & Task-Stealing Relative Packet Footprint Relative Task Footprint

49 GRAMPS gets insight from the graph: (Application-specified) queue bounds Group tasks by stage for priority, preemption Footprint: GRAMPS & Task-Stealing MapReduce Ray Tracer MapReduce Ray Tracer

50 Static scheduling is challenging Generating good Static schedules is *hard*. Static schedules are fragile: –Small mismatches compound –Hardware itself is dynamic (cache traffic, IRQs, …) Limited upside: dynamic scheduling is cheap! Execution TimePacket Footprint

51 Discussion (for multi-core CPUs) Adaptive scheduling is the obvious choice. –Better load-balance / handling of irregularity Semantic insight (app graph) gives a big advantage in managing footprint. More cores, development maturity → more complex graphs and thus more advantage.

Conclusion

53 Contributions Revisited GRAMPS programming model design –Graph of heterogeneous stages and queues Good results from actual implementation –Broad scope: Wide range of applications –Multi-platform: Three different architectures –Performance: High parallelism, good footprint

54 Anecdotes and intuitions Structure helps: an explicit graph is handy. Simple (principled) dynamic scheduling works. Queues impedance match heterogeneity. Graphs with cycles and push both paid off. (Also: Paired instrumentation and visualization help enormously)

55 Conclusion: Future trends revisited Core counts are increasing –Parallel programming models Memory and bandwidth are precious –Working set, locality (i.e., footprint) management Power, performance driving heterogeneity –All ‘cores’ need to communicate, interoperate  GRAMPS fits them well.

56 Thanks Eric, for agreeing to make this happen. Christos, for throwing helpers at me. Kurt, Mendel, and Pat, for, well, a lot. John Gerth, for tireless computer servitude. Melissa (and Heather and Ada before her)

57 Thanks My practice audiences My many collaborators Daniel, Kayvon, Mike, Tim Supporters at NVIDIA, ATI/AMD, Intel Supporters at VMware Everyone who entertained, informed, challenged me, and made me think

58 Thanks My funding agencies: –Rambus Stanford Graduate Fellowship –Department of the Army Research –Stanford Pervasive Parallelism Laboratory

59 Q&A Thank you for listening! Questions?

Extra Material (Backup)

61 Data: CPU-Like & GPU-Like

62 Footprint Data: Native

63 Tunability Diagnosis: –Raw counters, statistics, logs –Grampsviz Optimize / Control: –Graph topology (e.g., sort-middle vs. sort-last) –Queue watermarks (e.g., 10x win for ray tracing) –Packet size: Match SIMD widths, share data

64 Tunability– Grampsviz (1) GPU-Like: Rasterization pipeline

65 Tunability– Grampsviz (2) CPU-Like: Histogram (MapReduce) ReduceCombine

66 Graph topology/design: Tunability– Knobs Sort-MiddleSort-Last Sizing critical queues:

Alternatives

68 Alternate Contribution Formulation Design of the GRAMPS model –Structure computation as a graph of heterogeneous stages –Communication via programmer sized queues Many applications written in GRAMPS GRAMPS runtime for three platforms (dynamic scheduling) Evaluation of GRAMPS scheduler against Task- Stealing, Breadth-First, Static

69 A few other tidbits In-place Shader stages / coalescing inputs Image Histogram Pipeline Instanced Thread stages Queues as barriers / read all-at-once

70 Performance– Good Scale-out (Footprint: Good; detail a little later) Parallel Speedup Hardware Threads

71 Seeing is believing (ray tracer) GRAMPS Static (SAS) Task-Stealing Breadth-First

72 Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Comparison: Execution time Small ‘Sched’ time, even with large graphs