Download presentation
Presentation is loading. Please wait.
1
Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010
2
2 The single fast core era is over Trends: Changing Metrics: ‘scale out’, not just ‘scale up’ Increasing diversity: many different mixes of ‘cores’ Today’s (and tomorrow’s) machines: commodity, heterogeneous, many-core Problem: How does one program all this complexity?!
3
3 High-level programming models Two major advantages over threads & locks –Constructs to express/expose parallelism –Scheduling support to help manage concurrency, communication, and synchronization Widespread in research and industry: OpenGL/Direct3D, SQL, Brook, Cilk, CUDA, OpenCL, StreamIt, TBB, …
4
4 My biases workloads Interesting applications have irregularity Large bundles of coherent work are efficient Producer-consumer idiom is important Goal: Rebuild coherence dynamically by aggregating related work as it is generated.
5
5 My target audience Highly informed, but (good) lazy –Understands the hardware and best practices –Dislikes rote, Prefers power versus constraints Goal: Let systems-savvy developers efficiently develop programs that efficiently map onto their hardware.
6
6 Contributions: Design of GRAMPS Programs are graphs of stages and queues Queues: –Maximum capacities, Packet sizes Stages: –No, limited, or total automatic parallelism –Fixed, variable, or reduction (in-place) outputs Simple Graphics Pipeline
7
7 Contributions: Implementation Broad application scope: –Rendering, MapReduce, image processing, … Multi-platform applicability: –GRAMPS runtimes for three architectures Performance: –Scale-out parallelism, controlled data footprint –Compares well to schedulers from other models (Also: Tunable)
8
8 Outline GRAMPS overview Study 1: Future graphics architectures Study 2: Current multi-core CPUs Comparison with schedulers from other parallel programming models
9
GRAMPS Overview
10
10 GRAMPS Programs are graphs of stages and queues –Expose the program structure –Leave the program internals unconstrained
11
11 Writing a GRAMPS program Design the application graph and queues: Design the stages Instantiate and launch. Credit: http://www.foodnetwork.com/recipes/alton-brown/the-chewy-recipe/index.html Cookie Dough Pipeline
12
12 Queues Bounded size, operate at “packet” granularity –“Opaque” and “Collection” packets GRAMPS can optionally preserve ordering –Required for some workloads, adds overhead
13
13 Thread (and Fixed) stages Preemptible, long-lived, stateful –Often merge, compare, or repack inputs Queue operations: Reserve/Commit (Fixed: Thread stages in custom hardware)
14
14 Shader stages: Automatically parallelized: –Horde of non-preemptible, stateless instances –Pre- reserve /post- commit Push : Variable/conditional output support –GRAMPS coalesces elements into full packets
15
15 Queue sets: Mutual exclusion Independent exclusive (serial) subqueues –Created statically or on first output –Densely or sparsely indexed Bonus: Automatically instanced Thread stages Cookie Dough Pipeline
16
16 Queue sets: Mutual exclusion Independent exclusive (serial) subqueues –Created statically or on first output –Densely or sparsely indexed Bonus: Automatically instanced Thread stages Cookie Dough (with queue set)
17
17 A few other tidbits Instanced Thread stages Queues as barriers / read all-at-once In-place Shader stages / coalescing inputs
18
18 Formative influences The Graphics Pipeline, early GPGPU “Streaming” Work-queues and task-queues
19
Study 1: Future Graphics Architectures (with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan; appeared in Transactions on Computer Graphics, January 2009)
20
20 Graphics is a natural first domain Table stakes for commodity parallelism GPUs are full of heterogeneity Poised to transition from fixed/configurable pipeline to programmable We have a lot of experience in it
21
21 The Graphics Pipeline in GRAMPS Graph, setup are (application) software –Can be customized or completely replaced Like the transition to programmable shading –Not (unthinkably) radical Fits current hw: FIFOs, cores, rasterizer, …
22
22 Reminder: Design goals Broad application scope Multi-platform applicability Performance: scale-out, footprint-aware
23
23 The Experiment Three renderers: –Rasterization, Ray Tracer, Hybrid Two simulated future architectures –Simple scheduler for each
24
24 Scope: Two(-plus) renderers Ray Tracing Extension Rasterization Pipeline (with ray tracing extension) Ray Tracing Graph
25
25 Platforms: Two simulated systems CPU-Like: 8 Fat Cores, Rast GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched
26
26 Performance— Metrics “Maximize machine utilization while keeping working sets small” Priority #1: Scale-out parallelism –Parallel utilization Priority #2: ‘Reasonable’ bandwidth / storage –Worst case total footprint of all queues –Inherently a trade-off versus utilization
27
27 Performance— Scheduling Simple prototype scheduler (both platforms): Static stage priorities: Only preempt on Reserve and Commit No dynamic weighting of current queue sizes (Lowest) (Highest)
28
28 Performance— Results Utilization: 95+% for all but rasterized fairy (~80%). Footprint: < 600KB CPU-like, < 1.5MB GPU-like Surprised how well the simple scheduler worked Maintaining order costs footprint
29
Study 2: Current Multi-core CPUs (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)
30
30 Reminder: Design goals Broad application scope Multi-platform applicability Performance: scale-out, footprint-aware
31
31 The Experiment 9 applications, 13 configurations One (more) architecture: multi-core x86 –It’s real (no simulation here) –Built with pthreads, locks, and atomics Per-pthread task-priority-queues with work-stealing –More advanced scheduling
32
32 Scope: Application bonanza GRAMPS Ray tracer (0, 1 bounce) Spheres (No rasterization, though) MapReduce Hist (reduce / combine) LR (reduce / combine) PCA Cilk(-like) Mergesort CUDA Gaussian, SRAD StreamIt FM, TDE
33
33 Scope: Many different idioms FM Merge Sort Ray Tracer SRAD MapReduce
34
34 Platform: 2xQuad-core Nehalem Queues: copy in/out, global (shared) buffer Threads: user-level scheduled contexts Shaders: create one task per input packet Native: 8 HyperThreaded Core i7’s
35
35 Performance— Metrics (Reminder) “Maximize machine utilization while keeping working sets small” Priority #1: Scale-out parallelism Priority #2: ‘Reasonable’ bandwidth / storage
36
36 Performance– Scheduling Static per-stage priorities (still) Work-stealing task-priority-queues Eagerly create one task per packet (naïve) Keep running stages until a low watermark –(Limited dynamic weighting of queue depths)
37
37 Performance– Good Scale-out (Footprint: Good; detail a little later) Parallel Speedup Hardware Threads
38
38 Performance– Low Overheads ‘App’ and ‘Queue’ time are both useful work. Percentage of Execution Execution Time Breakdown (8 cores / 16 hyperthreads)
39
Comparison with Other Schedulers (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)
40
40 Three archetypes Task-Stealing: (Cilk, TBB) Low overhead with fine granularity tasks No producer-consumer, priorities, or data-parallel Breadth-First: (CUDA, OpenCL) Simple scheduler (one stage at the time) No producer-consumer, no pipeline parallelism Static: (StreamIt / Streaming) No runtime scheduler; complex schedules Cannot adapt to irregular workloads
41
41 GRAMPS is a natural framework Shader Support Producer- Consumer Structured ‘Work’ Adaptive GRAMPS Task- Stealing Breadth- First Static
42
42 The Experiment Re-use the exact same application code Modify the scheduler per archetype: –Task-Stealing: Unbounded queues, no priority, (amortized) preempt to child tasks –Breadth-First: Unbounded queues, stage at a time, top-to-bottom –Static: Unbounded queues, offline per-thread schedule using SAS / SGMS
43
43 Seeing is believing (ray tracer) GRAMPS Breadth-First Static (SAS) Task-Stealing
44
44 Comparison: Execution time Mostly similar: good parallelism, load balance Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)
45
45 Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Comparison: Execution time Breadth-first can exhibit load-imbalance
46
46 Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Comparison: Execution time Task-stealing can ping-pong, cause contention Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)
47
47 Comparison: Footprint Breadth-First is pathological (as expected) Relative Packet Footprint (Log-Scale) Size versus GRAMPS
48
48 Footprint: GRAMPS & Task-Stealing Relative Packet Footprint Relative Task Footprint
49
49 GRAMPS gets insight from the graph: (Application-specified) queue bounds Group tasks by stage for priority, preemption Footprint: GRAMPS & Task-Stealing MapReduce Ray Tracer MapReduce Ray Tracer
50
50 Static scheduling is challenging Generating good Static schedules is *hard*. Static schedules are fragile: –Small mismatches compound –Hardware itself is dynamic (cache traffic, IRQs, …) Limited upside: dynamic scheduling is cheap! Execution TimePacket Footprint
51
51 Discussion (for multi-core CPUs) Adaptive scheduling is the obvious choice. –Better load-balance / handling of irregularity Semantic insight (app graph) gives a big advantage in managing footprint. More cores, development maturity → more complex graphs and thus more advantage.
52
Conclusion
53
53 Contributions Revisited GRAMPS programming model design –Graph of heterogeneous stages and queues Good results from actual implementation –Broad scope: Wide range of applications –Multi-platform: Three different architectures –Performance: High parallelism, good footprint
54
54 Anecdotes and intuitions Structure helps: an explicit graph is handy. Simple (principled) dynamic scheduling works. Queues impedance match heterogeneity. Graphs with cycles and push both paid off. (Also: Paired instrumentation and visualization help enormously)
55
55 Conclusion: Future trends revisited Core counts are increasing –Parallel programming models Memory and bandwidth are precious –Working set, locality (i.e., footprint) management Power, performance driving heterogeneity –All ‘cores’ need to communicate, interoperate GRAMPS fits them well.
56
56 Thanks Eric, for agreeing to make this happen. Christos, for throwing helpers at me. Kurt, Mendel, and Pat, for, well, a lot. John Gerth, for tireless computer servitude. Melissa (and Heather and Ada before her)
57
57 Thanks My practice audiences My many collaborators Daniel, Kayvon, Mike, Tim Supporters at NVIDIA, ATI/AMD, Intel Supporters at VMware Everyone who entertained, informed, challenged me, and made me think
58
58 Thanks My funding agencies: –Rambus Stanford Graduate Fellowship –Department of the Army Research –Stanford Pervasive Parallelism Laboratory
59
59 Q&A Thank you for listening! Questions?
60
Extra Material (Backup)
61
61 Data: CPU-Like & GPU-Like
62
62 Footprint Data: Native
63
63 Tunability Diagnosis: –Raw counters, statistics, logs –Grampsviz Optimize / Control: –Graph topology (e.g., sort-middle vs. sort-last) –Queue watermarks (e.g., 10x win for ray tracing) –Packet size: Match SIMD widths, share data
64
64 Tunability– Grampsviz (1) GPU-Like: Rasterization pipeline
65
65 Tunability– Grampsviz (2) CPU-Like: Histogram (MapReduce) ReduceCombine
66
66 Graph topology/design: Tunability– Knobs Sort-MiddleSort-Last Sizing critical queues:
67
Alternatives
68
68 Alternate Contribution Formulation Design of the GRAMPS model –Structure computation as a graph of heterogeneous stages –Communication via programmer sized queues Many applications written in GRAMPS GRAMPS runtime for three platforms (dynamic scheduling) Evaluation of GRAMPS scheduler against Task- Stealing, Breadth-First, Static
69
69 A few other tidbits In-place Shader stages / coalescing inputs Image Histogram Pipeline Instanced Thread stages Queues as barriers / read all-at-once
70
70 Performance– Good Scale-out (Footprint: Good; detail a little later) Parallel Speedup Hardware Threads
71
71 Seeing is believing (ray tracer) GRAMPS Static (SAS) Task-Stealing Breadth-First
72
72 Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Comparison: Execution time Small ‘Sched’ time, even with large graphs
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.