Download presentation
Presentation is loading. Please wait.
Published byMartina Shanna Walton Modified over 8 years ago
1
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL
2
Is A Single Programming Model ? No Higher level productivity programming model –Usable by 90% of programmers –Principles: Determinism: behavioral determinism (not = deterministic memory sequence); no nondeterminism from concurrency; Independence: identified and verified Virtualized: Machine (#cores, memory hierarchy) hidden Lower level efficiency layer –Usable by >10% of all programmers (current C programmers) –Use for writing libraries, frameworks, domain-specific runtimes –Abstract machine model: Cores, thread contexts, memory structures visible Virtualization by choice Portability is critical to productivity of community We should have multiple models at both levels
3
Strawman: Minimal Efficiency Layer Static parallelism Locality and resource management done by user: –Structure of machine can be queried: #cores, #thread contexts #memory spaces, etc. Pay for what you need –Direct load/store or bulk DMAs between spaces –Separate transfer & synch –Fast barriers, locks, atomics –Supports systems with cache coherence or trivial coherence x: 1 y: l: m: x: 5 y: x: 7 y: 0 Shared off-chip DRAM Shared partitioned on-chip Private on-chip
4
Virtualization by Choice Efficiency layer has virtualization add-ons –Virtualize the cores Add runtime w/ dynamic scheduling for dynamic task tree Add semi-static schedule for given graph –Virtualize memory space (& add functionality) Add caching (or ignore partitioning) Add transactions Note: –These are just examples, but general strategy is to provide mechanisms and pay for what you need –Need serious support for verification and testing General task graph with weights & structure Divide-and- conquer tree with dynamic weights/struc ture
5
Tools for Efficiency: Autotuning Automatic performance tuning –Use machine time in place of human time for tuning –Search over possible implementations –Use performance models to restrict search space –Autotuned libraries for dwarfs (up to 10x speedup) Block size (n 0 x m 0 ) for dense matrix-matrix multiply mflop/s Spectral (FFTW, Spiral) Dense (PHiPAC, Atlas) Sparse (Sparsity, OSKI) Stencils/structured grids –Are these compilers? Don’t transform source There are compilers that use this kind of search But not for the sparse case (transform matrix) Best: 4x2 blocks uses a different data structure with more zeros
6
Sparse Matrices on Multicore Autotuning boosts performance of single and multiple cores
7
Sparse Matrices on Multicore Autotuning boosts performance of multiple socket SMPs
8
Sparse Matrices on Multicore Autotuning is more important than parallelism! (It’s the memory system stupid.) And don’t think running MPI process per core is good enough for performance.
9
Productivity: Coordination and Composition Leverage of experts programmers comes from abstraction and composition Calling traditional libraries is easy Calling parallel frameworks is harder –Divide-and-conquer parallelism –Guarded atomic commands –Usual serial composition Two key challenges –Correctness of the composition: ensuring independence; limited sharing –Performance: resource management during composition; domain-specific OS/runtime support
10
Productivity: Coordination and Composition Leverage of experts from abstraction a composition –Calling traditional libraries is easy –Calling parallel frameworks is harder (avoid concurrency errors, not just races on memory cells) Shared regions of partitioned data –Often implemented as copies (ghost regions) –Semantic property is phases: in read-only phase, can read larger set In read-write phase, can write smaller set Branch-and-bound search –Divide and conquer framework –Bound is shared state, updated by all tasks –Special properties: monotonic –Bound controls: tasks are non-destructive In all cases there are resource management problems
11
How Much Bandwidth? –Fill ½ of memory –Compute on other ½ New Top BW 500 –Essential data movement –Divide by time –Measure absolute (required) bw –Compare to peak If Latency is physics, Bandwidth is money, and Compute cores are ~free –Design hw, algs, sw to peg bandwidth –Turn off cores when you can’t utilize them –Observe Little’s Law Simple bandwidth model based on double buffering
12
But We’re Not Saturating Bandwidth Today Controlled experiment: dual core has ½ memory bandwidth Negligible effect from cutting bandwidth in ½ Many of these (large apps) are though to be bw-limited Result from NERSC/SDSA: Shan, Wasserman, and He
13
Small Kernel, Known to Be Memory Bound Sparse MatVec Multiply has <=2 flops per word (usually less!) Not at peak bandwidth, even with multiple cores Hardware didn’t observe Little’s Law
14
Multicore Impact on the Parallel Computing Ecosystem more education in parallelism for experts more programmers familiar with parallelism more access to parallel machines rare opportunity for new language or model 1988 Ecosystem 2000 Ecosystem one ecosystem or two?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.