Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL

Is A Single Programming Model ? No Higher level productivity programming model –Usable by 90% of programmers –Principles: Determinism: behavioral determinism (not = deterministic memory sequence); no nondeterminism from concurrency; Independence: identified and verified Virtualized: Machine (#cores, memory hierarchy) hidden Lower level efficiency layer –Usable by >10% of all programmers (current C programmers) –Use for writing libraries, frameworks, domain-specific runtimes –Abstract machine model: Cores, thread contexts, memory structures visible Virtualization by choice Portability is critical to productivity of community We should have multiple models at both levels

Strawman: Minimal Efficiency Layer Static parallelism Locality and resource management done by user: –Structure of machine can be queried: #cores, #thread contexts #memory spaces, etc. Pay for what you need –Direct load/store or bulk DMAs between spaces –Separate transfer & synch –Fast barriers, locks, atomics –Supports systems with cache coherence or trivial coherence x: 1 y: l: m: x: 5 y: x: 7 y: 0 Shared off-chip DRAM Shared partitioned on-chip Private on-chip

Virtualization by Choice Efficiency layer has virtualization add-ons –Virtualize the cores Add runtime w/ dynamic scheduling for dynamic task tree Add semi-static schedule for given graph –Virtualize memory space (& add functionality) Add caching (or ignore partitioning) Add transactions Note: –These are just examples, but general strategy is to provide mechanisms and pay for what you need –Need serious support for verification and testing General task graph with weights & structure Divide-and- conquer tree with dynamic weights/struc ture

Tools for Efficiency: Autotuning Automatic performance tuning –Use machine time in place of human time for tuning –Search over possible implementations –Use performance models to restrict search space –Autotuned libraries for dwarfs (up to 10x speedup) Block size (n 0 x m 0 ) for dense matrix-matrix multiply mflop/s Spectral (FFTW, Spiral) Dense (PHiPAC, Atlas) Sparse (Sparsity, OSKI) Stencils/structured grids –Are these compilers? Don’t transform source There are compilers that use this kind of search But not for the sparse case (transform matrix) Best: 4x2 blocks uses a different data structure with more zeros

Sparse Matrices on Multicore Autotuning boosts performance of single and multiple cores

Sparse Matrices on Multicore Autotuning boosts performance of multiple socket SMPs

Sparse Matrices on Multicore Autotuning is more important than parallelism! (It’s the memory system stupid.) And don’t think running MPI process per core is good enough for performance.

Productivity: Coordination and Composition Leverage of experts programmers comes from abstraction and composition Calling traditional libraries is easy Calling parallel frameworks is harder –Divide-and-conquer parallelism –Guarded atomic commands –Usual serial composition Two key challenges –Correctness of the composition: ensuring independence; limited sharing –Performance: resource management during composition; domain-specific OS/runtime support

Productivity: Coordination and Composition Leverage of experts from abstraction a composition –Calling traditional libraries is easy –Calling parallel frameworks is harder (avoid concurrency errors, not just races on memory cells) Shared regions of partitioned data –Often implemented as copies (ghost regions) –Semantic property is phases: in read-only phase, can read larger set In read-write phase, can write smaller set Branch-and-bound search –Divide and conquer framework –Bound is shared state, updated by all tasks –Special properties: monotonic –Bound controls: tasks are non-destructive In all cases there are resource management problems

How Much Bandwidth? –Fill ½ of memory –Compute on other ½ New Top BW 500 –Essential data movement –Divide by time –Measure absolute (required) bw –Compare to peak If Latency is physics, Bandwidth is money, and Compute cores are ~free –Design hw, algs, sw to peg bandwidth –Turn off cores when you can’t utilize them –Observe Little’s Law Simple bandwidth model based on double buffering

But We’re Not Saturating Bandwidth Today Controlled experiment: dual core has ½ memory bandwidth Negligible effect from cutting bandwidth in ½ Many of these (large apps) are though to be bw-limited Result from NERSC/SDSA: Shan, Wasserman, and He

Small Kernel, Known to Be Memory Bound Sparse MatVec Multiply has <=2 flops per word (usually less!) Not at peak bandwidth, even with multiple cores Hardware didn’t observe Little’s Law

Multicore Impact on the Parallel Computing Ecosystem more education in parallelism for experts more programmers familiar with parallelism more access to parallel machines rare opportunity for new language or model 1988 Ecosystem 2000 Ecosystem one ecosystem or two?

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Similar presentations

Presentation on theme: "Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Similar presentations

Presentation on theme: "Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL."— Presentation transcript:

Similar presentations

About project

Feedback