Are New Languages Necessary for Manycore? David I. August Department of Computer Science Princeton University
David I. August THIS is the Problem! SPEC CPU INTEGER PERFORMANCE TIME ? 2004
David I. August Why New Multicore Languages Will Fail Money is earned by relieving customer pain The Market Legacy, Legacy, Legacy Programmers adopt new programming models Parallel programming is more difficult Parallel programming models have longevity issues Automatic Thread Extraction (ATE)
David I. August Automatic Thread Extraction “That isn't to say we are parallelizing arbitrary C code, that's a fool's errand!” – Richard Lethin “Compiler can’t determine a tree from a graph…” – Burton Smith “Compiler can’t determine dependences without type information. Even then…” – Burton Smith “Decades of automatic parallelization work has been a failure…” – James Larus “All that icky pointer chasing code...” – Tim Mattson
David I. August How To Get Parallelism For Multicore? Nine months ago, with an open mind… A priori select ALL C programs from SPEC CINT 2000 Our objective function (in priority order): 1.Extract meaningful parallelism 2.Prefer automatic over manual 3.Minimize impact to the programmer when manual
David I. August Our Results BenchmarkThreads at PeakSpeedupLOCs Changed 164.gzip vpr gcc mcf crafty parser perlbmk gap vortex bzip twolf GEOMEAN ARITHMEAN M.L.O.P.: 5 Generations 32 Cores 5.3x Speedup
David I. August Our Recipe Recent Compiler Technology: Decoupled Software Pipelining (DSWP) [MICRO 05] Parallel-Stage DSWP (PS-DSWP) Speculative DSWP (Spec-DSWP) [PACT 07] Existing Technology: Speculative DOALL, TLS Targeted Memory Profiling Procedure Boundary Elimination [PLDI 06] Hardware Support: Compiler-Controlled Speculation Streaming Communication [MICRO 06]
David I. August Typical Example: 197.parser Threads run on multicore model with Itanium 2 cores. Find English Sentences Parse Sentences (95%) Emit Results DSWP PS-DSWP (Spec DOALL Middle Stage)
David I. August What We Learned 1.A new way of thinking about dependences: Go With the Flow 1.TLP is easier to extract than ILP 1.A holistic approach is better 1.A limitation exists in the sequential model: Determinism
David I. August Determinism: A Double Edged Sword while( ): x = Rand() int Rand(): state = f2(state) return f1(state) DOALL SEQUENTIAL 56 LOCs in 11 programs: 22 annotations Only 2 programs needed more Most common culprit: Custom Allocators
David I. August What about Manycore? Multicore New languages aren’t necessary Legacy code easily adjusted Manycore Implicitly Parallel Sequential Programming No optimization for sequential (custom allocators) Points of non-determinism specified Parallel algorithms in sequential codes Debuggability, Understandability, Sanity
David I. August The Answer Originates with ATE The Old Way: PL folks would write languages, Architecture folks would make HW, and Compiler folks would dutifully connect the two. This will fail for Manycore: Unduly burden the programmer Performance will suffer There’s a New Way…
David I. August DO NOT POST ANYTHING AFTER THIS SLIDE
David I. August How Code Was Transformed BenchmarkLOC (All) LOC (Model) Model Techniques Compiler Techniques Applied 164.gzip262Y-BranchTLS Memory, DSWP 175.vpr11PUREAlias, Value, & Control Spec, TLS Mem, DSWP 176.gcc177PUREAlias & Control Spec, TLS MEM, DSWP 181.mcf00Alias, Silent Store, & Control Spec, TLS Mem, DSWP, Nested 186.crafty99PURETLS Mem, DSWP, Nested 197.parser22PURETLS Mem, DSWP 253.perlbmk00Alias, Control, & Value Spec, DSWP 254.gap11PURETLS Memory, DSWP, Alias Spec 255.vortex00Alias & Value Spec, TLS Mem, DSWP 256.bzip200TLS Memory, DSWP 300.twolf11PUREAlias & Control Spec, TLS Mem, DSWP
David I. August PURE
David I. August Y-Branch
David I. August SPEC 2006: 403.gcc Threads run on multicore model with Itanium 2 cores.