EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Efficiency Programming for the (Productive) Masses Armando Fox, Bryan Catanzaro, Shoaib Kamil, Yunsup Lee, Ben Carpenter, Erin Carson, Krste Asanovic, Dave Patterson, Kurt Keutzer UC Berkeley Parallel Computing Lab/UPCRC
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Make productivity programmers efficient, and efficiency programmers productive? Productivity level language (PLL): Python, Ruby high-level abstractions well-matched to application domain => 5x faster development and 3-10x fewer lines of code >90% of programmers Efficiency level language (ELL): C/C++, CUDA, OpenCL >5x longer development time potential 10x-100x performance by exposing HW model <10% programmers, yet their work is poorly reused 5x development time 10x-100x performance! Raise level of abstraction and get performance?
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Capture patterns instead of domains? Efficiency programmers know how to target computation patterns to hardware stencil/SIMD codes => GPUs sparse matrix => communication-avoiding algos on multicore Big finance Monte Carlo sim => MapReduce Libraries? Useful, but dont raise abstraction level How to make ELL work accessible to more PLL programmers?
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Stovepipes: Connect Pattern to Platform OOO GPU SIMD FPGA Cloud Runtime & OS Common language substrate Rendering Probabilistic Physics Lin. Alg. Virt. worlds Data viz. Robotics Music App domains Computation domains Language Thick Runtime Hardware Traditional Layers OOO GPU SIMD FPGA Cloud Runtime & OS Virt. worlds Data viz. Robotics Music Applications Motifs/Pattern s Thin Runtime Hardware Stovepipes Sparse Matrix Dense to GPU Stencil to SIMD Stencil to FPGA Dense to OoO Dense Matrix Stencil Humans must produce these
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB SEJITS: Selective, Embedded Just-in-Time Specialization Productivity programmers write in general purpose, modern, high level PLL SEJITS infrastructure specializes computation patterns selectively at runtime Specialization uses runtime info to generate and JIT-compile ELL code targeted to hardware Embedded because PLLs own machinery enables (vs. extending PLL interpreter)
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Specifically... When specializable function is called: determine if specializer available for current platform if no: continue executing normally in PLL If a specializer is found, it can: manipulate/traverse AST of the function emit & JIT-compile ELL source code dynamically link compiled code to PLL interpreter Specializers written in PLL Necessary features present in modern PLLs, but absent from older widely-used PLLs
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB.py OS/HW Specializer.c PLL ) SEJITS Productivity app.so cc/ld $ $ SEJITS makes tuning decisions per-function (not per-app)
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB.py OS/HW Specializer.c PLL ) SEJITS Productivity app.so cc/ld $ $ SEJITS makes tuning decisions per-function (not per-app) Selective Embedded JIT Specialization
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Example: Stencil Computation in Ruby 9 class LaplacianKernel < Kernel def kernel(in_grid, out_grid) in_grid.each_interior do |point| point.neighbors(1).each do |x| out_grid[point] += 0.2*x.val end VALUE kern_par(int argc, VALUE* argv, VALUE self) { unpack_arrays into in_grid and out_grid; #pragma omp parallel for default(shared) private (t_6,t_7,t_8) for (t_8=1; t_8<256-1; t_8++) { for (t_7=1; t_7<256-1; t_7++) { for (t_6=1; t_6<256-1; t_6++) { int center = INDEX(t_6,t_7,t_8); out_grid[center] = (out_grid[center] +(0.2*in_grid[INDEX(t_6-1,t_7,t_8)]));... out_grid[center] = (out_grid[center] +(0.2*in_grid[INDEX(t_6,t_7,t_8+1)])); ;}}} return Qtrue;} Specializer emits OpenMP 1000x-2000x faster than Ruby Use introspection to grab parameters, inspect AST of computation
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Example: Sparse Matrix-Vector Multiply in Python 10 # Gather nonzero entries, # multiply them by vector, # do for each column Specializer outputs CUDA for nvcc: SEJITS leverages downstream toolchains B. Catanzaro et al., joint work with NVIDIA Research
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB.py Nexus on Eucalyptus or EC2 Specializer PLL ) SEJITS Productivity app Spark worker.scala scalac $ $ Spark & Nexus Spark enables cloud- distributed, persistent, fault-tolerant shared parallel data structures Relies on Scala runtime and data- parallel abstractions Relies on Nexus (cloud resource management) layer SEJITS in the Cloud
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Example: Logistic regression using Spark/Scala (in progress) M. Zaharia et al., Spark: Cluster Computing With Working Sets, HotCloud09 B. Hindman et al., Nexus: A Common Substrate for Cluster Computing, HotCloud09 12
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB.py Nexus on Cloud Specializer PLL ) SEJITS Productivity app Hadoop master.java javac $ $ SEJITS in the Cloud
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB SEJITS for Cloud Computing Idea: same Python app runs on desktop, on manycore, and in cloud Cloud/multicore synergy: specialize intra-node as well as generate cloud code Cloud: Emit JIT-able code for Spark (Scala), Hadoop (Java), MPI (C),... Single node: Emit JIT-able code for OpenCL, CUDA, OpenMP,... Combine abstractions in one app Remember...can always fall back to PLL
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Questions Wont we need lots & lots of specializers? if ParLab motifs bet is correct, ~10s of specializers will go a long way What about libraries, frameworks, etc.? SEJITS is complementary to frameworks Most libraries for ELL, and ELLs lack features that promote code reuse, dont raise abstraction level Why isnt this just as hard as magic compiler? Specializers written by human experts SEJITS allows crowdsourcing them Will programmers accustomed to Matlab/Fortran learn functional style, list comprehensions, etc.?
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Conclusion SEJITS enables code-generation strategy per- function, not per-app Uniform approach to productive programming same app on cloud, multicore, autotuned libraries Combine multiple frameworks/abstractions in same app Research enabler Incrementally develop specializers for different motifs or prototype HW Dont need full compiler & toolchain just to get started
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Questions 17