EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

Slides:



Advertisements
Similar presentations
Runtime Feedback in a Meta-Tracing JIT for Efficient Dynamic Languages Writer: Carl Friedrich Bolz Introduced by Ryotaro IKEDA at 2011/09/06.
Advertisements

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Fine-grain Task Aggregation and Coordination on GPUs
Introduction to the CUDA Platform
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB.py OS/HW f() h() Specializer.c PLL Interp Productivity app.so cc/ld $ $ SEJITS logical.
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
Selective, Embedded Just-in- Time Specialization (SEJITS) As a platform for implementing communication-avoiding algorithms accessible from Python.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.
Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Laboratory Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny, Kurt.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
UC Berkeley Par Lab Overview 2 David Patterson Par Lab’s original research “bets” Software platform: data center + mobile client Let compelling applications.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.
1 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are the HW building blocks? 4. How to.
Group Discussion Hong Man 07/21/ UMD DIF with GNU Radio From Will Plishker’s presentation. 2 GRC The DIF Package (TDP) Platforms GPUs Multi- processors.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Asst.Prof.Dr.Ahmet Ünveren SPRING Computer Engineering Department Asst.Prof.Dr.Ahmet Ünveren SPRING Computer Engineering Department.
Introduction to Embedded Development. What is an Embedded System ? An embedded system is a computer system embedded in a device with a dedicated function.
Massively LDPC Decoding on Multicore Architectures Present by : fakewen.
1/24 Exploring the Design Space of a Parallel Object Recognition System Bor-Yiing Su, Kurt Keutzer,
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
@2011 Mihail L. Sichitiu1 Android Introduction Platform Overview.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1 Hardware Security Mechanisms Krste Asanovic U.C. Berkeley August 20, 2009.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
GPU Architecture and Programming
Specialized systems are  Inevitable  Already the norm  Practical.
Networked Embedded and Control Systems WP ICT Call 2 Objective ICT ICT National Contact Points Mercè Griera i Fisa Brussels, 23 May 2007.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Scripting Languages Info derived largely from Programming Language Pragmatics, by Michael Scott.
3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.
B ERKELEY P AR L AB 1 Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović HotPar  Berkeley, CA  March.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
 Can access all API’s made available by OS vendor.  SDK’s are platform-specific.  Each mobile OS comes with its own unique tools and GUI toolkit.
B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently PLDI  June 09, 2010 Heidi Pan, Benjamin Hindman, Krste Asanovic  {benh,
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.
First appeared Features Popular uses Basic This language emphasises on ease of use, allowing general purpose programming to those with a small amount of.
Basic 1960s It was designed to emphasize ease of use. Became widespread on microcomputers It is relatively simple. Will make it easier for people with.
Our Graphics Environment
Topic: Difference b/w JDK, JRE, JIT, JVM
Enabling machine learning in embedded systems
Performance Tuning Team Chia-heng Tu June 30, 2009
CMPE419 Mobile Application Development
COSC121: Computer Systems
Compiler Back End Panel
Compiler Back End Panel
Optimizing stencil code for FPGA
CMPE419 Mobile Application Development
Presentation transcript:

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Efficiency Programming for the (Productive) Masses Armando Fox, Bryan Catanzaro, Shoaib Kamil, Yunsup Lee, Ben Carpenter, Erin Carson, Krste Asanovic, Dave Patterson, Kurt Keutzer UC Berkeley Parallel Computing Lab/UPCRC

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Make productivity programmers efficient, and efficiency programmers productive? Productivity level language (PLL): Python, Ruby high-level abstractions well-matched to application domain => 5x faster development and 3-10x fewer lines of code >90% of programmers Efficiency level language (ELL): C/C++, CUDA, OpenCL >5x longer development time potential 10x-100x performance by exposing HW model <10% programmers, yet their work is poorly reused 5x development time 10x-100x performance! Raise level of abstraction and get performance?

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Capture patterns instead of domains? Efficiency programmers know how to target computation patterns to hardware stencil/SIMD codes => GPUs sparse matrix => communication-avoiding algos on multicore Big finance Monte Carlo sim => MapReduce Libraries? Useful, but dont raise abstraction level How to make ELL work accessible to more PLL programmers?

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Stovepipes: Connect Pattern to Platform OOO GPU SIMD FPGA Cloud Runtime & OS Common language substrate Rendering Probabilistic Physics Lin. Alg. Virt. worlds Data viz. Robotics Music App domains Computation domains Language Thick Runtime Hardware Traditional Layers OOO GPU SIMD FPGA Cloud Runtime & OS Virt. worlds Data viz. Robotics Music Applications Motifs/Pattern s Thin Runtime Hardware Stovepipes Sparse Matrix Dense to GPU Stencil to SIMD Stencil to FPGA Dense to OoO Dense Matrix Stencil Humans must produce these

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB SEJITS: Selective, Embedded Just-in-Time Specialization Productivity programmers write in general purpose, modern, high level PLL SEJITS infrastructure specializes computation patterns selectively at runtime Specialization uses runtime info to generate and JIT-compile ELL code targeted to hardware Embedded because PLLs own machinery enables (vs. extending PLL interpreter)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Specifically... When specializable function is called: determine if specializer available for current platform if no: continue executing normally in PLL If a specializer is found, it can: manipulate/traverse AST of the function emit & JIT-compile ELL source code dynamically link compiled code to PLL interpreter Specializers written in PLL Necessary features present in modern PLLs, but absent from older widely-used PLLs

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB.py OS/HW Specializer.c PLL ) SEJITS Productivity app.so cc/ld $ $ SEJITS makes tuning decisions per-function (not per-app)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB.py OS/HW Specializer.c PLL ) SEJITS Productivity app.so cc/ld $ $ SEJITS makes tuning decisions per-function (not per-app) Selective Embedded JIT Specialization

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Example: Stencil Computation in Ruby 9 class LaplacianKernel < Kernel def kernel(in_grid, out_grid) in_grid.each_interior do |point| point.neighbors(1).each do |x| out_grid[point] += 0.2*x.val end VALUE kern_par(int argc, VALUE* argv, VALUE self) { unpack_arrays into in_grid and out_grid; #pragma omp parallel for default(shared) private (t_6,t_7,t_8) for (t_8=1; t_8<256-1; t_8++) { for (t_7=1; t_7<256-1; t_7++) { for (t_6=1; t_6<256-1; t_6++) { int center = INDEX(t_6,t_7,t_8); out_grid[center] = (out_grid[center] +(0.2*in_grid[INDEX(t_6-1,t_7,t_8)]));... out_grid[center] = (out_grid[center] +(0.2*in_grid[INDEX(t_6,t_7,t_8+1)])); ;}}} return Qtrue;} Specializer emits OpenMP 1000x-2000x faster than Ruby Use introspection to grab parameters, inspect AST of computation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Example: Sparse Matrix-Vector Multiply in Python 10 # Gather nonzero entries, # multiply them by vector, # do for each column Specializer outputs CUDA for nvcc: SEJITS leverages downstream toolchains B. Catanzaro et al., joint work with NVIDIA Research

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB.py Nexus on Eucalyptus or EC2 Specializer PLL ) SEJITS Productivity app Spark worker.scala scalac $ $ Spark & Nexus Spark enables cloud- distributed, persistent, fault-tolerant shared parallel data structures Relies on Scala runtime and data- parallel abstractions Relies on Nexus (cloud resource management) layer SEJITS in the Cloud

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Example: Logistic regression using Spark/Scala (in progress) M. Zaharia et al., Spark: Cluster Computing With Working Sets, HotCloud09 B. Hindman et al., Nexus: A Common Substrate for Cluster Computing, HotCloud09 12

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB.py Nexus on Cloud Specializer PLL ) SEJITS Productivity app Hadoop master.java javac $ $ SEJITS in the Cloud

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB SEJITS for Cloud Computing Idea: same Python app runs on desktop, on manycore, and in cloud Cloud/multicore synergy: specialize intra-node as well as generate cloud code Cloud: Emit JIT-able code for Spark (Scala), Hadoop (Java), MPI (C),... Single node: Emit JIT-able code for OpenCL, CUDA, OpenMP,... Combine abstractions in one app Remember...can always fall back to PLL

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Questions Wont we need lots & lots of specializers? if ParLab motifs bet is correct, ~10s of specializers will go a long way What about libraries, frameworks, etc.? SEJITS is complementary to frameworks Most libraries for ELL, and ELLs lack features that promote code reuse, dont raise abstraction level Why isnt this just as hard as magic compiler? Specializers written by human experts SEJITS allows crowdsourcing them Will programmers accustomed to Matlab/Fortran learn functional style, list comprehensions, etc.?

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Conclusion SEJITS enables code-generation strategy per- function, not per-app Uniform approach to productive programming same app on cloud, multicore, autotuned libraries Combine multiple frameworks/abstractions in same app Research enabler Incrementally develop specializers for different motifs or prototype HW Dont need full compiler & toolchain just to get started

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Questions 17