1 Introduction to the On-line Course: CS267 Applications of Parallel Computers www.cs.berkeley.edu/~demmel/cs267_Spr14/ Jim Demmel EECS & Math Departments.

Slides:



Advertisements
Similar presentations
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Advertisements

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
CS267 Class Project Suggestions Spring Class project suggestions Many kinds of projects – Reflects broad scope of field and of students, from many.
Introductory Courses in High Performance Computing at Illinois David Padua.
Avoiding Communication in Numerical Linear Algebra Jim Demmel EECS & Math Departments UC Berkeley.
OpenFOAM on a GPU-based Heterogeneous Cluster
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.
Introduction CS 524 – High-Performance Computing.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Communication-Avoiding Algorithms for Linear Algebra and Beyond Jim Demmel EECS & Math Departments UC Berkeley.
Communication-Avoiding Algorithms for Linear Algebra and Beyond
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
COALA: Communication Optimal Algorithms for Linear Algebra Jim Demmel EECS & Math Depts. UC Berkeley Laura Grigori INRIA Saclay Ile de France.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Integrating Parallel and Distributed Computing Topics into an Undergraduate CS Curriculum Andrew Danner & Tia Newhall Swarthmore College Third NSF/TCPP.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
Communication-Avoiding Algorithms for Linear Algebra and Beyond Jim Demmel EECS & Math Departments UC Berkeley.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
CS267 Class Project Suggestions Spring Class project suggestions Many kinds of projects – Reflects broad scope of field and of students, from many.
INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.
How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part II: Geometric embedding Oded Schwartz CS294, Lecture.
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
Concurrency and Performance Based on slides by Henri Casanova.
09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Intro to Scientific Libraries Intro to Scientific Libraries Blue Waters Undergraduate Petascale Education Program May 29 – June
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Introduction. News you can use Hardware –Multicore chips (2009: mostly 2 cores and 4 cores, but doubling) (cores=processors) –Servers (often.
Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Defining the Competencies for Leadership- Class Computing Education and Training Steven I. Gordon and Judith D. Gardiner August 3, 2010.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Advanced Algorithms Analysis and Design
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Web: Parallel Computing Rabie A. Ramadan , PhD Web:
Parallel Programming By J. H. Wang May 2, 2017.
Programming Models for SimMillennium
Introduction.
Introduction.
CS 179 Project Intro.
Objective of This Course
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Algorithm Discovery and Design
Lecture 2 The Art of Concurrency
CS267 Applications of Parallel Computers (2018 class on which the XSEDE lectures *will be* based) Mention:
Vrije Universiteit Amsterdam
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

1 Introduction to the On-line Course: CS267 Applications of Parallel Computers Jim Demmel EECS & Math Departments

Outline 3 free on-line parallel computing courses -Offered by UC Berkeley and XSEDE CS267 – Applications of Parallel Computers -Outline of course -Big Idea #1: Common Computational Patterns -Big Idea #2: Avoiding Communication -Who takes it? What final projects do people do? Parallel Shortcourse (CS267 in 3 days) CS194 – Engineering Parallel Software (for Performance) -Undergraduate version – emphasis on (parallel) software engineering 2

CS267 – Applications of Parallel Computers UC Berkeley 1-semester graduate class (some ugrads too) Offered every Spring semester - webcast live -Spring 2015: Jan 20 – Apr 30, T Th 9:30-11:00 -Local students use DOE supercomputers at NERSC Archived videos broadcast by XSEDE -Carefully edited lectures with in-line quizzes -Available as SPOC = Small, Personalized On-line Course to students for credit at their universities, with local instructor to give official grades -Free NSF supercomputer accounts to do autograded homework and class projects -UC Berkeley teaching assistants help answer questions for remote students -Contact Steve Gordon if 3

4 Motivation and Outline of Course Why powerful computers must be parallel processors Large Computational Science and Engineering (CSE) problems require powerful computers Why writing (fast) parallel programs is hard Structure of the course Commercial problems too Including your laptops and handhelds all But things are improving

Course summary and goals Goal: teach grads and advanced undergrads from diverse departments how to use parallel computers: -Efficiently – write programs that run fast -Productively – minimizing programming effort Basics: computer architectures -Memory hierarchies, Multicore, Distributed Memory, GPU, Cloud, … Basics: programming languages -C + Pthreads, OpenMP, MPI, CUDA, UPC, frameworks … Beyond basics: common “patterns” used in all programs that need to run fast: -Linear algebra, graph algorithms, structured grids,… -How to compose these into larger programs Tools for debugging correctness, performance Guest lectures: climate modeling, astrophysics, … 5

Which applications require parallelism? Analyzed in detail in “Berkeley View” report TechRpts/2006/EECS html

Which applications require parallelism? Analyzed in detail in “Berkeley View” report TechRpts/2006/EECS html

Which applications require parallelism? Analyzed in detail in “Berkeley View” report TechRpts/2006/EECS html

Motif/Dwarf: Common Computational Patterns (Red Hot  Blue Cool) What do commercial and CSE applications have in common?

For (many of) these patterns, we present How they naturally arise in many applications How to compose them to implement applications How to find good existing implementations, if possible Underlying algorithms -Usually many to choose from, with various tradeoffs -Autotuning: use computer to choose best algorithm -Lots of open research questions -What makes one algorithm more efficient than another? -Most expensive operation is not arithmetic, it is communication, i.e. moving data, between level of a memory hierarchy or between processors over a network. -Goal: avoid communication 10

Why avoid communication? (1/2) Running time of an algorithm is sum of 3 terms: -# flops * time_per_flop -# words moved / bandwidth -# messages * latency 11 communication Time_per_flop << 1/ bandwidth << latency Gaps growing exponentially with time [FOSC] Avoid communication to save time Goal : reorganize algorithms to avoid communication Between all memory hierarchy levels L1 L2 DRAM network, etc Very large speedups possible Energy savings too! Annual improvements Time_per_flopBandwidthLatency Network26%15% DRAM23%5% 59%

Why Minimize Communication? (2/2) Source: John Shalf, LBL

Why Minimize Communication? (2/2) Source: John Shalf, LBL Minimize communication to save energy

Goals 14 Redesign algorithms to avoid communication Between all memory hierarchy levels L1 L2 DRAM network, etc Attain lower bounds if possible Current algorithms often far from lower bounds Large speedups and energy savings possible

“New Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems. On modern computer architectures, communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor. ASCR researchers have developed a new method, derived from commonly used linear algebra methods, to minimize communications between processors and the memory hierarchy, by reformulating the communication patterns specified within the algorithm. This method has been implemented in the TRILINOS framework, a highly-regarded suite of software, which provides functionality for researchers around the world to solve large scale, complex multi-physics problems.” FY 2010 Congressional Budget, Volume 4, FY2010 Accomplishments, Advanced Scientific Computing Research (ASCR), pages President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress: CA-GMRES (Hoemmen, Mohiyuddin, Yelick, JD) “Tall-Skinny” QR (Grigori, Hoemmen, Langou, JD)

Summary of CA Linear Algebra “Direct” Linear Algebra Lower bounds on communication for linear algebra problems like Ax=b, least squares, Ax = λx, SVD, etc Mostly not attained by algorithms in standard libraries New algorithms that attain these lower bounds Being added to libraries: Sca/LAPACK, PLASMA, MAGMA Large speed-ups possible Autotuning to find optimal implementation Ditto for “Iterative” Linear Algebra

Lower bound for all “n 3 -like” linear algebra Holds for -Matmul, BLAS, LU, QR, eig, SVD, tensor contractions, … -Some whole programs (sequences of these operations, no matter how individual ops are interleaved, eg A k ) -Dense and sparse matrices (where #flops << n 3 ) -Sequential and parallel algorithms -Some graph-theoretic algorithms (eg Floyd-Warshall) 17 Let M = “fast” memory size (per processor) #words_moved (per processor) =  (#flops (per processor) / M 1/2 ) #messages_sent (per processor) =  (#flops (per processor) / M 3/2 ) Parallel case: assume either load or memory balanced

Lower bound for all “n 3 -like” linear algebra Holds for -Matmul, BLAS, LU, QR, eig, SVD, tensor contractions, … -Some whole programs (sequences of these operations, no matter how individual ops are interleaved, eg A k ) -Dense and sparse matrices (where #flops << n 3 ) -Sequential and parallel algorithms -Some graph-theoretic algorithms (eg Floyd-Warshall) 18 Let M = “fast” memory size (per processor) #words_moved (per processor) =  (#flops (per processor) / M 1/2 ) #messages_sent ≥ #words_moved / largest_message_size Parallel case: assume either load or memory balanced

Lower bound for all “n 3 -like” linear algebra Holds for -Matmul, BLAS, LU, QR, eig, SVD, tensor contractions, … -Some whole programs (sequences of these operations, no matter how individual ops are interleaved, eg A k ) -Dense and sparse matrices (where #flops << n 3 ) -Sequential and parallel algorithms -Some graph-theoretic algorithms (eg Floyd-Warshall) 19 Let M = “fast” memory size (per processor) #words_moved (per processor) =  (#flops (per processor) / M 1/2 ) #messages_sent (per processor) =  (#flops (per processor) / M 3/2 ) Parallel case: assume either load or memory balanced SIAM SIAG/Linear Algebra Prize, 2012 Ballard, D., Holtz, Schwartz

Can we attain these lower bounds? Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds? -Often not If not, are there other algorithms that do? -Yes, for much of dense linear algebra, APSP -New algorithms, with new numerical properties, new ways to encode answers, new data structures -Not just loop transformations (need those too!) Only a few sparse algorithms so far -Ex: Matmul of “random” sparse matrices -Ex: Sparse Cholesky of matrices with “large” separators Lots of work in progress 20

Summary of dense parallel algorithms attaining communication lower bounds 21 Assume nxn matrices on P processors Minimum Memory per processor = M = O(n 2 / P) Recall lower bounds: #words_moved =  ( (n 3 / P) / M 1/2 ) =  ( n 2 / P 1/2 ) #messages =  ( (n 3 / P) / M 3/2 ) =  ( P 1/2 ) Does ScaLAPACK attain these bounds? For #words_moved: mostly, except nonsym. Eigenproblem For #messages: asymptotically worse, except Cholesky New algorithms attain all bounds, up to polylog(P) factors Cholesky, LU, QR, Sym. and Nonsym eigenproblems, SVD Can we do Better?

Can we do better? Aren’t we already optimal? Why assume M = O(n 2 /p), i.e. minimal? -Lower bound still true if more memory -Can we attain it?

2.5D Matrix Multiplication Assume can fit cn 2 /P data per processor, c > 1 Processors form (P/c) 1/2 x (P/c) 1/2 x c grid c (P/c) 1/2 Example: P = 32, c = 2

2.5D Matrix Multiplication Assume can fit cn 2 /P data per processor, c > 1 Processors form (P/c) 1/2 x (P/c) 1/2 x c grid k j i Initially P(i,j,0) owns A(i,j) and B(i,j) each of size n(c/P) 1/2 x n(c/P) 1/2 (1) P(i,j,0) broadcasts A(i,j) and B(i,j) to P(i,j,k) (2) Processors at level k perform 1/c-th of SUMMA, i.e. 1/c-th of Σ m A(i,m)*B(m,j) (3) Sum-reduce partial sums Σ m A(i,m)*B(m,j) along k-axis so P(i,j,0) owns C(i,j)

2.5D Matmul on BG/P, 16K nodes / 64K cores

c = 16 copies Distinguished Paper Award, EuroPar’11 (Solomonik, D.) SC’11 paper by Solomonik, Bhatele, D. 12x faster 2.7x faster

Perfect Strong Scaling – in Time and Energy Every time you add a processor, you should use its memory M too Start with minimal number of procs: PM = 3n 2 Increase P by a factor of c  total memory increases by a factor of c Notation for timing model: -γ T, β T, α T = secs per flop, per word_moved, per message of size m T(cP) = n 3 /(cP) [ γ T + β T /M 1/2 + α T /(mM 1/2 ) ] = T(P)/c Notation for energy model: -γ E, β E, α E = joules for same operations -δ E = joules per word of memory used per sec -ε E = joules per sec for leakage, etc. E(cP) = cP { n 3 /(cP) [ γ E + β E /M 1/2 + α E /(mM 1/2 ) ] + δ E MT(cP) + ε E T(cP) } = E(P) Extends to N-body, Strassen, … Can prove lower bounds on needed network (eg 3D torus for matmul)

What are students expected to do? On-line students – quizzes embedded in lectures HW0 – Describing a Parallel Application HW1 – Tuning matrix multiplication HW2 – Parallel Particle Simulation HW3 – Parallel Knapsack* Project – Many possibilities … 28

Simple questions after each minute video ensuring students followed speaker and understood concepts Around 20 questions per lecture depending on length All questions multiple choice with 4 options and students allowed 3 tries (can be varied via moodle options) Most questions had about 10% wrong answer with 2-3 questions per lecture having over 25% wrong answer On-Line Quizzes for XSEDE students 29

A short description of self and parallel application that student finds interesting Should include a short bio, research interests, goals for the class then a description of parallel application Helps group students later for class projects Examples of the 2012 descriptions can be found at this link: BNTmhBUXR3ZElJWExvb21vS0E&single=true&gid=2&output=html HW0 – Describing a Parallel Application 30

Each assignment has ‘autograder’ code given to students so they can run tests and see potential grade they will receive Submission of files is done to the moodle website for each assignment with instructions for archiving files or submitting particular compiler directives. With help from TAs, local instructors will download files and run submissions and update scores on moodle after runs return For the 2 nd and 3 rd assignment this usually takes 30+ minutes for scaling studies to finish Programming Assignments 31

Square matrix multiplication Simplest code very easy but inefficient Give students naïve blocked code =+* C(i,j) A(i,:) B(:,j) C(i,j) n3 + O(n2) reads/writes altogether for (i=0;i<n;i++) for (j=0;j<n;j++) for (k=0;j<n;k++) C(i,j) = C(i,j) +A(i,k)*B(k,j); HW1 – Tuning Matrix Multiply 32

Memory access and caching SIMD code and optimization Using Library codes where available HW1 – Tuning Matrix Multiply 33

Simplified particle simulation Far field forces null outside interaction radius (makes simple O(n) algorithm possible) Give O(n 2 ) algorithm for serial, OpenMP, MPI and CUDA (2 nd part) HW2 – Parallel Particle Simulation 34

Introduction to OpenMP, MPI and CUDA (2 nd part) calls Domain decomposition and minimizing communication Locks for bins and avoiding synchronization overheads HW2 – Parallel Particle Simulation 35

0-1 Knapsack problem for n items Serial, UPC dynamic programming solution given but using inefficient UPC collectives and layout T(:,0) = 0 T(w,i) = max( T(w,i-1), T(w-w i, i-1) + v i ) T(W max, n) is solution Initial UPC version runs slower than serial for most cases due to communication overhead HW3 – Parallel Knapsack* 36 Figure from wikipedia

Students completing class at Berkeley in 2014 Applied Math Applied Science & Technology Astrophysics BioPhysics Business Administration Chemical Engineering Chemistry Civil & Environmental Engineering Math Mechanical Engineering Music Nuclear Engineering Physics enrolled (40 grad, 9 undergrad, 5 other) 28 CS or EECS students, rest from 6 CS or EECS undergrads, 3 double

CS267 Spring 2014 XSEDE Student Demographics Graduate and undergraduate students were near evenly represented Most students were in Computer Science N= Students from 19 Universities in North America, South America and Europe enrolled

CS267 Class Project Suggestions Spring 2012

How to Organize A Project Proposal (1/2) Parallelizing/comparing implementations of an Application Parallelizing/comparing implementations of a Kernel Building /evaluating a parallel software tool Evaluating parallel hardware 3/22/12 CS267 Class Projects 40

A few sample CS267 Class Projects all posters and video presentations at Content based image recognition -“Find me other pictures of the person in this picture” Faster molecular dynamics, applied to Alzheimer’s Disease Better speech recognition through a faster “inference engine” Faster algorithms to tolerate errors in new genome sequencers Faster simulation of marine zooplankton population Sharing cell-phone bandwidth for faster transfers 3/22/12 CS267 Class Projects 41

More Prior Projects 1.High-Throughput, Accurate Image Contour DetectionHigh-Throughput, Accurate Image Contour Detection 2.CUDA-based rendering of 3D Minkowski SumsCUDA-based rendering of 3D Minkowski Sums 3.Parallel Particle FiltersParallel Particle Filters 4.Scaling Content Based Image Retrieval SystemsScaling Content Based Image Retrieval Systems 5.Towards a parallel implementation of the Growing String MethodTowards a parallel implementation of the Growing String Method 6.Optimization of the Poisson Operator in CHOMBOOptimization of the Poisson Operator in CHOMBO 7.Sparse-Matrix-Vector-Multiplication on GPUsSparse-Matrix-Vector-Multiplication on GPUs 8.Parallel RI-MP2Parallel RI-MP2 3/22/12 CS267 Class Projects 42

More Prior Projects 1.Parallel FFTs in 3D: Testing different implementation schemesParallel FFTs in 3D: Testing different implementation schemes 2.Replica Exchange Molecular Dynamics (REMD) for Amber's Particle-Mesh Ewald MD (PMEMD)Replica Exchange Molecular Dynamics (REMD) for Amber's Particle-Mesh Ewald MD (PMEMD) 3.Creating a Scalable HMM based Inference Engine for Large Vocabulary Continuous Speech RecognitionCreating a Scalable HMM based Inference Engine for Large Vocabulary Continuous Speech Recognition 4.Using exponential integrators to solve large stiff problemUsing exponential integrators to solve large stiff problem 5.Clustering overlapping reads without using a reference genomeClustering overlapping reads without using a reference genome 6.An AggreGATE Network Abstraction for Mobile DevicesAn AggreGATE Network Abstraction for Mobile Devices 7.Parallel implementation of multipole-based Poisson-Boltzmann solverParallel implementation of multipole-based Poisson-Boltzmann solver 8.Finite Element Simulation of Nonlinear Elastic Dynamics using CUDAFinite Element Simulation of Nonlinear Elastic Dynamics using CUDA 3/22/12 CS267 Class Projects 43

Still more prior projects 1.Parallel Groebner Basis Computation using GASNetParallel Groebner Basis Computation using GASNet 2.Accelerating Mesoscale Molecular Simulation using CUDA and MPIAccelerating Mesoscale Molecular Simulation using CUDA and MPI 3.Modeling and simulation of red blood cell light scatteringModeling and simulation of red blood cell light scattering 4.NURBS Evaluation and RenderingNURBS Evaluation and Rendering 5.Performance Variability in Hadoop's Map ReducePerformance Variability in Hadoop's Map Reduce 6.Utilizing Multiple Virtual Machines in Legacy Desktop ApplicationsUtilizing Multiple Virtual Machines in Legacy Desktop Applications 7.How Useful are Performance Counters, Really? Profiling Chombo Finite Methods Solver and Parsec Fluids Codes on Nehalem and SiCortexHow Useful are Performance Counters, Really? Profiling Chombo Finite Methods Solver and Parsec Fluids Codes on Nehalem and SiCortex 8.Energy Efficiency of MapReduceEnergy Efficiency of MapReduce 9.Symmetric Eigenvalue Problem: Reduction to TridiagonalSymmetric Eigenvalue Problem: Reduction to Tridiagonal 10.Parallel POPCycle ImplementationParallel POPCycle Implementation 3/22/12 CS267 Class Projects 44

QUESTIONS? 45