CS267 Applications of Parallel Computers https://sites.google.com/lbl.gov/cs267-spr2018/ (2018 class on which the XSEDE lectures will be based) Mention:

Slides:

Advertisements

Similar presentations

Thoughts on Shared Caches Jeff Odom University of Maryland.

Advertisements

Parallel Research at Illinois Parallel Everywhere

CS267 Class Project Suggestions Spring Class project suggestions Many kinds of projects – Reflects broad scope of field and of students, from many.

OpenFOAM on a GPU-based Heterogeneous Cluster

Reference: Message Passing Fundamentals.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Memory Management 2010.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Computer Science Prof. Bill Pugh Dept. of Computer Science.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.

Integrating Parallel and Distributed Computing Topics into an Undergraduate CS Curriculum Andrew Danner & Tia Newhall Swarthmore College Third NSF/TCPP.

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

IT253: Computer Organization

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

QCAdesigner – CUDA HPPS project

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.

CS267 Class Project Suggestions Spring Class project suggestions Many kinds of projects – Reflects broad scope of field and of students, from many.

Parallel Computing Presented by Justin Reschke

HPC University Requirements Analysis Team Training Analysis Summary Meeting at PSC September Mary Ann Leung, Ph.D.

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Defining the Competencies for Leadership- Class Computing Education and Training Steven I. Gordon and Judith D. Gardiner August 3, 2010.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Parallel programs Inf-2202 Concurrent and Data-intensive Programming Fall 2016 Lars Ailo Bongo

Introduction to Machine Learning, its potential usage in network area,

Single Instruction Multiple Threads

These slides are based on the book:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Water and people in a changing world

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Basic 1960s It was designed to emphasize ease of use. Became widespread on microcomputers It is relatively simple. Will make it easier for people with.

Web: Parallel Computing Rabie A. Ramadan , PhD Web:

Ioannis E. Venetis Department of Computer Engineering and Informatics

Parallel Programming By J. H. Wang May 2, 2017.

Programming Models for SimMillennium

The University of Adelaide, School of Computer Science

Morgan Kaufmann Publishers

Introduction CSE 1310 – Introduction to Computers and Programming

Performance Evaluation of Adaptive MPI

Providing a Supported Online Course on Parallel Computing Steven I

EE 193: Parallel Computing

Introduction to Computer Programming

CS 179 Project Intro.

Objective of This Course

Summary Background Introduction in algorithms and applications

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Course Outline Introduction in algorithms and applications

Chapter 1 Introduction.

Introduction to Operating Systems

Multithreaded Programming

Lecture 2 The Art of Concurrency

Mattan Erez The University of Texas at Austin

Vrije Universiteit Amsterdam

Parallel Programming in C with MPI and OpenMP

MapReduce: Simplified Data Processing on Large Clusters

CSE 542: Operating Systems

Presentation transcript:

CS267 Applications of Parallel Computers https://sites.google.com/lbl.gov/cs267-spr2018/ (2018 class on which the XSEDE lectures *will be* based) Mention: All lectures video archived XSEDE version this semester based on Spr12 payoff to this class: autograders Aydin Buluc, James Demmel and Kathy Yelick aydin@eecs.berkeley.edu, demmel@eecs.berkeley.edu, yelick@berkeley.edu

Motivation and Outline of Course Why powerful computers must be parallel processors Large Computational Science and Engineering (CSE) problems require powerful computers Why writing (fast) parallel programs is hard Structure of the course all Including your laptops and handhelds Commercial problems too But things are improving Does anyone in room have a sequential computer? Story 1: End of Moore’s Law scaling, multicore revolution, need for all software to change to benefit Story 2: Lots of scientific and commercial examples, simulation and “big data”, more about climate, CMB Story 3: new challenges vs sequential programming: finding enough parallelism (Amdahl) of right granularity, locality, coordination/synchronization (safe sharing) debugging (correctness and performance) good news #1: 13 dwarfs, good news #2: higher level structural patterns good news #3: efficiency layer vs productivity layer for 2 kinds of programmers

Course summary and goals Goal: teach grads and advanced undergrads from diverse departments how to use parallel computers: Efficiently – write programs that run fast Productively – minimizing programming effort Basics: computer architectures and programming languages Beyond basics: common “patterns” used in all programs that need to run fast: Linear algebra, graph algorithms, structured grids,… How to compose these into larger programs Tools for debugging correctness, performance Guest lectures: climate modeling, astrophysics, …

Students completing class at Berkeley in 2014 54 enrolled (40 grad, 9 undergrad, 5 other) 28 CS or EECS students, rest from Applied Math Applied Science & Technology Astrophysics BioPhysics Business Administration Chemical Engineering Chemistry Civil & Environmental Engineering Math Mechanical Engineering Music Nuclear Engineering Physics 75 initially enrolled or on wait list 6 CS or EECS undergrads, 3 double

Students completing class at Berkeley in 2016 86 enrolled (>90% grad students) 31 (36%) CS, 22 (24%) EECS students, rest from Bioengineering (5) Civil & Env (6) Mech Eng (6) Math (5) Physics Nuclear Engineering Chemical Engineering Political Science …

Students completing class at Berkeley in 2017 74 finished with >90% grad students, 58 (78%) are Ph.D 35 (47% some double counting due to double majors) CS, 11 (15%) EECS students, rest from: Mechanical Eng (7) ChemEng and Chem (5) Materials (2) Applied Math (2) Physics (2) Nuclear Engineering (2) Energy & Resources Earth & Planetary Sci. Civil & Environmental Eng. Applied Sci. & Tech.

Students completing class at Berkeley in 2018 97-100 finished Slightly larger share of BA/BS students than before (a trend that is likely to continue in 2019) Exact numbers are not available as not everyone in roster has a major/year listed in database at the moment

CS267 Spring 2016 XSEDE Student Demographics “Math” disappeared? More undergrads?

CS267 Spring 2016 XSEDE Student Demographics Race/Ethnicity Gender

Students from 9 institutions participated in Spring 2015

Which applications require parallelism? Analyzed in detail in “Berkeley View” report Analyzed in detail in “Berkeley View” report www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html Some people might subdivide, some might combine them together Trying to stretch a computer architecture then you can do subcategories as well Graph Traversal = Probabilistic Models N-Body = Particle Methods, …

Motif/Dwarf: Common Computational Methods (Red Hot  Blue Cool) What do commercial and CSE applications have in common? Motif/Dwarf: Common Computational Methods (Red Hot  Blue Cool) Some people might subdivide, some might combine them together Trying to stretch a computer architecture then you can do subcategories as well Graph Traversal = Probabilistic Models N-Body = Particle Methods, …

Detailed List of Lectures #1 (2018 list) Lecture 1: Introduction Lecture 2: Single Processor Machines: Memory Hierarchies and Processor Features Lecture 3: Parallel Machines and Programming Models + Tricks with Trees Lecture 4: Shared Memory Parallelism Lecture 5: Roofline and Performance Modeling Lecture 6: Sources of Parallelism and Locality (Part 1) Lecture 7: Sources of Parallelism and Locality (Part 2) Lecture 8: An Introduction to CUDA/OpenCL and Graphics Processors (GPUs) Lecture 9: Distributed Memory Machines and Programming Lecture 10: Advanced MPI and Collective Communication Algorithms Lecture 11: UPC++: Partitioned Global Address Space Languages Lecture 12: Cloud and big data Lecture 13: Parallel Matrix Multiply Lecture 14: Dense Linear Algebra Lecture 15: Sparse-Matrix-Vector-Multiplication and Iterative Solvers Lecture 16: Structured Grids

Detailed List of Lectures #2 (2018 list) Lecture 17a: Machine Learning Part 1 (Supervised Learning) Lecture 17b: Scaling Deep Neural Network Training Lecture 17c: Machine Learning Part 2 (Unsupervised Learning) Lecture 18: Graph Partitioning Lecture 19: Graph Algorithms Lecture 20: Fast Fourier Transform Lecture 21: Climate Modeling Lecture 22: Sorting and Searching Lecture 23: Dynamic Load Balancing Lecture 24: Hierarchical Methods for the N-Body Problem Lecture 25: Computational Biology Lecture 26: Cosmology Lecture 27: Supercomputers & Superintelligence Lecture 28: Quantum Computing

Quiz Information Simple questions after each 20-30 minute video ensuring students followed speaker and understood concepts Around 20 questions per lecture depending on length All questions multiple choice with 4 options and students allowed 3 tries (can be varied via moodle options) Most questions had about 10% wrong answer with 2-3 questions per lecture having over 25% wrong answer

HW0 – Describing a Parallel Application A short description of self and parallel application that student finds interesting Should include a short bio, research interests, goals for the class then a description of parallel application Helps group students later for class projects Examples of the 2009-2015 descriptions can be found at this link (at the bottom of the page): https://people.eecs.berkeley.edu/~mme/cs267-2016/hw0/index.html

Programming Assignments Assignment 1: Optimize Matrix Multiplication. Tests ability to minimize communication, maximize data locality, and exploit parallelism within a single core Shows the difference between highly-optimized code and naïve code, the performance gap, and resources wasted by not using single cores to their full potential Assignment 2: Parallelize Particle Simulation Tests ability to write scalable and efficient shared/distributed-memory codes. Teaches recognizing the important of starting from an efficient serial implementation before parallelization Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly Introduces students to Partitioned Global Address Space (PGAS) languages as well as codes with irregular data access patterns.

Programming Assignments Each assignment has ‘autograder’ code given to students so they can run tests and see potential grade they will receive Submission of files is done to the moodle website for each assignment with instructions for archiving files or submitting particular compiler directives. With help from TAs, local instructors will download files and run submissions and update scores on moodle after runs return For the 2nd and 3rd assignment this usually takes 30+ minutes for scaling studies to finish

HW1 – Tuning Matrix Multiply Square matrix multiplication Simplest code very easy but inefficient Give students naïve blocked code = + * C(i,j) A(i,:) B(:,j) n3 + O(n2) reads/writes altogether for (i=0;i<n;i++) for (j=0;j<n;j++) for (k=0;j<n;k++) C(i,j) = C(i,j) +A(i,k)*B(k,j);

HW1 – Tuning Matrix Multiply Memory access and caching SIMD code and optimization Using Library codes where available

HW1 – Tuning Matrix Multiply Possible modifications, depending on audience Divide assignment into stages, such as Week 1: multilevel cache blocking Week 2: copy, transpose optimizations Week 3: SIMD and final tuning of all parameters

HW2 – Parallel Particle Simulation Simplified particle simulation Far field forces null outside interaction radius (makes simple O(n) algorithm possible) Give O(n2) algorithm for serial, OpenMP, MPI and CUDA (2nd part)

HW2 – Parallel Particle Simulation Introduction to OpenMP, MPI and CUDA (2nd part) calls Domain decomposition and minimizing communication Locks for bins and avoiding synchronization overheads

HW2 – Parallel Particle Simulation Possible modifications, depending on audience Divide assignment into stages, such as Week 1: O(n) code Week 2: MPI and OpenMP code Week 3: CUDA code

HW3: Genome Sequencing Given a large number of overlapping DNA fragments (sequences of the letters A, G, C and T), sort them so that together they represent the original complete strand of DNA Simplified version of real genome sequence assembly code developed by former CS267 TA Evangelos Georganas, which has been used to sequence wheat genome on 18K processors Benefits from one-sided communication model of UPC/UPC++ (irregular remote memory accesses are required)

CS267 Class Project Suggestions Spring 2012

Class project suggestions Many kinds of projects Reflects broad scope of field and of students, from many departments Need to do one or more of design / program / measure some parallel application / kernel / software tool / hardware Can work alone or in teams HW0 posted to help identify possible teammates based on interest What you need to do Project proposal by midnight Saturday Mar 24 Feedback from instructor over spring break (ongoing conversations) Poster presentation (+ recording short video presentation) on Thursday May 3 (class time, during RRR week) Final report writeups due Monday May 7 at midnight 3/22/12 CS267 Class Projects

How to Organize A Project Proposal (1/2) Parallelizing/comparing implementations of an Application Parallelizing/comparing implementations of a Kernel Building /evaluating a parallel software tool Evaluating parallel hardware CS267 Class Projects 3/22/12

How to Organize A Project Proposal (2/2) What is the list of tasks you will try? Sorted from low-hanging fruit to harder What existing tools you will use, compare to? Don’t reinvent wheels, or at least compare to existing wheels to see if yours is better For applications, consider using frameworks like Chombo or PETSC For applications, identify computational and structural patterns you plan to use What are your success metrics Get application X up on Edison or Stampede, solve problem Y Get motif Z to run W times faster on GPU Collect data V to evaluate/compare approaches CS267 Class Projects 3/22/12

A few sample CS267 Class Projects all posters and video presentations at https://people.eecs.berkeley.edu/~demmel/cs267_Spr16/ Parallel Finite Element Analysis Parallel Stable K-means Models of Flooding for Digital Elevation Models Multivariate Time series analysis engine Coarse-Grained Molecular Dynamics Parallel Genetic Minimization Algorithm Scalable Structure Learning Blocked Time Analysis for Machine Learning Amyloid beta peptide, how it chemically changes to form the A-Beta Monomers that are believed to cause Alzheimer’s disease New genome sequencing machines can be used to very quickly sequence The genomes of many patients, in the hopes of identifying genes that Contribute to cancer, heart disease, diabetes, hypertension etc, but They generate a lot of errors (side effect of speed), more than older Algorithms can handle Zooplankton bottom of food chain in ocean, sensitive indicators of How climate change can change distribution of larger ocean creatures, Like the ones we eat. (Ben Steffen of Int. Bio., student of Zach Powell) Better cell phone bandwidth means better music downloads, better Video and picture uploads, better gaming experiences Which is the undergraduate project? Plan to freely offer 3 day “bootcamp” version of this course later this summer 3/22/12 CS267 Class Projects

More Prior Projects High-Throughput, Accurate Image Contour Detection Parallel Particle Filters Scaling Content Based Image Retrieval Systems Towards a parallel implementation of the Growing String Method Content based image recognition Faster molecular dynamics, applied to Alzheimer’s Disease Speed recognition through a faster “inference engine” Faster algorithms to tolerate errors in new genome sequencers Faster simulation of marine zooplankton population Parallel multipole-based Poisson-Boltzmann solver 1: ParLab image segmentation projects 2: fast image summing for rendering 3: computing distribution of points from a nonlinear feedback control system 4: Searching databases of images using k-means algorithm 5: Electronic structure calculation to determine reaction mechanisms 6: Improve parallelization of stencil operator used in AMR 7: what it sounds like 8: parallelize improvement to Hartree-Fock electronic structure algorithm 3/22/12 CS267 Class Projects

Still more prior projects Parallel Groebner Basis Computation using GASNet Accelerating Mesoscale Molecular Simulation using CUDA and MPI Modeling and simulation of red blood cell light scattering NURBS Evaluation and Rendering Performance Variability in Hadoop's Map Reduce Utilizing Multiple Virtual Machines in Legacy Desktop Applications How Useful are Performance Counters, Really? Profiling Chombo Finite Methods Solver and Parsec Fluids Codes on Nehalem and SiCortex Energy Efficiency of MapReduce Symmetric Eigenvalue Problem: Reduction to Tridiagonal Parallel POPCycle Implementation Groebner basis like Gaussian elimination for systems of polynomials that are not linear What it sounds like Motivation: rapid detection of sickly cell anemia by analyzing light scattered by blood sample How to rapidly represent, render smooth surfaces with Non-Uniform Rational B-Splines Identify causes, suggest improvements to scheduler Use VMs mapped to cores as way to parallelize browsers etc Using hardware performance counters for detailed performance analysis Motivation is energy efficiency of data centers: measure MapReduce workload energies, model components, tune apps to reduce energy POPCycle model zooplankton population in ocean 3/22/12 CS267 Class Projects