04/06/2006CS267 Lecture 22a1 CS 267: Applications of Parallel Computers Final Project Suggestions James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr06.

Slides:

Advertisements

Similar presentations

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley 28 Sept 2005.

The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley 21 June 2006 PARA 06.

Reference: Message Passing Fundamentals.

Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.

The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley 23 Feb 2007.

CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

CS240A: Conjugate Gradients and the Model Problem.

Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.

CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.

Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,

Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.

Loads Balanced with CQoS Nicole Lemaster, Damian Rouson, Jaideep Ray Sandia National Laboratories Sponsor: DOE CCA Meeting – January 22, 2009.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Antonio M. Vidal Jesús Peinado

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.

Solving Scalar Linear Systems Iterative approach Lecture 15 MA/CS 471 Fall 2003.

The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley 27 March 2006.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

3/3/2008CS267 Guest Lecture 21 CS 267 Dense Linear Algebra: Parallel Gaussian Elimination James Demmel

Computation on meshes, sparse matrices, and graphs Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

03/04/2009CS267 Lecture 12a1 CS 267 Dense Linear Algebra: Possible Class Projects James Demmel

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

CS240A: Conjugate Gradients and the Model Problem.

Minimizing Communication in Numerical Linear Algebra Optimizing Krylov Subspace Methods Jim Demmel EECS & Math Departments,

Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

- 1 - Workshop on Pattern Analysis Data Flow Pattern Analysis of Scientific Applications Michael Frumkin Parallel Systems & Applications Intel Corporation.

MA/CS 471 Lecture 15, Fall 2002 Introduction to Graph Partitioning.

Report from LBNL TOPS Meeting TOPS/ – 2Investigators  Staff Members:  Parry Husbands  Sherry Li  Osni Marques  Esmond G. Ng 

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

1 Titanium Review: Immersed Boundary Armando Solar-Lezama Biological Simulations Using the Immersed Boundary Method in Titanium Ed Givelberg, Armando Solar-Lezama,

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Parallel Direct Methods for Sparse Linear Systems

Xing Cai University of Oslo

Model Problem: Solving Poisson’s equation for temperature

Programming Models for SimMillennium

Amir Kamil and Katherine Yelick

CS 267 Dense Linear Algebra: Parallel Gaussian Elimination

CS 267 Dense Linear Algebra: Parallel Gaussian Elimination

CS 267 Dense Linear Algebra: Parallel Gaussian Elimination

CS 179 Project Intro.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

CS 267 Sources of Parallelism and Locality in Simulation – Part 2

Amir Kamil and Katherine Yelick

Introduction to Scientific Computing II

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

04/06/2006CS267 Lecture 22a1 CS 267: Applications of Parallel Computers Final Project Suggestions James Demmel

04/06/2006CS267 Lecture 22a2 Outline Kinds of projects Evaluating and improving the performance of a parallel application “Application” could be full scientific application, or important kernel Parallelizing a sequential application other kinds of performance improvements possible too, eg memory hierarchy tuning Devise a new parallel algorithm for some problem Porting parallel application or systems software to new architecture Example of previous projects (all on-line) Upcoming guest lecturers See their previous lectures, or contact them, for project ideas Suggested projects

04/06/2006CS267 Lecture 22a3 CS267 Class Projects from 2004 BLAST Implementation on BEE2 — Chen ChangBLAST Implementation on BEE2 PFLAMELET; An Unsteady Flamelet Solver for Parallel Computers — Fabrizio BisettiPFLAMELET; An Unsteady Flamelet Solver for Parallel Computers Parallel Pattern Matcher — Frank Gennari, Shariq Rizvi, and Guille Díez-CañasParallel Pattern Matcher Parallel Simulation in Metropolis — Guang YangParallel Simulation in Metropolis A Survey of Performance Optimizations for Titanium Immersed Boundary Simulation — Hormozd Gahvari, Omair Kamil, Benjamin Lee, Meling Ngo, and Armando SolarA Survey of Performance Optimizations for Titanium Immersed Boundary Simulation Parallelization of oopd1 — Jeff HammelParallelization of oopd1 Optimization and Evaluation of a Titanium Adaptive Mesh Refinement Code — Amir Kamil, Ben Schwarz, and Jimmy SuOptimization and Evaluation of a Titanium Adaptive Mesh Refinement Code

04/06/2006CS267 Lecture 22a4 CS267 Class Projects from 2004 (cont) Communication Savings With Ghost Cell Expansion For Domain Decompositions Of Finite Difference Grids — C. Zambrana Rojas and Mark HoemmenCommunication Savings With Ghost Cell Expansion For Domain Decompositions Of Finite Difference Grids Parallelization of Phylogenetic Tree Construction — Michael TungParallelization of Phylogenetic Tree Construction UPC Implementation of the Sparse Triangular Solve and NAS FT — Christian Bell and Rajesh NishtalaUPC Implementation of the Sparse Triangular Solve and NAS FT Widescale Load Balanced Shared Memory Model for Parallel Computing — Sonesh Surana, Yatish Patel, and Dan AdkinsWidescale Load Balanced Shared Memory Model for Parallel Computing

04/06/2006CS267 Lecture 22a5 Planned Guest Lecturers Katherine Yelick (UPC, heart modeling) David Anderson (volunteer computing) Kimmen Sjolander (phylogenetic analysis of proteins – SATCHMO – Bonnie Kirkpatrick) Julian Borrill, (astrophysical data analysis) Wes Bethel, (graphics and data visualization) Phil Colella, (adaptive mesh refinement) David Skinner, (tools for scaling up applications) Xiaoye Li, (sparse linear algebra) Osni Marques and Tony Drummond, (ACTS Toolkit) Andrew Canning (computational neuroscience) Michael Wehner (climate modeling)

04/06/2006CS267 Lecture 22a6 Suggested projects (1) Weekly research group meetings on these and related topics (see J. Demmel and K. Yelick) Contribute to upcoming ScaLAPACK release (JD) Proposal, talk at ask me for latestwww.cs.berkeley.edu/~demmel Performance evaluation of existing parallel algorithms Ex: New eigensolvers based on successive band reduction Improved implementations of existing parallel algorithms Ex: Use UPC to overlap communication, computation Many serial algorithms to be parallelized See following slides

04/06/2006CS267 Lecture 22a7 Missing Drivers in Sca/LAPACK LAPACKScaLAPACK Linear EquationsLU Cholesky LDL T xGESV xPOSV xSYSV PxGESV PxPOSV missing Least Squares (LS)QR QR+pivot SVD/QR SVD/D&C SVD/MRRR QR + iterative refine. xGELS xGELSY xGELSS xGELSD missing PxGELS missing missing (intent?) missing Generalized LSLS + equality constr. Generalized LM Above + Iterative ref. xGGLSE xGGGLM missing

04/06/2006CS267 Lecture 22a8 More missing drivers LAPACKScaLAPACK Symmetric EVDQR / Bisection+Invit D&C MRRR xSYEV / X xSYEVD xSYEVR PxSYEV / X PxSYEVD missing Nonsymmetric EVDSchur form Vectors too xGEES / X xGEEV /X missing driver SVDQR D&C MRRR Jacobi xGESVD xGESDD missing PxGESVD missing (intent?) missing Missing Generalized Symmetric EVDQR / Bisection+Invit D&C MRRR xSYGV / X xSYGVD missing PxSYGV / X missing (intent?) missing Generalized Nonsymmetric EVD Schur form Vectors too xGGES / X xGGEV / X missing Generalized SVDKogbetliantz MRRR xGGSVD missing missing (intent) missing

04/06/2006CS267 Lecture 22a9 Suggested projects (2) Contribute to sparse linear algebra (JD & KY) Performance tuning to minimize latency and bandwidth costs, both to memory and between processors (sparse => few flops per memory reference or word communicated) Typical methods (eg CG = conjugate gradient) do some number of dot projects, saxpys for each SpMV, so communication cost is O(# iterations) Our goal: Make latency cost O(1)! Requires reorganizing algorithms drastically, including replacing SpMV by new kernel [Ax, A 2 x, A 3 x, …, A k x], which can be done with O(1) messages Projects Study scalability bottlenecks of current CG on real, large matrices Optimize [Ax, A 2 x, A 3 x, …, A k x] on sequential machines Optimize [Ax, A 2 x, A 3 x, …, A k x] on parallel machines

04/06/2006CS267 Lecture 22a10 Suggested projects (3) Evaluate new languages on applications (KY) UPC or Titanium UPC for asynchrony, overlapping communication & computation ScaLAPACK in UPC Use UPC-based 3D FFT in your application Optimize existing 1D FFT in UPC, to use 3D techniques Porting, Evaluating parallel systems software (KY) Port UPC to RAMP Port GASNET to Blue Gene, evaluate performance