1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Computer Science 320 Clumping in Parallel Java. Sequential vs Parallel Program Initial setup Execute the computation Clean up Initial setup Create a parallel.
Sharks and Fishes – The problem
1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
A High-Performance Java Dialect Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham,
1 Hierarchical Pointer AnalysisAmir Kamil Hierarchical Pointer Analysis for Distributed Programs Amir Kamil U.C. Berkeley December 7, 2005.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
CUDA and the Memory Model (Part II). Code executed on GPU.
Project 1CS-4513, D-Term Programming Project #1 Concurrent Game of Life Due Friday, March 20.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Yelick 1 ILP98, Titanium Titanium: A High Performance Java- Based Language Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul.
Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,
Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Lecture 8 – Stencil Pattern Stencil Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 10: File-System Interface.
CSC 310 – Imperative Programming Languages, Spring, 2009 Virtual Machines and Threaded Intermediate Code (instead of PR Chapter 5 on Target Machine Architecture)
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
Parallel Programming in Split-C David E. Culler et al. (UC-Berkeley) Presented by Dan Sorin 1/20/06.
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Gtb 1 Titanium Titanium: Language and Compiler Support for Scientific Computing Gregory T. Balls University of California - Berkeley Alex Aiken, Dan Bonachea,
ECEG-3202 Computer Architecture and Organization Chapter 6 Instruction Sets: Addressing Modes and Formats.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Compiling Fortran D For MIMD Distributed Machines Authors: Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng Published: 1992 Presented by: Sunjeev Sikand.
The Process From bare bones to finished product. The Steps Programming Debugging Performance Tuning Optimization.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
HParC language. Background Shared memory level –Multiple separated shared memory spaces Message passing level-1 –Fast level of k separate message passing.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Parallel Computing Presented by Justin Reschke
Background Computer System Architectures Computer System Software.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Displacement (Indexed) Stack
Introduction to OpenMP
Programming Models for SimMillennium
William Stallings Computer Organization and Architecture 8th Edition
ECEG-3202 Computer Architecture and Organization
Amir Kamil and Katherine Yelick
TensorFlow: A System for Large-Scale Machine Learning
Shared memory programming
Lecture 4: File-System Interface
Presentation transcript:

1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006

2 Amir KamilSynthesis of Distributed Arrays Background Titanium is a single program, multiple data (SPMD) dialect of Java All threads execute the same program text Designed for distributed machines Global address space – all threads can access all memory But much slower to access remote memory than local memory

3 Amir KamilSynthesis of Distributed Arrays Grids – The Abstract View Grids used extensively in scientific codes Ideally, programmer specifies: Size of grid Operations on each cell grid[2d] g = new grid[[0,0] : [100,100]]; setup(g); for (int i = 0; i < iterations; i++) { foreach (p in g.domain()) { g[p] = (g[p+[0,-1]] + g[p+[0,1]] + g[p+[1,0]] + g[p+[-1,0]]) / 4; }

4 Amir KamilSynthesis of Distributed Arrays Grids – The Reality Grids must be distributed across processors Global accesses are slow, local accesses are fast Load balancing is difficult Some problems require multiple levels of refinement Access patterns must be tailored for problem and machine

5 Amir KamilSynthesis of Distributed Arrays Grid Distribution Regular partitioning Blocked, Cyclic distributions Can also partition irregularly Ghost cells at boundaries used to cache data

6 Amir KamilSynthesis of Distributed Arrays Multi-level Grids Level 1 Level 2 Parts of the grid may require higher resolution Each level distributed separately Lower levels are refinements of upper levels Some notion of consistency between levels

7 Amir KamilSynthesis of Distributed Arrays Access Patterns Different problems require different access patterns Data dependency Cache effects Examples: Blocked accesses (linear algebra) Red/black (multigrid)

8 Amir KamilSynthesis of Distributed Arrays Problem #1 – Game of Life 2D grid Blocked in both dimensions Ghost cells of width 1 at boundaries array data { dimension[2]; distribution[BLOCKED(length[1] / 3), BLOCKED(3 * length[2] / Ti.numProcs())]; boundary[GHOST(1), GHOST(1)]; }

9 Amir KamilSynthesis of Distributed Arrays Problem #2 – Knapsack 2D grid Blocked in one dimension Blocked access pattern in other dimension Ghost cells of width 1 at boundaries array data { dimension[2]; distribution[BLOCKED(length[1] / Ti.numProcs()), NONE]; access[NONE, BLOCKED(2)]; boundary[GHOST(1), NONE]; }

10 Amir KamilSynthesis of Distributed Arrays Grid Usage Generated grids mostly used as if they’re normal, global grids Array access ( [], []= ) to any cell supported Ghost cells automatically updated by calling synchronize() method Methods provided to restrict access to local elements, specified pattern e.g. myDomain(), myBlocks()

11 Amir KamilSynthesis of Distributed Arrays Future Work Optimize certain access patterns in compiler e.g. can remove owner computation when iterating over local domain foreach (p in grid.myDomain()) grid[p] = … Add basic support for multiple levels of refinement

12 Amir KamilSynthesis of Distributed Arrays Future Future Work Add more distribution types e.g. Blocked-Cyclic Irregular partitioning Add load balancing Support irregular grids e.g. AMR Add other boundary conditions e.g. shared cells Improve compiler support by adding optimizations, analysis