The Swept Rule for Breaking the Latency Barrier in Time-Advancing PDEs FINAL PROJECT MIT 18.337 FALL 2015 PROJECT SUPERVISOR: PROFESSOR QIQI WANG MAITHAM.

Slides:



Advertisements
Similar presentations
Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.
Advertisements

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Parallel Computation of the 2D Laminar Axisymmetric Coflow Nonpremixed Flames Qingan Andy Zhang PhD Candidate Department of Mechanical and Industrial Engineering.
David Streader Computer Science Victoria University of Wellington Copyright: David Streader, Victoria University of Wellington Recursion COMP T1.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
Computer Science II Recursion Professor: Evan Korth New York University.
Reference: Message Passing Fundamentals.
1 Distributed Computing Algorithms CSCI Distributed Computing: everything not centralized many processors.
4/26/05Han: ELEC72501 Department of Electrical and Computer Engineering Auburn University, AL K.Han Development of Parallel Distributed Computing System.
Computer Programming and Basic Software Engineering 4. Basic Software Engineering 1 Writing a Good Program 4. Basic Software Engineering 3 October 2007.
CS 267 Spring 2008 Horst Simon UC Berkeley May 15, 2008 Code Generation Framework for Process Network Models onto Parallel Platforms Man-Kit Leung, Isaac.
May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.
1 PLuSH – Mesh Tree Fast and Robust Wide-Area Remote Execution Mikhail Afanasyev ‧ Jose Garcia ‧ Brian Lum.
Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002.
1 Algorithms and Problem Solving. 2 Outline  Problem Solving  Problem Solving Strategy  Algorithms  Sequential Statements  Examples.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
CSE 486/586 CSE 486/586 Distributed Systems PA Best Practices Steve Ko Computer Sciences and Engineering University at Buffalo.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Understanding the CORBA Model. What is CORBA?  The Common Object Request Broker Architecture (CORBA) allows distributed applications to interoperate.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Programming Project: Hybrid Programming Rebecca Hartman-Baker Oak Ridge National Laboratory
Programming 1 1. Introduction to object oriented programming and problem-solving.
General Programming Introduction to Computing Science and Programming I.
View-Oriented Parallel Programming for multi-core systems Dr Zhiyi Huang World 45 Univ of Otago.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
Introduction to Java August 14, 2008 Mrs. C. Furman.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Week 1 - Friday.  What did we talk about last time?  Our first Java program.
Created By: Kevin Jiang, Cullen Wong, Stephen Halter.
Example: Sorting on Distributed Computing Environment Apr 20,
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Parameter Passing Mechanisms Reference Parameters Read § §
Parameter Passing Mechanisms Reference Parameters § §
Parallelization of 2D Lid-Driven Cavity Flow
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Parallel XML Parsing Using Meta-DFAs Yinfei Pan 1, Ying Zhang 1, Kenneth Chiu 1, Wei Lu 2 1 State University of New York (SUNY) Binghamton 2 Indiana University.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.
Parallel Solution of the Poisson Problem Using MPI
Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
CSCI-455/552 Introduction to High Performance Computing Lecture 6.
Distributed simulation with MPI in ns-3 Joshua Pelkey and Dr. George Riley Wns3 March 25, 2011.
Towards Anonymous Communication Infrastructure There are many existing anonymous communication solutions each having advantages and disadvantages and most.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
MA/CS 471 Lecture 15, Fall 2002 Introduction to Graph Partitioning.
1 UMBC CMSC 104, Section Fall 2002 Functions, Part 1 of 3 Topics Top-down Design The Function Concept Using Predefined Functions Programmer-Defined.
Exploring Parallelism with Joseph Pantoga Jon Simington.
Distributed Algorithms Dr. Samir Tartir Extracted from Principles of Concurrent and Distributed Programming, Second Edition By M. Ben-Ari.
Cliff Addison University of Liverpool NW-GRID Training Event 26 th January 2007 SCore MPI Taking full advantage of GigE.
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
Recursion COMP T1 #6.
The Distributed Application Debugger (DAD)
MASS Java Documentation, Verification, and Testing
Introduction to Computing Science and Programming I
Programming Languages
Parallel Programming By J. H. Wang May 2, 2017.
MPICH2 for Windows Parallelism on (LCS) Longest Common Subsequence
Fundamentals of Programming
Distributed Algorithms
Hybrid Programming with OpenMP and MPI
Distributed Computing:
The Challenge of Cross - Language Interoperability
Do it now activity Log onto the computer.
Quiz: Computational Thinking
Presentation transcript:

The Swept Rule for Breaking the Latency Barrier in Time-Advancing PDEs FINAL PROJECT MIT FALL 2015 PROJECT SUPERVISOR: PROFESSOR QIQI WANG MAITHAM ALHUBAIL MOHAMAD SINDI ABDULAZIZ ALBAIZ MOHAMMAD ALADWANI

Motivation ◦Many parallel PDE solvers are deployed in computer clusters ◦The number of processing cores in compute nodes is increasing ◦Engineers demand this compute power to speedup the solution of unsteady PDEs ◦Network Latency is the major factor limiting the scalability of PDE solvers ◦What Can we do to help??

The Swept Rule ◦It is all about following the domain of influence and the domains of dependency while explicitly solving PDEs!! ◦Follow the way that allows you to proceed further without communication! ◦Cells move between processors!!!!!!

Swept Rule in 1D

Swept Rule in 1D cont…

Swept Rule in 2D ◦This is a 3D problem ◦Decompose as squares and assign those to different processors ◦Staring from an initial condition

Swept Rule in 2D cont… Timestepping

Swept Rule in 2D cont… ◦At this stage, no further processing is possible ◦Prepare for the first communication!! ◦But, communicate WHAT??

Swept Rule in 2D cont… ◦The Panels of the Pyramids become our communication UNIT ◦It encapsulates data for different cells at different timesteps! 4x

Swept Rule in 2D cont… ◦Merging 2 panels of different pyramids generate valleys ◦1 owned, 1 guest ◦Those can be filled as we have the full stencil for the internal cells

Swept Rule in 2D cont… Timestepping

Swept Rule in 2D cont… ◦After the valley between 2 panels is filled, no further processing is possible ◦We call these results bridges! ◦Prepare for the second communication! ◦Now, WHAT to communicate?!

Swept Rule in 2D cont… ◦Again, we will communicate panels. This time, the sides of the bridges!! ◦They have the same size as the previously communicated panels (the pyramid sides)! 2x

Swept Rule in 2D cont… ◦Arrange 4 of the communicated panels! ◦2 guests, 2 owned!

Swept Rule in 2D cont… ◦Properly placing the 4 panels provides the full stencil to fill the gaps between the panels! Fill

Swept Rule in 2D cont… ◦By Now, all the gaps are filled! ◦And Swept2D goes ON!

Results

Our Contribution to the Julia Language ◦A swept2D.jl Julia library implementing the Swept algorithm in 2D (~1000 lines of 100% Julia all the way code). ◦For parallelization we use Julia’s low level remote calls, we didn’t want to use MPI since it’s C based and we wanted to keep everything Julia all the way down: remotecall_fetch(procesesor id, function, args...) ◦The library is easy to include and use in your code to solve PDEs, you just need to setup your PDE of interest and its initial condition and the parallelization part is taken care of by our library.

Example of How to Use the Library:

Challenges Encountered During Project ◦The “include” statement seems to be very slow when running on a large number of cores: e.g. on 256 cores, it took ~80 seconds just to execute the include statement, while the actually parallel computation only took 7 include("swept2d.jl"); ◦The machinefile option didn’t seem to work properly, we had to construct the host string manually in the code and pass it to the addprocs function as a workaround. ◦Out of boundary errors were difficult to debug especially when running in parallel, debug info doesn’t provide proper line numbers and using print statements to debug in parallel wasn’t convenient when running on a large number of cores (e.g. 256 cores).

Live Demo