Optimizing Parallel Programming with MPI Michael Chen TJHSST Computer Systems Lab 2007-2008 Abstract: With more and more computationally- intense problems.

Slides:

Advertisements

Similar presentations

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Advertisements

Blue brain Copyright © cs-tutorial.com.

CHAPTER 11 – NUMBER THEORY

Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort

Maths Workshop. Aims of the Workshop To raise standards in maths by working closely with parents. To raise standards in maths by working closely with.

Project: – Several options for bid: Bid our signal Develop several strategies Develop stable bidding strategy Simulating Normal Random Variables.

Python Programming Chapter 1: The way of the program Saad Bani Mohammad Department of Computer Science Al al-Bayt University 1 st 2011/2012.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

30-Jun-15 Profiling. Optimization Optimization is the process of making a program as fast (or as small) as possible Here’s what the experts say about.

Chapter 1 Program Design

Introduction To C++ Programming 1.0 Basic C++ Program Structure 2.0 Program Control 3.0 Array And Structures 4.0 Function 5.0 Pointer 6.0 Secure Programming.

Optimizing the Placement of Chemical and Biological Agent Sensors Daniel L. Schafer Thomas Jefferson High School for Science and Technology Defense Threat.

Welcome To. Improving Remote File Transfer Speeds By The Solution For: %

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Assignments  1. Grade graphs and conclusions.  2. Introduction to Reaction Time.  3. Begin Pre-Lab of Reaction Time.

Introduction This section provides information that helps the reader understand what you accomplished, the science behind it and.

Sample size vs. Error A tutorial By Bill Thomas, Colby-Sawyer College.

Timing Trials An investigation arising out of the Assignment CS32310 – Nov 2013 H Holstein 1.

Othello Artificial Intelligence With Machine Learning

Scientific Method Scientific Method Interactive Lotus Diagram By Michelle O’Malley 6 th Grade Science League Academy Work Cited Work Cited Forward.

Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Supercomputers – David Bailey (1991) Eileen Kraemer August 25, 2002.

1 The Software Development Process  Systems analysis  Systems design  Implementation  Testing  Documentation  Evaluation  Maintenance.

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

Maths Workshop St Nicholas CE (VC) First School. Aims of the Workshop To raise standards in maths by working closely with parents. To provide parents.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

O. Music Classrooms and Teaching Spaces: These are used for teachers and children. They can be used for music practical and music theory.

How Solvable Is Intelligence? A brief introduction to AI Dr. Richard Fox Department of Computer Science Northern Kentucky University.

CLUSTER COMPUTING TECHNOLOGY BY-1.SACHIN YADAV 2.MADHAV SHINDE SECTION-3.

Group May Bryan McCoy Kinit Patel Tyson Williams.

Advanced Speed Guidance for Merging and Sequencing Techniques Chris Sweeney Thomas Jefferson High School for Science and Technology MITRE Corporation Center.

An Investigation into Implementations of DNA Sequence Pattern Matching Algorithms Peden Nichols Computer Systems Research April,

BMTRY 789 Lecture 11: Debugging Readings – Chapter 10 (3 rd Ed) from “The Little SAS Book” Lab Problems – None Homework Due – None Final Project Presentations.

The Software Development Process

Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,

Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.

PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

SNU OOPSLA Lab. 1 Great Ideas of CS with Java Part 1 WWW & Computer programming in the language Java Ch 1: The World Wide Web Ch 2: Watch out: Here comes.

Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X)  Hit Rate : the fraction of memory access found in.

Multiplication of Common Fractions © Math As A Second Language All Rights Reserved next #6 Taking the Fear out of Math 1 3 ×1 3 Applying.

Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.

The single most important skill for a computer programmer is problem solving Problem solving means the ability to formulate problems, think creatively.

1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

A Brief History of AI Fall 2013 COMP3710 Artificial Intelligence Computing Science Thompson Rivers University.

Othello Artificial Intelligence With Machine Learning Computer Systems TJHSST Nick Sidawy.

M1G Introduction to Programming 2 2. Creating Classes: Game and Player.

What is the Fastest Sorting A Computer Science Project by Timothy Hewitt Algorithm?

St Peter’s CofE Primary School

24/06/20161 Hardware Processor components & ROM. 224/06/2016 Learning Objectives Describe the function and purpose of the control unit, memory unit and.

Software Development Languages and Environments. Computer Languages Just as there are many human languages, there are many computer programming languages.

DNA Computing. What is it?  “DNA computing is a branch of computing which uses DNA, biochemistry, and molecular biology hardware, instead of the traditional.

Advanced Computer Systems

Othello Artificial Intelligence With Machine Learning

Grad OS Course Project Kevin Kastner Xueheng Hu

The University of Adelaide, School of Computer Science

Othello Artificial Intelligence With Machine Learning

Teaching Computing to GCSE

Teaching Computing to GCSE

Real Applications Infused in Technology Math

Ainsley Smith Tel: Ex

Title of Project Joseph Hallahan Computer Systems Lab

Analysis of Algorithms

Experiment Design and Purpose

Advanced Operating System Maekawa vs. Ricart-Agrawala By Rizal M Nor

Othello Artificial Intelligence With Machine Learning

Excursions into Parallel Programming

Excursions into Parallel Programming

Presentation transcript:

Optimizing Parallel Programming with MPI Michael Chen TJHSST Computer Systems Lab Abstract: With more and more computationally- intense problems appearing through the fields of math, science, and technology, a need for better processing power is needed in computers. MPI (Message Passing Interface) is one of the most crucial and effective ways of programming in parallel, and despite its difficulty to use at times, it has remained the de facto standard. But technology has been advancing, and new MPI libraries have been created to keep up with the capabilities of supercomputers. Efficiency has become the new keyword in finding the number of computers to use, as well as the latency when passing messages. The purpose of this project is to explore some of these methods of optimization in specific cases like the game of life problem. Introduction: Because of the increasing demand for powerful computations, MPI has become a standard for even companies like IBM, which recently began using their BG/L supercomputer along with an MPI library to tackle problems like protein folding. Because of this, efficiency in parallel programming has become a high priority, finding what yields the best latency, and what type of processors can best suit each problem. Many hopes for the future lie in this field. Molecular biology, strong artificial intelligence, and ecosystem simulations are just a few of the multitude of applications which will surely require parallel computing. Though my plans only include optimization of the game of life problem, it is these basic skills that carry over into real-world applications, except on a much larger scale. The processes of automated testing are becoming increasingly popular, and the results that my project have yielded have shed some light on the relationship between latency, processor usage, and efficiency. Procedures and Methodology: During first and second quarter, I followed along with the supercomputing class, which has been diving into parallel programming with MPI. However, nearing the end of the 2nd quarter, I decided to break off, and work more on the game of life program. The process began with writing the game of life without MPI, which was done quickly. However, making it run in parallel was the hard part, hindered by two problems: Java has been my main language for the past three years, so I run into syntax errors in C quite often, and because I encountered problems at first with sending and receiving in MPI (array sizes caused problems in the game of life). However, in the 3rd and 4th quarters, the problems instead revolved much more around file i/o and the inability of the C language to use strings effectively, which caused problems when I was attempting to automate the whole process. Results and Conclusions: Running simulation of game of life with 9 processors Diagram of latency vs. processing The theory behind the game of life with MPI is that the board will be split by the number of processors used in calculation. But each section only needs a limited amount of information from surrounding areas, and this is where message passing came into play. Each processor sends one row or one column to another computer, as well as receives it. This is also where the latency problem comes into play. If only one right/column is send a turn, then the computers must re-sync every step. But if two rows/columns are sent, the amount of re-syncing time is halved. But the efficacy of this depends on individual computer and network latency as well. A limit has to be drawn eventually. There is little use in, say, passing the entire board to each cell. (see right for diagram) Graphs and charts of the averages of multiple runs in different scenarios Though I was hoping to find a clear relationship between latency, run-time, and number of processors, the data that I've produced suggests otherwise. The largest problem was the fact that as I added processors, the run-time actually increased, almost linearly. However, this can be explained by the massive overhead due to passing of arrays as well as general inefficiency of my code. However, not all the results were against my original hypothesis. First is the fact that with less processors, the amount passed plays a smaller role. With 4 processors, the variation in the running time between when 1 row/column was passed vs. 5 has a difference of around 25 milliseconds. However, with 9 processors, the variation is closer to 100 milliseconds. The second was that there was a general trend for faster run speed as the amount passed neared 4, and then a gradual increase in run time again above 4. Although the results I have uncovered may be insignificant compared to the grand-scale projects that major companies are pursuing, they have much of the same basics, and will provide a basis for me in the future to work on, as well as an idea of how research is formally done.